Hello devs,
as you might already have noticed, there is a constant stream of
issues containing keywords "encoding" or more often
"UnicodeDecodeError". The main reason behind this is Python 2.x two
types of text strings - byte sequence (one you get with str()) and
Unicode (unicode()). Python 3.x will have only one - Unicode (byte
sequence is not a string any more) thus fixing this frustrating source
of errors.
Moving GRASS Python code to use Unicode internally will make it closer
to Python 3 ready and solve largest part of errors caused by implicit
conversation from encoded text strings to Unicode text strings.
The proposal is to make GRASS GIS Python code complaint with Unicode
best practice [1] following principle "decode early, encode late".
Things to change:
1) Any text string entering Python part of code should be decoded at
its entry point and decoded back to byte sequence at its exit point.
It also applies to all calls to GRASS modules passing around text;
2) Replace all text strings with Unicode literals (u'text'). No
exceptions. Note - "text strings" - thus byte sequences should not be
touched;
3) Ensure text file reading / writing is done via codecs.open;
4) Pass only Unicode to Python file handling calls (this is important
for running on MS-Windows);
5) Use Unicode in tests to ensure correctness of code;
6) Introduce information on Unicode usage into Python submitting
guidelines [2],[3].
Things to change outside of Python code:
1) Store attribute table encoding information along with connection parameters;
2) Ensure storage of correct encoding information on data import and
correct use on export (especially painful for ESRI Shapefiles);
3) Ensure correct encoding information in headers of all PO and XML files.
Expected problems:
1) When moving to Python 3, all explicit Unicode literal definitions
will need to be removed (u'text' -> 'text');
2) Introduction of "encode early" principle will break all of the
band-aids currently in place - a major breakage of code for a short
time is expected;
3) Guessing correct encoding can be a problem. One of solutions could
be checking early for correctness of system configuration and refusing
to operate on improperly configured systems. Fatal error is better
than silent data corruption (as it is happening at the moment for
certain scenarios).
Topic to discuss:
1) Implementation plan:
a) should it be done before 7.1?
b) should separate bugs be opened for parts of migration?
c) how big / long breakage is acceptable?
2) Moving all text in GRASS location to UTF-8 encoding (GRASS 8) thus
pushing the encode/decode "boundary" further. Upside - most of
existing data is UTF-8 ready (parts supporting only ASCII) [4].
1. http://unicodebook.readthedocs.org/good_practices.html
2. http://www.azavea.com/blogs/labs/2014/03/solving-unicode-problems-in-python-2-7/
3. https://docs.python.org/2/howto/unicode.html
4. http://utf8everywhere.org/
Jauku dienu;
miłego dnia;
хорошего дня,
Māris.
Moved from trac ticket https://trac.osgeo.org/grass/ticket/2885