[GRASS-dev] Moving GRASS Python parts to Unicode

Hello devs,
as you might already have noticed, there is a constant stream of
issues containing keywords "encoding" or more often
"UnicodeDecodeError". The main reason behind this is Python 2.x two
types of text strings - byte sequence (one you get with str()) and
Unicode (unicode()). Python 3.x will have only one - Unicode (byte
sequence is not a string any more) thus fixing this frustrating source
of errors.
Moving GRASS Python code to use Unicode internally will make it closer
to Python 3 ready and solve largest part of errors caused by implicit
conversation from encoded text strings to Unicode text strings.

The proposal is to make GRASS GIS Python code complaint with Unicode
best practice [1] following principle "decode early, encode late".
Things to change:
1) Any text string entering Python part of code should be decoded at
its entry point and decoded back to byte sequence at its exit point.
It also applies to all calls to GRASS modules passing around text;
2) Replace all text strings with Unicode literals (u'text'). No
exceptions. Note - "text strings" - thus byte sequences should not be
touched;
3) Ensure text file reading / writing is done via codecs.open;
4) Pass only Unicode to Python file handling calls (this is important
for running on MS-Windows);
5) Use Unicode in tests to ensure correctness of code;
6) Introduce information on Unicode usage into Python submitting
guidelines [2],[3].

Things to change outside of Python code:
1) Store attribute table encoding information along with connection parameters;
2) Ensure storage of correct encoding information on data import and
correct use on export (especially painful for ESRI Shapefiles);
3) Ensure correct encoding information in headers of all PO and XML files.

Expected problems:
1) When moving to Python 3, all explicit Unicode literal definitions
will need to be removed (u'text' -> 'text');
2) Introduction of "encode early" principle will break all of the
band-aids currently in place - a major breakage of code for a short
time is expected;
3) Guessing correct encoding can be a problem. One of solutions could
be checking early for correctness of system configuration and refusing
to operate on improperly configured systems. Fatal error is better
than silent data corruption (as it is happening at the moment for
certain scenarios).

Topic to discuss:
1) Implementation plan:
a) should it be done before 7.1?
b) should separate bugs be opened for parts of migration?
c) how big / long breakage is acceptable?
2) Moving all text in GRASS location to UTF-8 encoding (GRASS 8) thus
pushing the encode/decode "boundary" further. Upside - most of
existing data is UTF-8 ready (parts supporting only ASCII) [4].

1. http://unicodebook.readthedocs.org/good_practices.html
2. http://www.azavea.com/blogs/labs/2014/03/solving-unicode-problems-in-python-2-7/
3. https://docs.python.org/2/howto/unicode.html
4. http://utf8everywhere.org/

Jauku dienu;
miłego dnia;
хорошего дня,
Māris.

Moved from trac ticket https://trac.osgeo.org/grass/ticket/2885

Hi Maris,

On 07/02/16 11:56, Maris Nartiss wrote:

Hello devs,
as you might already have noticed, there is a constant stream of
issues containing keywords "encoding" or more often
"UnicodeDecodeError". The main reason behind this is Python 2.x two
types of text strings - byte sequence (one you get with str()) and
Unicode (unicode()). Python 3.x will have only one - Unicode (byte
sequence is not a string any more) thus fixing this frustrating source
of errors.
Moving GRASS Python code to use Unicode internally will make it closer
to Python 3 ready and solve largest part of errors caused by implicit
conversation from encoded text strings to Unicode text strings.

I would be very happy if we could find a structural solution to this which would avoid having to deal with so many individual errors all the time.

The proposal is to make GRASS GIS Python code complaint with Unicode
best practice [1] following principle "decode early, encode late".
Things to change:
1) Any text string entering Python part of code should be decoded at
its entry point and decoded back to byte sequence at its exit point.
It also applies to all calls to GRASS modules passing around text;
2) Replace all text strings with Unicode literals (u'text'). No
exceptions. Note - "text strings" - thus byte sequences should not be
touched;
3) Ensure text file reading / writing is done via codecs.open;
4) Pass only Unicode to Python file handling calls (this is important
for running on MS-Windows);
5) Use Unicode in tests to ensure correctness of code;
6) Introduce information on Unicode usage into Python submitting
guidelines [2],[3].

Things to change outside of Python code:
1) Store attribute table encoding information along with connection parameters;
2) Ensure storage of correct encoding information on data import and
correct use on export (especially painful for ESRI Shapefiles);
3) Ensure correct encoding information in headers of all PO and XML files.

Expected problems:
1) When moving to Python 3, all explicit Unicode literal definitions
will need to be removed (u'text' -> 'text');
2) Introduction of "encode early" principle will break all of the
band-aids currently in place - a major breakage of code for a short
time is expected;
3) Guessing correct encoding can be a problem. One of solutions could
be checking early for correctness of system configuration and refusing
to operate on improperly configured systems. Fatal error is better
than silent data corruption (as it is happening at the moment for
certain scenarios).

I am no expert on this question, and thus do not have a clear opinion on your proposal, except for the fact that I'm very happy that it exists, but here are my intuitive ideas & questions on your topics:

Topic to discuss:
1) Implementation plan:
a) should it be done before 7.1?

I think the sooner, the better, so 7.1 should be our latest milestone (7.0.x should be in 'bugfix only mode).

b) should separate bugs be opened for parts of migration?

To what point can different issues be delimited into +/- autonomous issues ?

c) how big / long breakage is acceptable?

How complete would breakage be: for all encodings, or would LANG=C always work ?

Is this something which could be done for most part in a concentrated manner during a code sprint (e.g. FOSS4G 2016) ?

2) Moving all text in GRASS location to UTF-8 encoding (GRASS 8) thus
pushing the encode/decode "boundary" further. Upside - most of
existing data is UTF-8 ready (parts supporting only ASCII) [4].

What do you mean with "text in GRASS location" ? How about files on the filesystem that some users might want to access via other tools ? Shouldn't they be in the system-wide encoding ?

Thank you very much for bringing up this discussion in such a structured manner. I hope that others will show some interest in the matter...

Moritz

On Wed, Feb 10, 2016 at 9:42 AM, Moritz Lennert <
mlennert@club.worldonline.be> wrote:

Hi Maris,

On 07/02/16 11:56, Maris Nartiss wrote:

Hello devs,
as you might already have noticed, there is a constant stream of
issues containing keywords "encoding" or more often
"UnicodeDecodeError". The main reason behind this is Python 2.x two
types of text strings - byte sequence (one you get with str()) and
Unicode (unicode()). Python 3.x will have only one - Unicode (byte
sequence is not a string any more) thus fixing this frustrating source
of errors.
Moving GRASS Python code to use Unicode internally will make it closer
to Python 3 ready and solve largest part of errors caused by implicit
conversation from encoded text strings to Unicode text strings.

I would be very happy if we could find a structural solution to this which
would avoid having to deal with so many individual errors all the time.

The proposal is to make GRASS GIS Python code complaint with Unicode
best practice [1] following principle "decode early, encode late".
Things to change:
1) Any text string entering Python part of code should be decoded at
its entry point and decoded back to byte sequence at its exit point.
It also applies to all calls to GRASS modules passing around text;
2) Replace all text strings with Unicode literals (u'text'). No
exceptions. Note - "text strings" - thus byte sequences should not be
touched;
3) Ensure text file reading / writing is done via codecs.open;
4) Pass only Unicode to Python file handling calls (this is important
for running on MS-Windows);
5) Use Unicode in tests to ensure correctness of code;
6) Introduce information on Unicode usage into Python submitting
guidelines [2],[3].

Things to change outside of Python code:
1) Store attribute table encoding information along with connection
parameters;
2) Ensure storage of correct encoding information on data import and
correct use on export (especially painful for ESRI Shapefiles);
3) Ensure correct encoding information in headers of all PO and XML files.

Expected problems:
1) When moving to Python 3, all explicit Unicode literal definitions
will need to be removed (u'text' -> 'text');
2) Introduction of "encode early" principle will break all of the
band-aids currently in place - a major breakage of code for a short
time is expected;
3) Guessing correct encoding can be a problem. One of solutions could
be checking early for correctness of system configuration and refusing
to operate on improperly configured systems. Fatal error is better
than silent data corruption (as it is happening at the moment for
certain scenarios).

I am no expert on this question, and thus do not have a clear opinion on
your proposal, except for the fact that I'm very happy that it exists, but
here are my intuitive ideas & questions on your topics:

I don't have a clear opinion either but I hoped Glynn could state his
opinion here, because I understood he has a different view on some of these
things. AFAIR, one of the problems is possibly different needs of Python
scripting library vs. GUI.

Anna

Topic to discuss:

1) Implementation plan:
a) should it be done before 7.1?

I think the sooner, the better, so 7.1 should be our latest milestone
(7.0.x should be in 'bugfix only mode).

b) should separate bugs be opened for parts of migration?

To what point can different issues be delimited into +/- autonomous issues
?

c) how big / long breakage is acceptable?

How complete would breakage be: for all encodings, or would LANG=C always
work ?

Is this something which could be done for most part in a concentrated
manner during a code sprint (e.g. FOSS4G 2016) ?

2) Moving all text in GRASS location to UTF-8 encoding (GRASS 8) thus

pushing the encode/decode "boundary" further. Upside - most of
existing data is UTF-8 ready (parts supporting only ASCII) [4].

What do you mean with "text in GRASS location" ? How about files on the
filesystem that some users might want to access via other tools ? Shouldn't
they be in the system-wide encoding ?

Thank you very much for bringing up this discussion in such a structured
manner. I hope that others will show some interest in the matter...

Moritz

_______________________________________________
grass-dev mailing list
grass-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-dev

2016-02-10 17:03 GMT+02:00 Anna Petrášová <kratochanna@gmail.com>:

On Wed, Feb 10, 2016 at 9:42 AM, Moritz Lennert
<mlennert@club.worldonline.be> wrote:

Hi Maris,

I would be very happy if we could find a structural solution to this which
would avoid having to deal with so many individual errors all the time.

That is my proposal. Get it right + policy to enforce to avoid
breakdown in the future.

I am no expert on this question, and thus do not have a clear opinion on
your proposal, except for the fact that I'm very happy that it exists, but
here are my intuitive ideas & questions on your topics:

Neither am I. I just got fed up with UnicodeDecodeError.

I don't have a clear opinion either but I hoped Glynn could state his
opinion here, because I understood he has a different view on some of these
things. AFAIR, one of the problems is possibly different needs of Python
scripting library vs. GUI.

Anna

Anna, there should be no other "special" way of treating some parts of
Python code. If it is Python, it should follow Python idioms. That's
the whole point of using Python at the first place - to provide
Pythonic access to power of GRASS. I do not see in any near future any
significant changes in Python community moving away from Unicode
strings to raw byte strings for texts thus either we adopt Pythonic
approach or continue to fight uphill battle with Python. So far we are
not going too well with it.

Topic to discuss:
1) Implementation plan:
a) should it be done before 7.1?

I think the sooner, the better, so 7.1 should be our latest milestone
(7.0.x should be in 'bugfix only mode).

Depends on how far is 7.1. I would prefer to have GRASS releases more
often, then it should go to 7.2.

b) should separate bugs be opened for parts of migration?

To what point can different issues be delimited into +/- autonomous issues
?

Good question.

c) how big / long breakage is acceptable?

How complete would breakage be: for all encodings, or would LANG=C always
work ?

Only partially. There are no UnicodeEncodingErrors for LANG=C, but
there will be UnicodeUnequalError instead when comparing Unicode
string to byte string.

Is this something which could be done for most part in a concentrated
manner during a code sprint (e.g. FOSS4G 2016) ?

I am not so familiar with whole codebase.

2) Moving all text in GRASS location to UTF-8 encoding (GRASS 8) thus
pushing the encode/decode "boundary" further. Upside - most of
existing data is UTF-8 ready (parts supporting only ASCII) [4].

What do you mean with "text in GRASS location" ? How about files on the
filesystem that some users might want to access via other tools ? Shouldn't
they be in the system-wide encoding ?

I meant any text strings (raster categories, metadata entries, etc.).
System-wide encoding makes GRASS location non-portable. I can not just
copy it to other system and expect to work. UTF-8 would be a natural
choice as it is backwards compatible with ASCII (existing data does
not need to be changed) and at the same would allow to accept any
characters in the future. Besides - it is used by 86% of Web [1].
If we introduce such policy, the same principle would apply - decode
early, encode late. On the bright side - legacy systems are dying out,
MacOS uses UTF-8 for all locales by default, Linux has nice UTF-8
support (my guess - it is the most popular encoding after plain
ASCII).
Current situation that data is in unknown encoding is the worst -
either we adopt this approach, or start to store metadata on encoding
in use. I assume anyone who has been playing game "guess the encoding
of Shapefile" will agree on downsides of such approach.
Anyway - this is discussion about GRASS 8.

Thank you very much for bringing up this discussion in such a structured
manner. I hope that others will show some interest in the matter...

Moritz

I hope so.

Dziękuje,
Māris.

1. http://w3techs.com/technologies/overview/character_encoding/all

On Sat, Feb 13, 2016 at 11:30 AM, Maris Nartiss <maris.gis@gmail.com> wrote:
...

Depends on how far is 7.1. I would prefer to have GRASS releases more
often, then it should go to 7.2.

Agreed. We could change this in trunk (current 7.1.svn) and then
release it as stable 7.2.

Markus

Hi,

2016-02-13 12:13 GMT+01:00 Markus Neteler <neteler@osgeo.org>:

Agreed. We could change this in trunk (current 7.1.svn) and then
release it as stable 7.2.

I am confused, I thought that the next stable release will be 7.1. and
not 7.2., in other words some months (let's say two) before planned
release of 7.1. we create from trunk releasebranch_7_1 and trunk
becomes 7.2.svn. Or do we want to create releasebranch_7_2 and trunk
turns into 7.3svn (version 7.1.x will be never released)?

My vote would be for first option - release 7.1. as stable. Ma

--
Martin Landa
http://geo.fsv.cvut.cz/gwiki/Landa
http://gismentors.cz/mentors/landa

Maris Nartiss wrote:

as you might already have noticed, there is a constant stream of
issues containing keywords "encoding" or more often
"UnicodeDecodeError". The main reason behind this is Python 2.x two
types of text strings - byte sequence (one you get with str()) and
Unicode (unicode()). Python 3.x will have only one - Unicode (byte
sequence is not a string any more) thus fixing this frustrating source
of errors.

Both versions have both types of string. In 2.x, str() and "plain"
string literals create byte strings, while unicode() and u"..." create
unicode strings. In 3.x, str() and plain string literals create
unicode strings, while bytes() and b"..." create byte strings.

The biggest differences between the two are:

a) 2.x allows implicit conversions. If you pass a byte string where a
unicode string is expected (or vice versa), the string is implicitly
converted using the default encoding (which can't be set by a script).
3.x doesn't do this; you get an exception.

b) 3.x tries quite hard to maintain the fiction that everything is
unicode. E.g. sys.argv contains unicode strings, os.environ uses
unicode strings for both keys and values, sys.stdin/stdout/stderr are
text streams which return Unicode data.

Moving GRASS Python code to use Unicode internally will make it closer
to Python 3 ready and solve largest part of errors caused by implicit
conversation from encoded text strings to Unicode text strings.

I don't particularly care what happens with wxGUI, and using unicode
consistently would make sense there, as wx itself uses Unicode. But if
you're planning on doing this to grass.script, I'm strongly opposed.
It achieves nothing beyond making what should be wxGUI's problem into
everyone else's problem.

Pretending that everything is unicode only works so long as the rest
of the world makes sure not to dispel the illusion. Otherwise, it
fails hard. Something as simple as e.g. copying stdin to stdout fails
just because the data isn't in the assumed encoding.

Bear in mind that the C portion of GRASS (i.e. most of it) doesn't pay
any attention to encodings unless it has to. It just passes bytes
around. It doesn't care whether the bytes are in any particular
encoding, and certainly won't attempt to ensure that data written to
stdout or to files is in any particular encoding.

--
Glynn Clements <glynn@gclements.plus.com>

I would like to add a link on this topic:

http://www.catb.org/esr/faqs/practical-python-porting/

that provide an extensive overview of possible issues an solutions.

Best regards

Pietro