[GRASS-dev] [GRASS GIS] #2617: wxgui Raster query redirect to console UnicodeDecodeError

#2617: wxgui Raster query redirect to console UnicodeDecodeError
-------------------------+--------------------------------------------------
Reporter: marisn | Owner: grass-dev@…
     Type: defect | Status: new
Priority: normal | Milestone: 7.0.1
Component: wxGUI | Version: svn-trunk
Keywords: | Platform: MSWindows Vista
      Cpu: Unspecified |
-------------------------+--------------------------------------------------
Steps to reproduce:
  * use raster query tool to query raster
  * check "redirect to console"

{{{
Vaicājuma rezultāti:
Traceback (most recent call last):
   File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\gui_core\query.py", line 65, in
<lambda>

self.redirect.Bind(wx.EVT_CHECKBOX, lambda evt:
self._onRedirect(evt.IsChecked()))
   File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\gui_core\query.py", line 143, in
_onRedirect

self.redirectOutput.emit(output=self._textToRedirect())
   File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\gui_core\query.py", line 148, in
_textToRedirect

text = printResults(self._model, self._colNames[1])
   File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\gui_core\query.py", line 215, in
printResults

return '\n'.join(textList)
UnicodeDecodeError
:
'ascii' codec can't decode byte 0xc4 in position 4: ordinal
not in range(128)
}}}

Also reported as a part of #2120;
7.0.0 is also affected.

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/2617&gt;
GRASS GIS <http://grass.osgeo.org>

#2617: wxgui Raster query redirect to console UnicodeDecodeError
-----------------------------+----------------------------------------------
Reporter: marisn | Owner: grass-dev@…
     Type: defect | Status: new
Priority: normal | Milestone: 7.0.1
Component: wxGUI | Version: svn-trunk
Keywords: query, encoding | Platform: MSWindows Vista
      Cpu: Unspecified |
-----------------------------+----------------------------------------------
Changes (by annakrat):

  * keywords: => query, encoding

Comment:

Please try r64818 in trunk. Any chance it would solve #2601?

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/2617#comment:1&gt;
GRASS GIS <http://grass.osgeo.org>

#2617: wxgui Raster query redirect to console UnicodeDecodeError
-----------------------------+----------------------------------------------
Reporter: marisn | Owner: grass-dev@…
     Type: defect | Status: new
Priority: normal | Milestone: 7.0.1
Component: wxGUI | Version: svn-trunk
Keywords: query, encoding | Platform: MSWindows Vista
      Cpu: Unspecified |
-----------------------------+----------------------------------------------

Comment(by marisn):

No, this is not a solution - it is still broken.

I appended a following print in query.py L185:
{{{
print 'k: %s (%s) v: %s (%s)' % (k, type(k), v, type(v))
}}}

And here is output:
{{{
k: east, north (<type 'unicode'>) v: 622578.672986, 6399325.43444 (<type
'str'>)
k: dores_idw@kalistrats (<type 'unicode'>) v: {'nosaukums': '',
'kr\xc4\x81sa': '255:202:000', 'v\xc4\x93rt\xc4\xabba':
'71.6390742964988'} (<type 'dict'>)
k: nosaukums (<type 'str'>) v: (<type 'str'>)
k: krāsa (<type 'str'>) v: 255:202:000 (<type 'str'>)
Traceback (most recent call last):
   File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\mapwin\buffered.py", line 1230, in
MouseActions

self.OnLeftUp(event)
   File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\mapwin\buffered.py", line 1407, in
OnLeftUp

self.mapQueried.emit(x=self.mouse['end'][0],
y=self.mouse['end'][1])
   File "C:\Program Files\GRASS GIS
7.1.svn\etc\python\grass\pydispatch\signal.py", line 229, in
emit

dispatcher.send(signal=self, *args, **kwargs)
   File "C:\Program Files\GRASS GIS
7.1.svn\etc\python\grass\pydispatch\dispatcher.py", line
349, in send

**named
   File "C:\Program Files\GRASS GIS
7.1.svn\etc\python\grass\pydispatch\robustapply.py", line
60, in robustApply

return receiver(*arguments, **named)
   File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\mapdisp\frame.py", line 868, in Query

self.QueryMap(east, north, qdist, rast, vect)
   File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\mapdisp\frame.py", line 922, in
QueryMap

self.dialogs['query'] = QueryDialog(parent = self, data =
result)
   File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\gui_core\query.py", line 46, in
__init__

self._model = QueryTreeBuilder(self.data,
column=self._colNames[1])
   File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\gui_core\query.py", line 201, in
QueryTreeBuilder

addNode(parent=model.root, data=part, model=model)
   File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\gui_core\query.py", line 190, in
addNode

addNode(parent=node, data=v, model=model)
   File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\gui_core\query.py", line 187, in
addNode

k = DecodeString(k)
   File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\core\gcmd.py", line 76, in DecodeString

return string.decode(_enc)
   File "C:\Program Files\GRASS GIS
7.1.svn\Python27\lib\encodings\cp1257.py", line 15, in
decode

return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError
:
'charmap' codec can't decode byte 0x81 in position 3:
character maps to <undefined>
}}}
As it is visible from the output, on the second line v contains UTF-8
encoded text. Lines 3 and 4 report it to be a str and thus a DecodeString
is called. So far - nothing bad, but there kicks in DecodeString - it is
using GetSystemEncoding to decode string. On this system _enc variable is
set to cp1257 - this is definitely not UTF-8 and thus decoding fails.
The string in question (krāsa) is coming form the GRASS translation to
Latvian language - to reproduce the issue on your system, you must
translate "color" to a word with non-ascii letters in it (zbarvenã) and,
of course, encode translation file (PO) as UTF-8.

The source of problem is r47310 where instead of installing unicode
version of gettext a bytestring version is installed. This should work
fine, but now in every place where a _() call is made, it returns str for
unicode translations. Reverting r47310 fixes this bug (and probably others
too!) without any problems, still I would like to hear Glynn's rationale
why it was necessary in the first place (preferably with patches that
solve _() issue if r47310 is to stay). Not using unicode version of
gettext is really strange, as Slovenian is the only language NOT using
UTF-8 in their PO files and it has seen the last update in 2005, thus
GRASS PO files ARE unicode-ready.

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/2617#comment:2&gt;
GRASS GIS <http://grass.osgeo.org>

#2617: wxgui Raster query redirect to console UnicodeDecodeError
-----------------------------+----------------------------------------------
Reporter: marisn | Owner: grass-dev@…
     Type: defect | Status: new
Priority: normal | Milestone: 7.0.1
Component: wxGUI | Version: svn-trunk
Keywords: query, encoding | Platform: MSWindows Vista
      Cpu: Unspecified |
-----------------------------+----------------------------------------------

Comment(by glynn):

Replying to [comment:2 marisn]:

> The source of problem is r47310 where instead of installing unicode
version of gettext a bytestring version is installed. This should work
fine, but now in every place where a _() call is made, it returns str for
unicode translations. Reverting r47310 fixes this bug (and probably others
too!) without any problems, still I would like to hear Glynn's rationale
why it was necessary in the first place (preferably with patches that
solve _() issue if r47310 is to stay).

The scripting library only uses byte strings, never unicode. Values
returned from _() are typically written to streams (stdout/stderr or
files) or used as command-line arguments. These contexts invariably
require byte strings, so if _() returned a unicode value it will just get
converted to a byte string using the default encoding (not the locale's
encoding or filesystem encoding etc), which is usually ASCII. So prior to
r47310, any attempt by a script to use a translated string while in a non-
English locale was likely to result in the familiar "codec can't encode
character ..." error.

If there's a bug here, it's wxGUI expecting the grass.script library to
cater to it. grass.script doesn't exist for the benefit of wxGUI. If
grass.script isn't suitable for wxGUI (e.g. because of wxPython's use of
unicode), wxGUI should provide its own alternatives, not break
grass.script.

But the real question is: where is that UTF-8 coming from? On Windows,
nothing should ever see UTF-8, as Windows doesn't support UTF-8 as an
actual codepage (cp65001 is a pseudo-codepage which exists to allow
certain functions to use UTF-8; but you can't have a locale which uses
cp65001 as its codepage).

Byte strings which end up in wxGUI should be interpreted as using the
locale's codepage (cp1257 in this case), as should anything converted from
unicode to a byte string by wxGUI. Anything coming from wxPython (e.g. the
contents of a text field) should be unicode values (UTF-16-LE internally).

> Not using unicode version of gettext is really strange, as Slovenian is
the only language NOT using UTF-8 in their PO files and it has seen the
last update in 2005, thus GRASS PO files ARE unicode-ready.

The encoding used in PO files doesn't matter on systems which use GNU
gettext, which will automatically convert from the encoding used in the PO
file to the locale's encoding (so a single PO file can be used for both
e.g. en_GB.utf8 and en_GB.iso88591). In fact, the encoding used in PO
files shouldn't even be visible to applications (unless they're trying to
read the PO file directly rather than using gettext, which would be dumb).

Ideally, PO files should use the locale's legacy encoding (e.g ISO-8859-1
for most of Western Europe). Newer systems will translate that to UTF-8 if
that's what the locale uses; older systems will just copy the data
verbatim, so it needs to use the locale's encoding (which, on older
systems, won't be UTF-8). This has the added advantage of restricting what
goes into those files to characters which can actually be displayed.

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/2617#comment:3&gt;
GRASS GIS <http://grass.osgeo.org>

#2617: wxgui Raster query redirect to console UnicodeDecodeError
-----------------------------+----------------------------------------------
Reporter: marisn | Owner: grass-dev@…
     Type: defect | Status: new
Priority: normal | Milestone: 7.0.1
Component: wxGUI | Version: svn-trunk
Keywords: query, encoding | Platform: MSWindows Vista
      Cpu: Unspecified |
-----------------------------+----------------------------------------------

Comment(by marisn):

Just dropping a note here as it needs further investigation:
https://docs.python.org/2/library/gettext.html#gettext-vs-lgettext

   In Python 2.4 the lgettext() family of functions were introduced. The
intention of these functions is to provide an alternative which is more
compliant with the current implementation of GNU gettext. Unlike
'''gettext(), which returns strings encoded with the same codeset used in
the translation file''', lgettext() will return strings encoded with the
preferred system encoding, as returned by locale.getpreferredencoding().
Also notice that Python 2.4 introduces new functions to explicitly choose
the codeset used in translated strings. If a codeset is explicitly set,
even lgettext() will return translated strings in the requested codeset,
as would be expected in the GNU gettext implementation.

Note on "same codeset" explains where the UTF-8 strings are coming from
and why it differs from C implementation of gettext.

In the aforementioned document is also another one interesting remark:
https://docs.python.org/2/library/gettext.html#the-gnutranslations-class

   Note that the Unicode version of the methods (i.e. ugettext() and
ungettext()) are the recommended interface to use for internationalized
Python programs.

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/2617#comment:4&gt;
GRASS GIS <http://grass.osgeo.org>

#2617: wxgui Raster query redirect to console UnicodeDecodeError
-----------------------------+----------------------------------------------
Reporter: marisn | Owner: grass-dev@…
     Type: defect | Status: new
Priority: normal | Milestone: 7.0.1
Component: wxGUI | Version: svn-trunk
Keywords: query, encoding | Platform: MSWindows Vista
      Cpu: Unspecified |
-----------------------------+----------------------------------------------

Comment(by glynn):

Replying to [comment:4 marisn]:

> In Python 2.4 the lgettext() family of functions were introduced. The
intention of these functions is to provide an alternative which is more
compliant with the current implementation of GNU gettext. Unlike
'''gettext(), which returns strings encoded with the same codeset used in
the translation file''', lgettext() will return strings encoded with the
preferred system encoding, as returned by locale.getpreferredencoding().

Right. Unfortunately, gettext.install() binds the _() function to the
.gettext() method rather than to the .lgettext() method.

Try r64834.

> Note that the Unicode version of the methods (i.e. ugettext() and
ungettext()) are the recommended interface to use for internationalized
Python programs.

"Recommended" by someone who isn't going to be doing the (substantial)
amount of work involved in adding all the required .encode() calls, or
dealing with the bugs which arise whenever someone forgets the .encode()
call. Because without those calls, unicode values will be converted using
implicit conversions, which fails whenever the unicode value contains non-
ASCII characters.

As a rough guide, you can (and should) ignore anything the Python
developers have to say about Unicode. Their attitude tends to be
"everything should use Unicode, and the fact that POSIX (and a lot else)
doesn't is your problem and not ours".

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/2617#comment:5&gt;
GRASS GIS <http://grass.osgeo.org>

#2617: wxgui Raster query redirect to console UnicodeDecodeError
-----------------------------+----------------------------------------------
Reporter: marisn | Owner: grass-dev@…
     Type: defect | Status: new
Priority: normal | Milestone: 7.0.1
Component: wxGUI | Version: svn-trunk
Keywords: query, encoding | Platform: MSWindows Vista
      Cpu: Unspecified |
-----------------------------+----------------------------------------------

Comment(by zarch):

Replying to [comment:5 glynn]:
> Replying to [comment:4 marisn]:
> > Note that the Unicode version of the methods (i.e. ugettext() and
ungettext()) are the recommended interface to use for internationalized
Python programs.
>
> "Recommended" by someone who isn't going to be doing the (substantial)
amount of work involved in adding all the required .encode() calls, or
dealing with the bugs which arise whenever someone forgets the .encode()
call. Because without those calls, unicode values will be converted using
implicit conversions, which fails whenever the unicode value contains non-
ASCII characters.

We have to do this work in any case for python3. We can create a function
that explicity convert every input to unicode, something like:

{{{
import sys

PY2 = sys.version[0] == '2'

def to_text_string(obj, encoding=None):
     """Convert `obj` to (unicode) text string"""
     if PY2:
         # Python 2
         if encoding is None:
             return unicode(obj)
         else:
             return unicode(obj, encoding)
     else:
         # Python 3
         if encoding is None:
             return str(obj)
         elif isinstance(obj, str):
             # In case this function is not used properly, this could
happen
             return obj
         else:
             return str(obj, encoding)
}}}

> As a rough guide, you can (and should) ignore anything the Python
developers have to say about Unicode. Their attitude tends to be
"everything should use Unicode, and the fact that POSIX (and a lot else)
doesn't is your problem and not ours".

Many recent computer languages (i.e. Go, Rust) consider this a good
practice... and personally I agree with them.
In Python3 they fix this implicit conversion, and this is the reason why I
believe we should move to python3.

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/2617#comment:6&gt;
GRASS GIS <http://grass.osgeo.org>

#2617: wxgui Raster query redirect to console UnicodeDecodeError
----------------------------------------------+-----------------------------
Reporter: marisn | Owner: grass-dev@…
     Type: defect | Status: new
Priority: normal | Milestone: 7.0.1
Component: wxGUI | Version: svn-trunk
Keywords: query, encoding, python, gettext | Platform: MSWindows Vista
      Cpu: Unspecified |
----------------------------------------------+-----------------------------
Changes (by wenzeslaus):

  * keywords: query, encoding => query, encoding, python, gettext

Comment:

Replying to [comment:5 glynn]:
> Replying to [comment:4 marisn]:
>
> > In Python 2.4 the lgettext() family of functions were introduced.
The intention of these functions is to provide an alternative which is
more compliant with the current implementation of GNU gettext. Unlike
'''gettext(), which returns strings encoded with the same codeset used in
the translation file''', lgettext() will return strings encoded with the
preferred system encoding, as returned by locale.getpreferredencoding().
>
> Right. Unfortunately, gettext.install() binds the _() function to the
.gettext() method rather than to the .lgettext() method.

> Try r64834:

{{{
#!python
import gettext
gettext.install('grasslibs', os.path.join(os.getenv("GISBASE"), 'locale'))
import __builtin__
__builtin__.__dict__['_'] = __builtin__.__dict__['_'].im_self.lgettext
}}}

This solves the problem but the fix is yet another reason for me to
believe that translation function should be explicitly imported and
changing buildins, explicit or hidden, should be avoided. Compare the code
above with the code in GUI (r57219 and r57220):

{{{
#!python
# gui/wxpython/core/utils.py
# _ intended to be used also outside this module
try:
     # intended to be used also outside this module
     import gettext
     _ = gettext.translation('grasswxpy',
os.path.join(os.getenv("GISBASE"), 'locale')).ugettext
except IOError:
     # using no translation silently
     def null_gettext(string):
         return string
     _ = null_gettext
}}}

Please see the further discussion in #2425.

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/2617#comment:7&gt;
GRASS GIS <http://grass.osgeo.org>

#2617: wxgui Raster query redirect to console UnicodeDecodeError
----------------------------------------------+-----------------------------
Reporter: marisn | Owner: grass-dev@…
     Type: defect | Status: new
Priority: normal | Milestone: 7.0.1
Component: wxGUI | Version: svn-trunk
Keywords: query, encoding, python, gettext | Platform: MSWindows Vista
      Cpu: Unspecified |
----------------------------------------------+-----------------------------

Comment(by glynn):

Replying to [comment:6 zarch]:

> We have to do this work in any case for python3.

If we actually use it. For most scripting tasks, Python 3 offers nothing
but inconvenience.

And even then, there's a much simpler way to deal with it: convert unicode
strings to byte strings at the point they arise (there are far fewer of
these compared to the number of places where we will need to write byte
strings to streams or pass them as command arguments).

> We can create a function that explicity convert every input to unicode,
something like:

But why bother? At the lowest level, scripts tend to do two things: invoke
commands and read/write streams. Both of these deal with byte strings.
Converting to unicode then back again just creates unnecessary failure
modes; there's no guarantee that data read from a given stream will be in
the locale's encoding, or even in any known encoding.

wxGUI has to deal with this because wxPython uses Unicode throughout (and
look how many wxGUI issues relate to Unicode{Encode,Decode}Error as a
result). The scripting library doesn't need to deal with this; there's no
inherent reason why most scripts should ever encounter a unicode value.

--
Ticket URL: <http://trac.osgeo.org/grass/ticket/2617#comment:8&gt;
GRASS GIS <http://grass.osgeo.org>

#2617: wxgui Raster query redirect to console UnicodeDecodeError
--------------------------+----------------------------------------------
  Reporter: marisn | Owner: grass-dev@…
      Type: defect | Status: new
  Priority: normal | Milestone: 7.0.1
Component: wxGUI | Version: svn-trunk
Resolution: | Keywords: query, encoding, python, gettext
       CPU: Unspecified | Platform: MSWindows Vista
--------------------------+----------------------------------------------

Comment (by neteler):

Replying to [comment:5 glynn]:
> Replying to [comment:4 marisn]:
>
> > In Python 2.4 the lgettext() family of functions were introduced.
The intention of these functions is to provide an alternative which is
more compliant with the current implementation of GNU gettext. Unlike
'''gettext(), which returns strings encoded with the same codeset used in
the translation file''', lgettext() will return strings encoded with the
preferred system encoding, as returned by locale.getpreferredencoding().
>
> Right. Unfortunately, gettext.install() binds the _() function to the
.gettext() method rather than to the .lgettext() method.
>
> Try r64834.

Is this a backport candidate?

--
Ticket URL: <https://trac.osgeo.org/grass/ticket/2617#comment:9&gt;
GRASS GIS <http://grass.osgeo.org>