[GRASS-dev] Re: [GRASS-stats] Sys.setlocale for GRASS6.4

(cc grass-dev)

@grass-dev: There are encoding issues with --interface-description
   which states UTF-8 also then the actual language encoding is different:

2009/9/4 Roger Bivand <Roger.Bivand@nhh.no>:

OK, don't worry about that. I'll try to send an updated spgrass6 when I have
adequate net access. The problem is that GRASS writes UTF-8 as the encoding
into the XML header from --nterface-description, but the French translations
seem (on my XP machine) to be in latin1, that is the 0x.. spell eacute
(\'{e}) in latex, then "fin" from "défine...", the first non-ASCII string in
the output for g.region.

The fix is to insert "latin1" into the header, so I've updated the package
to allow the user to do this from within R

This gave me another idea to check the .po files in GRASS 6.4:

grep charset locale/po/grass*_fr.po
locale/po/grasslibs_fr.po:"Content-Type: text/plain; charset=ISO-8859-1\n"
locale/po/grassmods_fr.po:"Content-Type: text/plain; charset=ISO-8859-1\n"
locale/po/grasstcl_fr.po:"Content-Type: text/plain; charset=UTF-8\n"
locale/po/grasswxpy_fr.po:"Content-Type: text/plain; charset=UTF-8\n"

Apparently the translators mixed several charsets instead of using one,
this is valid for various languages supported in GRASS.

@grass-dev: Should we harmonize the encodings to UTF-8 for all/subset
  of languages? If yes, how? With iconv?

@grass-stats: perhaps we need to do our homework first and fix the mess
  if it is a mess...

Markus

Markus Neteler wrote:

@grass-dev: There are encoding issues with --interface-description
   which states UTF-8 also then the actual language encoding is different:

2009/9/4 Roger Bivand <Roger.Bivand@nhh.no>:
> OK, don't worry about that. I'll try to send an updated spgrass6 when I have
> adequate net access. The problem is that GRASS writes UTF-8 as the encoding
> into the XML header from --nterface-description, but the French translations
> seem (on my XP machine) to be in latin1, that is the 0x.. spell eacute
> (\'{e}) in latex, then "fin" from "défine...", the first non-ASCII string in
> the output for g.region.
>
> The fix is to insert "latin1" into the header, so I've updated the package
> to allow the user to do this from within R

This gave me another idea to check the .po files in GRASS 6.4:

grep charset locale/po/grass*_fr.po
locale/po/grasslibs_fr.po:"Content-Type: text/plain; charset=ISO-8859-1\n"
locale/po/grassmods_fr.po:"Content-Type: text/plain; charset=ISO-8859-1\n"
locale/po/grasstcl_fr.po:"Content-Type: text/plain; charset=UTF-8\n"
locale/po/grasswxpy_fr.po:"Content-Type: text/plain; charset=UTF-8\n"

Apparently the translators mixed several charsets instead of using one,
this is valid for various languages supported in GRASS.

@grass-dev: Should we harmonize the encodings to UTF-8 for all/subset
  of languages? If yes, how? With iconv?

No.

For systems using GNU libc, the .mo files use unicode, which is
converted to the locale's encoding automatically at run time. If your
locale uses ISO-8859-1, that's what the program will get regardless of
the encoding used in the .po files.

Additionally, using "historical" encodings provides better
compatibility. Systems which support unicode will typically convert
the .po files to unicode in the .mo files, then convert this to the
locale's encoding at run time, so it doesn't matter which encoding is
used. Systems which don't support unicode will require the .po files
to use the locale's historical encoding (ISO-8859-*, EUC-JP, etc).

The --interface description option needs to either determine the
locale's encoding via e.g. nl_langinfo() and use that in the header,
or convert the data to UTF-8. The latter has the advantage of
relieving the reader of the burden of handling multiple encodings.

--
Glynn Clements <glynn@gclements.plus.com>

On Fri, Sep 4, 2009 at 9:26 PM, Glynn Clements<glynn@gclements.plus.com> wrote:

Markus Neteler wrote:

@grass-dev: There are encoding issues with --interface-description
which states UTF-8 also then the actual language encoding is different:

[then -> when]

...

The --interface description option needs to either determine the
locale's encoding via e.g. nl_langinfo() and use that in the header,

It seems to be already there?

lib/gis/parser.c:

static void G_usage_xml(void)
{
...
#if defined(HAVE_LANGINFO_H)
    encoding = nl_langinfo(CODESET);
    if (!encoding || strlen(encoding) == 0) {
        encoding = "UTF-8";
    }
#else
    encoding = "UTF-8";
#endif
...
    fprintf(stdout, "<?xml version=\"1.0\" encoding=\"%s\"?>\n", encoding);
    fprintf(stdout, "<!DOCTYPE task SYSTEM \"grass-interface.dtd\">\n");

Apparently the Windows built was missing HAVE_LANGINFO_H or it
isn't properly set on Windows.

?
Markus

On Fri, 4 Sep 2009, Markus Neteler wrote:

On Fri, Sep 4, 2009 at 9:26 PM, Glynn Clements<glynn@gclements.plus.com> wrote:

Markus Neteler wrote:

@grass-dev: There are encoding issues with --interface-description
which states UTF-8 also then the actual language encoding is different:

[then -> when]

...

The --interface description option needs to either determine the
locale's encoding via e.g. nl_langinfo() and use that in the header,

It seems to be already there?

lib/gis/parser.c:

static void G_usage_xml(void)
{
...
#if defined(HAVE_LANGINFO_H)
   encoding = nl_langinfo(CODESET);
   if (!encoding || strlen(encoding) == 0) {
       encoding = "UTF-8";
   }
#else
   encoding = "UTF-8";
#endif
...
   fprintf(stdout, "<?xml version=\"1.0\" encoding=\"%s\"?>\n", encoding);
   fprintf(stdout, "<!DOCTYPE task SYSTEM \"grass-interface.dtd\">\n");

Apparently the Windows built was missing HAVE_LANGINFO_H or it
isn't properly set on Windows.

As a work-around, I've submitted to CRAN a revised version of spgrass6 allowing the user to manipulate the encoding string in the XML data directly, since Windows users of binary releases cannot get at this themselves on the GRASS side.

The user would typically run a parseGRASS(<whatever>) command, see a UTF-8 error message, and then try inserting usual suspects with setXMLencoding() - typically "latin1" - and retry parseGRASS(<whatever>). Users of initGRASS() will see the UTF-8 error because parseGRASS("g.region") is run when the function completes.

Roger

?
Markus

--
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand@nhh.no

Markus Neteler wrote:

>> @grass-dev: There are encoding issues with --interface-description
>> which states UTF-8 also then the actual language encoding is different:

[then -> when]

...
> The --interface description option needs to either determine the
> locale's encoding via e.g. nl_langinfo() and use that in the header,

It seems to be already there?

lib/gis/parser.c:

static void G_usage_xml(void)
{
...
#if defined(HAVE_LANGINFO_H)
    encoding = nl_langinfo(CODESET);
    if (!encoding || strlen(encoding) == 0) {
        encoding = "UTF-8";
    }

That would explain the sense of déjà vu :wink:

Apparently the Windows built was missing HAVE_LANGINFO_H or it
isn't properly set on Windows.

This probably won't exist on Windows. I think that we can get the
information from locale_charset(), declared in localcharset.h and
defined in both libintl and libgettext.

Can someone try the attached patch?

--
Glynn Clements <glynn@gclements.plus.com>

(attachments)

locale_charset.diff (666 Bytes)

(for the record)

On Sat, Sep 5, 2009 at 8:55 PM, Glynn Clements <glynn@gclements.plus.com> wrote:

Markus Neteler wrote:

>> @grass-dev: There are encoding issues with --interface-description

...

Apparently the Windows built was missing HAVE_LANGINFO_H or it
isn't properly set on Windows.

This probably won't exist on Windows. I think that we can get the
information from locale_charset(), declared in localcharset.h and
defined in both libintl and libgettext.

Can someone try the attached patch?

Glynn's patch has been backported to 6.5.svn and 6.4.svn. Hopefully
the problem is gone now.

Markus