[GRASS-dev] Character encoding of module i.atcorr files

Dear Grass developers,

I tried to open the file "computations.cpp" of module i.atcorr in gedit and got the following error message:

-------
There was a problem opening the file “computations.cpp”.
The file you opened has some invalid characters. If you continue editing this file you could corrupt this document.
You can also choose another character encoding and try again.
Character Encoding: Current Locale (UTF-8).
-------

Then I tried opening the file using another character encoding (Western (ISO-8859-15)) and the file opened without problems. I wonder if the different character encoding of this file (all other files opened using UTF-8) could cause errors when running the module.

Regards,

--

Alessandro Samuel-Rosa
---
PhD Candidate Graduate School in Agronomy - Soil Science
Federal Rural University of Rio de Janeiro
Seropédica, Rio de Janeiro, Brazil
---
Guest Researcher ISRIC - World Soil Information
Wageningen, the Netherlands alessandro.rosa@wur.nl
---
Homepage: soil-scientist.net Skype: alessandrosamuel

The offending line is a reference in the comment section:
http://trac.osgeo.org/grass/browser/grass/trunk/imagery/i.atcorr/computations.cpp#L1365

I browsed SUBMITTING file and didn't find any rules about source
encoding. As a supporter of Unicode everywhere, I would suggest to add
a requirement for source files to be in UTF-8. Upside - most of files
already are in UTF-8. Thus only files with symbols outside of latin1
would be affected.

Just my 0.02,
Maris.

2014-02-27 12:22 GMT+02:00 Alessandro Samuel Rosa
<alessandrosamuel@yahoo.com.br>:

Dear Grass developers,

I tried to open the file "computations.cpp" of module i.atcorr in gedit and got the following error message:

-------
There was a problem opening the file “computations.cpp”.
The file you opened has some invalid characters. If you continue editing this file you could corrupt this document.
You can also choose another character encoding and try again.
Character Encoding: Current Locale (UTF-8).
-------

Then I tried opening the file using another character encoding (Western (ISO-8859-15)) and the file opened without problems. I wonder if the different character encoding of this file (all other files opened using UTF-8) could cause errors when running the module.

Regards,

--

Alessandro Samuel-Rosa
---
PhD Candidate Graduate School in Agronomy - Soil Science
Federal Rural University of Rio de Janeiro
Seropédica, Rio de Janeiro, Brazil
---
Guest Researcher ISRIC - World Soil Information
Wageningen, the Netherlands alessandro.rosa@wur.nl
---
Homepage: soil-scientist.net Skype: alessandrosamuel
_______________________________________________
grass-dev mailing list
grass-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-dev

Maris Nartiss wrote:

The offending line is a reference in the comment section:
http://trac.osgeo.org/grass/browser/grass/trunk/imagery/i.atcorr/computations.cpp#L1365

I browsed SUBMITTING file and didn't find any rules about source
encoding. As a supporter of Unicode everywhere, I would suggest to add
a requirement for source files to be in UTF-8. Upside - most of files
already are in UTF-8. Thus only files with symbols outside of latin1
would be affected.

Most files are ASCII. Those which aren't are almost evenly split
between ISO-8859-1 and UTF-8:

Files using ISO-8859-1:

raster/r.sunmask/g_solposition.c U+00B0 DEGREE SIGN
imagery/i.topo.corr/main.c U+00F1 LATIN SMALL LETTER N WITH TILDE
imagery/i.landsat.toar/landsat.h U+00B5 MICRO SIGN
imagery/i.evapo.pm/functions.c U+00B0 DEGREE SIGN
imagery/i.atcorr/computations.cpp U+00E9 LATIN SMALL LETTER E WITH ACUTE
lib/raster/color_look.c U+00AD SOFT HYPHEN
lib/raster/color_set.c U+00AD SOFT HYPHEN

Files using UTF-8:

raster/r.sunmask/main.c U+00B0 DEGREE SIGN
raster/r.watershed/ram/do_flatarea.c U+2013 EN DASH
vector/v.net.salesman/main.c U+2013 EN DASH
gui/wxpython/lmgr/frame.py U+00F6 LATIN SMALL LETTER O WITH DIAERESIS
          U+2019 RIGHT SINGLE QUOTATION MARK
lib/python/pygrass/functions.py U+00B0 DEGREE SIGN
lib/arraystats/class.c U+00E9 LATIN SMALL LETTER E WITH ACUTE

Many of these are either gratuitous, e.g. use of soft hyphen or
en-dash when an ASCII "-" (U+002D HYPHEN-MINUS) would suffice.

Some are due to comments written in languages other than English
(i.topo.corr = Spanish, lib/arraystats = French); these should be
translated.

All but one are in comments: the pygrass one is a string literal,
which should really use escape notation (assuming that the
is_clean_name() function is actually correct, and not a half-baked
attempt at re-implementing G_legal_filename()).

So, if those are fixed, it boils down to whether we actually want to
have to deal with source-code encoding issue for the sake of comments
which include:

a) °C for degrees Celcius,
b) µm for micrometres (microns), and
c) proper names using the Latin script with accents (names using any
other script will invariably be romanised).

Personally, I would prefer it if source code was 7-bit clean.

--
Glynn Clements <glynn@gclements.plus.com>

Maris wrote:

The offending line is a reference in the comment section:
http://trac.osgeo.org/grass/browser/grass/trunk/imagery/i.atcorr/computations.cpp#L1365

I browsed SUBMITTING file and didn't find any rules about source
encoding.

...

Glynn wrote

Most files are ASCII. Those which aren't are almost evenly split
between ISO-8859-1 and UTF-8:

Files using ISO-8859-1:

raster/r.sunmask/g_solposition.c U+00B0 DEGREE SIGN
imagery/i.topo.corr/main.c U+00F1 LATIN SMALL LETTER N WITH TILDE
imagery/i.landsat.toar/landsat.h U+00B5 MICRO SIGN
imagery/i.evapo.pm/functions.c U+00B0 DEGREE SIGN
imagery/i.atcorr/computations.cpp U+00E9 LATIN SMALL LETTER E WITH ACUTE
lib/raster/color_look.c U+00AD SOFT HYPHEN
lib/raster/color_set.c U+00AD SOFT HYPHEN

Files using UTF-8:

raster/r.sunmask/main.c U+00B0 DEGREE SIGN
raster/r.watershed/ram/do_flatarea.c U+2013 EN DASH
vector/v.net.salesman/main.c U+2013 EN DASH
gui/wxpython/lmgr/frame.py U+00F6 LATIN SMALL LETTER O WITH DIAERESIS
U+2019 RIGHT SINGLE QUOTATION MARK
lib/python/pygrass/functions.py U+00B0 DEGREE SIGN
lib/arraystats/class.c U+00E9 LATIN SMALL LETTER E WITH ACUTE

Many of these are either gratuitous, e.g. use of soft hyphen or
en-dash when an ASCII "-" (U+002D HYPHEN-MINUS) would suffice.

Some are due to comments written in languages other than English
(i.topo.corr = Spanish, lib/arraystats = French); these should be
translated.

All but one are in comments: the pygrass one is a string literal,
which should really use escape notation (assuming that the
is_clean_name() function is actually correct, and not a half-baked
attempt at re-implementing G_legal_filename()).

So, if those are fixed, it boils down to whether we actually want to
have to deal with source-code encoding issue for the sake of comments
which include:

a) °C for degrees Celcius,
b) µm for micrometres (microns), and
c) proper names using the Latin script with accents (names using any
other script will invariably be romanised).

I've now removed most of these in trunk with r59172.

remaining:
imagery/i.atcorr/computations.cpp (someone's name)
gui/wxpython/lmgr/frame.py (an example of something using UTF-8)

and lib/python/pygrass/functions.py ...

as for functions.py, hooking into G_legal_filename() would
be best, but failing that, a white-list of allowed chars would
seem much more robust than a small black-list of disallowed
chars.

Personally, I would prefer it if source code was 7-bit clean.

Me too. Not sure how to deal with non-ASCII chars in people's names though.

regards,
Hamish

On Sun, Mar 2, 2014 at 10:59 PM, Hamish <hamish_b@yahoo.com> wrote:

Maris wrote:

>> The offending line is a reference in the comment section:
>>
http://trac.osgeo.org/grass/browser/grass/trunk/imagery/i.atcorr/computations.cpp#L1365
>>
>> I browsed SUBMITTING file and didn't find any rules about source
>> encoding.
...

Glynn wrote
> Most files are ASCII. Those which aren't are almost evenly split
> between ISO-8859-1 and UTF-8:
>
> Files using ISO-8859-1:
>
> raster/r.sunmask/g_solposition.c U+00B0 DEGREE SIGN
> imagery/i.topo.corr/main.c U+00F1 LATIN SMALL LETTER N WITH
TILDE
> imagery/i.landsat.toar/landsat.h U+00B5 MICRO SIGN
> imagery/i.evapo.pm/functions.c U+00B0 DEGREE SIGN
> imagery/i.atcorr/computations.cpp U+00E9 LATIN SMALL LETTER E WITH
ACUTE
> lib/raster/color_look.c U+00AD SOFT HYPHEN
> lib/raster/color_set.c U+00AD SOFT HYPHEN
>
> Files using UTF-8:
>
> raster/r.sunmask/main.c U+00B0 DEGREE SIGN
> raster/r.watershed/ram/do_flatarea.c U+2013 EN DASH
> vector/v.net.salesman/main.c U+2013 EN DASH
> gui/wxpython/lmgr/frame.py U+00F6 LATIN SMALL LETTER O WITH
DIAERESIS
> U+2019 RIGHT SINGLE QUOTATION MARK
> lib/python/pygrass/functions.py U+00B0 DEGREE SIGN
> lib/arraystats/class.c U+00E9 LATIN SMALL LETTER E WITH
ACUTE
>
> Many of these are either gratuitous, e.g. use of soft hyphen or
> en-dash when an ASCII "-" (U+002D HYPHEN-MINUS) would suffice.
>
> Some are due to comments written in languages other than English
> (i.topo.corr = Spanish, lib/arraystats = French); these should be
> translated.
>
> All but one are in comments: the pygrass one is a string literal,
> which should really use escape notation (assuming that the
> is_clean_name() function is actually correct, and not a half-baked
> attempt at re-implementing G_legal_filename()).
>
> So, if those are fixed, it boils down to whether we actually want to
> have to deal with source-code encoding issue for the sake of comments
> which include:
>
> a) °C for degrees Celcius,
> b) µm for micrometres (microns), and
> c) proper names using the Latin script with accents (names using any
> other script will invariably be romanised).

I've now removed most of these in trunk with r59172.

remaining:
imagery/i.atcorr/computations.cpp (someone's name)

gui/wxpython/lmgr/frame.py (an example of something using UTF-8)

https://trac.osgeo.org/grass/browser/grass/trunk/gui/wxpython/lmgr/frame.py#L978

I wanted this to be just written without UTF-8 chars but since UTF-8 chars
is what makes problematic, I agree with MarkusN that it is better to be
explicit.

and lib/python/pygrass/functions.py ...

as for functions.py, hooking into G_legal_filename() would
be best, but failing that, a white-list of allowed chars would
seem much more robust than a small black-list of disallowed
chars.

> Personally, I would prefer it if source code was 7-bit clean.

Me too. Not sure how to deal with non-ASCII chars in people's names though.

The problem is that each language deal with this differently. While for

Czech you write Petras instead of Petráš, for German, you write Soeren
instead of Sören in case you want to avoid non-ASCII. For languages with
non-latin alphabet, it is even more complicated. And moreover, the context
when it is appropriate or tolerated may differ.

However, it seems that languages usually have some way to write them in
ASCII or in English transcription. So, we can use that in source codes.
Original names in UTF-8 can be in contributors.csv and in (HTML)
documentation for modules which anyway may contain some UTF-8 chars for
various reasons.

But anyway, UTF-8 is now everywhere and time to time it is necessary and
much easier than various workarounds such as entities in HTML, unicode
escape sequences or rewriting readable and standard °C to degC. So, I don't
see 7 bit or whatever simplification as advantageous because the problem is
complex and you just cannot fit into 7 bit (1).

Are there any disadvantages of using UTF-8?

Vaclav (Václav Petráš)

(1) This remembered me about some comment somewhere where the question "How
do I use this with Latin2 encoded language?" was answered "Use Latin1."
which is of course absurd since Latin1 contains different characters than
Latin2 (that's why there are both here). My point is that encoding in
something else than unicode/UTF-8 is usually a huge simplification which
may destroy the original text.

regards,

Hamish

_______________________________________________
grass-dev mailing list
grass-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-dev

On Mon, Mar 3, 2014 at 3:59 AM, Hamish hamish_b@yahoo.com wrote:

and lib/python/pygrass/functions.py …

as for functions.py, hooking into G_legal_filename() would
be best

done in r59180.

Personally, I would prefer it if source code was 7-bit clean.

Me too. Not sure how to deal with non-ASCII chars in people’s names though.

I don’t see the reason to avoid source code in utf-8, at least on
python files, in particular when it is explicitly declare in the file
header like this (as in PEP 0263):

# -*- coding: utf-8 -*-

regards

Pietro

On Mon, Mar 3, 2014 at 3:59 AM, Hamish hamish_b@yahoo.com wrote:

and lib/python/pygrass/functions.py …

as for functions.py, hooking into G_legal_filename() would
be best

done in r59180.

Personally, I would prefer it if source code was 7-bit clean.

Me too. Not sure how to deal with non-ASCII chars in people’s names though.

I don’t see the reason to avoid source code in utf-8, at least on
python files, in particular when it is explicitly declare in the file
header like this (as in PEP 0263):

# -*- coding: utf-8 -*-

regards

Pietro

ps: sorry my previous email lost the format.

One other file with non-ASCII is doc/infrastructure.txt which contains

grass-es La lista de correo de GRASS GIS en español

I would say that in infrastructure the English name would be more appropriate but anyway, the official name is the above and it is clear what does it mean, so I would keep the original.

···

On Mon, Mar 3, 2014 at 11:19 AM, Pietro <peter.zamb@gmail.com> wrote:

On Mon, Mar 3, 2014 at 3:59 AM, Hamish hamish_b@yahoo.com wrote:

and lib/python/pygrass/functions.py …
as for functions.py, hooking into G_legal_filename() would
be best

done in r59180.

Personally, I would prefer it if source code was 7-bit clean.
Me too. Not sure how to deal with non-ASCII chars in people’s names though.

I don’t see the reason to avoid source code in utf-8, at least on
python files, in particular when it is explicitly declare in the file
header like this (as in PEP 0263):

-- coding: utf-8 --

regards

Pietro

ps: sorry my previous email lost the format.


grass-dev mailing list
grass-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-dev

Vaclav Petras wrote:

But anyway, UTF-8 is now everywhere and time to time it is necessary and
much easier than various workarounds such as entities in HTML, unicode
escape sequences or rewriting readable and standard °C to degC. So, I don't
see 7 bit or whatever simplification as advantageous because the problem is
complex and you just cannot fit into 7 bit (1).

Are there any disadvantages of using UTF-8?

Yes. It isn't actually supported everywhere, and even where it is
supported, it isn't necessarily the only encoding used or even the
preferred encoding.

Even ASCII isn't universal, but people using encodings which are
incompatible with ASCII are familiar with the issues.

Using anything other than ASCII in source code creates unnecessary
problems, which is something which we should try to avoid.

--
Glynn Clements <glynn@gclements.plus.com>

Pietro wrote:

On Mon, Mar 3, 2014 at 3:59 AM, Hamish hamish_b@yahoo.com wrote:

and lib/python/pygrass/functions.py …

as for functions.py, hooking into G_legal_filename() would
be best

done in r59180.

FWIW, if you want to avoid using a C function, it's trivial to
re-implement G_legal_filename() in Python. In particular, *any*
character >= 127 is invalid.

Personally, I would prefer it if source code was 7-bit clean.

Me too. Not sure how to deal with non-ASCII chars in people’s names though.

If they're moderately well known, it shouldn't be hard to find out the
standard romanisation.

I don’t see the reason to avoid source code in utf-8, at least on
python files, in particular when it is explicitly declare in the file
header like this (as in PEP 0263):

# -*- coding: utf-8 -*-

Source code doesn't just have to work with the interpreter. It needs
to work with text editors, web browsers, revision-control systems,
printers, etc. Some of those understand coding cookies, some don't.

The main problem with coding cookies is that software which
(explicitly or implicitly) converts between encodings typically leaves
the cookie unchanged. Which is why libraries such as chardet typically
treat coding cookies as little more than a hint or fallback.

--
Glynn Clements <glynn@gclements.plus.com>

On Mon, Mar 3, 2014 at 9:18 PM, Glynn Clements <glynn@gclements.plus.com>wrote:

Source code doesn't just have to work with the interpreter. It needs
to work with text editors, web browsers, revision-control systems,
printers, etc. Some of those understand coding cookies, some don't.

The main problem with coding cookies is that software which
(explicitly or implicitly) converts between encodings typically leaves
the cookie unchanged. Which is why libraries such as chardet typically
treat coding cookies as little more than a hint or fallback.

What are coding cookies?

My search brought me to Cookies and Coding event which is for sure great
but apparently not related to current discussion.

[1] http://www.meetup.com/Women-Who-Code-Austin/events/160041362/

Vaclav Petras wrote:

What are coding cookies?

A coding cookie is a piece of text indicating the encoding of a text
file, such as:

  -*- coding: utf-8 -*-

The above syntax originates from Emacs, and has been adopted by
Python. I don't know what else (if anything) supports it.

--
Glynn Clements <glynn@gclements.plus.com>