[Geoserver-devel] [jira] Created: (GEOS-2399) Need a way to specify the encoding of shapefiles generated with SHAPE-ZIP output format

Need a way to specify the encoding of shapefiles generated with SHAPE-ZIP output format
---------------------------------------------------------------------------------------

                 Key: GEOS-2399
                 URL: http://jira.codehaus.org/browse/GEOS-2399
             Project: GeoServer
          Issue Type: New Feature
          Components: WFS
    Affects Versions: 1.7.0
            Reporter: Andrea Aime
            Assignee: Andrea Aime
             Fix For: 1.7.2

At the moment the platform default encoding is used, it may not suit the chars that will be encoded in the shapefile (and UTF8 is not a common charset for shapefiles, the common one is ISO-8859-1, but the latter does not work well with non western scripts).
Ideas collected so far:
* allow the specification in the output format name: SHAPE-ZIP;encoding=ISO-8859-15
* allow the specification as a GET parameter (and keep it as a GET parameter even with POST requests): ...&format_options=encoding:ISO-8859-15

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.codehaus.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Great enhancement for all non-A-Z languages.

How will the encoding be stored in the DBF-file?
ESRI has two ways of doing this, either storing LDID in the DBF header or creating a textfile .cpg which stores the codepage.
See http://support.esri.com/index.cfm?fa=knowledgebase.techarticles.articleShow&d=26015

Best Regards



Andreas Oxenstierna

Telefon direkt 040-16 70 17
Mobil 0734-12 80 17
andreas.oxenstierna@anonymised.com

SWECO Position AB

Hans Michelsensgatan 2
Box 286
201 22 Malmö
Telefon 040-16 70 00
www.sweco.se







-----Ursprungligt meddelande-----
Från: Andrea Aime (JIRA) [mailto:jira@anonymised.com]
Skickat: den 19 november 2008 09:41
Till: geoserver-devel@lists.sourceforge.net
Ämne: [Geoserver-devel] [jira] Created: (GEOS-2399) Need a way to specify the encoding of shapefiles generated with SHAPE-ZIP output format

Need a way to specify the encoding of shapefiles generated with SHAPE-ZIP output format

Key: GEOS-2399
URL: http://jira.codehaus.org/browse/GEOS-2399
Project: GeoServer
Issue Type: New Feature
Components: WFS
Affects Versions: 1.7.0
Reporter: Andrea Aime
Assignee: Andrea Aime
Fix For: 1.7.2

At the moment the platform default encoding is used, it may not suit the chars that will be encoded in the shapefile (and UTF8 is not a common charset for shapefiles, the common one is ISO-8859-1, but the latter does not work well with non western scripts).
Ideas collected so far:

  • allow the specification in the output format name: SHAPE-ZIP;encoding=ISO-8859-15
  • allow the specification as a GET parameter (and keep it as a GET parameter even with POST requests): …&format_options=encoding:ISO-8859-15


This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators: http://jira.codehaus.org/secure/Administrators.jspa

For more information on JIRA, see: http://www.atlassian.com/software/jira


This SF.Net email is sponsored by the Moblin Your Move Developer’s challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/


Geoserver-devel mailing list
Geoserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

Oxenstierna Andreas ha scritto:

Great enhancement for all non-A-Z languages.
How will the encoding be stored in the DBF-file?
ESRI has two ways of doing this, either storing LDID in the DBF header or creating a textfile <filename>.cpg which stores the codepage.
See http://support.esri.com/index.cfm?fa=knowledgebase.techarticles.articleShow&d=26015

Hum, not sure we can use any of these... In particular, Java has no
notion of what a codepage is, only knows about Locale and Charset,
both basically go with the standard encoding names such as
ISO-8859-xx or UTF8/16/32 family.

For reading foreign chars shapefiles we already allow the user to specify the encoding that way, and for writing we would to the same,
but how to turn a java.nio.Charset to a codepage number is something
I don't know.

By quickly looking around with Google I've found this
library (http://cpdetector.sourceforge.net/) that does the
opposite, it guesses the encoding based on the file contents, and
it's called Code Page detector, but in fact it does return a
java.nio.Charset.
By looking more I've found this post (http://forums.sun.com/thread.jspa?messageID=10372122) where someone
states that codepage concept is not supported by Java as it's something
Windows specific.

There was some discussion about codepage support in OGR, not sure
how it turned out:
http://article.gmane.org/gmane.comp.gis.gdal.devel/8710

So it seems to pull this we'd first need to build a conversion
table from codepages to encodings, provided that is even possible.
Seems like quite a bit of long boring work...

Cheers
Andrea

PS: more info about code pages here:
http://en.wikipedia.org/wiki/Code_page

--
Andrea Aime
OpenGeo - http://opengeo.org
Expert service straight from the developers.

Yes, character sets should have been unified from the beginning of the computer age…
Codepages are essentially an IBM PC / Windows thing, as described by Wikipedia. Fairly complete lists are available in the external links, e.g. http://msdn.microsoft.com/en-us/library/ms776446.aspx.
For us the important issue is to easily use the shapefiles correctly in different softwares. To my knowledge, the codepage file ('.cpg) is read by all ESRI softwares and many others as well. And can easily be updated if needed…

Best Regards



Andreas Oxenstierna

Telefon direkt 040-16 70 17
Mobil 0734-12 80 17
andreas.oxenstierna@anonymised.com

SWECO Position AB

Hans Michelsensgatan 2
Box 286
201 22 Malmö
Telefon 040-16 70 00
www.sweco.se







-----Ursprungligt meddelande-----
Från: Andrea Aime [mailto:aaime@anonymised.com]
Skickat: den 19 november 2008 10:14
Till: Oxenstierna Andreas
Kopia: Andrea Aime (JIRA); geoserver-devel@lists.sourceforge.net
Ämne: Re: [Geoserver-devel] [jira] Created: (GEOS-2399) Need a way to specify the encoding of shapefiles generated with SHAPE-ZIP output format

Oxenstierna Andreas ha scritto:

Great enhancement for all non-A-Z languages.

How will the encoding be stored in the DBF-file?
ESRI has two ways of doing this, either storing LDID in the DBF header
or creating a textfile .cpg which stores the codepage.
See
http://support.esri.com/index.cfm?fa=knowledgebase.techarticles.articl
eShow&d=26015
<http://support.esri.com/index.cfm?fa=knowledgebase.techarticles.artic
leShow&d=26015>

Hum, not sure we can use any of these… In particular, Java has no notion of what a codepage is, only knows about Locale and Charset, both basically go with the standard encoding names such as ISO-8859-xx or UTF8/16/32 family.

For reading foreign chars shapefiles we already allow the user to specify the encoding that way, and for writing we would to the same, but how to turn a java.nio.Charset to a codepage number is something I don’t know.

By quickly looking around with Google I’ve found this library (http://cpdetector.sourceforge.net/) that does the opposite, it guesses the encoding based on the file contents, and it’s called Code Page detector, but in fact it does return a java.nio.Charset.
By looking more I’ve found this post
(http://forums.sun.com/thread.jspa?messageID=10372122) where someone states that codepage concept is not supported by Java as it’s something Windows specific.

There was some discussion about codepage support in OGR, not sure how it turned out:
http://article.gmane.org/gmane.comp.gis.gdal.devel/8710

So it seems to pull this we’d first need to build a conversion table from codepages to encodings, provided that is even possible.
Seems like quite a bit of long boring work…

Cheers
Andrea

PS: more info about code pages here:
http://en.wikipedia.org/wiki/Code_page


Andrea Aime
OpenGeo - http://opengeo.org
Expert service straight from the developers.

Oxenstierna Andreas ha scritto:

Yes, character sets should have been unified from the beginning of the computer age...
Codepages are essentially an IBM PC / Windows thing, as described by Wikipedia. Fairly complete lists are available in the external links, e.g. http://msdn.microsoft.com/en-us/library/ms776446.aspx.
For us the important issue is to easily use the shapefiles correctly in different softwares. To my knowledge, the codepage file ('.cpg) is read by all ESRI softwares and many others as well. And can easily be updated if needed...

Ok... this still does not tell me how to match a standard charset name
with a codepage (by the looks of it there is no 1-1 match, you can
just try to make a partial guess).

From where I stand one possible way to solve this would have to either
add another parameter (codepage) that the shapefile encoder can use,
or setup some kind of map going from charset to codepage.

I also see nasty issues in the very encoding act, we may end
up encode some chars in a way that does not respect the codepage
reality, thus making other software have a hard time decoding properly
what we wrote. For example, even taking the most common codepage in
wester europe, cp1252, there are mismatches between it and ISO-8859-1:
http://en.wikipedia.org/wiki/Windows-1252

To properly support codepage encodings we'd have to turn all those
tables into software and have encode/decode driven by those tables.
The code/decode loops should not be that difficult, the hard part
is filling all these codepage <-> UTF-16 tables needed to do the
job (and finding their contents in the first place, this
http://www.microsoft.com/globaldev/reference/cphome.mspx
Microsoft help page seem to list some, thought it's not clear if
those U+00xy codes are really unicode or what).

Cheers
Andrea

--
Andrea Aime
OpenGeo - http://opengeo.org
Expert service straight from the developers.