[Geoserver-devel] WFS 1.1, citewfs config, and UTF8

Hi,
trying to run the WFS 1.1 test suite I noticed that the
feature type "EntiteGenerique" wasn't loaded properly
(the name has in fact three e with an acute accent, it's French).

Investigation showed that configuration reading wasn't working
properly because on Windows the default charset is not UTF8,
and the config files are encoded in that charset.

To solve this problem, I had to force the UTF8 charset in the
reader, instead of simply doing an new FileReader(file) I did
a new InputStreamReader(new FileInputStream(infoFile), Charset.forName("utf8"));

If this cure is ok, we should also force UTF8 when writing.

Yet, I'm wondering, may this cause problems with old data dirs?
We may eventually try to use http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding
to try and guess the encoding, or at least determine if the
file is encoded in utf8 or utf16, and fallback on the system
default otherwise.

Thoughts?

Cheers
Andrea

Andrea Aime wrote:

Hi,
trying to run the WFS 1.1 test suite I noticed that the
feature type "EntiteGenerique" wasn't loaded properly
(the name has in fact three e with an acute accent, it's French).

Investigation showed that configuration reading wasn't working
properly because on Windows the default charset is not UTF8,
and the config files are encoded in that charset.

To solve this problem, I had to force the UTF8 charset in the
reader, instead of simply doing an new FileReader(file) I did
a new InputStreamReader(new FileInputStream(infoFile),
Charset.forName("utf8"));

I cant think of any issues with this. So the extended characters in the
file name are not an issue either?

If this cure is ok, we should also force UTF8 when writing.

Yet, I'm wondering, may this cause problems with old data dirs?

I guess if perhaps someone tries a new version of geoserver and then
reverts to an older one? Not sure...

We may eventually try to use
http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding
to try and guess the encoding, or at least determine if the
file is encoded in utf8 or utf16, and fallback on the system
default otherwise.

Thoughts?

Cheers
Andrea

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Geoserver-devel mailing list
Geoserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

!DSPAM:4007,467fd6c889525210051143!

--
Justin Deoliveira
The Open Planning Project
jdeolive@anonymised.com

Justin Deoliveira ha scritto:

Yet, I'm wondering, may this cause problems with old data dirs?

I guess if perhaps someone tries a new version of geoserver and then
reverts to an older one? Not sure...

No no, what happens if you force UTF8 charset in the inputstreamreader
and the files are not in utf8?
Cheers
Andrea

Justin Deoliveira ha scritto:

Yet, I'm wondering, may this cause problems with old data dirs?

I guess if perhaps someone tries a new version of geoserver and then
reverts to an older one? Not sure...

We may eventually try to use http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding
to try and guess the encoding, or at least determine if the
file is encoded in utf8 or utf16, and fallback on the system
default otherwise.

I noticed FilePublisher assumes files encoded in UTF8 as well...
hum... Imho we should better use this class to guess the
encoding instead of assuming stuff:
http://glaforge.free.fr/projects/guessencoding/html/com/glaforge/i18n/io/CharsetToolkit.java.html

Unfortunately here I cannot see any licensing information:
http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding

So, can one assume it's public domain?
Cheers
Andrea

Andrea Aime ha scritto:

Justin Deoliveira ha scritto:

Yet, I'm wondering, may this cause problems with old data dirs?

I guess if perhaps someone tries a new version of geoserver and then
reverts to an older one? Not sure...

We may eventually try to use http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding
to try and guess the encoding, or at least determine if the
file is encoded in utf8 or utf16, and fallback on the system
default otherwise.

I noticed FilePublisher assumes files encoded in UTF8 as well...
hum... Imho we should better use this class to guess the
encoding instead of assuming stuff:
http://glaforge.free.fr/projects/guessencoding/html/com/glaforge/i18n/io/CharsetToolkit.java.html

Unfortunately here I cannot see any licensing information:
http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding

Ok, contacted Guillame and he told me we can use his code as
if it has an Apache license, that is, integrate, and keep
all the headers intact. This should be ok, no?

Cheers
Andrea

You know, I think we already have some code in GeoServer that guesses character sets. It's used for incoming XML streams I think. See http://jira.codehaus.org/browse/GEOS-323 Charset issues confuse, so I'm not sure if it's actually the same issue here. But if they are similar we should align the code.

Chris

Andrea Aime wrote:

Andrea Aime ha scritto:

Justin Deoliveira ha scritto:

Yet, I'm wondering, may this cause problems with old data dirs?

I guess if perhaps someone tries a new version of geoserver and then
reverts to an older one? Not sure...

We may eventually try to use http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding
to try and guess the encoding, or at least determine if the
file is encoded in utf8 or utf16, and fallback on the system
default otherwise.

I noticed FilePublisher assumes files encoded in UTF8 as well...
hum... Imho we should better use this class to guess the
encoding instead of assuming stuff:
http://glaforge.free.fr/projects/guessencoding/html/com/glaforge/i18n/io/CharsetToolkit.java.html

Unfortunately here I cannot see any licensing information:
http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding

Ok, contacted Guillame and he told me we can use his code as
if it has an Apache license, that is, integrate, and keep
all the headers intact. This should be ok, no?

Cheers
Andrea

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Geoserver-devel mailing list
Geoserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

!DSPAM:4005,4680e66d21835219720167!

--
Chris Holmes
The Open Planning Project
http://topp.openplans.org

Chris Holmes ha scritto:

You know, I think we already have some code in GeoServer that guesses character sets. It's used for incoming XML streams I think. See http://jira.codehaus.org/browse/GEOS-323 Charset issues confuse, so I'm not sure if it's actually the same issue here. But if they are similar we should align the code.

Hum, yeah, the charset detection code seems to pick up even more encodings than the one I cited. It just needs to be extracted from that
class, since depending on the usage I may not want to create a reader
(on FilePublisher we won't, we serve a binary stream directly, but we
have to pick up the encoding and declare it in the http header).

Cheers
Andrea