[Geoserver-devel] [jira] Created: (GEOS-258) /wfs/GetFeature works not too well with i18n data

Message:

  A new issue has been created in JIRA.

---------------------------------------------------------------------
View the issue:
  http://jira.codehaus.org/browse/GEOS-258

Here is an overview of the issue:
---------------------------------------------------------------------
        Key: GEOS-258
    Summary: /wfs/GetFeature works not too well with i18n data
       Type: Bug

     Status: Open
   Priority: Minor

Original Estimate: Unknown
Time Spent: Unknown
  Remaining: Unknown

    Project: GeoServer
Components:
             WFS
   Versions:
             1.2.3

   Assignee: Chris Holmes
   Reporter: Artie Konin

    Created: Thu, 9 Dec 2004 10:22 AM
    Updated: Thu, 9 Dec 2004 10:22 AM

Description:
I encountered this problem after moving both GeoServer and my web client to
Linux machine with US locale as default one. After Chris fixed WfsDispatcher,
all GetFeature requests that come directly to it works fine. But I was
advised to use URLs specific for each WFS request type and found this to be
a good idea :slight_smile: But occassionally I found an interesting thing.
Ff the same XML is being sent to
  http://somehost:8080/geoserver/wfs
and
  http://somehost:8080/geoserver/wfs/GetFeature
the results are not necessarily the same.

  When I am sending query to filter features with PropertyIsLike or any other
string-based filter, and passing along cyrillic string as a pattern to the
"/wfs", I receive the expected results. The same request results in empty
featureCollection if sent to "/wfs/GetFeature".
  I was not fond of that behavior and tried to see why that is happening.
The following code in org.vfny.geoserver.servlets.AbstractService is of
particular interest. It is located in

doPost(HttpServletRequest, HttpServletResponse, Reader) - lines 286-289:

      XmlRequestReader requestReader = getXmlRequestReader();
      Reader xml = (requestXml != null) ? requestXml : request.getReader();
      serviceRequest = requestReader.read(xml, request);

If I got the things right then the Reader objects which is passed to
requestReader.read() is being either assigned to the one that comes
from WfsDispatcher (requestXml) or obtained directly from request.
As "requestXml" is passed from WfsDispatcher, it is already has the
right character encoding set and filter operates on correctly decoded
pattern string. But I believe that the second reader is not aware of
incoming data encoding and uses a default one.

   Further, in org.vfny.geoserver.requests.readers.wfs.GetFeatureXmlReader
(read() method, line 52) aforementioned reader is used to construct InputSource
for a SAX parser:

      // translate string into a proper SAX input source
      InputSource requestSource = new InputSource(reader);

   Below I'm quoting a part of this class' description for convenience:

"The SAX parser will use the InputSource object to determine how to
read XML input. If there is a character stream available, the parser
will read that stream directly, disregarding any text encoding
declaration found in that stream. If there is no character stream,
but there is a byte stream, the parser will use that byte stream,
using the encoding specified in the InputSource or else (if no encoding
is specified) autodetecting the character encoding using an algorithm
such as the one in the XML specification."

   It is clear that in our case we are feeding character stream to
SAX parser and it will just use it without any charset conversion.
And as the reader uses system default encoding (when request arrives
to "/wfs/GetFeature") then textual data will be surely mangled if
its encoding differs from default one.

   I see two possible ways to solve the problem.
   
   1. First one is simple and can be implemented fast but it is also
      very unreliable as a tradeoff.

   In AbstractService we can replace "request.getReader()" with something
like "new InputStreamReader(request.getInputStream(), charSet)". But I
can't see any reliable way to determine the correct value of "charSet".
One can examine "Content-Type" header of the request, but that is
obviously not a solution as the header may be missing or irrelevant.

   2. I think more real way is to use byte streams and let SAX parser
      determine the encoding with its own means from XML declaration.

   I briefly looked at part of XML specification dealing with charset
autodetection (mentioned in InputSource javadocs) and looks like it
should suffice for most cases.
   But replacing Readers with Streams is definitely not an easy task,
so would not be risk to do it myself :slight_smile: That would involve changing
lots of method contracts and I never know which parts of GeoServer will
be affected with these changes.
   And I've to another thought after reading the above spec. It is
obvious that method which used to extract character encoding from
XML declaration in WfsDispatcher is somewhat flawed. It will work
for most cases when incoming XML is in UTF-8, Latin-1, Windows-1251
and other basic encoding, but it will fail when XML is in "true"
Unicode (UCS-4, UCS-16 etc.) as detection of such cases requires a
far more sophisticated algorithm. Using byte streams will eliminate
a need for detecting encoding in GeoServer itself as this will
be nicely performed by SAX parser itself. I didn't tested that though
but I don't see why this should not work.

   Well, I hope I didn't missed something important. And I am not
completely sure even about this thing as a whole :slight_smile: It will be good
to hear any thoughts on this matter.

---------------------------------------------------------------------
JIRA INFORMATION:
This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators:
   http://jira.codehaus.org/secure/Administrators.jspa

If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira