[Geoserver-devel] [jira] Created: (GEOS-254) WFS Dispatcher mangles character encoding

Message:

  A new issue has been created in JIRA.

---------------------------------------------------------------------
View the issue:
  http://jira.codehaus.org/browse/GEOS-254

Here is an overview of the issue:
---------------------------------------------------------------------
        Key: GEOS-254
    Summary: WFS Dispatcher mangles character encoding
       Type: Bug

     Status: Open
   Priority: Major

Original Estimate: Unknown
Time Spent: Unknown
  Remaining: Unknown

    Project: GeoServer
Components:
             WFS
   Fix Fors:
             1.2.4
   Versions:
             1.2.3

   Assignee: Chris Holmes
   Reporter: Chris Holmes

    Created: Sat, 20 Nov 2004 3:26 PM
    Updated: Sat, 20 Nov 2004 3:26 PM

Description:
From Artie Konin, his bug fix should be rolled in, and we should look into his suggestions:

I was testing GeoServer's 1.2.1 WFS functionality and ended up with
the fact that `like` and `equals` filters stops working when I post
any windows-1251 encoded characters in the pattern.
No need to say I set the right encoding both in XML declaration and
in request's `Content-Type` header.

After speding some time trying to get GeoServer logging system
to differentiate between individual loggers, I realized that this is too
complicated matter for me and started using old good
System.out.println()'s in any suspicious part of source code.
  Eventually I found the offending section in WfsDispatcher's
doPost() method.
  It uses FileWriter and then FileReader classes to save the incoming
XML into temporary file and then reading it back. Examining Java API
specification I learned that the above two classes are suitable when
"the default character encoding and the default byte-buffer size
are acceptable". In my case that assumptions about default character
encodings lead to a sad results :slight_smile: It looks like my non-US ASCII
XML file content was mangled during either file writing or reading,
or maybe even both :slight_smile: Strangely indeed that this behaved similar both
at Linux and my Windows machine, though at later one the default encoding
is "windows-1251".
  So I returned to Java documentation and was lucky enough to
circumvent the problem. The changes I made in doPost()'s code are
below:

Was:

            BufferedReader tempReader = new BufferedReader(request.getReader());

            // REVISIT: Should do more than sequence here
            // (In case we are running two GeoServers at once)
            // - Could we use response.getHandle() in the filename?
            // - ProcessID is traditional, I don't know how to find that in Java
            sequence++;
            temp = File.createTempFile("wfsdispatch" + sequence, "tmp");

            FileWriter out = new FileWriter(temp);
            int c;

            while ((c = tempReader.read()) != -1) {
                out.write(c);
            }

            tempReader.close();
            out.close();

            BufferedReader disReader = new BufferedReader(new FileReader(temp));
            BufferedReader requestReader = new BufferedReader(new FileReader(
                        temp));

Became:

            InputStream is = new BufferedInputStream(request.getInputStream());

            // REVISIT: Should do more than sequence here
            // (In case we are running two GeoServers at once)
            // - Could we use response.getHandle() in the filename?
            // - ProcessID is traditional, I don't know how to find that in Java
            sequence++;
            temp = File.createTempFile("wfsdispatch" + sequence, "tmp");

            BufferedOutputStream out = new BufferedOutputStream(
                                         new FileOutputStream(temp));

            int c;

            while (-1 != (c = is.read())) {
                out.write(c);
            }

            is.close();
            out.flush();
            out.close();

            String req_enc = guessRequestEncoding(request);

            BufferedReader disReader = new BufferedReader(
                                         new InputStreamReader(
                                           new FileInputStream(temp), req_enc));
            BufferedReader requestReader = new BufferedReader(
                                             new InputStreamReader(
                                               new FileInputStream(temp), req_enc));

Where `guessRequestEncoding()` is a convenience method to resolve the
character encoding of the XML markup contained within incoming
request:

    protected String guessRequestEncoding(HttpServletRequest request) {

        String def_enc = "UTF-8";

        String enc = getXmlEncoding();

        if (null == enc) {

            enc = request.getHeader("Content-Type");

            if (null == enc) {
                enc = def_enc;
            } else {
                if (-1 == enc.indexOf("=")) {
                    enc = def_enc;
                } else {
                    enc = enc.substring(enc.lastIndexOf("=") + 1).trim();
                }
            }
        
        }

        return enc;
    }

    protected String getXmlEncoding() {

        try {
            
            StringWriter sw = new StringWriter(60);
            BufferedReader in = new BufferedReader(new FileReader(temp));
            
            int c;
            while ((-1 != (c = in.read())) && (0x3E != c)) {
                sw.write(c);
            }
            in.close();
           
            Pattern p = Pattern.compile("encoding\\s*\\=\\s*\"([^\"]+)\"");
            Matcher m = p.matcher(sw.toString());
            if (m.find()) {
                return m.toMatchResult().group(1);
            } else {
                return null;
            }
        } catch (IOException e) {
            return null;
        }
    }

After above changes all is working perfectly both on Linux and
Windows. I'm almost happy :slight_smile:

However I have notes considering that code:

1. It looks like changing BufferedReader to BufferedInputStream at the
beginning was not really necessary, as BufferedReader itself doen't
perform any codepage conversions. But at the other hand streams are
just more realiable from my point of view :slight_smile: You get the data exactly
as it comes.

2. Both `guessRequestEncoding()` and `getXmlEncoding()` functions
looks pretty ugly even to me :slight_smile: But that is the best I can do with my
current level of Java acquaintance :confused:

3. Placing "Content-Type" header check before the reading the XML
declaration should be faster, but I think that encoding specified at
the above declaration is simply more adequate.

4. Is there any less ugly way to extract encoding info from the
incoming XML data? `getXmlEncoding()` looks like a pregnant mammoth :slight_smile:

---------------------------------------------------------------------
JIRA INFORMATION:
This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators:
   http://jira.codehaus.org/secure/Administrators.jspa

If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira