[Geoserver-devel] WFS and non-Latin-1 input - part 2

Greetings again, all.

Sorry for my previous incomplete posting :confused: It seems that pressing
F2 for saving program code turns out to be a very bad habit when
editing mail messages with TheBat! :slight_smile:

So I'll continue.

  After speding some time trying to get GeoServer logging system
to differentiate between individual loggers, I realized that this is too
complicated matter for me and started using old good
System.out.println()'s in any suspicious part of source code.
  Eventually I found the offending section in WfsDispatcher's
doPost() method.
  It uses FileWriter and then FileReader classes to save the incoming
XML into temporary file and then reading it back. Examining Java API
specification I learned that the above two classes are suitable when
"the default character encoding and the default byte-buffer size
are acceptable". In my case that assumptions about default character
encodings lead to a sad results :slight_smile: It looks like my non-US ASCII
XML file content was mangled during either file writing or reading,
or maybe even both :slight_smile: Strangely indeed that this behaved similar both
at Linux and my Windows machine, though at later one the default encoding
is "windows-1251".
  So I returned to Java documentation and was lucky enough to
circumvent the problem. The changes I made in doPost()'s code are
below:

Was:

            BufferedReader tempReader = new BufferedReader(request.getReader());

            // REVISIT: Should do more than sequence here
            // (In case we are running two GeoServers at once)
            // - Could we use response.getHandle() in the filename?
            // - ProcessID is traditional, I don't know how to find that in Java
            sequence++;
            temp = File.createTempFile("wfsdispatch" + sequence, "tmp");

            FileWriter out = new FileWriter(temp);
            int c;

            while ((c = tempReader.read()) != -1) {
                out.write(c);
            }

            tempReader.close();
            out.close();

            BufferedReader disReader = new BufferedReader(new FileReader(temp));
            BufferedReader requestReader = new BufferedReader(new FileReader(
                        temp));

Became:

            InputStream is = new BufferedInputStream(request.getInputStream());

            // REVISIT: Should do more than sequence here
            // (In case we are running two GeoServers at once)
            // - Could we use response.getHandle() in the filename?
            // - ProcessID is traditional, I don't know how to find that in Java
            sequence++;
            temp = File.createTempFile("wfsdispatch" + sequence, "tmp");

            BufferedOutputStream out = new BufferedOutputStream(
                                         new FileOutputStream(temp));

            int c;

            while (-1 != (c = is.read())) {
                out.write(c);
            }

            is.close();
            out.flush();
            out.close();

            String req_enc = guessRequestEncoding(request);

            BufferedReader disReader = new BufferedReader(
                                         new InputStreamReader(
                                           new FileInputStream(temp), req_enc));
            BufferedReader requestReader = new BufferedReader(
                                             new InputStreamReader(
                                               new FileInputStream(temp), req_enc));

Where `guessRequestEncoding()` is a convenience method to resolve the
character encoding of the XML markup contained within incoming
request:

    protected String guessRequestEncoding(HttpServletRequest request) {

        String def_enc = "UTF-8";

        String enc = getXmlEncoding();

        if (null == enc) {

            enc = request.getHeader("Content-Type");

            if (null == enc) {
                enc = def_enc;
            } else {
                if (-1 == enc.indexOf("=")) {
                    enc = def_enc;
                } else {
                    enc = enc.substring(enc.lastIndexOf("=") + 1).trim();
                }
            }
        
        }

        return enc;
    }

    protected String getXmlEncoding() {

        try {
            
            StringWriter sw = new StringWriter(60);
            BufferedReader in = new BufferedReader(new FileReader(temp));
            
            int c;
            while ((-1 != (c = in.read())) && (0x3E != c)) {
                sw.write(c);
            }
            in.close();
           
            Pattern p = Pattern.compile("encoding\\s*\\=\\s*\"([^\"]+)\"");
            Matcher m = p.matcher(sw.toString());
            if (m.find()) {
                return m.toMatchResult().group(1);
            } else {
                return null;
            }
        } catch (IOException e) {
            return null;
        }
    }

After above changes all is working perfectly both on Linux and
Windows. I'm almost happy :slight_smile:

However I have notes considering that code:

1. It looks like changing BufferedReader to BufferedInputStream at the
beginning was not really necessary, as BufferedReader itself doen't
perform any codepage conversions. But at the other hand streams are
just more realiable from my point of view :slight_smile: You get the data exactly
as it comes.

2. Both `guessRequestEncoding()` and `getXmlEncoding()` functions
looks pretty ugly even to me :slight_smile: But that is the best I can do with my
current level of Java acquaintance :confused:

3. Placing "Content-Type" header check before the reading the XML
declaration should be faster, but I think that encoding specified at
the above declaration is simply more adequate.

4. Is there any less ugly way to extract encoding info from the
incoming XML data? `getXmlEncoding()` looks like a pregnant mammoth :slight_smile:

-- Best regards,
Artie Konin mailto:a-thor@anonymised.com

Awesome, thanks for this. I will dig into the code and see what I can
come up with, I may be able to simplify things a bit with some java
convenience methods. Apologies for the bug, WfsDispatcher was kind of
hacked together to replicate the preferred cgi style of wfs requests. To
avoid the problem you also could issue the request to:
http://localhost:8080/geoserver/wfs/GetFeature/, instead of
http://localhost:8080/geoserver/wfs/ To see the locations of all
operations see the capabilities documents at
http://localhost:8080/geoserver/wfs/GetCapabilities
Note that this is completely fine by the WFS specification, to have the
operations be at different locations. It's just that all the examples use
the cgi style. But we pass the CITE tests with the way we do it just
fine. We offered the wfs/ stuff since many clients naively asume that
it's that way, without actaully really parsing the Capabilities document.
I had thought we were at the point where it worked just as well, but I
guess this proves we aren't. I'll get the fix into 1.2.4.

best regards,

Chris

On Thu, 18 Nov 2004, Artie Konin wrote:

Greetings again, all.

Sorry for my previous incomplete posting :confused: It seems that pressing
F2 for saving program code turns out to be a very bad habit when
editing mail messages with TheBat! :slight_smile:

So I'll continue.

  After speding some time trying to get GeoServer logging system
to differentiate between individual loggers, I realized that this is too
complicated matter for me and started using old good
System.out.println()'s in any suspicious part of source code.
  Eventually I found the offending section in WfsDispatcher's
doPost() method.
  It uses FileWriter and then FileReader classes to save the incoming
XML into temporary file and then reading it back. Examining Java API
specification I learned that the above two classes are suitable when
"the default character encoding and the default byte-buffer size
are acceptable". In my case that assumptions about default character
encodings lead to a sad results :slight_smile: It looks like my non-US ASCII
XML file content was mangled during either file writing or reading,
or maybe even both :slight_smile: Strangely indeed that this behaved similar both
at Linux and my Windows machine, though at later one the default encoding
is "windows-1251".
  So I returned to Java documentation and was lucky enough to
circumvent the problem. The changes I made in doPost()'s code are
below:

Was:

            BufferedReader tempReader = new BufferedReader(request.getReader());

            // REVISIT: Should do more than sequence here
            // (In case we are running two GeoServers at once)
            // - Could we use response.getHandle() in the filename?
            // - ProcessID is traditional, I don't know how to find that in Java
            sequence++;
            temp = File.createTempFile("wfsdispatch" + sequence, "tmp");

            FileWriter out = new FileWriter(temp);
            int c;

            while ((c = tempReader.read()) != -1) {
                out.write(c);
            }

            tempReader.close();
            out.close();

            BufferedReader disReader = new BufferedReader(new FileReader(temp));
            BufferedReader requestReader = new BufferedReader(new FileReader(
                        temp));

Became:

            InputStream is = new BufferedInputStream(request.getInputStream());

            // REVISIT: Should do more than sequence here
            // (In case we are running two GeoServers at once)
            // - Could we use response.getHandle() in the filename?
            // - ProcessID is traditional, I don't know how to find that in Java
            sequence++;
            temp = File.createTempFile("wfsdispatch" + sequence, "tmp");

            BufferedOutputStream out = new BufferedOutputStream(
                                         new FileOutputStream(temp));

            int c;

            while (-1 != (c = is.read())) {
                out.write(c);
            }

            is.close();
            out.flush();
            out.close();

            String req_enc = guessRequestEncoding(request);

            BufferedReader disReader = new BufferedReader(
                                         new InputStreamReader(
                                           new FileInputStream(temp), req_enc));
            BufferedReader requestReader = new BufferedReader(
                                             new InputStreamReader(
                                               new FileInputStream(temp), req_enc));

Where `guessRequestEncoding()` is a convenience method to resolve the
character encoding of the XML markup contained within incoming
request:

    protected String guessRequestEncoding(HttpServletRequest request) {

        String def_enc = "UTF-8";

        String enc = getXmlEncoding();

        if (null == enc) {

            enc = request.getHeader("Content-Type");

            if (null == enc) {
                enc = def_enc;
            } else {
                if (-1 == enc.indexOf("=")) {
                    enc = def_enc;
                } else {
                    enc = enc.substring(enc.lastIndexOf("=") + 1).trim();
                }
            }
        
        }

        return enc;
    }

    protected String getXmlEncoding() {

        try {
            
            StringWriter sw = new StringWriter(60);
            BufferedReader in = new BufferedReader(new FileReader(temp));
            
            int c;
            while ((-1 != (c = in.read())) && (0x3E != c)) {
                sw.write(c);
            }
            in.close();
           
            Pattern p = Pattern.compile("encoding\\s*\\=\\s*\"([^\"]+)\"");
            Matcher m = p.matcher(sw.toString());
            if (m.find()) {
                return m.toMatchResult().group(1);
            } else {
                return null;
            }
        } catch (IOException e) {
            return null;
        }
    }

After above changes all is working perfectly both on Linux and
Windows. I'm almost happy :slight_smile:

However I have notes considering that code:

1. It looks like changing BufferedReader to BufferedInputStream at the
beginning was not really necessary, as BufferedReader itself doen't
perform any codepage conversions. But at the other hand streams are
just more realiable from my point of view :slight_smile: You get the data exactly
as it comes.

2. Both `guessRequestEncoding()` and `getXmlEncoding()` functions
looks pretty ugly even to me :slight_smile: But that is the best I can do with my
current level of Java acquaintance :confused:

3. Placing "Content-Type" header check before the reading the XML
declaration should be faster, but I think that encoding specified at
the above declaration is simply more adequate.

4. Is there any less ugly way to extract encoding info from the
incoming XML data? `getXmlEncoding()` looks like a pregnant mammoth :slight_smile:

-- Best regards,
Artie Konin mailto:a-thor@anonymised.com

-------------------------------------------------------
This SF.Net email is sponsored by: InterSystems CACHE
FREE OODBMS DOWNLOAD - A multidimensional database that combines
robust object and relational technologies, making it a perfect match
for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8
_______________________________________________
Geoserver-devel mailing list
Geoserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

--