[Geoserver-devel] [jira] Created: (GEOS-323) I18n patch for XML requests handling

I18n patch for XML requests handling
------------------------------------

         Key: GEOS-323
         URL: http://jira.codehaus.org/browse/GEOS-323
     Project: GeoServer
        Type: Improvement
  Components: WFS
    Versions: 1.2.4
Environment: Tested it (not too well) with Tomcat 5.5.7 on jsdk 1.5.0_01
    Reporter: Artie Konin
Assigned to: Chris Holmes
    Priority: Minor
     Fix For: 1.3.0
Attachments: i18n_patch.zip

That is basically a small patch, which allows GeoServer to detect charset used
in incoming XML requests (POST only, of course). I modified files from GEOS-258
attachment for GeoServer needs (work is based on the trunk snapshot made week
ago or so). In addition to what was sent to Chris this version also contains a
change in WfsDispatcher, mostly in order to make charset detection more unified
across the code (this also allows handling of multibyte encoding schemes, absent
in previous version).

Changes in GeoServer code are limited to following 2 files:

org.vfny.geoserver.servlets.AbstractService.java

org.vfny.geoserver.wfs.servlets.WfsDispatcher.java

These files with all modifications are included in attached archive, but I think
it would be good to summarize the changes here, so they could be applied by hand
to possibly new versions of these files (of course, using a diff tool may prove
to be a better solution :slight_smile:

org.vfny.geoserver.servlets.AbstractService.java

1. Added a few new imports:

      import java.io.BufferedReader;

      import org.vfny.geoserver.util.requests.XmlCharsetDetector;

2. Changed `doPost` method:

  a) Commented out line

      Reader xml = (requestXml != null) ? requestXml : request.getReader();

  b) Added following lines instead:

      Reader xml;
      if (null != requestXml) {
          xml = requestXml;
      } else {
          /*
           * `getCharsetAwareReader` returns a reader which not support
           * mark/reset. So it is a good idea to wrap it into BufferedReader.
           * In this case the below debug output will work.
           */
          xml = new BufferedReader(
                    XmlCharsetDetector.getCharsetAwareReader(
                        request.getInputStream()));
      }

3. Some diff tools may highlight other lines as different when comparing
    against attached file. That's because I removed excess trailing spaces
    from the lines with regexp (/[ \t]+$/ to empty string). That kind of
    artefacts often appear when editing files with Far Manager internal
    editor. And it seems that a few trailing spaces in other code were
    affected by this too. This note also applies to other files.

org.vfny.geoserver.wfs.servlets.WfsDispatcher.java

1. Imported two new classes:

     import org.vfny.geoserver.util.requests.XmlCharsetDetector;
     import org.vfny.geoserver.util.requests.EncodingInfo;

2. Commented out code relating to old detection algorithm (so it can be
    quickly brought back in case of troubles with a new one), namely

    a) Some static declarations:

// private static final String DEFAULT_ENCODING = "UTF-8";

// private static final String ENCODING_HEADER_ARG = "Content-Type";

// private static final Pattern ENCODING_PATTERN =
// Pattern.compile("encoding\\s*\\=\\s*\"([^\"]+)\"");

    b) Methods `guessRequestEncoding` and `getXmlEncoding`

    c) Following code in `doPost`:

      String req_enc = guessRequestEncoding(request);
      BufferedReader disReader = new BufferedReader(
                                  new InputStreamReader(
                                   new FileInputStream(temp), req_enc));

      BufferedReader requestReader = new BufferedReader(
                                   new InputStreamReader(
                                    new FileInputStream(temp), req_enc));

  3. Added these lines to `doPost` before the aforementioned commented ones:

      /*
       * To avoid repeating charset detection process twice, we could
       * remember charset info after the first time and then use it
       * to create all subsequent readers with the same source.
       * `createReader` immediately creates a charset aware reader using
       * data from existing `EncodingInfo` instance. It doesn't perform
       * any detection.
       */
      EncodingInfo encInfo = new EncodingInfo();

      BufferedReader disReader;
      BufferedReader requestReader;

      try {
          disReader = new BufferedReader(
                  XmlCharsetDetector.getCharsetAwareReader(
                          new FileInputStream(temp), encInfo));

          requestReader = new BufferedReader(
                  XmlCharsetDetector.createReader(
                          new FileInputStream(temp), encInfo));
      } catch (Exception e) {
          /*
           * Any exception other than WfsException will "hang up" the
           * process - no client output, no log entries, only "Internal
           * server error". So this is a little trick to make detector's
           * exceptions "visible".
           */
          throw new WfsException(e);
      }

  It is worth to note that now HTTP `Content-Type` header is ignored, and that
  slightly reduces the capabilities of detection comparing to old algorithm.
  Still it may be put back, and I can do this, if necessary. I simply don't
  think it is a reliable source of information. XML document itself should be
  generally enough of a hint.

  And now about four new files, that I placed in a new `util` package (this
  location could be changed as appropriate of course).

  1. `org.vfny.geoserver.util.requests.XmlCharsetDetector` is a container for
     detection methods itself. It uses other three classes to achive its goals.
     Note that at least one of its methods is taken from Xerces codebase.

  2. `org.vfny.geoserver.util.requests.EncodingInfo` (should've named it
     CharsetInfo, though :slight_smile: can hold information about detected charset.
     Then this data can be used to form a response appropriately encoded,
     as Gabriel suggested. Though currently I have no idea about how to do this.

  3. `org.vfny.geoserver.util.requests.RewindableInputStream` is a legacy of
     original Xerces-J charset detection algorithm which I based upon. I don't
     know why they invented it (wonder if the same thing can be achieved with
     BufferedInputStream somehow) but I didn't have a time to investigate
     this subject more closely and just copied it (with a few modifications).
     This byte-stream provides very limited mark/reset functionality.

  4. `org.vfny.geoserver.util.requests.readers.UCSReader` is based upon
     `org.apache.xerces.impl.io.UCSReader` and (in theory) should handle the
     decoding of ISO-10646-UCS-2 and ISO-10646-UCS-4 charsets that are missing
     from standard JVM distribution. Don't know if these charsets ever used,
     but I thought it would be better to preserve this Xerces functionality.

  I also wrote a basic Writer for UCS-4 but it is of no use currently and so
  not included. May be it will prove useful when the task of adopting responses
  to requests is solved.

  Ok, the whole stuff needs some serious testing (including perfomance and
  possible synchronization issues). I can only guarantee that it works for
  my specific tasks I tested it with.

  I'm always open for any questions, though not always have Internet access :slight_smile:

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://jira.codehaus.org/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira