[Geoserver-devel] fun with cite tests

Hi all,

For 2.0-alpha2 I have been running cite tests like for normal releases, however I used this as an opportunity to resurrect some stuff I was working on a while back.

To step back a moment, my motivation was to be able to run cite tests against multiple backends easily, targeting mainly db back ends, postgis, h2, mysql, etc...

So the first thing I did was come up with a small utility program to load an arbitrary datastore with the various configurations of cite data. Currently I have only implemented wfs 1.0 and wfs 1.1. The loader can be found underneath the rest of the cite utilities i have worked on in the past:

http://svn.codehaus.org/geoserver/trunk/src/community/cite/loader/

The nice thing about this is that it exercises really what goes *into the datastore* strictly through the geotools api, rather than hacking the backend with sql scripts and making this appear ok coming *out of the datastore*.

So with the loader script I loaded up an H2 database for cite wfs 1.0 cite tests and did a run. And... it uncovered quite a few of the hacks that the current postgis setup uses to be "cite compliant". To name the major ones:

1) the boundedBy attribute only gets encoded because it is an attribute in the underlying tables. Which obviously would never get updated if the actual geometry changed.

2) The reference to "built-in" gml geometry property types in GML2 is toggled via a flag to the gml2 encoder, rather than mapping feature attributes to the actual application schema and gml schemas like we do for GML3.

And there are other issues. I finally figured out why wfs 1.0 describe feature types never pass with the new engine. It is because the schema generated by wfs 1.0 does not match the schemas built into the cite tests. I am not sure how this ever passed on the old engine but I had to dig into into the XSLT pit of the new engine and indeed found the schemas that it uses to validate.

Here is a summary of what I did codewise:

1) Added a new GML2 output format (GML2OutputFormat2) which uses the gtxml encoder like the GML3 one does. (I can see Andreas eyes rolling from here). I did try to make it work with the existing encoder but could not since it does not at all respect the application schema being encoded against.

2) Added proper schema.xsd files for each feature type so that geoserver actually creates the feature types properly for the cite wfs 1.0 cite data

3) Refactored a bit the wfs 1.1 schema encoder to work with both 1.1 and 1.0.

4) Some other random changes here and there... mostly bug fixes to work against the new configuration.

And it works!! I was able to run wfs 1.0 cite tests on both H2 and MySQL (NG). It should work for pretty much all NG datastores but I have not gotten around to trying them all. Although I know there is something special with Oracle (go figure) that prevent us to pass cite. Same goes for DB2 I believe.

Similar situation for wfs 1.1 tests.

I have committed all my changes that are "non-disruptive" and wanted to bounce my ideas off for the important changes.

1) GML2OutputFormat2: I realize that there are performance issues with the the gtxml encoder (which btw i am working on, but that is a discussion for another thread). So I am not proposing a replacement. What I am proposing is that the GML2OutputFormat be engaged when strict cite compliance is set.

2) XmlSchemaEncoder: I am proposing replacing the old 1.0 schema encoder with the new one. The old one has no notion of schema overrides, and quite brutishly builds up a big string buffer and then spits out the XML.

And that is it. Thansk for reading :).

-Justin

--
Justin Deoliveira
OpenGeo - http://opengeo.org
Enterprise support for open source geospatial.

Justin Deoliveira ha scritto:

Hi all,

For 2.0-alpha2 I have been running cite tests like for normal releases, however I used this as an opportunity to resurrect some stuff I was working on a while back.

To step back a moment, my motivation was to be able to run cite tests against multiple backends easily, targeting mainly db back ends, postgis, h2, mysql, etc...

So the first thing I did was come up with a small utility program to load an arbitrary datastore with the various configurations of cite data. Currently I have only implemented wfs 1.0 and wfs 1.1. The loader can be found underneath the rest of the cite utilities i have worked on in the past:

http://svn.codehaus.org/geoserver/trunk/src/community/cite/loader/

The nice thing about this is that it exercises really what goes *into the datastore* strictly through the geotools api, rather than hacking the backend with sql scripts and making this appear ok coming *out of the datastore*.

Nice, I love this. DataStore.createSchema and data loading have never
been tested much. I expect those to have some issues (e.g., creating
the wrong column types, not setting the expected length restrictions
and the like).

So with the loader script I loaded up an H2 database for cite wfs 1.0 cite tests and did a run. And... it uncovered quite a few of the hacks that the current postgis setup uses to be "cite compliant". To name the major ones:

1) the boundedBy attribute only gets encoded because it is an attribute in the underlying tables. Which obviously would never get updated if the actual geometry changed.

Mumble muble... isn't this related to the fact we don't have feature
bounding on by default anymore? If you enable it the gml:boundedBy element will be generated, see for example here:

http://geo.openplans.org:8080/geoserver/wfs?service=WFS&version=1.0.0&request=GetFeature&typeName=topp:states

Hmm... but if you have it inside the table, won't you get it inside
another namespace (the one of the feature)?

2) The reference to "built-in" gml geometry property types in GML2 is toggled via a flag to the gml2 encoder, rather than mapping feature attributes to the actual application schema and gml schemas like we do for GML3.

Can you explain why the GML3 way is any better? Afaik GML3 is not doing
real "app schema" anyways?

And there are other issues. I finally figured out why wfs 1.0 describe feature types never pass with the new engine. It is because the schema generated by wfs 1.0 does not match the schemas built into the cite tests. I am not sure how this ever passed on the old engine but I had to dig into into the XSLT pit of the new engine and indeed found the schemas that it uses to validate.

Didn't we use featureTypes/typeName/schema.xml in order to
match the expected feature type?

Here is a summary of what I did codewise:

1) Added a new GML2 output format (GML2OutputFormat2) which uses the gtxml encoder like the GML3 one does. (I can see Andreas eyes rolling from here). I did try to make it work with the existing encoder but could not since it does not at all respect the application schema being encoded against.

My eyes will be rolling only as long as the new encoder is not up
to snuff speed wise, the day we can recommend it for production I'll
be happy with it.

Having a separate gml2 encoder using the new architecture could be a good intermediate step to get it some exposure in the meantime, but
I'm not sure we want to run cite tests with it: we'd end up running
cite tests with one encoder but then suggesting people to use the
other one in production.

2) Added proper schema.xsd files for each feature type so that geoserver actually creates the feature types properly for the cite wfs 1.0 cite data

Seems more workaround than we had before? If you look here:
http://svn.codehaus.org/geoserver/branches/1.7.x/data/citewfs-1.0/featureTypes/
you'll notice that only one of the feature types is using a schema
override:
http://svn.codehaus.org/geoserver/branches/1.7.x/data/citewfs-1.0/featureTypes/cdf_Other/

3) Refactored a bit the wfs 1.1 schema encoder to work with both 1.1 and 1.0.

4) Some other random changes here and there... mostly bug fixes to work against the new configuration.

And it works!! I was able to run wfs 1.0 cite tests on both H2 and MySQL (NG). It should work for pretty much all NG datastores but I have not gotten around to trying them all. Although I know there is something special with Oracle (go figure) that prevent us to pass cite. Same goes for DB2 I believe.

For Oracle we cannot pass the cite tests for a couple reasons:
- table name and geometry name are forced to be uppercase by Oracle
   Spatial, cite tests want different case
- there is no boolean data type
Hmmm... there may be some issue related to number handling as well.
In any case, it seems we need a mapping datastore in the middle in
order to cite test with Oracle. (or a purely renaming one if we
establish some convention on what a "boolean" is in Oracle).

Similar situation for wfs 1.1 tests.

I have committed all my changes that are "non-disruptive" and wanted to bounce my ideas off for the important changes.

1) GML2OutputFormat2: I realize that there are performance issues with the the gtxml encoder (which btw i am working on, but that is a discussion for another thread). So I am not proposing a replacement. What I am proposing is that the GML2OutputFormat be engaged when strict cite compliance is set.

I would prefer to see the production choice be used for cite testing as well. Can you point me at what issues there are with the old gml2
encoder? I've had good success fixing it in the past.
What about an environment variable telling the encoder which one to use?
This way one can use GML2OutputFormat2 if he wants so.

2) XmlSchemaEncoder: I am proposing replacing the old 1.0 schema encoder with the new one. The old one has no notion of schema overrides, and quite brutishly builds up a big string buffer and then spits out the XML.

Yes, works for me.

Cheers
Andrea

Nice, I love this. DataStore.createSchema and data loading have never
been tested much. I expect those to have some issues (e.g., creating
the wrong column types, not setting the expected length restrictions
and the like).

I echo - andrea's words. This is really fun stuff.

Mark filled in all the table metadata information for the CCIP PostGIS
instance; so we may be able to get good descriptions out of jdbc-ng as
well.

Cheers,
Jody

Jody Garnett ha scritto:

Nice, I love this. DataStore.createSchema and data loading have never
been tested much. I expect those to have some issues (e.g., creating
the wrong column types, not setting the expected length restrictions
and the like).

I echo - andrea's words. This is really fun stuff.

Mark filled in all the table metadata information for the CCIP PostGIS
instance; so we may be able to get good descriptions out of jdbc-ng as
well.

To do this we'd have to pimp up the jdbc-ng datastore to read the
table descriptions and modify GeoServer to either look for specific
metadata in the FeatureType definition or to use the getInfo
methods.

Which seems like a good idea for very database centric admins that
do like to configure as much as possible in the db itself.
Jody, can you open a jira and propose a patch?

Cheers
Andrea

Andrea Aime wrote:

Justin Deoliveira ha scritto:

Hi all,

For 2.0-alpha2 I have been running cite tests like for normal releases, however I used this as an opportunity to resurrect some stuff I was working on a while back.

To step back a moment, my motivation was to be able to run cite tests against multiple backends easily, targeting mainly db back ends, postgis, h2, mysql, etc...

So the first thing I did was come up with a small utility program to load an arbitrary datastore with the various configurations of cite data. Currently I have only implemented wfs 1.0 and wfs 1.1. The loader can be found underneath the rest of the cite utilities i have worked on in the past:

http://svn.codehaus.org/geoserver/trunk/src/community/cite/loader/

The nice thing about this is that it exercises really what goes *into the datastore* strictly through the geotools api, rather than hacking the backend with sql scripts and making this appear ok coming *out of the datastore*.

Nice, I love this. DataStore.createSchema and data loading have never
been tested much. I expect those to have some issues (e.g., creating
the wrong column types, not setting the expected length restrictions
and the like).

So with the loader script I loaded up an H2 database for cite wfs 1.0 cite tests and did a run. And... it uncovered quite a few of the hacks that the current postgis setup uses to be "cite compliant". To name the major ones:

1) the boundedBy attribute only gets encoded because it is an attribute in the underlying tables. Which obviously would never get updated if the actual geometry changed.

Mumble muble... isn't this related to the fact we don't have feature
bounding on by default anymore? If you enable it the gml:boundedBy element will be generated, see for example here:

http://geo.openplans.org:8080/geoserver/wfs?service=WFS&version=1.0.0&request=GetFeature&typeName=topp:states

Perhaps... the boundedBy attribute may be unnecessary in the database and this is just there for legacy reasons.

Hmm... but if you have it inside the table, won't you get it inside
another namespace (the one of the feature)?

2) The reference to "built-in" gml geometry property types in GML2 is toggled via a flag to the gml2 encoder, rather than mapping feature attributes to the actual application schema and gml schemas like we do for GML3.

Can you explain why the GML3 way is any better? Afaik GML3 is not doing
real "app schema" anyways?

Well it is in the fact that the encoder respects the schema. And this allows for very basic mapping facilities. Namely being able to specify the namespace attached to the element and being able to specify the encoded type of an element, etc... The old transformer can't do that.

The flag is a brute force approach that would work only in the general case.

* Will it work in the case you have two geometries and one needs to reference a gml property type, and the other an app schema property type -> no, both are either gml, or both our either app schema

* Will it work if the property name is named something other than the well known property types, poingPropertyType,lineStringPropertyType, etc... -> no

And there are other issues. I finally figured out why wfs 1.0 describe feature types never pass with the new engine. It is because the schema generated by wfs 1.0 does not match the schemas built into the cite tests. I am not sure how this ever passed on the old engine but I had to dig into into the XSLT pit of the new engine and indeed found the schemas that it uses to validate.

Didn't we use featureTypes/typeName/schema.xml in order to
match the expected feature type?

For one type... and after I examined it with respect to the schemas used by the cite tests they still did not match up, so i admit i am unsure how this worked.

Here is a summary of what I did codewise:

1) Added a new GML2 output format (GML2OutputFormat2) which uses the gtxml encoder like the GML3 one does. (I can see Andreas eyes rolling from here). I did try to make it work with the existing encoder but could not since it does not at all respect the application schema being encoded against.

My eyes will be rolling only as long as the new encoder is not up
to snuff speed wise, the day we can recommend it for production I'll
be happy with it.

Having a separate gml2 encoder using the new architecture could be a good intermediate step to get it some exposure in the meantime, but
I'm not sure we want to run cite tests with it: we'd end up running
cite tests with one encoder but then suggesting people to use the
other one in production.

Well that is kind of a mute argument since many of the code paths we run during cite are run only during cite and not in production. Although I admit using a full blown different encoder is drastic it does not seem any different. I mean without it we are not cite compliant. Just like without all the other cite hacks we are not cite compliant.

2) Added proper schema.xsd files for each feature type so that geoserver actually creates the feature types properly for the cite wfs 1.0 cite data

Seems more workaround than we had before? If you look here:
http://svn.codehaus.org/geoserver/branches/1.7.x/data/citewfs-1.0/featureTypes/

you'll notice that only one of the feature types is using a schema
override:
http://svn.codehaus.org/geoserver/branches/1.7.x/data/citewfs-1.0/featureTypes/cdf_Other/

Not quite, all the types have special schema requirements that we were not really supporting. Search for dataFeatures.xsd and geometryFeatures.xsd under the cite/tests directory to see what I mean.

3) Refactored a bit the wfs 1.1 schema encoder to work with both 1.1 and 1.0.

4) Some other random changes here and there... mostly bug fixes to work against the new configuration.

And it works!! I was able to run wfs 1.0 cite tests on both H2 and MySQL (NG). It should work for pretty much all NG datastores but I have not gotten around to trying them all. Although I know there is something special with Oracle (go figure) that prevent us to pass cite. Same goes for DB2 I believe.

For Oracle we cannot pass the cite tests for a couple reasons:
- table name and geometry name are forced to be uppercase by Oracle
  Spatial, cite tests want different case
- there is no boolean data type
Hmmm... there may be some issue related to number handling as well.
In any case, it seems we need a mapping datastore in the middle in
order to cite test with Oracle. (or a purely renaming one if we
establish some convention on what a "boolean" is in Oracle).

Similar situation for wfs 1.1 tests.

I have committed all my changes that are "non-disruptive" and wanted to bounce my ideas off for the important changes.

1) GML2OutputFormat2: I realize that there are performance issues with the the gtxml encoder (which btw i am working on, but that is a discussion for another thread). So I am not proposing a replacement. What I am proposing is that the GML2OutputFormat be engaged when strict cite compliance is set.

I would prefer to see the production choice be used for cite testing as well. Can you point me at what issues there are with the old gml2
encoder? I've had good success fixing it in the past.
What about an environment variable telling the encoder which one to use?
This way one can use GML2OutputFormat2 if he wants so.

Ha, try to run wfs cite tests with a regular database setup and have fun. It took me a couple weeks of spare time to figure out all the issues and fix them cleanly so good luck.

The alternative is to not change anything and keep the old postgis db around with the old encoder and pass the tests for that special case. In which calling ourselves cite compliant would be a stretch.

The whole point for me in this exercises was not to test our WFS protocol, we have already done that, it is to test our backend datastores against the variety of cases that the cite tests throw out.

Anyways, I am curious if other people think the value add here is worth the hit in performance. My opinion is I have never seen GML as a format built for speed, it is way too verbose, it requires the loading of an external document to describe itself, etc... I am also curious to know if anyone has actually chosen server software based soley on how fast it spits out GML.

2) XmlSchemaEncoder: I am proposing replacing the old 1.0 schema encoder with the new one. The old one has no notion of schema overrides, and quite brutishly builds up a big string buffer and then spits out the XML.

Yes, works for me.

Cheers
Andrea

--
Justin Deoliveira
OpenGeo - http://opengeo.org
Enterprise support for open source geospatial.

Justin Deoliveira ha scritto:

Anyways, I am curious if other people think the value add here is worth the hit in performance. My opinion is I have never seen GML as a format built for speed, it is way too verbose, it requires the loading of an external document to describe itself, etc... I am also curious to know if anyone has actually chosen server software based soley on how fast it spits out GML.

Solely on that makes no sense. In general you have a set of minimum
requirements, and you have to satisfy them all for a solution to be chosen. Performance is usually one of them, unless you're just
setting up a proof of concept.

For example, INSPIRE is setting minimum requirements for both WMS
and WFS. The WFS requirements are not high (whilst the WMS ones
might be very taxing), but you can be sure people deploying a
INSPIRE compliant service will do a formal performance comparison
(whilst before that might have been just a informal check):
http://inspire.jrc.ec.europa.eu/reports/ImplementingRules/network/D3.9_Draft_IR_Download_Services_v2.0.pdf
The requirement is 0.5MB/s sustained for each connection, but it also
says the service must be able to serve at least 10 concurrent
connections. ("The capacity of an INSPIRE service is given by a number of service request which are sent in a given
time frame. Then the performance indicator has to be met for every individual service response.").
This means the total requirement is 5MB/s total (quite a taxing
one, since we're talking MBytes, not Megabits, meaning one would
also need a 40Mbit line to serve out that much data).

Marketing wise the performance presentation already saw GeoServer
being slower than MapServer in GML3/shapefile output.
You may say it's fast enough for production purposes and you're probably
right about it, but it does not make for a good impression on users looking at the presentation anyways (the takeaway looking at
bar charts is relative performance, the fact that both are
plenty fast is something that you have to force them to read).

Cheers
Andrea

What I am proposing is that the GML2OutputFormat be engaged when strict cite compliance is set.

I would prefer to see the production choice be used for cite testing as well. Can you point me at what issues there are with the old gml2
encoder? I've had good success fixing it in the past.
What about an environment variable telling the encoder which one to use?
This way one can use GML2OutputFormat2 if he wants so.

Ha, try to run wfs cite tests with a regular database setup and have fun. It took me a couple weeks of spare time to figure out all the issues and fix them cleanly so good luck.

The alternative is to not change anything and keep the old postgis db around with the old encoder and pass the tests for that special case. In which calling ourselves cite compliant would be a stretch.

The whole point for me in this exercises was not to test our WFS protocol, we have already done that, it is to test our backend datastores against the variety of cases that the cite tests throw out.

Anyways, I am curious if other people think the value add here is worth the hit in performance.

As I see it there are different situations in which people tend to use one or other QA factor as the main driver to choose a product. We can't deny speed is, even if a lame one compared to robustness, scalability, reliability etc, the easiest to assess and hence the most often talked about. I have seen a large gov agency wanting to spit out GML as-fast-as-possible. I think an organization delivering GML to the public will always find the bottleneck being the network bandwidth, while an organization willing to use WFS as the centralized data edition service in its intranet will want it to be really fast.
But, that is to say, I'm very willing to agree with you on this, Justin, I certainly want to have the least code paths possible, a single (gt-xsd) tech in use for both gml2 and gml3, and am also willing to sacrifice some perf to obtain that. I just want to make sure the solution, even if a bit slowerd, do scale up, does not blow up resource consumption, AND I would love to sit down with you and research for an strategy in which we can a) incorporate a pull/push model for gt-xsd streaming and b) make it in a way that the underlying tech used for the low level IO is pluggable, such that I can as easily reuse all the infrastructure for binary xml streaming.
In conclusion, and sorry if all that comments didn't actually add more value to the discussion, this is something I would really love to see on _trunk_, but have my reservations about changing the gml2 encoder in 1.7.x.

My opinion is I have never seen GML as a format built for speed, it is way too verbose, it requires the loading of an external document to describe itself, etc... I am also curious to know if anyone has actually chosen server software based soley on how fast it spits out GML.

2) XmlSchemaEncoder: I am proposing replacing the old 1.0 schema encoder with the new one. The old one has no notion of schema overrides, and quite brutishly builds up a big string buffer and then spits out the XML.

+1
Cheers,

Gabriel

Yes, works for me.

Cheers
Andrea

--
Gabriel Roldan
OpenGeo - http://opengeo.org
Expert service straight from the developers.

Putting the philosophical debate aside for the moment there are two things on the table here:

1) fast GML
2) cite compliance with a generic setup

The current set up can't do both without a complete overhaul of the current gml2 encoder... which is what the gtxml encoder is.

Also to stress the point, I only want to replace the encoder when cite is enabled which is what? 99% percent of the time? Does anyone in production actually run with cite enabled?

Asking for the sacrifice of some speed in a 1% case in order to achieve much better testing and qa of many of our datastores does not seem like an unreasonable request to me.

Gabriel Roldan wrote:

What I am proposing is that the GML2OutputFormat be engaged when strict cite compliance is set.

I would prefer to see the production choice be used for cite testing as well. Can you point me at what issues there are with the old gml2
encoder? I've had good success fixing it in the past.
What about an environment variable telling the encoder which one to use?
This way one can use GML2OutputFormat2 if he wants so.

Ha, try to run wfs cite tests with a regular database setup and have fun. It took me a couple weeks of spare time to figure out all the issues and fix them cleanly so good luck.

The alternative is to not change anything and keep the old postgis db around with the old encoder and pass the tests for that special case. In which calling ourselves cite compliant would be a stretch.

The whole point for me in this exercises was not to test our WFS protocol, we have already done that, it is to test our backend datastores against the variety of cases that the cite tests throw out.

Anyways, I am curious if other people think the value add here is worth the hit in performance.

As I see it there are different situations in which people tend to use one or other QA factor as the main driver to choose a product. We can't deny speed is, even if a lame one compared to robustness, scalability, reliability etc, the easiest to assess and hence the most often talked about. I have seen a large gov agency wanting to spit out GML as-fast-as-possible. I think an organization delivering GML to the public will always find the bottleneck being the network bandwidth, while an organization willing to use WFS as the centralized data edition service in its intranet will want it to be really fast.
But, that is to say, I'm very willing to agree with you on this, Justin, I certainly want to have the least code paths possible, a single (gt-xsd) tech in use for both gml2 and gml3, and am also willing to sacrifice some perf to obtain that. I just want to make sure the solution, even if a bit slowerd, do scale up, does not blow up resource consumption, AND I would love to sit down with you and research for an strategy in which we can a) incorporate a pull/push model for gt-xsd streaming and b) make it in a way that the underlying tech used for the low level IO is pluggable, such that I can as easily reuse all the infrastructure for binary xml streaming.
In conclusion, and sorry if all that comments didn't actually add more value to the discussion, this is something I would really love to see on _trunk_, but have my reservations about changing the gml2 encoder in 1.7.x.

My opinion is I have never seen GML as a format built for speed, it is way too verbose, it requires the loading of an external document to describe itself, etc... I am also curious to know if anyone has actually chosen server software based soley on how fast it spits out GML.

2) XmlSchemaEncoder: I am proposing replacing the old 1.0 schema encoder with the new one. The old one has no notion of schema overrides, and quite brutishly builds up a big string buffer and then spits out the XML.

+1
Cheers,

Gabriel

Yes, works for me.

Cheers
Andrea

--
Justin Deoliveira
OpenGeo - http://opengeo.org
Enterprise support for open source geospatial.

Justin Deoliveira ha scritto:

Putting the philosophical debate aside for the moment there are two things on the table here:

1) fast GML
2) cite compliance with a generic setup

The current set up can't do both without a complete overhaul of the current gml2 encoder... which is what the gtxml encoder is.

Also to stress the point, I only want to replace the encoder when cite is enabled which is what? 99% percent of the time? Does anyone in production actually run with cite enabled?

Asking for the sacrifice of some speed in a 1% case in order to achieve much better testing and qa of many of our datastores does not seem like an unreasonable request to me.

I don't care if we run the cite tests a bit slower.
What I'm worried about is that we would remove some test coverage on
the case that people do really run in production, since the CITE tests
would not test it anymore.
Are you sure our unit tests provide the same coverage over GML encoding
as the cite tests do?

Cheers
Andrea

Andrea Aime ha scritto:

For example, INSPIRE is setting minimum requirements for both WMS
and WFS. The WFS requirements are not high

Meh, I wrote this before noticing that the 0.5MB/s had to be
multiplied by 10. Forget about it, they are high :slight_smile:

Cheers
Andrea

Justin Deoliveira wrote:

Putting the philosophical debate aside for the moment there are two things on the table here:

What do my comments have of philosophical? Didn't I basically tell that whilst I understand andrea's concerns about speed I am willing to support this, but I just have some reservations about doing it in 1.7.x if for the general case due to lack of exposure of the new code to production conditions?
wait a minute... reading the thread from the beginning again I see you're talking of 2.0 here... sorry I jut got the alarm on about 1.7.x, this seems totally fine for 2.0 to me as I already told.

1) fast GML
2) cite compliance with a generic setup

The current set up can't do both without a complete overhaul of the current gml2 encoder... which is what the gtxml encoder is.

Also to stress the point, I only want to replace the encoder when cite is enabled which is what? 99% percent of the time? Does anyone in production actually run with cite enabled?

I don't know.

Asking for the sacrifice of some speed in a 1% case in order to achieve much better testing and qa of many of our datastores does not seem like an unreasonable request to me.

yes, sounds reasonable to me, you're trying to get a better QA end to end by easily running cite against different backends

Gabriel Roldan wrote:

What I am proposing is that the GML2OutputFormat be engaged when strict cite compliance is set.

I would prefer to see the production choice be used for cite testing as well. Can you point me at what issues there are with the old gml2
encoder? I've had good success fixing it in the past.
What about an environment variable telling the encoder which one to use?
This way one can use GML2OutputFormat2 if he wants so.

Ha, try to run wfs cite tests with a regular database setup and have fun. It took me a couple weeks of spare time to figure out all the issues and fix them cleanly so good luck.

The alternative is to not change anything and keep the old postgis db around with the old encoder and pass the tests for that special case. In which calling ourselves cite compliant would be a stretch.

The whole point for me in this exercises was not to test our WFS protocol, we have already done that, it is to test our backend datastores against the variety of cases that the cite tests throw out.

Anyways, I am curious if other people think the value add here is worth the hit in performance.

As I see it there are different situations in which people tend to use one or other QA factor as the main driver to choose a product. We can't deny speed is, even if a lame one compared to robustness, scalability, reliability etc, the easiest to assess and hence the most often talked about. I have seen a large gov agency wanting to spit out GML as-fast-as-possible. I think an organization delivering GML to the public will always find the bottleneck being the network bandwidth, while an organization willing to use WFS as the centralized data edition service in its intranet will want it to be really fast.
But, that is to say, I'm very willing to agree with you on this, Justin, I certainly want to have the least code paths possible, a single (gt-xsd) tech in use for both gml2 and gml3, and am also willing to sacrifice some perf to obtain that. I just want to make sure the solution, even if a bit slowerd, do scale up, does not blow up resource consumption, AND I would love to sit down with you and research for an strategy in which we can a) incorporate a pull/push model for gt-xsd streaming and b) make it in a way that the underlying tech used for the low level IO is pluggable, such that I can as easily reuse all the infrastructure for binary xml streaming.
In conclusion, and sorry if all that comments didn't actually add more value to the discussion, this is something I would really love to see on _trunk_, but have my reservations about changing the gml2 encoder in 1.7.x.

My opinion is I have never seen GML as a format built for speed, it is way too verbose, it requires the loading of an external document to describe itself, etc... I am also curious to know if anyone has actually chosen server software based soley on how fast it spits out GML.

2) XmlSchemaEncoder: I am proposing replacing the old 1.0 schema encoder with the new one. The old one has no notion of schema overrides, and quite brutishly builds up a big string buffer and then spits out the XML.

+1
Cheers,

Gabriel

Yes, works for me.

Cheers
Andrea

--
Gabriel Roldan
OpenGeo - http://opengeo.org
Expert service straight from the developers.

I don't care if we run the cite tests a bit slower.
What I'm worried about is that we would remove some test coverage on
the case that people do really run in production, since the CITE tests
would not test it anymore.
Are you sure our unit tests provide the same coverage over GML encoding
as the cite tests do?

Fair enough, that is a very valid concern. We can continue to maintain the test coverage we achieve now by running the old encoder with the old postgis setup. But keep in mind that I am the only one who runs and fixes cite tests for releases (or at least I am one that usually does it). Maintaining a hacked cite setup is not something I am crazy about, and probably something i don't have the capacity for. Would someone else be willing to take on the responsibility of running these tests?

So an option to enable the new GML2 output format only when running certain cite suites is deemed ok?

Cheers
Andrea

--
Justin Deoliveira
OpenGeo - http://opengeo.org
Enterprise support for open source geospatial.

That comment was not targeted at you directly, it was targeted at everyone, including myself who opened up the point of conversation.

Gabriel Roldan wrote:

Justin Deoliveira wrote:

Putting the philosophical debate aside for the moment there are two things on the table here:

What do my comments have of philosophical? Didn't I basically tell that whilst I understand andrea's concerns about speed I am willing to support this, but I just have some reservations about doing it in 1.7.x if for the general case due to lack of exposure of the new code to production conditions?
wait a minute... reading the thread from the beginning again I see you're talking of 2.0 here... sorry I jut got the alarm on about 1.7.x, this seems totally fine for 2.0 to me as I already told.

1) fast GML
2) cite compliance with a generic setup

The current set up can't do both without a complete overhaul of the current gml2 encoder... which is what the gtxml encoder is.

Also to stress the point, I only want to replace the encoder when cite is enabled which is what? 99% percent of the time? Does anyone in production actually run with cite enabled?

I don't know.

Asking for the sacrifice of some speed in a 1% case in order to achieve much better testing and qa of many of our datastores does not seem like an unreasonable request to me.

yes, sounds reasonable to me, you're trying to get a better QA end to end by easily running cite against different backends

Gabriel Roldan wrote:

What I am proposing is that the GML2OutputFormat be engaged when strict cite compliance is set.

I would prefer to see the production choice be used for cite testing as well. Can you point me at what issues there are with the old gml2
encoder? I've had good success fixing it in the past.
What about an environment variable telling the encoder which one to use?
This way one can use GML2OutputFormat2 if he wants so.

Ha, try to run wfs cite tests with a regular database setup and have fun. It took me a couple weeks of spare time to figure out all the issues and fix them cleanly so good luck.

The alternative is to not change anything and keep the old postgis db around with the old encoder and pass the tests for that special case. In which calling ourselves cite compliant would be a stretch.

The whole point for me in this exercises was not to test our WFS protocol, we have already done that, it is to test our backend datastores against the variety of cases that the cite tests throw out.

Anyways, I am curious if other people think the value add here is worth the hit in performance.

As I see it there are different situations in which people tend to use one or other QA factor as the main driver to choose a product. We can't deny speed is, even if a lame one compared to robustness, scalability, reliability etc, the easiest to assess and hence the most often talked about. I have seen a large gov agency wanting to spit out GML as-fast-as-possible. I think an organization delivering GML to the public will always find the bottleneck being the network bandwidth, while an organization willing to use WFS as the centralized data edition service in its intranet will want it to be really fast.
But, that is to say, I'm very willing to agree with you on this, Justin, I certainly want to have the least code paths possible, a single (gt-xsd) tech in use for both gml2 and gml3, and am also willing to sacrifice some perf to obtain that. I just want to make sure the solution, even if a bit slowerd, do scale up, does not blow up resource consumption, AND I would love to sit down with you and research for an strategy in which we can a) incorporate a pull/push model for gt-xsd streaming and b) make it in a way that the underlying tech used for the low level IO is pluggable, such that I can as easily reuse all the infrastructure for binary xml streaming.
In conclusion, and sorry if all that comments didn't actually add more value to the discussion, this is something I would really love to see on _trunk_, but have my reservations about changing the gml2 encoder in 1.7.x.

My opinion is I have never seen GML as a format built for speed, it is way too verbose, it requires the loading of an external document to describe itself, etc... I am also curious to know if anyone has actually chosen server software based soley on how fast it spits out GML.

2) XmlSchemaEncoder: I am proposing replacing the old 1.0 schema encoder with the new one. The old one has no notion of schema overrides, and quite brutishly builds up a big string buffer and then spits out the XML.

+1
Cheers,

Gabriel

Yes, works for me.

Cheers
Andrea

--
Justin Deoliveira
OpenGeo - http://opengeo.org
Enterprise support for open source geospatial.

Andrea Aime ha scritto:

I don't care if we run the cite tests a bit slower.
What I'm worried about is that we would remove some test coverage on
the case that people do really run in production, since the CITE tests
would not test it anymore.
Are you sure our unit tests provide the same coverage over GML encoding
as the cite tests do?

And oh, forgot to mention the obvious, but if the new encoder can
get in the same ballpark as the old encoder speed wise, the whole argument of discussion ceases to exist, we can have the new encoder
as the only one, use it always, test and production, and don't worry about perf issues :wink:
I know you're working on it, wondering if we can start kicking the
tires of your patches sooner rather than later?

Cheers
Andrea

Justin Deoliveira wrote:

That comment was not targeted at you directly, it was targeted at everyone, including myself who opened up the point of conversation.

fair enough, so it seems like you have our bless on this.

Gabriel Roldan wrote:

Justin Deoliveira wrote:

Putting the philosophical debate aside for the moment there are two things on the table here:

What do my comments have of philosophical? Didn't I basically tell that whilst I understand andrea's concerns about speed I am willing to support this, but I just have some reservations about doing it in 1.7.x if for the general case due to lack of exposure of the new code to production conditions?
wait a minute... reading the thread from the beginning again I see you're talking of 2.0 here... sorry I jut got the alarm on about 1.7.x, this seems totally fine for 2.0 to me as I already told.

1) fast GML
2) cite compliance with a generic setup

The current set up can't do both without a complete overhaul of the current gml2 encoder... which is what the gtxml encoder is.

Also to stress the point, I only want to replace the encoder when cite is enabled which is what? 99% percent of the time? Does anyone in production actually run with cite enabled?

I don't know.

Asking for the sacrifice of some speed in a 1% case in order to achieve much better testing and qa of many of our datastores does not seem like an unreasonable request to me.

yes, sounds reasonable to me, you're trying to get a better QA end to end by easily running cite against different backends

Gabriel Roldan wrote:

What I am proposing is that the GML2OutputFormat be engaged when strict cite compliance is set.

I would prefer to see the production choice be used for cite testing as well. Can you point me at what issues there are with the old gml2
encoder? I've had good success fixing it in the past.
What about an environment variable telling the encoder which one to use?
This way one can use GML2OutputFormat2 if he wants so.

Ha, try to run wfs cite tests with a regular database setup and have fun. It took me a couple weeks of spare time to figure out all the issues and fix them cleanly so good luck.

The alternative is to not change anything and keep the old postgis db around with the old encoder and pass the tests for that special case. In which calling ourselves cite compliant would be a stretch.

The whole point for me in this exercises was not to test our WFS protocol, we have already done that, it is to test our backend datastores against the variety of cases that the cite tests throw out.

Anyways, I am curious if other people think the value add here is worth the hit in performance.

As I see it there are different situations in which people tend to use one or other QA factor as the main driver to choose a product. We can't deny speed is, even if a lame one compared to robustness, scalability, reliability etc, the easiest to assess and hence the most often talked about. I have seen a large gov agency wanting to spit out GML as-fast-as-possible. I think an organization delivering GML to the public will always find the bottleneck being the network bandwidth, while an organization willing to use WFS as the centralized data edition service in its intranet will want it to be really fast.
But, that is to say, I'm very willing to agree with you on this, Justin, I certainly want to have the least code paths possible, a single (gt-xsd) tech in use for both gml2 and gml3, and am also willing to sacrifice some perf to obtain that. I just want to make sure the solution, even if a bit slowerd, do scale up, does not blow up resource consumption, AND I would love to sit down with you and research for an strategy in which we can a) incorporate a pull/push model for gt-xsd streaming and b) make it in a way that the underlying tech used for the low level IO is pluggable, such that I can as easily reuse all the infrastructure for binary xml streaming.
In conclusion, and sorry if all that comments didn't actually add more value to the discussion, this is something I would really love to see on _trunk_, but have my reservations about changing the gml2 encoder in 1.7.x.

My opinion is I have never seen GML as a format built for speed, it is way too verbose, it requires the loading of an external document to describe itself, etc... I am also curious to know if anyone has actually chosen server software based soley on how fast it spits out GML.

2) XmlSchemaEncoder: I am proposing replacing the old 1.0 schema encoder with the new one. The old one has no notion of schema overrides, and quite brutishly builds up a big string buffer and then spits out the XML.

+1
Cheers,

Gabriel

Yes, works for me.

Cheers
Andrea

--
Gabriel Roldan
OpenGeo - http://opengeo.org
Expert service straight from the developers.

Justin Deoliveira ha scritto:

I don't care if we run the cite tests a bit slower.
What I'm worried about is that we would remove some test coverage on
the case that people do really run in production, since the CITE tests
would not test it anymore.
Are you sure our unit tests provide the same coverage over GML encoding
as the cite tests do?

Fair enough, that is a very valid concern. We can continue to maintain the test coverage we achieve now by running the old encoder with the old postgis setup. But keep in mind that I am the only one who runs and fixes cite tests for releases (or at least I am one that usually does it). Maintaining a hacked cite setup is not something I am crazy about, and probably something i don't have the capacity for. Would someone else be willing to take on the responsibility of running these tests?

I guess it's time to put my money where the mouth is :wink:
I'll do it.

So an option to enable the new GML2 output format only when running certain cite suites is deemed ok?

Sure. Hopefully some time in the future we'll be able to just switch
solid to the new encoder for everything.

Cheers
Andrea

Andrea Aime wrote:

Andrea Aime ha scritto:

I don't care if we run the cite tests a bit slower.
What I'm worried about is that we would remove some test coverage on
the case that people do really run in production, since the CITE tests
would not test it anymore.
Are you sure our unit tests provide the same coverage over GML encoding
as the cite tests do?

And oh, forgot to mention the obvious, but if the new encoder can
get in the same ballpark as the old encoder speed wise, the whole argument of discussion ceases to exist, we can have the new encoder
as the only one, use it always, test and production, and don't worry about perf issues :wink:
I know you're working on it, wondering if we can start kicking the
tires of your patches sooner rather than later?

Well they are still pretty rough, the result of raw experimentation so nothing ready to commit in patch for format. And I of course want to do some rigorous testing before I proceed. The rest of this week is looking pretty busy but perhaps in the next couple of weeks...

Cheers
Andrea

--
Justin Deoliveira
OpenGeo - http://opengeo.org
Enterprise support for open source geospatial.

Will it help if we aim for CITE compliance with 2.0, using the
app-schemas stuff.?

For people who will care about CITE, they are probably going to care
about app-schema support as well (cf. INSPIRE).

IMHO we could afford to let CITE conformance slide for pre 2.0
versions - in much the same way as we dont try to certify OGC
conformance for every release.

i.e. Put the energy into optimising performance and hack-free
configuration for 2.0.

The question is if we lose people not willing to wait till 2.0? Would
be good to do a poll of people to see what combination of requirements
they have. It deosnt make sense to talk about INSPIRE's requirements
without supporting INSPIRE compliant app-schemas for example, so
optimising 1.7 performance to meet thier needs doesnt really make much
sense.

NB This way we are still free to have a non-compliant (to a published
standard) but fast format(s) that takes shortcuts for people who care
more about performance than transparency of contract. Effectively 1.7
only supports these, no big deal perhaps?

Rob

On Wed, Apr 15, 2009 at 12:39 AM, Justin Deoliveira
<jdeolive@anonymised.com> wrote:

Andrea Aime wrote:

Andrea Aime ha scritto:

I don't care if we run the cite tests a bit slower.
What I'm worried about is that we would remove some test coverage on
the case that people do really run in production, since the CITE tests
would not test it anymore.
Are you sure our unit tests provide the same coverage over GML encoding
as the cite tests do?

And oh, forgot to mention the obvious, but if the new encoder can
get in the same ballpark as the old encoder speed wise, the whole
argument of discussion ceases to exist, we can have the new encoder
as the only one, use it always, test and production, and don't worry
about perf issues :wink:
I know you're working on it, wondering if we can start kicking the
tires of your patches sooner rather than later?

Well they are still pretty rough, the result of raw experimentation so
nothing ready to commit in patch for format. And I of course want to do
some rigorous testing before I proceed. The rest of this week is looking
pretty busy but perhaps in the next couple of weeks...

Cheers
Andrea

--
Justin Deoliveira
OpenGeo - http://opengeo.org
Enterprise support for open source geospatial.

------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
Geoserver-devel mailing list
Geoserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

Will it help if we aim for CITE compliance with 2.0, using the
app-schemas stuff.?

It would help; I view the app-schemas stuff as combining the schema;
and allowing for mapping (such as the interger to boolean mapping
required for oracle).

IMHO we could afford to let CITE conformance slide for pre 2.0
versions - in much the same way as we dont try to certify OGC
conformance for every release.

I am not sure if that would help; being able to run CITE tests has
been very helpful from a QA side. Justin are the improved test-cases
able to cover the slack in this case?

Jody

Rob Atkinson ha scritto:

Will it help if we aim for CITE compliance with 2.0, using the
app-schemas stuff.?

I want them both, app schema and non app schema
case. The non app schema case is still 100% of our current
user base, the app schema case will expand it but I don't see
it becoming the primary use case for a while, lots of custom
apps are built on top of GeoServer that don't care at all
about shareable schemas.

For people who will care about CITE, they are probably going to care
about app-schema support as well (cf. INSPIRE).

CITE has been the only test GeoServer had for a long time, and our
unit test is still far away from covering all the tests a normal
cite run does (500+ tests in wfs 1.1 without complex features alone).

IMHO we could afford to let CITE conformance slide for pre 2.0
versions - in much the same way as we dont try to certify OGC
conformance for every release.

We don't try to certify for a cost reason. Letting go compliance
for the stable series would be foolish in my opinion, it will
a definite contradiction in terms (stable series, and not compliant?
We don't ever release a stable if we don't pass all CITE tests).

Cheers
Andrea