[Geoserver-devel] Some thoughts about using ogr2ogr to make new wfs output formats

Hi,
I'm dumping some thoughts about the og2ogr output formats before
I start to implement it so that we can have a bit of discussion
and decide on the details of how this can be done.

First off, the basic idea is to generate an output on the file
system, and then have ogr2ogr generate the desired output,
and return it. The code would leverage the existing ogr2ogr
executable, not the ogr java bindings, so it would not have
platform dependency/threading/server segfault issues.

On the other side, we don't have much control on which outputs
are actually available in the ogr2ogr build.

The code attached to http://jira.codehaus.org/browse/GEOS-665
is quite old, and has an assumption that makes me wonder if
it's worth using at all: it hard codes in code the output
types that can be used to MapInfo, CSV, KML2. If you want
to add more, you have to write a class.
I see one problem with this approach:
- what if the user wants to write out an S57? Does he have
   to make a custom build of the plugin for his GeoServer
   version, that he may already have in production?
- what if one wants to specify custom parameters? OGR
   output formats have many parameters

Wouldn't it be better to allow users to configure the outputs
instead? Say we add an ogr folder to the data dir, this
folder contains a number of property files, one for each
output format that we want to generate with ogr2ogr,
with the following content:

mapinfo.properties
----------------------------------------------
format=MapInfo (the -f parameter of ogr2ogr)
ofname=MapInfo (the output format name)
options=... (a list of options to be passed to ogr2ogr)
extension=.tab (the extension of the file(s) generated in output)
zip=true/false (can we return a single file?)
contentType=xyz (if we return a single file, what is its content type?)

And then have some generic code in GeoServer that
builds the output, builds a ogr2ogr invocation,
and returns the output file (or zips up all the
files in the resulting folder).

Opinions? (I know I had a similar conversation with Justin
some time ago and he was lending towards having one class
per possible output, thought in that case it was about
an OGR datastore using the java bindings).

Ah, for feeding ogr2ogr we either provide some GML2
or shapefiles. With gml2 we have the issue that
ogr2ogr does not read the schema, and thus will try
to make a guess on the data types.
With shapefiles we have issues when multiple shapefiles
are generated, which happens any time:
- the wfs GetFeature contains multiple Query elements
- the geometry type is "generic" and there are mixed
   geometric types in the output, in which case a
   shapefile per geometry type is generated

Of course I could circumvent the issue and have
ogr2ogr invoked on each of the shapefiles in output,
and then zip up the whole result (I'm actually kind
of lending towards this solution).

Again, opinions?
Cheers
Andrea

--
Andrea Aime
OpenGeo - http://opengeo.org
Expert service straight from the developers.

Sounds good on the whole Andrea. A few comments inline.

<snip>

Opinions? (I know I had a similar conversation with Justin
some time ago and he was lending towards having one class
per possible output, thought in that case it was about
an OGR datastore using the java bindings).

Yeah having to write separate classes or deploy separate jars I agree is a lot of work, which is probably unnecessary. My motivation here is to be able to somehow configure which formats from ogr show up in GeoServer, and not get bombarded with the many formats ogr provides. Others may disagree... this really just stems from my minimalist need to only have things around that I ask for :). Regardless, not a big issue.

Ah, for feeding ogr2ogr we either provide some GML2
or shapefiles. With gml2 we have the issue that
ogr2ogr does not read the schema, and thus will try
to make a guess on the data types.
With shapefiles we have issues when multiple shapefiles
are generated, which happens any time:
- the wfs GetFeature contains multiple Query elements
- the geometry type is "generic" and there are mixed
   geometric types in the output, in which case a
   shapefile per geometry type is generated

Of course I could circumvent the issue and have
ogr2ogr invoked on each of the shapefiles in output,
and then zip up the whole result (I'm actually kind
of lending towards this solution).

How about perhaps making the "feed source" configurable? For instance perhaps use shapefile as the default but provide a facility to allow users to use postgis or some other source they want to configure.

Again, opinions?
Cheers
Andrea

--
Justin Deoliveira
OpenGeo - http://opengeo.org
Enterprise support for open source geospatial.

Justin Deoliveira ha scritto:

Sounds good on the whole Andrea. A few comments inline.

<snip>

Opinions? (I know I had a similar conversation with Justin
some time ago and he was lending towards having one class
per possible output, thought in that case it was about
an OGR datastore using the java bindings).

Yeah having to write separate classes or deploy separate jars I agree is a lot of work, which is probably unnecessary.

I'm not really scared about that work, but the fact that the user
might have an ogr2ogr build that supports format XYZ that we did
not code up support for. If it's configurable, then he can add
his own format without leaning java, setting up maven and so on.

My motivation here is to be able to somehow configure which formats from ogr show up in GeoServer, and not get bombarded with the many formats ogr provides. Others may disagree... this really just stems from my minimalist need to only have things around that I ask for :). Regardless, not a big issue.

Ah, but we would list only the formats that the user has provided
a property file for. Maybe we can ship with a few sample formats
supported by default and hard coded into the plugin
(MapInfo and a few others), and allow the user to provide
extras by adding those property files.
Whatever is used provided has not been tested by the devs,
so no big issue there.

Of course I could circumvent the issue and have
ogr2ogr invoked on each of the shapefiles in output,
and then zip up the whole result (I'm actually kind
of lending towards this solution).

How about perhaps making the "feed source" configurable? For instance perhaps use shapefile as the default but provide a facility to allow users to use postgis or some other source they want to configure.

You mean, direct connection to the original data source from
ogr2ogr, like connecting directly to the postgis database that
we know holds the data? I see a few issues:
- how do we handle all of the WFS options like filtering and so on?
- what happens if ogr2ogr does not have the proper driver?

The mechanisms I envisioned would use GeoServer to do all the
filtering, attribute shaving, reprojecting and so on (eventually,
computation too, as this can be used as a WPS output format too),
then dump into shapefile, and have ogr2ogr turn that shapefile
into the format that we want to return.

Maybe I misunderstood your suggestion. Please elaborate? :slight_smile:

Cheers
Andrea

--
Andrea Aime
OpenGeo - http://opengeo.org
Expert service straight from the developers.

Andrea Aime wrote:

Justin Deoliveira ha scritto:

Sounds good on the whole Andrea. A few comments inline.

<snip>

Opinions? (I know I had a similar conversation with Justin
some time ago and he was lending towards having one class
per possible output, thought in that case it was about
an OGR datastore using the java bindings).

Yeah having to write separate classes or deploy separate jars I agree is a lot of work, which is probably unnecessary.

I'm not really scared about that work, but the fact that the user
might have an ogr2ogr build that supports format XYZ that we did
not code up support for. If it's configurable, then he can add
his own format without leaning java, setting up maven and so on.

My motivation here is to be able to somehow configure which formats from ogr show up in GeoServer, and not get bombarded with the many formats ogr provides. Others may disagree... this really just stems from my minimalist need to only have things around that I ask for :). Regardless, not a big issue.

Ah, but we would list only the formats that the user has provided
a property file for. Maybe we can ship with a few sample formats
supported by default and hard coded into the plugin
(MapInfo and a few others), and allow the user to provide
extras by adding those property files.
Whatever is used provided has not been tested by the devs,
so no big issue there.

Hmmm... good point. Mission accomplished then!! :).

Of course I could circumvent the issue and have
ogr2ogr invoked on each of the shapefiles in output,
and then zip up the whole result (I'm actually kind
of lending towards this solution).

How about perhaps making the "feed source" configurable? For instance perhaps use shapefile as the default but provide a facility to allow users to use postgis or some other source they want to configure.

You mean, direct connection to the original data source from
ogr2ogr, like connecting directly to the postgis database that
we know holds the data? I see a few issues:
- how do we handle all of the WFS options like filtering and so on?
- what happens if ogr2ogr does not have the proper driver?

The mechanisms I envisioned would use GeoServer to do all the
filtering, attribute shaving, reprojecting and so on (eventually,
computation too, as this can be used as a WPS output format too),
then dump into shapefile, and have ogr2ogr turn that shapefile
into the format that we want to return.

Maybe I misunderstood your suggestion. Please elaborate? :slight_smile:

Yeah, not quite, sorry, did not explain my self well :).

So what we are talking about this this correct:

1. request comes in and his handled by geoserver
2. geoserver dumps response to local shapefile
3. geoserver pumps the shapefile through ogr to another format and returns the new format to the client

My suggestion is in step 2 instead of dumping to a shapefile, dumping to some other intermediary data source. The rationale being that you stated issues with shapefile, and gml... so perhaps being able to configure a postgis db for the temporary storage might work better? Just a thought.

Cheers
Andrea

--
Justin Deoliveira
OpenGeo - http://opengeo.org
Enterprise support for open source geospatial.

Justin Deoliveira ha scritto:
...

Maybe I misunderstood your suggestion. Please elaborate? :slight_smile:

Yeah, not quite, sorry, did not explain my self well :).

So what we are talking about this this correct:

1. request comes in and his handled by geoserver
2. geoserver dumps response to local shapefile
3. geoserver pumps the shapefile through ogr to another format and returns the new format to the client

My suggestion is in step 2 instead of dumping to a shapefile, dumping to some other intermediary data source. The rationale being that you stated issues with shapefile, and gml... so perhaps being able to configure a postgis db for the temporary storage might work better? Just a thought.

Aaah, I see what you mean. Hum, I'm playing a little with gml generated
by GeoServer and it's not going that bad. I'll make tests with some
bigger data sets to see what happens, time and memory usage wise.
So far GML as the intermediate format is working fine (not onto that
1.3GB shapefile, that will most probably generate a 3+GB gml file).

Cheers
Andrea

--
Andrea Aime
OpenGeo - http://opengeo.org
Expert service straight from the developers.

Andrea Aime ha scritto:

Justin Deoliveira ha scritto:
...

Maybe I misunderstood your suggestion. Please elaborate? :slight_smile:

Yeah, not quite, sorry, did not explain my self well :).

So what we are talking about this this correct:

1. request comes in and his handled by geoserver
2. geoserver dumps response to local shapefile
3. geoserver pumps the shapefile through ogr to another format and returns the new format to the client

My suggestion is in step 2 instead of dumping to a shapefile, dumping to some other intermediary data source. The rationale being that you stated issues with shapefile, and gml... so perhaps being able to configure a postgis db for the temporary storage might work better? Just a thought.

Aaah, I see what you mean. Hum, I'm playing a little with gml generated
by GeoServer and it's not going that bad. I'll make tests with some
bigger data sets to see what happens, time and memory usage wise.
So far GML as the intermediate format is working fine (not onto that
1.3GB shapefile, that will most probably generate a 3+GB gml file).

Ok, so I made a couple tests.

If we look at a sytentic feature type I created with a different geometry type per feature (which can happen if you play with spatial
databases), gml is the winner, as the conversion to mapinfo, which
too supports mixed geometries, works just fine.
Same goes for producing KML, which generates an interesting pure
data KML dialect (no styles) with all the mixed geometries happily
sitting around.

Using shapefile would have produced a number of different shapefiles,
and as a result a number of mapinfo files (assuming we generate one
per output shapefile). Which is not that bas, as we output data
anyways, but not brilliant.

When I tried the big 1.3GB shapefile as a data source results were
quite grim thought:
GML
- time to generate the 3GB gml2 file: 2m 30s
- ogr2ogr the gml into mapinfo: well over 12m (eek!)
SHAPEFILE
- time to generate the output shapefile using SHAPE-ZIP: around 4m
   (unfair, we won't need to actually zip it, so assume around 2-3
    minutes to dump to shape and the rest to zip it)
- time to ogr2ogr the generated shape into mapinfo: 2m 30s

All in all the output times of shapefile and gml are not that
different, but the ogr2ogr times are really bad looking for GML.
Humm... which pill? One is general enough but it's slow as a snail,
the other is somewhat decent speed wise, but will end up
generating multiple files each time the shapefile limits
are broken.

As for dumping to a configurable datastore, yeah, that might
work, but we'd also end up debugging createSchema which is not
really used much by any popular gt2 based software so far...

If you're still interesting in taking over the OGR data store
in GeoTools I'd say we could implement the gml way
for the time being, and look forward to a faster implementation
using the OGR data store for the future. Opinions?

Cheers
Andrea

--
Andrea Aime
OpenGeo - http://opengeo.org
Expert service straight from the developers.

Interesting, sounds like you are having fun :). I would say go ahead as you see fit. Anything will be better than nothing. As for the ability to configure the intermediate datastore, i agree it will rely on create schema which is pretty brittle across datastore implementations. Perhaps something to look at for the future.

As for the tradeoff between shapefile and gml... would it be worth adding the ability to use a format_option to control which is used? Again perhaps something to look at as an improvement for the future.

Andrea Aime wrote:

Andrea Aime ha scritto:

Justin Deoliveira ha scritto:
...

Maybe I misunderstood your suggestion. Please elaborate? :slight_smile:

Yeah, not quite, sorry, did not explain my self well :).

So what we are talking about this this correct:

1. request comes in and his handled by geoserver
2. geoserver dumps response to local shapefile
3. geoserver pumps the shapefile through ogr to another format and returns the new format to the client

My suggestion is in step 2 instead of dumping to a shapefile, dumping to some other intermediary data source. The rationale being that you stated issues with shapefile, and gml... so perhaps being able to configure a postgis db for the temporary storage might work better? Just a thought.

Aaah, I see what you mean. Hum, I'm playing a little with gml generated
by GeoServer and it's not going that bad. I'll make tests with some
bigger data sets to see what happens, time and memory usage wise.
So far GML as the intermediate format is working fine (not onto that
1.3GB shapefile, that will most probably generate a 3+GB gml file).

Ok, so I made a couple tests.

If we look at a sytentic feature type I created with a different geometry type per feature (which can happen if you play with spatial
databases), gml is the winner, as the conversion to mapinfo, which
too supports mixed geometries, works just fine.
Same goes for producing KML, which generates an interesting pure
data KML dialect (no styles) with all the mixed geometries happily
sitting around.

Using shapefile would have produced a number of different shapefiles,
and as a result a number of mapinfo files (assuming we generate one
per output shapefile). Which is not that bas, as we output data
anyways, but not brilliant.

When I tried the big 1.3GB shapefile as a data source results were
quite grim thought:
GML
- time to generate the 3GB gml2 file: 2m 30s
- ogr2ogr the gml into mapinfo: well over 12m (eek!)
SHAPEFILE
- time to generate the output shapefile using SHAPE-ZIP: around 4m
   (unfair, we won't need to actually zip it, so assume around 2-3
    minutes to dump to shape and the rest to zip it)
- time to ogr2ogr the generated shape into mapinfo: 2m 30s

All in all the output times of shapefile and gml are not that
different, but the ogr2ogr times are really bad looking for GML.
Humm... which pill? One is general enough but it's slow as a snail,
the other is somewhat decent speed wise, but will end up
generating multiple files each time the shapefile limits
are broken.

As for dumping to a configurable datastore, yeah, that might
work, but we'd also end up debugging createSchema which is not
really used much by any popular gt2 based software so far...

If you're still interesting in taking over the OGR data store
in GeoTools I'd say we could implement the gml way
for the time being, and look forward to a faster implementation
using the OGR data store for the future. Opinions?

Cheers
Andrea

--
Justin Deoliveira
OpenGeo - http://opengeo.org
Enterprise support for open source geospatial.