[Geoserver-devel] Some ideas on the WPS module

Hi all,
during FOSS4G 2008 and associated code sprint me and Justin had a look
at the WPS specification and the current GeoServer
community module looking for a way to move it forward.
This mail summarizes some of our findings and ideas. Don't read it as
a "will do all of this by next month" by more of a set of wishful
thoughts that I will probably try to turn into reality, slowly,
in my spare time.

FIX EXISTING ISSUES
We explored the current WPS module and found a number of issues
that we'd like to address:
- lack of unit testing
- lack of xml request support for GetCapabilities, lack of
   KVP support for Execute (none of them is mandatory, but
   not having the choice of KVP for Execute is actually quite
   an annoyance, as KVP is often perceived as more immediate to use)
- the transmuters API seems overkill, there is a need to create a
   class per type handled, class level javadoc missing.
   Justin has implemented on his notebook a tentative replacement that
   does the same job as the existing 12 classes in 3 simple classes
   and still leaves the door open to handle raster data (whilst the
   current complex data handler simply assumes data to be parsed is XML,
   but the input could be anything such as a geotiff or a compressed
   shapefile)
Once this is fixed, we can try to hook up reading feature collections
from remote servers.

AVOID THE MID MAN IF POSSIBLE
We would like to avoid GML encoding/decoding (or coverage
encoding/decoding) when possible, and directly access the datastores if
the source is local.
We can try to recognize that from the host, port and first part of the
path, also dealing with proxies and the like, but as a first step we can
just have a special URL that is resolved to the current geoserver,
something like:

.?service=wfs&version=1.0&method=GetFeature&typeName=topp:states

(think of it as a relative link, the WPS request is posted to
http://host:port/geoserver/ows so "." would represent exactly that URL).

Once we recognize that the request is local, we want to leverage the
work the Dispatcher is already doing up to a point, that is, parsing the
request, executing it, and returning the results without actually
encoding them. For example, WFS returns a number of FeatureCollection
objects out of GetFeature, and the same happens with WCS. We can then
plug those feature collections directly into the process, allowing
for more efficient handling of big amounts of data

SCALE UP WITH REAL REMOTE REQUESTS
If the request is really remote, we have to be prepared and parse
whatever is coming in.
Now, a process generally speaking will need to access a
feature collection content multiple times, so we either need to:
- store the feature collection in memory, hoping it's not too big
- store the feature collection in a persitent datastore (postgis,
   spatialite, even a shapefile would do) and then return back a feature
   collection coming from that datastore, and eventually clean up the
   datastore when the execute it's done (meaning we need some extension
   allowing the after the fact clean up of those collections).
For starters we'll follow the first path, the second one is the natural
evolution. A detail that bugs me is that the second option is
inefficient for those processes that only need to access the feature
collection once, streaming over it, for those it would be ok to use a
streaming parser and just scan over the remote input stream once.
We could create a marker interface to identify those processes and act
consequently.

It the remote input happens to be a coverage instead we can
just download it as is, put it into a temporary directory, and
create a coverage out of it. Actually, for some processes we could
again decide to pre-process it a little, such as tiling it, in order
to get memory efficient processing as opposed to reading the
coverage a single big bad block of memory.

STORE=TRUE
For long running processes it makes lots of sense to actually
support storing. Yet just storing the result is kind of unsatisfactory,
my guess is that most of the client code would be interested in
being able to access the results of the computation using WMS/WFS.
Following this line of thinking it would be nice to allow
two types of store operation mode:
* plain storage of the outputs on th file system (as the standard
   requires)
* storing the results in a datastore/coverage, register the result
   in the catalog, and have the WPS response return a GetFeature
   or a GetCoverage as the "web accessible url" store requires us
   to put in the response

Given that most of the computations have a temporary or per
user nature, it would be nice if we could decide whether to
store the results in a per-session catalog that only the current
user sees, and that eventually expires killing the registered
resources along with it, or the full catalog.
This could be done by adding a per session catalog to be used
side by side with the "global" one, and having the normal services
use a wrappers that looks for data first in the session one, and
then in the global one.

Well, that's it. Thoughts?
Cheers
Andrea

Hi Andrea; just a quick follow up ... I viewed the priorities as:
- feature collection (where most of the fun is in including the schema used by the feature collection)
- support for the store/status use case (where content is uploaded to an FTP site; and you can check on the status of along running application)

Other comments inline.

- the transmuters API seems overkill, there is a need to create a
   class per type handled, class level javadoc missing.
   Justin has implemented on his notebook a tentative replacement that
   does the same job as the existing 12 classes in 3 simple classes
   and still leaves the door open to handle raster data (whilst the
   current complex data handler simply assumes data to be parsed is XML,
   but the input could be anything such as a geotiff or a compressed
   shapefile)
  

What is the transmuters API you are talking about here?

AVOID THE MID MAN IF POSSIBLE
  

This was a nice to have; your idea seems fine. Please keep in mind the "simple chaining" examples from the WPS spec.

SCALE UP WITH REAL REMOTE REQUESTS
If the request is really remote, we have to be prepared and parse whatever is coming in.
  

We may need to make a new kind of api here - for the code that is in the uDig project. I am thinking along the lines of Jesse's "provider" classes. So we could have a FeatureCollectionProvider that is passed into a Process; if its answer is a FeatureCollection perhaps it could give us a FeatureCollectionProvider which we could hook up to the next process in the "chain"? If you wanted to "wrap" this something that would lazily save the contents to memory or disk in order to prevent "reprocessing" that would be transparent? The providers could be considered the "edges" and the processes the nodes in a graph producing the answer.

streaming parser and just scan over the remote input stream once.
We could create a marker interface to identify those processes and act
consequently.
  

The processes have a data structures describing their parameter requirements; it includes a Map that you can use for hints like what you describe. So you could have a process that expects a feature collection and you could include a key in the metadata map that is true if the feature collection will be used more than once.

STORE=TRUE
For long running processes it makes lots of sense to actually
support storing. Yet just storing the result is kind of unsatisfactory,
my guess is that most of the client code would be interested in
being able to access the results of the computation using WMS/WFS.
  

This is not documented in the spec; they refer to making results available on FTP sites. I could see doing this
for processes that are scheduled to occur at a set time (make a "daily" layer for example). However why not
make a simple "publish" process - and people can use that at the end of their chain if they want the result made
available via WFS.

Well, that's it. Thoughts?

The per session stuff is only okay; the specification provides some guidance about how long a result will be available and so forth.
Jody

Jody Garnett ha scritto:

Hi Andrea; just a quick follow up ... I viewed the priorities as:
- feature collection (where most of the fun is in including the schema used by the feature collection)
- support for the store/status use case (where content is uploaded to an FTP site; and you can check on the status of along running application)

I share the priority of feature collection support, but I definitely won't implement upload on an FTP server as part of store, the
specification just says that you have to give back an URL at which you
can access the result.
Which from where I stand means just putting the GML file in the
data directory and provide an access to it thru a URL, just as we
do for WCS store support. Also, the file is only temporary store there,
and will be removed after a timeout (that, I fear, for the moment is
not configurable... waiting for the new UI to be completed to
add that config option).

Other comments inline.

- the transmuters API seems overkill, there is a need to create a
   class per type handled, class level javadoc missing.
   Justin has implemented on his notebook a tentative replacement that
   does the same job as the existing 12 classes in 3 simple classes
   and still leaves the door open to handle raster data (whilst the
   current complex data handler simply assumes data to be parsed is XML,
   but the input could be anything such as a geotiff or a compressed
   shapefile)
  

What is the transmuters API you are talking about here?

The one contained in the org.geoserver.wps.transmute in the wps module.

AVOID THE MID MAN IF POSSIBLE
  

This was a nice to have; your idea seems fine. Please keep in mind the "simple chaining" examples from the WPS spec.

SCALE UP WITH REAL REMOTE REQUESTS
If the request is really remote, we have to be prepared and parse whatever is coming in.
  

We may need to make a new kind of api here - for the code that is in the uDig project. I am thinking along the lines of Jesse's "provider" classes. So we could have a FeatureCollectionProvider that is passed into a Process; if its answer is a FeatureCollection perhaps it could give us a FeatureCollectionProvider which we could hook up to the next process in the "chain"? If you wanted to "wrap" this something that would lazily save the contents to memory or disk in order to prevent "reprocessing" that would be transparent? The providers could be considered the "edges" and the processes the nodes in a graph producing the answer.

I need some help in understanding the gain here.
Generally speaking we want the processes to depend on
some well known input and output types to ensure that it's possible
to easily build chains. In my mind those would be FeatureCollection
and Coverage. Good performing chaining should allow for middle
man avoidance. What I proposed in fact looks like a fc provider managed
by geoserver, with only three possible behaviours:
- provide the collection as is
- store the collection in memory (as a stop gap measure, don't want to
   keep this around in the long term as it's obviously a scalability
   killer)
- store the collection on disk
The logic used to apply the first or the third option (or the second
up until the third is available) is simply to look into the
streaming requirements of the process that will use it during
chaining. The processes would know nothing about providers, adding
that would just increase the difficulty of implementing one, generally
speaking if something has to deal with providers, it should be
whatever orchestrates the execution of a chain of processes.

If you foresee a pluggable, extensible api, you need to create
extension points both for the provider themselves, and for the logic
it takes to decide which one to provide. That sounds like a lot of
work for something that I'm not really excited about, but I may be
mistaken. Can you provide more details?

streaming parser and just scan over the remote input stream once.
We could create a marker interface to identify those processes and act
consequently.
  

The processes have a data structures describing their parameter requirements; it includes a Map that you can use for hints like what you describe. So you could have a process that expects a feature collection and you could include a key in the metadata map that is true if the feature collection will be used more than once.

Yeah, that works better, this way a process can have multiple fc inputs
and say that only some of them will be scanned just once (think of
an inner loop like processing, outer collection scanned just once,
inner collection scanned once for each feature of the outer one).
I like it.

STORE=TRUE
For long running processes it makes lots of sense to actually
support storing. Yet just storing the result is kind of unsatisfactory,
my guess is that most of the client code would be interested in
being able to access the results of the computation using WMS/WFS.
  

This is not documented in the spec; they refer to making results available on FTP sites.

The FTP server is just an example of how the behaviour could be implemented, nowhere it is said that it has to be that.
The specification says that if an output as been requested to be
published "asReference" then "it should be stored by the process as a web-accessible resource", meaning you have to just provide a URL
allowing the resource to be retrieve, without enforcing a specific
protocol or location.

I could see doing this
for processes that are scheduled to occur at a set time (make a "daily" layer for example). However why not
make a simple "publish" process - and people can use that at the end of their chain if they want the result made
available via WFS.

Good idea! This allows to cherry pick which one of the outputs of
a complex process can be stored on the server.

Well, that's it. Thoughts?

The per session stuff is only okay; the specification provides some guidance about how long a result will be available and so forth.

Then I must have missed something in the spec. Can you clarify how an
interactive client can perform processes and then access the
outputs with WMS/WFS, without the other users seeing them, and without
requiring permanent storage?
Let me draw a simple use case that I've seen in action at FOSS4G: an OL
client allows a user to select an origin and a destination on a road
network, a process is invoked to compute it, the result is then drawn
on OL. Suppose the result can be so big that it's not a good idea to
just return the GML (just change the kind of computation and you'll
easily find an example that can kill OL vector drawing abilities).
How would you handle it without using per session catalog storage?

Cheers
Andrea

--
Andrea Aime
OpenGeo - http://opengeo.org
Expert service straight from the developers.