Hi all,
during FOSS4G 2008 and associated code sprint me and Justin had a look
at the WPS specification and the current GeoServer
community module looking for a way to move it forward.
This mail summarizes some of our findings and ideas. Don't read it as
a "will do all of this by next month" by more of a set of wishful
thoughts that I will probably try to turn into reality, slowly,
in my spare time.
FIX EXISTING ISSUES
We explored the current WPS module and found a number of issues
that we'd like to address:
- lack of unit testing
- lack of xml request support for GetCapabilities, lack of
KVP support for Execute (none of them is mandatory, but
not having the choice of KVP for Execute is actually quite
an annoyance, as KVP is often perceived as more immediate to use)
- the transmuters API seems overkill, there is a need to create a
class per type handled, class level javadoc missing.
Justin has implemented on his notebook a tentative replacement that
does the same job as the existing 12 classes in 3 simple classes
and still leaves the door open to handle raster data (whilst the
current complex data handler simply assumes data to be parsed is XML,
but the input could be anything such as a geotiff or a compressed
shapefile)
Once this is fixed, we can try to hook up reading feature collections
from remote servers.
AVOID THE MID MAN IF POSSIBLE
We would like to avoid GML encoding/decoding (or coverage
encoding/decoding) when possible, and directly access the datastores if
the source is local.
We can try to recognize that from the host, port and first part of the
path, also dealing with proxies and the like, but as a first step we can
just have a special URL that is resolved to the current geoserver,
something like:
.?service=wfs&version=1.0&method=GetFeature&typeName=topp:states
(think of it as a relative link, the WPS request is posted to
http://host:port/geoserver/ows so "." would represent exactly that URL).
Once we recognize that the request is local, we want to leverage the
work the Dispatcher is already doing up to a point, that is, parsing the
request, executing it, and returning the results without actually
encoding them. For example, WFS returns a number of FeatureCollection
objects out of GetFeature, and the same happens with WCS. We can then
plug those feature collections directly into the process, allowing
for more efficient handling of big amounts of data
SCALE UP WITH REAL REMOTE REQUESTS
If the request is really remote, we have to be prepared and parse
whatever is coming in.
Now, a process generally speaking will need to access a
feature collection content multiple times, so we either need to:
- store the feature collection in memory, hoping it's not too big
- store the feature collection in a persitent datastore (postgis,
spatialite, even a shapefile would do) and then return back a feature
collection coming from that datastore, and eventually clean up the
datastore when the execute it's done (meaning we need some extension
allowing the after the fact clean up of those collections).
For starters we'll follow the first path, the second one is the natural
evolution. A detail that bugs me is that the second option is
inefficient for those processes that only need to access the feature
collection once, streaming over it, for those it would be ok to use a
streaming parser and just scan over the remote input stream once.
We could create a marker interface to identify those processes and act
consequently.
It the remote input happens to be a coverage instead we can
just download it as is, put it into a temporary directory, and
create a coverage out of it. Actually, for some processes we could
again decide to pre-process it a little, such as tiling it, in order
to get memory efficient processing as opposed to reading the
coverage a single big bad block of memory.
STORE=TRUE
For long running processes it makes lots of sense to actually
support storing. Yet just storing the result is kind of unsatisfactory,
my guess is that most of the client code would be interested in
being able to access the results of the computation using WMS/WFS.
Following this line of thinking it would be nice to allow
two types of store operation mode:
* plain storage of the outputs on th file system (as the standard
requires)
* storing the results in a datastore/coverage, register the result
in the catalog, and have the WPS response return a GetFeature
or a GetCoverage as the "web accessible url" store requires us
to put in the response
Given that most of the computations have a temporary or per
user nature, it would be nice if we could decide whether to
store the results in a per-session catalog that only the current
user sees, and that eventually expires killing the registered
resources along with it, or the full catalog.
This could be done by adding a per session catalog to be used
side by side with the "global" one, and having the normal services
use a wrappers that looks for data first in the session one, and
then in the global one.
Well, that's it. Thoughts?
Cheers
Andrea