[Geoserver-devel] WPS integration back into the catalog: import, temp layers, dynamic layers

Hi all,
I would like to submit the community with three ideas that I'd like
to implement, one in the short term, the other possibly in the
short term as well, and one for the future.

As you may know I've been working to revitalize WPS enough that may
become an extension in GS 2.1 (at least, that's the plan).

One of that main attractions of a WPS in GeoServer is that the WPS
is not stand alone, but it has at its disposal local services and
a catalog.
So far that means a WPS process does not have to painfully gather
data from remote, but it can also get it directly from the local
catalog. This is great, but it's one way, the outputs are still
going out in some form (gml, shapefiles, json)
that the client has to process by itself.

I want to integrated back in the other direction by having an "import"
process that can be used at the end of a processing chain to save back
the results into the catalog, so that the result can then be rendered
by WMS and queried by WFS. Which makes it possible to interact with
GS WPS with lightweight clients without the limitation of using small
data sets (not to mention the fact that the result layer can be
a legitimate new layer to be used long term).

For vectors the import process would take:
- the feature collection to be stored
- a layer name
- the workspace (optional, we can used the default)
- the target store (optional, on trunk we have a concept of default
   store). I'd say the target store must exist (and be either a DB, or a
   directory store)
- a style name (optional, we can use one of the built ins)

It's evident there is some overlap with restconfig, but a processing
chain will result in a feature collection, something we cannot
throw at REST (plus we don't want the data to travel back to the client
and then again to the server).
This would be a special case, I don't intend to actually go and redo
RESTConfig as a set of WPS processes (btw, if you have ideas of how
to integrate the two without having the data go round the world I'm
all ears).
At most it could be useful to add a RemoveLayer process that would remove the layer and the underlying contents from the catalog, so
that a client can actually do the two most common things without having
to switch protocols (add a layer, remove a layer).

Oh, the process would actually run only if a admin level user is invoking it (yeah, would be nice to have more granular administration
rights, but that's a can of worms I don't intend to open in my spare time)

So ok, this would be step one, and something I'd definitely like to do
this week.

For step two, let's consider not all processed layers are meant to live for a long time. You do some processing, _look_ at the results, decide
some of the filtering, buffering distances, or stuff like that, is not
ok, and want to redo the process with different params, and look at the
results again.
Of course this can be done by using the above Import process, but reality is, you probably don't want to:
a) have the layer be visible to the world
b) maybe you don't want to have to manage its lifecycle, the layer is
    meant to be a throwaway anyways
So it would be nice to mark a layer as temporary and as private somehow.
Temporary means the layer (and the data backing it) would disappear in thin air after some time since it has been last used, private could either mean:
- it would not be advertised in the caps (but anyone knowing its full
   name could access it)
- it would be protected using security so that only a certain user can
   access it
I would go for the first, since the second implies working on the
granular security can of worms.
Also, adding a handling of temp layers sounds relatively easy to
implement, a little touch in the capabilities transformers, a scheduled
activity that periodically checks when the layer has last been accessed, and it's done. Perfect for spare time coding (whilst more complex
solutions still can get it using funding, when and if there is some).

Step three is daydreaming. But let me dream for once. Say I have a process that generates a layer. It does in a way that the layer is
cached, but dynamic: it is computed, but the process used to compute it
is saved, and it's run every time the input data changes (well, maybe
driven by a certain polling so that a storm of changes does not result
in a storm of processing routines running).
Actually this should not be _so_ hard to implement. Add to the layer
definition three new entries in the metadata section:
- the full process definition (as xml)
- the last reprocessing date
- the recompute interval
- the last date a input was changed
Then add a scheduled job that:
- knows about all the dynamic layers
- knows about the sources and has a transaction listener on them
- runs the processes again when the output is stale and uses transactions to change in a single shot the data under the layer feets.

How does this sound? I think temporary/private layers are more important
than this (my dream is to have one day a GeoServer based in-browser client that can behave similar to a desktop gis one day).
On the other side it seems the latter is doable without API changes,
which makes it a low hanging fruit.

Opinions and comments.... very welcomed!

Cheers
Andrea

Very interesting problems. Some random thoughts for you inline. Thanks for your continued hard work on pushing WPS to the core.

On 10-07-19 7:56 AM, Andrea Aime wrote:

Hi all,
I would like to submit the community with three ideas that I'd like
to implement, one in the short term, the other possibly in the
short term as well, and one for the future.

As you may know I've been working to revitalize WPS enough that may
become an extension in GS 2.1 (at least, that's the plan).

One of that main attractions of a WPS in GeoServer is that the WPS
is not stand alone, but it has at its disposal local services and
a catalog.
So far that means a WPS process does not have to painfully gather
data from remote, but it can also get it directly from the local
catalog. This is great, but it's one way, the outputs are still
going out in some form (gml, shapefiles, json)
that the client has to process by itself.

I want to integrated back in the other direction by having an "import"
process that can be used at the end of a processing chain to save back
the results into the catalog, so that the result can then be rendered
by WMS and queried by WFS. Which makes it possible to interact with
GS WPS with lightweight clients without the limitation of using small
data sets (not to mention the fact that the result layer can be
a legitimate new layer to be used long term).

For vectors the import process would take:
- the feature collection to be stored
- a layer name
- the workspace (optional, we can used the default)
- the target store (optional, on trunk we have a concept of default
    store). I'd say the target store must exist (and be either a DB, or a
    directory store)
- a style name (optional, we can use one of the built ins)

It's evident there is some overlap with restconfig, but a processing
chain will result in a feature collection, something we cannot
throw at REST (plus we don't want the data to travel back to the client
and then again to the server).
This would be a special case, I don't intend to actually go and redo
RESTConfig as a set of WPS processes (btw, if you have ideas of how
to integrate the two without having the data go round the world I'm
all ears).

Well there has been talk of integrating restconfig into the core for 2.1.x. So a hackish but relatively easy way to integrate could be to just use the restlet resources directly, sort of mocking up request objects.

A cleaner way would be to refactor restconfig into some reusable command like objects, and add those objects to the core. I like this approach and think it could be useful in terms of code reuse even today as there is some code overlap between the ui and restconfig. But not a trivial undertaking to be sure.

At most it could be useful to add a RemoveLayer process that would
remove the layer and the underlying contents from the catalog, so
that a client can actually do the two most common things without having
to switch protocols (add a layer, remove a layer).

Oh, the process would actually run only if a admin level user is
invoking it (yeah, would be nice to have more granular administration
rights, but that's a can of worms I don't intend to open in my spare time)

So ok, this would be step one, and something I'd definitely like to do
this week.

For step two, let's consider not all processed layers are meant to live
for a long time. You do some processing, _look_ at the results, decide
some of the filtering, buffering distances, or stuff like that, is not
ok, and want to redo the process with different params, and look at the
results again.
Of course this can be done by using the above Import process, but
reality is, you probably don't want to:
a) have the layer be visible to the world
b) maybe you don't want to have to manage its lifecycle, the layer is
     meant to be a throwaway anyways
So it would be nice to mark a layer as temporary and as private somehow.
Temporary means the layer (and the data backing it) would disappear in
thin air after some time since it has been last used, private could
either mean:
- it would not be advertised in the caps (but anyone knowing its full
    name could access it)
- it would be protected using security so that only a certain user can
    access it
I would go for the first, since the second implies working on the
granular security can of worms.
Also, adding a handling of temp layers sounds relatively easy to
implement, a little touch in the capabilities transformers, a scheduled
activity that periodically checks when the layer has last been accessed,
and it's done. Perfect for spare time coding (whilst more complex
solutions still can get it using funding, when and if there is some).

The temp layer idea makes sense but I can see having to explicitly skip over temp layer objects seems a bit error prone. There are a few places in code that have to iterate through layers, capabilities, ui, restconfig, etc... It would be a lot to update and probably the first thing someone forgets to do when writing code to iterate over layers.

Obviously some sort of thread local view of the catalog would not work since the layers need to live across requests. But I wonder if it could work in conjunction with a sort of token or key system. What I am thinking is the temp layers are stored outside the core catalog. But can be engaged (thread locally) when the client specifies a particular token.

How does the WPS send back the info for a temp layer to a client? If it is the full OGC request link like a GetFeature or GetMap request the token could be used relatively transparently... anyways, just a thought.

Step three is daydreaming. But let me dream for once. Say I have a
process that generates a layer. It does in a way that the layer is
cached, but dynamic: it is computed, but the process used to compute it
is saved, and it's run every time the input data changes (well, maybe
driven by a certain polling so that a storm of changes does not result
in a storm of processing routines running).
Actually this should not be _so_ hard to implement. Add to the layer
definition three new entries in the metadata section:
- the full process definition (as xml)
- the last reprocessing date
- the recompute interval
- the last date a input was changed
Then add a scheduled job that:
- knows about all the dynamic layers
- knows about the sources and has a transaction listener on them
- runs the processes again when the output is stale and uses
transactions to change in a single shot the data under the layer feets.

Yeah I agree that I don't think this is too far fetched. Might be cool to try and drag in the H2 datastore as the temporary storage... I have been working on improving performance lately and hope to get it usable as a first class datastore in geoserver.

How does this sound? I think temporary/private layers are more important
than this (my dream is to have one day a GeoServer based in-browser
client that can behave similar to a desktop gis one day).
On the other side it seems the latter is doable without API changes,
which makes it a low hanging fruit.

Opinions and comments.... very welcomed!

Cheers
Andrea

------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Geoserver-devel mailing list
Geoserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

--
Justin Deoliveira
OpenGeo - http://opengeo.org
Enterprise support for open source geospatial.

Justin Deoliveira wrote:

It's evident there is some overlap with restconfig, but a processing
chain will result in a feature collection, something we cannot
throw at REST (plus we don't want the data to travel back to the client
and then again to the server).
This would be a special case, I don't intend to actually go and redo
RESTConfig as a set of WPS processes (btw, if you have ideas of how
to integrate the two without having the data go round the world I'm
all ears).

Well there has been talk of integrating restconfig into the core for 2.1.x. So a hackish but relatively easy way to integrate could be to just use the restlet resources directly, sort of mocking up request objects.

I see that. However RESTConfig cannot take any feature collection and
store it into a target store, the most it can do is to take and configure a shapefile, right?

It may be easier to go the other way around: add a process that can
do an import of a feature collection into a target store, and then
use it back in RESTConfig to add this new functionality (I guess it
would be a POST against the target store to create a new layer,
or something like that)

For step two, let's consider not all processed layers are meant to live
for a long time. You do some processing, _look_ at the results, decide
some of the filtering, buffering distances, or stuff like that, is not
ok, and want to redo the process with different params, and look at the
results again.
Of course this can be done by using the above Import process, but
reality is, you probably don't want to:
a) have the layer be visible to the world
b) maybe you don't want to have to manage its lifecycle, the layer is
     meant to be a throwaway anyways
So it would be nice to mark a layer as temporary and as private somehow.
Temporary means the layer (and the data backing it) would disappear in
thin air after some time since it has been last used, private could
either mean:
- it would not be advertised in the caps (but anyone knowing its full
    name could access it)
- it would be protected using security so that only a certain user can
    access it
I would go for the first, since the second implies working on the
granular security can of worms.
Also, adding a handling of temp layers sounds relatively easy to
implement, a little touch in the capabilities transformers, a scheduled
activity that periodically checks when the layer has last been accessed,
and it's done. Perfect for spare time coding (whilst more complex
solutions still can get it using funding, when and if there is some).

The temp layer idea makes sense but I can see having to explicitly skip over temp layer objects seems a bit error prone. There are a few places in code that have to iterate through layers, capabilities, ui, restconfig, etc... It would be a lot to update and probably the first thing someone forgets to do when writing code to iterate over layers.

Obviously some sort of thread local view of the catalog would not work since the layers need to live across requests. But I wonder if it could work in conjunction with a sort of token or key system. What I am thinking is the temp layers are stored outside the core catalog. But can be engaged (thread locally) when the client specifies a particular token.

It could work. The token could be a cookie. And in fact we already have one, the session cookie. Session bound catalog? Could be a way.
Maybe we want something a bit more longer lived, something that keeps
the temp layers around for many hours if not a few days (think you're working against GS on Friday, come back on Sunday and want to start back
from where you left without having to wait half an hour for a layer
to be recomputed).

Though... going there it would make more sense to have an explict per user local catalog.

How does the WPS send back the info for a temp layer to a client? If it is the full OGC request link like a GetFeature or GetMap request the token could be used relatively transparently... anyways, just a thought.

I think the "import" process would return back the layer name or
accept a layer name from the user.

Step three is daydreaming. But let me dream for once. Say I have a
process that generates a layer. It does in a way that the layer is
cached, but dynamic: it is computed, but the process used to compute it
is saved, and it's run every time the input data changes (well, maybe
driven by a certain polling so that a storm of changes does not result
in a storm of processing routines running).
Actually this should not be _so_ hard to implement. Add to the layer
definition three new entries in the metadata section:
- the full process definition (as xml)
- the last reprocessing date
- the recompute interval
- the last date a input was changed
Then add a scheduled job that:
- knows about all the dynamic layers
- knows about the sources and has a transaction listener on them
- runs the processes again when the output is stale and uses
transactions to change in a single shot the data under the layer feets.

Yeah I agree that I don't think this is too far fetched. Might be cool to try and drag in the H2 datastore as the temporary storage... I have been working on improving performance lately and hope to get it usable as a first class datastore in geoserver.

Yep, actually I want to start using it in WPS sooner rather than later.
Sextante needs a place where to store the results of a process and shapefiles are bad in a number of ways, mostly they don't allow for
geometryless storage and they will mangle the collection schema (the
geometry is named the_geom, the attributes are uppercase, the strings
cannot be longer than 256 chars, and so on).

I was thinking to have a single H2 db and create a different temporary
schema for each temp layer so that I'm not even forced to change the
feature type name (and use a random schema name instead).

Cheers
Andrea