[Geoserver-devel] Process annotations improvements (with a WPS bit)

Hi,
writing processes over and over let me to see some deficiencies in the
WPS annotations that
I would like to clear out. Some are general to processes, some are WPS
usage specific
(that's why I'm cross posting to gs-devevl).

One thing I see I do over and over at the start of a process is value
defaulting and validation.

For params that are not mandatory but have a default value, it would
be nice to just stick
the default value in the annotation:

@DescribeParameter(name = "multiplier", min = 0, default = 1d) double multiplier

For params that need validation it would be nice to have a set of
common validators so that we don't
have to roll the validation logic in the process code:

@DescribeParameter(name = "myparam", validator = new
RangeValidator(100, 200)) double myparam)

I would really hate to have to roll our own in this case, there is
plenty of validation packages
around on the web, but all seem to be focusing quite specifically on a
particular use case,
e.g., validating forms, validating a Swing GUI, or carrying around a
lot of extras. Sigh...
I guess we could roll our own built on top of commons-validator
routines, to avoid duplicating
really everything? Or use some convention and reflection, something like:

new Validator("double", "isInRange", 0, 100)

to end up calling DoubleValidator.isInRange(value, min, max).
Hmm... the above could be also done using code generation against all
the routines to get a
set of typesafe validation objects, or a factory exposing all the
validators, Validators.doubleIsInRange(0, 100) -> Validator

Moving on to something WPS specific, we need to be able to declare
what mime types the input files/streams
we are going to accept, and which ones we are going to produce.
Normally we avoid this issue in GeoServer by trading objects (e.g.,
RenderedImage, Coverage, FeatureCollection)
that specialized utility classes (ProcessParameterIO) turn into a
serialized format with a specific mime type.
But there are processes around that really do trade files, and
specific formats of them for what is worth.

For files that are inputs we could have something like:

@DescribeBinaryInput(..., mimeTypes = new String {image/png,
image/jpeg, ...} ...)

The mimeType would have then to be carried around by Parameter, which
has a metadata map
that we cannot read... can we add a getMetadata(key) method in there
to access some random
metadata?

Dealing with the output formats instead is going to be loads of fun.
We can have a
@DescribeBinaryResult( ... mimeTypes = new String {image/png,
image/jpeg, ...}
and that would allow us to run DescribeProcess, but when it comes to
executing the fun starts,
since the user will choose one of the output formats as part of the
_output_, not as part of
the input:

<wps:RawDataOutput mimeType="image/png">
<ows:Identifier>theImage</ows:Identifier>
</wps:RawDataOutput>
</wps:ResponseForm>

How is the process going to know that the user requested image/png as
the output format, since
that's not part of the inputs?
All I can think if being explicit about which of the input params is
going to be mapped to such choice,
and avoid exposing such param in the DescribeProcess:

@DescribeBinaryResult( ... mapToInput = "outputFormat")
@DescribeParameter(name="outputFormat") String outputFormat

Any better idea?

One final thing that happened to me while writing processes is
realizing that a process
will take long, that it cannot be written in a streaming manner for
some reason, and thus
really wanting the process to be only run in asynch mode.
Ideally it would be nice to have the following:

@DescribeProcess(... synch=false, asynch=true)

However I don't see a way to expose this from ProcessFactory, and it's
indeed something
that would be pretty WPS specific. Maybe we should augment the ProcessFactory
to return a open ended metadata map about the process?

Map<String, Object> getProcessMetadata(Name process)

Feedback welcomed.

Cheers
Andrea

--
Ing. Andrea Aime
GeoSolutions S.A.S.
Tech lead

Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy

phone: +39 0584 962313
fax: +39 0584 962313
mob: +39 339 8844549

http://www.geo-solutions.it
http://geo-solutions.blogspot.com/
http://www.youtube.com/user/GeoSolutionsIT
http://www.linkedin.com/in/andreaaime
http://twitter.com/geowolf

Andrea,

Funny you bring this up, very timely as I’ve been looking at processes the past few days.

On Fri, Jun 8, 2012 at 6:06 AM, Andrea Aime <andrea.aime@anonymised.com> wrote:

Hi,
writing processes over and over let me to see some deficiencies in the
WPS annotations that
I would like to clear out. Some are general to processes, some are WPS
usage specific
(that’s why I’m cross posting to gs-devevl).

One thing I see I do over and over at the start of a process is value
defaulting and validation.

For params that are not mandatory but have a default value, it would
be nice to just stick
the default value in the annotation:

@DescribeParameter(name = “multiplier”, min = 0, default = 1d) double multiplier

For params that need validation it would be nice to have a set of
common validators so that we don’t
have to roll the validation logic in the process code:

@DescribeParameter(name = “myparam”, validator = new
RangeValidator(100, 200)) double myparam)

I would really hate to have to roll our own in this case, there is
plenty of validation packages
around on the web, but all seem to be focusing quite specifically on a
particular use case,
e.g., validating forms, validating a Swing GUI, or carrying around a
lot of extras. Sigh…
I guess we could roll our own built on top of commons-validator
routines, to avoid duplicating
really everything? Or use some convention and reflection, something like:

new Validator(“double”, “isInRange”, 0, 100)

to end up calling DoubleValidator.isInRange(value, min, max).
Hmm… the above could be also done using code generation against all
the routines to get a
set of typesafe validation objects, or a factory exposing all the
validators, Validators.doubleIsInRange(0, 100) → Validator

Moving on to something WPS specific, we need to be able to declare
what mime types the input files/streams
we are going to accept, and which ones we are going to produce.
Normally we avoid this issue in GeoServer by trading objects (e.g.,
RenderedImage, Coverage, FeatureCollection)
that specialized utility classes (ProcessParameterIO) turn into a
serialized format with a specific mime type.
But there are processes around that really do trade files, and
specific formats of them for what is worth.

For files that are inputs we could have something like:

@DescribeBinaryInput(…, mimeTypes = new String {image/png,
image/jpeg, …} …)

The mimeType would have then to be carried around by Parameter, which
has a metadata map
that we cannot read… can we add a getMetadata(key) method in there
to access some random
metadata?

Dealing with the output formats instead is going to be loads of fun.
We can have a
@DescribeBinaryResult( … mimeTypes = new String {image/png,
image/jpeg, …}
and that would allow us to run DescribeProcess, but when it comes to
executing the fun starts,
since the user will choose one of the output formats as part of the
output, not as part of
the input:

<wps:RawDataOutput mimeType=“image/png”>
ows:IdentifiertheImage</ows:Identifier>
</wps:RawDataOutput>
</wps:ResponseForm>

How is the process going to know that the user requested image/png as
the output format, since
that’s not part of the inputs?
All I can think if being explicit about which of the input params is
going to be mapped to such choice,
and avoid exposing such param in the DescribeProcess:

@DescribeBinaryResult( … mapToInput = “outputFormat”)
@DescribeParameter(name=“outputFormat”) String outputFormat

Any better idea?

This one doesn’t look terrible, from an implementer point of view it’s quite clear what it does.

One final thing that happened to me while writing processes is
realizing that a process
will take long, that it cannot be written in a streaming manner for
some reason, and thus
really wanting the process to be only run in asynch mode.
Ideally it would be nice to have the following:

@DescribeProcess(… synch=false, asynch=true)

I really like this idea, and had been wondering about it. I’m not sure if you want synch and asynch, perhaps only keep asynch with a default value of false?

However I don’t see a way to expose this from ProcessFactory, and it’s
indeed something
that would be pretty WPS specific. Maybe we should augment the ProcessFactory
to return a open ended metadata map about the process?

Map<String, Object> getProcessMetadata(Name process)

Feedback welcomed.

Cheers
Andrea


Ing. Andrea Aime
GeoSolutions S.A.S.
Tech lead

Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy

phone: +39 0584 962313
fax: +39 0584 962313
mob: +39 339 8844549

http://www.geo-solutions.it
http://geo-solutions.blogspot.com/
http://www.youtube.com/user/GeoSolutionsIT
http://www.linkedin.com/in/andreaaime
http://twitter.com/geowolf


Live Security Virtual Conference
Exclusive live event will cover all the ways today’s security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/


GeoTools-Devel mailing list
GeoTools-Devel@anonymised.coms.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geotools-devel


Juan Marín Otero
GIS Consultant

-------Visita mi blog en---------------------
http://guachintoneando.blogspot.com

On Fri, Jun 8, 2012 at 1:40 PM, Juan Marín Otero
<juan.marin.otero@anonymised.com> wrote:

One final thing that happened to me while writing processes is
realizing that a process
will take long, that it cannot be written in a streaming manner for
some reason, and thus
really wanting the process to be only run in asynch mode.
Ideally it would be nice to have the following:

@DescribeProcess(... synch=false, asynch=true)

If a process runs fast

I really like this idea, and had been wondering about it. I'm not sure if
you want synch and asynch, perhaps only keep asynch with a default value of
false?

Hmm... this matches what the WPS protocol exposes, there is no way to
say a process cannot be run in a synchronous way, one can only say
that asynch is supported or not.

Yet, the GeoServer WPS is transparently supporting asynch, so it comes at
no extra cost for the implementor, and little extra cost for the server.

What I really want is a way to deny synchronous execution of processes that
take so long it's unreasonable to think a HTTP call can be kept up that
long.

Long story short, I believe asynch = true and synch = true is a better fit for
the general case, and even if we cannot advertise it, I'd like to have the
option to throw an exception if a process is called in synch mode when
it has synch = false

Cheers
Andrea

--
Ing. Andrea Aime
GeoSolutions S.A.S.
Tech lead

Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy

phone: +39 0584 962313
fax: +39 0584 962313
mob: +39 339 8844549

http://www.geo-solutions.it
http://geo-solutions.blogspot.com/
http://www.youtube.com/user/GeoSolutionsIT
http://www.linkedin.com/in/andreaaime
http://twitter.com/geowolf

I like Juan’s goal of reducing the number of logical combinations that have to be considered. It seems like the WPS spec has it backwards. What you really want to know is whether a process can be called as synch - any process could be called as asynch, with the only repercussion some slight inefficiency for fast processes.

So maybe provide both synch and asynch annotation attributes, but default them both to true, so that the only thing that really needs to be supplied is synch=false for long-running processes? I’m not sure when you would ever want to say asynch=false…

As for providing this metadata in a Map, is a more structured alternative to provide a specific class carrying the various metadata values? I guess Maps are used elsewhere in GT/GS for this kind of thing, but in this case where the parameters are all known in advance it seems like a type-checked class would give more support.

On Fri, Jun 8, 2012 at 5:03 AM, Andrea Aime <andrea.aime@anonymised.com> wrote:

On Fri, Jun 8, 2012 at 1:40 PM, Juan Marín Otero
<juan.marin.otero@anonymised.com.> wrote:

One final thing that happened to me while writing processes is
realizing that a process
will take long, that it cannot be written in a streaming manner for
some reason, and thus
really wanting the process to be only run in asynch mode.
Ideally it would be nice to have the following:

@DescribeProcess(… synch=false, asynch=true)

If a process runs fast

I really like this idea, and had been wondering about it. I’m not sure if
you want synch and asynch, perhaps only keep asynch with a default value of
false?

Hmm… this matches what the WPS protocol exposes, there is no way to
say a process cannot be run in a synchronous way, one can only say
that asynch is supported or not.

Yet, the GeoServer WPS is transparently supporting asynch, so it comes at
no extra cost for the implementor, and little extra cost for the server.

What I really want is a way to deny synchronous execution of processes that
take so long it’s unreasonable to think a HTTP call can be kept up that
long.

Long story short, I believe asynch = true and synch = true is a better fit for
the general case, and even if we cannot advertise it, I’d like to have the
option to throw an exception if a process is called in synch mode when
it has synch = false

Martin Davis
OpenGeo - http://opengeo.org
Expert service straight from the developers.

On Sat, Jun 9, 2012 at 9:21 AM, Martin Davis <mdavis@anonymised.com> wrote:

I like Juan's goal of reducing the number of logical combinations that have
to be considered. It seems like the WPS spec has it backwards. What you
really want to know is whether a process can be called as synch - *any*
process could be called as asynch, with the only repercussion some slight
inefficiency for fast processes.

Indeed. Unfortunately the writers created a spec for a network protocol
without considering the first of the distributed computing
(http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing):
the network is _not_ reliable, practically speaking not to the point that you
can assume a http connection will stay up for hours waiting for a long
process to complete.

So maybe provide both synch and asynch annotation attributes, but default
them both to true, so that the only thing that really needs to be supplied
is synch=false for long-running processes? I'm not sure when you would ever
want to say asynch=false...

Yep, I agree. asynch=false would make sense for a WPS server that gives
the process writer the onus to handle an asynch process, but GeoServer
makes that transparent.

One borderline case would be a process that has no support for progress
notification, yet can take long and is not streaming.
I'd argue such process is likely broken, but implementors might argue back
that adding progress support is too much of a overhead.

As for providing this metadata in a Map, is a more structured alternative to
provide a specific class carrying the various metadata values? I guess Maps
are used elsewhere in GT/GS for this kind of thing, but in this case where
the parameters are all known in advance it seems like a type-checked class
would give more support.

That's the key, knowing in advance. In the rest of GeoTools we did not assume
we could know in advance, and indeed new needs have been popping up
over time.
One could say that a class can get new fields, and I would agree, but it you
have a need that the community does not share you're forced into a fork,
a map is more lenient in that direction.
I'd prefer an open ended map with well known keys for the reasons above

Cheers
Andrea

--
Ing. Andrea Aime
GeoSolutions S.A.S.
Tech lead

Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy

phone: +39 0584 962313
fax: +39 0584 962313
mob: +39 339 8844549

http://www.geo-solutions.it
http://geo-solutions.blogspot.com/
http://www.youtube.com/user/GeoSolutionsIT
http://www.linkedin.com/in/andreaaime
http://twitter.com/geowolf

On Sat, Jun 9, 2012 at 1:00 AM, Andrea Aime <andrea.aime@anonymised.com> wrote:

On Sat, Jun 9, 2012 at 9:21 AM, Martin Davis <mdavis@anonymised.com> wrote:

So maybe provide both synch and asynch annotation attributes, but default
them both to true, so that the only thing that really needs to be supplied
is synch=false for long-running processes? I’m not sure when you would ever
want to say asynch=false…

Yep, I agree. asynch=false would make sense for a WPS server that gives
the process writer the onus to handle an asynch process, but GeoServer
makes that transparent.

One borderline case would be a process that has no support for progress
notification, yet can take long and is not streaming.
I’d argue such process is likely broken, but implementors might argue back
that adding progress support is too much of a overhead.

Actually this kind of thing is not too uncommon in geometric algorithms. There are algorithms for which it is hard or inefficient to determine the progress (eg. as a percentage), but which take a long time and cannot easily be streamed. Sometimes there are alternatives which can be streamed, but the implementation is either unknown or a research problem. A classic example is Delaunay Triangulation over input of unknown size (eg streamed). Buffer is another example.

But surely these still be called as asynch, but with no feedback on intermediate progress? And with no ability to stream the output it would be better to make the call asynch (for the reason mentioned of unreliable network).

Anyway, no matter, the ability to say asynch=false should still be provided, since the spec needs it.

As for providing this metadata in a Map, is a more structured alternative to
provide a specific class carrying the various metadata values? I guess Maps
are used elsewhere in GT/GS for this kind of thing, but in this case where
the parameters are all known in advance it seems like a type-checked class
would give more support.

That’s the key, knowing in advance. In the rest of GeoTools we did not assume
we could know in advance, and indeed new needs have been popping up
over time.
One could say that a class can get new fields, and I would agree, but it you
have a need that the community does not share you’re forced into a fork,
a map is more lenient in that direction.
I’d prefer an open ended map with well known keys for the reasons above

Fair enough. I suppose as long as the keys are named constants it’s still possible to find where they are set and read.

Was there ever any thought about using named subclasses of HashMap to make usage context clear?

Martin Davis
OpenGeo - http://opengeo.org
Expert service straight from the developers.

On Sun, Jun 10, 2012 at 5:34 AM, Martin Davis <mdavis@anonymised.com> wrote:

One borderline case would be a process that has no support for progress
notification, yet can take long and is not streaming.
I'd argue such process is likely broken, but implementors might argue back
that adding progress support is too much of a overhead.

Actually this kind of thing is not too uncommon in geometric algorithms.
There are algorithms for which it is hard or inefficient to determine the
progress (eg. as a percentage), but which take a long time and cannot easily
be streamed. Sometimes there are alternatives which can be streamed, but
the implementation is either unknown or a research problem. A classic
example is Delaunay Triangulation over input of unknown size (eg streamed).
Buffer is another example.

But surely these still be called as asynch, but with no feedback on
intermediate progress? And with no ability to stream the output it would be
*better* to make the call asynch (for the reason mentioned of unreliable
network).

Yep, there is no need to have progress listener support to make a process
asynch.
But it'll tell you, processes that take longer than 10-30 seconds get users
pretty upset in case there is no progress listener attached.
Actually that's another fallacy in the WPS spec, it would be nice to know
that a process takes long but has no way to report progress, as the front
end GUI for it (if there is one) could use one of those undeterminate
progress reports instead of being pegged at 0% all the time...

One could say that a class can get new fields, and I would agree, but it
you
have a need that the community does not share you're forced into a fork,
a map is more lenient in that direction.
I'd prefer an open ended map with well known keys for the reasons above

Fair enough. I suppose as long as the keys are named constants it's still
possible to find where they are set and read.

Was there ever any thought about using named subclasses of HashMap to make
usage context clear?

I don't believe we ever did that. You can color a map blue with yellow stripes,
but it's still a map :-p

Cheers
Andrea

--
Ing. Andrea Aime
GeoSolutions S.A.S.
Tech lead

Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy

phone: +39 0584 962313
fax: +39 0584 962313
mob: +39 339 8844549

http://www.geo-solutions.it
http://geo-solutions.blogspot.com/
http://www.youtube.com/user/GeoSolutionsIT
http://www.linkedin.com/in/andreaaime
http://twitter.com/geowolf