On Thu, Oct 16, 2014 at 11:42 PM, Brad Hards <bradh@anonymised.com> wrote:
> The disk space management problem cannot be solved, the results need to
> stay available for a while, the spec does not say that the first time you
> request
> them they will be deleted. And some processes do compute data while they
> are encoding the output (known as streaming processes, they do return
> feature collections that are computing the results as you pull them, or
> grid coverages
> that are backed by JAI operations that compute tiles as you request
them),
> we cannot keep
> the database connections open until the client asks for the results
either.
> Also, WPS clients can do something crazy like sending over a 2GB request
> with the input data embedded into it, and then ask for ancestry in the
> response,
> which requires repeating the whole request in the response, so the
request
> needs to be stored on disk somewhere.
OK. My question is whether it needs to be on *shared* disk?
No, it does not need to, but it's a valid solution.
One of the mandates of a GSIP is that whatever is presented can be worked on
fully by the party that's presenting it, without requiring help from the
outside, and
without leaving the code in a half done state, this requires scope control,
e.g.,
the proposal either stays within the available resources to implement it,
or it's
not done at all.
At the same time, it does have to be general enough that it can grow the
limits
of the initial funding/timelines.
My hope is that the ProcessArtifactsStore interface is general enough to
allow
future extension beyond the shared file system, and be implemented with some
other technology:
https://github.com/geoserver/geoserver/wiki/GSIP%20119%20WPS%20clustering%20of%20asynchronous%20requests#processartifactsstore
If you think it is too limited, then yes, we have a problem I have to
address before
the proposal can pass, but if you simply don't like the shared filesystem
approach, in that case
it's up to you to provide the resources to implement a different solution,
I just need
to make sure you can implement it if you want to.
One alternative would be to only store on the local disk for the server
doing
the processing.
The shared part would be a messaging system (e.g ApacheMQ, although the
specific choice may not be important).
If we think in terms of interface, the specific choice is indeed not
important,
but in terms of the first implementation of this proposal, it is very much
so.
WPS can have large input and large outputs (as in GB or even TB large)
and the solution needs to support large file transfer.
Message passing is normally not a good solution for streaming large results
e..g,
http://activemq.apache.org/can-i-send-really-large-files-over-activemq.html
Also, assuming there is large enough local storage simply does not match my
experience of
large installations (e.g., the ones that do fund these kind of proposals),
where the local disks often do not even get beyond a few hundred GB, while
the network disks often are
pretty large.
Also, we have experience on message passing technology (the configuration
clustering
module we just donated is based on ActiveMQ) and I can tell you it's not
always
welcomed.
First, people see the need to have an external server as a complication,
often too
much of it, so in that solution we put ActiveMQ as integral part of the
plugin by default
(you can still use an external one if you want), it's running as a library
and automatically discovers its peers.
Which is cool, but often does not work because multicast is banned from the
network,
at which point you have to provide a list of TCP addresses for the various
bits
to communicate with each other.
Recently we have also been playing with Hazelcast, and I'm also considering
the usage
of it for the status sharing part (as a replacement for the database), but
besides
its evident coolness it suffers from the same issues as a
embedded/clustered ActiveMQ
solution, it either has to use multicast for the discovery, or needs a list
of TCP
addresses that will form the core of the cluster (and such list needs to be
known
before starting up the cluster).
I can tell you that we've been trying to push this kind of technology for a
while,
but the resistance is strong: it can be a solution, but it clearly cannot
be _the solution_,
it cannot be the only option.
I'm actually trying to push it right now for a sharing small, short live
data across a cluster
in a customer project I'm following, for the task Hazelcast is clearly
easier and faster
than a database, and yet, there's a good chance I'll have to implement the
database one
instead because network and database admins are against it (the maintenance
and politics angles often play a very significant role and can supersede
the technical merits)
That said, I believe that also having a artifact store based on message
passing, as an option,
would be cool, and you're more than welcomed to work alongside this
proposal to implement it,
if you try, just let me know if the current interface shows weaknesses and
we'll
try to address them
Cheers
Andrea
--
GeoServer Professional Services from the experts! Visit
http://goo.gl/NWWaa2 for more information.
Ing. Andrea Aime
@geowolf
Technical Lead
GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549
http://www.geo-solutions.it
http://twitter.com/geosolutions_it
*AVVERTENZE AI SENSI DEL D.Lgs. 196/2003*
Le informazioni contenute in questo messaggio di posta elettronica e/o
nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il
loro utilizzo è consentito esclusivamente al destinatario del messaggio,
per le finalità indicate nel messaggio stesso. Qualora riceviate questo
messaggio senza esserne il destinatario, Vi preghiamo cortesemente di
darcene notizia via e-mail e di procedere alla distruzione del messaggio
stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso,
divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od
utilizzarlo per finalità diverse, costituisce comportamento contrario ai
principi dettati dal D.Lgs. 196/2003.
The information in this message and/or attachments, is intended solely for
the attention and use of the named addressee(s) and may be confidential or
proprietary in nature or covered by the provisions of privacy act
(Legislative Decree June, 30 2003, no.196 - Italy's New Data Protection
Code).Any use not in accord with its purpose, any disclosure, reproduction,
copying, distribution, or either dissemination, either whole or partial, is
strictly forbidden except previous formal approval of the named
addressee(s). If you are not the intended recipient, please contact
immediately the sender by telephone, fax or e-mail and delete the
information in this message that has been received in error. The sender
does not give any warranty or accept liability as the content, accuracy or
completeness of sent messages and accepts no responsibility for changes
made after they were sent or for other risks which arise as a result of
e-mail transmission, viruses, etc.
-------------------------------------------------------