[Geoserver-devel] GeoServer cluster scalability

Greetings. I’ve been a long time user of GeoServer / occasional contributor, and I was interested in the general direction the developers envisioned GeoServer moving with regards to deployment scalability. Specifically with regard to supporting master-less horizontal scalability in a Apache Mesos / Kubernetes / etc. cluster environment.

As background, I’m aware of various efforts / documentation on how to coordinate GeoServer configuration between multiple instances:
http://docs.geoserver.org/latest/en/user/community/jms-cluster/index.html

https://boundlessgeo.com/2013/04/geoserver-in-a-clustered-configuration-part-1/
https://2016.foss4g-na.org/sites/default/files/slides/High%20Performance%20Geoserver%20Clusters_0.pdf

In developing a GeoServer package for Apache Mesos via DC/OS, I went down a similar path to Derek Kern as identified in his 2016 FOSS4G-NA talk (linked above) - mounted network storage to share GeoServer data configuration across multiple machines. While this solution is functional, it enforces a requirement on consistent mounted data across a cluster, as well as requiring an external coordination service to monitor configuration directory and force instances to reload from disk. My preferred approach would be to either directly coordinate between GeoServers or use a cluster native coordination system (such as Zookeeper) for configuration. I have considered looking into using the GeoServer backup/restore plugin that was recently developed to push configuration to all other GeoServer instances within a cluster.

Does any one else have experience or opinions in this domain? I’m just brainstorming and would love to discuss this in more detail.

Also, any feedback on the DC/OS GeoServer package would be welcomed:
http://universe.serv.sh/#/package/geoserver/version/latest

https://github.com/appliedis/dcos-geoserver

v/r,
Jonathan Meyer
Sr. Software Engineer
Applied Information Sciences

It should no longer be required to have a shared file system, GSIP-106 removed file references form the codebase allowing you to plug in alternate storage for configuration (styles, icons, fonts, templates). Previously only the catalog and settings were isolated from the file system.

Your dcos-geoserver solutions is interesting, I have seen teams use version control on their data directory for a similar effect. As part of the work done around the backup restore package - a chance was made allowing your to use variable substitution data store path locations - which may be a good fit for your cluster if they do not all have a shared disk.

Have you tried the jms-cluster or jdbc-config approaches? They take two different approaches to notifying nodes in the cluster and can point at how integration can be done.

···

On 11 July 2017 at 15:19, Jonathan Meyer <jon@…4152…> wrote:

Greetings. I’ve been a long time user of GeoServer / occasional contributor, and I was interested in the general direction the developers envisioned GeoServer moving with regards to deployment scalability. Specifically with regard to supporting master-less horizontal scalability in a Apache Mesos / Kubernetes / etc. cluster environment.

As background, I’m aware of various efforts / documentation on how to coordinate GeoServer configuration between multiple instances:
http://docs.geoserver.org/latest/en/user/community/jms-cluster/index.html

https://boundlessgeo.com/2013/04/geoserver-in-a-clustered-configuration-part-1/
https://2016.foss4g-na.org/sites/default/files/slides/High%20Performance%20Geoserver%20Clusters_0.pdf

In developing a GeoServer package for Apache Mesos via DC/OS, I went down a similar path to Derek Kern as identified in his 2016 FOSS4G-NA talk (linked above) - mounted network storage to share GeoServer data configuration across multiple machines. While this solution is functional, it enforces a requirement on consistent mounted data across a cluster, as well as requiring an external coordination service to monitor configuration directory and force instances to reload from disk. My preferred approach would be to either directly coordinate between GeoServers or use a cluster native coordination system (such as Zookeeper) for configuration. I have considered looking into using the GeoServer backup/restore plugin that was recently developed to push configuration to all other GeoServer instances within a cluster.

Does any one else have experience or opinions in this domain? I’m just brainstorming and would love to discuss this in more detail.

Also, any feedback on the DC/OS GeoServer package would be welcomed:
http://universe.serv.sh/#/package/geoserver/version/latest

https://github.com/appliedis/dcos-geoserver

v/r,
Jonathan Meyer
Sr. Software Engineer
Applied Information Sciences


Check out the vibrant tech community on one of the world’s most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot


Geoserver-devel mailing list
Geoserver-devel@anonymised.com.366…sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel


Jody Garnett

On Wed, Jul 12, 2017 at 12:19 AM, Jonathan Meyer <jon@anonymised.com> wrote:

As background, I'm aware of various efforts / documentation on how to
coordinate GeoServer configuration between multiple instances:
http://docs.geoserver.org/latest/en/user/community/jms-cluster/index.html
https://boundlessgeo.com/2013/04/geoserver-in-a-clustered-
configuration-part-1/
https://2016.foss4g-na.org/sites/default/files/slides/High%20Performance%
20Geoserver%20Clusters_0.pdf

In developing a GeoServer package for Apache Mesos via DC/OS, I went down
a similar path to Derek Kern as identified in his 2016 FOSS4G-NA talk
(linked above) - mounted network storage to share GeoServer data
configuration across multiple machines.

The overall approach of the presentation is a bit .... old? it represents
the "state of the art" as of 2010, as Jody remarked other
avenues have been considered since then. To be fair, that approach is still
simple and viable if you have a master driven, low change
rate configuration (though the separate front-end GWC is something I have
not seen in a while) and with the speedups to
file system catalog loading in 2.11 it's viable even if you have a
large-ish number of layers.
At the same time, shared data dir and reload is the only approach that you
can take if you restrict yourself to supported modules
(both jms and jdbc-config are community, thus, unsupported).

While this solution is functional, it enforces a requirement on consistent
mounted data across a cluster, as well as requiring an external
coordination service to monitor configuration directory and force instances
to reload from disk. My preferred approach would be to either directly
coordinate between GeoServers or use a cluster native coordination system
(such as Zookeeper) for configuration. I have considered looking into using
the GeoServer backup/restore plugin that was recently developed to push
configuration to all other GeoServer instances within a cluster.

The backup/restore module has been developed for "full/slow" backup/restore
operations, not on the fly change notification.
Something based on zookeper would be interesting. I'd also like to play
with/develop a distributed in memory configuration based on
Hazelcast (or something similar) to see how it works, nowadays the
jdbcconfig module is taking a significant performance hit
due to the many queries it does to the config db per request, slowing down
each OGC request (Niels showed interest in
improving that, haven't heard about it since though).

Ideally, I'd like to see something easier to setup than JMS clustering,
with a performance comparable to in memory config storage
and not requiring changes to the database when the configuration object
properties change, or queries towards the catalog change
(something that jdbcconfig nowadays requires, making it hard to upgrade
[1]).

That said, the configuration needs to be stored somewhere (to support full
cluster restart at the very least), as Jody said
there are indirections in the code nowadays to allow storage on something
other than a file system, there is a community (unsupported)
module allowing storage on a relational database, to be used along
jdbcconfig.

Does any one else have experience or opinions in this domain? I'm just
brainstorming and would love to discuss this in more detail.

Been playing with all options above, yep.

Regards,

Andrea Aime

[1]: This is one annoying issue in jdbcconfig imho.

Basically, jdbcconfig stores XML blobs and maps out interesting attributes
in a separate table for indexing searches.

So, if a new property pops up that needs to be searchable for whatever
reason, one has to go and change the

jdbcconfig mappings to map it out, failing to do so will make jdbcconfig go
and de-serialize the xml blobs from db

every time a search based on the incriminated property happens.

Another issue happens if the code querying the catalog starts issuing
queries against that

are already in the stored XML, but have not been mapped out to be indexed.

There is no tooling to add the mapping and extract them from the XML blobs,

the only approach I've found is to re-import from a file system based data
dir... which is not possible once

you have been using jdbcconfig for a while and it got out of synch with the
fs based data dir.

Hopefully one has used a dbms with xml/xpath extraction support to setup
mass extraction

queries to re-align the db.

==
GeoServer Professional Services from the experts! Visit http://goo.gl/it488V
for more information.

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via di Montramito 3/A
55054 Massarosa (LU)
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

AVVERTENZE AI SENSI DEL D.Lgs. 196/2003

Le informazioni contenute in questo messaggio di posta elettronica e/o
nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il
loro utilizzo è consentito esclusivamente al destinatario del messaggio,
per le finalità indicate nel messaggio stesso. Qualora riceviate questo
messaggio senza esserne il destinatario, Vi preghiamo cortesemente di
darcene notizia via e-mail e di procedere alla distruzione del messaggio
stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso,
divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od
utilizzarlo per finalità diverse, costituisce comportamento contrario ai
principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for
the attention and use of the named addressee(s) and may be confidential or
proprietary in nature or covered by the provisions of privacy act
(Legislative Decree June, 30 2003, no.196 - Italy's New Data Protection
Code).Any use not in accord with its purpose, any disclosure, reproduction,
copying, distribution, or either dissemination, either whole or partial, is
strictly forbidden except previous formal approval of the named
addressee(s). If you are not the intended recipient, please contact
immediately the sender by telephone, fax or e-mail and delete the
information in this message that has been received in error. The sender
does not give any warranty or accept liability as the content, accuracy or
completeness of sent messages and accepts no responsibility for changes
made after they were sent or for other risks which arise as a result of
e-mail transmission, viruses, etc.

Jody / Andrea,

Thanks for the detailed responses. I was not aware of the work to decouple the GeoServer configuration from the filesystem - that would certainly facilitate making a change away from the file system as a first class configuration storage mechanism.

I had briefly investigated both the jdbc-config and jms approaches. The primary issue we had with anything but the file-system backed config was that all seemed to lack complete configuration support. We make heavy use of image mosaics and a fair number of FTL template files and both of these require configuration that falls outside of that managed by the community plugins. This may have changed, but at the time that was a deal breaker (initial investigation was back around 2.6 or 2.7). Jody, I also noticed that the Resource API plan doc on the wiki said the TemplateLoader wasn’t handled yet in the transition. Is that accurate or out-of-date? I assume this is where changes would be made to support FTL files. https://github.com/geoserver/geoserver/wiki/Resource-API-Transition-Plan#file-events

Ultimately, I’d like to see configuring a different backend being as simple as how Zookeeper config works. Setting an environment variable on GeoServer startup, such as “ZK=zk://zk-1.zk:2181,zk-2.zk:2181,zk-3.zk:2181/geoserver”, that would behave just as the file-system backed configuration. This would require separation of any actual data within the “data” directory from the config, which I’ve always personally avoided. Short of the demo files, I’ve always referenced image data from a location outside the “data” tree as that seemed like a best practice and made it possible to put directory under SCM if desired.

If complete configuration coordination (FTL and Image Mosaic config), to your knowledge, is available with the JMS plugin, I may look into that again. Doing complete configuration reload works reasonably well for us presently as we have a large amount of data in a fairly small number of layers (~100) / stores (~10, but I can imagine in a scenario with 1000s+ plus this would be more problematic.

Thanks,
Jonathan Meyer
Sr. Software Engineer
Applied Information Sciences

···

On Wed, Jul 12, 2017 at 12:19 AM, Jonathan Meyer <jon@anonymised.com> wrote:

As background, I’m aware of various efforts / documentation on how to coordinate GeoServer configuration between multiple instances:

http://docs.geoserver.org/latest/en/user/community/jms-cluster/index.html

https://boundlessgeo.com/2013/04/geoserver-in-a-clustered-configuration-part-1/
https://2016.foss4g-na.org/sites/default/files/slides/High%20Performance%20Geoserver%20Clusters_0.pdf

In developing a GeoServer package for Apache Mesos via DC/OS, I went down a similar path to Derek Kern as identified in his 2016 FOSS4G-NA talk (linked above) - mounted network storage to share GeoServer data configuration across multiple machines.

The overall approach of the presentation is a bit … old? it represents the “state of the art” as of 2010, as Jody remarked other
avenues have been considered since then. To be fair, that approach is still simple and viable if you have a master driven, low change
rate configuration (though the separate front-end GWC is something I have not seen in a while) and with the speedups to
file system catalog loading in 2.11 it’s viable even if you have a large-ish number of layers.
At the same time, shared data dir and reload is the only approach that you can take if you restrict yourself to supported modules
(both jms and jdbc-config are community, thus, unsupported).

While this solution is functional, it enforces a requirement on consistent mounted data across a cluster, as well as requiring an external coordination service to monitor configuration directory and force instances to reload from disk. My preferred approach would be to either directly coordinate between GeoServers or use a cluster native coordination system (such as Zookeeper) for configuration. I have considered looking into using the GeoServer backup/restore plugin that was recently developed to push configuration to all other GeoServer instances within a cluster.

The backup/restore module has been developed for “full/slow” backup/restore operations, not on the fly change notification.
Something based on zookeper would be interesting. I’d also like to play with/develop a distributed in memory configuration based on
Hazelcast (or something similar) to see how it works, nowadays the jdbcconfig module is taking a significant performance hit
due to the many queries it does to the config db per request, slowing down each OGC request (Niels showed interest in
improving that, haven’t heard about it since though).

Ideally, I’d like to see something easier to setup than JMS clustering, with a performance comparable to in memory config storage
and not requiring changes to the database when the configuration object properties change, or queries towards the catalog change
(something that jdbcconfig nowadays requires, making it hard to upgrade [1]).

That said, the configuration needs to be stored somewhere (to support full cluster restart at the very least), as Jody said
there are indirections in the code nowadays to allow storage on something other than a file system, there is a community (unsupported)
module allowing storage on a relational database, to be used along jdbcconfig.

Does any one else have experience or opinions in this domain? I’m just brainstorming and would love to discuss this in more detail.

Been playing with all options above, yep.

Regards,

Andrea Aime

[1]: This is one annoying issue in jdbcconfig imho.

Basically, jdbcconfig stores XML blobs and maps out interesting attributes in a separate table for indexing searches.

So, if a new property pops up that needs to be searchable for whatever reason, one has to go and change the

jdbcconfig mappings to map it out, failing to do so will make jdbcconfig go and de-serialize the xml blobs from db

every time a search based on the incriminated property happens.

Another issue happens if the code querying the catalog starts issuing queries against that

are already in the stored XML, but have not been mapped out to be indexed.

There is no tooling to add the mapping and extract them from the XML blobs,

the only approach I’ve found is to re-import from a file system based data dir… which is not possible once

you have been using jdbcconfig for a while and it got out of synch with the fs based data dir.

Hopefully one has used a dbms with xml/xpath extraction support to setup mass extraction

queries to re-align the db.

==
GeoServer Professional Services from the experts! Visit http://goo.gl/it488V for more information.

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via di Montramito 3/A
55054 Massarosa (LU)
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

AVVERTENZE AI SENSI DEL D.Lgs. 196/2003

Le informazioni contenute in questo messaggio di posta elettronica e/o nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il loro utilizzo è consentito esclusivamente al destinatario del messaggio, per le finalità indicate nel messaggio stesso. Qualora riceviate questo messaggio senza esserne il destinatario, Vi preghiamo cortesemente di darcene notizia via e-mail e di procedere alla distruzione del messaggio stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso, divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od utilizzarlo per finalità diverse, costituisce comportamento contrario ai principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for the attention and use of the named addressee(s) and may be confidential or proprietary in nature or covered by the provisions of privacy act (Legislative Decree June, 30 2003, no.196 - Italy’s New Data Protection Code).Any use not in accord with its purpose, any disclosure, reproduction, copying, distribution, or either dissemination, either whole or partial, is strictly forbidden except previous formal approval of the named addressee(s). If you are not the intended recipient, please contact immediately the sender by telephone, fax or e-mail and delete the information in this message that has been received in error. The sender does not give any warranty or accept liability as the content, accuracy or completeness of sent messages and accepts no responsibility for changes made after they were sent or for other risks which arise as a result of e-mail transmission, viruses, etc.

On Wed, Jul 12, 2017 at 2:00 PM, Jonathan Meyer <jon@anonymised.com> wrote:

If complete configuration coordination (FTL and Image Mosaic config), to
your knowledge, is available with the JMS plugin, I may look into that
again. Doing complete configuration reload works reasonably well for us
presently as we have a large amount of data in a fairly small number of
layers (~100) / stores (~10, but I can imagine in a scenario with 1000s+
plus this would be more problematic.

I don't know what's going on with templates, but suspect there might be
problems, as Freemarker automatic
reload checks for local timestamps, I'm not sure the resource integration
goes that deep.

As for the JMS plugin, I believe it does not support freemarker templates,
thought that could be changed.

To my knowledge, nothing is *designed* to support image mosaic
configuration files, as to GeoServer those are really data, not config
(it's config of the store,
GeoServer is not aware of it). That said, maybe if you keep the mosaic
configuration files in the data directory and force them to load
information from a separate directory (still keeping data out) then it
_might_ work with jdbcconfig/jdbcstore, but I'm not sure, never tried.

Cheers
Andrea

--

Regards,

Andrea Aime

==
GeoServer Professional Services from the experts! Visit http://goo.gl/it488V
for more information.

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via di Montramito 3/A
55054 Massarosa (LU)
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

AVVERTENZE AI SENSI DEL D.Lgs. 196/2003

Le informazioni contenute in questo messaggio di posta elettronica e/o
nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il
loro utilizzo è consentito esclusivamente al destinatario del messaggio,
per le finalità indicate nel messaggio stesso. Qualora riceviate questo
messaggio senza esserne il destinatario, Vi preghiamo cortesemente di
darcene notizia via e-mail e di procedere alla distruzione del messaggio
stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso,
divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od
utilizzarlo per finalità diverse, costituisce comportamento contrario ai
principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for
the attention and use of the named addressee(s) and may be confidential or
proprietary in nature or covered by the provisions of privacy act
(Legislative Decree June, 30 2003, no.196 - Italy's New Data Protection
Code).Any use not in accord with its purpose, any disclosure, reproduction,
copying, distribution, or either dissemination, either whole or partial, is
strictly forbidden except previous formal approval of the named
addressee(s). If you are not the intended recipient, please contact
immediately the sender by telephone, fax or e-mail and delete the
information in this message that has been received in error. The sender
does not give any warranty or accept liability as the content, accuracy or
completeness of sent messages and accepts no responsibility for changes
made after they were sent or for other risks which arise as a result of
e-mail transmission, viruses, etc.

Jody, I also noticed that the Resource API plan doc on the wiki said the TemplateLoader wasn’t handled yet in the transition. Is that accurate or out-of-date? I assume this is where changes would be made to support FTL files. https://github.com/geoserver/geoserver/wiki/Resource-API-Transition-Plan#file-events

Everything has been migrated over to ResourceStore, if you find any gaps mark them as bugs.

Note configuring each node with a temporary filesystem is required, a few libraries we use require us to unpack the BLOB into a file and then pass file reference in. I think only 1 or 2 spots in the codebase require this so I am not fussed.

···

If complete configuration coordination (FTL and Image Mosaic config), to your knowledge, is available with the JMS plugin, I may look into that again. Doing complete configuration reload works reasonably well for us presently as we have a large amount of data in a fairly small number of layers (~100) / stores (~10, but I can imagine in a scenario with 1000s+ plus this would be more problematic.

For image mosaic you can also configure it to use a database, rather than a shapefile for granule management, not sure if that is of interest.

This work is planned, but I have some other things lying on top now. Will happen within the next few months.

Regards
Niels

···

nowadays the jdbcconfig module is taking a significant performance hit
due to the many queries it does to the config db per request, slowing down each OGC request (Niels showed interest in
improving that, haven’t heard about it since though).