[Geoserver-devel] Achievement unlocked: (partial) parallel catalog loading

Hi,
I managed to get a “light” paralellization during the catalog load by rolling a background
thread pool that would take care of loading the xml files into memory, without having
to parallelize the code that runs in the spring startup.

Basically, the existing code does a number of “scan this directory for xml files and load them
one by one”, that’s the part I parallelized by creating a asynch iterator doing the loading,
and giving the main thread a byte for each file, to decode and put int the catalog.

The benefit of doing so is visible because IO is the bottleneck.
So, let’s see some numbers, but before, a reminder of the data dirs involved:

  • “Many states”: 1 workspace, 1 store, 10k layers, 10k cached layers
  • “Large”: 1001 workspaces, 11000 stores (a mix of shapefiles, postgis, directory of shapefile, single tiff, arcgrid, mosaics), 42000 layers and 42000 associated tile layers

Here is a comparison of the last loading times I have vs the ones with parallel IO loading.

  • Many states, cold startup. Before, 68s. After: 30s
  • Many states, warm startup. Before: 29s After: 21s
  • Large, cold startup. Before: 230s After: 107s
  • Large, warm startup: Before: 45s After: 45s (weird?)
    As expected the benefit shows up mostly on col startups, where actual IO against the disk happens (btw, the

IO is likely happening against the SSD cache of my hybrid HD, a pure spinning disk drive will likely have worse timings).

Pull request available here, please let me know if anybody wants to review or if you have reservations
against a merge (plan B, land it after the freeze and wait September to have it in a release):
https://github.com/geoserver/geoserver/pull/2116

Cheers
Andrea

PS: the IO rate I see is still quite a bit below the potential of my local drive, so a true
parallelization of the catalog loading (including the CPU bound part) is likely to reap extra
benefits, but as said previously, it’s harder to implement. This is a more modest approach that
still manages to provide a speedup.

==
GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via di Montramito 3/A
55054 Massarosa (LU)
phone: +39 0584 962313

fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

AVVERTENZE AI SENSI DEL D.Lgs. 196/2003

Le informazioni contenute in questo messaggio di posta elettronica e/o nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il loro utilizzo è consentito esclusivamente al destinatario del messaggio, per le finalità indicate nel messaggio stesso. Qualora riceviate questo messaggio senza esserne il destinatario, Vi preghiamo cortesemente di darcene notizia via e-mail e di procedere alla distruzione del messaggio stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso, divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od utilizzarlo per finalità diverse, costituisce comportamento contrario ai principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for the attention and use of the named addressee(s) and may be confidential or proprietary in nature or covered by the provisions of privacy act (Legislative Decree June, 30 2003, no.196 - Italy’s New Data Protection Code).Any use not in accord with its purpose, any disclosure, reproduction, copying, distribution, or either dissemination, either whole or partial, is strictly forbidden except previous formal approval of the named addressee(s). If you are not the intended recipient, please contact immediately the sender by telephone, fax or e-mail and delete the information in this message that has been received in error. The sender does not give any warranty or accept liability as the content, accuracy or completeness of sent messages and accepts no responsibility for changes made after they were sent or for other risks which arise as a result of e-mail transmission, viruses, etc.


I am happy to merge this in for the beta; indeed I would like to see this kind of work receieve the slightly wider testing a beta and release candidate can provide.

···

On 19 February 2017 at 03:08, Andrea Aime <andrea.aime@anonymised.com> wrote:

Hi,
I managed to get a “light” paralellization during the catalog load by rolling a background
thread pool that would take care of loading the xml files into memory, without having
to parallelize the code that runs in the spring startup.

Basically, the existing code does a number of “scan this directory for xml files and load them
one by one”, that’s the part I parallelized by creating a asynch iterator doing the loading,
and giving the main thread a byte for each file, to decode and put int the catalog.

The benefit of doing so is visible because IO is the bottleneck.
So, let’s see some numbers, but before, a reminder of the data dirs involved:

  • “Many states”: 1 workspace, 1 store, 10k layers, 10k cached layers
  • “Large”: 1001 workspaces, 11000 stores (a mix of shapefiles, postgis, directory of shapefile, single tiff, arcgrid, mosaics), 42000 layers and 42000 associated tile layers

Here is a comparison of the last loading times I have vs the ones with parallel IO loading.

  • Many states, cold startup. Before, 68s. After: 30s
  • Many states, warm startup. Before: 29s After: 21s
  • Large, cold startup. Before: 230s After: 107s
  • Large, warm startup: Before: 45s After: 45s (weird?)
    As expected the benefit shows up mostly on col startups, where actual IO against the disk happens (btw, the

IO is likely happening against the SSD cache of my hybrid HD, a pure spinning disk drive will likely have worse timings).

Pull request available here, please let me know if anybody wants to review or if you have reservations
against a merge (plan B, land it after the freeze and wait September to have it in a release):
https://github.com/geoserver/geoserver/pull/2116

Cheers
Andrea

PS: the IO rate I see is still quite a bit below the potential of my local drive, so a true
parallelization of the catalog loading (including the CPU bound part) is likely to reap extra
benefits, but as said previously, it’s harder to implement. This is a more modest approach that
still manages to provide a speedup.

==
GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via di Montramito 3/A
55054 Massarosa (LU)
phone: +39 0584 962313

fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

AVVERTENZE AI SENSI DEL D.Lgs. 196/2003

Le informazioni contenute in questo messaggio di posta elettronica e/o nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il loro utilizzo è consentito esclusivamente al destinatario del messaggio, per le finalità indicate nel messaggio stesso. Qualora riceviate questo messaggio senza esserne il destinatario, Vi preghiamo cortesemente di darcene notizia via e-mail e di procedere alla distruzione del messaggio stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso, divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od utilizzarlo per finalità diverse, costituisce comportamento contrario ai principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for the attention and use of the named addressee(s) and may be confidential or proprietary in nature or covered by the provisions of privacy act (Legislative Decree June, 30 2003, no.196 - Italy’s New Data Protection Code).Any use not in accord with its purpose, any disclosure, reproduction, copying, distribution, or either dissemination, either whole or partial, is strictly forbidden except previous formal approval of the named addressee(s). If you are not the intended recipient, please contact immediately the sender by telephone, fax or e-mail and delete the information in this message that has been received in error. The sender does not give any warranty or accept liability as the content, accuracy or completeness of sent messages and accepts no responsibility for changes made after they were sent or for other risks which arise as a result of e-mail transmission, viruses, etc.



Check out the vibrant tech community on one of the world’s most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot


Geoserver-devel mailing list
Geoserver-devel@anonymised.com.366…sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel


Jody Garnett

Ok,
waited for the world to have a full spin since my mail, considering Jody’s interest I’ve merged the pull request in for the beta

Cheers
Andrea

···

On Mon, Feb 20, 2017 at 7:20 AM, Jody Garnett <jody.garnett@anonymised.com> wrote:

I am happy to merge this in for the beta; indeed I would like to see this kind of work receieve the slightly wider testing a beta and release candidate can provide.


Jody Garnett

On 19 February 2017 at 03:08, Andrea Aime <andrea.aime@anonymised.com> wrote:

Hi,
I managed to get a “light” paralellization during the catalog load by rolling a background
thread pool that would take care of loading the xml files into memory, without having
to parallelize the code that runs in the spring startup.

Basically, the existing code does a number of “scan this directory for xml files and load them
one by one”, that’s the part I parallelized by creating a asynch iterator doing the loading,
and giving the main thread a byte for each file, to decode and put int the catalog.

The benefit of doing so is visible because IO is the bottleneck.
So, let’s see some numbers, but before, a reminder of the data dirs involved:

  • “Many states”: 1 workspace, 1 store, 10k layers, 10k cached layers
  • “Large”: 1001 workspaces, 11000 stores (a mix of shapefiles, postgis, directory of shapefile, single tiff, arcgrid, mosaics), 42000 layers and 42000 associated tile layers

Here is a comparison of the last loading times I have vs the ones with parallel IO loading.

  • Many states, cold startup. Before, 68s. After: 30s
  • Many states, warm startup. Before: 29s After: 21s
  • Large, cold startup. Before: 230s After: 107s
  • Large, warm startup: Before: 45s After: 45s (weird?)
    As expected the benefit shows up mostly on col startups, where actual IO against the disk happens (btw, the

IO is likely happening against the SSD cache of my hybrid HD, a pure spinning disk drive will likely have worse timings).

Pull request available here, please let me know if anybody wants to review or if you have reservations
against a merge (plan B, land it after the freeze and wait September to have it in a release):
https://github.com/geoserver/geoserver/pull/2116

Cheers
Andrea

PS: the IO rate I see is still quite a bit below the potential of my local drive, so a true
parallelization of the catalog loading (including the CPU bound part) is likely to reap extra
benefits, but as said previously, it’s harder to implement. This is a more modest approach that
still manages to provide a speedup.

==
GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via di Montramito 3/A
55054 Massarosa (LU)
phone: +39 0584 962313

fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

AVVERTENZE AI SENSI DEL D.Lgs. 196/2003

Le informazioni contenute in questo messaggio di posta elettronica e/o nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il loro utilizzo è consentito esclusivamente al destinatario del messaggio, per le finalità indicate nel messaggio stesso. Qualora riceviate questo messaggio senza esserne il destinatario, Vi preghiamo cortesemente di darcene notizia via e-mail e di procedere alla distruzione del messaggio stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso, divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od utilizzarlo per finalità diverse, costituisce comportamento contrario ai principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for the attention and use of the named addressee(s) and may be confidential or proprietary in nature or covered by the provisions of privacy act (Legislative Decree June, 30 2003, no.196 - Italy’s New Data Protection Code).Any use not in accord with its purpose, any disclosure, reproduction, copying, distribution, or either dissemination, either whole or partial, is strictly forbidden except previous formal approval of the named addressee(s). If you are not the intended recipient, please contact immediately the sender by telephone, fax or e-mail and delete the information in this message that has been received in error. The sender does not give any warranty or accept liability as the content, accuracy or completeness of sent messages and accepts no responsibility for changes made after they were sent or for other risks which arise as a result of e-mail transmission, viruses, etc.



Check out the vibrant tech community on one of the world’s most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot


Geoserver-devel mailing list
Geoserver-devel@anonymised.comrge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

==
GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via di Montramito 3/A
55054 Massarosa (LU)
phone: +39 0584 962313

fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

AVVERTENZE AI SENSI DEL D.Lgs. 196/2003

Le informazioni contenute in questo messaggio di posta elettronica e/o nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il loro utilizzo è consentito esclusivamente al destinatario del messaggio, per le finalità indicate nel messaggio stesso. Qualora riceviate questo messaggio senza esserne il destinatario, Vi preghiamo cortesemente di darcene notizia via e-mail e di procedere alla distruzione del messaggio stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso, divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od utilizzarlo per finalità diverse, costituisce comportamento contrario ai principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for the attention and use of the named addressee(s) and may be confidential or proprietary in nature or covered by the provisions of privacy act (Legislative Decree June, 30 2003, no.196 - Italy’s New Data Protection Code).Any use not in accord with its purpose, any disclosure, reproduction, copying, distribution, or either dissemination, either whole or partial, is strictly forbidden except previous formal approval of the named addressee(s). If you are not the intended recipient, please contact immediately the sender by telephone, fax or e-mail and delete the information in this message that has been received in error. The sender does not give any warranty or accept liability as the content, accuracy or completeness of sent messages and accepts no responsibility for changes made after they were sent or for other risks which arise as a result of e-mail transmission, viruses, etc.