[Geoserver-devel] GSIP 69 - Catalog scalability enhancements - OOM

On Sat, Apr 28, 2012 at 4:29 PM, Andrea Aime
<andrea.aime@anonymised.com> wrote:

Doh, forgot one last bit that I've found interesting. Citing Gabriel:

----------------------------------------------------------------------------

If starting up geoserver in seconds instead of minutes, loading the
home page almost instantly instead waiting for seconds or even
minutes, under-second response times for the layers page list with
thousands of layers, including filtering and paging; not going OOM or
getting timeouts when doing a GetCapabilities request under
concurrency and/or low heap size, but instead streaming out as quickly
as possible, using as little memory as possible, and gracefully
degrading under load; are not ways of exercising the new API, then I'm
lost.

----------------------------------------------------------------------------

All good stuff that I did not see mentioned in the proposal, though
I can hardly imagine a GS going OOM under concurrent load of
GetCapabilities unless... well, maybe it has 200k layers and
works off just 256M of memory (gut feeling estimate).

In the past OOM in case of many layers is caused by leaks in the
DescribeFeatureType subsystem... did you measure how much memory
does it take for the keep the catalog in memory and do GetCapabilities?

Pardon the very rough way of assessing it, but if I we take as reference
the release directory we have 19 layers, with quite a bit of stores that
could be avoided (single shapefiles instead of directory stores), and
running inside "workspaces" the following gives me:

du -csh `find . -name "*.xml"`
256K total

which means, on average, 13KB of xml per layer (which still pads it quite
a bit since the service configuration is shared and normally you don't
have so many stores). And then it's XML, would you agree that the
in memory representation should be something like 5 times more compact?
This would give us a rough estimate of 3KB per layer.
If I have 200k layers it means 600MB of in memory storage.
Which is a lot, I'm not denying it, but if you are handling that many layers
you do also want to have some beefy hardware, 600MB should be peanuts.

I'm not trying to deny the scalability advantages of secondary storage, it
just seems
to me the OOM reports may be a bit exaggerated.

Actually it is not exaggerated. I can make it go OOM with 25K layers
(not 200K), 2GB heap size, and 20 concurrent GetCapabilities requests.
Try the following for yourself, from the GSIP69 branch
<https://github.com/groldan/geoserver/tree/GSIP69&gt; (finally managed to
redo it with squashed commits where the system builds ok on each one,
the old branch is still there for reference, called GSIP69_old):

1- Check out the code at a point where the getcapabilities processing
doesn't use the new API (commit "GSIP-69: add catalog bulk copy
tool...")
2- Run against release data directory, with -Xmx1024m
-XX:MaxPermSize=128m -XX:+UseCompressedOops (Oracle Java 6, 64bit,
Linux), but without the -P jdbcconfig profile, in order to use the
default catalog.
3- Disable GWC's "Automatically configure a GeoWebCache Layer for
every new Layer and LayerGroup"
4- Use the catalog bulk load tool and add 25k copies of
topp:tasmania_water_bodies
5- Shut down, checkout the master branch so you're sure no GSIP69 code
gets in the middle, restart geoserver (I'm doing all this through
eclipse).
6- Connect jconsole to the java process
7- Run curl -v "http://localhost:8080/geoserver/ows?service=wms&version=1.3.0&request=GetCapabilities&quot;&gt;
caps.xml. grep "<Layer" caps.xml |wc -l gives something like 25122,
all right.
8- Check jsonsole, memory usage should be around 227M
9- Hit "Perform GC", memory should go down to around 130M. Clearing
the resource cache again takes it up to over 300M, hit "Free memory"
and it should get it back to ~130M again.
10- Run ab -n 10 -c 10
"http://localhost:8080/geoserver/ows?service=wms&version=1.3.0&request=GetCapabilities&quot;
11- Go check memory consumption in jconsole. Memory fills up almost
completely. Both the old gen and the eden memory pools are up to the
top. Jconsole says ~950M are in use. And this is only 10 concurrent
GetCapabilities requests. "ab" reports a mean response time of 184.2
seconds. Other times ab times out after 10 minutes or so. Most of the
time spent on GC. Running with a 2GB heap shows up GeoServer is happy
with 1.17 GB to serve 10 getcaps requests, with a mean of 23 seconds.
But with 20 concurrent requests, it eats up to 1.8GB of heap and then
the GC is having a hard time not to OOM, but it finally does (*) after
about 750 seconds.

12- Shut down GeoServer and checkout the commit "GSIP-69: port WMS
GetCapabilities 1.3 to extended Catalog API"
13- Start up GeoServer (it's gonna take a while) and repeat the
process from 6) to 11).
This time ab reports (on my system) a mean response time of 14.794
seconds. Memory usage in jconsole peaks up at about 370m, then goes
down to about 180M without explicitly calling the GC. And back to 130M
if clearing the resource cache.

Doing it with 100 concurrent requests instead of 10, memory usage
barely exceeds 550M, and mean response time is about 130.9 seconds.
Pretty much linearly scaling. And this is with the default catalog,
which has to do sorting in-memory.
If doing the same with 100 concurrent requests, but without the
getcaps transformer ported to the new API - same commit than 1) -, the
1G heap fills up and ab times out: "Benchmarking localhost (be
patient)...apr_poll: The timeout specified has expired (70007)"
If adding a bigger timeout (add -t 3600 to the ab arguments),

(*)
java.lang.OutOfMemoryError: GC overhead limit exceeded
  at sun.misc.FloatingDecimal.dtoa(FloatingDecimal.java:659)
  at sun.misc.FloatingDecimal.<init>(FloatingDecimal.java:440)
  at java.lang.Double.toString(Double.java:179)
  at java.lang.String.valueOf(String.java:2973)
  at org.geoserver.wms.capabilities.Capabilities_1_3_0_Transformer$Capabilities_1_3_0_Translator.handleBBox(Capabilities_1_3_0_Transformer.java:1174)
  at org.geoserver.wms.capabilities.Capabilities_1_3_0_Transformer$Capabilities_1_3_0_Translator.handleLayer(Capabilities_1_3_0_Transformer.java:834)
  at org.geoserver.wms.capabilities.Capabilities_1_3_0_Transformer$Capabilities_1_3_0_Translator.handleLayerTree(Capabilities_1_3_0_Transformer.java:791)
  at org.geoserver.wms.capabilities.Capabilities_1_3_0_Transformer$Capabilities_1_3_0_Translator.handleLayers(Capabilities_1_3_0_Transformer.java:658)
  at org.geoserver.wms.capabilities.Capabilities_1_3_0_Transformer$Capabilities_1_3_0_Translator.handleCapability(Capabilities_1_3_0_Transformer.java:437)
  at org.geoserver.wms.capabilities.Capabilities_1_3_0_Transformer$Capabilities_1_3_0_Translator.encode(Capabilities_1_3_0_Transformer.java:252)

On Mon, Apr 30, 2012 at 6:23 PM, Gabriel Roldan <groldan@anonymised.com> wrote:

Actually it is not exaggerated. I can make it go OOM with 25K layers
(not 200K), 2GB heap size, and 20 concurrent GetCapabilities requests.
Try the following for yourself, from the GSIP69 branch
<https://github.com/groldan/geoserver/tree/GSIP69> (finally managed to
redo it with squashed commits where the system builds ok on each one,
the old branch is still there for reference, called GSIP69_old):

Read it all, interesting points. Not much to add, I guess my understanding
of where the memory is spent in the in memory catalog is missing some
important bit, the usage you’re found seems really a lot to me… but numbers
are numbers.

Cheers
Andrea

Ing. Andrea Aime
GeoSolutions S.A.S.
Tech lead

Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy

phone: +39 0584 962313
fax: +39 0584 962313
mob: +39 339 8844549

http://www.geo-solutions.it
http://geo-solutions.blogspot.com/
http://www.youtube.com/user/GeoSolutionsIT
http://www.linkedin.com/in/andreaaime
http://twitter.com/geowolf