Hi all,
recently I was investigating an OOM reported by a user
that was basically just using OpenLayers with tiles
and meta-tiling on a single machine (so one user connected
to GeoServer).
The result of the investigation is not completely new, but
it's worrisome anyways.
Basically the user was moving around a lot using OL,
panning and zooming, and the VM was configured as default,
which on the platform of this example meant only
having 64M or memory.
Each request resulted in the building of a 3x3 meta tile,
thought of course not all requests triggered that as the
code prevents the same meta tile to be computed in parallel
by more than one thread.
I've added some machinery to get a count of the concurrent
request working in parallel and usually the count was 6
(which is the default Firefox max connections) but if
someone starts zooming around while OL is still asking
for the tiles of the current level, boom, one can
easily get up to 30-40 concurrent requests and the OOM
is pretty much guaranteed.
The thing is, Firefox gives up on the older requests,
but GeoServer does not know that until it actually tries
to write anything to the response, which happens only
after the rendering is fully done.
Given that each meta tile uses 2+MB of memory,
it does not take much to fill up a 64MB heap
(especially since good part of it
is already filled with the HSQL EPSG database cache,
around 19MB, hopefully switching to H2 will give
us some breathing room in the future).
We really need to find a way to make GeoServer stop working
on requests that the client has dropped.
I've looked a bit around, here is what I've found.
Apache in CGI mode kills the cgi process as soon
as the connection is dropped.
In Java we cannot, because we're using threads, and the
threads share resources, one cannot kill one without
bad consequences.
I looked into the servlet API but could find no "supported" way to
actually guess if the client connection is still alive
or not, it seems one has actually to try and write something
on the output.
I asked on the Sun J2EE servlet forum and got a couple of answers:
http://forums.sun.com/thread.jspa?threadID=5408542
The idea of trying to flush() periodically seems to be a good
one, I've read in other places that flushing the output
stream should not turn the response into committed status:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4362880
The reason it's important that flush() does not commit
the response is that by the time one commits the response
the headers have to be set and cannot be modified,
and our dispatch system sets them only after the response
object has been created (the fully rendered image in our case).
Since we want to try periodic flush() call during the rendering
we would be in troubles, as the headers are set only after that.
Alternatively, or in parallel to this, we could make sure no more
than X threads are rendering.
This could be done by using a concurrent queue limited in
size, each rendering action trying to push a token into it and
end up waiting if full.
This would solve the OOM, but would
make all the new request wait for the older ones to be
dropped, basically making GS WMS unusuable for a while.
Failing everything else, this may not be a such a bad idea.
With a little generalization we could apply this at the
dispatcher level and allow the administrator to set limits
to the number of requests GS is serving for each service
(typically you can serve much more WFS requests in parallel
than WMS ones).
Another option that comes to mind is to get our hands
dirty and write plugins that leverage container specific
api to check if the connection is still alive.
Downside, it would work only for specific versions of
specific containers, and I haven't checked if such an API
exists at all.
Well, do anybody have experiences on this? Suggestions?
Cheers
Andrea
--
Andrea Aime
OpenGeo - http://opengeo.org
Expert service straight from the developers.