Hi,
lately at OpenGeo we've been having some troubles keeping up
a few WMS demos due to exceptional load. Looking into them
it's easy to see that our WMS does not defend itself from
too high workload, as I've reported in a jira almost
2 years ago:
http://jira.codehaus.org/browse/GEOS-1127
I've been given some time to try and provide a few solutions
that can be landed in GeoServer 1.7.x series in order to make
putting out GeoServer WMS in the wild less of a concern.
Since we're talking of 1.7.x the changes have to be as
less invasive as possible, but the idea and the configuration
should be portable unchanged to trunk where we can find
a fuller, more extensive solution (time and funding permitting,
that is... if the above jira teaches anything, is that finding
resources to pull this off is harder than it would seem at
a first sight...).
Mail thread wise, I would suggest we stick on what
can be done on 1.7.x, since I have no mandate to do
a full fledged solution on trunk, but only to make simple
changes to 1.7.x.
If you feel the proposed solutions are not good for 1.7.x,
or are not good at all, just say so, I will stop my
attempt and we'll start waiting again for resources for
a fuller solution.
I also encourage anybody interested to start discussing in
a separate thread, so that we have a design
ready for estimates should anyone with funds be interested
in having it implemented.
The following are the items I'm thinking about for
the 1.7.x branch.
Memory usage
--------------------------------------------------------
A way to limit the memory used by each request. WMS
requests do use quite an amount of memory due to the
need of setting up the drawing surface, which is usually
width * height * 4 (4 bytes per pixel). So a 1024x1024
image sucks up 4MB of memory (this is the quite typical
4x4 GWC metatile).
If one is determined enough, and he has access to a big
enough dataset on the server side, he can make a request
with a custom style that will suck up 99% of the heap
without going into OOM itself, but making any other
legitimate request OOM.
Even without a big dataset, you can make a loop
of big enough requests and obtain the same effect.
Now, external tools can be used to throttle down too
many requests from a single host I think, but those
tools won't be able to asses the image size being
requested.
So one config item I would like to add is image size.
As per Gabriel suggestion in private mail, a x MB
per request cap seem to be a good one.
It would be a global WMS parameter, simple to check,
and I would like to land a patch for this in 1.7.x,
without adding the param to the UI, and add the UI
in trunk instead.
The parameter could be a new full fledged field,
or an entry in the metadata map. I would prefer the
former.
Time usage
-------------------------------------------------
A request taking too much time to execute is
no good.
If you look at WFS, this requirement has been
turned from the time to the feature count dimension,
and even in that case, we had to allow admins
to turn off bounds computation on the returned
feature collections because that single
thing could take minutes on big data sets.
WMS wise we could do the same, but in the
end you can take a lot of time due to many
features, or to a few gigantic ones.
Gabriel has provided a solution at the NY
sprint that involves setting up a thread pool
that executes the rendering, and that can be
timeout out on config (and that can be also limited
in terms of how many threads do actually perform
rendering).
I have some reservations on applying this kind
of solution on 1.7.x due to a couple of things:
- it always requires two threads per request,
one provided by the container that is executing
the http request, and another doing the actual
rendering in the thread pool
- it changes the request is executed even when
if the admin did not activate it
I was considering a lower tech solution involving
the usage of a timer. A timer is started before the
rendering starts with the timeout time as its delay.
If the rendering terminates within it, the timer
is just cancelled. If the timer is activated instead,
it calls the stop() method over the renderer, and
for good measure it also disposes of the graphics
the renderer was using so that coverage rendering
is killed as well.
Mind, this ends up using extra threads as well, but
the main path is unaltered, and if the option is not
enabled, the main path is not modified at all.
At that point, we can decide whether to throw a
service exception, or return the partially generated
image with some marker showing it timed out.
I would go for the former.
Configuration wise, I suggest we add a wms timeout
specified in seconds, and again, add only the config
option to 1.7.x, and provide a UI for it on trunk.
Number of rendering errors
--------------------------------------------------
The StreamignRenderer has been developed for a long
time having uDig as the use case.
One of the effects of this shows up in its
"best effort rendering", which means the renderer
skips features it cannot render and goes on.
Typical issues that may arise during rendering
are reprojection problems, invalid geometries,
but also data source connections suddenly being
severed.
In face of this, the renderer just keeps on going,
eventually wasting a lot of time handling exceptions.
I would like to add a max errors setting inside
the renderer. It was there once, and an error
counter is still available in the code, but most
of it has been removed.
This thing can also be implemented as a listener
too, yet listeners are kind of heavy in that they
are also informed of each feature rendered, not only
of errors.
Also, there is the also the thing that by implementing
timeouts we also make it impossible for this
"best effort rendering" to keep the cpu busy for more
than x seconds. Having this knob has its own merit
thought, as wasting time handling exception is
an expensive and useless way to burn CPU cycles.
Questions
-------------------------------------------------
Justin, to make sure, what's the effort involved
into adding an option to the configuration in a way
that it goes straight to the services without the
need to add it to the UI in 1.7.x?
I think it would require changing the xml reader/writer
classes, the involved ServiceInfo class, and that would
be it, assuming the patch goes down an grab it?
I guess if I use the metadata map I would not even need
to change the reader/writer classes or the ServiceInfo,
but only change the service code, right?
Conclusion
-------------------------------------------------
While there are other items in the checklist of a more solid
server (like disallowing customs styles, disabling certain
output formats) the above seem to strike the
best bang for the buck, and I believe I can implement them
in the time I've been given (16 hours, for the record).
Feedback welcomed
Cheers
Andrea
--
Andrea Aime
OpenGeo - http://opengeo.org
Expert service straight from the developers.