[SAC] Run away process on Projects VM

neteler · January 19, 2014, 7:02pm

On the ProjectsVM, we currently face:
"PHP Fatal error: Out of memory (allocated 6553600)..."

Any suggestions how to change the apache server values (again)?

Note, on IRC, MartinS pointed me to

http://webextra.osgeo.osuosl.org/munin/osgeo.org/projects.osgeo.org/apache_processes.html
vs
http://webextra.osgeo.osuosl.org/munin/osgeo.org/projects.osgeo.org/memory.html

which confirms a recurrent memory issue. The GRASS Wiki currently runs
into troubles to save edits due to the currenly exceeded (PHP) memory.
Likewise other PHP applications on the VM.

Suggestions?

thanks
Markus

On Wed, Mar 13, 2013 at 1:34 PM, Christopher Schmidt
<crschmidt@crschmidt.net> wrote:

So, I hopped on to look at this a bit this morning.

Here are the observations I made, and what I did to try to negate the problem.

1. Load up http://projects.osgeo.osuosl.org/server-status , note that
    many many connections appear to be stuck in "Reading" state, which
    means they have not yet sent enough information ot the server to make
    a request.
2. Bump maxclients to the max of 256 -- no difference.
3. Turn down timeout from 120 seconds to 5 seconds -- server behavior
    improves, but behavior is still consistent with many clients opening a
    connection and trying to hold it for > 5 seconds. Free clients exist,
    but any client taking longer than 5s is killed.

At this point, it seems like the problem is that many clients are connecting,
but are not sending packets after that for a long time. Unfortunately, I can't
see any traffic pattern that would explain this.

One possible explanation is just that we're network-bound on input, but that is
inconsistent with low latency interactive ssh and with the fact that lowering the
timeout seems to have an improvement. Another possible explanation is a DOS of
some sort, but I can't find any obvious evidence of that. (Of course, running
a webserver shared by many projects and accessed by a worldwide network of users
of many websites doesn't really look much different htan a DOS to begin with,
so I'm hard pressed to dny the possibility entirely.)

After a bit more playing it seems clear that whatever the timeout is, we will
have an extra 10 requests per second hanging in the 'reading' state. Bumping
the timeout from 5s -> 30s doesn't seem to increase the overall r/s throughput
(which is around 68-70 / s at the moment.)

As a tradeoff between a higher timeout and letting clients in, I have:

- Bumped ServerLimit and MaxClients to 700
- Put the timeout to 30s

This means that any server which doesn't start returning bytes in 30s will
be closed by the server (vs 120s before). Given that most traffic on the
projects VM is static file requests, I think this is a possibly-unfortunate,
but reasonable tradeoff.

The setting of 700 is such that we have a spare, under the current operating
situation, of 200 clients to respond. In observing for the past 20 minutes,
it seems like this is relatively stable. From our historic munin data, it
looks like this will be about 100 more than our max at peak times.

I do not feel comfortable saying that this is a good fix: something looks
like it has changed in the usage pattern. But I can't find out what it is,
and I don't know what tool to use to debug the connections that are sitting
in 'waiting' state.

CPU, I/O utilization, and memory usage are reasonable, and requests are flowing
again.

As a first line of defense, if the problem becomes drastic again -- which
should be obvious from:

http://webextra.osgeo.osuosl.org/munin/osgeo.org/projects.osgeo.org-apache_processes.html

if the 'busy servers' line is flat, the first change is to lower the timeout
in /etc/apache2/apache2.conf to 5s, and restart the server. Beyond that,
I'm a bit lost as to what we can do here.

Regards,
--
Christopher Schmidt
Web Developer
_______________________________________________
Sac mailing list
Sac@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/sac

neteler · January 19, 2014, 7:22pm