[SAC] Run away process on Projects VM

Alex_M · March 11, 2013, 8:14pm

Martin L.,

It was noticed that the projects VM is having load issues today since
about 2 hours ago. There seems to be a lot of processes and a connection
running under your user with massive I/O wait.

Any idea what's going on?

Thanks,
Alex

Alex_M · March 12, 2013, 2:35pm

On 03/12/2013 03:56 AM, Markus Neteler wrote:

Hi again,

there is also a crschmidt Python process causing potential troubles:

dmesg
Mar 11 19:04:30 projects kernel: [7499374.049855] python[20053]:
segfault at 0 ip 00007f1ce8040f32 sp 00007fff44f89908 error 4 in
libc-2.11.3.so[7f1ce7fc5000+159000]
Mar 11 19:14:36 projects kernel: [7499979.356895] python[8413]:
segfault at 0 ip 00007f1ce8040f32 sp 00007fff44f89908 error 4 in
libc-2.11.3.so[7f1ce7fc5000+159000]
Mar 11 20:06:12 projects kernel: [7503075.591039] python[29533]:
segfault at 0 ip 00007f1ce8040f32 sp 00007fff44f89908 error 4 in
libc-2.11.3.so[7f1ce7fc5000+159000]
Mar 11 20:52:28 projects kernel: [7505851.389659] python[11701]:
segfault at 0 ip 00007f1ce8040f32 sp 00007fff44f89908 error 4 in
libc-2.11.3.so[7f1ce7fc5000+159000]
Mar 11 22:24:25 projects kernel: [7511368.687807] TCP: Possible SYN
flooding on port 80. Sending cookies. Check SNMP counters.
Mar 12 02:24:43 projects kernel: [7525786.958763] python[22365]:
segfault at 0 ip 00007f1ce8040f32 sp 00007fff44f89908 error 4 in
libc-2.11.3.so[7f1ce7fc5000+159000]
Mar 12 02:33:05 projects kernel: [7526288.270835] python[21809]:
segfault at 0 ip 00007f1ce8040f32 sp 00007fff44f89908 error 4 in
libc-2.11.3.so[7f1ce7fc5000+159000]
Mar 12 02:42:09 projects kernel: [7526832.527160] python[2854]:
segfault at 0 ip 00007f1ce8040f32 sp 00007fff44f89908 error 4 in
libc-2.11.3.so[7f1ce7fc5000+159000]
Mar 12 02:44:22 projects kernel: [7526965.583365] python[12697]:
segfault at 0 ip 00007f1ce8040f32 sp 00007fff44f89908 error 4 in
libc-2.11.3.so[7f1ce7fc5000+159000]
^C

projects:/home/neteler# ps -aef | grep python
root 1861 1 0 2012 ? 00:23:22 /usr/bin/python
/usr/bin/gunicorn_django --pid /var/run/gunicorn/sr.gunicorn.pid
--name sr.gunicorn --user crschmidt --group metacrs --daemon
--log-file /var/log/gunicorn/sr.gunicorn.log --bind=0.0.0.0:8092
--workers=4 --worker-class=egg:gunicorn#sync --timeout=30
10011 7546 1861 0 02:42 ? 00:00:13 /usr/bin/python
/usr/bin/gunicorn_django --pid /var/run/gunicorn/sr.gunicorn.pid
--name sr.gunicorn --user crschmidt --group metacrs --daemon
--log-file /var/log/gunicorn/sr.gunicorn.log --bind=0.0.0.0:8092
--workers=4 --worker-class=egg:gunicorn#sync --timeout=30
10011 7848 1861 0 02:44 ? 00:00:12 /usr/bin/python
/usr/bin/gunicorn_django --pid /var/run/gunicorn/sr.gunicorn.pid
--name sr.gunicorn --user crschmidt --group metacrs --daemon
--log-file /var/log/gunicorn/sr.gunicorn.log --bind=0.0.0.0:8092
--workers=4 --worker-class=egg:gunicorn#sync --timeout=30
10011 9355 1861 0 02:24 ? 00:00:16 /usr/bin/python
/usr/bin/gunicorn_django --pid /var/run/gunicorn/sr.gunicorn.pid
--name sr.gunicorn --user crschmidt --group metacrs --daemon
--log-file /var/log/gunicorn/sr.gunicorn.log --bind=0.0.0.0:8092
--workers=4 --worker-class=egg:gunicorn#sync --timeout=30
root 13060 1 0 Mar10 ? 00:03:00 python
/usr/sbin/denyhosts --daemon --purge --config=/etc/denyhosts.conf
--config=/etc/denyhosts.conf
www-data 20596 6443 0 03:53 ? 00:00:00 python
/osgeo/openlayers/sites/openlayers.org/dev/examples/proxy.cgi
www-data 20599 6377 0 03:53 ? 00:00:00 python
/osgeo/openlayers/sites/openlayers.org/dev/examples/proxy.cgi
www-data 20623 6310 0 03:54 ? 00:00:00 python
/osgeo/openlayers/sites/openlayers.org/dev/examples/proxy.cgi
root 20657 6127 0 03:54 pts/1 00:00:00 grep python
10011 21808 1861 0 00:43 ? 00:00:30 /usr/bin/python
/usr/bin/gunicorn_django --pid /var/run/gunicorn/sr.gunicorn.pid
--name sr.gunicorn --user crschmidt --group metacrs --daemon
--log-file /var/log/gunicorn/sr.gunicorn.log --bind=0.0.0.0:8092
--workers=4 --worker-class=egg:gunicorn#sync --timeout=30

No idea what to do about it...

Markus

That one runs as a service, so
sudo /etc/init.d/gunicorn restart
I just did that, since apache seems to still be using all 200 of it's allowed processes.

Though I don't know what the libc error is about, I wonder if someone is overusing the the proxy in some way, or overuse of the openlayers hosted js (can't recall if we implemented caching for that).

Thanks,
Alex

Martin_Landa · March 12, 2013, 5:53pm

Hi all,

2013/3/12 Alex Mandel <tech_dev@wildintellect.com>:

sorry for being silent, unfortunately yesterday I was partly offline,
then I was chatting with MarkusN and later being pinged by FrankW. I
am not sure what could be wrong. These cronjobs (buildings addons for
GRASS) were running for longer time without no problems. I disabled
them today in the morning. I will check what could be a source of the
problems. Anyway currently there should be no loading issues related
to my account. So sorry for these problems! I will be more careful
about my cronjobs even they were running without problems.

Thanks for support and patience, Martin

--
Martin Landa <landa.martin gmail.com> * http://geo.fsv.cvut.cz/~landa

Alex_M · March 12, 2013, 6:08pm

On 03/12/2013 10:53 AM, Martin Landa wrote:

Hi all,

2013/3/12 Alex Mandel <tech_dev@wildintellect.com>:

sorry for being silent, unfortunately yesterday I was partly offline,
then I was chatting with MarkusN and later being pinged by FrankW. I
am not sure what could be wrong. These cronjobs (buildings addons for
GRASS) were running for longer time without no problems. I disabled
them today in the morning. I will check what could be a source of the
problems. Anyway currently there should be no loading issues related
to my account. So sorry for these problems! I will be more careful
about my cronjobs even they were running without problems.

Thanks for support and patience, Martin

I actually don't think it's your cronjobs, especially with nice added to
them. It looks like it's apache memory consumption, or a memory leak in
someone's app (perhaps python or php). Looking around at options I think
we might need to asjust MaxChildRequests to kill slightly more often and
reset the memory.

Thanks,
Alex

neteler · March 12, 2013, 10:40pm

On Tue, Mar 12, 2013 at 7:08 PM, Alex Mandel <tech_dev@wildintellect.com> wrote:
...

I actually don't think it's your cronjobs, especially with nice added to
them. It looks like it's apache memory consumption, or a memory leak in
someone's app (perhaps python or php). Looking around at options I think
we might need to asjust MaxChildRequests to kill slightly more often and
reset the memory.

We (GRASS) have all stopped but Apache is in tilt:

http://webextra.osgeo.osuosl.org/munin/osgeo.org/projects.osgeo.org-apache_processes.html

No idea what's going on there.

Markus

Thanks,
Alex

neteler · March 13, 2013, 11:00am

On Tue, Mar 12, 2013 at 7:08 PM, Alex Mandel <tech_dev@wildintellect.com> wrote:

It looks like it's apache memory consumption, or a memory leak in
someone's app (perhaps python or php). Looking around at options I think
we might need to asjust MaxChildRequests to kill slightly more often and
reset the memory.

Looking at:
http://httpd.apache.org/docs/2.2/mod/mpm_common.html#maxrequestsperchild

-> Default: MaxRequestsPerChild 10000

but we have 0 - "If MaxRequestsPerChild is 0, then the process will
never expire."

I'll set to 10000 and restart since Apache is blocked continuously and
http://webextra.osgeo.osuosl.org/munin/osgeo.org/projects.osgeo.org-apache_processes.html
looks bad for some days (I wonder what happened).

Markus

Christopher_Schmidt · March 13, 2013, 12:34pm

On Wed, Mar 13, 2013 at 12:00:59PM +0100, Markus Neteler wrote:

On Tue, Mar 12, 2013 at 7:08 PM, Alex Mandel <tech_dev@wildintellect.com> wrote:
> It looks like it's apache memory consumption, or a memory leak in
> someone's app (perhaps python or php). Looking around at options I think
> we might need to asjust MaxChildRequests to kill slightly more often and
> reset the memory.

Looking at:
http://httpd.apache.org/docs/2.2/mod/mpm_common.html#maxrequestsperchild

-> Default: MaxRequestsPerChild 10000

but we have 0 - "If MaxRequestsPerChild is 0, then the process will
never expire."

I'll set to 10000 and restart since Apache is blocked continuously and
http://webextra.osgeo.osuosl.org/munin/osgeo.org/projects.osgeo.org-apache_processes.html
looks bad for some days (I wonder what happened).

So, I hopped on to look at this a bit this morning.

Here are the observations I made, and what I did to try to negate the problem.

1. Load up http://projects.osgeo.osuosl.org/server-status , note that
    many many connections appear to be stuck in "Reading" state, which
    means they have not yet sent enough information ot the server to make
    a request.
2. Bump maxclients to the max of 256 -- no difference.
3. Turn down timeout from 120 seconds to 5 seconds -- server behavior
    improves, but behavior is still consistent with many clients opening a
    connection and trying to hold it for > 5 seconds. Free clients exist,
    but any client taking longer than 5s is killed.

At this point, it seems like the problem is that many clients are connecting,
but are not sending packets after that for a long time. Unfortunately, I can't
see any traffic pattern that would explain this.

One possible explanation is just that we're network-bound on input, but that is
inconsistent with low latency interactive ssh and with the fact that lowering the
timeout seems to have an improvement. Another possible explanation is a DOS of
some sort, but I can't find any obvious evidence of that. (Of course, running
a webserver shared by many projects and accessed by a worldwide network of users
of many websites doesn't really look much different htan a DOS to begin with,
so I'm hard pressed to dny the possibility entirely.)

After a bit more playing it seems clear that whatever the timeout is, we will
have an extra 10 requests per second hanging in the 'reading' state. Bumping
the timeout from 5s -> 30s doesn't seem to increase the overall r/s throughput
(which is around 68-70 / s at the moment.)

As a tradeoff between a higher timeout and letting clients in, I have:

- Bumped ServerLimit and MaxClients to 700
- Put the timeout to 30s

This means that any server which doesn't start returning bytes in 30s will
be closed by the server (vs 120s before). Given that most traffic on the
projects VM is static file requests, I think this is a possibly-unfortunate,
but reasonable tradeoff.

The setting of 700 is such that we have a spare, under the current operating
situation, of 200 clients to respond. In observing for the past 20 minutes,
it seems like this is relatively stable. From our historic munin data, it
looks like this will be about 100 more than our max at peak times.

I do not feel comfortable saying that this is a good fix: something looks
like it has changed in the usage pattern. But I can't find out what it is,
and I don't know what tool to use to debug the connections that are sitting
in 'waiting' state.

CPU, I/O utilization, and memory usage are reasonable, and requests are flowing
again.

As a first line of defense, if the problem becomes drastic again -- which
should be obvious from:

http://webextra.osgeo.osuosl.org/munin/osgeo.org/projects.osgeo.org-apache_processes.html

if the 'busy servers' line is flat, the first change is to lower the timeout
in /etc/apache2/apache2.conf to 5s, and restart the server. Beyond that,
I'm a bit lost as to what we can do here.

Regards,
--
Christopher Schmidt
Web Developer

Christopher_Schmidt · March 13, 2013, 1:30pm

On Wed, Mar 13, 2013 at 8:34 AM, Christopher Schmidt
<crschmidt@crschmidt.net> wrote:

On Wed, Mar 13, 2013 at 12:00:59PM +0100, Markus Neteler wrote:

On Tue, Mar 12, 2013 at 7:08 PM, Alex Mandel <tech_dev@wildintellect.com> wrote:
> It looks like it's apache memory consumption, or a memory leak in
> someone's app (perhaps python or php). Looking around at options I think
> we might need to asjust MaxChildRequests to kill slightly more often and
> reset the memory.

Looking at:
http://httpd.apache.org/docs/2.2/mod/mpm_common.html#maxrequestsperchild

-> Default: MaxRequestsPerChild 10000

but we have 0 - "If MaxRequestsPerChild is 0, then the process will
never expire."

I'll set to 10000 and restart since Apache is blocked continuously and
  http://webextra.osgeo.osuosl.org/munin/osgeo.org/projects.osgeo.org-apache_processes.html
looks bad for some days (I wonder what happened).

So, I hopped on to look at this a bit this morning.

Here are the observations I made, and what I did to try to negate the problem.

1. Load up http://projects.osgeo.osuosl.org/server-status , note that
    many many connections appear to be stuck in "Reading" state, which
    means they have not yet sent enough information ot the server to make
    a request.
2. Bump maxclients to the max of 256 -- no difference.
3. Turn down timeout from 120 seconds to 5 seconds -- server behavior
    improves, but behavior is still consistent with many clients opening a
    connection and trying to hold it for > 5 seconds. Free clients exist,
    but any client taking longer than 5s is killed.

At this point, it seems like the problem is that many clients are connecting,
but are not sending packets after that for a long time. Unfortunately, I can't
see any traffic pattern that would explain this.

One possible explanation is just that we're network-bound on input, but that is
inconsistent with low latency interactive ssh and with the fact that lowering the
timeout seems to have an improvement. Another possible explanation is a DOS of
some sort, but I can't find any obvious evidence of that. (Of course, running
a webserver shared by many projects and accessed by a worldwide network of users
of many websites doesn't really look much different htan a DOS to begin with,
so I'm hard pressed to dny the possibility entirely.)

After some more research, I found that the timed out requests were all
coming from
a narrow range of IPs which were trying to use OSGeo as a proxy server. This
behavior is ongoing in general -- about 5 r/s hitting OSGeo are
trying, and failing,
to use our service as a proxy server -- but this block of IPs was new, and was
causing more harm than good.

I have, for the time being, implemented the following iptables rules, using help
from petschge on IRC:

DROP all -- 192.74.224.0/19 anywhere
DROP all -- 142.4.96.0/19 anywhere
DROP all -- 142.0.128.0/20 anywhere

These are from:

http://bgp.he.net/AS54600#_prefixes

This brought us back down to our previous load levels without other obvious
negative consequences.

I've left the apache config mostly the same for now -- but bumped up the timeout
to 60s instead of 30s -- and it would be wise to watch it for a few
days, but the
problem seems to have simply been that a network which is somewhat known
for spamming started trying to use our server for spam. That won't
actually work,
but since we had an open proxy a year ago, it seems that we still get regular
traffic trying to use us as a proxy server.

-- Chris