[SAC] [OSGeo] #1325: ProjectsVM not responding

#1325: ProjectsVM not responding
---------------------------+------------------------------------------------
Reporter: jmckenna | Owner: sac@…
     Type: task | Status: new
Priority: critical | Milestone:
Component: Systems Admin | Keywords:
---------------------------+------------------------------------------------
- discussion: http://lists.osgeo.org/pipermail/sac/2014-April/004950.html
  - affecting many projects such as GRASS, GDAL, OpenLayers, MapServer,
etc.
  - VM: http://wiki.osgeo.org/wiki/ProjectsVM

-jeff

--
Ticket URL: <http://trac.osgeo.org/osgeo/ticket/1325&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1325: ProjectsVM not responding
---------------------------+------------------------------------------------
Reporter: jmckenna | Owner: sac@…
     Type: task | Status: new
Priority: critical | Milestone:
Component: Systems Admin | Keywords:
---------------------------+------------------------------------------------

Comment(by jmckenna):

I've asked for an update in #osuosl but got no response. They may not
know who I am. Maybe someone else can try as well.

--
Ticket URL: <http://trac.osgeo.org/osgeo/ticket/1325#comment:1&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1325: ProjectsVM not responding
---------------------------+------------------------------------------------
Reporter: jmckenna | Owner: sac@…
     Type: task | Status: new
Priority: critical | Milestone:
Component: Systems Admin | Keywords:
---------------------------+------------------------------------------------

Comment(by jmckenna):

"Ramereth" in #osuosl says the ProjectsVM is rebooting now, very slowly.

--
Ticket URL: <http://trac.osgeo.org/osgeo/ticket/1325#comment:2&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1325: ProjectsVM not responding
---------------------------+------------------------------------------------
Reporter: jmckenna | Owner: sac@…
     Type: task | Status: new
Priority: critical | Milestone:
Component: Systems Admin | Keywords:
---------------------------+------------------------------------------------

Comment(by bartvde):

still down, any update?

--
Ticket URL: <http://trac.osgeo.org/osgeo/ticket/1325#comment:3&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1325: ProjectsVM not responding
---------------------------+------------------------------------------------
Reporter: jmckenna | Owner: sac@…
     Type: task | Status: new
Priority: critical | Milestone:
Component: Systems Admin | Keywords:
---------------------------+------------------------------------------------

Comment(by marcjansen):

Bumping this again, openlayers.org and others are still down.

Is there anything I could possibly do to get them up again?

--
Ticket URL: <http://trac.osgeo.org/osgeo/ticket/1325#comment:4&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1325: ProjectsVM not responding
---------------------------+------------------------------------------------
Reporter: jmckenna | Owner: sac@…
     Type: task | Status: new
Priority: critical | Milestone:
Component: Systems Admin | Keywords:
---------------------------+------------------------------------------------

Comment(by wildintellect):

Know anything about optimizing I/O on a kvm based VM?
We took care of the underlying hardware issue, a raid rebuild for drive
replacement. Now we're stuck wondering why the I/O is really bad, bad
enough to make Projects unresponsive. Apache is temporarily off while we
look for way to prevent it from locking up.
There was a mapserver sphinx job that got out of control due to the I/O
wait and spun up 4 simultaneous runs. Hopefully killing that is all thats
needed right now.

FYI, OSUOSL thinks we need new RAID card batteries 2x
http://amzn.com/B0045ZNJJU

--
Ticket URL: <http://trac.osgeo.org/osgeo/ticket/1325#comment:5&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1325: ProjectsVM not responding
---------------------------+------------------------------------------------
Reporter: jmckenna | Owner: sac@…
     Type: task | Status: new
Priority: critical | Milestone:
Component: Systems Admin | Keywords:
---------------------------+------------------------------------------------

Comment(by dmorissette):

For the record the RAID cards have been ordered and the order info sent to
Alex.

--
Ticket URL: <http://trac.osgeo.org/osgeo/ticket/1325#comment:6&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1325: ProjectsVM not responding
---------------------------+------------------------------------------------
Reporter: jmckenna | Owner: sac@…
     Type: task | Status: new
Priority: critical | Milestone:
Component: Systems Admin | Keywords:
---------------------------+------------------------------------------------

Comment(by dmorissette):

Replying to [comment:6 dmorissette]:
> For the record the RAID cards have been ordered and the order info sent
to Alex.

I meant RAID card batteries of course.

--
Ticket URL: <http://trac.osgeo.org/osgeo/ticket/1325#comment:7&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1325: ProjectsVM not responding
---------------------------+------------------------------------------------
Reporter: jmckenna | Owner: sac@…
     Type: task | Status: new
Priority: critical | Milestone:
Component: Systems Admin | Keywords:
---------------------------+------------------------------------------------

Comment(by wildintellect):

Batteries have arrived. Looking to schedule a time to put them in. Is 3-4
hours from now a good time? Any objections? That's 10-11pm Germany, 4-5 pm
US east coast, 1-2 US west coast.

http://www.timeanddate.com/worldclock/meetingdetails.html?year=2014&month=4&day=7&hour=21&min=0&sec=0&p1=217&p2=37

--
Ticket URL: <http://trac.osgeo.org/osgeo/ticket/1325#comment:8&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1325: ProjectsVM not responding
---------------------------+------------------------------------------------
Reporter: jmckenna | Owner: sac@…
     Type: task | Status: new
Priority: critical | Milestone:
Component: Systems Admin | Keywords:
---------------------------+------------------------------------------------

Comment(by martin):

Sounds plausible to me, please consider to first equip the host where the
secure VM resides.

--
Ticket URL: <http://trac.osgeo.org/osgeo/ticket/1325#comment:9&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1325: ProjectsVM not responding
---------------------------+------------------------------------------------
Reporter: jmckenna | Owner: sac@…
     Type: task | Status: new
Priority: critical | Milestone:
Component: Systems Admin | Keywords:
---------------------------+------------------------------------------------

Comment(by jmckenna):

+1

--
Ticket URL: <http://trac.osgeo.org/osgeo/ticket/1325#comment:10&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1325: ProjectsVM not responding
---------------------------+------------------------------------------------
Reporter: jmckenna | Owner: sac@…
     Type: task | Status: new
Priority: critical | Milestone:
Component: Systems Admin | Keywords:
---------------------------+------------------------------------------------

Comment(by crschmidt):

I don't know the current state of this, but here's some things I do know:

  1. The ProjectsVM Apache was swamped this morning.
  2. Somehow, in the past week, the number of available connections to the
projectsVM apache was dropped:
http://webextra.osgeo.osuosl.org/munin/osgeo.org/projects.osgeo.org/apache_processes.html
  3. Markus raised that back up.
  4. As always, a huge number of botnets were hitting the OSGeo projects VM
trying to use it as an open proxy. (These requests were just returning
404s.) I blackholed a number of IPs to pull the number of incoming
connections down; I think I got about 40% of the incoming spam
connections. iptables --list will show the ranges I blacklisted; I tried
to be relatively conservative. I used the following command to find IPs to
blacklist.

sudo tail -n 100 /var/log/apache2/docs.geotools.org-access_log | grep
"http:" | cut -f 1-3 -d'.' | sort | uniq -c | sort -n

Now that apache is configured correctly and the incoming spam is decreased
a bit, the disks are still massively underperforming; iostat -x -m 2 shows
that there is as little as 2-3 disk seeks/sec going on with full
utilization and high await times. This usually means bad things; If the
raid batteries are not replaced yet, this would be consistent with
WriteThrough mode.

This means that the high load average on the projects vm appears to be
entirely due to bad disk performance. I don't know how to debug more --
the machine seems otherwise fine -- and I think the machine will run, with
abnormally high load numbers, though access that requires disk seeks will
be much slower than usual.

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
avgqu-sz await svctm %util
vda 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1.00 0.00 0.00 100.00

--
Ticket URL: <http://trac.osgeo.org/osgeo/ticket/1325#comment:11&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1325: ProjectsVM not responding
---------------------------+------------------------------------------------
Reporter: jmckenna | Owner: sac@…
     Type: task | Status: new
Priority: critical | Milestone:
Component: Systems Admin | Keywords:
---------------------------+------------------------------------------------

Comment(by neteler):

Are the new batteries in place?

--
Ticket URL: <http://trac.osgeo.org/osgeo/ticket/1325#comment:12&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1325: ProjectsVM not responding
---------------------------+------------------------------------------------
Reporter: jmckenna | Owner: sac@…
     Type: task | Status: new
Priority: critical | Milestone:
Component: Systems Admin | Keywords:
---------------------------+------------------------------------------------

Comment(by wildintellect):

Yes the new battery is in on osgeo4.

Limited connections was intentional, so we could slowly raise it up to a
level that balances the needs. Also Buffered Logs was intentional and we
doubled the ram to make sure it wouldn't be an issue.

Note, sometimes the performance is actually be impacted when a different
VM on the same host is going to swap. We have a memory leak on QGIS with
Redmine under Phusion Passenger that needs to be debugged.

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/1325#comment:13&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1325: ProjectsVM not responding
---------------------------+------------------------------------------------
Reporter: jmckenna | Owner: sac@…
     Type: task | Status: new
Priority: critical | Milestone:
Component: Systems Admin | Keywords:
---------------------------+------------------------------------------------

Comment(by wildintellect):

BufferedLogs is back on. <- we added more ram so this could be used
without issue, and it seems to make a big difference.
Connections down to 500 from 700. That should be enough based on the
historical charts.
Thanks for blacklisting bad IPs.

Raid card status checked:Current Cache Policy: WriteBack, ReadAdaptive,
Direct, Write Cache OK if Bad BBU

We also need to check if the disk I/O can be improved on Mail.

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/1325#comment:14&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1325: ProjectsVM not responding
---------------------------+------------------------------------------------
Reporter: jmckenna | Owner: sac@…
     Type: task | Status: new
Priority: critical | Milestone:
Component: Systems Admin | Keywords:
---------------------------+------------------------------------------------

Comment(by neteler):

FWIW, but the disk latency on the projVM is right now terrible, in
the seconds (!) range.

{{{
neteler@projects:~/cronjobs$ date
Sat Apr 26 03:18:13 PDT 2014
neteler@projects:~/cronjobs$ date -u
Sat Apr 26 10:18:23 UTC 2014
}}}

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/1325#comment:15&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1325: ProjectsVM not responding
---------------------------+------------------------------------------------
Reporter: jmckenna | Owner: sac@…
     Type: task | Status: new
Priority: critical | Milestone:
Component: Systems Admin | Keywords:
---------------------------+------------------------------------------------

Comment(by wildintellect):

It appears to be at the same time each day, I suspect it may be the backup
job. Maybe we can throttle the speed of the backup or shift it to a time
when their is less traffic. If I'm reading the timing right, it's during
prime morning hours in EU. Perhaps moving it up a couple of hours to early
morning EU, late night US?

I don't necessarily think thats the root cause but maybe a slight shift to
lesser used time will help us decipher what else is going on.

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/1325#comment:16&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1325: ProjectsVM not responding
---------------------------+------------------------------------------------
Reporter: jmckenna | Owner: sac@…
     Type: task | Status: new
Priority: critical | Milestone:
Component: Systems Admin | Keywords:
---------------------------+------------------------------------------------

Comment(by hamish):

Replying to [comment:16 wildintellect]:
> It appears to be at the same time each day, I suspect it may
> be the backup job. Maybe we can throttle the speed of the backup

is the backup job running at 'ionice -c3'? (consume i/o only when
otherwise idle)

Hamish

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/1325#comment:17&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1325: ProjectsVM not responding
---------------------------+------------------------------------------------
Reporter: jmckenna | Owner: sac@…
     Type: task | Status: new
Priority: critical | Milestone:
Component: Systems Admin | Keywords:
---------------------------+------------------------------------------------

Comment(by martin):

No, it's not running at 'ionice -c3' and I'm having mixed feelings wrt.
consuming I/O only when otherwise idle because in real life the system is
hardly even close to idle.

Running at 'ionice -c 2 -n 7' the weekly full backup of just the Projects
VM took 10 hours at an average bandwidth of 1.6 MByte/s - and still wasn't
very responsive, I fact I didn't notice any relevant difference after
applying ionice. Thus I'm curious about what you're trying to achieve by
further reducing its priority !?

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/1325#comment:18&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

#1325: ProjectsVM not responding
---------------------------+------------------------------------------------
Reporter: jmckenna | Owner: sac@…
     Type: task | Status: new
Priority: critical | Milestone:
Component: Systems Admin | Keywords:
---------------------------+------------------------------------------------

Comment(by neteler):

The projectsVM continues to be very very slow.

Suggestion
  * create a new VM
  * migrated per project stuff over to new machine

Rationale: The current projectsVM is overly complex and meanwhile
impossible to maintain.

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/1325#comment:19&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.