[SAC] xblade14 distress

Folks,

I found xblade14 in distress this morning. Load average very high, and
barely responsive.

I stopped http for a while, and observed that several buildbot was active
as well as some scripts of Markus' and the gdal doxygen web site rebuild.
The doxygen seemed to be the heaviest aspect of this. Once the doxygen
finished I restarted apache and all is well.

I have tuned my web site doxygen rebuild for gdal down to once per day
(it was roughly eight times per day).

I think it would also be very helpful to migrate buildbot slaves off this
blade, as I have noted before.

In the meantime, if it is not too difficult Mateusz, could you disable all
automatic GDAL buildbot rebuilds, leaving it to us to force a rebuild when
we want to check something? This should at least moderate the load somewhat.
GDAL isn't the only buildbot slave on xblade14, but I think it is one of
the heavier ones.

Best regards,
--
---------------------------------------+--------------------------------------
I set the clouds in motion - turn up | Frank Warmerdam, warmerdam@pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush | Geospatial Programmer for Rent

Frank Warmerdam wrote:

I think it would also be very helpful to migrate buildbot slaves off this
blade, as I have noted before.

In the meantime, if it is not too difficult Mateusz, could you disable all
automatic GDAL buildbot rebuilds, leaving it to us to force a rebuild when
we want to check something? This should at least moderate the load
somewhat.
GDAL isn't the only buildbot slave on xblade14, but I think it is one of
the heavier ones.

Frank,

I'm working on the GDAL buildbot slave migration to buildtest.
I'm on it now. It's about time.

Best regards,

-----
--
Mateusz Loskot
http://mateusz.loskot.net
--
View this message in context: http://n2.nabble.com/xblade14-distress-tp4273533p4279684.html
Sent from the OSGeo System Administration Committee mailing list archive at Nabble.com.

Mateusz Loskot wrote:

Frank Warmerdam wrote:

I think it would also be very helpful to migrate buildbot slaves off this
blade, as I have noted before.

In the meantime, if it is not too difficult Mateusz, could you disable
all
automatic GDAL buildbot rebuilds, leaving it to us to force a rebuild
when
we want to check something? This should at least moderate the load
somewhat.
GDAL isn't the only buildbot slave on xblade14, but I think it is one of
the heavier ones.

Frank,

I'm working on the GDAL buildbot slave migration to buildtest.
I'm on it now. It's about time.

Frank,

Job done. GDAL lives now on buildtest.
Let me know if other slaves should go there too.

By the way of spending hours ssh'ed on xblade14-2 and xblade11-2, I noticed
that xblade14-2 average load is constantly around 2-3 and the two processes
are occupying most of the resources, it is 2-3 x httpd and 1 x mysqld.
The load on buildtest is less than 1 most of the time.

Best regards,

-----
--
Mateusz Loskot
http://mateusz.loskot.net
--
View this message in context: http://n2.nabble.com/xblade14-distress-tp4273533p4283025.html
Sent from the OSGeo System Administration Committee mailing list archive at Nabble.com.

Mateusz wrote:

By the way of spending hours ssh'ed on xblade14-2 and
xblade11-2, I noticed that xblade14-2 average load is
constantly around 2-3 and the two processes are occupying
most of the resources, it is 2-3 x httpd and 1 x mysqld.

is the mysqld going to be the grass mediawiki-backend?

(no idea)

Hamish

Hamish wrote:

Mateusz wrote:

By the way of spending hours ssh'ed on xblade14-2 and
xblade11-2, I noticed that xblade14-2 average load is
constantly around 2-3 and the two processes are occupying
most of the resources, it is 2-3 x httpd and 1 x mysqld.

is the mysqld going to be the grass mediawiki-backend?

Hamish,

That is the only service using mysql that I am aware of.

Best regards,
--
---------------------------------------+--------------------------------------
I set the clouds in motion - turn up | Frank Warmerdam, warmerdam@pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush | Geospatial Programmer for Rent

On Mon, Jan 11, 2010 at 2:05 AM, Hamish <hamish_b@yahoo.com> wrote:

Mateusz wrote:

By the way of spending hours ssh'ed on xblade14-2 and
xblade11-2, I noticed that xblade14-2 average load is
constantly around 2-3 and the two processes are occupying
most of the resources, it is 2-3 x httpd and 1 x mysqld.

is the mysqld going to be the grass mediawiki-backend?

[you can figure out with lsof -p PID]

Right now (Mon Jan 11 21:12:40 UTC 2010), the machine
is almost dead again... Apache is unresponsive and
I can login (accpeted) but don't reach command line.

From
http://home.gdal.org/tmp/server-status.html
I don't see anything particular. Not sure what's going on...

Markus

Hi again,

the blade granted cmd line to me :slight_smile:

Top is showing high load average but almost no CPU usage:

top - 13:35:28 up 6:23, 2 users, load average: 62.46, 75.51, 49.82
Tasks: 229 total, 5 running, 224 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.6% us, 3.6% sy, 0.0% ni, 0.0% id, 93.1% wa, 0.7% hi, 0.0% si
Mem: 1034320k total, 1022192k used, 12128k free, 5824k buffers
Swap: 2096472k total, 1072120k used, 1024352k free, 40772k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10685 apache 16 0 40556 8240 2784 D 1.7 0.8 0:00.05 httpd
10592 nobody 16 0 5096 1516 824 D 1.0 0.1 0:00.20 rsync
4530 root 16 0 2104 964 676 R 0.3 0.1 0:31.00 top
28909 apache 16 0 57272 18m 5780 D 0.3 1.8 1:11.57 httpd
8854 apache 16 0 55028 20m 3776 R 0.3 2.0 0:03.47 httpd
8938 apache 16 0 52212 12m 3880 D 0.3 1.3 0:01.36 httpd
8985 apache 16 0 52232 12m 4124 D 0.3 1.3 0:01.37 httpd
9739 neteler 16 0 2108 968 680 R 0.3 0.1 0:03.25 top
10305 apache 15 0 45132 13m 3980 S 0.3 1.4 0:00.68 httpd
10336 apache 15 0 41000 9.9m 4052 D 0.3 1.0 0:00.40 httpd
    1 root 15 0 1744 444 420 S 0.0 0.0 0:00.67 init
    2 root 34 19 0 0 0 S 0.0 0.0 0:00.06 ksoftirqd/0
    3 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
    4 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 events/0
    5 root 19 -5 0 0 0 S 0.0 0.0 0:00.00 khelper
    6 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kthread
   10 root 10 -5 0 0 0 S 0.0 0.0 0:01.28 kblockd/0
   11 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 kacpid
  168 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 khubd
...

Half of swap is used which is no good.

Searching for dead processes:

top -b -n 1 | awk '{if (NR <=7) print; else if ($8 == "D") {print;
count++} } END {print "Total status D: "count}'
top - 13:37:10 up 6:25, 2 users, load average: 83.92, 78.91, 53.75
...
10774 root 17 0 6928 5224 1644 D 5.8 0.5 0:00.25 mrtg
10305 apache 16 0 45132 13m 3980 D 1.9 1.4 0:00.87 httpd
  400 root 10 -5 0 0 0 D 0.0 0.0 0:02.28 kjournald
  898 root 10 -5 0 0 0 D 0.0 0.0 0:01.26 kjournald
1223 root 15 0 1612 540 492 D 0.0 0.1 0:00.11 syslogd
1462 root 16 0 1824 576 532 D 0.0 0.1 0:00.19 automount
3295 apache 16 0 63916 12m 6156 D 0.0 1.2 3:06.06 httpd
26976 apache 16 0 63140 18m 6156 D 0.0 1.8 1:29.20 httpd
28446 apache 16 0 63236 15m 5808 D 0.0 1.6 1:09.26 httpd
28906 apache 16 0 57188 12m 5776 D 0.0 1.3 1:16.37 httpd
...
10816 apache 16 0 39080 7184 3148 D 0.0 0.7 0:00.03 httpd
10820 apache 16 0 38868 5808 2248 D 0.0 0.6 0:00.04 httpd
10885 apache 15 0 38868 4828 1516 D 0.0 0.5 0:00.00 httpd
10894 neteler 16 0 1572 420 364 D 0.0 0.0 0:00.01 md5sum
10896 neteler 16 0 228 8 0 D 0.0 0.0 0:00.00 cut
Total status D: 80

Way too many dead jobs.
Sounds like reboot again?

Markus

On Mon, Jan 11, 2010 at 3:37 AM, Frank Warmerdam <warmerdam@pobox.com> wrote:

Hamish wrote:

Mateusz wrote:

By the way of spending hours ssh'ed on xblade14-2 and
xblade11-2, I noticed that xblade14-2 average load is
constantly around 2-3 and the two processes are occupying
most of the resources, it is 2-3 x httpd and 1 x mysqld.

is the mysqld going to be the grass mediawiki-backend?

Hamish,

That is the only service using mysql that I am aware of.

It is also used in particular by QGIS/Joomla:
joomla15_qgisweb

[root@xblade14 neteler]# lsof -p 1742 | grep wiki | wc -l
2
[root@xblade14 neteler]# lsof -p 1742 | grep qgis | wc -l
67

For the given usage, IMHO the machine does not have
too much RAM. Could some more be added? This might
help to avoid heavy swapping when peaks hit the machine.

Markus

Markus Neteler wrote:

On Mon, Jan 11, 2010 at 3:37 AM, Frank Warmerdam <warmerdam@pobox.com> wrote:

Hamish wrote:

Mateusz wrote:

By the way of spending hours ssh'ed on xblade14-2 and
xblade11-2, I noticed that xblade14-2 average load is
constantly around 2-3 and the two processes are occupying
most of the resources, it is 2-3 x httpd and 1 x mysqld.

is the mysqld going to be the grass mediawiki-backend?

Hamish,

That is the only service using mysql that I am aware of.

It is also used in particular by QGIS/Joomla:
joomla15_qgisweb

[root@xblade14 neteler]# lsof -p 1742 | grep wiki | wc -l
2
[root@xblade14 neteler]# lsof -p 1742 | grep qgis | wc -l
67

For the given usage, IMHO the machine does not have
too much RAM. Could some more be added? This might
help to avoid heavy swapping when peaks hit the machine.

Markus,

I do not believe it is practical to modify the hardware. I think we just
need to migrate some heavy services elsewhere.

Potentially the qgis site should be migrated to OSL once we have
servers set up there.

Best regards,
--
---------------------------------------+--------------------------------------
I set the clouds in motion - turn up | Frank Warmerdam, warmerdam@pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush | Geospatial Programmer for Rent

On Fri, Jan 08, 2010 at 11:40:14AM -0500, Frank Warmerdam wrote:

Folks,

I found xblade14 in distress this morning. Load average very high, and
barely responsive.

A series of fixes has alleviated this situation to some extent.

1. The entire reason for this change is the popularity of the new qgis
   release. The qgis website was the culprit for the change in performance.
2. The number of apache children on this server was completely out of line
   with the actual server RAM. Lowering the maximum to 30 leaves the machine
   usable even when the load is high.
3. Once narrowed to qgis, I changed the qgis homepage to a static page,
   and had macha enable caching on the site. This seemed to resolve the
   performance.
4. Changing the homepage to not be static caused it to break again.
5. The homepage had a poll, which no other pages have; this was causing
   80% CPU load when there were a few requests.
6. Removing the poll fixed the performance issues.

The server is now handling reasonable with 30 max apache children, and
no poll on the qgis Joomla homepage.

The performance will still be somewhat slower than usual; we're doing
2x the number of requests that osgeo1 does currently (6reqs/sec vs. 3reqs/sec)
on average on this server, which is only one small CPU vs. 4 big ones, so
don't kill the machine, but it should be reasonable to use again.

Regards,
--
Christopher Schmidt
MetaCarta