[SAC] Fair projectsVM usage (esp. Openlayers)

Markus wrote:

top - 00:21:37 up 10 days, 3:53, 1 user,
load average: 42.23, 36.57, 31.99 <<---!!

Hi,

since we know exactly what time it happens each day, today I sat here
and watched it. It's bacula saturating the I/O, causing upwards 50-90% cpu
time spent waiting for the kernel, causing lots of the apache processes
to deadlock, which is [perhaps!] causing a whole bunch of them to respawn
threads, with the old ones left as (hollow) load (reporting) absorbing
zombies for a moment until the prior deadlock clears. system load is
reporting above 80 as I type this, but it is still responsive. Also I
notice the bacula job is running completely unniced.

suggest to run bacula with nice, and with 'ionice -c3', if for no other
reason than to stop the innocent apaches from getting deadlocked by it.

thanks,
Hamish

On Tue, Feb 14, 2012 at 8:49 AM, Hamish <hamish_b@yahoo.com> wrote:

Markus wrote:

top - 00:21:37 up 10 days, 3:53, 1 user,
load average: 42.23, 36.57, 31.99 <<---!!

Hi,

since we know exactly what time it happens each day, today I sat here
and watched it. It's bacula saturating the I/O, causing upwards 50-90% cpu
time spent waiting for the kernel, causing lots of the apache processes
to deadlock, which is [perhaps!] causing a whole bunch of them to respawn
threads, with the old ones left as (hollow) load (reporting) absorbing
zombies for a moment until the prior deadlock clears. system load is
reporting above 80 as I type this, but it is still responsive. Also I
notice the bacula job is running completely unniced.

suggest to run bacula with nice,

Yes this may be a good idea.
But..

and with 'ionice -c3', if for no other
reason than to stop the innocent apaches from getting deadlocked by it.

... there seem to be another problem:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    7 root 20 0 0 0 0 R 100 0.0 45:56.36 ksoftirqd/1

ps -aef | grep python | cut -b1-100
10011 2251 17542 2 10:14 ? 00:00:24 /usr/bin/python
/usr/bin/gunicorn_django --pid /var/
10011 2663 17542 2 10:14 ? 00:00:22 /usr/bin/python
/usr/bin/gunicorn_django --pid /var/
10011 2798 17542 2 10:14 ? 00:00:22 /usr/bin/python
/usr/bin/gunicorn_django --pid /var/
10011 3375 17542 2 10:15 ? 00:00:22 /usr/bin/python
/usr/bin/gunicorn_django --pid /var/
10011 6437 17542 2 09:35 ? 00:01:21 /usr/bin/python
/usr/bin/gunicorn_django --pid /var/
10011 12262 17542 2 09:38 ? 00:01:21 /usr/bin/python
/usr/bin/gunicorn_django --pid /var/
10011 13087 17542 2 09:39 ? 00:01:20 /usr/bin/python
/usr/bin/gunicorn_django --pid /var/
root 15098 1 0 Feb12 ? 00:00:34 python
/usr/sbin/denyhosts --daemon --purge --config
root 17542 1 0 Feb09 ? 00:01:30 /usr/bin/python
/usr/bin/gunicorn_django --pid /var/
10011 20292 17542 2 10:10 ? 00:00:33 /usr/bin/python
/usr/bin/gunicorn_django --pid /var/
...

many!

ksoftirqd/1 is definitely too high. And no bacula around...

Markus

Markus Neteler wrote:

ksoftirqd/1 is definitely too high. And no bacula around...

... and fwiw lots and lots of python segfaults seen in dmesg

Hamish

can we /please/ "nice ionice -c3" the backula job? This is ridiculous, the
projects VM turns to mollasses for 20-25 minutes each day as all the apache
processes are deadlocked waiting for access to the disks and respawning
themselves like mad trying to keep availability intact.

currently:
load average: 103.93, 39.21, 15.08
Cpu(s): 3.4%us, 2.8%sy, 0.0%ni, 0.0%id, 93.6%wa, 0.0%hi, 0.2%si, 0.0%st

thanks,
Hamish
(slightly frustrated waiting 45 seconds for a wiki page to load)

On 02/27/2012 11:27 PM, Hamish wrote:

can we /please/ "nice ionice -c3" the backula job? This is ridiculous, the
projects VM turns to mollasses for 20-25 minutes each day as all the apache
processes are deadlocked waiting for access to the disks and respawning
themselves like mad trying to keep availability intact.

currently:
load average: 103.93, 39.21, 15.08
Cpu(s): 3.4%us, 2.8%sy, 0.0%ni, 0.0%id, 93.6%wa, 0.0%hi, 0.2%si, 0.0%st

thanks,
Hamish
(slightly frustrated waiting 45 seconds for a wiki page to load)

I have no objections to such an idea. I think we haven't made any
progress on it because no one knows how to do it.

Thinking about how bacula works, the server triggers the client and the
server is in charge. If there happens to be a script that gets run
before the backup I see how we might be able to specify nice.
Do you have any idea how to apply nice or ionice to a process that is
remotely trigged through a server-client system?

Maybe the bacula client service could be run with nice? would that
require us to maintain a variation of the init.d script?

If you can provide more information on a way to do it, I'd be happy to try.

Thanks,
Ale

On Tue, Feb 28, 2012 at 12:38:22AM -0800, Alex Mandel wrote:

I have no objections to such an idea. I think we haven't made any
progress on it because no one knows how to do it.

Well, I don't agree ....
Instead I wonder why Hamish doesn't simply do it himself instead of
complaining ?

Cheers,
  Martin.
--
Unix _IS_ user friendly - it's just selective about who its friends are !
--------------------------------------------------------------------------

Hamish wrote:

> can we /please/ "nice ionice -c3" the backula job?

(sic:)

Alex wrote:

I have no objections to such an idea. I think we haven't made any
progress on it because no one knows how to do it.

Thinking about how bacula works, the server triggers the client and the
server is in charge. If there happens to be a script that gets run
before the backup I see how we might be able to specify nice.
Do you have any idea how to apply nice or ionice to a process that is
remotely trigged through a server-client system?

Maybe the bacula client service could be run with nice?
would that require us to maintain a variation of the init.d script?

If you can provide more information on a way to do it, I'd
be happy to try.

I don't have much experience re. how bacula is implemented or how the
client is triggered, so my best suggestions are a bit of a shot in the
dark. However, if the server does trigger a script to run the client
adding the "nice ionice -c3" before some part of the command might work.
Or within the script do "nice +15 -p $$" and "ionice -c3 -p $$" on their
own lines near the top of the script.

Failing that, the "and" daemon could be installed/adapted and told to keep
an eye out for baculas. (`apt-cache show and`) It might not be a bad idea
to set that up that anyway, to lessen the overall effect of any processes
stuck in a loop.

A:

> I have no objections to such an idea. I think we haven't made any
> progress on it because no one knows how to do it.

Martin:

Well, I don't agree ....
Instead I wonder why Hamish doesn't simply do it himself instead of
complaining ?

.. for one thing I'm not a bacula expert, so I don't like messing with
it myself without checking in with the local experts first, for another I
don't have sudo rights on that VM.

thanks & regards,
Hamish

Alex wrote:

I have no objections to such an idea. I think we haven't
made any progress on it because no one knows how to do it.

Thinking about how bacula works, the server triggers the
client and the server is in charge. If there happens to
be a script that gets run before the backup I see how we
might be able to specify nice.

try this:

edit /usr/share/bacula-common/common-functions
adding this to the end:

# HB Mod 28 Feb 2012:
renice +19 -p $$
ionice -c3 -p $$

then sudo /etc/init.d/bacula-fd restart

and wait.

cheers,
Hamish

On Tue, Feb 28, 2012 at 01:24:23AM -0800, Hamish wrote:

.. for one thing I'm not a bacula expert, so I don't like messing with
it myself without checking in with the local experts first, for another I
don't have sudo rights on that VM.

I'm certainly not a Bacula guru, anyhow I do claim some familiarity
with Bacula (as well as with some other backup systems).

As such I've been pointing out that, according to my understanding,
you're trying to cure a symptom instead of fixing a cause. If you
don't agree, that's fine with me, but don't expect me to waste time on
stuff which I think follows the wrong track.

Cheers,
  Martin.
--
Unix _IS_ user friendly - it's just selective about who its friends are !
--------------------------------------------------------------------------

On 02/28/2012 01:50 AM, Martin Spott wrote:

On Tue, Feb 28, 2012 at 01:24:23AM -0800, Hamish wrote:

.. for one thing I'm not a bacula expert, so I don't like messing with
it myself without checking in with the local experts first, for another I
don't have sudo rights on that VM.

I'm certainly not a Bacula guru, anyhow I do claim some familiarity
with Bacula (as well as with some other backup systems).

As such I've been pointing out that, according to my understanding,
you're trying to cure a symptom instead of fixing a cause. If you
don't agree, that's fine with me, but don't expect me to waste time on
stuff which I think follows the wrong track.

Cheers,
  Martin.

Martin,

What other solutions would you recommend?

Doing research I did see that someone requested such ability be added to
bacula back in 2010, though I don't know if it was put in. One option I
found was
ionice -c3 -p $(pgrep bacula)
I was thinking we could put that into a script and set bacula to run
that script before backup, so basically it would re-nice itself every
time before starting the intensive stuff.

Could it be that Martin is also suggesting we be more selective about
what gets backed up as a way to reduce the load?

Thanks,
Alex

Martin wrote:

As such I've been pointing out that, according to my
understanding, you're trying to cure a symptom instead of
fixing a cause.

It may well be that my theory is not correct, but even if it is
wrong, nice'ing background jobs is good practice and should
happen anyway. If it solves the problem great, if the waiting-
for-kernel-io level remains very high, then we keep looking
for some other root cause like the way the virtual disks are
set up or an unfortunate stripe size on the host's RAID. Maybe by
clearing away the noisy symptom we get a clearer look at what is
currently hidden behind it.

If you don't agree, that's fine with me, but don't expect me
to waste time on stuff which I think follows the wrong track.

I ask nothing of you personally, just for help from the group
since now the performance of the server is poor for <=30 minutes
a day.

regards,
Hamish
(experimentalist)

On Tue, Feb 28, 2012 at 12:17:06PM -0800, Hamish wrote:

I ask nothing of you personally, just for help from the group
since now the performance of the server is poor for <=30 minutes
a day.

Has it been this way all the time since the 'new' infrastructure is in
place ?

Cheers,
  Martin.
--
Unix _IS_ user friendly - it's just selective about who its friends are !
--------------------------------------------------------------------------

Right NOW the machine is very slow due to bacula:

top - 23:21:06 up 28 days, 2:53, 1 user, load average: 23.56, 16.19, 7.67
Tasks: 319 total, 2 running, 317 sleeping, 0 stopped, 0 zombie
Cpu(s): 3.6%us, 2.0%sy, 0.0%ni, 9.8%id, 84.2%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 8198148k total, 7102704k used, 1095444k free, 1148892k buffers
Swap: 4096564k total, 61776k used, 4034788k free, 2367504k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1023 root 20 0 154m 2796 888 S 6 0.0 126:06.31 bacula-fd

neteler@projects:~$ date -u
Thu Mar 1 07:22:00 UTC 2012

neteler@projects:~$ w
23:22:15 up 28 days, 2:54, 1 user, load average: 64.68, 31.19, 13.56
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
neteler pts/0 151.62.4.12 23:20 1.00s 0.20s 0.00s w

projects:/home/neteler# iostat
Linux 2.6.32-5-amd64 (projects.osgeo.osuosl.org) 02/29/2012
_x86_64_ (4 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
           7.87 0.27 4.48 2.35 0.00 85.04

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
vda 45.39 757.68 365.37 1840930995 887735416

I think that this should be improved with renicing or whatever.

Markus

Markus wrote:

Right NOW the machine is very slow due to bacula:

yes, every day at this same time for 20-25 minutes.

I just logged in, now the daemon has been reniced to +15 (thanks!), but
the sys load was still >100.0 and waiting for kernel still at >90.0% cpu.

`ionice -p $PID` shows that the daemon has not been ioniced, which is the
main thing that needs to happen to it AFAICT.
  to do that: `ionice -c3 -p $PID`

the /usr/share/bacula-common/common-functions file is '.' sourced when
the bacula daemon starts up, so this would, I think, take care of it:

edit /usr/share/bacula-common/common-functions
adding this to the end:

# HB Mod 28 Feb 2012:
renice +19 -p $$
ionice -c3 -p $$

then sudo /etc/init.d/bacula-fd restart

maybe it just cures the symptom, but the effects of the symptom are acute,
just takes 30 sec to apply, and should probably be done as good practice
regardless of any other root cause which might also exist.

The Adhoc VM sees the same trouble happening, but on a much smaller scale
with less things running on it and a less frequent full backup (it relies
on incremental backups between the full ones). The next full backup there
is still a few days away, so waiting to see if the above edit to the
'common-functions' file helps much.

as to if the problem existed before upgrade: I don't know. Back in
October '11 it wasn't, a week after upgrade (early Feb '12) it was.
I've have no data between that.

thanks,
Hamish

On Thu, Mar 01, 2012 at 12:25:33AM -0800, Hamish wrote:

> edit /usr/share/bacula-common/common-functions
> adding this to the end:
>
> # HB Mod 28 Feb 2012:
> renice +19 -p $$
> ionice -c3 -p $$
>
> then sudo /etc/init.d/bacula-fd restart

maybe it just cures the symptom, but the effects of the symptom are acute,
just takes 30 sec to apply, and should probably be done as good practice
regardless of any other root cause which might also exist.

Even better, it would come at no cost at all to stop the repetition of
uneducated blah-blah about "good practice" of applying unspecific
measures to cover a specific cause,

  Martin.
--
Unix _IS_ user friendly - it's just selective about who its friends are !
--------------------------------------------------------------------------

On Thu, Mar 1, 2012 at 9:25 AM, Hamish <hamish_b@yahoo.com> wrote:

Markus wrote:

Right NOW the machine is very slow due to bacula:

yes, every day at this same time for 20-25 minutes.

I just logged in, now the daemon has been reniced to +15 (thanks!), but

just for the record: I reniced it manually. So nothing which remains.

Markus

Martin wrote:

Even better, it would come at no cost at all to stop the repetition of
uneducated blah-blah about "good practice" of applying unspecific
measures to cover a specific cause,

the intended meaning was re. the good practice of running background
time-insensitive jobs at low priority in the background, instead of
having them directly competing for resources with the time-sensitive apps.

regards,
Hamish

ps- being rude doesn't help solve the issue, or educate us out of
ignorance. it just makes good people walk away leaving even larger
piles of work for the few who remain.

On Thu, Mar 01, 2012 at 02:04:34AM -0800, Hamish wrote:

Martin wrote:

> Even better, it would come at no cost at all to stop the repetition of
> uneducated blah-blah about "good practice" of applying unspecific
> measures to cover a specific cause,

the intended meaning was re. the good practice of running background
time-insensitive jobs at low priority in the background, instead of
having them directly competing for resources with the time-sensitive apps.

No, this approach isn't "good practice" at all - and re-iterating your
statement doesn't make it any better. Instead it's really _bad_
practice to apply such measure unless you know what you're doing.

ps- being rude doesn't help solve the issue, or educate us out of
ignorance. it just makes good people walk away leaving even larger
piles of work for the few who remain.

Let's face it, we're all doing this as a hobby in spare time. How do
you motivate people to dedicate their spare time ?
Maybe almost every of us who is doing paid IT stuff is familiar with
customers who'd rather tape a breaking wall ten times instead of
replacing one or two unsound bricks - because they think it's cheaper.
Ok, you're getting paid for it and maybe that's the reason why you're
going to make your customer feel good by taping the crack.

I don't intend to bear such habit in my spare time - we already having
too many occurrences of similar style in the OSGeo server
infrastructure and you're right now trying to make us believe that it
would be "good practice" to tape a mockup wall ? Seriously !?

  Martin.
--
Unix _IS_ user friendly - it's just selective about who its friends are !
--------------------------------------------------------------------------

Martin wrote:

Seriously !?

This meta-disucssion needs to end, now. If there are technical issues or
suggested solutions, please advise. Otherwise there are better things to do.

regards,
Hamish out.

In an attempt to eliminate the causes of performance issues affecting
the projects VM and (hopefully) as a starting point for general
directory cleanup I've moved the "/osgeo/tilecache/wms-c/cache/"
directory to "/var/cache/tilecache/".
I've added an "Alias /wms-c/cache /var/cache/tilecache" to the Apache
config in "/etc/apache2/sites-available/tilecache and I've modified the
"base=" in "/osgeo/tilecache/wms-c/metacarta.cfg" accordingly.

As far as _I_ can tell the tilecache is still working as expected, but,
please, affected parties, check your sites as well and report failures.

This change cuts the backup time for the projects VM just for the
differential !! backup in half because it skips the check for about 5
million files.

Cheers,
  Martin.
--
Unix _IS_ user friendly - it's just selective about who its friends are !
--------------------------------------------------------------------------