[SAC] dsmc - tivoli backup system on osgeo1

Folks,

I am increasingly convinced that the automated backup done by dsmc
on osgeo1 is responsible for the IO contention that leads to service
unavailability on a fairly frequent basis.

I would like to disable these backups for a couple weeks to see if it helps
a lot. We have never used the backups and most service data is backed up
by other mechanisms anyways.

Thoughts? If there are no objections raised here in a couple days I might
just go ahead and do it.

Also, does anyone know how to manage this service? Do I need to file a ticket
with peer1?

Best regards,
--
---------------------------------------+--------------------------------------
I set the clouds in motion - turn up | Frank Warmerdam, warmerdam@pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush | Geospatial Programmer for Rent

On Nov 19, 2009, at 3:36 PM, Frank Warmerdam wrote:

Folks,

I am increasingly convinced that the automated backup done by dsmc
on osgeo1 is responsible for the IO contention that leads to service
unavailability on a fairly frequent basis.

I would like to disable these backups for a couple weeks to see if it helps
a lot. We have never used the backups and most service data is backed up
by other mechanisms anyways.

Thoughts? If there are no objections raised here in a couple days I might
just go ahead and do it.

Also, does anyone know how to manage this service? Do I need to file a ticket
with peer1?

+ 1 Assuming we have documented by-hand or existing backup solutions for all of our services that are currently assuming tivoli's doing their backup (ldap?, databases?, htdocs directories?, trac config

Howard

On Thu, 19 Nov 2009 15:55:29 -0600
Howard Butler <hobu.inc@gmail.com> wrote:

On Nov 19, 2009, at 3:36 PM, Frank Warmerdam wrote:

> Folks,
>
> I am increasingly convinced that the automated backup done by dsmc
> on osgeo1 is responsible for the IO contention that leads to service
> unavailability on a fairly frequent basis.
>
> I would like to disable these backups for a couple weeks to see if
> it helps a lot. We have never used the backups and most service
> data is backed up by other mechanisms anyways.
>
> Thoughts? If there are no objections raised here in a couple days
> I might just go ahead and do it.
>
> Also, does anyone know how to manage this service? Do I need to
> file a ticket with peer1?

+ 1 Assuming we have documented by-hand or existing backup solutions
for all of our services that are currently assuming tivoli's doing
their backup (ldap?, databases?, htdocs directories?, trac config

Yes, you have to file a ticket to discuss this with them. It would be
good to let them know since I get a note from their automated
systems every time their backup system cannot be contacted from their
end.

On Thu, Nov 19, 2009 at 04:36:12PM -0500, Frank Warmerdam wrote:

I am increasingly convinced that the automated backup done by dsmc
on osgeo1 is responsible for the IO contention that leads to service
unavailability on a fairly frequent basis.

Today I have installed 'iostat' on the 'osgeo1' machine. The next time
you suspect IO contention (when I'm not around), please paste the
output of approx. half a minute into an EMail, running the command as:

  # ~> iostat -x 5

Thanks,
  Martin.
--
Unix _IS_ user friendly - it's just selective about who its friends are !
--------------------------------------------------------------------------

Martin Spott wrote:

On Thu, Nov 19, 2009 at 04:36:12PM -0500, Frank Warmerdam wrote:

I am increasingly convinced that the automated backup done by dsmc
on osgeo1 is responsible for the IO contention that leads to service
unavailability on a fairly frequent basis.

Today I have installed 'iostat' on the 'osgeo1' machine. The next time
you suspect IO contention (when I'm not around), please paste the
output of approx. half a minute into an EMail, running the command as:

  # ~> iostat -x 5

Martin,

log for a part of a minute attached. This is during a period when the load
average spiked to 23 or so, and "wait states" were around 65% in the top
report.

In this case, there is no sign of dsmc. The top of the top report looks like:

top - 02:43:22 up 19 days, 5:38, 2 users, load average: 23.77, 22.19, 19.00
Tasks: 333 total, 1 running, 332 sleeping, 0 stopped, 0 zombie
Cpu(s): 14.9% us, 2.6% sy, 0.0% ni, 21.9% id, 60.6% wa, 0.1% hi, 0.0% si
Mem: 2074860k total, 2050372k used, 24488k free, 5060k buffers
Swap: 2040244k total, 723380k used, 1316864k free, 453616k cached

   PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

26106 apache 17 0 64836 45m 7516 S 25.8 2.2 0:15.01 httpd
26116 postgres 18 0 27032 11m 10m D 6.9 0.6 0:01.22 postmaster
26162 apache 16 0 40060 21m 6248 S 5.9 1.1 0:01.17 httpd
25202 apache 15 0 65620 42m 7656 D 4.6 2.1 0:11.93 httpd
26114 root 16 0 3252 540 188 S 4.6 0.0 0:00.74 gzip
22970 apache 16 0 228m 137m 7636 S 3.6 6.8 1:04.32 httpd
25971 postgres 17 0 26424 11m 9m D 3.6 0.5 0:03.70 postmaster
26149 postgres 16 0 26336 11m 9m S 3.3 0.5 0:01.87 postmaster
  2922 mysql 16 0 200m 93m 3344 S 2.6 4.6 3016:12 mysqld
25456 apache 15 0 76080 50m 7776 D 1.0 2.5 0:25.90 httpd
26159 postgres 15 0 26336 11m 10m D 1.0 0.5 0:00.61 postmaster
    67 root 15 0 0 0 0 S 0.7 0.0 33:18.59 kswapd0
25937 postgres 15 0 26380 11m 9m S 0.7 0.5 0:02.09 postmaster
26017 postgres 16 0 26768 11m 10m D 0.7 0.6 0:01.40 postmaster
26040 postgres 15 0 26380 11m 9m D 0.7 0.5 0:00.77 postmaster
26113 root 16 0 5020 1676 1296 S 0.7 0.1 0:00.11 pg_dump
26140 root 16 0 3188 1156 772 R 0.7 0.1 0:00.56 top
25252 postgres 16 0 26324 8440 7520 S 0.3 0.4 0:00.15 postmaster
25792 postgres 15 0 26336 10m 10m D 0.3 0.5 0:00.63 postmaster
25846 postgres 15 0 26952 11m 10m D 0.3 0.6 0:02.64 postmaster
25883 postgres 17 0 26336 10m 9512 D 0.3 0.5 0:00.22 postmaster
25993 postgres 15 0 26344 10m 9m D 0.3 0.5 0:00.45 postmaster
26003 postgres 17 0 26336 10m 9.9m D 0.3 0.5 0:00.35 postmaster
26020 apache 16 0 62312 43m 7664 S 0.3 2.2 0:05.19 httpd
26093 postgres 15 0 26336 11m 10m D 0.3 0.5 0:00.79 postmaster
26151 postgres 16 0 26340 8064 7068 D 0.3 0.4 0:00.08 postmaster

Best regards,
--
---------------------------------------+--------------------------------------
I set the clouds in motion - turn up | Frank Warmerdam, warmerdam@pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush | Geospatial Programmer for Rent

log (3.93 KB)

Hi Frank, thanks for posting the report.

On Tue, Nov 24, 2009 at 02:46:34AM -0500, Frank Warmerdam wrote:

In this case, there is no sign of dsmc.

.... which doesn't come by surprise since the backup client doesn't
consume a lot of CPU cycles while it's waiting for the result of its
read request to get returned :slight_smile:

Are 'we' having RAID5/6 running here or just stupid disk mirroring via
the 3ware controller ? The backup server is 'bu2atl.bu.peer1.net', if
this points to a fixed IP number, then we could try to throttle the
backup bandwidth via network traffic shaping.

Cheers,
  Martin.
--
Unix _IS_ user friendly - it's just selective about who its friends are !
--------------------------------------------------------------------------