[SAC] [OSGeo] #1940: osgeo4 raid needs a replacement

#1940: osgeo4 raid needs a replacement
---------------------------+-------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: new
Priority: normal | Milestone:
Component: Systems Admin | Keywords:
---------------------------+-------------------
As reported by Justin L Dugger (IRC nickname pwnguin) from OSUOSL staff
there's a degraded raid on osgeo4 server which should be replaced.

This ticket is to track operations toward getting that disk replaced.

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/1940&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

On 06/12/2017 03:17 PM, OSGeo wrote:

#1940: osgeo4 raid needs a replacement
---------------------------+-------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: new
Priority: normal | Milestone:
Component: Systems Admin | Keywords:
---------------------------+-------------------
As reported by Justin L Dugger (IRC nickname pwnguin) from OSUOSL staff
there's a degraded raid on osgeo4 server which should be replaced.

This ticket is to track operations toward getting that disk replaced.

As I stated when it was first noticed, a while back. My preferred
solution is to migrate the final 2 VMs off OSGeo 4 and retire the machine.

OSGeo6 is completely capable of handling the load, the fastest method
would be to do KVM managed by libvrt (or similar). A better solution
would probably be to rebuild the few remaining services in containers.

Thanks,
Alex

#1940: osgeo4 raid needs a replacement
---------------------------+--------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: new
Priority: normal | Milestone:
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+--------------------
Changes (by strk):

* cc: wildintellect (added)

Comment:

NOTE: this got assigned ID {{{[support.osuosl.org #29544]}}} by OSUOSL
support.

Alex do you know how to proceed here ?

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/1940#comment:1&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.

On Mon, Jun 12, 2017 at 10:05:16PM -0700, Alex Mandel wrote:

> #1940: osgeo4 raid needs a replacement

As I stated when it was first noticed, a while back. My preferred
solution is to migrate the final 2 VMs off OSGeo 4 and retire the machine.

OSGeo6 is completely capable of handling the load, the fastest method
would be to do KVM managed by libvrt (or similar). A better solution
would probably be to rebuild the few remaining services in containers.

I've the impression that given the lack of a new sysadmin contract
the fastest thing here would be to buy a new disk. It can always be
moved to one of the existing machines later (we have space issues
in the download machine if I'm not mistaken?)

Or, how about setting up a new sysadmin contract to do the final VM
migration (and maybe other things, if there's enough budget) ?

BTW: last time I checked I could not find a wiki page with
informations about setting up VMs, is there one ?

--strk;

On 06/13/2017 08:54 AM, Sandro Santilli wrote:

On Mon, Jun 12, 2017 at 10:05:16PM -0700, Alex Mandel wrote:

#1940: osgeo4 raid needs a replacement

As I stated when it was first noticed, a while back. My preferred
solution is to migrate the final 2 VMs off OSGeo 4 and retire the machine.

OSGeo6 is completely capable of handling the load, the fastest method
would be to do KVM managed by libvrt (or similar). A better solution
would probably be to rebuild the few remaining services in containers.

I've the impression that given the lack of a new sysadmin contract
the fastest thing here would be to buy a new disk. It can always be
moved to one of the existing machines later (we have space issues
in the download machine if I'm not mistaken?)

OSGeo3 which contains Downloads has a space issue, but that's actually
just a disk allocation issue, it has more space. But nothing new should
go on that machine it's just as old.

FYI, adding a replacement disk the last time caused significant downtime
to rebuild the raid (days). If I can get sudo access to the host (need
Martin's help) I know how to migrate a ganeti lvm volume to a new host.
I'll see if I can write up the process in the next week. It's not
actually that many commands. I expect the downtime to be the same or
less than rebuilding the raid.

The 2 VMs still on osgeo4 I believe are adhoc and qgis. qgis contains
their bug tracker which if I recall the qgis PSC has been debating the
future of. Adhoc could probably stand a fresh rebuild since it was
always a crazy pile of various demo servers to start with.

Or, how about setting up a new sysadmin contract to do the final VM
migration (and maybe other things, if there's enough budget) ?

We could consider that.

BTW: last time I checked I could not find a wiki page with
informations about setting up VMs, is there one ?

Probably not since OSUOSL set them up, at least from the host side
originally. I only know how it works because I've used ganeti myself.

--strk;

Thanks,
Alex

Hi Alex,

On Tue, 13. Jun 2017 at 22:08:25 -0700, Alex Mandel wrote:

OSGeo3 which contains Downloads has a space issue, but that's actually
just a disk allocation issue, it has more space. But nothing new should
go on that machine it's just as old.

So it's just still a matter of access? Expanding a lv and growing a fs should
be a matter of minutes - although making the expanded lv known to the vm
without a reboot might be an issue.

FYI, adding a replacement disk the last time caused significant downtime to
rebuild the raid (days).

Usually you would just pull the disk and put in a new one - without any
downtime. At least the rebuild should run in background. Anyone recall why it
had to run in foreground with everying offline the last time?

qgis contains their bug tracker which if I recall the qgis PSC has been
debating the future of.

We moved redmine to a non-osgeo machine. The VM now is only hosting the
website with documentation and downloads. It's also behind cloudflare. But
disk space is also an issue there too.

Building of website and documentation has been moved of the VM and is rsynced
back from another non-osgeo machine to the VM.

Jürgen

--
Jürgen E. Fischer norBIT GmbH Tel. +49-4931-918175-31
Dipl.-Inf. (FH) Rheinstraße 13 Fax. +49-4931-918175-50
Software Engineer D-26506 Norden http://www.norbit.de
QGIS release manager (PSC) Germany IRC: jef on FreeNode

On 06/14/2017 04:09 AM, Jürgen E. Fischer wrote:

Hi Alex,

On Tue, 13. Jun 2017 at 22:08:25 -0700, Alex Mandel wrote:

OSGeo3 which contains Downloads has a space issue, but that's actually
just a disk allocation issue, it has more space. But nothing new should
go on that machine it's just as old.

So it's just still a matter of access? Expanding a lv and growing a fs should
be a matter of minutes - although making the expanded lv known to the vm
without a reboot might be an issue.

Yes, access. And you are correct there is no way to tell the guest OS to
grow the / partition without it being offline. The most often used trick
is to boot the VM with a live distro and do an ext grow operation.

FYI, adding a replacement disk the last time caused significant downtime to
rebuild the raid (days).

Usually you would just pull the disk and put in a new one - without any
downtime. At least the rebuild should run in background. Anyone recall why it
had to run in foreground with everying offline the last time?

qgis contains their bug tracker which if I recall the qgis PSC has been
debating the future of.

We moved redmine to a non-osgeo machine. The VM now is only hosting the
website with documentation and downloads. It's also behind cloudflare. But
disk space is also an issue there too.

Building of website and documentation has been moved of the VM and is rsynced
back from another non-osgeo machine to the VM.

If that's the case and you just have static assets, then please go ahead
and start migrating the qgis VM content to OSGeo6.

Jürgen

Thanks,
Alex

#1940: osgeo4 raid needs a replacement
---------------------------+--------------------
Reporter: strk | Owner: sac@…
     Type: task | Status: new
Priority: normal | Milestone:
Component: Systems Admin | Resolution:
Keywords: |
---------------------------+--------------------

Comment (by strk):

From the upstream ticket (Samarendra Hedaoo):
{{{
To add more details, the current status is:
RAID-6:6 drives:557.75GB:
Degraded Drives:5
1 Bad Drives (1974326 Errors)
}}}

--
Ticket URL: <https://trac.osgeo.org/osgeo/ticket/1940#comment:2&gt;
OSGeo <http://www.osgeo.org/&gt;
OSGeo committee and general foundation issue tracker.