telascience backup plan - was Re: [SAC] xblade 14 down

On Tue, Aug 4, 2009 at 4:43 PM, Frank Warmerdam<warmerdam@pobox.com> wrote:

Folks,

xblade14 which hosts quite a bit of stuff (buildbot.osgeo.org, www.gdal.org,
mapserver.org, remotesensing.org, grass.osgeo.org, qgis.osgeo.org,
www.geotools.org, upload.osgeo.org) is down and unreachable.

also
http://gallery.osgeo.org/

John has been emailed but is travelling and it is unclear how soon he can
arrange someone to reboot the system. Norman may be able to
contact someone directly if John is unavailable.

In the meantime it is unclear how long the blade will remain down.

I would like to bring up the backup situation of our "services"
hosted at Telascience. We all appreciate very much the ease
of access there and the tremendous bandwidth they offer.

The problem

I learned (again) that no systematic backups were done for
xblade14. "Every project has to self-organize this" I was told
in IRC yesterday. While we, the GRASS project, have done our
homework others may not have done so:

For example, the "OSGeo Gallery" is not a group. We should not
lose important showcasing of "OSGeo in the wild" just because
we are unable to perform a rsync call to another machine. Disk
space is cheap today, I could have done the gallery backup
easily if I had *known* that there is no backup done so far.
Remember that the community has contributed the examples,
it will be hard to ask them to repeat everything.

A Proposal

I suggest that we immediately start to better coordinate
backups of OSGeo web sites hosted at Telascience.
Just having a simple clone of the disk could save many
hours of reconstructing all the stuff. Note, done with
a one-liner cronjob of rsync.
I know that Wikis, Drupals and so forth cannot be restored
like this but the majority of the (hopefully not lost) stuff
are plain files.

According to http://wiki.osgeo.org/wiki/SAC:Backups
the download.osgeo.org site is rsync'ed to osgeo2,
if that's not possible for xblade1x, I am sure that we
can find someone who offers this. As failover we
really need it. Additionally, the projects should take
care of course.

At least some more coordination is needed IMHO.
I'd suggest to nominate a contact person for each
project hosted at Telascience.

Best,
Markus

I would like to bring up the backup situation of our "services"
hosted at Telascience. We all appreciate very much the ease
of access there and the tremendous bandwidth they offer.

I can't speak to the backup problem, but I was surprised to see
"essential" services being run on Telascience servers. I was under the
impression these services were being used for large file downloads and
test/build environments, etc. and not necessarily for project primary
websites.

Why are our funded and "supported" hosting services (PEER1) not being used
for mission critical website services? Is it related to our security
policies, disk space limitations or some other issue with osgeo1/2? Is it
something we can change to better support the projects with live supported
machines?

I'm not questioning the rationale for Telascience hosting but just
wondering how we got here with some of the main project websites, specifically because it looks bad. When I speak to audiences about "supporting our projects", then their sites
disappear for a long period, it looks like we are not doing a good support job - even though
it's beyond our control and with best intentions.

Hope these are constructive questions we can work through together.

Tyler

On Thu, Aug 6, 2009 at 8:41 AM, Tyler Mitchell<tmitchell@osgeo.org> wrote:

I would like to bring up the backup situation of our "services"
hosted at Telascience. We all appreciate very much the ease
of access there and the tremendous bandwidth they offer.

I can't speak to the backup problem, but I was surprised to see
"essential" services being run on Telascience servers. I was under the
impression these services were being used for large file downloads and
test/build environments, etc. and not necessarily for project primary
websites.

Why are our funded and "supported" hosting services (PEER1) not being used
for mission critical website services?

At least for the GRASS project, we simply didn't have ssh access to
the PEER1 machine. I meanwhile have but consider it to be a
bottleneck to enable only one person per project (as we have just
seen).

Is it related to our security
policies, disk space limitations or some other issue with osgeo1/2? Is it
something we can change to better support the projects with live supported
machines?

I'm not questioning the rationale for Telascience hosting but just
wondering how we got here with some of the main project websites,
specifically because it looks bad. When I speak to audiences about
"supporting our projects", then their sites
disappear for a long period, it looks like we are not doing a good support
job - even though it's beyond our control and with best intentions.

Hope these are constructive questions we can work through together.

I hope the same...

Markus