[SAC] Reliability of Services

I am not happy with the reliability of the services that OSGeo operates, especially regarding the website and Wiki. I do not think that it is caused by missing administrative capabilities, capacities (meaning the know-how) or enthusiasm. Even although resources are very limited (we all know that) it never takes long to get things back up. Kudos to all SAC admins and supporters.

So what is the problem? The OSGeo Wiki was never down before it moved to PEER servers although it was never even maintained. It just worked. Now we are starting to have trouble with the Wiki too. The CMS cannot be used to edit the web site and is starting to block normal business.

I am trying to not blame anybody but find out what can be done to make things more reliable.

I expect basic services operated by OSGeo to be up 24/7. From a Marketing point of view at least the web site and the Wiki *must* be up all the time. We are trying hard to build a brand and every time OSGeo is down it breaks the message we are trying to get out.

If there is no other possibility I suggest to get a dedicated server for the web site and Wiki only. It can be hosted by any standard ISP because the technical requirements are so low that any LAMP box is good enough (just MySQL, Drupal and MediaWiki required).

Regards, Arnulf.

Arnulf Christl wrote:

I am not happy with the reliability of the services that OSGeo operates, especially regarding the website and Wiki. I do not think that it is caused by missing administrative capabilities, capacities (meaning the know-how) or enthusiasm. Even although resources are very limited (we all know that) it never takes long to get things back up. Kudos to all SAC admins and supporters.
So what is the problem? The OSGeo Wiki was never down before it moved to PEER servers although it was never even maintained. It just worked. Now we are starting to have trouble with the Wiki too. The CMS cannot be used to edit the web site and is starting to block normal business.
I am trying to not blame anybody but find out what can be done to make things more reliable.
I expect basic services operated by OSGeo to be up 24/7. From a Marketing point of view at least the web site and the Wiki *must* be up all the time. We are trying hard to build a brand and every time OSGeo is down it breaks the message we are trying to get out.
If there is no other possibility I suggest to get a dedicated server for the web site and Wiki only. It can be hosted by any standard ISP because the technical requirements are so low that any LAMP box is good enough (just MySQL, Drupal and MediaWiki required).

Arnulf,

I believe the wiki problems related to an improperly administered MySQL
install and could have happened anywhere. There is no particular reason
to anticipate future problems with the wiki since it is essentially the
only service on osgeo2.

The problems on osgeo1 however are ongoing and I'm not at all sure what
to do about them. Segregating svn, drupal, mailing list and trac services
to some degree might help avoid all services being lost when one is
behaving improperly.

I would suggest we focus our attention on the osgeo1 problems and how to
alleviate them if we are concerned about downtime (which I am - though
I'm not sure it is a crisis).

Best regards,
--
---------------------------------------+--------------------------------------
I set the clouds in motion - turn up | Frank Warmerdam, warmerdam@pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush | President OSGeo, http://osgeo.org

On 10-Mar-08, at 6:34 AM, Frank Warmerdam wrote:

Arnulf Christl wrote:

I am not happy with the reliability of the services that OSGeo operates, especially regarding the website and Wiki. I do not think that it is caused by missing administrative capabilities, capacities (meaning the know-how) or enthusiasm. Even although resources are very limited (we all know that) it never takes long to get things back up. Kudos to all SAC admins and supporters.

I would suggest we focus our attention on the osgeo1 problems and how to
alleviate them if we are concerned about downtime (which I am - though
I'm not sure it is a crisis).

Hi all,

Perhaps it's a good time to pause and ask a few questions that affect maintaining our services:

1- Do we have enough available SAC volunteers to help focus on these problems? Need to recruit a few more?
2- Can we give a couple more people shell access to help? e.g. Folks from projects that have moved into OSGeo (GRASS, QGIS)
3- Could we use some hired help on-demand?
4- Do we want to consider moving to another hosting platform?
5- Should we move some services from one server to another?

As with any volunteer group, we are dependent on goodwill and time availability of the committee and personal availability fluctuates over time. It might be a good idea for SAC members to comment on their current level of availability so we can see if there is a hole in support availability.

The hosting and service location questions are more what Arnulf seems concerned about. Even if we decide to migrate to another ISP or to move some services around on the boxes, I wonder if we have enough volunteer time available to make it happen effectively.

Tyler

Tyler Mitchell (OSGeo) wrote:

Hi all,

Perhaps it's a good time to pause and ask a few questions that affect maintaining our services:

1- Do we have enough available SAC volunteers to help focus on these problems? Need to recruit a few more?

Tyler,

I think some additional volunteers would be helpful. I'm not sure
though how we know when we have too many and management is becoming
too chaotic.

2- Can we give a couple more people shell access to help? e.g. Folks from projects that have moved into OSGeo (GRASS, QGIS)

I think Chris Schmidt and Daniel Morissette are skilled admins
who could be an asset as primary administrators if they were
interested. Perhaps also folks from other projects.

I do however think that "primary administrators" need to make
some commitment to learning how things are setup and being available
to deal with stuff beyond just the one thing that they want to setup.

3- Could we use some hired help on-demand?

I think it would be helpful if there was on-demand hired help that
Howard could task to do stuff. I'm a bit leery about paying
commercial rates for this though as we could eat through quite a bit
of money quickly. My *hope* was that there might be a
consultant/contractor from somewhere with modest wage expectations, and
a passion for OSGeo that might take this on for a modest wage (say
$20/hr). Of course, I hope that for everything and it is hard to
get a solid mix of skills and willingness to work cheaply.

4- Do we want to consider moving to another hosting platform?

I'm not sure what you are proposing, but I think change without
a clear rationale is the last thing we need.

5- Should we move some services from one server to another?

This is plausible, but once again without a good understanding
of what the problem is, it seems like churn. We really need
someone decently skilled to watch the situation closely and try
to diagnose what is happening (perhaps more than one thing!).

This would include post-seizeup log analysis, adding various
sorts of instrumentation (logging server-status reports, top
to a log, etc) and then careful experimentation with variations.

It could take a while. I haven't observed any server problems
in the last several days for instance.

As with any volunteer group, we are dependent on goodwill and time availability of the committee and personal availability fluctuates over time. It might be a good idea for SAC members to comment on their current level of availability so we can see if there is a hole in support availability.

I am available as needed, but have rather adhoc system administration
skills. Doing stuff by ssh is very painful for me due to my extremely
poor latency internet connection so I feel like I should be a
resource-of-last-resort for most admin work. :frowning:

The hosting and service location questions are more what Arnulf seems concerned about. Even if we decide to migrate to another ISP or to move some services around on the boxes, I wonder if we have enough volunteer time available to make it happen effectively.

I'm not sure I follow this. I recall Arnulf mentioning website
(presumably www.osgeo.org Drupal) downtime and wiki downtime. I
don't recall all the details but I'm not sure either of these was
a problem with our ISP. One was mysql stuck with a broken table after
a hard reboot, and the other was mysql binary logs filling the disk -
a configuration error on our part.

I will say, both of these could have been either resolved more
quickly, or prevented in the first place by more ongoing volunteer
effort and more systematic management of our services.

All IMHO of course.

Best regards,
--
---------------------------------------+--------------------------------------
I set the clouds in motion - turn up | Frank Warmerdam, warmerdam@pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush | President OSGeo, http://osgeo.org

On Thu, Mar 20, 2008 at 01:50:16PM -0400, Frank Warmerdam wrote:

>Perhaps it's a good time to pause and ask a few questions that affect
>maintaining our services:

My *hope* was that there might be a
consultant/contractor from somewhere with modest wage expectations, and
a passion for OSGeo that might take this on for a modest wage

It does sound to me as if what is needed (a fairly intensive burst of
analysis of the underlying problems, network monitoring setup, and
recommending any rearrangement of services that may be needed)
would benefit a lot from a short-term (paid) dedicated person on it.

>4- Do we want to consider moving to another hosting platform?
I'm not sure what you are proposing, but I think change without
a clear rationale is the last thing we need.

One possible reason to consider this more seriously is that PEER1
is VERY expensive. (At 15K USD p/a, 5-10 times what it would seem
necessary to cough up for a couple of vhosts on some anonymous rack.)

good luck!

jo

jo@frot.org wrote:

4- Do we want to consider moving to another hosting platform?

I'm not sure what you are proposing, but I think change without
a clear rationale is the last thing we need.

One possible reason to consider this more seriously is that PEER1
is VERY expensive. (At 15K USD p/a, 5-10 times what it would seem
necessary to cough up for a couple of vhosts on some anonymous rack.)

Jo,

Well, I think we need more than "a couple of vhosts". I believe we
ended up with a relatively expensive (by my standards) solution partly
because we demanded a provider with some sort of uptime guarantee though
I'm personally not overly keen on that aspect.

If we want to reduce costs I think the most obvious step might be
to drop the second Peer1 host which we are currently using as a place
to rsync backups to, and to operate the wiki.

This would save roughly half our costs.

But all things considered $15K/yr isn't ruinous.

Best regards,
--
---------------------------------------+--------------------------------------
I set the clouds in motion - turn up | Frank Warmerdam, warmerdam@pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush | President OSGeo, http://osgeo.org

I guess I need to take some blame here, as one of the folks who was initially involved in investigating hosting solutions. Some of the things that we were constrained by included:

- Managed hardware solution - we do not have the resource to provide reliable co-lo management.
- Reasonable bandwidth allowance - many hosts have incredibly stingy allowances, and huge penalties for overage
- SLA (part of due diligence for ADSK; I believe that they may have gone elsewhere for MapGuide without this)
- Full root access to the machines - many providers with SLA also have change management procedures, with costs associated - no flexibility there

Peer1 was the cheapest and most functional of the reliable managed hardware solutions that were reviewed at the time. I would hate to see the current solution abandoned without proper review of the alternatives.

That said, I wonder if we are visible enough now that there are providers that are willing to give us some services for free; I know that pair Networks does this often. Do we have US charity status yet?

Jason

________________________________

From: Frank Warmerdam
Subject: Re: [Board] Re: [SAC] Reliability of Services

jo@frot.org wrote:

4- Do we want to consider moving to another hosting platform?

I'm not sure what you are proposing, but I think change without
a clear rationale is the last thing we need.

One possible reason to consider this more seriously is that PEER1
is VERY expensive. (At 15K USD p/a, 5-10 times what it would seem
necessary to cough up for a couple of vhosts on some anonymous rack.)

Jo,

Well, I think we need more than "a couple of vhosts". I believe we
ended up with a relatively expensive (by my standards) solution partly
because we demanded a provider with some sort of uptime guarantee though
I'm personally not overly keen on that aspect.

If we want to reduce costs I think the most obvious step might be
to drop the second Peer1 host which we are currently using as a place
to rsync backups to, and to operate the wiki.

This would save roughly half our costs.

But all things considered $15K/yr isn't ruinous.

Best regards,
--
---------------------------------------+--------------------------------------
I set the clouds in motion - turn up | Frank Warmerdam, warmerdam@pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush | President OSGeo, http://osgeo.org/>

_______________________________________________
Board mailing list
Board@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/board

On 24-Mar-08, at 2:55 PM, Jason Birch wrote:

That said, I wonder if we are visible enough now that there are providers that are willing to give us some services for free; I know that pair Networks does this often. Do we have US charity status yet?

No, it's not finalised yet.