[SAC] Motion: robots.txt, Drupal and performance

We're buried again. Hope this email gets to you :slight_smile:

It is my opinion that Drupal is not our problem, svn is. Next time we are
"up against the wall" I think we should just rename /var/www/svn and see if
that doesn't bring things back under control (and verify my claim that svn is
the real pig).

Whomever re-enabled robots.txt has proven what I have suspected... our Drupal performance is very poor, CPU intensive, and completely unable to keep up with the load. In the 20 minutes I've watched, we haven't had any svn activity, and we're still averaging a load of 1.5-2.0. Throw in an FDO checkout and the load jumps to 6+ quickly.

o I could pull one drupal web page at about 60pages per second
   (http://www.osgeo.org)
o I found pulling a non-existant page ran about the same speed.
o I can pull GDAL at about 2400pages/second.

Trac's performance is 40 times better than Drupal in our respective configurations, and it doesn't have to respond to just about every request on the OSGeo.org domain. Additionally, Trac doesn't cache *anything,* and it dynamically generates it all. Admittedly, Drupal's doing a lot more, but 40 times the load on who knows how much times the traffic equates to an unsustainable situation.

Motion:
I move that we disable crawlers on osgeo.org until we can do 500+ pages/sec with Drupal. In our current configuration, even if we moved svn and trac off to the other server, we would not be able to keep up with the bots.

I completely agree that a website not on the search engines is equivalent to practically not existing. I would rather not exist and have the machine stand up. In the last week, load issues have caused us to burn through *a lot* of our volunteer admin juice. Someone needs to step forward to aggressively marshal the Drupal performance stuff and get us where we need to be. Otherwise we can just leave it alone and not be on the indexes.

Howard

Hi
as far as optimizing drupal performance is concerned, it looks like some unnecessary modules installed can reduce the performance. See http://drupal.org/node/32091 for some possible ways of encountering performance hit.
As far as I remember, we have Access Stats module enabled, which is not useful really after having installed awstats for access logs. Maybe we should disable it and also find out more such modules which are not used at all.

regards,

Swapnil Hajare

Not sure that I get a vote, but I reluctantly have to agree with Howard. Is there a way that we can allow the bots to at least spider the main page, but ignore anything else?

Honestly, I really doubt that we’ll get to 500 pages / second with Drupal without throwing a more agressive caching strategy (Squid?) or a better-performing CMS (is there such a thing?) in place. Or going to static html :slight_smile:

But what are the options? As Howard has pointed out, we have limited volunteer server admin resource, so what can we do with what we’ve got?

Lets turn off as many modules as we can because each one requires additional SQL queries. Drupal can be a SQL pig; I’ve seen reports of 100+ queries to generate a single page, which is unimaginable.

Is there anyone in the group with MySQL tuning experience? If we can get MySQL set up as effectively as possible in DW mode, this will help Drupal as well.

We should consider turning off content-compression in Apache. This isn’t nearly as expensive as MySQL, but…

Another thing that should be looked into is why the server is responding in a way that invalidates caches. This makes the bots see our content as highly dynamic, and the’ll come back for it way more frequently.

Jason


From: sac-bounces@lists.osgeo.org [mailto:sac-bounces@lists.osgeo.org] On Behalf Of Swapnil Hajare
Sent: Thursday, March 29, 2007 08:17
To: System Administration Committee Discussion/OSGeo
Subject: Re: [SAC] Motion: robots.txt, Drupal and performance

Hi
as far as optimizing drupal performance is concerned, it looks like some unnecessary modules installed can reduce the performance. See http://drupal.org/node/32091 for some possible ways of encountering performance hit.
As far as I remember, we have Access Stats module enabled, which is not useful really after having installed awstats for access logs. Maybe we should disable it and also find out more such modules which are not used at all.

regards,

Swapnil Hajare

On 29-Mar-07, at 8:45 AM, Jason Birch wrote:

Not sure that I get a vote, but I reluctantly have to agree with Howard. Is there a way that we can allow the bots to at least spider the main page, but ignore anything else?

Isn’t there a way to feed search engines our site map directly?

But what are the options? As Howard has pointed out, we have limited volunteer server admin resource, so what can we do with what we’ve got?

I’m also trying to find more people who are familiar with tuning drupal. (Swapnil keep the ideas coming and put into the wiki!) Ejer and alienbrain (on IRC) may be able to help too.

Another thing that should be looked into is why the server is responding in a way that invalidates caches. This makes the bots see our content as highly dynamic, and the’ll come back for it way more frequently.

Cache was currently set to 5min lifetime, perhaps it should be more now.

On 29-Mar-07, at 8:17 AM, Swapnil Hajare wrote:

Hi
as far as optimizing drupal performance is concerned, it looks like some unnecessary modules installed can reduce the performance. See http://drupal.org/node/32091 for some possible ways of encountering performance hit.
As far as I remember, we have Access Stats module enabled, which is not useful really after having installed awstats for access logs. Maybe we should disable it and also find out more such modules which are not used at all.

I just disabled modules:
-aggregator
-graphstat
-statistics

I know at least statistics is causing some grief.

> Isn't there a way to feed search engines our site map directly?

Yes, but this only tells them what to spider.

> Cache was currently set to 5min lifetime, perhaps it should be more
now.

No, this only affects how long each of the content blocks is cached, not
how Drupal sets HTTP expires headers.

Jason

Howard Butler wrote:

Motion:
I move that we disable crawlers on osgeo.org until we can do 500+ pages/sec with Drupal. In our current configuration, even if we moved svn and trac off to the other server, we would not be able to keep up with the bots.

Howard,

I'm -1 on this.

Reasons:

#1 removing ourselves from search engines is death to OSGeo's promotional
    efforts.

#2 Even with full crawler activity, I haven't see us doing anything
    near 50 pages/sec - I think 500pages/sec is an unnecessarily high bar.

#3 I don't think we are groking where our load is coming from.

My short term suggestion would be:

  o a moderately restrictive robots.txt (ie. screen out stuff like mapguide
    and fdo docs. That way the total number of pages should be modest.
  o continued monitoring of load and activity to try and better understand
    the issues
  o try some of the performance tweaks previously suggested for Drupal

I completely agree that a website not on the search engines is equivalent to practically not existing. I would rather not exist and have the machine stand up. In the last week, load issues have caused us to burn through *a lot* of our volunteer admin juice. Someone needs to step forward to aggressively marshal the Drupal performance stuff and get us where we need to be. Otherwise we can just leave it alone and not be on the indexes.

I agree, and understand that we can't be babying things at the level we
have, but in the last week we have also learned alot about our machine
configuration, and improved quite a few things.

I am prepared to continue tweaking backup schemes, and robots.txt to
reach a healthy steady-state.

Best regards,
--
---------------------------------------+--------------------------------------
I set the clouds in motion - turn up | Frank Warmerdam, warmerdam@pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush | President OSGeo, http://osgeo.org

Howard Butler wrote:

Whomever re-enabled robots.txt has proven what I have suspected...

Folks,

By the way, I'm pretty sure I'm responsible for this. I moved the file
away for some testing, and then put it back. But it seems I put back the
wrong one.

I appologize and wish to make it clear this was not an attempt to subvert
the need for discussion on stuff like this. In a collaberative system
administration environment as we are trying to operate I realize it is
important that I not just make changes without at *least* notification.

Best regards,
--
---------------------------------------+--------------------------------------
I set the clouds in motion - turn up | Frank Warmerdam, warmerdam@pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush | President OSGeo, http://osgeo.org