[SAC] Google walking trac...

Folks,

I was doing a bit of digging around to see what the httpd processes
were doing that were pegging at 100% for a while. I had blamed this on
subversion. It looks like I was partly right, and partly wrong. Digging
in /proc for the httpd's I found they were doing trac related stuff, and
looking in /var/log/httpd/trac-access_log I found these entry near the end
of the file:

66.249.66.10 - - [29/Mar/2007:23:17:40 -0400] "GET /mapguide/changeset/1142/trunk?old_path=%2F&old=1142&format=zip HTTP/1.1" 200 123170821

66.249.66.10 - - [29/Mar/2007:23:21:02 -0400] "GET /fdo/changeset/2615/trunk?old_path=%2F&old=2615&format=zip HTTP/1.1" 200 208304671

It appears that googlebots have been walking the changeset links available
in Trac, and requesting changesets that were basically the complete state
of the svn repository - in the above cases 208MB and 123MB each.

So, I think this is a significant part of our problem. I also suspect, but
don't know, that our httpd memory leaks (httpd's were continuing to bloat
through the day till the machine came down if they weren't restarted) was
related to this machinery.

Howard and I are putting robots.txt's into place that we home will screen
direct access to svn and access to svn through trac. Hopefully this will
moderate our load issues.

Best regards,
--
---------------------------------------+--------------------------------------
I set the clouds in motion - turn up | Frank Warmerdam, warmerdam@pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush | President OSGeo, http://osgeo.org

Frank Warmerdam wrote:

Howard and I are putting robots.txt's into place that we home will screen
direct access to svn and access to svn through trac. Hopefully this will
moderate our load issues.

Folks,

In follow up, we have put the following robots.txt files into place:

  http://trac.osgeo.org/robots.txt
  http://download.osgeo.org/robots.txt
  http://www.gdal.org/robots.txt
  http://buildbot.osgeo.org/robots.txt
  http://svn.osgeo.org/robots.txt

Hopefully this will reduce the search engine bot churn on our services,
while at the same time ensuring that appropriate content is focused on by
the bots. Appropriate including wikis, drupal pages, and mailing lists.

When modifying and adding services, please keep robots.txt issues in mind.

Best regards,
--
---------------------------------------+--------------------------------------
I set the clouds in motion - turn up | Frank Warmerdam, warmerdam@pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush | President OSGeo, http://osgeo.org