[GeoNetwork-devel] bots & harvesters and geonetwork

Last week we had quite big issues with a 2.10 production instance of geonetwork running out of memory (fatal crash once a day). Yesterday we identified that it is related to bots/scripts spidering/harvesting the full catalogue without using jsessionid, so for each request a new session is created.

Apparantly there is some work on geonetwork 3.0/2.10.4 that allows for a stateless csw, is there some documentation about this somewhere? Besides CSW, the same can happen on html representation of metadata (when crawled by google/bing etc).

For me there are 3 approaches to improve this:
- detect the number of sessions, and when sessions are running out, try removing the oldest sessions (and/or the sessions that share an ip)
- detect if a request comes from a script/crawler, then don't start a session or use a shared session.
- Don't use sessions for the public part, only allow sessions for the private part (logged in).

Related to the second point this article seems relevant: http://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve, which, if enabled, treats all crawlers (identified by a regular expression) as a single session. Should we notify our users to enable this setting, when installing geonetwork?

What's your view on this?

--
Paul van Genuchten
GeoCat bv