[GeoNetwork-devel] [GeoNetwork opensource Developer website] #163: Simultaneous reharvest fail

#163: Simultaneous reharvest fail
---------------------------------------+------------------------------------
Reporter: ddnebert | Owner: geonetwork-devel@lists.sourceforge.net
     Type: defect | Status: new
Priority: major | Milestone:
Component: General | Version:
Keywords: harvest, concurrency, run |
---------------------------------------+------------------------------------
I have over 50 WMS GetCapabilities to check on a daily basis. If I set the
revisit frequency to be the same, a significant number of the harvest
processes return as errors, purging the old entries and not fetching new
ones. It would appear that there is not multi-threaded support for harvest
and proper queuing of the results to be indexed by Lucene.

This defect can be demonstrated by selecting more than 3-4 of listed
harvest services and clicking Run. Invariably one or more will fail.
However, if one runs the checks individually, they all succeed.

--
Ticket URL: <http://trac.osgeo.org/geonetwork/ticket/163&gt;
GeoNetwork opensource Developer website <http://trac.osgeo.org/geonetwork&gt;
GeoNetwork opensource is a standards based, Free and Open Source catalog application to manage spatially referenced resources through the web. It provides powerful metadata editing and search functions as well as an embedded interactive web map viewer. This website contains information related to the development of the software.

#163: Simultaneous reharvest fail
-----------------------+----------------------------------------------------
  Reporter: ddnebert | Owner: geonetwork-devel@lists.sourceforge.net
      Type: defect | Status: new
  Priority: major | Milestone:
Component: General | Version:
Resolution: | Keywords: harvest, concurrency, run
-----------------------+----------------------------------------------------
Comment (by ddnebert):

version 2.4.2 on Linux.

--
Ticket URL: <http://trac.osgeo.org/geonetwork/ticket/163#comment:1&gt;
GeoNetwork opensource Developer website <http://trac.osgeo.org/geonetwork&gt;
GeoNetwork opensource is a standards based, Free and Open Source catalog application to manage spatially referenced resources through the web. It provides powerful metadata editing and search functions as well as an embedded interactive web map viewer. This website contains information related to the development of the software.

#163: Simultaneous reharvest fail
-----------------------+----------------------------------------------------
  Reporter: ddnebert | Owner: geonetwork-devel@lists.sourceforge.net
      Type: defect | Status: new
  Priority: major | Milestone:
Component: General | Version:
Resolution: | Keywords: harvest, concurrency, run
-----------------------+----------------------------------------------------
Comment (by ddnebert):

Class: MSQL Exception, Error: Concurrent Serializable Transaction Conflict

--
Ticket URL: <http://trac.osgeo.org/geonetwork/ticket/163#comment:2&gt;
GeoNetwork opensource Developer website <http://trac.osgeo.org/geonetwork&gt;
GeoNetwork opensource is a standards based, Free and Open Source catalog application to manage spatially referenced resources through the web. It provides powerful metadata editing and search functions as well as an embedded interactive web map viewer. This website contains information related to the development of the software.

#163: Simultaneous reharvest fail
-----------------------+----------------------------------------------------
  Reporter: ddnebert | Owner: geonetwork-devel@lists.sourceforge.net
      Type: defect | Status: new
  Priority: major | Milestone: v2.5.0
Component: General | Version: v2.4.2
Resolution: | Keywords: harvest, concurrency, run
-----------------------+----------------------------------------------------
Changes (by ddnebert):

  * version: => v2.4.2
  * milestone: => v2.5.0

--
Ticket URL: <http://trac.osgeo.org/geonetwork/ticket/163#comment:3&gt;
GeoNetwork opensource Developer website <http://trac.osgeo.org/geonetwork&gt;
GeoNetwork opensource is a standards based, Free and Open Source catalog application to manage spatially referenced resources through the web. It provides powerful metadata editing and search functions as well as an embedded interactive web map viewer. This website contains information related to the development of the software.

#163: Simultaneous reharvest fail
-----------------------+----------------------------------------------------
  Reporter: ddnebert | Owner: geonetwork-devel@lists.sourceforge.net
      Type: defect | Status: new
  Priority: major | Milestone: v2.5.0
Component: General | Version: v2.4.2
Resolution: | Keywords: harvest, concurrency, run
-----------------------+----------------------------------------------------
Comment (by simonp):

Hi Doug,

Which database are you using when you get this bug?

I have also come across the concurrent serializable transaction conflict
in the recent bug on Dbms retries and before that I used to get it in lots
of situations with ORACLE (eg. doing a big batch import etc).

I think the problem is in part due to the rather restrictive (but very
safe) way in which access to data in the database is synchronized between
different threads. At the moment (except for ORACLE) we have exclusive
locks ie. one writer or one reader can have a lock at any time (isolation
level SERIALIZABLE). This appears to be in part due to McKoi which doesn't
support any other isolation level and because this is the safest way (in
terms of consistency) to handle this issue.

We could change the default isolation level for all databases except McKoi
to allow data to be read at any time as long as there is no write in
progress (isolation level READ_COMMITTED - for all except McKoi which
shouldn't be used in production anyway) which would help with this problem
and also improve performance too. However we should check our database
interaction to see whether there are any issues.

The other issue with the harvest is that by my reckoning, each harvest
process uses one connection to the database - if you attempt to start 35
at the same time by selecting them all and choosing run then I think your
harvest will fail anyway - you need to schedule them in groups according
to the database pool size for your database (see <resources><resource
enabled="true">..... in web/geonetwork/WEB-INF/config.xml) or more simply,
just increase the poolsize - default poolsize is 10.

Cheers,
Simon

--
Ticket URL: <http://trac.osgeo.org/geonetwork/ticket/163#comment:4&gt;
GeoNetwork opensource Developer website <http://trac.osgeo.org/geonetwork&gt;
GeoNetwork opensource is a standards based, Free and Open Source catalog application to manage spatially referenced resources through the web. It provides powerful metadata editing and search functions as well as an embedded interactive web map viewer. This website contains information related to the development of the software.

#163: Simultaneous reharvest fail
-----------------------+----------------------------------------------------
  Reporter: ddnebert | Owner: geonetwork-devel@lists.sourceforge.net
      Type: defect | Status: new
  Priority: major | Milestone: v2.4.2
Component: General | Version: v2.4.2
Resolution: | Keywords: harvest, concurrency, run
-----------------------+----------------------------------------------------
Changes (by ddnebert):

  * milestone: v2.5.0 => v2.4.2

--
Ticket URL: <http://trac.osgeo.org/geonetwork/ticket/163#comment:5&gt;
GeoNetwork opensource Developer website <http://trac.osgeo.org/geonetwork&gt;
GeoNetwork opensource is a standards based, Free and Open Source catalog application to manage spatially referenced resources through the web. It provides powerful metadata editing and search functions as well as an embedded interactive web map viewer. This website contains information related to the development of the software.

#163: Simultaneous reharvest fail
-----------------------+----------------------------------------------------
  Reporter: ddnebert | Owner: geonetwork-devel@lists.sourceforge.net
      Type: defect | Status: new
  Priority: major | Milestone: v2.4.2
Component: General | Version: v2.4.2
Resolution: | Keywords: harvest, concurrency, run
-----------------------+----------------------------------------------------
Comment (by ddnebert):

Replying to [comment:5 ddnebert]:

I am only using the default (embedded) database solution for
experimentation. If using or deploying MySQL or Postgresql would make this
concurrency problem go away, then I'll try that. I get the same problem if
I hit "Run" manually on two or more targets, or if a scheduled harvest on
multiple targets takes place. Not talking 35 concurrent requests here.

If this is a known behavior it should be flagged or prevented in the
interface "warning: more than one concurrent harvest not permitted with
embedded McKoi database"

--
Ticket URL: <http://trac.osgeo.org/geonetwork/ticket/163#comment:6&gt;
GeoNetwork opensource Developer website <http://trac.osgeo.org/geonetwork&gt;
GeoNetwork opensource is a standards based, Free and Open Source catalog application to manage spatially referenced resources through the web. It provides powerful metadata editing and search functions as well as an embedded interactive web map viewer. This website contains information related to the development of the software.

#163: Simultaneous reharvest fail
-----------------------+----------------------------------------------------
  Reporter: ddnebert | Owner: geonetwork-devel@lists.sourceforge.net
      Type: defect | Status: new
  Priority: major | Milestone: v2.4.2
Component: General | Version: v2.4.2
Resolution: | Keywords: harvest, concurrency, run
-----------------------+----------------------------------------------------
Comment (by ddnebert):

Replying to [comment:6 ddnebert]:
> Replying to [comment:5 ddnebert]:
>
> I am only using the default (embedded) database solution for
experimentation. If using or deploying MySQL or Postgresql would make this
concurrency problem go away, then I'll try that. I get the same problem if
I hit "Run" manually on two or more targets, or if a scheduled harvest on
multiple targets takes place. Not talking 35 concurrent requests here.
>
> If this is a known behavior it should be flagged or prevented in the
interface "warning: more than one concurrent harvest not permitted with
embedded McKoi database"

We reinstalled the software, this time using MySQL and get the same
result. If I try to harvest only two remote collections at the same time,
the indexing chokes and both need to be redone separately. This is a
defect.

--
Ticket URL: <http://trac.osgeo.org/geonetwork/ticket/163#comment:7&gt;
GeoNetwork opensource Developer website <http://trac.osgeo.org/geonetwork&gt;
GeoNetwork opensource is a standards based, Free and Open Source catalog application to manage spatially referenced resources through the web. It provides powerful metadata editing and search functions as well as an embedded interactive web map viewer. This website contains information related to the development of the software.

#163: Simultaneous reharvest fail
-----------------------+----------------------------------------------------
  Reporter: ddnebert | Owner: geonetwork-devel@lists.sourceforge.net
      Type: defect | Status: new
  Priority: major | Milestone: v2.4.2
Component: General | Version: v2.4.2
Resolution: | Keywords: harvest, concurrency, run
-----------------------+----------------------------------------------------
Comment (by ddnebert):

We reinstalled the software, this time using MySQL and get the same
result. If I try to harvest only two remote collections at the same time,
the indexing chokes and both need to be redone separately. This is a
defect.

--
Ticket URL: <http://trac.osgeo.org/geonetwork/ticket/163#comment:8&gt;
GeoNetwork opensource Developer website <http://trac.osgeo.org/geonetwork&gt;
GeoNetwork opensource is a standards based, Free and Open Source catalog application to manage spatially referenced resources through the web. It provides powerful metadata editing and search functions as well as an embedded interactive web map viewer. This website contains information related to the development of the software.