[GeoNetwork-users] [GeoNetwork-devel] Harvesting delete logic

Hello Heikki/List

Correction to my email yesterday. The statement "However deletion is successful if the first harvest to
run after deletion is on one on the machines with harvested copies" is not correct. In fact deletion
might be unsuccessful because machines with harvested copies are harvesting off both the machine
where the original was deleted and also from machines where the harvested copy still exists.

So there is a chance (which increases with the number of machines) that the last harvest action might
be from a machine which still has a harvested copy. I explain this in more detail below:

I did 2 further tests by timing the first harvest to occur after deletion (on C) on machines A and B
which had harvested copies. I found for these 2 tests:

1) A runs harvest first after deletion - result was sucessful deletion on all
2) B runs harvest first after deletion - same problem "un-deletion", end up with harvested copy on all

Test 1) details:

A harvests from B,C every 2 minutes
B harvests from A,C every 3 minutes
C harvests from A,B every 4 minutes

Sequence of events for test 1)
- 0 minutes - start the 6 harvest nodes off on the 3 machines
- at 1 minutes M was deleted on C
-1 minute later:
-2 minutes A harvests from B - finds harvested record - 'says managed by another node' - no change
-2 minutes A harvests from C - finds nothing - so removes harvested record
- 3 minutes B harvests from C, finds nothing, so removes harvested copy
- 3 minutes B harvests from A, finds nothing, no action
- 4 minutes A harvests from B, finds nil, no action
- 4 minutes A harvests from C, finds nil, no action

and so on with end result deletion is successfully propogated to all machines.

Test 2 details:

A harvests from B,C every 4 minutes
B harvests from A,C every 2 minutes
C harvests from A,B every 3 minutes

Sequence of events for test 2)
- 0 minutes - start the 6 harvest nodes off on the 3 machines
- at 1 minutes M was deleted on C
-1 minute later:
-2 minutes B harvests from C - not found on C , so deleted
-2 minutes B harvests from A - finds harvested record - puts it back (log says 'indexing record (1555)')
- 3 minutes C harvests from B, finds harvested record, puts back - un-delete (log says 'Adding metadata with remote uuid:684...)
- 3 minutes C harvests from A, finds harvested record, but already got from B, no change
- 4 minutes A harvests from B, finds harvested record, already got, no change
- 4 minutes A harvests from C, finds harvested record, already got, no change

and so on with end result a harvested copy on all 3 machines.

So summarising the "deletion problem" (harvested copy gets put back) occurs under these conditions:

1) Assume harvesting is set up so that all machines harvest off each other. In general there
will be n machines and n(n-1) harvest nodes.

2a) The first harvest to run after deletion occurs on the machine with the original copy.
There is a 1/(n-1) chance of this occurring.

OR

2b) The first harvest runs on any of machines with a harvested copy [(n-1)/n chance]. Deletion is not
propogated to these machine from the original if the last harvest action is from one
from one of the machines which still has a copy. In general the chances of deletion
not occurring increases as the number of machines increasing. For example
with 3 machines there is a 50% chance that the record won't be deleted as we harvest
off the machine with the deleted copy (original) and also off a machine with a
harvested copy. For 6 machines we harvest off 5 nodes so there is a 4/5
chance that the last harvest action gets to keep a harvested copy. While there
is a 1/5 chance that the last harvest action results in a deletion by harvesting
off the machine where the original was deleted.

Andrew
  ----- Original Message -----
  From: awalsh
  To: tropicano@anonymised.com
  Cc: geonetwork-users@lists.sourceforge.net ; geonetwork-devel@anonymised.comeforge.net ; Paul Sliogeris ; Greg Reed
  Sent: Thursday, May 13, 2010 5:11 PM
  Subject: Re: [GeoNetwork-devel] Harvesting delete logic

  Hi Heikki,

  We have been doing some harvest testing with the same scenario you outline (each machine
  harvests off each other) and we find under certain timing the 'un-deletion' problem you mentioned
  indeed occurs.

  We set up 3 instances of the BlueNet MEST 20091209 version, lets call them A,B,C.
  We create metadata record M on C . We stagger the harvest times and set up 2 harvest nodes
  on each machine:

  A harvests B,C every 4 minutes.
  B harvests A,C every 3 minutes
  C harvests A,B every 2 minutes

  1) We start off the harvesting and metadata M gets harvested to A and B Ok
  2) We stop harvesting.
  3) Then start all off harvests again on all 3 machines.
  4) Approx. 1 minute later we delete metadata M on C.
  5) Then 1 minute later C harvests from B and gets M back, its un-deleted
  and a harvested copy of M is put back!
  6) Few seconds later C harvests from A - already got it no change.

  7) 3 minutes from start B does its harvests from C and A - no change

  and so on....

  So the end result is each machine all have harvested copies of M
  and deletion never gets done. Below is an excerpt of the geonetwork
  log from machine C to illustrate the issue.

  Basically this problem will occur if, after the delete is done, the first
  to harvest to run is on the machine where the deletion occurred. The deleted
  record gets harvested back form others where it has not been deleted
  yet. However deletion is successful if the first harvest to run after deletion
  is on one on the machines with harvested copies.

  Interested to hear thoughts/solutions from the GN community....

  Andrew

  ****C Harvest run - starts harvesting from A and B****

  2010-05-13 15:48:35,426 INFO [jeeves.request] - ==========================================================
  2010-05-13 15:48:35,442 INFO [jeeves.request] - HTML Request (from 192.103.47.15) : /geonetwork/srv/en/xml.harvesting.start
  2010-05-13 15:48:35,442 DEBUG [jeeves.request] - Method : POST
  2010-05-13 15:48:35,442 DEBUG [jeeves.request] - Content type : application/xml; charset=UTF-8
  2010-05-13 15:48:35,442 DEBUG [jeeves.request] - Accept : text/javascript, text/html, application/xml, text/xml, */*
  2010-05-13 15:48:35,442 DEBUG [jeeves.request] - Session id is 1g5cur9m95s6t
  2010-05-13 15:48:35,442 INFO [jeeves.service] - Dispatching : xml.harvesting.start
  2010-05-13 15:48:35,442 DEBUG [jeeves.service] - -> parameters are :
  <request>
    <id>418</id>
    <id>390</id>
  </request>
  2010-05-13 15:48:35,442 DEBUG [geonetwork.harvest-man] - Starting harvesting with id : 418
  2010-05-13 15:48:35,458 DEBUG [geonetwork.harvest-man] - Starting harvesting with id : 390
  2010-05-13 15:48:35,473 INFO [jeeves.service] - -> dispatching to output for : xml.harvesting.start
  2010-05-13 15:48:35,473 INFO [jeeves.service] - -> writing xml for : xml.harvesting.start
  2010-05-13 15:48:35,473 DEBUG [jeeves.service] - Service xml is :
  <response>
    <id status="ok">418</id>
    <id status="ok">390</id>
  </response>
  2010-05-13 15:48:35,473 INFO [jeeves.service] - -> output ended for : xml.harvesting.start
  2010-05-13 15:48:35,473 INFO [jeeves.service] - -> dispatch ended for : xml.harvesting.start
  2010-05-13 15:48:40,692 INFO [jeeves.request] - ==========================================================
  2010-05-13 15:48:40,708 INFO [jeeves.request] - HTML Request (from 192.103.47.15) : /geonetwork/srv/en/xml.harvesting.run
  2010-05-13 15:48:40,708 DEBUG [jeeves.request] - Method : POST
  2010-05-13 15:48:40,708 DEBUG [jeeves.request] - Content type : application/xml; charset=UTF-8
  2010-05-13 15:48:40,708 DEBUG [jeeves.request] - Accept : text/javascript, text/html, application/xml, text/xml, */*
  2010-05-13 15:48:40,708 DEBUG [jeeves.request] - Session id is 1g5cur9m95s6t
  2010-05-13 15:48:40,708 INFO [jeeves.service] - Dispatching : xml.harvesting.run
  2010-05-13 15:48:40,708 DEBUG [jeeves.service] - -> parameters are :
  <request>
    <id>418</id>
    <id>390</id>
  </request>
  2010-05-13 15:48:40,708 DEBUG [geonetwork.harvest-man] - Running harvesting with id : 418
  2010-05-13 15:48:40,723 DEBUG [geonetwork.harvest-man] - Running harvesting with id : 390
  2010-05-13 15:48:40,723 INFO [geonetwork.harvester] - Started harvesting from node : fromB (GeonetHarvester)
  2010-05-13 15:48:40,723 INFO [geonetwork.harvester] - Login into : fromB
  2010-05-13 15:48:40,723 INFO [geonetwork.harvester] - Started harvesting from node : fromA (GeonetHarvester)
  2010-05-13 15:48:40,723 INFO [jeeves.service] - -> dispatching to output for : xml.harvesting.run
  2010-05-13 15:48:40,723 INFO [geonetwork.harvester] - Login into : fromA
  2010-05-13 15:48:40,723 INFO [jeeves.service] - -> writing xml for : xml.harvesting.run
  2010-05-13 15:48:40,739 DEBUG [jeeves.service] - Service xml is :
  <response>
    <id status="ok">418</id>
    <id status="ok">390</id>
  </response>
  2010-05-13 15:48:40,739 INFO [jeeves.service] - -> output ended for : xml.harvesting.run
  2010-05-13 15:48:40,739 INFO [geonetwork.harvester] - Retrieving information from : snapper
  2010-05-13 15:48:40,739 INFO [jeeves.service] - -> dispatch ended for : xml.harvesting.run
  2010-05-13 15:48:40,739 INFO [geonetwork.harvester] - Retrieving information from : grey.metoc.gov.au
  2010-05-13 15:48:40,755 INFO [geonetwork.harvester] - Searching on : fromB
  2010-05-13 15:48:40,770 INFO [geonetwork.harvester] - Searching on : fromA
  2010-05-13 15:48:40,801 DEBUG [geonetwork.harvester] - Search results:
  <response>
    <summary count="1" type="local">
      <keywords />
    </summary>
    <metadata>
      <geonet:info xmlns:geonet="http://www.fao.org/geonetwork&quot;&gt;
        <id>1554</id>
        <uuid>6845b0b9-e9d4-458a-b44b-45f317d1dfb0</uuid>
        <schema>iso19139.mcp</schema>
        <createDate>2010-05-13T15:47:39</createDate>
        <changeDate>2010-05-13T15:47:39</changeDate>
        <source>9c78aea4-f747-46ae-8e67-3a1b28ec1ca1</source>
        <category>datasets</category>
        <score>1.0</score>
      </geonet:info>
    </metadata>
  </response>
  2010-05-13 15:48:40,801 INFO [geonetwork.harvester] - Records added to result list : 1
  2010-05-13 15:48:40,801 INFO [geonetwork.harvester] - Total records processed in all searches :1
  2010-05-13 15:48:40,801 INFO [geonetwork.harvester] - Start of alignment for : fromB
  2010-05-13 15:48:40,817 DEBUG [geonetwork.harvester] - - Skipped metadata managed by another harvesting node. uuid:6845b0b9-e9d4-458a-b44b-45f317d1dfb0, name:fromB
  2010-05-13 15:48:40,817 INFO [geonetwork.harvester] - End of alignment for : fromB
  2010-05-13 15:48:40,817 INFO [geonetwork.harvester] - Aligning source logos from for : fromB
  2010-05-13 15:48:40,817 INFO [geonetwork.harvester] - Ended harvesting from node : fromB (GeonetHarvester)
  2010-05-13 15:48:40,880 DEBUG [geonetwork.harvester] - Search results:
  <response>
    <summary count="1" type="local">
      <keywords />
    </summary>
    <metadata>
      <geonet:info xmlns:geonet="http://www.fao.org/geonetwork&quot;&gt;
        <id>484</id>
        <uuid>6845b0b9-e9d4-458a-b44b-45f317d1dfb0</uuid>
        <schema>iso19139.mcp</schema>
        <createDate>2010-05-13T15:47:39</createDate>
        <changeDate>2010-05-13T15:47:39</changeDate>
        <source>9c78aea4-f747-46ae-8e67-3a1b28ec1ca1</source>
        <category>datasets</category>
        <score>1.0</score>
      </geonet:info>
    </metadata>
  </response>
  2010-05-13 15:48:40,880 INFO [geonetwork.harvester] - Records added to result list : 1
  2010-05-13 15:48:40,880 INFO [geonetwork.harvester] - Total records processed in all searches :1
  2010-05-13 15:48:40,880 INFO [geonetwork.harvester] - Start of alignment for : fromA
  2010-05-13 15:48:40,895 WARN [geonetwork.harvester] - Raised exception while harvesting from : fromA (GeonetHarvester)
  2010-05-13 15:48:40,895 WARN [geonetwork.harvester] - (C) Class : MSQLException
  2010-05-13 15:48:40,895 WARN [geonetwork.harvester] - (C) Message : Concurrent Serializable Transaction Conflict(4): Select from table that has committed changes: Settings
  2010-05-13 15:48:56,395 INFO [jeeves.request] - ==========================================================
  2010-05-13 15:48:56,395 INFO [jeeves.request] - HTML Request (from 192.103.47.15) : /geonetwork/srv/en/metadata.select
  2010-05-13 15:48:56,395 DEBUG [jeeves.request] - Method : GET
  2010-05-13 15:48:56,395 DEBUG [jeeves.request] - Content type : null
  2010-05-13 15:48:56,395 DEBUG [jeeves.request] - Accept : text/javascript, text/html, application/xml, text/xml, */*
  2010-05-13 15:48:56,395 DEBUG [jeeves.request] - Session id is 1g5cur9m95s6t
  2010-05-13 15:48:56,411 INFO [jeeves.service] - Dispatching : metadata.select
  2010-05-13 15:48:56,411 DEBUG [jeeves.service] - -> parameters are :
  <request>
    <id>6845b0b9-e9d4-458a-b44b-45f317d1dfb0</id>
    <selected>true</selected>
  </request>
  2010-05-13 15:48:56,411 INFO [jeeves.service] - -> dispatching to output for : metadata.select
  2010-05-13 15:48:56,411 INFO [jeeves.service] - -> writing xml for : metadata.select
  2010-05-13 15:48:56,411 DEBUG [jeeves.service] - Service xml is :
  <response>
    <Selected>1</Selected>
  </response>
  2010-05-13 15:48:56,411 INFO [jeeves.service] - -> output ended for : metadata.select
  2010-05-13 15:48:56,411 INFO [jeeves.service] - -> dispatch ended for : metadata.select
  2010-05-13 15:49:04,786 INFO [jeeves.request] - ==========================================================

  ******approx. 1 minute later original metadata is deleted on C*****

  2010-05-13 15:49:04,801 INFO [jeeves.request] - HTML Request (from 192.103.47.15) : /geonetwork/srv/en/metadata.delete
  2010-05-13 15:49:04,817 DEBUG [jeeves.request] - Method : GET
  2010-05-13 15:49:04,817 DEBUG [jeeves.request] - Content type : null
  2010-05-13 15:49:04,817 DEBUG [jeeves.request] - Accept : text/javascript, text/html, application/xml, text/xml, */*
  2010-05-13 15:49:04,817 DEBUG [jeeves.request] - Session id is 1g5cur9m95s6t
  2010-05-13 15:49:04,817 INFO [jeeves.service] - Dispatching : metadata.delete
  2010-05-13 15:49:04,817 DEBUG [jeeves.service] - -> parameters are :
  <request>
    <id>104</id>
  </request>
  2010-05-13 15:49:05,130 INFO [jeeves.service] - -> dispatching to output for : metadata.delete
  2010-05-13 15:49:05,176 INFO [jeeves.service] - -> transforming with stylesheet : Y:\MEST20091209\web\geonetwork/xsl/infoedit.xsl
  2010-05-13 15:49:05,223 INFO [jeeves.service] - -> end transformation for : metadata.delete
  2010-05-13 15:49:05,223 INFO [jeeves.service] - -> output ended for : metadata.delete
  2010-05-13 15:49:05,223 INFO [jeeves.service] - -> dispatch ended for : metadata.delete

  ******Next (2 min from start) harvest runs again on C****

  2010-05-13 15:50:40,817 INFO [geonetwork.harvester] - Started harvesting from node : fromB (GeonetHarvester)
  2010-05-13 15:50:40,833 INFO [geonetwork.harvester] - Login into : fromB
  2010-05-13 15:50:40,833 INFO [geonetwork.harvester] - Retrieving information from : snapper
  2010-05-13 15:50:40,848 INFO [geonetwork.harvester] - Searching on : fromB

  ****It finds the same record harvested copy on B

  2010-05-13 15:50:40,895 DEBUG [geonetwork.harvester] - Search results:
  <response>
    <summary count="1" type="local">
      <keywords />
    </summary>
    <metadata>
      <geonet:info xmlns:geonet="http://www.fao.org/geonetwork&quot;&gt;
        <id>1554</id>
        <uuid>6845b0b9-e9d4-458a-b44b-45f317d1dfb0</uuid>
        <schema>iso19139.mcp</schema>
        <createDate>2010-05-13T15:47:39</createDate>
        <changeDate>2010-05-13T15:47:39</changeDate>
        <source>9c78aea4-f747-46ae-8e67-3a1b28ec1ca1</source>
        <category>datasets</category>
        <score>1.0</score>
      </geonet:info>
    </metadata>
  </response>

  *****and it gets put back! as a harvested record

  2010-05-13 15:50:40,895 INFO [geonetwork.harvester] - Records added to result list : 1
  2010-05-13 15:50:40,895 INFO [geonetwork.harvester] - Total records processed in all searches :1
  2010-05-13 15:50:40,895 INFO [geonetwork.harvester] - Start of alignment for : fromB
  2010-05-13 15:50:40,895 INFO [geonetwork.harvester] - Started harvesting from node : fromA (GeonetHarvester)
  2010-05-13 15:50:40,911 INFO [geonetwork.harvester] - Login into : fromA
  2010-05-13 15:50:40,911 INFO [geonetwork.harvester] - Retrieving information from : grey.metoc.gov.au
  2010-05-13 15:50:40,942 INFO [geonetwork.harvester] - Searching on : fromA
  2010-05-13 15:50:40,973 DEBUG [geonetwork.harvester] - - Adding metadata with remote uuid:6845b0b9-e9d4-458a-b44b-45f317d1dfb0
  2010-05-13 15:50:40,973 DEBUG [geonetwork.datamanager] - setNamespacePrefixUsingSchemas: Found schema iso19139.mcp with NSURI http://bluenet3.antcrc.utas.edu.au/mcp
  2010-05-13 15:50:40,973 DEBUG [geonetwork.datamanager] - xlink resolver was false
  2010-05-13 15:50:40,989 DEBUG [geonetwork.datamanager] - Indexing record (105)
  2010-05-13 15:50:40,989 DEBUG [geonetwork.datamanager] - xlink resolver was false
  2010-05-13 15:50:41,005 DEBUG [geonetwork.datamanager] - record schema (iso19139.mcp)
  2010-05-13 15:50:41,005 DEBUG [geonetwork.datamanager] - record createDate (2010-05-13T15:47:39)
  2010-05-13 15:50:41,051 DEBUG [geonetwork.harvester] - Search results:
  <response>
    <summary count="1" type="local">
      <keywords />
    </summary>
    <metadata>
      <geonet:info xmlns:geonet="http://www.fao.org/geonetwork&quot;&gt;
        <id>484</id>
        <uuid>6845b0b9-e9d4-458a-b44b-45f317d1dfb0</uuid>
        <schema>iso19139.mcp</schema>
        <createDate>2010-05-13T15:47:39</createDate>
        <changeDate>2010-05-13T15:47:39</changeDate>
        <source>9c78aea4-f747-46ae-8e67-3a1b28ec1ca1</source>
        <category>datasets</category>
        <score>1.0</score>
      </geonet:info>
    </metadata>
  </response>
  2010-05-13 15:50:41,051 INFO [geonetwork.harvester] - Records added to result list : 1
  2010-05-13 15:50:41,051 INFO [geonetwork.harvester] - Total records processed in all searches :1
  2010-05-13 15:50:41,051 INFO [geonetwork.harvester] - Start of alignment for : fromA
  2010-05-13 15:50:41,067 WARN [geonetwork.harvester] - Raised exception while harvesting from : fromA (GeonetHarvester)
  2010-05-13 15:50:41,067 WARN [geonetwork.harvester] - (C) Class : MSQLException
  2010-05-13 15:50:41,067 WARN [geonetwork.harvester] - (C) Message : Concurrent Serializable Transaction Conflict(4): Select from table that has committed changes: Settings
  2010-05-13 15:50:41,567 DEBUG [geonetwork.datamanager] - Indexing record (105)
  2010-05-13 15:50:41,567 DEBUG [geonetwork.datamanager] - xlink resolver was false
  2010-05-13 15:50:41,583 DEBUG [geonetwork.datamanager] - record schema (iso19139.mcp)
  2010-05-13 15:50:41,583 DEBUG [geonetwork.datamanager] - record createDate (2010-05-13T15:47:39)
  2010-05-13 15:50:41,958 DEBUG [geonetwork.harvester] - - Setting category : datasets
  2010-05-13 15:50:42,067 DEBUG [geonetwork.datamanager] - Indexing record (105)
  2010-05-13 15:50:42,067 DEBUG [geonetwork.datamanager] - xlink resolver was false
  2010-05-13 15:50:42,067 DEBUG [geonetwork.datamanager] - record schema (iso19139.mcp)
  2010-05-13 15:50:42,067 DEBUG [geonetwork.datamanager] - record createDate (2010-05-13T15:47:39)
  2010-05-13 15:50:42,473 INFO [geonetwork.harvester] - End of alignment for : fromB
  2010-05-13 15:50:42,473 INFO [geonetwork.harvester] - Aligning source logos from for : fromB
  2010-05-13 15:50:42,473 INFO [geonetwork.harvester] - Ended harvesting from node : fromB (GeonetHarvester)
  2010-05-13 15:50:55,801 INFO [jeeves.request] - ==========================================================

    ----- Original Message -----
    From: heikki
    To: Devel geonetwork-devel@lists.sourceforge.net ; Geonetwork Users
    Sent: Thursday, December 03, 2009 5:06 AM
    Subject: [GeoNetwork-devel] Harvesting delete logic

    hello,

    I think there is a potential problem in the Harvesting logic concerning "delete". It is caused by the impossibility of deciding whether to update/insert harvested metadata when this decision is based on the last modification date (as it is -- for updates).

    Suppose there are nodes A, B and C, and all of them harvest each other.

    They all contain metadata M, originally introduced in C but now present in A, B and C.

    Then this metadata is deleted in C. The "delete" action is propagated to A and B when they do their next havesting runs.

    Now let's say M is already deleted in A, but not yet in B; for example because B's harvesting interval for C is much longer.

    Meanwhile A does a new harvesting run from B, and so retrieves the not-yet deleted metadata M from B and inserts it again in A.

    Result: delete in A is undone.

    And of course, now C could harvest it again from A, so it gets back to its origin. In this manner a "delete" action could go around forever, or be completely canceled in all nodes.

    My question to you, list: is this correct, or am I missing something here?

    If this scenario is correct, then I think this is a bug in GeoNetwork harvesting logic. Possible fixes are:

    - at delete, do not really delete the metadata but instead set a boolean flag 'isDeleted' to true and record the timestamp; this would fix it but probably requires to keep all deleted metadata forever, if we want to be sure this scenario can never occur.

    - harvesting client nodes could be registered with harvesting server nodes. The harvesting server node, on metadata delete, pushes a delete action to all its registered clients, using for example CSW-Transaction; this would probably fix it but it's fishy, and requires the harvesting client nodes to enable CSW serving just to handle those CSW-T deletes (or expose some other means for the harvesting server node to push the delete).

    Or maybe I'm missing some point completely and this is no issue at all ?

    Looking forward to your ideas

    Kind regards
    Heikki Doeleman

----------------------------------------------------------------------------

    ------------------------------------------------------------------------------
    Join us December 9, 2009 for the Red Hat Virtual Experience,
    a free event focused on virtualization and cloud computing.
    Attend in-depth sessions from your desk. Your couch. Anywhere.
    http://p.sf.net/sfu/redhat-sfdev2dev

----------------------------------------------------------------------------

    _______________________________________________
    GeoNetwork-devel mailing list
    GeoNetwork-devel@lists.sourceforge.net
    https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
    GeoNetwork OpenSource is maintained at http://sourceforge.net/projects/geonetwork