[GeoNetwork-devel] File Identifiers being replaced during harvest [SEC=UNCLASSIFIED]

Hi all,

We've discovered that when harvesting ISO 19139 metadata records using WebDAV, GeoNetwork replaces the original file identifier UUID in the metadata record with a new UUID on initial harvest. It is important for us that the file identifiers assigned by the external metadata system are preserved, as we have references to those metadata records in other systems based on the file identifier.

Before we go charging off to modify the code so that the original file identifiers are preserved, I'd be very interested to find out from the developers that implemented this functionality the rationale behind replacing the original file identifier. Do other users of GeoNetwork see benefit in preserving the original file identifier?

Our GeoNetwork implementation is based on the BlueNetMEST-1.1 branch.

Thanks,

Aaron Sedgmen
Geoscience Australia
GPO Box 378, Canberra, ACT, 2601, Australia
(02) 6249 9576
Aaron.Sedgmen@anonymised.com
http://www.ga.gov.au

Hi All and Happy New Year,

I have just returned to work and noticed that no-one has replied to Aaron's email. I also work at Geoscience Australia so I thought that I would let others respond. However, no-one has so I will give my two cents worth.

I believe that harvested metadata can't be edited via GeoNetwork. It is therefore meant to be a copy of the original metadata record. Therefore, I agree that the original FileIdentifier content should be preserved when harvesting metadata. The metadata record IS a COPY of the original and therefore should be exact.

The ANZLIC Metadata Profile of ISO 19115 made FileIdentifier mandatory for two reasons. One of the reasons is to identify duplicate metadata records. It is expected, because of the practice of harvesting rather than distributed searching, that there would be multiple copies of the same metadata record exposed for searching at multiple CSW or Z3950 servers. If a search is performed across these multiple servers then there will be duplicate metadata records returned from the different servers. This may be very annoying to the users of that search system. It was expected that an intelligent search system would identify multiple copies of one metadata record and only present one of those records. The user can then use the content of the metadata record to determine the authoritative version and use it to access the data, service etc. Hence, the FileIdentifier was made mandatory in the ANZLIC Metadata Profile.

I don't know of a distributed search system that currently does this however, I believe that the FileIdentifier could and should be used for this purpose.

I strongly suggest that GeoNetwork's code be changed so that the FileIdentifier can be used to identify duplicate metadata records.

If Geoscience Australia modifies the code to preserve the FileIdentifier from harvested metadata then will the GN community accept that code into the trunk of GN? If not can anyone explain why not?

Thanks.

John

-----Original Message-----
From: Sedgmen Aaron
Sent: Friday, 19 December 2008 4:58 PM
To: geonetwork-devel@lists.sourceforge.net
Subject: [GeoNetwork-devel] File Identifiers being replaced
during harvest [SEC=UNCLASSIFIED]

Hi all,

We've discovered that when harvesting ISO 19139 metadata
records using WebDAV, GeoNetwork replaces the original file
identifier UUID in the metadata record with a new UUID on
initial harvest. It is important for us that the file
identifiers assigned by the external metadata system are
preserved, as we have references to those metadata records in
other systems based on the file identifier.

Before we go charging off to modify the code so that the
original file identifiers are preserved, I'd be very
interested to find out from the developers that implemented
this functionality the rationale behind replacing the
original file identifier. Do other users of GeoNetwork see
benefit in preserving the original file identifier?

Our GeoNetwork implementation is based on the BlueNetMEST-1.1 branch.

Thanks,

Aaron Sedgmen
Geoscience Australia
GPO Box 378, Canberra, ACT, 2601, Australia
(02) 6249 9576
Aaron.Sedgmen@anonymised.com
http://www.ga.gov.au

--------------------------------------------------------------
----------------
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at
http://sourceforge.net/projects/geonetwork

Hi John,

On Jan 19, 2009, at 2:03 AM, <John.Hockaday@anonymised.com> <John.Hockaday@anonymised.com> wrote:

I strongly suggest that GeoNetwork’s code be changed so that the FileIdentifier can be used to identify duplicate metadata records.

I am not sure I fully understand. GeoNetwork is supposed to not harvest a metadata again, not even from another source, if the metadata has a FileIdentifier that is already in use in a local copy (or in the local original). That makes it impossible to harvest multiple copies of a metadata record with the same identifier, and as such it makes the above function redundant.

If Geoscience Australia modifies the code to preserve the FileIdentifier from harvested metadata then will the GN community accept that code into the trunk of GN? If not can anyone explain why not?

Hope the above explained why not. Besides that, it would make sense to develop a process in GeoNetwork that would identify similarity of metadata records. Metadata CAN end up in GeoNetwork as a duplicate if the fileidentifier changed for instance. Currently we have no mechanism to deal with that. In this case, very similar results could be grouped in a similar manner as done by Google for instance.

Hope this helps. Please let me know if you found cases where your duplicate FileIdentifier actualy occurred.
Ciao,
Jeroen


Jeroen Ticheler

GeoCat bv
Grotenhuisweg 61
7384 CT Wilp
Tel: +31 (0)6 81286572
http://geocat.net

Please consider the environment before printing this email.

Hi Jeroen,

Thanks for taking time to look at this. I'll try to clarify our situation and the issue we're having a little better.

We have a metadata system separate to our GeoNetwork (GN) installation in which we are generating ISO 19139 metadata records (ANZLIC Metadata Profile). We periodically harvest metadata from this system into our GN repository, resulting in two copies of the same metadata records (or so we hope). The non-GN metadata system is the authoritative source - this is where the metadata are created, updated and deleted. GN must synchronise its metadata content in the harvest process, synchronising any inserted, updated or deleted records that may have occurred since the previous harvest.

Our observation is that during the initial harvest of a record, GN will ignore the existing FileIdentifier and replace it with its own UUID. This is effectively creating a logically new metadata record, and we now have two unique metadata records describing the same resource. It is important that the FileIdentifier is not altered, such that the harvest results in a copy of the same metadata record, i.e. the two different metadata systems contain instances of the same metadata record.

It becomes more problematic when synchronising an updated record. GN will blindly replace a previously harvested record in its repository if the time stamp has changed on the file being harvested (we're using WebDav). GN does not check if the FileIdentifier in the updated record being harvested matches the FileIdentifier in the corresponding record in the repository. This effectively results in GN now containing the updated record with a FileIdentifier matching the authoritative source, which is what we wanted in the first place, although it upsets GN's internal record keeping as it tracks the FileIdentifier in a separate column in the database, populated during initial harvest.

I can see why the FileIdentifier is being replaced by GN during harvest, to ensure uniqueness of the metadata in the GN repository - this could be a real issue when harvesting from multiple nodes. The question here is whether the benefits of ensuring uniqueness in the GN repository outweigh the implications of altering harvested metadata from the authoritative source.

I hope this makes things clearer.

Regards,

Aaron Sedgmen
Geoscience Australia

Hi Aaron,
I see I never responded to this email. Sorry about that.

I fully agree with your observations and note that there is some inconsistent behavior in GeoNetwork in this respect. The MEF based import GeoNetwork also has actually keeps the original UUID. I now understand from you that the webdav does not consider the UUID. In fact that is something that would better be fixed to support your workflow.

You now have two options:
1- put a ticket in the trac.osgeo.org/geonetwork and wait until someone fixes it
2- modify the behavior yourself (or hire someone to do it) and submit a patch with the fix.

Hope this helps,
ciao,
Jeroen

On Jan 22, 2009, at 2:01 AM, <Aaron.Sedgmen@anonymised.com> <Aaron.Sedgmen@anonymised.com > wrote:

Hi Jeroen,

Thanks for taking time to look at this. I'll try to clarify our situation and the issue we're having a little better.

We have a metadata system separate to our GeoNetwork (GN) installation in which we are generating ISO 19139 metadata records (ANZLIC Metadata Profile). We periodically harvest metadata from this system into our GN repository, resulting in two copies of the same metadata records (or so we hope). The non-GN metadata system is the authoritative source - this is where the metadata are created, updated and deleted. GN must synchronise its metadata content in the harvest process, synchronising any inserted, updated or deleted records that may have occurred since the previous harvest.

Our observation is that during the initial harvest of a record, GN will ignore the existing FileIdentifier and replace it with its own UUID. This is effectively creating a logically new metadata record, and we now have two unique metadata records describing the same resource. It is important that the FileIdentifier is not altered, such that the harvest results in a copy of the same metadata record, i.e. the two different metadata systems contain instances of the same metadata record.

It becomes more problematic when synchronising an updated record. GN will blindly replace a previously harvested record in its repository if the time stamp has changed on the file being harvested (we're using WebDav). GN does not check if the FileIdentifier in the updated record being harvested matches the FileIdentifier in the corresponding record in the repository. This effectively results in GN now containing the updated record with a FileIdentifier matching the authoritative source, which is what we wanted in the first place, although it upsets GN's internal record keeping as it tracks the FileIdentifier in a separate column in the database, populated during initial harvest.

I can see why the FileIdentifier is being replaced by GN during harvest, to ensure uniqueness of the metadata in the GN repository - this could be a real issue when harvesting from multiple nodes. The question here is whether the benefits of ensuring uniqueness in the GN repository outweigh the implications of altering harvested metadata from the authoritative source.

I hope this makes things clearer.

Regards,

Aaron Sedgmen
Geoscience Australia

Hi Jeroen,

Thanks for confirming that this is a bug. We're not in a position to develop a fix immediately, and we've developed a work around within our own system for now. We may able to apply a proper fix and submit a patch sometime down the track, although this is uncertain so I've created a ticket as suggested.

Regards,

Aaron Sedgmen
GeoScience Australia
-----Original Message-----
From: Jeroen Ticheler [mailto:Jeroen.Ticheler@anonymised.com]
Sent: Tuesday, 10 February 2009 9:55
To: Sedgmen Aaron
Cc: Hockaday John; geonetwork-devel@lists.sourceforge.net
Subject: Re: [GeoNetwork-devel] File Identifiers being replaced during harvest [SEC=UNCLASSIFIED]

Hi Aaron,
I see I never responded to this email. Sorry about that.

I fully agree with your observations and note that there is some
inconsistent behavior in GeoNetwork in this respect. The MEF based
import GeoNetwork also has actually keeps the original UUID. I now
understand from you that the webdav does not consider the UUID. In
fact that is something that would better be fixed to support your
workflow.

You now have two options:
1- put a ticket in the trac.osgeo.org/geonetwork and wait until
someone fixes it
2- modify the behavior yourself (or hire someone to do it) and submit
a patch with the fix.

Hope this helps,
ciao,
Jeroen

On Jan 22, 2009, at 2:01 AM, <Aaron.Sedgmen@anonymised.com> <Aaron.Sedgmen@anonymised.com5...
> wrote:

Hi Jeroen,

Thanks for taking time to look at this. I'll try to clarify our
situation and the issue we're having a little better.

We have a metadata system separate to our GeoNetwork (GN)
installation in which we are generating ISO 19139 metadata records
(ANZLIC Metadata Profile). We periodically harvest metadata from
this system into our GN repository, resulting in two copies of the
same metadata records (or so we hope). The non-GN metadata system
is the authoritative source - this is where the metadata are
created, updated and deleted. GN must synchronise its metadata
content in the harvest process, synchronising any inserted, updated
or deleted records that may have occurred since the previous harvest.

Our observation is that during the initial harvest of a record, GN
will ignore the existing FileIdentifier and replace it with its own
UUID. This is effectively creating a logically new metadata record,
and we now have two unique metadata records describing the same
resource. It is important that the FileIdentifier is not altered,
such that the harvest results in a copy of the same metadata record,
i.e. the two different metadata systems contain instances of the
same metadata record.

It becomes more problematic when synchronising an updated record.
GN will blindly replace a previously harvested record in its
repository if the time stamp has changed on the file being harvested
(we're using WebDav). GN does not check if the FileIdentifier in
the updated record being harvested matches the FileIdentifier in the
corresponding record in the repository. This effectively results in
GN now containing the updated record with a FileIdentifier matching
the authoritative source, which is what we wanted in the first
place, although it upsets GN's internal record keeping as it tracks
the FileIdentifier in a separate column in the database, populated
during initial harvest.

I can see why the FileIdentifier is being replaced by GN during
harvest, to ensure uniqueness of the metadata in the GN repository -
this could be a real issue when harvesting from multiple nodes. The
question here is whether the benefits of ensuring uniqueness in the
GN repository outweigh the implications of altering harvested
metadata from the authoritative source.

I hope this makes things clearer.

Regards,

Aaron Sedgmen
Geoscience Australia