[GeoNetwork-users] GeoNetwork OAI-PMH deletedRecord support?

Hi All,

About two years ago, a proposal was discussed about implementing OAI-PMH deletedRecord support (http://trac.osgeo.org/geonetwork/wiki/oaipmh_improvements).
This would be the best way to keep metadata of a GN node A in all other GN nodes X,Y,Z which are synchronized to it…

Case study: GN node A provides metadata set A, which is daily harvested by GN nodes X,Y,Z. All GN nodes A,X,Y,Z are intended to provide to users the metadata set A. For some crash reason, node A looses the metadata set A. In this case, on the next run on GN nodes X,Y,Z of the harvester for metadata set A from GN node A, the behavior is as follows:

  1. Without deletedRecord support: as the metadata set A UUIDs will no more be returned from GN node A, the next OAI-PMH harvesting of node A from nodes X, Y, Z will delete also all mirrored metadata set A on nodes X,Y,Z. Thus these metadata are completely lost.
  2. With deletedRecord support: GN nodes X,Y,Z should be able to not delete metadata set A mirrored from GN node A while they are not explicitly marked as Deleted in the OAI-PMH response. In case of a crash of GN node A, metadata were not deleted but simply lost; thus metadata set A will remain available to users on GN nodes X,Y,Z.
    Does someone know which is the current status of this proposal, or if future releases of GN are intended to support deletedRecords?

Many thanks,
Victor

Hi all,

I have some trouble getting expected result when entering the uuid of a metadata, and the uuid contains one or more of the chars “:” and “****.”.
I mean for Simple Search, with the uuid entered in the input field “What?”.

Thus, for a metadata UUID=“TEST:LUCENE1:TESTLUCENE2:TEST.LUCENE.3::TEST_AFTER_DOTS”, while entering the full UUID in the Keywords field, it is not clear to me what happens, as both Indexer and Searcher tokenize the input string in a strange manner.

The searcher does this when entering WHAT?=TEST:LUCENE1:TESTLUCENE2 (picked from logs):

  • Analyze field any : TEST:LUCENE1:TESTLUCENE2

  • Analyzed text is test:lucene1 testlucene2

  • Lucene query: … … +(+any:test:lucene1 +any:testlucene2) … …
    The Lucene index has also a similar content for the field any (I guess the same tokenizer is used):

  • any=test:lucene1

  • any=testlucene2
    Lucene index has not the full content of such an uuid in the field any; it is only kept in the field _uuid, but the latest is not used in a Simple Search (where only the field any is considered). The same behavior happens for the “.” separator (for “TEST.LUCENE.3”, for example, the tokenizer produces for the field ANY two different values “TEST.LUCENE” and “3”).
    Moreover, Lucene index keeps the full content of uuid in the “any” index field for uuids with default format (i.e. uuid generated by Geonetwork) like ---

Is this the correct behavior?
Why does Lucene keep the first occurrence of “:” or “.” but not the following occurrences, when indexing/searching the field ANY, and why it keeps the full uuid content in the field ANY when there are no “:” or “.” chars inside?

Many Thanks,
Victor

Victor,

The behaviour you are seeing is determined by the type of lucene tokenizer (which is part of the lucene analyzer). The analyzer is applied to both the text to be indexed and the search terms (for consistent results) - as you observed.

There are a range of analyzers available with Lucene that you can choose and configure GeoNetwork to apply to your index - see WEB-INF/config-lucene.xml. You can even write your own and configure GeoNetwork to use it.

By default, GeoNetwork applies its own Analyzer (with a set of tokenizers) to many fields (eg. any) - see WEB-INF/config-lucene.xml again for which ones. The GeoNetworkAnalyzer uses a StandardTokenizer an ASCIIFoldingFilter and a StopFilter (with a configurable list of stop words) - see the Lucene doco for more details of what these do - too much to explain here - but the main advantage of the GeoNetworkAnalyzer (over the Lucene StandardAnalyzer) is support for wild card queries (see the java doc for the GeoNetworkAnalyzer - web/src/main/java/org/fao/geonet/kernel/search/GeoNetworkAnalyzer.java).

We've found that changing the GeoNetworkAnalyzer to use a WhitespaceTokenizer (instead of the StandardTokenizer) gives results users expect to see when searching using terms that contain characters like apostrophes, slashes etc. The change is easy to make - see Use WhitespaceTokenizer instead of StandardTokenizer to allow searche… · marlin2/core-geonetwork@a6b2830 · GitHub for example. (Note: If you apply a recompiled jar containing your modified GeoNetworkAnalyzer or other changes in config-lucene.xml to an existing catalogue then you need to reindex from the admin page). I've been thinking we should apply this to trunk but as yet I haven't had time to fully explore all the implications of this change, other alternatives or discuss it with anyone :-).

Cheers,
Simon
________________________________________
From: Victor Sinceac [victor.sinceac@anonymised.com]
Sent: Monday, 7 April 2014 8:06 PM
To: geonetwork-devel@lists.sourceforge.net; geonetwork-users@anonymised.comrge.net
Subject: [GeoNetwork-devel] GeoNetwork: Lucene query tokenized for ANY

Hi all,

I have some trouble getting expected result when entering the uuid of a metadata, and the uuid contains one or more of the chars ":" and ".".
I mean for Simple Search, with the uuid entered in the input field "What?".

Thus, for a metadata UUID="TEST:LUCENE1:TESTLUCENE2:TEST.LUCENE.3::TEST_AFTER_DOTS", while entering the full UUID in the Keywords field, it is not clear to me what happens, as both Indexer and Searcher tokenize the input string in a strange manner.

The searcher does this when entering WHAT?=TEST:LUCENE1:TESTLUCENE2 (picked from logs):

  * Analyze field any : TEST:LUCENE1:TESTLUCENE2
  * Analyzed text is test:lucene1 testlucene2
  * Lucene query: ... ... +(+any:test:lucene1 +any:testlucene2) ... ...

The Lucene index has also a similar content for the field any (I guess the same tokenizer is used):

  * any=test:lucene1
  * any=testlucene2

Lucene index has not the full content of such an uuid in the field any; it is only kept in the field _uuid, but the latest is not used in a Simple Search (where only the field any is considered). The same behavior happens for the "." separator (for "TEST.LUCENE.3", for example, the tokenizer produces for the field ANY two different values "TEST.LUCENE" and "3").
Moreover, Lucene index keeps the full content of uuid in the "any" index field for uuids with default format (i.e. uuid generated by Geonetwork) like ****-****-****-****

Is this the correct behavior?
Why does Lucene keep the first occurrence of ":" or "." but not the following occurrences, when indexing/searching the field ANY, and why it keeps the full uuid content in the field ANY when there are no ":" or "." chars inside?

Many Thanks,
Victor

Hi, another option if you know that your end user will do search by UUIDs
or resource ids is to use the OR operator on any and another field. The
other field like _uuid or identifier will generally use a
WhiteSpaceTokenizer or KeywordTokenizer which will preserve special
character.

Query will contains any_OR_identifier=value.
This could be done by modifying your search form field.

Cheers.

Francois

2014-04-07 14:36 GMT+02:00 <Simon.Pigot@anonymised.com>:

Victor,

The behaviour you are seeing is determined by the type of lucene

tokenizer (which is part of the lucene analyzer). The analyzer is applied
to both the text to be indexed and the search terms (for consistent
results) - as you observed.

There are a range of analyzers available with Lucene that you can choose

and configure GeoNetwork to apply to your index - see
WEB-INF/config-lucene.xml. You can even write your own and configure
GeoNetwork to use it.

By default, GeoNetwork applies its own Analyzer (with a set of

tokenizers) to many fields (eg. any) - see WEB-INF/config-lucene.xml again
for which ones. The GeoNetworkAnalyzer uses a StandardTokenizer an
ASCIIFoldingFilter and a StopFilter (with a configurable list of stop
words) - see the Lucene doco for more details of what these do - too much
to explain here - but the main advantage of the GeoNetworkAnalyzer (over
the Lucene StandardAnalyzer) is support for wild card queries (see the java
doc for the GeoNetworkAnalyzer -
web/src/main/java/org/fao/geonet/kernel/search/GeoNetworkAnalyzer.java).

We've found that changing the GeoNetworkAnalyzer to use a

WhitespaceTokenizer (instead of the StandardTokenizer) gives results users
expect to see when searching using terms that contain characters like
apostrophes, slashes etc. The change is easy to make - see
https://github.com/marlin2/core-geonetwork/commit/a6b2830577b29d8096b6efd586bf0b16ee16869c#diff-0d1a0e81fbf55a25e7971a5fd50b5471for
example. (Note: If you apply a recompiled jar containing your modified
GeoNetworkAnalyzer or other changes in config-lucene.xml to an existing
catalogue then you need to reindex from the admin page). I've been thinking
we should apply this to trunk but as yet I haven't had time to fully
explore all the implications of this change, other alternatives or discuss
it with anyone :-).

Cheers,
Simon
________________________________________
From: Victor Sinceac [victor.sinceac@anonymised.com]
Sent: Monday, 7 April 2014 8:06 PM
To: geonetwork-devel@lists.sourceforge.net;

geonetwork-users@lists.sourceforge.net

Subject: [GeoNetwork-devel] GeoNetwork: Lucene query tokenized for ANY

Hi all,

I have some trouble getting expected result when entering the uuid of a

metadata, and the uuid contains one or more of the chars ":" and ".".

I mean for Simple Search, with the uuid entered in the input field

"What?".

Thus, for a metadata

UUID="TEST:LUCENE1:TESTLUCENE2:TEST.LUCENE.3::TEST_AFTER_DOTS", while
entering the full UUID in the Keywords field, it is not clear to me what
happens, as both Indexer and Searcher tokenize the input string in a
strange manner.

The searcher does this when entering WHAT?=TEST:LUCENE1:TESTLUCENE2

(picked from logs):

  * Analyze field any : TEST:LUCENE1:TESTLUCENE2
  * Analyzed text is test:lucene1 testlucene2
  * Lucene query: ... ... +(+any:test:lucene1 +any:testlucene2) ...

...

The Lucene index has also a similar content for the field any (I guess

the same tokenizer is used):

  * any=test:lucene1
  * any=testlucene2

Lucene index has not the full content of such an uuid in the field any;

it is only kept in the field _uuid, but the latest is not used in a Simple
Search (where only the field any is considered). The same behavior happens
for the "." separator (for "TEST.LUCENE.3", for example, the tokenizer
produces for the field ANY two different values "TEST.LUCENE" and "3").

Moreover, Lucene index keeps the full content of uuid in the "any" index

field for uuids with default format (i.e. uuid generated by Geonetwork)
like ****-****-****-****

Is this the correct behavior?
Why does Lucene keep the first occurrence of ":" or "." but not the

following occurrences, when indexing/searching the field ANY, and why it
keeps the full uuid content in the field ANY when there are no ":" or "."
chars inside?

Many Thanks,
Victor

------------------------------------------------------------------------------

Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees_APR
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at

http://sourceforge.net/projects/geonetwork

Many thanks François!

I finally used Simon’s solution but had also to combine it with some gn_search.js sauce: the simple/advanced search js functions detect when an any/phrase/all/or input fields does correspond to a complete, well formatted and alone uuid (WMO Metadata Core Profile in this case, i.e. something like “urn:x-wmo:md:..::”) and change the similarity from default to 1 on true.

I had to do this ugly workaround in order to reduce noise in results: as uuids are unique, the user entering a uuid in simple search would expect to get unique result, while the default similarity (0.8 for the What? field in Simple Search) should return multiple similar uuids (they may differ by only one char, for example)

Regards,
Victor

···

Le 07/04/2014 18:00, Francois Prunayre a écrit :

Hi, another option if you know that your end user will do search by UUIDs or resource ids is to use the OR operator on any and another field. The other field like _uuid or identifier will generally use a WhiteSpaceTokenizer or KeywordTokenizer which will preserve special character.

Query will contains any_OR_identifier=value.
This could be done by modifying your search form field.

Cheers.

Francois

2014-04-07 14:36 GMT+02:00 Simon.Pigot@anonymised.com:

Victor,

The behaviour you are seeing is determined by the type of lucene tokenizer (which is part of the lucene analyzer). The analyzer is applied to both the text to be indexed and the search terms (for consistent results) - as you observed.

There are a range of analyzers available with Lucene that you can choose and configure GeoNetwork to apply to your index - see WEB-INF/config-lucene.xml. You can even write your own and configure GeoNetwork to use it.

By default, GeoNetwork applies its own Analyzer (with a set of tokenizers) to many fields (eg. any) - see WEB-INF/config-lucene.xml again for which ones. The GeoNetworkAnalyzer uses a StandardTokenizer an ASCIIFoldingFilter and a StopFilter (with a configurable list of stop words) - see the Lucene doco for more details of what these do - too much to explain here - but the main advantage of the GeoNetworkAnalyzer (over the Lucene StandardAnalyzer) is support for wild card queries (see the java doc for the GeoNetworkAnalyzer - web/src/main/java/org/fao/geonet/kernel/search/GeoNetworkAnalyzer.java).

We’ve found that changing the GeoNetworkAnalyzer to use a WhitespaceTokenizer (instead of the StandardTokenizer) gives results users expect to see when searching using terms that contain characters like apostrophes, slashes etc. The change is easy to make - see https://github.com/marlin2/core-geonetwork/commit/a6b2830577b29d8096b6efd586bf0b16ee16869c#diff-0d1a0e81fbf55a25e7971a5fd50b5471 for example. (Note: If you apply a recompiled jar containing your modified GeoNetworkAnalyzer or other changes in config-lucene.xml to an existing catalogue then you need to reindex from the admin page). I’ve been thinking we should apply this to trunk but as yet I haven’t had time to fully explore all the implications of this change, other alternatives or discuss it with anyone :-).

Cheers,
Simon


From: Victor Sinceac [victor.sinceac@anonymised.com]
Sent: Monday, 7 April 2014 8:06 PM
To: geonetwork-devel@lists.sourceforge.net; geonetwork-users@lists.sourceforge.net
Subject: [GeoNetwork-devel] GeoNetwork: Lucene query tokenized for ANY

Hi all,

I have some trouble getting expected result when entering the uuid of a metadata, and the uuid contains one or more of the chars “:” and “.”.
I mean for Simple Search, with the uuid entered in the input field “What?”.

Thus, for a metadata UUID=“TEST:LUCENE1:TESTLUCENE2:TEST.LUCENE.3::TEST_AFTER_DOTS”, while entering the full UUID in the Keywords field, it is not clear to me what happens, as both Indexer and Searcher tokenize the input string in a strange manner.

The searcher does this when entering WHAT?=TEST:LUCENE1:TESTLUCENE2 (picked from logs):

  • Analyze field any : TEST:LUCENE1:TESTLUCENE2
  • Analyzed text is test:lucene1 testlucene2
  • Lucene query: … … +(+any:test:lucene1 +any:testlucene2) … …

The Lucene index has also a similar content for the field any (I guess the same tokenizer is used):

  • any=test:lucene1
  • any=testlucene2

Lucene index has not the full content of such an uuid in the field any; it is only kept in the field _uuid, but the latest is not used in a Simple Search (where only the field any is considered). The same behavior happens for the “.” separator (for “TEST.LUCENE.3”, for example, the tokenizer produces for the field ANY two different values “TEST.LUCENE” and “3”).
Moreover, Lucene index keeps the full content of uuid in the “any” index field for uuids with default format (i.e. uuid generated by Geonetwork) like ---

Is this the correct behavior?
Why does Lucene keep the first occurrence of “:” or “.” but not the following occurrences, when indexing/searching the field ANY, and why it keeps the full uuid content in the field ANY when there are no “:” or “.” chars inside?

Many Thanks,
Victor

Hi All,

In New/Edit Metadata modes, the metadata FileIdentifier/uuid field is always generated/retrieved by GN and users cannot edit it (they can in XML mode but without any result).

Afaik it happens on all GN releases at least from 2.6.5.

Actually a user can use custom FileIdentifier/uuid, by using Metadata Insert or Batch Import.

But what about catalogs having customized FileIdentifier/uuid format (e.g. “MY:METADATA::TYPE-12345:A-12345:B-12345”) and metadata editors who are editing the metadata online, from custom templates?

Is there a simple way to make the FeatureIdentifier/uuid value editable in current GN releases, or is such a feature scheduled for future releases?

Many Thanks,
Victor

All I can say is there was some work done to allow this, Sadly I can’t remember the details now. It might have been a configuration option or something.

Maybe a search through the source code or the mailing lists archive can help you find clues.

Jesse

···

On Thu, Apr 10, 2014 at 2:33 PM, Victor Sinceac <victor.sinceac@anonymised.com> wrote:

Hi All,

In New/Edit Metadata modes, the metadata FileIdentifier/uuid field is always generated/retrieved by GN and users cannot edit it (they can in XML mode but without any result).

Afaik it happens on all GN releases at least from 2.6.5.

Actually a user can use custom FileIdentifier/uuid, by using Metadata Insert or Batch Import.

But what about catalogs having customized FileIdentifier/uuid format (e.g. “MY:METADATA::TYPE-12345:A-12345:B-12345”) and metadata editors who are editing the metadata online, from custom templates?

Is there a simple way to make the FeatureIdentifier/uuid value editable in current GN releases, or is such a feature scheduled for future releases?

Many Thanks,
Victor


Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees


GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at http://sourceforge.net/projects/geonetwork