Hi, another option if you know that your end user will do search by UUIDs
or resource ids is to use the OR operator on any and another field. The
other field like _uuid or identifier will generally use a
WhiteSpaceTokenizer or KeywordTokenizer which will preserve special
character.
Query will contains any_OR_identifier=value.
This could be done by modifying your search form field.
Cheers.
Francois
2014-04-07 14:36 GMT+02:00 <Simon.Pigot@anonymised.com>:
Victor,
The behaviour you are seeing is determined by the type of lucene
tokenizer (which is part of the lucene analyzer). The analyzer is applied
to both the text to be indexed and the search terms (for consistent
results) - as you observed.
There are a range of analyzers available with Lucene that you can choose
and configure GeoNetwork to apply to your index - see
WEB-INF/config-lucene.xml. You can even write your own and configure
GeoNetwork to use it.
By default, GeoNetwork applies its own Analyzer (with a set of
tokenizers) to many fields (eg. any) - see WEB-INF/config-lucene.xml again
for which ones. The GeoNetworkAnalyzer uses a StandardTokenizer an
ASCIIFoldingFilter and a StopFilter (with a configurable list of stop
words) - see the Lucene doco for more details of what these do - too much
to explain here - but the main advantage of the GeoNetworkAnalyzer (over
the Lucene StandardAnalyzer) is support for wild card queries (see the java
doc for the GeoNetworkAnalyzer -
web/src/main/java/org/fao/geonet/kernel/search/GeoNetworkAnalyzer.java).
We've found that changing the GeoNetworkAnalyzer to use a
WhitespaceTokenizer (instead of the StandardTokenizer) gives results users
expect to see when searching using terms that contain characters like
apostrophes, slashes etc. The change is easy to make - see
https://github.com/marlin2/core-geonetwork/commit/a6b2830577b29d8096b6efd586bf0b16ee16869c#diff-0d1a0e81fbf55a25e7971a5fd50b5471for
example. (Note: If you apply a recompiled jar containing your modified
GeoNetworkAnalyzer or other changes in config-lucene.xml to an existing
catalogue then you need to reindex from the admin page). I've been thinking
we should apply this to trunk but as yet I haven't had time to fully
explore all the implications of this change, other alternatives or discuss
it with anyone :-).
Cheers,
Simon
________________________________________
From: Victor Sinceac [victor.sinceac@anonymised.com]
Sent: Monday, 7 April 2014 8:06 PM
To: geonetwork-devel@lists.sourceforge.net;
geonetwork-users@lists.sourceforge.net
Subject: [GeoNetwork-devel] GeoNetwork: Lucene query tokenized for ANY
Hi all,
I have some trouble getting expected result when entering the uuid of a
metadata, and the uuid contains one or more of the chars ":" and ".".
I mean for Simple Search, with the uuid entered in the input field
"What?".
Thus, for a metadata
UUID="TEST:LUCENE1:TESTLUCENE2:TEST.LUCENE.3::TEST_AFTER_DOTS", while
entering the full UUID in the Keywords field, it is not clear to me what
happens, as both Indexer and Searcher tokenize the input string in a
strange manner.
The searcher does this when entering WHAT?=TEST:LUCENE1:TESTLUCENE2
(picked from logs):
* Analyze field any : TEST:LUCENE1:TESTLUCENE2
* Analyzed text is test:lucene1 testlucene2
* Lucene query: ... ... +(+any:test:lucene1 +any:testlucene2) ...
...
The Lucene index has also a similar content for the field any (I guess
the same tokenizer is used):
* any=test:lucene1
* any=testlucene2
Lucene index has not the full content of such an uuid in the field any;
it is only kept in the field _uuid, but the latest is not used in a Simple
Search (where only the field any is considered). The same behavior happens
for the "." separator (for "TEST.LUCENE.3", for example, the tokenizer
produces for the field ANY two different values "TEST.LUCENE" and "3").
Moreover, Lucene index keeps the full content of uuid in the "any" index
field for uuids with default format (i.e. uuid generated by Geonetwork)
like ****-****-****-****
Is this the correct behavior?
Why does Lucene keep the first occurrence of ":" or "." but not the
following occurrences, when indexing/searching the field ANY, and why it
keeps the full uuid content in the field ANY when there are no ":" or "."
chars inside?
Many Thanks,
Victor
------------------------------------------------------------------------------
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees_APR
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at
http://sourceforge.net/projects/geonetwork