#336: Lucene / improve configuration
-------------------------+--------------------------------------------------
Reporter: Fxp | Owner: geonetwork-devel@…
Type: enhancement | Status: new
Priority: minor | Milestone:
Component: General | Version: v2.6.1
Keywords: |
-------------------------+--------------------------------------------------
Currently GeoNetwork only has configuration information about Lucene
tokenized fields. Lots of parameters are set in the Java code. This could
be improved externalizing some parameters (eg. RAMBufferSizeMB, Standard
analyzer, Per field analyzer, ...).
Draft of configuration file for discussion:
{{{
<?xml version="1.0"?>
<config>
<index>
<!--
The amount of memory to be used for buffering
documents in memory.
48MB seems to be plenty for running at least two
long
indexing jobs (eg. importing 20,000 records) and
keeping disk
activity for lucene index writing to a minimum.
-->
<RAMBufferSizeMB>48.0d</RAMBufferSizeMB>
<!-- Determines how often segment indices are merged by
addDocument(). -->
<MergeFactor>10</MergeFactor>
<!-- Default Lucene version to use (mainly for Analyzer
creation). -->
<luceneVersion>29</luceneVersion>
</index>
<!-- Search parameters are applied at search time and does not need
an index rebuild in order to be take into account. -->
<search>
<!--
By default Lucene compute score according to search criteria
and the corresponding result set and their index content.
In case of search with no criteria, Lucene will return top
docs
in index order (because none are more relevant than others).
In order to change the score computation, a boost function
could
be define. Boosting query needs to be loaded in classpath.
* RecencyBoostingQuery will promote recently modified
documents
<boostQuery
name="org.fao.geonet.kernel.search.function.RecencyBoostingQuery">
<Param name="multiplier" type="double"
value="2.0"/>
<Param name="maxDaysAgo" type="int" value="365"/>
<Param name="dayField" type="java.lang.String"
value="_changeDate"/>
</boostQuery>
-->
</search>
<!-- Default analyzer to use for all fields not defined in the
fieldSpecificAnalyzer section.
If not set, GeoNetwork use a default per field analyzer
(ie. fieldSpecificAnalyzer is not
take into account).
Example:
org.apache.lucene.analysis.fr.FrenchAnalyzer
-->
<defaultAnalyzer
name="org.apache.lucene.analysis.standard.StandardAnalyzer">
<Param name="version"
type="org.apache.lucene.util.Version"/>
</defaultAnalyzer>
<!-- TODO: Add a language specific analyzer -->
<!-- Field analyzer
Define here specific analyzer for each fields stored in
the index
For example adding a different analyzer for any (ie. full
text search)
could be better than a standard analyzer which has a
particular way of
creating tokens.
In that situation, when field is "mission AD-T" is
tokenized to "mission" "ad" & "t"
using StandardAnalyzer. A WhiteSpaceTokenizer tokenized to
"mission" "AD-T"
which could be better in some situation. But when field is
"mission AD-34T" is tokenized
to "mission" "ad-34t" using StandardAnalyzer due to
number.
doeleman: UUID must be case insensitive, as its parts are
hexadecimal numbers which
are not case sensitive. StandardAnalyzer is recommended
for UUIDS.
A list of analyzer is available
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/Analyzer.html
Commons analyzer:
* org.apache.lucene.analysis.standard.StandardAnalyzer
* org.apache.lucene.analysis.WhitespaceAnalyzer
*
The analyzer must be in the classpath.
-->
<fieldSpecificAnalyzer>
<Field name="_uuid"
analyzer="org.apache.lucene.analysis.SimpleAnalyzer"/>
<Field name="parentUuid"
analyzer="org.apache.lucene.analysis.SimpleAnalyzer"/>
<Field name="operatesOn"
analyzer="org.apache.lucene.analysis.SimpleAnalyzer"/>
<Field name="operatesOnIdentifier"
analyzer="org.apache.lucene.analysis.SimpleAnalyzer"/>
<Field name="any"
analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer">
<Param name="version"
type="org.apache.lucene.util.Version"/>
<!--<Param name="stopWords" type="java.io.File"
value="/path/to/stopwords/stopwords.txt"/>-->
</Field>
<Field name="subject"
analyzer="org.apache.lucene.analysis.KeywordAnalyzer"/>
</fieldSpecificAnalyzer>
<!-- All Lucene fields that are tokenized must be kept here
because it
is impossible unfortunately from Lucene API to work out which
fields are
tokenized and which aren't unless we read
documents and we may not have
an index to do this on so since most fields are
not tokenized we
keep a list of tokenized fields here
-->
<tokenized>
<Field name="any"/>
<Field name="abstract"/>
<Field name="title"/>
<Field name="altTitle"/>
<Field name="inspiretheme"/>
<Field name="keywordType"/>
<Field name="orgName"/>
<Field name="specificationTitle"/>
<Field name="levelName"/>
<!-- from SearchManager/static -->
<Field name="_uuid"/>
<Field name="parentUuid"/>
<Field name="operatesOn"/>
<Field name="subject"/>
</tokenized>
</config>
}}}
Proposed patch has the same level of functionality than current version +
* !RecencyBoostingQuery (disabled by default): to promote newly added
records
* Reload configuration option: to reload a modified Lucene configuration
file (no restart required)
Review and comments welcomed.
--
Ticket URL: <http://trac.osgeo.org/geonetwork/ticket/336>
GeoNetwork opensource Developer website <http://trac.osgeo.org/geonetwork>
GeoNetwork opensource is a standards based, Free and Open Source catalog application to manage spatially referenced resources through the web. It provides powerful metadata editing and search functions as well as an embedded interactive web map viewer. This website contains information related to the development of the software.