[GeoNetwork-devel] [GeoNetwork opensource Developer website] #336: Lucene / improve configuration

#336: Lucene / improve configuration
-------------------------+--------------------------------------------------
Reporter: Fxp | Owner: geonetwork-devel@…
     Type: enhancement | Status: new
Priority: minor | Milestone:
Component: General | Version: v2.6.1
Keywords: |
-------------------------+--------------------------------------------------
Currently GeoNetwork only has configuration information about Lucene
tokenized fields. Lots of parameters are set in the Java code. This could
be improved externalizing some parameters (eg. RAMBufferSizeMB, Standard
analyzer, Per field analyzer, ...).

Draft of configuration file for discussion:
{{{
<?xml version="1.0"?>
<config>
         <index>
                 <!--
                         The amount of memory to be used for buffering
documents in memory.
                         48MB seems to be plenty for running at least two
long
                          indexing jobs (eg. importing 20,000 records) and
keeping disk
                          activity for lucene index writing to a minimum.
                 -->
                 <RAMBufferSizeMB>48.0d</RAMBufferSizeMB>

                 <!-- Determines how often segment indices are merged by
addDocument(). -->
                 <MergeFactor>10</MergeFactor>

                 <!-- Default Lucene version to use (mainly for Analyzer
creation). -->
                 <luceneVersion>29</luceneVersion>
         </index>

     <!-- Search parameters are applied at search time and does not need
     an index rebuild in order to be take into account. -->
     <search>
         <!--
             By default Lucene compute score according to search criteria
             and the corresponding result set and their index content.
             In case of search with no criteria, Lucene will return top
docs
             in index order (because none are more relevant than others).

             In order to change the score computation, a boost function
could
             be define. Boosting query needs to be loaded in classpath.
             * RecencyBoostingQuery will promote recently modified
documents

                 <boostQuery
name="org.fao.geonet.kernel.search.function.RecencyBoostingQuery">
                         <Param name="multiplier" type="double"
value="2.0"/>
             <Param name="maxDaysAgo" type="int" value="365"/>
             <Param name="dayField" type="java.lang.String"
value="_changeDate"/>
             </boostQuery>
         -->
         </search>

         <!-- Default analyzer to use for all fields not defined in the
fieldSpecificAnalyzer section.
                 If not set, GeoNetwork use a default per field analyzer
(ie. fieldSpecificAnalyzer is not
                 take into account).

                 Example:
                 org.apache.lucene.analysis.fr.FrenchAnalyzer
         -->
         <defaultAnalyzer
name="org.apache.lucene.analysis.standard.StandardAnalyzer">
                 <Param name="version"
type="org.apache.lucene.util.Version"/>
         </defaultAnalyzer>

         <!-- TODO: Add a language specific analyzer -->

         <!-- Field analyzer
                 Define here specific analyzer for each fields stored in
the index

                 For example adding a different analyzer for any (ie. full
text search)
                 could be better than a standard analyzer which has a
particular way of
                 creating tokens.

                 In that situation, when field is "mission AD-T" is
tokenized to "mission" "ad" & "t"
                 using StandardAnalyzer. A WhiteSpaceTokenizer tokenized to
"mission" "AD-T"
                 which could be better in some situation. But when field is
"mission AD-34T" is tokenized
                 to "mission" "ad-34t" using StandardAnalyzer due to
number.

                 doeleman: UUID must be case insensitive, as its parts are
hexadecimal numbers which
                 are not case sensitive. StandardAnalyzer is recommended
for UUIDS.

                 A list of analyzer is available
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/Analyzer.html
                 Commons analyzer:
                 * org.apache.lucene.analysis.standard.StandardAnalyzer
                 * org.apache.lucene.analysis.WhitespaceAnalyzer
                 *
                 The analyzer must be in the classpath.

         -->
         <fieldSpecificAnalyzer>
                 <Field name="_uuid"
analyzer="org.apache.lucene.analysis.SimpleAnalyzer"/>
                 <Field name="parentUuid"
analyzer="org.apache.lucene.analysis.SimpleAnalyzer"/>
                 <Field name="operatesOn"
analyzer="org.apache.lucene.analysis.SimpleAnalyzer"/>
                 <Field name="operatesOnIdentifier"
analyzer="org.apache.lucene.analysis.SimpleAnalyzer"/>

                 <Field name="any"
analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer">
                         <Param name="version"
type="org.apache.lucene.util.Version"/>
                         <!--<Param name="stopWords" type="java.io.File"
value="/path/to/stopwords/stopwords.txt"/>-->
                 </Field>
                 <Field name="subject"
analyzer="org.apache.lucene.analysis.KeywordAnalyzer"/>
         </fieldSpecificAnalyzer>

         <!-- All Lucene fields that are tokenized must be kept here
because it
        is impossible unfortunately from Lucene API to work out which
fields are
                          tokenized and which aren't unless we read
documents and we may not have
                          an index to do this on so since most fields are
not tokenized we
                          keep a list of tokenized fields here
          -->
         <tokenized>
                 <Field name="any"/>
                 <Field name="abstract"/>
                 <Field name="title"/>
                 <Field name="altTitle"/>
                 <Field name="inspiretheme"/>
                 <Field name="keywordType"/>
                 <Field name="orgName"/>
                 <Field name="specificationTitle"/>
                 <Field name="levelName"/>
                 <!-- from SearchManager/static -->
                 <Field name="_uuid"/>
                 <Field name="parentUuid"/>
                 <Field name="operatesOn"/>
                 <Field name="subject"/>
         </tokenized>

</config>
}}}

Proposed patch has the same level of functionality than current version +
  * !RecencyBoostingQuery (disabled by default): to promote newly added
records
  * Reload configuration option: to reload a modified Lucene configuration
file (no restart required)

Review and comments welcomed.

--
Ticket URL: <http://trac.osgeo.org/geonetwork/ticket/336&gt;
GeoNetwork opensource Developer website <http://trac.osgeo.org/geonetwork&gt;
GeoNetwork opensource is a standards based, Free and Open Source catalog application to manage spatially referenced resources through the web. It provides powerful metadata editing and search functions as well as an embedded interactive web map viewer. This website contains information related to the development of the software.

#336: Lucene / improve configuration
-------------------------+--------------------------------------------------
Reporter: Fxp | Owner: geonetwork-devel@…
     Type: enhancement | Status: new
Priority: minor | Milestone:
Component: General | Version: v2.6.1
Keywords: |
-------------------------+--------------------------------------------------

Comment(by Fxp):

Having such a configuration file could help to add and configure new
Analyzer like the recently added (and default one) !GeoNetworkAnalyzer :
http://osgeo-org.1803224.n2.nabble.com/SF-net-SVN-geonetwork-6615-trunk-
web-src-main-java-org-fao-geonet-kernel-search-td5592349.html#a5592349

--
Ticket URL: <http://trac.osgeo.org/geonetwork/ticket/336#comment:1&gt;
GeoNetwork opensource Developer website <http://trac.osgeo.org/geonetwork&gt;
GeoNetwork opensource is a standards based, Free and Open Source catalog application to manage spatially referenced resources through the web. It provides powerful metadata editing and search functions as well as an embedded interactive web map viewer. This website contains information related to the development of the software.

#336: Lucene / improve configuration
-------------------------+--------------------------------------------------
Reporter: Fxp | Owner: geonetwork-devel@…
     Type: enhancement | Status: new
Priority: minor | Milestone:
Component: General | Version: v2.6.1
Keywords: |
-------------------------+--------------------------------------------------

Comment(by Fxp):

Add an option for scoring (see #341).

{{{

  <search>
         <!-- Score parameters. Turning these parameters to true, affects
performance. -->
         <!-- Set track doc score to true if score needs to be displayed in
results using
                 geonet:info/score element -->
         <trackDocScores>false</trackDocScores>
         <trackMaxScore>false</trackMaxScore>

         <!-- Not used because no Scorer defined -->
         <docsScoredInOrder>false</docsScoredInOrder>

}}}

--
Ticket URL: <http://trac.osgeo.org/geonetwork/ticket/336#comment:2&gt;
GeoNetwork opensource Developer website <http://trac.osgeo.org/geonetwork&gt;
GeoNetwork opensource is a standards based, Free and Open Source catalog application to manage spatially referenced resources through the web. It provides powerful metadata editing and search functions as well as an embedded interactive web map viewer. This website contains information related to the development of the software.

#336: Lucene / improve configuration
-------------------------+--------------------------------------------------
Reporter: Fxp | Owner: geonetwork-devel@…
     Type: enhancement | Status: new
Priority: minor | Milestone:
Component: General | Version: v2.6.1
Keywords: |
-------------------------+--------------------------------------------------

Comment(by Fxp):

r6639

--
Ticket URL: <http://trac.osgeo.org/geonetwork/ticket/336#comment:3&gt;
GeoNetwork opensource Developer website <http://trac.osgeo.org/geonetwork&gt;
GeoNetwork opensource is a standards based, Free and Open Source catalog application to manage spatially referenced resources through the web. It provides powerful metadata editing and search functions as well as an embedded interactive web map viewer. This website contains information related to the development of the software.

#336: Lucene / improve configuration
--------------------------+-------------------------------------------------
  Reporter: Fxp | Owner: geonetwork-devel@…
      Type: enhancement | Status: closed
  Priority: minor | Milestone:
Component: General | Version: v2.6.1
Resolution: fixed | Keywords:
--------------------------+-------------------------------------------------
Changes (by Fxp):

  * status: new => closed
  * resolution: => fixed

--
Ticket URL: <http://trac.osgeo.org/geonetwork/ticket/336#comment:4&gt;
GeoNetwork opensource Developer website <http://trac.osgeo.org/geonetwork&gt;
GeoNetwork opensource is a standards based, Free and Open Source catalog application to manage spatially referenced resources through the web. It provides powerful metadata editing and search functions as well as an embedded interactive web map viewer. This website contains information related to the development of the software.