[GeoNetwork-devel] Fuzzy search on Lucene indexed keywords

Hello,

A client recently asked us if it was possible to implement a "fuzzy"
search on GeoNetwork ; since I am pretty new to GeoNetwork
development, I don't really know if a current work has been done in
some branches about it. The idea would be to implement a
case-insensitive search, as well as accentuated-insensitive search. in
addition, our client wanted us to add the capability to search parts
of word.

For example, the search "bâti" may be able to return all metadatas
concerned by the following keywords : "Bati", "batî", "batiment", ...

Does anyone know if something in this direction has been tried in one
of the branches, or if you have some pointers where I could look at in
order to achieve this kind of development ? I have heard of the
"token" parameter onto lucene's configuration files (in order to allow
"incomplete keywords" search), but don't know much about it.

Thanks in advance for the answers,

--
Pierre Mauduit

Camptocamp France SAS
Savoie Technolac, BP 352
73377 Le Bourget du Lac Cedex
Tel : + 33 (0)4 79 44 44 92
http://www.camptocamp.com
pierre.mauduit@anonymised.com

hello Pierre,

GeoNetwork already implements fuzzy search: in the advanced search parameters you can set fuzziness level. The default is not fuzzy. You could easily set the default to your desired level of fuzziness in your implementation.

I think the search is already case-insensitive.

To abstract over accentuated characters, the Lucene ISOLatin1AccentFilter must be applied to both indexing and search values. I’m sure this has been done in some branches but I’m not sure that is currently in the trunk or GN2.4.x branches.

To also fulfill your parts-of-word search, it is sufficient to change the search logic to include a wildcard at the end of each search term.

Hope this helps,
Heikki Doeleman

On Thu, Mar 11, 2010 at 8:46 PM, Pierre Mauduit <pierre.mauduit@anonymised.com> wrote:

Hello,

A client recently asked us if it was possible to implement a “fuzzy”
search on GeoNetwork ; since I am pretty new to GeoNetwork
development, I don’t really know if a current work has been done in
some branches about it. The idea would be to implement a
case-insensitive search, as well as accentuated-insensitive search. in
addition, our client wanted us to add the capability to search parts
of word.

For example, the search “bâti” may be able to return all metadatas
concerned by the following keywords : “Bati”, “batî”, “batiment”, …

Does anyone know if something in this direction has been tried in one
of the branches, or if you have some pointers where I could look at in
order to achieve this kind of development ? I have heard of the
“token” parameter onto lucene’s configuration files (in order to allow
“incomplete keywords” search), but don’t know much about it.

Thanks in advance for the answers,


Pierre Mauduit

Camptocamp France SAS
Savoie Technolac, BP 352
73377 Le Bourget du Lac Cedex
Tel : + 33 (0)4 79 44 44 92
http://www.camptocamp.com
pierre.mauduit@anonymised.com


Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev


GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at http://sourceforge.net/projects/geonetwork

Hi,

Thanks for the fast answer ;

GeoNetwork already implements fuzzy search: in the advanced search
parameters you can set fuzziness level. The default is not fuzzy. You could
easily set the default to your desired level of fuzziness in your
implementation.

I think the search is already case-insensitive.

In fact, the version of GeoNetwork used in the given project is a
quite old "fork" (roughly adapted to the client's needs and
very-specific architecture) of an old GeoNetwork version, and I have
not seen any "fuzzy search option" in the interface ; but I may look
at a recent trunk to find some inspiration I guess :wink: As far as I
remember, the client reported that the search was case sensitive, so,
to be tested.

To abstract over accentuated characters, the Lucene ISOLatin1AccentFilter
must be applied to both indexing and search values. I'm sure this has been
done in some branches but I'm not sure that is currently in the trunk or
GN2.4.x branches.

To also fulfill your parts-of-word search, it is sufficient to change the
search logic to include a wildcard at the end of each search term.

Okay, I will then take a look around this ;

Hope this helps,

Thanks again for these hints,

--
Pierre Mauduit

Camptocamp France SAS
Savoie Technolac, BP 352
73377 Le Bourget du Lac Cedex
Tel : + 33 (0)4 79 44 44 92
http://www.camptocamp.com
pierre.mauduit@anonymised.com