[GeoNetwork-devel] GeoNetwork fuzzy search

Hi Francois,
As you can see from the exchange below, the fuzzy search is looked at with great interest. Did you do any performance testing?
Ciao,
Jeroen

Begin forwarded message:

on the fuzzy search I wrote the following short note last may to the developer’s list.

On the fuzzy search, I wonder if there are performance issues to be considered; in general any partial match search that does not work on a prefix tends to give some performance penalty, both in terms of space (in this case, number of generated clauses?) and query resolution time.
It would be worth doing some tests before adding fuzzy search.

Has anyone done some test on the performance of the fuzzy search since then?
I am re-issuing this since I remember Stefano at some point, totally outside geonetwork, needed some fuzzy search features for a big database he was working on, and he reported me that the speed became much slower; but maybe I don’t remember vwell since this happened more than one year ago.
Regards from Vienna,

Roberto

On 12 Oct 2006, at 14:41, Andrea Carboni wrote:

Dear Patrizia,

this problem will be automatically fixed as soon as our french colleagues implements
fuzzy search in geonetwork.

Cheers,
Andrea

Hi,

I want just to add to the email below that the same problem there is with
dash, like in Sub-Sahara.

Again thanks a lot,
Patrizia

Hi Jeroen, I’m sure that fuzzy query will be slower than TermQuery. “A fuzzy search needs to calculate the (Levenshtein) edit distance between the search term and all terms in the index, which is an expensive operation”. Usually, more the similarity is close to 0, more the query will take time. There’s not so much coments on that on lucene’ forum.

I set default value to .8, which allows not to take into account approximately on accent or one special character in a word (eg. TERMQUERY (rivière) = FUZZYQUERY (riviere~.8) ~= FUZZYQUERY (rivere~.6). We will have to set a similarity close to .5 (default lucene’s value) to expect to deal with typo … Fuzzy search will also improve search for plural (TERMQUERY (nitrate) <> TERMQUERY (nitrates) ~= FUZZYQUERY (nitrate~.8) …
Maybe Geonetwork users will have to analyse what users are typing in the search interface in order to set the default similarity (it will depend on language, and type of search one or more words, etc …).

I didn’t make a lot of performance testing; just played a bit with apache bench utility … 50 Search for “riviere” with
TERMQUERY : Time per request: 2782.456 [ms] (mean)
FUZZYQUERY : Time per request: 3045.039 [ms] (mean)
Times are all HTTP time not only for search ! With 500 metadata on a dev server.

If everybody agrees, I’ll try to commit the code by the end of this week so everybody could test it. We will be able to easily turn off the fuzzy query in the main page by setting the similarity to 1 and remove the slider from the advanced-search.

Cheers. Francois

ab -n 50 -c 1 “http://dev.sandre.eaufrance.fr/geonetworkalp
ha/srv/en/main.search?extended=off&remote=off&attrset=geo&any=rivi%C3%A8re&simil
arity=1&hitsPerPage=10”
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.121.2.12 $> apache-2.0
Benchmarking dev.sandre.eaufrance.fr (be patient)…done
Server Software: Jetty/5.1.5rc2
Server Hostname: dev.sandre.eaufrance.fr
Server Port: 80

Document Path: /geonetworkalpha/srv/en/main.search?extended=off&remote=
off&attrset=geo&any=rivi%C3%A8re&similarity=1&hitsPerPage=10
Document Length: 20001 bytes

Concurrency Level: 1
Time taken for tests: 139.122823 seconds
Complete requests: 50
Failed requests: 0
Write errors: 0
Total transferred: 1015738 bytes
HTML transferred: 1000050 bytes
Requests per second: 0.36 [#/sec] (mean)
Time per request: 2782.456 [ms] (mean)
Time per request: 2782.456 [ms] (mean, across all concurrent requests)
Transfer rate: 7.12 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 305 818.2 80 3955
Processing: 1081 2476 1297.6 2032 6229
Waiting: 991 1491 315.3 1442 2423
Total: 1091 2781 1683.4 2183 9814

Percentage of the requests served within a certain time (ms)
50% 2183
66% 2733
75% 3505
80% 4055
90% 5648
95% 6028
98% 9814
99% 9814
100% 9814 (longest request)

ab -n 50 -c 1 “http://dev.sandre.eaufrance.fr/geonetworkalp
ha/srv/en/main.search?extended=off&remote=off&attrset=geo&any=rivi%C3%A8re&simil
arity=.6&hitsPerPage=10”
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.121.2.12 $> apache-2.0
Benchmarking dev.sandre.eaufrance.fr (be patient)…done
Server Software: Jetty/5.1.5rc2
Server Hostname: dev.sandre.eaufrance.fr
Server Port: 80

Document Path: /geonetworkalpha/srv/en/main.search?extended=off&remote=
off&attrset=geo&any=rivi%C3%A8re&similarity=.6&hitsPerPage=10
Document Length: 20282 bytes

Concurrency Level: 1
Time taken for tests: 152.251964 seconds
Complete requests: 50
Failed requests: 0
Write errors: 0
Total transferred: 1029792 bytes
HTML transferred: 1014100 bytes
Requests per second: 0.33 [#/sec] (mean)
Time per request: 3045.039 [ms] (mean)
Time per request: 3045.039 [ms] (mean, across all concurrent requests)
Transfer rate: 6.60 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 181 449.1 60 3134
Processing: 1151 2863 2398.8 1962 12157
Waiting: 1111 1722 1113.6 1372 7470
Total: 1161 3044 2605.6 2143 12157

Percentage of the requests served within a certain time (ms)
50% 2143
66% 2814
75% 3895
80% 4146
90% 5928
95% 11286
98% 12157
99% 12157
100% 12157 (longest request)


De : geonetwork-devel-bounces@lists.sourceforge.net [mailto:geonetwork-devel-bounces@lists.sourceforge.net] De la part de Jeroen Ticheler
Envoyé : jeudi 12 octobre 2006 16:32
À : geonetwork-devel@lists.sourceforge.net
Objet : [GeoNetwork-devel] GeoNetwork fuzzy search

Hi Francois,As you can see from the exchange below, the fuzzy search is looked at with great interest. Did you do any performance testing?
Ciao,
Jeroen

Begin forwarded message:

on the fuzzy search I wrote the following short note last may to the developer’s list.

On the fuzzy search, I wonder if there are performance issues to be considered; in general any partial match search that does not work on a prefix tends to give some performance penalty, both in terms of space (in this case, number of generated clauses?) and query resolution time.
It would be worth doing some tests before adding fuzzy search.

Has anyone done some test on the performance of the fuzzy search since then?
I am re-issuing this since I remember Stefano at some point, totally outside geonetwork, needed some fuzzy search features for a big database he was working on, and he reported me that the speed became much slower; but maybe I don’t remember vwell since this happened more than one year ago.
Regards from Vienna,

Roberto

On 12 Oct 2006, at 14:41, Andrea Carboni wrote:

Dear Patrizia,

this problem will be automatically fixed as soon as our french colleagues implements
fuzzy search in geonetwork.

Cheers,
Andrea

Hi,

I want just to add to the email below that the same problem there is with
dash, like in Sub-Sahara.

Again thanks a lot,
Patrizia


Ce message a ete verifie par MailScanner pour des virus ou des polluriels et rien de suspect n’a ete trouve.

Les donnees et renseignements contenus dans ce message sont personnels, confidentiels et privis.Toute publication, utilisation ou diffusion, meme partielle, doit etre autorisee.
Any data and information contained in this electronic mail is personal, confidential and private. Any total or partial publication, use or distribution must be authorized.

Ce message a ete verifie par MailScanner pour des virus ou des polluriels et rien de suspect n’a ete trouve.

Les donnees et renseignements contenus dans ce message sont personnels, confidentiels et privés.Toute publication, utilisation ou diffusion, meme partielle, doit etre autorisee.
Any data and information contained in this electronic mail is personal, confidential and private. Any total or partial publication, use or distribution must be authorized.

I agree. With a similarity slider set to 0.8.

Cheers,
Andrea

Hi Jeroen, I'm sure that fuzzy query will be slower than TermQuery. "A fuzzy
search needs to calculate the (Levenshtein) edit distance between the search
term and all terms in the index, which is an expensive operation". Usually,
more the similarity is close to 0, more the query will take time. There's
not so much coments on that on lucene' forum.

I set default value to .8, which allows not to take into account
approximately on accent or one special character in a word (eg. TERMQUERY
(rivière) = FUZZYQUERY (riviere~.8) ~= FUZZYQUERY (rivere~.6). We will have
to set a similarity close to .5 (default lucene's value) to expect to deal
with typo ... Fuzzy search will also improve search for plural (TERMQUERY
(nitrate) <> TERMQUERY (nitrates) ~= FUZZYQUERY (nitrate~.8) ...
Maybe Geonetwork users will have to analyse what users are typing in the
search interface in order to set the default similarity (it will depend on
language, and type of search one or more words, etc ...).

I didn't make a lot of performance testing; just played a bit with apache
bench utility ... 50 Search for "riviere" with
TERMQUERY : Time per request: 2782.456 [ms] (mean)
FUZZYQUERY : Time per request: 3045.039 [ms] (mean)
Times are all HTTP time not only for search ! With 500 metadata on a dev
server.

If everybody agrees, I'll try to commit the code by the end of this week so
everybody could test it. We will be able to easily turn off the fuzzy query
in the main page by setting the similarity to 1 and remove the slider from
the advanced-search.

Cheers. Francois

ab -n 50 -c 1 "http://dev.sandre.eaufrance.fr/geonetworkalp
ha/srv/en/main.search?extended=off&remote=off&attrset=geo&any=rivi%C3%A8re&s
imil
arity=1&hitsPerPage=10"
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.121.2.12 $> apache-2.0
Benchmarking dev.sandre.eaufrance.fr (be patient).....done
Server Software: Jetty/5.1.5rc2
Server Hostname: dev.sandre.eaufrance.fr
Server Port: 80

Document Path:
/geonetworkalpha/srv/en/main.search?extended=off&remote=
off&attrset=geo&any=rivi%C3%A8re&similarity=1&hitsPerPage=10
Document Length: 20001 bytes

Concurrency Level: 1
Time taken for tests: 139.122823 seconds
Complete requests: 50
Failed requests: 0
Write errors: 0
Total transferred: 1015738 bytes
HTML transferred: 1000050 bytes
Requests per second: 0.36 [#/sec] (mean)
Time per request: 2782.456 [ms] (mean)
Time per request: 2782.456 [ms] (mean, across all concurrent requests)
Transfer rate: 7.12 [Kbytes/sec] received

Connection Times (ms)
              min mean[+/-sd] median max
Connect: 0 305 818.2 80 3955
Processing: 1081 2476 1297.6 2032 6229
Waiting: 991 1491 315.3 1442 2423
Total: 1091 2781 1683.4 2183 9814

Percentage of the requests served within a certain time (ms)
  50% 2183
  66% 2733
  75% 3505
  80% 4055
  90% 5648
  95% 6028
  98% 9814
  99% 9814
100% 9814 (longest request)

ab -n 50 -c 1 "http://dev.sandre.eaufrance.fr/geonetworkalp
ha/srv/en/main.search?extended=off&remote=off&attrset=geo&any=rivi%C3%A8re&s
imil
arity=.6&hitsPerPage=10"
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.121.2.12 $> apache-2.0
Benchmarking dev.sandre.eaufrance.fr (be patient).....done
Server Software: Jetty/5.1.5rc2
Server Hostname: dev.sandre.eaufrance.fr
Server Port: 80

Document Path:
/geonetworkalpha/srv/en/main.search?extended=off&remote=
off&attrset=geo&any=rivi%C3%A8re&similarity=.6&hitsPerPage=10
Document Length: 20282 bytes

Concurrency Level: 1
Time taken for tests: 152.251964 seconds
Complete requests: 50
Failed requests: 0
Write errors: 0
Total transferred: 1029792 bytes
HTML transferred: 1014100 bytes
Requests per second: 0.33 [#/sec] (mean)
Time per request: 3045.039 [ms] (mean)
Time per request: 3045.039 [ms] (mean, across all concurrent requests)
Transfer rate: 6.60 [Kbytes/sec] received

Connection Times (ms)
              min mean[+/-sd] median max
Connect: 0 181 449.1 60 3134
Processing: 1151 2863 2398.8 1962 12157
Waiting: 1111 1722 1113.6 1372 7470
Total: 1161 3044 2605.6 2143 12157

Percentage of the requests served within a certain time (ms)
  50% 2143
  66% 2814
  75% 3895
  80% 4146
  90% 5928
  95% 11286
  98% 12157
  99% 12157
100% 12157 (longest request)

  _____

De : geonetwork-devel-bounces@lists.sourceforge.net
[mailto:geonetwork-devel-bounces@lists.sourceforge.net] De la part de Jeroen
Ticheler
Envoyé : jeudi 12 octobre 2006 16:32
À : geonetwork-devel@lists.sourceforge.net
Objet : [GeoNetwork-devel] GeoNetwork fuzzy search

Hi Francois,
As you can see from the exchange below, the fuzzy search is looked at with
great interest. Did you do any performance testing?
Ciao,
Jeroen

Begin forwarded message:

on the fuzzy search I wrote the following short note last may to the
developer's list.

On the fuzzy search, I wonder if there are performance issues to be
considered; in general any partial match search that does not work on a
prefix tends to give some performance penalty, both in terms of space (in
this case, number of generated clauses?) and query resolution time.
It would be worth doing some tests before adding fuzzy search.

Has anyone done some test on the performance of the fuzzy search since then?
I am re-issuing this since I remember Stefano at some point, totally outside
geonetwork, needed some fuzzy search features for a big database he was
working on, and he reported me that the speed became much slower; but maybe
I don't remember vwell since this happened more than one year ago.
Regards from Vienna,

Roberto

On 12 Oct 2006, at 14:41, Andrea Carboni wrote:

Dear Patrizia,

this problem will be automatically fixed as soon as our french colleagues
implements
fuzzy search in geonetwork.

Cheers,
Andrea

Hi,

I want just to add to the email below that the same problem there is with
dash, like in Sub-Sahara.

Again thanks a lot,
Patrizia

--
Ce message a ete verifie par <http://www.mailscanner.info/&gt; MailScanner
pour des virus ou des polluriels et rien de suspect n'a ete trouve.

Les donnees et renseignements contenus dans ce message sont personnels,
confidentiels et privis.Toute publication, utilisation ou diffusion, meme
partielle, doit etre autorisee.

Any data and information contained in this electronic mail is personal,
confidential and private. Any total or partial publication, use or
distribution must be authorized.

Francois, thanks a lot for that feedback! Looking forward to see the code contribution!
Ciao,
Jeroen

On Oct 12, 2006, at 9:01 PM, Andrea Carboni wrote:

I agree. With a similarity slider set to 0.8.

Cheers,
Andrea

Hi Jeroen, I'm sure that fuzzy query will be slower than TermQuery. "A fuzzy
search needs to calculate the (Levenshtein) edit distance between the search
term and all terms in the index, which is an expensive operation". Usually,
more the similarity is close to 0, more the query will take time. There's
not so much coments on that on lucene' forum.

I set default value to .8, which allows not to take into account
approximately on accent or one special character in a word (eg. TERMQUERY
(rivière) = FUZZYQUERY (riviere~.8) ~= FUZZYQUERY (rivere~.6). We will have
to set a similarity close to .5 (default lucene's value) to expect to deal
with typo ... Fuzzy search will also improve search for plural (TERMQUERY
(nitrate) <> TERMQUERY (nitrates) ~= FUZZYQUERY (nitrate~.8) ...
Maybe Geonetwork users will have to analyse what users are typing in the
search interface in order to set the default similarity (it will depend on
language, and type of search one or more words, etc ...).

I didn't make a lot of performance testing; just played a bit with apache
bench utility ... 50 Search for "riviere" with
TERMQUERY : Time per request: 2782.456 [ms] (mean)
FUZZYQUERY : Time per request: 3045.039 [ms] (mean)
Times are all HTTP time not only for search ! With 500 metadata on a dev
server.

If everybody agrees, I'll try to commit the code by the end of this week so
everybody could test it. We will be able to easily turn off the fuzzy query
in the main page by setting the similarity to 1 and remove the slider from
the advanced-search.

Cheers. Francois

ab -n 50 -c 1 "http://dev.sandre.eaufrance.fr/geonetworkalp
ha/srv/en/main.search?extended=off&remote=off&attrset=geo&any=rivi%C3%A8re&s
imil
arity=1&hitsPerPage=10"
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.121.2.12 $> apache-2.0
Benchmarking dev.sandre.eaufrance.fr (be patient).....done
Server Software: Jetty/5.1.5rc2
Server Hostname: dev.sandre.eaufrance.fr
Server Port: 80

Document Path:
/geonetworkalpha/srv/en/main.search?extended=off&remote=
off&attrset=geo&any=rivi%C3%A8re&similarity=1&hitsPerPage=10
Document Length: 20001 bytes

Concurrency Level: 1
Time taken for tests: 139.122823 seconds
Complete requests: 50
Failed requests: 0
Write errors: 0
Total transferred: 1015738 bytes
HTML transferred: 1000050 bytes
Requests per second: 0.36 [#/sec] (mean)
Time per request: 2782.456 [ms] (mean)
Time per request: 2782.456 [ms] (mean, across all concurrent requests)
Transfer rate: 7.12 [Kbytes/sec] received

Connection Times (ms)
              min mean[+/-sd] median max
Connect: 0 305 818.2 80 3955
Processing: 1081 2476 1297.6 2032 6229
Waiting: 991 1491 315.3 1442 2423
Total: 1091 2781 1683.4 2183 9814

Percentage of the requests served within a certain time (ms)
  50% 2183
  66% 2733
  75% 3505
  80% 4055
  90% 5648
  95% 6028
  98% 9814
  99% 9814
100% 9814 (longest request)

ab -n 50 -c 1 "http://dev.sandre.eaufrance.fr/geonetworkalp
ha/srv/en/main.search?extended=off&remote=off&attrset=geo&any=rivi%C3%A8re&s
imil
arity=.6&hitsPerPage=10"
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.121.2.12 $> apache-2.0
Benchmarking dev.sandre.eaufrance.fr (be patient).....done
Server Software: Jetty/5.1.5rc2
Server Hostname: dev.sandre.eaufrance.fr
Server Port: 80

Document Path:
/geonetworkalpha/srv/en/main.search?extended=off&remote=
off&attrset=geo&any=rivi%C3%A8re&similarity=.6&hitsPerPage=10
Document Length: 20282 bytes

Concurrency Level: 1
Time taken for tests: 152.251964 seconds
Complete requests: 50
Failed requests: 0
Write errors: 0
Total transferred: 1029792 bytes
HTML transferred: 1014100 bytes
Requests per second: 0.33 [#/sec] (mean)
Time per request: 3045.039 [ms] (mean)
Time per request: 3045.039 [ms] (mean, across all concurrent requests)
Transfer rate: 6.60 [Kbytes/sec] received

Connection Times (ms)
              min mean[+/-sd] median max
Connect: 0 181 449.1 60 3134
Processing: 1151 2863 2398.8 1962 12157
Waiting: 1111 1722 1113.6 1372 7470
Total: 1161 3044 2605.6 2143 12157

Percentage of the requests served within a certain time (ms)
  50% 2143
  66% 2814
  75% 3895
  80% 4146
  90% 5928
  95% 11286
  98% 12157
  99% 12157
100% 12157 (longest request)

  _____

De : geonetwork-devel-bounces@lists.sourceforge.net
[mailto:geonetwork-devel-bounces@lists.sourceforge.net] De la part de Jeroen
Ticheler
Envoyé : jeudi 12 octobre 2006 16:32
À : geonetwork-devel@lists.sourceforge.net
Objet : [GeoNetwork-devel] GeoNetwork fuzzy search

Hi Francois,
As you can see from the exchange below, the fuzzy search is looked at with
great interest. Did you do any performance testing?
Ciao,
Jeroen

Begin forwarded message:

on the fuzzy search I wrote the following short note last may to the
developer's list.

On the fuzzy search, I wonder if there are performance issues to be
considered; in general any partial match search that does not work on a
prefix tends to give some performance penalty, both in terms of space (in
this case, number of generated clauses?) and query resolution time.
It would be worth doing some tests before adding fuzzy search.

Has anyone done some test on the performance of the fuzzy search since then?
I am re-issuing this since I remember Stefano at some point, totally outside
geonetwork, needed some fuzzy search features for a big database he was
working on, and he reported me that the speed became much slower; but maybe
I don't remember vwell since this happened more than one year ago.
Regards from Vienna,

Roberto

On 12 Oct 2006, at 14:41, Andrea Carboni wrote:

Dear Patrizia,

this problem will be automatically fixed as soon as our french colleagues
implements
fuzzy search in geonetwork.

Cheers,
Andrea

Hi,

I want just to add to the email below that the same problem there is with
dash, like in Sub-Sahara.

Again thanks a lot,
Patrizia

--
Ce message a ete verifie par <http://www.mailscanner.info/&gt; MailScanner
pour des virus ou des polluriels et rien de suspect n'a ete trouve.

Les donnees et renseignements contenus dans ce message sont personnels,
confidentiels et privis.Toute publication, utilisation ou diffusion, meme
partielle, doit etre autorisee.

Any data and information contained in this electronic mail is personal,
confidential and private. Any total or partial publication, use or
distribution must be authorized.

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at http://sourceforge.net/projects/geonetwork