I agree. With a similarity slider set to 0.8.
Hi Jeroen, I'm sure that fuzzy query will be slower than TermQuery. "A fuzzy
search needs to calculate the (Levenshtein) edit distance between the search
term and all terms in the index, which is an expensive operation". Usually,
more the similarity is close to 0, more the query will take time. There's
not so much coments on that on lucene' forum.
I set default value to .8, which allows not to take into account
approximately on accent or one special character in a word (eg. TERMQUERY
(rivière) = FUZZYQUERY (riviere~.8) ~= FUZZYQUERY (rivere~.6). We will have
to set a similarity close to .5 (default lucene's value) to expect to deal
with typo ... Fuzzy search will also improve search for plural (TERMQUERY
(nitrate) <> TERMQUERY (nitrates) ~= FUZZYQUERY (nitrate~.8) ...
Maybe Geonetwork users will have to analyse what users are typing in the
search interface in order to set the default similarity (it will depend on
language, and type of search one or more words, etc ...).
I didn't make a lot of performance testing; just played a bit with apache
bench utility ... 50 Search for "riviere" with
TERMQUERY : Time per request: 2782.456 [ms] (mean)
FUZZYQUERY : Time per request: 3045.039 [ms] (mean)
Times are all HTTP time not only for search ! With 500 metadata on a dev
server.
If everybody agrees, I'll try to commit the code by the end of this week so
everybody could test it. We will be able to easily turn off the fuzzy query
in the main page by setting the similarity to 1 and remove the slider from
the advanced-search.
Cheers. Francois
ab -n 50 -c 1 "http://dev.sandre.eaufrance.fr/geonetworkalp
ha/srv/en/main.search?extended=off&remote=off&attrset=geo&any=rivi%C3%A8re&s
imil
arity=1&hitsPerPage=10"
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.121.2.12 $> apache-2.0
Benchmarking dev.sandre.eaufrance.fr (be patient).....done
Server Software: Jetty/5.1.5rc2
Server Hostname: dev.sandre.eaufrance.fr
Server Port: 80
Document Path:
/geonetworkalpha/srv/en/main.search?extended=off&remote=
off&attrset=geo&any=rivi%C3%A8re&similarity=1&hitsPerPage=10
Document Length: 20001 bytes
Concurrency Level: 1
Time taken for tests: 139.122823 seconds
Complete requests: 50
Failed requests: 0
Write errors: 0
Total transferred: 1015738 bytes
HTML transferred: 1000050 bytes
Requests per second: 0.36 [#/sec] (mean)
Time per request: 2782.456 [ms] (mean)
Time per request: 2782.456 [ms] (mean, across all concurrent requests)
Transfer rate: 7.12 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 305 818.2 80 3955
Processing: 1081 2476 1297.6 2032 6229
Waiting: 991 1491 315.3 1442 2423
Total: 1091 2781 1683.4 2183 9814
Percentage of the requests served within a certain time (ms)
50% 2183
66% 2733
75% 3505
80% 4055
90% 5648
95% 6028
98% 9814
99% 9814
100% 9814 (longest request)
ab -n 50 -c 1 "http://dev.sandre.eaufrance.fr/geonetworkalp
ha/srv/en/main.search?extended=off&remote=off&attrset=geo&any=rivi%C3%A8re&s
imil
arity=.6&hitsPerPage=10"
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.121.2.12 $> apache-2.0
Benchmarking dev.sandre.eaufrance.fr (be patient).....done
Server Software: Jetty/5.1.5rc2
Server Hostname: dev.sandre.eaufrance.fr
Server Port: 80
Document Path:
/geonetworkalpha/srv/en/main.search?extended=off&remote=
off&attrset=geo&any=rivi%C3%A8re&similarity=.6&hitsPerPage=10
Document Length: 20282 bytes
Concurrency Level: 1
Time taken for tests: 152.251964 seconds
Complete requests: 50
Failed requests: 0
Write errors: 0
Total transferred: 1029792 bytes
HTML transferred: 1014100 bytes
Requests per second: 0.33 [#/sec] (mean)
Time per request: 3045.039 [ms] (mean)
Time per request: 3045.039 [ms] (mean, across all concurrent requests)
Transfer rate: 6.60 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 181 449.1 60 3134
Processing: 1151 2863 2398.8 1962 12157
Waiting: 1111 1722 1113.6 1372 7470
Total: 1161 3044 2605.6 2143 12157
Percentage of the requests served within a certain time (ms)
50% 2143
66% 2814
75% 3895
80% 4146
90% 5928
95% 11286
98% 12157
99% 12157
100% 12157 (longest request)
_____
De : geonetwork-devel-bounces@lists.sourceforge.net
[mailto:geonetwork-devel-bounces@lists.sourceforge.net] De la part de Jeroen
Ticheler
Envoyé : jeudi 12 octobre 2006 16:32
À : geonetwork-devel@lists.sourceforge.net
Objet : [GeoNetwork-devel] GeoNetwork fuzzy search
Hi Francois,
As you can see from the exchange below, the fuzzy search is looked at with
great interest. Did you do any performance testing?
Ciao,
Jeroen
Begin forwarded message:
on the fuzzy search I wrote the following short note last may to the
developer's list.
On the fuzzy search, I wonder if there are performance issues to be
considered; in general any partial match search that does not work on a
prefix tends to give some performance penalty, both in terms of space (in
this case, number of generated clauses?) and query resolution time.
It would be worth doing some tests before adding fuzzy search.
Has anyone done some test on the performance of the fuzzy search since then?
I am re-issuing this since I remember Stefano at some point, totally outside
geonetwork, needed some fuzzy search features for a big database he was
working on, and he reported me that the speed became much slower; but maybe
I don't remember vwell since this happened more than one year ago.
Regards from Vienna,
Roberto
On 12 Oct 2006, at 14:41, Andrea Carboni wrote:
Dear Patrizia,
this problem will be automatically fixed as soon as our french colleagues
implements
fuzzy search in geonetwork.
Cheers,
Andrea
Hi,
I want just to add to the email below that the same problem there is with
dash, like in Sub-Sahara.
Again thanks a lot,
Patrizia
--
Ce message a ete verifie par <http://www.mailscanner.info/> MailScanner
pour des virus ou des polluriels et rien de suspect n'a ete trouve.
Les donnees et renseignements contenus dans ce message sont personnels,
confidentiels et privis.Toute publication, utilisation ou diffusion, meme
partielle, doit etre autorisee.
Any data and information contained in this electronic mail is personal,
confidential and private. Any total or partial publication, use or
distribution must be authorized.