[GeoNetwork-devel] Help? - about indexing 1000's records

Frederic,

Thanks for your reply. I am sure the geonetwork developers list would
be interested in your findings so I have copied them in too.

Andrew

----- Original Message ----- From: "Frederic Vallois" <Frederic.Vallois@anonymised.com>
To: "awalsh" <awalsh@anonymised.com>
Sent: Friday, March 05, 2010 4:26 AM
Subject: RE: Help?

Hello,

I had a confirmation for you the limit of record is equal to the max
lenght of your system file name (2^32 even for a simple Windows.) so
nearly no limit!.
But after 100 000 metadata the lucene index is quite long in answering
time. Very Linear.... but long....

It seems that it could be improved with a hadoop solution if needed. PS: In that case I think some combination of a distributed Lucene index
with Apache Hadoop could be a way to go (see for example
http://www.hpl.hp.com/techreports/2008/HPL-2008-64.pdf).

But for larger database we normally in VEGA use really faster solutions
(like solution used for ground segment catalogues in satellite for earth
observation) but the price is not the same .....so it could be a very
interesting solution. Depending on the number of metadata and the budget
you have for the project.

PS_2 : The test have been made with the current latest version of
GeoNetwork 2.4.x, running on Windows 7 - 32 bit, 2.8GHz, 4G RAM.
GeoNetwork's JVM settings as they are out-of-the-box, and its logging
settings too. GeoNetwork running on its embedded Jetty servlet
container, and connected to a Postgres database.

Best regards,

Frederic VALLOIS
VEGA Technologies
A Finmeccanica Compagny
Mobile: + 33 6 68 37 15 45

-----Original Message-----
From: awalsh [mailto:awalsh@anonymised.com] Sent: jeudi 26 novembre 2009 23:10
To: Frederic Vallois
Subject: Re: Help?

Frederic,

I got some advice from one of the Geonetwork programmers that you can
have more than 99999 records. The XML importer Java code when it makes
numbered directories will increment for numbers larger than 99999. Can't
remember which java file it was but when I checked its logic/math it
seemed to be doing the job.

I haven't actually tested this as we dont have that many records here.
But I would be interested to know how your GN goes if you do have more
than 99999 records.

Andrew

----- Original Message ----- From: <frederic.vallois@anonymised.com>
To: <awalsh@anonymised.com>
Sent: Thursday, November 26, 2009 7:35 PM
Subject: Help?

Hello,

I have seen your post about the limits of records of 5 digits is it

confirmed?

do you find ways to improve this limits ?
have you some answers to your post?

I am in the same case and searching the limit of records?

could you help me?
best regards,

Frederic.

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email ______________________________________________________________________

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email ______________________________________________________________________

Hi Frederic and Andrew,

Concidentally, I've been testing and developing some improvements for this over the last week on catalogs of up to 120,000 records.

The findings at the moment are that the search speed issue is not Lucene - which is really fast (even for these size catalogs) - but the processing that GeoNetwork does on the result set. One solution is to limit the extent of this result set processing (eg. collecting keywords and keyword frequency) to the most relevant records - in that way you can still do very fast searches (1-2 seconds) on these larger catalogs.

I have a patch almost ready for the trunk which copes with this and the other speed problem of building indexes for these catalogs.

Ultimately we will run into other limits (but not it seems with catalogs of this size) and Hadoop is out there.

Cheers and thanks,
Simon

awalsh wrote:

Frederic,

Thanks for your reply. I am sure the geonetwork developers list would
be interested in your findings so I have copied them in too.

Andrew

----- Original Message ----- From: "Frederic Vallois" <Frederic.Vallois@anonymised.com>
To: "awalsh" <awalsh@anonymised.com>
Sent: Friday, March 05, 2010 4:26 AM
Subject: RE: Help?

Hello,

I had a confirmation for you the limit of record is equal to the max
lenght of your system file name (2^32 even for a simple Windows.) so
nearly no limit!.
But after 100 000 metadata the lucene index is quite long in answering
time. Very Linear.... but long....

It seems that it could be improved with a hadoop solution if needed. PS: In that case I think some combination of a distributed Lucene index
with Apache Hadoop could be a way to go (see for example
http://www.hpl.hp.com/techreports/2008/HPL-2008-64.pdf).

But for larger database we normally in VEGA use really faster solutions
(like solution used for ground segment catalogues in satellite for earth
observation) but the price is not the same .....so it could be a very
interesting solution. Depending on the number of metadata and the budget
you have for the project.

PS_2 : The test have been made with the current latest version of
GeoNetwork 2.4.x, running on Windows 7 - 32 bit, 2.8GHz, 4G RAM.
GeoNetwork's JVM settings as they are out-of-the-box, and its logging
settings too. GeoNetwork running on its embedded Jetty servlet
container, and connected to a Postgres database.

Best regards,

Frederic VALLOIS
VEGA Technologies
A Finmeccanica Compagny
Mobile: + 33 6 68 37 15 45

-----Original Message-----
From: awalsh [mailto:awalsh@anonymised.com] Sent: jeudi 26 novembre 2009 23:10
To: Frederic Vallois
Subject: Re: Help?

Frederic,

I got some advice from one of the Geonetwork programmers that you can
have more than 99999 records. The XML importer Java code when it makes
numbered directories will increment for numbers larger than 99999. Can't
remember which java file it was but when I checked its logic/math it
seemed to be doing the job.

I haven't actually tested this as we dont have that many records here.
But I would be interested to know how your GN goes if you do have more
than 99999 records.

Andrew

----- Original Message ----- From: <frederic.vallois@anonymised.com>
To: <awalsh@anonymised.com>
Sent: Thursday, November 26, 2009 7:35 PM
Subject: Help?

Hello,

I have seen your post about the limits of records of 5 digits is it
    

confirmed?
  

do you find ways to improve this limits ?
have you some answers to your post?

I am in the same case and searching the limit of records?

could you help me?
best regards,

Frederic.

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email ______________________________________________________________________

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit Email Security ______________________________________________________________________

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at http://sourceforge.net/projects/geonetwork