Hello friends,
Some of our collaborators are interested in using GeoNetwork for serving up their metadata records for very large collections - 16 million records or more.
Are there any tips or suggestions to help them tune the installation and ensure good performance? Does anyone here have experience with collections of similar size?
Thanks for any useful input you might be able to provide!
--
Archie
-- Archie Warnock warnock@anonymised.com
-- A/WWW Enterprises www.awcubed.com
-- As a matter of fact, I _do_ speak for my employer.
There are a couple important factors:
- Which version of GeNetwork is being used
- What requests are being used
The trunk (and 2.8.0) versions have a search request that does not access the database and is therefore quite fast compared to other versions. However if CSW is used then it access the database for each request resulting in a much slower response. We intend to fix at least some CSW issues in the future but at the moment it is recommended to use the “quick” q service for performing searches.
Other issues like uploaded content associated with records I have not had to work with so I can’t help there.
The 2.8 and trunk versions also have a metrics API that you can use to setup something like collectd (in the future I intend to make it more configurable so that push services like ganglian and statsd can also be used). The metrics can be used to identify slow times. It would be interesting to get feedback on what other metrics would be useful to add to the system.
Jesse
On Wed, May 23, 2012 at 4:28 AM, Archie <warnock@anonymised.com> wrote:
Hello friends,
Some of our collaborators are interested in using GeoNetwork for serving
up their metadata records for very large collections - 16 million
records or more.
Are there any tips or suggestions to help them tune the installation and
ensure good performance? Does anyone here have experience with
collections of similar size?
Thanks for any useful input you might be able to provide!
–
Archie
– Archie Warnock warnock@anonymised.com.100…
– A/WWW Enterprises www.awcubed.com
– As a matter of fact, I do speak for my employer.
Live Security Virtual Conference
Exclusive live event will cover all the ways today’s security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at http://sourceforge.net/projects/geonetwork
On 5/23/12 2:04 AM, Jesse Eichar wrote:
There are a couple important factors:
1. Which version of GeoNetwork is being used
2. What requests are being used
The trunk (and 2.8.0) versions have a search request that does not access the database and is therefore quite fast compared to other versions. However if CSW is used then it access the database for each request resulting in a much slower response. We intend to fix at least some CSW issues in the future but at the moment it is recommended to use the "quick" q service for performing searches.
I'd like to know whether the spatial searches are now being done in Lucene or if that is still done in the database. I would expect that multi-term searches will be much slower than free-text searches, but even that could be optimized as much as possible. A test of bulk-loading 100K, 250K, 500K, then 1M recs of metadata and then performing searches on the large collection (inventory) would be very interesting.
Doug.
--
Douglas D. Nebert
Senior Advisor for Geospatial Technology, System-of-Systems Architect
FGDC Secretariat Tel/Fax:+1 503 454-6248 Cell:+1 703 459-5860
On 5/23/2012 2:04 AM, Jesse Eichar wrote:
There are a couple important factors:
1. Which version of GeNetwork is being used
Thanks, Jesse. This would be the current version - they're just ramping up to bring the system on line.
2. What requests are being used
It'll be primarily CSW, at least for purposes of this project. I imagine the answer has two parts - the time it takes to identify and locate the matching records and the time to format it into the proper response.
They're not unrelated, of course, but it's probably reasonable to expect they'll implement some types of limits to prevent overly-large result sets. The big issue is more likely to be how long it takes to locate the relevant records in a collection of tens of millions of records.
The trunk (and 2.8.0) versions have a search request that does not
access the database and is therefore quite fast compared to other
versions. However if CSW is used then it access the database for each
request resulting in a much slower response. We intend to fix at least
some CSW issues in the future but at the moment it is recommended to use
the "quick" q service for performing searches.
Other issues like uploaded content associated with records I have not
had to work with so I can't help there.
The 2.8 and trunk versions also have a metrics API that you can use to
setup something like collectd (in the future I intend to make it more
configurable so that push services like ganglian and statsd can also be
used). The metrics can be used to identify slow times. It would be
interesting to get feedback on what other metrics would be useful to add
to the system.
--
Archie
-- Archie Warnock warnock@anonymised.com
-- A/WWW Enterprises www.awcubed.com
-- As a matter of fact, I _do_ speak for my employer.