[GeoNetwork-users] how many records?

http://geonetwork-opensource.org/manuals/trunk/users/_sources/admin/advanced-configuration/index.txt mentions catalogs with over a million records, is there an actual limit?

And if there is such a limit, would one have access to all metadata if one tried to harvest from two different GN nodes that were at this limit?

Thank You,

Jim
--

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
James Long
International Arctic Research Center
University of Alaska Fairbanks
jlong|at|alaska.edu
(907) 474-2440
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

HI there,

for the CSW service itself, there are no hard and fast limits, but of
course with a million records things are not going to run particularly
smooth: it all depends on the capabilities and scalability of the
external database holding all the records, and thus beyond a certain
point things are simply going to slow down too much and too suddently
to be really usable, if you simply use e.g. McKoi. Now, for the
GeoNetwork web interface, there may be stricter limits e.g. the Lucene
index is known to break at as "little" as 16K, 32K or 64K records,
depending on how it's configured.

On Wed, Feb 15, 2012 at 10:41 PM, James Long <jlong@anonymised.com> wrote:

http://geonetwork-opensource.org/manuals/trunk/users/_sources/admin/advanced-configuration/index.txt mentions catalogs with over a million records, is there an actual limit?

And if there is such a limit, would one have access to all metadata if one tried to harvest from two different GN nodes that were at this limit?

Thank You,

Jim
--

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
James Long
International Arctic Research Center
University of Alaska Fairbanks
jlong|at|alaska.edu
(907) 474-2440
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
GeoNetwork-users mailing list
GeoNetwork-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-users
GeoNetwork OpenSource is maintained at http://sourceforge.net/projects/geonetwork

hi Victor,

do you have any references about your statement "the Lucene index is known
to break" ? Are you referring specifically to the Lucene index in
GeoNetwork ?

Kind regards
Heikki Doeleman

On Thu, Feb 16, 2012 at 9:50 AM, Victor Epitropou <
vepitrop@anonymised.com> wrote:

HI there,

for the CSW service itself, there are no hard and fast limits, but of
course with a million records things are not going to run particularly
smooth: it all depends on the capabilities and scalability of the
external database holding all the records, and thus beyond a certain
point things are simply going to slow down too much and too suddently
to be really usable, if you simply use e.g. McKoi. Now, for the
GeoNetwork web interface, there may be stricter limits e.g. the Lucene
index is known to break at as "little" as 16K, 32K or 64K records,
depending on how it's configured.

On Wed, Feb 15, 2012 at 10:41 PM, James Long <jlong@anonymised.com> wrote:
>
http://geonetwork-opensource.org/manuals/trunk/users/_sources/admin/advanced-configuration/index.txtmentions catalogs with over a million records, is there an actual limit?
>
> And if there is such a limit, would one have access to all metadata if
one tried to harvest from two different GN nodes that were at this limit?
>
> Thank You,
>
> Jim
> --
>
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> James Long
> International Arctic Research Center
> University of Alaska Fairbanks
> jlong|at|alaska.edu
> (907) 474-2440
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>
>
>
>
>
>
------------------------------------------------------------------------------
> Virtualization & Cloud Management Using Capacity Planning
> Cloud computing makes use of virtualization - but cloud computing
> also focuses on allowing computing to be delivered as a service.
> http://www.accelacomm.com/jaw/sfnl/114/51521223/
> _______________________________________________
> GeoNetwork-users mailing list
> GeoNetwork-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/geonetwork-users
> GeoNetwork OpenSource is maintained at
http://sourceforge.net/projects/geonetwork

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
GeoNetwork-users mailing list
GeoNetwork-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-users
GeoNetwork OpenSource is maintained at
http://sourceforge.net/projects/geonetwork

Well, I first investigated this aspect a few days ago, when somebody
posted a complaint about a maxClauseCount parameter in the Lucene
index (forwarding post):

On Tue, Feb 14, 2012 at 4:32 PM, <kieransun@anonymised.com> wrote:

Hello GeoNetwork users,

we are searching for time ranges in about 54500 datasets inside the CSW and get the following error:

Raised exception while searching metadata : org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 16384

How to raise it?

Kind regards,
Kieran

After that, just googling for "maxClauseCount" brought up an
interesting backstory to it. Apparently it is always hardcoded to some
"high enough" value (which, according to who you ask, might be 16000,
32000 or 64000 records etc.), which is of course proverbially proven
wrong, sooner or later. Fortunately, it can be changed through
configuration. That's the only obvious hard & fast limit which is due
only to an arbitrary constraint that I could identify. For the
rest....I suppose interesting things could happen if someone exceeds a
total of 2^31 records (might cause 32-bit integer overflows in certain
software modules).

The TooManyClauses error happens because of the way certain queries are
constructed, it's not directly the number of documents that are in the
index. See e.g. this explanation
http://stackoverflow.com/questions/1534789/help-needed-figuring-out-reason-for-maxclausecount-is-set-to-1024-error
.

MaxClauseCount is set to 16384 in GeoNetwork, which seems a bit random to
me, it could be set to Integer.MAX_VALUE (which is 231-1).

Also, we're using TermRangeQuery for date ranges, where it may be more
ideal to use NumericRangeQuery (
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/search/NumericRangeQuery.html
).

Apart from range queries, I would expect that a catalog with 100,000s or
millions of records would still be fine -- though I have no performance
data on such sizes. If anyone does, please let us know ?

Kind regards,
Heikki Doeleman

On Thu, Feb 16, 2012 at 2:20 PM, Victor Epitropou <
vepitrop@anonymised.com> wrote:

Well, I first investigated this aspect a few days ago, when somebody
posted a complaint about a maxClauseCount parameter in the Lucene
index (forwarding post):

On Tue, Feb 14, 2012 at 4:32 PM, <kieransun@anonymised.com> wrote:
> Hello GeoNetwork users,
>
> we are searching for time ranges in about 54500 datasets inside the CSW
and get the following error:
>
> Raised exception while searching metadata :
org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set
to 16384
>
> How to raise it?
>
> Kind regards,
> Kieran

After that, just googling for "maxClauseCount" brought up an
interesting backstory to it. Apparently it is always hardcoded to some
"high enough" value (which, according to who you ask, might be 16000,
32000 or 64000 records etc.), which is of course proverbially proven
wrong, sooner or later. Fortunately, it can be changed through
configuration. That's the only obvious hard & fast limit which is due
only to an arbitrary constraint that I could identify. For the
rest....I suppose interesting things could happen if someone exceeds a
total of 2^31 records (might cause 32-bit integer overflows in certain
software modules).

Thanks for the replies everyone. I'm working with a user who wants us to ingest one image, w/ associated radar data, every 5 minutes for several years into the future, so over 100,000 per year. We can automate the process with a nested parent/child metadata hierarchy, I'm thinking of a separate GN instance for each year, and harvest from those. I would bundle the images together if they were small enough, but the total comes out to 315 MB/day.

Does anyone have experience they would like to share with such large catalogs?

Thank You,

Jim

On 02/16/2012 04:50 AM, heikki wrote:

The TooManyClauses error happens because of the way certain queries are
constructed, it's not directly the number of documents that are in the
index. See e.g. this explanation
http://stackoverflow.com/questions/1534789/help-needed-figuring-out-reason-for-maxclausecount-is-set-to-1024-error.

MaxClauseCount is set to 16384 in GeoNetwork, which seems a bit random
to me, it could be set to Integer.MAX_VALUE (which is 2^31 -1).

Also, we're using TermRangeQuery for date ranges, where it may be more
ideal to use NumericRangeQuery
(http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/search/NumericRangeQuery.html).

Apart from range queries, I would expect that a catalog with 100,000s or
millions of records would still be fine -- though I have no performance
data on such sizes. If anyone does, please let us know ?

Kind regards,
Heikki Doeleman

On Thu, Feb 16, 2012 at 2:20 PM, Victor Epitropou
<vepitrop@anonymised.com <mailto:vepitrop@anonymised.com>> wrote:

    Well, I first investigated this aspect a few days ago, when somebody
    posted a complaint about a maxClauseCount parameter in the Lucene
    index (forwarding post):

    On Tue, Feb 14, 2012 at 4:32 PM, <kieransun@anonymised.com
    <mailto:kieransun@anonymised.com>> wrote:
     > Hello GeoNetwork users,
     >
     > we are searching for time ranges in about 54500 datasets inside
    the CSW and get the following error:
     >
     > Raised exception while searching metadata :
    org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount
    is set to 16384
     >
     > How to raise it?
     >
     > Kind regards,
     > Kieran

    After that, just googling for "maxClauseCount" brought up an
    interesting backstory to it. Apparently it is always hardcoded to some
    "high enough" value (which, according to who you ask, might be 16000,
    32000 or 64000 records etc.), which is of course proverbially proven
    wrong, sooner or later. Fortunately, it can be changed through
    configuration. That's the only obvious hard & fast limit which is due
    only to an arbitrary constraint that I could identify. For the
    rest....I suppose interesting things could happen if someone exceeds a
    total of 2^31 records (might cause 32-bit integer overflows in certain
    software modules).

--

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
James Long
International Arctic Research Center
University of Alaska Fairbanks
jlong|at|alaska.edu
(907) 474-2440
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%