[GeoNetwork-devel] Proposal: Local harvesting and remote search

Harvesting of local sources does not appear to work reliably, although
one-time importing local records does seem to be working. We propose to
test and refine, as necessary, the ability to harvest (or update) local
collections of records on a regular basis.

In addition, a number of metadata sources in GEOSS are either too large
(250,000 - 1 million records or more) or too dynamic (several
updates/hour, perhaps related to emergency conditions) to harvest and
hold locally. In these cases, we anticipate that searching through the
clearinghouse instance should be performed as a distributed search
against the original collection, rather than against a locally-held
harvested collection, and further, that this process should be
transparent to the end-user. That is, while such collections may be
presented to the end user as an optional source to be searched, they
should not be expected to know which collections are held in the
clearinghouse and which are searched remotely, nor should the end user
be directed away from the clearinghouse site to search these remote
sites separately.

We are fully cognizant of the network latencies involved in such a
scenario, having had direct experience with it in the FGDC Clearinghouse
network in years past. Nonetheless, support for distributed, remote
searching is seen to be unavoidable within the GEOSS framework. The
basic client functions for doing distributed, remote search is already
in GeoNetwork - we propose to implement it as part of the search
interface, at least through the CSW API. Note that, in GEOSS anyway,
clearinghouse and portal functions are separate - portals provide the
user interface, clearinghouses provide the programmatic search interface
to the portals via CSW.

--

Archie

-- Archie Warnock warnock@anonymised.com
-- A/WWW Enterprises www.awcubed.com
-- As a matter of fact, I _do_ speak for my employer.

I definitely support further testing and possible rework on improving harvesting.

For the second part, on on the fly searching (or distributed search) I agree that it should be practically (or totally) invisible to the end user. I'm absolutely no fan of distributed searches and think that if a clearinghouse is know to update its information with a high frequency it should be configured to regularly be checked. Indeed, maybe the distributed search function could trigger a simple check against the remote catalogue and then subsequently update its local cache. That would avoid the next user to query the same catalogue again and face the same latencies.

Would it be an idea to have a public status page that shows the last update time for each catalogue that is connected?

Ciao,
Jeroen

On Nov 16, 2009, at 5:14 PM, Archie Warnock wrote:

Harvesting of local sources does not appear to work reliably, although
one-time importing local records does seem to be working. We propose to
test and refine, as necessary, the ability to harvest (or update) local
collections of records on a regular basis.

In addition, a number of metadata sources in GEOSS are either too large
(250,000 - 1 million records or more) or too dynamic (several
updates/hour, perhaps related to emergency conditions) to harvest and
hold locally. In these cases, we anticipate that searching through the
clearinghouse instance should be performed as a distributed search
against the original collection, rather than against a locally-held
harvested collection, and further, that this process should be
transparent to the end-user. That is, while such collections may be
presented to the end user as an optional source to be searched, they
should not be expected to know which collections are held in the
clearinghouse and which are searched remotely, nor should the end user
be directed away from the clearinghouse site to search these remote
sites separately.

We are fully cognizant of the network latencies involved in such a
scenario, having had direct experience with it in the FGDC Clearinghouse
network in years past. Nonetheless, support for distributed, remote
searching is seen to be unavoidable within the GEOSS framework. The
basic client functions for doing distributed, remote search is already
in GeoNetwork - we propose to implement it as part of the search
interface, at least through the CSW API. Note that, in GEOSS anyway,
clearinghouse and portal functions are separate - portals provide the
user interface, clearinghouses provide the programmatic search interface
to the portals via CSW.

--

Archie

-- Archie Warnock warnock@anonymised.com
-- A/WWW Enterprises www.awcubed.com
-- As a matter of fact, I _do_ speak for my employer.

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now. http://p.sf.net/sfu/bobj-july
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at http://sourceforge.net/projects/geonetwork

Jeroen Ticheler wrote:

I definitely support further testing and possible rework on improving
harvesting.

Thanks.

For the second part, on on the fly searching (or distributed search) I
agree that it should be practically (or totally) invisible to the end
user. I'm absolutely no fan of distributed searches and think that if a
clearinghouse is know to update its information with a high frequency it
should be configured to regularly be checked. Indeed, maybe the
distributed search function could trigger a simple check against the
remote catalogue and then subsequently update its local cache. That
would avoid the next user to query the same catalogue again and face the
same latencies.

I'm really not a fan of distributed search, either, but much discussion
has taken place in the GEOSS community and it seems to be a general
consensus that it can't be avoided in certain situations. Obviously, it
should be a last resort - to be used only when the clearinghouse can't
maintain a reasonable level of performance or reasonably current local
cache - but I think it needs to be an option.

As an example, 100,000+ Geospatial OneStop records are absolutely
killing my current database. We're investigating migrating the database
to a more powerful machine, but I can tell you that there's more latency
right now in searching against my local cache than in searching the GOS
source remotely. It would be a huge improvement in performance to push
the search out to them - and the prospect of duplicating over a million
NASA records into a local cache someday is pretty daunting.

Would it be an idea to have a public status page that shows the last
update time for each catalogue that is connected?

Absolutely.

--

Archie

-- Archie Warnock warnock@anonymised.com
-- A/WWW Enterprises www.awcubed.com
-- As a matter of fact, I _do_ speak for my employer.