[GeoNetwork-users] [GeoNetwork-devel] OAI-PMH ListSets Validation Error: Can be silenced?

Hi John,

Two issues here:

1. the ListSets response from the equella server is invalid (as we found) which means that you can't get the sets & prefixes to search/use in the edit interface when adding an OAI-PMH harvester for an equella server. The fix here is for equella to fix their server - ok, not likely :slight_smile: - so we could relax the validation on this request in GeoNetwork but the risk is that we get back stuff we don't understand and further confuse the user by presenting junk in the interface. (A hack here is to run the ListSets request yourself and look at the response eg. http://equella-server/oai/provider?verb=ListSets or just use oai_dc as that set is defined anyway). Alternatively we could let the user enter their own set name and metadata prefix values like joai if we don't get anything (valid) back from the ListSets request in the harvester edit interface (which sounds preferable to me anyway) - anyone else want to comment?

2. I think the invalid xml error you're referring to is how all the metadata records returned from the equella server are flagged when the harvester is actually run. What is happening here is that GeoNetwork is doing a GetRecords request and getting back a request from the equella server that includes the metadata record. This is fine but GeoNetwork attempts to validate the response which fails validation, probably because the equella server (in common with many other oai servers) is not including the schemaLocation attribute for the oai_dc schema on the root element of the embedded metadata record in the GetRecords response. This is perhaps a bit too restrictive on our part as schemaLocation is optional I think on all XML records. Anyway this can be relaxed by leaving validation of the embedded metadata record until after GeoNetwork has made a guess at the schema to which it belongs - at that stage it can use the local schema to validate it and it doesn't need a schemaLocation attribute. This seems to me to be a worthwhile relaxation because it makes lots more records available from the servers I know about without increasing the risk of ingesting junk. Again, does anyone else want to comment on this?

I've tested the second fix in the ANZMEST code and I'll develop the first fix as well to prove the concept (although maybe Mathieu and Julien have done this in the refactoring of the OAI-PMH harvester they have done?).

Cheers,
Simon

________________________________________
From: boabjohn [john@anonymised.com]
Sent: Sunday, 27 March 2011 8:36 PM
To: geonetwork-devel@lists.sourceforge.net
Subject: [GeoNetwork-devel] OAI-PMH ListSets Validation Error: Can be silenced?

(Apologies: cross-posted from GN-Users...perhaps this is the better forum?)

G'Day all,

We're attempting to harvest from a records management platform called
Equella (http://www.equella.com/) which say they can support an OAI-PMH
endpoint.

However, GN throws an "invalid xml" error when we attempt to harvest.

After some laser-like investigation by a friendly wizard (thanks Simon!) it
looks like the problem is with the ListSets response on the Equella server.
It is returning text content in element which apparently is
not valid oaipmh (according to oaipmh XSDs).

Equella say they will fix it "some time soon..."

We need to harvest now.

Can anyone suggest how the validation error can be safely (and succinctly)
silenced so that the harvest can continue?

Thanks in advance,

JB

---

John Brisbin

Managing Director, BoaB interactive

[mb] +61 (0)407 471 565

[ph] +61 (0)7 3103 0574 (voice 2 text)

[im] skype:boabjohn

[www] http://www.boab.info

[po] POB 802, Townsville QLD 4810 AUSTRALIA

--
View this message in context: http://osgeo-org.1803224.n2.nabble.com/OAI-PMH-ListSets-Validation-Error-Can-be-silenced-tp6211988p6211988.html
Sent from the GeoNetwork developer mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software
be a part of the solution? Download the Intel(R) Manageability Checker
today! Best Open Source Mac Front-Ends 2024
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
geonetwork-devel List Signup and Options
GeoNetwork OpenSource is maintained at GeoNetwork - Geographic Metadata Catalog download | SourceForge.net

On Tuesday, 29 March 2011 1:52 AM Simon.Pigot@anonymised.com wrote:

<snip>

This is perhaps a
bit too restrictive on our part as schemaLocation is optional

Yes. schemaLocation is unfortunately option so we can't expect it to always be included in the XML document instance.

I think on all XML records. Anyway this can be relaxed by
leaving validation of the embedded metadata record until
after GeoNetwork has made a guess at the schema to which it
belongs - at that stage it can use the local schema to
validate it and it doesn't need a schemaLocation
attribute. This seems to me to be a worthwhile relaxation
because it makes lots more records available from the servers
I know about without increasing the risk of ingesting junk.
Again, does anyone else want to comment on this?

If:

1. an entry in an OASIS Catalog file includes the OAI-PMH namespace (http://www.openarchives.org/OAI/2.0/)

2. a local copy of the OAI-PMH.xsd is stored in the directory where GeoNetwork holds its XSDs and

3. GeoNetwork can read OASIS Catalog files during the validation process:

Then:

The validation of the harvested metadata files shouldn't need a schemaLocation attribute. However, if the metadata file is invalid then there will still be issues.

An entry in the OASIS Catalog file could look like this:

    <system
        systemid="http://www.openarchives.org/OAI/2.0/&quot;
        uri="file://path_to_geonetwork_xsds/OAI-PMH.xsd"/>

I think that this is a regular way of validating XML document instance files where the XSD is not included in the schemaLocation.

<snip>

Thanks.

John

________________________________________
From: boabjohn [john@anonymised.com]
Sent: Sunday, 27 March 2011 8:36 PM
To: geonetwork-devel@lists.sourceforge.net
Subject: [GeoNetwork-devel] OAI-PMH ListSets Validation
Error: Can be silenced?

(Apologies: cross-posted from GN-Users...perhaps this is the
better forum?)

G'Day all,

We're attempting to harvest from a records management platform called
Equella (http://www.equella.com/) which say they can support
an OAI-PMH
endpoint.

However, GN throws an "invalid xml" error when we attempt to harvest.

After some laser-like investigation by a friendly wizard
(thanks Simon!) it
looks like the problem is with the ListSets response on the
Equella server.
It is returning text content in element which apparently is
not valid oaipmh (according to oaipmh XSDs).

Equella say they will fix it "some time soon..."

We need to harvest now.

Can anyone suggest how the validation error can be safely
(and succinctly)
silenced so that the harvest can continue?

Thanks in advance,

JB

---

John Brisbin

Managing Director, BoaB interactive

[mb] +61 (0)407 471 565

[ph] +61 (0)7 3103 0574 (voice 2 text)

[im] skype:boabjohn

[www] http://www.boab.info

[po] POB 802, Townsville QLD 4810 AUSTRALIA

--
View this message in context:
http://osgeo-org.1803224.n2.nabble.com/OAI-PMH-ListSets-Valida

tion-Error-Can-be-silenced-tp6211988p6211988.html

Sent from the GeoNetwork developer mailing list archive at Nabble.com.

--------------------------------------------------------------
----------------
Enable your software for Intel(R) Active Management
Technology to meet the
growing manageability and security demands of your customers.
Businesses
are taking advantage of Intel(R) vPro (TM) technology - will
your software
be a part of the solution? Download the Intel(R) Manageability Checker
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at
http://sourceforge.net/projects/geonetwork

--------------------------------------------------------------
----------------
Enable your software for Intel(R) Active Management
Technology to meet the
growing manageability and security demands of your customers.
Businesses
are taking advantage of Intel(R) vPro (TM) technology - will
your software
be a part of the solution? Download the Intel(R) Manageability Checker
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at
http://sourceforge.net/projects/geonetwork

On 03/29/2011 11:16 AM, john.hockaday@anonymised.com wrote:

On Tuesday, 29 March 2011 1:52 AMSimon.Pigot@anonymised.com wrote:

<snip>

I think on all XML records. Anyway this can be relaxed by
leaving validation of the embedded metadata record until
after GeoNetwork has made a guess at the schema to which it
belongs - at that stage it can use the local schema to
validate it and it doesn't need a schemaLocation
attribute. This seems to me to be a worthwhile relaxation
because it makes lots more records available from the servers
I know about without increasing the risk of ingesting junk.
Again, does anyone else want to comment on this?

If:

1. an entry in an OASIS Catalog file includes the OAI-PMH namespace (http://www.openarchives.org/OAI/2.0/)

2. a local copy of the OAI-PMH.xsd is stored in the directory where GeoNetwork holds its XSDs and

3. GeoNetwork can read OASIS Catalog files during the validation process:

Then:

The validation of the harvested metadata files shouldn't need a schemaLocation attribute. However, if the metadata file is invalid then there will still be issues.

An entry in the OASIS Catalog file could look like this:

     <system
         systemid="http://www.openarchives.org/OAI/2.0/&quot;
         uri="file://path_to_geonetwork_xsds/OAI-PMH.xsd"/>

I think that this is a regular way of validating XML document instance files where the XSD is not included in the schemaLocation.

Hi John & John,

In GeoNetwork 2.7 oasis catalogs are now available and are used to map external URLs to local filesystem references for things like this.

We do already map schemaLocation URLs (eg. http://www.isotc211.org/2005/gmd/gmd.xsd) to local file system paths (eg. INSTALL_PATH/web/geonetwork/xml/schemas/iso19139/schema/gmd/gmd.xsd), I'll check whether the validation code does look ups on the namespace URI when no schemaLocation attribute is present as this would indeed add more flexibility.

Note that the issue is validating an OAIPMH GetRecords response (others work ok) and the problem is that the embedded metadata in the response can be from any schema. The embedded metadata won't be of use in GeoNetwork unless the schema for that metadata is known to GeoNetwork. So the proposed solution to the problem of:

    * switching off validation of the complete GetRecords response
    * extracting the embedded metadata, detecting the schema of the metadata
    * using the file path to the local schema XSD in GeoNetwork for
      validation (and then only if the user checks the validate checkbox
      when defining the harvester)

will provide enough detail for GeoNetwork to tell anyone what it can and can't handle in addition to what is and isn't valid.

Cheers,
Simon

Hi Simon and John,

On Mon, Mar 28, 2011 at 4:51 PM, <Simon.Pigot@anonymised.com> wrote:

Hi John,

Two issues here:

1. the ListSets response from the equella server is invalid (as we found)
which means that you can't get the sets & prefixes to search/use in the edit
interface when adding an OAI-PMH harvester for an equella server. The fix
here is for equella to fix their server - ok, not likely :slight_smile: - so we could
relax the validation on this request in GeoNetwork but the risk is that we
get back stuff we don't understand and further confuse the user by
presenting junk in the interface. (A hack here is to run the ListSets
request yourself and look at the response eg.
http://equella-server/oai/provider?verb=ListSets or just use oai_dc as
that set is defined anyway). Alternatively we could let the user enter their
own set name and metadata prefix values like joai if we don't get anything
(valid) back from the ListSets request in the harvester edit interface
(which sounds preferable to me anyway) - anyone else want to comment?

I agree with Simon to say that the best fix should be to the provider to fix
their invalid operation! :wink:

But, in the case where the ListSets response for the remote OAI provider is
invalid, I would suggest to not use the set as a search criteria
and perform a request (ListIdentifiers or ListRecords) without any
set restriction. Because if you let the user enter a specific set name
manually this will mostly lead to specify an unknown set and retrieve no
results...but maybe I'm wrong. That's also the same for the metadataPrefix,
because of the oai_dc default value you will be able to get results from the
provider in dublin-core.

2. I think the invalid xml error you're referring to is how all the
metadata records returned from the equella server are flagged when the
harvester is actually run. What is happening here is that GeoNetwork is
doing a GetRecords request and getting back a request from the equella
server that includes the metadata record. This is fine but GeoNetwork
attempts to validate the response which fails validation, probably because
the equella server (in common with many other oai servers) is not including
the schemaLocation attribute for the oai_dc schema on the root element of
the embedded metadata record in the GetRecords response. This is perhaps a
bit too restrictive on our part as schemaLocation is optional I think on all
XML records. Anyway this can be relaxed by leaving validation of the
embedded metadata record until after GeoNetwork has made a guess at the
schema to which it belongs - at that stage it can use the local schema to
validate it and it doesn't need a schemaLocation attribute. This seems to me
to be a worthwhile relaxation because it makes lots more records available
from the servers I know about without increasing the risk of ingesting junk.
Again, does anyone else want to comment on this?

I've tested the second fix in the ANZMEST code and I'll develop the first
fix as well to prove the concept (although maybe Mathieu and Julien have
done this in the refactoring of the OAI-PMH harvester they have done?).

At the time being we do not plan to support this case. We consider that if
the provider is unable to provide any valid response for any operation, the
provider is not usable as an harvesting source. So no fix to apply! :wink:

Cheers,

Mathieu

Cheers,
Simon

________________________________________
From: boabjohn [john@anonymised.com]
Sent: Sunday, 27 March 2011 8:36 PM
To: geonetwork-devel@lists.sourceforge.net
Subject: [GeoNetwork-devel] OAI-PMH ListSets Validation Error: Can be
silenced?

(Apologies: cross-posted from GN-Users...perhaps this is the better forum?)

G'Day all,

We're attempting to harvest from a records management platform called
Equella (http://www.equella.com/) which say they can support an OAI-PMH
endpoint.

However, GN throws an "invalid xml" error when we attempt to harvest.

After some laser-like investigation by a friendly wizard (thanks Simon!) it
looks like the problem is with the ListSets response on the Equella server.
It is returning text content in element which apparently is
not valid oaipmh (according to oaipmh XSDs).

Equella say they will fix it "some time soon..."

We need to harvest now.

Can anyone suggest how the validation error can be safely (and succinctly)
silenced so that the harvest can continue?

Thanks in advance,

JB

---

John Brisbin

Managing Director, BoaB interactive

[mb] +61 (0)407 471 565

[ph] +61 (0)7 3103 0574 (voice 2 text)

[im] skype:boabjohn

[www] http://www.boab.info

[po] POB 802, Townsville QLD 4810 AUSTRALIA

--
View this message in context:
http://osgeo-org.1803224.n2.nabble.com/OAI-PMH-ListSets-Validation-Error-Can-be-silenced-tp6211988p6211988.html
Sent from the GeoNetwork developer mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software
be a part of the solution? Download the Intel(R) Manageability Checker
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at
http://sourceforge.net/projects/geonetwork

------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software
be a part of the solution? Download the Intel(R) Manageability Checker
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at
http://sourceforge.net/projects/geonetwork

G'Day Mathieu,

Thanks very much indeed for looking into this case (and for Simon's
investigation). We have formulated a response to Equella with your combined
analysis and recommendations.

With luck they will see that strict OAI-PMH compliance will make Equella
collections visible to the growing GN community (and others!). They would
surely not want to deprive their clients of this usecase...and perhaps they
will sort themselves out accordingly. We can only hope!

Many kind regards,

JB

From: Mathieu Coudert [mailto:mathieu.coudert@anonymised.com]
Sent: Tuesday, 29 March 2011 8:29 PM
To: Simon.Pigot@anonymised.com
Cc: john@anonymised.com; geonetwork-devel@lists.sourceforge.net;
geonetwork-users@lists.sourceforge.net
Subject: Re: [GeoNetwork-devel] OAI-PMH ListSets Validation Error: Can be
silenced?

Hi Simon and John,

On Mon, Mar 28, 2011 at 4:51 PM, <Simon.Pigot@anonymised.com> wrote:

Hi John,

Two issues here:

1. the ListSets response from the equella server is invalid (as we found)
which means that you can't get the sets & prefixes to search/use in the edit
interface when adding an OAI-PMH harvester for an equella server. The fix
here is for equella to fix their server - ok, not likely :slight_smile: - so we could
relax the validation on this request in GeoNetwork but the risk is that we
get back stuff we don't understand and further confuse the user by
presenting junk in the interface. (A hack here is to run the ListSets
request yourself and look at the response eg.
http://equella-server/oai/provider?verb=ListSets or just use oai_dc as that
set is defined anyway). Alternatively we could let the user enter their own
set name and metadata prefix values like joai if we don't get anything
(valid) back from the ListSets request in the harvester edit interface
(which sounds preferable to me anyway) - anyone else want to comment?

I agree with Simon to say that the best fix should be to the provider to fix
their invalid operation! :wink:

But, in the case where the ListSets response for the remote OAI provider is
invalid, I would suggest to not use the set as a search criteria and perform
a request (ListIdentifiers or ListRecords) without any set restriction.
Because if you let the user enter a specific set name manually this will
mostly lead to specify an unknown set and retrieve no results...but maybe
I'm wrong. That's also the same for the metadataPrefix, because of the
oai_dc default value you will be able to get results from the provider in
dublin-core.

2. I think the invalid xml error you're referring to is how all the metadata
records returned from the equella server are flagged when the harvester is
actually run. What is happening here is that GeoNetwork is doing a
GetRecords request and getting back a request from the equella server that
includes the metadata record. This is fine but GeoNetwork attempts to
validate the response which fails validation, probably because the equella
server (in common with many other oai servers) is not including the
schemaLocation attribute for the oai_dc schema on the root element of the
embedded metadata record in the GetRecords response. This is perhaps a bit
too restrictive on our part as schemaLocation is optional I think on all XML
records. Anyway this can be relaxed by leaving validation of the embedded
metadata record until after GeoNetwork has made a guess at the schema to
which it belongs - at that stage it can use the local schema to validate it
and it doesn't need a schemaLocation attribute. This seems to me to be a
worthwhile relaxation because it makes lots more records available from the
servers I know about without increasing the risk of ingesting junk. Again,
does anyone else want to comment on this?

I've tested the second fix in the ANZMEST code and I'll develop the first
fix as well to prove the concept (although maybe Mathieu and Julien have
done this in the refactoring of the OAI-PMH harvester they have done?).

At the time being we do not plan to support this case. We consider that if
the provider is unable to provide any valid response for any operation, the
provider is not usable as an harvesting source. So no fix to apply! :wink:

Cheers,

Mathieu

Cheers,
Simon

________________________________________
From: boabjohn [john@anonymised.com]
Sent: Sunday, 27 March 2011 8:36 PM
To: geonetwork-devel@lists.sourceforge.net
Subject: [GeoNetwork-devel] OAI-PMH ListSets Validation Error: Can be
silenced?

(Apologies: cross-posted from GN-Users...perhaps this is the better forum?)

G'Day all,

We're attempting to harvest from a records management platform called
Equella (http://www.equella.com/) which say they can support an OAI-PMH
endpoint.

However, GN throws an "invalid xml" error when we attempt to harvest.

After some laser-like investigation by a friendly wizard (thanks Simon!) it
looks like the problem is with the ListSets response on the Equella server.
It is returning text content in element which apparently is
not valid oaipmh (according to oaipmh XSDs).

Equella say they will fix it "some time soon..."

We need to harvest now.

Can anyone suggest how the validation error can be safely (and succinctly)
silenced so that the harvest can continue?

Thanks in advance,

JB

---

John Brisbin

Managing Director, BoaB interactive

[mb] +61 (0)407 471 565 <tel:%2B61%20%280%29407%20471%20565>

[ph] +61 (0)7 3103 0574 <tel:%2B61%20%280%297%203103%200574> (voice 2 text)

[im] skype:boabjohn

[www] http://www.boab.info

[po] POB 802, Townsville QLD 4810 AUSTRALIA

--
View this message in context:
http://osgeo-org.1803224.n2.nabble.com/OAI-PMH-ListSets-Validation-Error-Can
-be-silenced-tp6211988p6211988.html
Sent from the GeoNetwork developer mailing list archive at Nabble.com.

----------------------------------------------------------------------------
--
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software
be a part of the solution? Download the Intel(R) Manageability Checker
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at
http://sourceforge.net/projects/geonetwork

----------------------------------------------------------------------------
--
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software
be a part of the solution? Download the Intel(R) Manageability Checker
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at
http://sourceforge.net/projects/geonetwork

Mathieu,

Whilst the principle of 'if the implementation is broken the provider should fix it' is fine, I think what we are proposing here is simply being a little flexible about what we accept. So with respect to:

1. Getting the sets and prefixes from the provider for the harvester interface: It seems fine to me to offer the user a chance to enter these things manually if nothing or an invalid response comes back from the server - don't need to show the manual boxes until then and then at least the (informed) user has some chance of getting back something other than just oai_dc records describing the whole collection :slight_smile:

2. Validating the GetRecord/ListRecords response and embedded metadata records: Don't be too quick to say that you aren't implementing it! I think we're actually talking about how GeoNetwork validates the response here - there are two options: one is as I've described below (extract metadata record(s), detect schema and validate against schema XSDs in GeoNetwork), the other relies on having oasis catalogs with mappings of namespace URIs and schemaLocation URLs to GeoNetwork schema XSDs (and would only work in 2.7 or higher as that is the only one that implements the oasis and resolver stuff). What we're trying to work out here is which validation approach we should follow here. My feeling is that the 'extract and then validate' approach gives more info back to the harvest result than the 'validate the whole response' approach (especially if we switch the harvester to use ListRecords).

Let me know what you think and whether we're still talking about the same thing here :slight_smile:

Cheers,
Simon
________________________________________
From: Mathieu Coudert [mathieu.coudert@anonymised.com]
Sent: Tuesday, 29 March 2011 8:28 PM
To: Pigot, Simon (CMAR, Hobart)
Cc: john@anonymised.com; geonetwork-devel@lists.sourceforge.net; geonetwork-users@lists.sourceforge.net
Subject: Re: [GeoNetwork-devel] OAI-PMH ListSets Validation Error: Can be silenced?

Hi Simon and John,

On Mon, Mar 28, 2011 at 4:51 PM, <Simon.Pigot@anonymised.com> wrote:
Hi John,

Two issues here:

1. the ListSets response from the equella server is invalid (as we found) which means that you can't get the sets & prefixes to search/use in the edit interface when adding an OAI-PMH harvester for an equella server. The fix here is for equella to fix their server - ok, not likely :slight_smile: - so we could relax the validation on this request in GeoNetwork but the risk is that we get back stuff we don't understand and further confuse the user by presenting junk in the interface. (A hack here is to run the ListSets request yourself and look at the response eg. http://equella-server/oai/provider?verb=ListSets or just use oai_dc as that set is defined anyway). Alternatively we could let the user enter their own set name and metadata prefix values like joai if we don't get anything (valid) back from the ListSets request in the harvester edit interface (which sounds preferable to me anyway) - anyone else want to comment?

I agree with Simon to say that the best fix should be to the provider to fix their invalid operation! :wink:

But, in the case where the ListSets response for the remote OAI provider is invalid, I would suggest to not use the set as a search criteria and perform a request (ListIdentifiers or ListRecords) without any set restriction. Because if you let the user enter a specific set name manually this will mostly lead to specify an unknown set and retrieve no results...but maybe I'm wrong. That's also the same for the metadataPrefix, because of the oai_dc default value you will be able to get results from the provider in dublin-core.

2. I think the invalid xml error you're referring to is how all the metadata records returned from the equella server are flagged when the harvester is actually run. What is happening here is that GeoNetwork is doing a GetRecords request and getting back a request from the equella server that includes the metadata record. This is fine but GeoNetwork attempts to validate the response which fails validation, probably because the equella server (in common with many other oai servers) is not including the schemaLocation attribute for the oai_dc schema on the root element of the embedded metadata record in the GetRecords response. This is perhaps a bit too restrictive on our part as schemaLocation is optional I think on all XML records. Anyway this can be relaxed by leaving validation of the embedded metadata record until after GeoNetwork has made a guess at the schema to which it belongs - at that stage it can use the local schema to validate it and it doesn't need a schemaLocation attribute. This seems to me to be a worthwhile relaxation because it makes lots more records available from the servers I know about without increasing the risk of ingesting junk. Again, does anyone else want to comment on this?

I've tested the second fix in the ANZMEST code and I'll develop the first fix as well to prove the concept (although maybe Mathieu and Julien have done this in the refactoring of the OAI-PMH harvester they have done?).

At the time being we do not plan to support this case. We consider that if the provider is unable to provide any valid response for any operation, the provider is not usable as an harvesting source. So no fix to apply! :wink:

Cheers,

Mathieu

Cheers,
Simon

________________________________________
From: boabjohn [john@anonymised.com<mailto:john@anonymised.com>]
Sent: Sunday, 27 March 2011 8:36 PM
To: geonetwork-devel@lists.sourceforge.net<mailto:geonetwork-devel@anonymised.comurceforge.net>
Subject: [GeoNetwork-devel] OAI-PMH ListSets Validation Error: Can be silenced?

(Apologies: cross-posted from GN-Users...perhaps this is the better forum?)

G'Day all,

We're attempting to harvest from a records management platform called
Equella (http://www.equella.com/) which say they can support an OAI-PMH
endpoint.

However, GN throws an "invalid xml" error when we attempt to harvest.

After some laser-like investigation by a friendly wizard (thanks Simon!) it
looks like the problem is with the ListSets response on the Equella server.
It is returning text content in element which apparently is
not valid oaipmh (according to oaipmh XSDs).

Equella say they will fix it "some time soon..."

We need to harvest now.

Can anyone suggest how the validation error can be safely (and succinctly)
silenced so that the harvest can continue?

Thanks in advance,

JB

---

John Brisbin

Managing Director, BoaB interactive

[mb] +61 (0)407 471 565<tel:%2B61%20%280%29407%20471%20565>

[ph] +61 (0)7 3103 0574<tel:%2B61%20%280%297%203103%200574> (voice 2 text)

[im] skype:boabjohn

[www] http://www.boab.info

[po] POB 802, Townsville QLD 4810 AUSTRALIA

--
View this message in context: http://osgeo-org.1803224.n2.nabble.com/OAI-PMH-ListSets-Validation-Error-Can-be-silenced-tp6211988p6211988.html
Sent from the GeoNetwork developer mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software
be a part of the solution? Download the Intel(R) Manageability Checker
today! Best Open Source Mac Front-Ends 2024
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net<mailto:GeoNetwork-devel@anonymised.comforge.net>
geonetwork-devel List Signup and Options
GeoNetwork OpenSource is maintained at GeoNetwork - Geographic Metadata Catalog download | SourceForge.net

------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software
be a part of the solution? Download the Intel(R) Manageability Checker
today! Best Open Source Mac Front-Ends 2024
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net<mailto:GeoNetwork-devel@anonymised.comforge.net>
geonetwork-devel List Signup and Options
GeoNetwork OpenSource is maintained at GeoNetwork - Geographic Metadata Catalog download | SourceForge.net

Hi all,

on point 2) I would like to mention that flexibility is important when it
comes to validating incoming stuff, as one might like to enforce their own
policies on incoming records. What people would actually like to do is to
not just validate but check other constraints, so a plugin mechanism would
be good here.

on 1) I would caution on working around broken stuff. There is no correct
bahaviour to be expected from a broken provider. What is more, it makes
your own software more bloated with all the consequences following from
that.
A compromise would be to show the boxes metedata format and set by default
together with a button "autodetect". This would make the behaviour
predictable for users.

While this is slightly OT here, the GN provider also causes harvester to
fail with validation, as pointed out in my other email, at least if the
metadata record declares the same xml namespace as the OAI
GetRecord/ListRecords response.

The question is how to deal with duplicate namespaces when self-contained
XML records are embedded in a transport xml structure.

If the whole XML (container + embedded doc) is considered as one document,
it is XML-wise correct to remove duplicate namespaces. However, this
requires a harvester to reassemble the namespaces when extracting the
embedded document from the OAI response. Most harvester do not do this,
but just extract the string instead.

On the other hand, one could consider OAI transport container and embedded
xml as different, there would be no problem with namespaces, but they
could not be treat them as one DOM as currently the case inside GN.

best
Timo

Mathieu,

Whilst the principle of 'if the implementation is broken the provider
should fix it' is fine, I think what we are proposing here is simply being
a little flexible about what we accept. So with respect to:

1. Getting the sets and prefixes from the provider for the harvester
interface: It seems fine to me to offer the user a chance to enter these
things manually if nothing or an invalid response comes back from the
server - don't need to show the manual boxes until then and then at least
the (informed) user has some chance of getting back something other than
just oai_dc records describing the whole collection :slight_smile:

2. Validating the GetRecord/ListRecords response and embedded metadata
records: Don't be too quick to say that you aren't implementing it! I
think we're actually talking about how GeoNetwork validates the response
here - there are two options: one is as I've described below (extract
metadata record(s), detect schema and validate against schema XSDs in
GeoNetwork), the other relies on having oasis catalogs with mappings of
namespace URIs and schemaLocation URLs to GeoNetwork schema XSDs (and
would only work in 2.7 or higher as that is the only one that implements
the oasis and resolver stuff). What we're trying to work out here is which
validation approach we should follow here. My feeling is that the 'extract
and then validate' approach gives more info back to the harvest result
than the 'validate the whole response' approach (especially if we switch
the harvester to use ListRecords).

Let me know what you think and whether we're still talking about the same
thing here :slight_smile:

Cheers,
Simon
________________________________________
From: Mathieu Coudert [mathieu.coudert@anonymised.com]
Sent: Tuesday, 29 March 2011 8:28 PM
To: Pigot, Simon (CMAR, Hobart)
Cc: john@anonymised.com; geonetwork-devel@lists.sourceforge.net;
geonetwork-users@lists.sourceforge.net
Subject: Re: [GeoNetwork-devel] OAI-PMH ListSets Validation Error: Can be
silenced?

Hi Simon and John,

On Mon, Mar 28, 2011 at 4:51 PM, <Simon.Pigot@anonymised.com> wrote:
Hi John,

Two issues here:

1. the ListSets response from the equella server is invalid (as we found)
which means that you can't get the sets & prefixes to search/use in the
edit interface when adding an OAI-PMH harvester for an equella server. The
fix here is for equella to fix their server - ok, not likely :slight_smile: - so we
could relax the validation on this request in GeoNetwork but the risk is
that we get back stuff we don't understand and further confuse the user by
presenting junk in the interface. (A hack here is to run the ListSets
request yourself and look at the response eg.
http://equella-server/oai/provider?verb=ListSets or just use oai_dc as
that set is defined anyway). Alternatively we could let the user enter
their own set name and metadata prefix values like joai if we don't get
anything (valid) back from the ListSets request in the harvester edit
interface (which sounds preferable to me anyway) - anyone else want to
comment?

I agree with Simon to say that the best fix should be to the provider to
fix their invalid operation! :wink:

But, in the case where the ListSets response for the remote OAI provider
is invalid, I would suggest to not use the set as a search criteria and
perform a request (ListIdentifiers or ListRecords) without any set
restriction. Because if you let the user enter a specific set name
manually this will mostly lead to specify an unknown set and retrieve no
results...but maybe I'm wrong. That's also the same for the
metadataPrefix, because of the oai_dc default value you will be able to
get results from the provider in dublin-core.

2. I think the invalid xml error you're referring to is how all the
metadata records returned from the equella server are flagged when the
harvester is actually run. What is happening here is that GeoNetwork is
doing a GetRecords request and getting back a request from the equella
server that includes the metadata record. This is fine but GeoNetwork
attempts to validate the response which fails validation, probably because
the equella server (in common with many other oai servers) is not
including the schemaLocation attribute for the oai_dc schema on the root
element of the embedded metadata record in the GetRecords response. This
is perhaps a bit too restrictive on our part as schemaLocation is optional
I think on all XML records. Anyway this can be relaxed by leaving
validation of the embedded metadata record until after GeoNetwork has made
a guess at the schema to which it belongs - at that stage it can use the
local schema to validate it and it doesn't need a schemaLocati
on attribute. This seems to me to be a worthwhile relaxation because it
makes lots more records available from the servers I know about without
increasing the risk of ingesting junk. Again, does anyone else want to
comment on this?

I've tested the second fix in the ANZMEST code and I'll develop the first
fix as well to prove the concept (although maybe Mathieu and Julien have
done this in the refactoring of the OAI-PMH harvester they have done?).

At the time being we do not plan to support this case. We consider that if
the provider is unable to provide any valid response for any operation,
the provider is not usable as an harvesting source. So no fix to apply! :wink:

Cheers,

Mathieu

Cheers,
Simon

________________________________________
From: boabjohn [john@anonymised.com<mailto:john@anonymised.com>]
Sent: Sunday, 27 March 2011 8:36 PM
To:
geonetwork-devel@lists.sourceforge.net<mailto:geonetwork-devel@anonymised.comceforge.net>
Subject: [GeoNetwork-devel] OAI-PMH ListSets Validation Error: Can be
silenced?

(Apologies: cross-posted from GN-Users...perhaps this is the better
forum?)

G'Day all,

We're attempting to harvest from a records management platform called
Equella (http://www.equella.com/) which say they can support an OAI-PMH
endpoint.

However, GN throws an "invalid xml" error when we attempt to harvest.

After some laser-like investigation by a friendly wizard (thanks Simon!)
it
looks like the problem is with the ListSets response on the Equella
server.
It is returning text content in element which apparently is
not valid oaipmh (according to oaipmh XSDs).

Equella say they will fix it "some time soon..."

We need to harvest now.

Can anyone suggest how the validation error can be safely (and succinctly)
silenced so that the harvest can continue?

Thanks in advance,

JB

---

John Brisbin

Managing Director, BoaB interactive

[mb] +61 (0)407 471 565<tel:%2B61%20%280%29407%20471%20565>

[ph] +61 (0)7 3103 0574<tel:%2B61%20%280%297%203103%200574> (voice 2 text)

[im] skype:boabjohn

[www] http://www.boab.info

[po] POB 802, Townsville QLD 4810 AUSTRALIA

--
View this message in context:
http://osgeo-org.1803224.n2.nabble.com/OAI-PMH-ListSets-Validation-Error-Can-be-silenced-tp6211988p6211988.html
Sent from the GeoNetwork developer mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software
be a part of the solution? Download the Intel(R) Manageability Checker
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net<mailto:GeoNetwork-devel@anonymised.comceforge.net>
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at
http://sourceforge.net/projects/geonetwork

------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software
be a part of the solution? Download the Intel(R) Manageability Checker
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net<mailto:GeoNetwork-devel@anonymised.comceforge.net>
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at
http://sourceforge.net/projects/geonetwork

------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software
be a part of the solution? Download the Intel(R) Manageability Checker
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at
http://sourceforge.net/projects/geonetwork

Hi Timo et al,

------------------
<snip>

on 1) I would caution on working around broken stuff. There is no correct
bahaviour to be expected from a broken provider. What is more, it makes
your own software more bloated with all the consequences following from
that.
A compromise would be to show the boxes metedata format and set by default
together with a button "autodetect". This would make the behaviour
predictable for users.

------------------

Yep - this is effectively what I'm proposing. The oaipmh harvester form would offer text input boxes for prefix and set if the user pressed the 'Retrieve Info' button when defining a search and didn't get back anything to fill out the select pulldown menus. It's basically a few extra lines of code so no bloat and it gives just a little extra flexibility. It's done for the anzmest sandbox anyway so anyone interested can take a look there later.

Apologies for sniping the rest of your text - I'd like to take some time to look at that too but time is really short right now....

Cheers,
Simon

Hi,

Yep - this is effectively what I'm proposing. The oaipmh harvester form
would offer text input boxes for prefix and set if the user pressed the
'Retrieve Info' button when defining a search and didn't get back anything
to fill out the select pulldown menus. It's basically a few extra lines of
code so no bloat and it gives just a little extra flexibility. It's done
for the anzmest sandbox anyway so anyone interested can take a look there
later.

ok

Apologies for sniping the rest of your text - I'd like to take some time
to look at that too but time is really short right now....

I have started investigating the matter a bit. Here is some additional info.

http://stackoverflow.com/questions/5469878/jdom-removes-duplicate-namespace-declaration-xmloutputter

http://www.openarchives.org/pipermail/oai-implementers/2011-March/002049.html

Timo

Cheers,
Simon