[GeoNetwork-devel] xml namespace problem in OAI response

Dear all,

the OAI actions which return embedded xml documents have a small but very
serious bug. The bug is that xml namespace declarations in the embedded
xml document are removed if the namespace in question is already declared
by the OAI response.

The consequence of this is that an OAI harvester that extracts the
embedded XML from the OAI response, will produce an invalid xml document.

To see this consider the following example.

A OAI response starts like this:

<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/&quot;
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance&quot;
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd&quot;&gt;

you can see that the xsi namespace is declared.

Now consider that an XML document embedded in the OAI response (it is a
GetRecord or ListRecords response) also declares the xsi namespace.
Since GN actually parses the embedded xml into a jdom DOM and attaches it
to the internal DOM that represents the OAI response, the duplicate
namespace declaration is removed.

The behaviour is reproduced by the attached code.

The question to the community is how to avoid the behaviour. I have
tracked the error to Record.java: (toXml()):86 . There, the parsed
metadata (in jdom form) is attached to the OAI record response structure.
What I do not know is where the removal of the duplicate namespace
declaration takes place. It might well be that this is done somewhere else
(in jeeves?) when the XML is marshaled.
In my code example the culprit seems to be the XMLoutputter.

We need to find the part of GN responsable of the duplicate elimination
and then devise a way to tell it not to do so.
It might well be that other parts of GN are affected by this behaviour, too.

best regards
Timo

(attachments)

TestDomNamespace.java (2.56 KB)

One way to fix this is to patch JDOM XMLOutputter so that it doesn't check the global namespace stack when adding additional namespace declarations to an element. Essentially, I've added a new parameter method on XMLOutputter to switch this on: setDuplicateNamespacesAllowed(boolean) which toggles this behaviour. In GeoNetwork, I've created a new method in Jeeves.utils.Xml as an alternative to writeResponse that uses the XMLOutputter with setDuplicateNamespacesAllowed(true); and this method is used (in the Jeeves ServiceManager) for services that produce XML (no output page) of which the OaiPmhDispatcher is one.

It works with ISO records that have the xmlns:xsi that is common to the oai response root element but it obviously needs more testing to make sure that there are no unintended consequences before we let it out into the wider world.

Cheers,
Simon
   
________________________________________
From: Timo Proescholdt [timo@anonymised.com]
Sent: Tuesday, 29 March 2011 12:07 AM
To: geonetwork-devel@lists.sourceforge.net
Subject: [GeoNetwork-devel] xml namespace problem in OAI response

Dear all,

the OAI actions which return embedded xml documents have a small but very
serious bug. The bug is that xml namespace declarations in the embedded
xml document are removed if the namespace in question is already declared
by the OAI response.

The consequence of this is that an OAI harvester that extracts the
embedded XML from the OAI response, will produce an invalid xml document.

To see this consider the following example.

A OAI response starts like this:

<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/&quot;
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance&quot;
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd&quot;&gt;

you can see that the xsi namespace is declared.

Now consider that an XML document embedded in the OAI response (it is a
GetRecord or ListRecords response) also declares the xsi namespace.
Since GN actually parses the embedded xml into a jdom DOM and attaches it
to the internal DOM that represents the OAI response, the duplicate
namespace declaration is removed.

The behaviour is reproduced by the attached code.

The question to the community is how to avoid the behaviour. I have
tracked the error to Record.java: (toXml()):86 . There, the parsed
metadata (in jdom form) is attached to the OAI record response structure.
What I do not know is where the removal of the duplicate namespace
declaration takes place. It might well be that this is done somewhere else
(in jeeves?) when the XML is marshaled.
In my code example the culprit seems to be the XMLoutputter.

We need to find the part of GN responsable of the duplicate elimination
and then devise a way to tell it not to do so.
It might well be that other parts of GN are affected by this behaviour, too.

best regards
Timo

Hi Simon,

can you share the code? I have not seen in the sandbox.

Interestingly the behaviour is not that wrong after all.. an intelligent
harvester would reassemble the namespaces when cutting out the embedded
xml. I have written a proof of concept for that yesterday, but it seems
that most harvesters really do cut and paste without paying attention to
namespaces etc.

best
Timo

One way to fix this is to patch JDOM XMLOutputter so that it doesn't check
the global namespace stack when adding additional namespace declarations
to an element. Essentially, I've added a new parameter method on
XMLOutputter to switch this on: setDuplicateNamespacesAllowed(boolean)
which toggles this behaviour. In GeoNetwork, I've created a new method in
Jeeves.utils.Xml as an alternative to writeResponse that uses the
XMLOutputter with setDuplicateNamespacesAllowed(true); and this method is
used (in the Jeeves ServiceManager) for services that produce XML (no
output page) of which the OaiPmhDispatcher is one.

It works with ISO records that have the xmlns:xsi that is common to the
oai response root element but it obviously needs more testing to make sure
that there are no unintended consequences before we let it out into the
wider world.

Cheers,
Simon

________________________________________
From: Timo Proescholdt [timo@anonymised.com]
Sent: Tuesday, 29 March 2011 12:07 AM
To: geonetwork-devel@lists.sourceforge.net
Subject: [GeoNetwork-devel] xml namespace problem in OAI response

Dear all,

the OAI actions which return embedded xml documents have a small but very
serious bug. The bug is that xml namespace declarations in the embedded
xml document are removed if the namespace in question is already declared
by the OAI response.

The consequence of this is that an OAI harvester that extracts the
embedded XML from the OAI response, will produce an invalid xml document.

To see this consider the following example.

A OAI response starts like this:

<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/&quot;
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance&quot;
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd&quot;&gt;

you can see that the xsi namespace is declared.

Now consider that an XML document embedded in the OAI response (it is a
GetRecord or ListRecords response) also declares the xsi namespace.
Since GN actually parses the embedded xml into a jdom DOM and attaches it
to the internal DOM that represents the OAI response, the duplicate
namespace declaration is removed.

The behaviour is reproduced by the attached code.

The question to the community is how to avoid the behaviour. I have
tracked the error to Record.java: (toXml()):86 . There, the parsed
metadata (in jdom form) is attached to the OAI record response structure.
What I do not know is where the removal of the duplicate namespace
declaration takes place. It might well be that this is done somewhere else
(in jeeves?) when the XML is marshaled.
In my code example the culprit seems to be the XMLoutputter.

We need to find the part of GN responsable of the duplicate elimination
and then devise a way to tell it not to do so.
It might well be that other parts of GN are affected by this behaviour,
too.

best regards
Timo

Yep - will share it once I check it in shortly - and yep I think this enhancement if it works out comes under the heading of GeoNetwork being flexible as opposed to saying that all those other harvesters are broken and should be fixed :slight_smile:

Cheers,
Simon
________________________________________
From: Timo Proescholdt [timo@anonymised.com]
Sent: Wednesday, 30 March 2011 7:35 PM
To: Pigot, Simon (CMAR, Hobart)
Cc: geonetwork-devel@lists.sourceforge.net
Subject: RE: [GeoNetwork-devel] xml namespace problem in OAI response

Hi Simon,

can you share the code? I have not seen in the sandbox.

Interestingly the behaviour is not that wrong after all.. an intelligent
harvester would reassemble the namespaces when cutting out the embedded
xml. I have written a proof of concept for that yesterday, but it seems
that most harvesters really do cut and paste without paying attention to
namespaces etc.

best
Timo

One way to fix this is to patch JDOM XMLOutputter so that it doesn't check
the global namespace stack when adding additional namespace declarations
to an element. Essentially, I've added a new parameter method on
XMLOutputter to switch this on: setDuplicateNamespacesAllowed(boolean)
which toggles this behaviour. In GeoNetwork, I've created a new method in
Jeeves.utils.Xml as an alternative to writeResponse that uses the
XMLOutputter with setDuplicateNamespacesAllowed(true); and this method is
used (in the Jeeves ServiceManager) for services that produce XML (no
output page) of which the OaiPmhDispatcher is one.

It works with ISO records that have the xmlns:xsi that is common to the
oai response root element but it obviously needs more testing to make sure
that there are no unintended consequences before we let it out into the
wider world.

Cheers,
Simon

________________________________________
From: Timo Proescholdt [timo@anonymised.com]
Sent: Tuesday, 29 March 2011 12:07 AM
To: geonetwork-devel@lists.sourceforge.net
Subject: [GeoNetwork-devel] xml namespace problem in OAI response

Dear all,

the OAI actions which return embedded xml documents have a small but very
serious bug. The bug is that xml namespace declarations in the embedded
xml document are removed if the namespace in question is already declared
by the OAI response.

The consequence of this is that an OAI harvester that extracts the
embedded XML from the OAI response, will produce an invalid xml document.

To see this consider the following example.

A OAI response starts like this:

<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/&quot;
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance&quot;
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd&quot;&gt;

you can see that the xsi namespace is declared.

Now consider that an XML document embedded in the OAI response (it is a
GetRecord or ListRecords response) also declares the xsi namespace.
Since GN actually parses the embedded xml into a jdom DOM and attaches it
to the internal DOM that represents the OAI response, the duplicate
namespace declaration is removed.

The behaviour is reproduced by the attached code.

The question to the community is how to avoid the behaviour. I have
tracked the error to Record.java: (toXml()):86 . There, the parsed
metadata (in jdom form) is attached to the OAI record response structure.
What I do not know is where the removal of the duplicate namespace
declaration takes place. It might well be that this is done somewhere else
(in jeeves?) when the XML is marshaled.
In my code example the culprit seems to be the XMLoutputter.

We need to find the part of GN responsable of the duplicate elimination
and then devise a way to tell it not to do so.
It might well be that other parts of GN are affected by this behaviour,
too.

best regards
Timo

svn commit 7570 into the BlueNetMEST/ANZMEST sandbox - you should be able to apply it to a working copy of trunk.

Cheers,
Simon
________________________________________
From: Timo Proescholdt [timo@anonymised.com]
Sent: Wednesday, 30 March 2011 7:35 PM
To: Pigot, Simon (CMAR, Hobart)
Cc: geonetwork-devel@lists.sourceforge.net
Subject: RE: [GeoNetwork-devel] xml namespace problem in OAI response

Hi Simon,

can you share the code? I have not seen in the sandbox.

Interestingly the behaviour is not that wrong after all.. an intelligent
harvester would reassemble the namespaces when cutting out the embedded
xml. I have written a proof of concept for that yesterday, but it seems
that most harvesters really do cut and paste without paying attention to
namespaces etc.

best
Timo

One way to fix this is to patch JDOM XMLOutputter so that it doesn't check
the global namespace stack when adding additional namespace declarations
to an element. Essentially, I've added a new parameter method on
XMLOutputter to switch this on: setDuplicateNamespacesAllowed(boolean)
which toggles this behaviour. In GeoNetwork, I've created a new method in
Jeeves.utils.Xml as an alternative to writeResponse that uses the
XMLOutputter with setDuplicateNamespacesAllowed(true); and this method is
used (in the Jeeves ServiceManager) for services that produce XML (no
output page) of which the OaiPmhDispatcher is one.

It works with ISO records that have the xmlns:xsi that is common to the
oai response root element but it obviously needs more testing to make sure
that there are no unintended consequences before we let it out into the
wider world.

Cheers,
Simon

________________________________________
From: Timo Proescholdt [timo@anonymised.com]
Sent: Tuesday, 29 March 2011 12:07 AM
To: geonetwork-devel@lists.sourceforge.net
Subject: [GeoNetwork-devel] xml namespace problem in OAI response

Dear all,

the OAI actions which return embedded xml documents have a small but very
serious bug. The bug is that xml namespace declarations in the embedded
xml document are removed if the namespace in question is already declared
by the OAI response.

The consequence of this is that an OAI harvester that extracts the
embedded XML from the OAI response, will produce an invalid xml document.

To see this consider the following example.

A OAI response starts like this:

<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/&quot;
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance&quot;
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd&quot;&gt;

you can see that the xsi namespace is declared.

Now consider that an XML document embedded in the OAI response (it is a
GetRecord or ListRecords response) also declares the xsi namespace.
Since GN actually parses the embedded xml into a jdom DOM and attaches it
to the internal DOM that represents the OAI response, the duplicate
namespace declaration is removed.

The behaviour is reproduced by the attached code.

The question to the community is how to avoid the behaviour. I have
tracked the error to Record.java: (toXml()):86 . There, the parsed
metadata (in jdom form) is attached to the OAI record response structure.
What I do not know is where the removal of the duplicate namespace
declaration takes place. It might well be that this is done somewhere else
(in jeeves?) when the XML is marshaled.
In my code example the culprit seems to be the XMLoutputter.

We need to find the part of GN responsable of the duplicate elimination
and then devise a way to tell it not to do so.
It might well be that other parts of GN are affected by this behaviour,
too.

best regards
Timo