[GeoNetwork-devel] EDINA GN activity - Spatial queries - Lucene and JTS

Hi all,
Think this discussion is of interest to the devel list.Ciao,
Jeroen

Begin forwarded message:

From: Andrew Robson <andrew.melmoth@anonymised.com>
Date: August 31, 2006 6:32:26 PM GMT+02:00
Subject: Re: EDINA GN activity

Hello all,

I’ve sent Andrea the GML to WKT library. I thought I’d just pick up a couple of things
from the mails earlier today.

Firstly, why I think geo-enabling Lucene is a bigger task than it might look at first sight.
The assumption that you are doing tests for textual similarity runs through the Lucene
codebase from top to bottom. After tracing a number of different queries I’m not at all
sure how you would intervene to do a spatial match using the WKT field or how you could
combine that with other query types.
The RangeQuery is not much help as a model as, as far as I can see, Lucene simply uses its
indexes to rewrite a range query as a set of term queries. The Boolean query too does quite a
lot of re-writing of its queries before it submits them to the main search classes. Not at all clear
that the Boolean Query in its present form offers much to us in dealing with the arbitrary complexity
of the Filter spec.The base operation just isn’t abstract enough, it seems to me, to simply slot in
query types that cannot ultimately be expressed as text matches.
My initial conclusion was that what would be required would be start from the bottom
implementing one’s own spatial indexing system and re-designing more or less all the major
Lucene classes to support a greater degree of abstraction.
(I had a look at the JTS implementations of in memory quad trees, STR trees and bin trees
which would probably be a good starting point).
I could be wrong about this of course. At it’s lower levels Lucene is not the most comprehensible
code I’ve ever looked at.
I’m going to write down my thoughts in more detail. I may change my mind about whether it would
take 2 months but unless I get a cleverer idea it might not necessarily be for the better.

Secondly - The error message on GetRecords for FGDC was
[Fatal Error] ogc-full.xsl:64:68: The prefix “ows” for element “ows:BoundingBox” is not bound.

FATAL ERROR: ‘Could not compile stylesheet’
Imagine simple to fix.

hope this is helpful
andrew

On 31/08/06, Andrea Carboni <acarboni@anonymised.com> wrote:

Hi Chris,

here are my answers:

Andrew Robson has conducted further investigation of the incorporation of
spatial queries into Lucene. We will summarise this in a document and
pass to you. As this kind of sophisticated spatial queries this work
entails is not required for minimal compliance with the ISO AP of CSW 2.0
we are not inclined to proceed with this at the moment. Andrew reckons
there is about 2 months worth of work here. Happy to discuss this
further.

In his report, Andrew spoke about a simple library to convert from GML to WKT.
If you send us that library (with sources) we can investigate ourselves.

  • Our understanding is that, in order to be minimally compliant with Cat2
    CSW (04-021r3), the BBOX operator must be supported (p122)?

Yes.

  • Do you believe that the current version of GN CSW implements BBOX
    according to the definition in the FE Spec (02-059), ie, “all geometries
    that spatially interact with the box in some manner (p10)”? I am asking
    so that we can be clear about what the software should do. As you will
    have guessed, we had some trouble getting this working.

Geonetwork uses the 1.1 version of the FE specification (04-095). After
converting our 19115 metadata into the 19139 equivalent I tested the code
against the simple BBOX that is contained in our metadata. The behaviour
of the operator is specified into the 1.1 spec but tests against complex geometries
should be done.

  • The “GetRecords” request appears to work in some cases, but not others,
    eg, for Dublin core records its OK, FGDC has a bug and ISO19115 just
    doesn’t seem to be implemented

Would you like us to work on fixing this?

This is a subtle point. The CSW-ISO spec obviously refers to the 19139. This
means that:

  • the DC schema works only for the OGC CORE output schema. I made the
    mapping myself, which could not be correct in some cases.

  • the FGDC schema works only for the OGC CORE output schema. I received
    the mapping from some guys at FGDC. Maybe this mapping needs to be
    improved.

  • the 19115 schema doesn’t have the CSW stylesheets because it is just a
    legacy schema.

  • The records that we are getting back (including ones we generate using
    GN) appear to be formatted to a much older version of 19139 than the
    recently ratified version…

We use the latest 19139 schema (presented in May/June).

We have fixed some things to the migration script. I suggest to use that script
again if you want to use some 19115 metadata. Anyway, the Metadata.ddf file
contains 2 metadata in 19139 format that can be used for testing.

Cheers,
Andrea