[GeoNetwork-devel] clarification on dates / possible bug

Hello GN-developers!

my name is Timo Proescholdt, I work at the World Meteorological
Organization in the WMO information system office (together with
David Thomas and Eliot Christian).
We have been looking into Z39.50 and SRU support for GN and are
currently implementing Z39.50 and SRU support with the new Jzkit2
library. I want to share the code in the next days, but wanted to
ask for clarification on another point first.

We have also looked into date-search support for the classical Z39.50.
For this I changed z3959Server.xml to support date-queries, mapping them
to Lucene RangeQueries. (no java code change required)
In the process I ran across several puzzles.

First. Presuming we have the following two records in the index and DB,
input via the import function.

id | _changedate
1 | 2009-11-26T16:23:22
2 | 2009-11-27T16:23:22

if we search using the GN webfrontend for records between
2009-11-26T00:00:00 and 2009-11-29T00:00:00 you would
expect 1 and 2 to come up. However, only 2 is selected.
In order to get both you have to go back to 2009-11-25T00:00:00.

This is because the datestring is indexed with the capital 'T'
but LuceneSearcher.java [line: 472] changes the query to lowercase.
If I issue the query in uppercase (e.G with Luke) the I get the
correct(?) result.

in lexicographical order lucene oddly seems to order.
2009-11-25t00:00:00.
2009-11-26T16:23:22
2009-11-26t00:00:00

I dont know if this is the expected behaviour, since it is not clear
if the first date is included, but the semantics 00:00:00 leads that
way. I want to point out that this has nothing to do with the changes
we made but is the behaviour of a vanilla system.

I have also noticed that the lucene index contains several almost
identical indices and I was wondering if somebody could enlighten me
as to their intended usage. There is _changedate,_createdate and
changedate (without the underscore) (and possibly many more dates).

Am I right that the indices prefixed with an underscore are identical
to the database fields, whereas everything else comes from the metadata
itself (e.G extracted by index-fields.xsl)? If this was the case, what
sense would that make, since the DB changedate,createdate are filled
in the DB when the metadata is imported by extracting these very dates
from the metadata (Importer.java line: 152)?

A possible interpretation would be that the _ prefixed ones are allowed
to change whereas everything else is bound to be equal to the actual
metadata?

Code reading cannot help me here so I would very much appreciate
clarification on how the dates are supposed to be used.

many thanks and best regards
Timo

--
Timo Pröscholdt
Program Officer, WMO Information System (WIS)
Observing and Information Systems Department
World Meteorological Organization
Tel: +41 22 730 81 76
Cell: +41 77 40 63 554
e-mail: tproescholdt@anonymised.com

Hi Timo,

I'll weigh in first - mainly because some of the things you are discussing and working on have come up in the Australian (and the US) community and its great (though not unexpected) to hear that they are not unique to one or two communities and that means we can solve them together more efficiently (many hands etc etc ) and probably more effectively too.

Timo Proescholdt wrote:

Hello GN-developers!

my name is Timo Proescholdt, I work at the World Meteorological
Organization in the WMO information system office (together with
David Thomas and Eliot Christian).
We have been looking into Z39.50 and SRU support for GN and are
currently implementing Z39.50 and SRU support with the new Jzkit2
library. I want to share the code in the next days, but wanted to
ask for clarification on another point first.
  

Great - we'd like to test this code out - I wonder if you could comment on whether your implementation in JZKit2 solves some of the existing issues that have been found with the JZKit1 implemetation? (See the message from software-improvements, Richard Walker, in http://n2.nabble.com/Z3950-in-2-1-td3429902.html)

We have also looked into date-search support for the classical Z39.50.
For this I changed z3959Server.xml to support date-queries, mapping them
to Lucene RangeQueries. (no java code change required)
In the process I ran across several puzzles.

First. Presuming we have the following two records in the index and DB,
input via the import function.

id | _changedate
1 | 2009-11-26T16:23:22
2 | 2009-11-27T16:23:22

if we search using the GN webfrontend for records between
2009-11-26T00:00:00 and 2009-11-29T00:00:00 you would
expect 1 and 2 to come up. However, only 2 is selected.
In order to get both you have to go back to 2009-11-25T00:00:00.

This is because the datestring is indexed with the capital 'T'
but LuceneSearcher.java [line: 472] changes the query to lowercase.
If I issue the query in uppercase (e.G with Luke) the I get the
correct(?) result.

in lexicographical order lucene oddly seems to order.
2009-11-25t00:00:00.
2009-11-26T16:23:22
2009-11-26t00:00:00

I dont know if this is the expected behaviour, since it is not clear
if the first date is included, but the semantics 00:00:00 leads that
way. I want to point out that this has nothing to do with the changes
we made but is the behaviour of a vanilla system.
  

I think this is a bug - seems to me that the same lucene analyzer should be applied by GeoNetwork to both the query and the indexed dates - otherwise the search is not guaranteed to return sensible results (as I think you've found) ie. the T separators in both dates should be both upper or both lower.

I have also noticed that the lucene index contains several almost
identical indices and I was wondering if somebody could enlighten me
as to their intended usage. There is _changedate,_createdate and
changedate (without the underscore) (and possibly many more dates).

Am I right that the indices prefixed with an underscore are identical
to the database fields, whereas everything else comes from the metadata
itself (e.G extracted by index-fields.xsl)? If this was the case, what
sense would that make, since the DB changedate,createdate are filled
in the DB when the metadata is imported by extracting these very dates
from the metadata (Importer.java line: 152)?

A possible interpretation would be that the _ prefixed ones are allowed
to change whereas everything else is bound to be equal to the actual
metadata?

Code reading cannot help me here so I would very much appreciate
clarification on how the dates are supposed to be used.
  

I think your interpretation of where those prefixed with the _ come from is mostly right but I think fields with an _ prefix are more generally for GeoNetwork internal use (not supposed to be searchable directly by users) - as there are some such as _title which is not from the database but is a non-tokenized version of title from the metadata used for ordering search results by title. The general idea of indexing database content with the metadata fields makes good sense to me as there is only one thing to search - Lucene - not Lucene and the database - also you need some of these things in Lucene eg. privileges to make sure that search results do not include content that the user is not privileged to see/manipulate.

As for changeDate, _changeDate and _createDate, I think you've understood them correctly, the prefixed ones are from the database - changeDate appears to be from the metadata (if it has such a field) - so if you were searching externally then I think you would have to use changeDate as that is what is in the metadata and what the searcher would see in any records returned. (Note I think that indexing gmd:dateStamp as the changeDate might not be a correct interpretation of the ISO 19115 standard though as gmd:dateStamp is described in the standard as the creation date for the metadata record).

Filling the changedate and createdate in the database with content coming from metadata seems ok to me if you're importing metadata. They do get modified by GN in other places eg. changedate would get changed when a record is edited and saved and createdate would get filled in when a record is cloned. You have to have both the fields in the database and the fields from the metadata in the lucene index because some metadata standards don't have one or the other and GeoNetwork needs both of them eg. changedate for user search queries and _changedate for things like sorting hits on changedate.

Cheers and thanks - looking forward to seeing the new Z code!
Simon

Simon Pigot a écrit :

Hi Timo,

Hi Simon,

I'll weigh in first - mainly because some of the things you are
discussing and working on have come up in the Australian (and the US)
community and its great (though not unexpected) to hear that they are
not unique to one or two communities and that means we can solve them
together more efficiently (many hands etc etc ) and probably more
effectively too.

great to hear. I will come back to you in the next couple of days on
the JZkit stuff.
We will probably also work on speeding up search in GN. Areas that
need some work according to our analysis are date range queries and
geospacial indexing. My preliminary idea how to go about it was
to store dates more efficiently (different resolutions, numerically?)
and to map a geospacial indexing algorithm onto lucene.

This is because the datestring is indexed with the capital 'T'
but LuceneSearcher.java [line: 472] changes the query to lowercase.
If I issue the query in uppercase (e.G with Luke) the I get the
correct(?) result.

in lexicographical order lucene oddly seems to order.
2009-11-25t00:00:00.
2009-11-26T16:23:22
2009-11-26t00:00:00

I dont know if this is the expected behaviour, since it is not clear
if the first date is included, but the semantics 00:00:00 leads that
way. I want to point out that this has nothing to do with the changes
we made but is the behaviour of a vanilla system.
  

I think this is a bug - seems to me that the same lucene analyzer should
be applied by GeoNetwork to both the query and the indexed dates -
otherwise the search is not guaranteed to return sensible results (as I
think you've found) ie. the T separators in both dates should be both
upper or both lower.

I share your analysis. I thought of the analyser as well, but could not
see it being used when the index is searched. This is probably the way
to fix it. B.T.W It would be nice if the analyser could be configured
without changing the code. (would allow the user to make certain fields
upper or lower case).

I have also noticed that the lucene index contains several almost
identical indices and I was wondering if somebody could enlighten me
as to their intended usage. There is _changedate,_createdate and
changedate (without the underscore) (and possibly many more dates).

Am I right that the indices prefixed with an underscore are identical
to the database fields, whereas everything else comes from the metadata
itself (e.G extracted by index-fields.xsl)? If this was the case, what
sense would that make, since the DB changedate,createdate are filled
in the DB when the metadata is imported by extracting these very dates
from the metadata (Importer.java line: 152)?

A possible interpretation would be that the _ prefixed ones are allowed
to change whereas everything else is bound to be equal to the actual
metadata?

[..]

I think your interpretation of where those prefixed with the _ come
from is mostly right but I think fields with an _ prefix are more
generally for GeoNetwork internal use (not supposed to be searchable
directly by users) - as there are some such as _title which is not from

I traced back the webinterface search to lucene queries on _changedate,
( can be seen in xml/search/lucene.xsl line: 316 )
but I share your view. The semantics of the _ indices are that they are
for internal usage. It might be that people just jumped on them later
for the searching due to the ambiguity in the names.
Is this a bug then (should search on createdate)?

the database but is a non-tokenized version of title from the metadata
used for ordering search results by title. The general idea of indexing
database content with the metadata fields makes good sense to me as
there is only one thing to search - Lucene - not Lucene and the database
- also you need some of these things in Lucene eg. privileges to make
sure that search results do not include content that the user is not
privileged to see/manipulate.

As for changeDate, _changeDate and _createDate, I think you've
understood them correctly, the prefixed ones are from the database -

thanks for your other comments, they have been very helpful to me.

Cheers and thanks - looking forward to seeing the new Z code!
Simon

cheers, looking forward to sharing code with you!
Timo

Timo Proescholdt wrote:

[snip]
great to hear. I will come back to you in the next couple of days on
the JZkit stuff.
We will probably also work on speeding up search in GN. Areas that
need some work according to our analysis are date range queries and
geospacial indexing. My preliminary idea how to go about it was
to store dates more efficiently (different resolutions, numerically?)
and to map a geospacial indexing algorithm onto lucene.
  

Hi Timo,

A couple of things spring to mind here - about dates and date formats - we are going to propose a change from the BlueNetMEST sandbox that handles searching on temporal extents from metadata records (ISO being the most obvious) - one of the issues was parsing all the different date formats that could be used for the temporal extents into a common format for the index - we used JODA to do the parsing and chose ISO time as output - might be useful for you to look at if you are doing your research as it gathers date-time formats from many formats and produces one output (which could be changed from ISO time string to some sort of number and use NumericRangeQuery in Lucene maybe for faster searching?).

Also, the spatial indexing stuff is already handled in trunk (which I guess you know) but not using Lucene, instead using GeoTools - would be nice to use Lucene for all sorts of reasons but am not sure that all the efforts to graft spatial onto Lucene have actually gotten there yet especially when say comparing polygons for spatial relationships?

[snip] - More to come on the use of Lucene fields changeDate etc later as its late down here and I'm thinking that with regard to _changeDate and changeDate and the syncing that goes on between these two that it might be appropriate in this case only to use _changeDate for user searching (certainly makes sense as a possible search for an admin user).

Cheers,
Simon

Hi Simon,
hi all,

Simon Pigot wrote:

Timo Proescholdt wrote:

[snip]

[..]

geospacial indexing. My preliminary idea how to go about it was
to store dates more efficiently (different resolutions, numerically?)
and to map a geospacial indexing algorithm onto lucene.
  

Hi Timo,

A couple of things spring to mind here - about dates and date formats -
we are going to propose a change from the BlueNetMEST sandbox that
handles searching on temporal extents from metadata records (ISO being
the most obvious) - one of the issues was parsing all the different date
formats that could be used for the temporal extents into a common format
for the index - we used JODA to do the parsing and chose ISO time as
output - might be useful for you to look at if you are doing your
research as it gathers date-time formats from many formats and produces
one output (which could be changed from ISO time string to some sort of
number and use NumericRangeQuery in Lucene maybe for faster searching?).

good to know. I dont know how deep this matter has been discussed
already. If it has, please ignore what follows. I'm relatively new to
GN so sorry if I stir up old issues, but I'm definitively interested in
contributing.

Implementationwise I think it would be good if there was
a DateRangeQuery Type in the GN XML query language which could then be
an interface to the actual indexing and mapped to whatever indexing
structure is deemed best. That way the user generated(?) xslt sheets for
search could be retained while the indexing structure is allowed to
change (reindexing needed of course). This type could also handle the
datestring parsing in the way you described it.
Effective indexing strategies for dates depend on the resolution needed.
As a rule of thumb only what is really searchable should be indexed.
In order to speed up the lookup one can use several indices with
different resolutions (e.G one for year, year_month, year_month_day..)
These can then be used in combination with a Filter. Numercial indexing
should be faster for dates, IMHO.

Also, the spatial indexing stuff is already handled in trunk (which I
guess you know) but not using Lucene, instead using GeoTools - would be
nice to use Lucene for all sorts of reasons but am not sure that all the
efforts to graft spatial onto Lucene have actually gotten there yet
especially when say comparing polygons for spatial relationships?

is there a summary or documentation of the current implementation
and the requirements to the spacial indexing in GN?
My idea was actually just to use a grid-index to store bounding
boxes more efficently. I did not know that polygon queries have to
be supported, too.

[..]

Cheers,
Simon

best regards
Timo