[GeoNetwork-devel] Metadata registries too domain specific or too general - how to fill in the gaps?

(Didn't those use to be known as Content Management Systems, in
the day? What changed?)

Rufus suggested i start braindumping *somewhere*, about *something*
related to data repository / collective metadata management.
I have been suffering from a bit of intellectual constipation recently.
I once spent the years 2003-2005 rewriting the same graph-annotating
application over and over and over. Eventually i tired of that and
began to write variants on it, one of which was an early attempt at a
backend for CKAN, the knowledge package archive project. http://www.ckan.net/

As often, the software stalled half-finished and mutated into something else.
Later, when i picked up the task of making metadata libraries for geodata,
i was much aware of the conceptual overlap, and hoped to reintegrate later.
Meanwhile the Java OSGeo people would nudge me - "Why don't you just use
FAO GeoNetwork?" Eventually I learned to overcome some anti-java prejudice.

In the GIS software world there is an overfocus on standards. Given
several different domain models + XML serialisations, etc,
all differingly mandatory in regional government data policies, what
is an implementor to do? The GN answer is to use XML templates (or XML
documents as prototype templates) and to store per-package metadata in
BLOBs of XML in a database - extracting a few key properties for indexing.

But it becomes harder to re-use, and get a 'network effect' from the
re-use of, descriptions of people or organisations; the *internal
structure* of these data sets can't often be expressed by the
prevalent standards (ISO 19115, FGDC, etc). The information models,
domain models for geographic data have a lot of specific details in them
that I don't necessarily want to fill out, or see on the screen.

I wrote a bit back when about the extrapolation of a "core model"
from the current standards prevalent in geo-metadata; and also about
what i perceive as structural flaws in some of the information models
designs that are byproducts of a "metadata first, data later" view.
http://wiki.osgeo.org/index.php/Why_DCLite4G
http://frot.org/terradue/minimal_metadata.html

But implementation-wise this needs more than a generic CKAN, though.
Foremost is ability to add spatial properties of data sets - typically
an envelope described by two X,Y points showing what area of space the
data set covers. This bleeds into wanting more things:

- The ability to store and query geometry objects in the backend
  (for the postgres database this is supplied by PostGIS)
- The ability to spatially filter search results from the frontend
- "Semi-automatic" grabbing of a data set and extracting the spatial
  extents from the metadata in the file (usually possible)
  and post-facto inserting that into the metadata record.

Last year it made sense to DIY with some homegrown geo extensions to
sqlobject. Then GeoDjango came along and rendered all that obsolete,
and I ported across in a couple of hours.

But now development-wise I feel i am stuck between two stools.
GeoNetwork is doing a great job on both search UI and on support for
standard "harvesting" protocols like old-school OAI-PMH and
new-fangled OpenSearch.

One interesting byproject there is the MEF data package format.
This is a structured zipfile containing metadata about the data,
metadata about the contents of the package, potentially accompanying
screenshots and thumbnails, potentially the data itself.
http://frot.org/terradue/explore_mef.html is an excerpt of the GN
manual that describes MEF. Again it is geospatially specific in
some assumptions about the detail, but this could be a useful way to
think of delivering data from CKAN in something more approaching the
"apt-get install london" dream. MEF originated as part of an
*interchange protocol between GN nodes* e.g. was a mechanism for
registry/repositories to share data amongst one another.

Now client-side software is getting into the act, e.g. the gvSIG
graphical view and analysis program is getting a plugin that will
generate stubs of MEFs, extract the spatial properties, i *assume*
walk through filling in the more useful/mandatory fields, and POST
the resulting metadata/data package off to a GeoNetwork instance.

This seems to make it less and less worthwhile to replicate what GN
does directly, and more worthwhile to replicate its most successful
interfaces. I start a project to do some of the above in python.
Then i look at CKAN and think about how I'd like to add new query
interfaces to it and contribute directly; being able to "scratch my
own itch" with CKAN would maximise the chance that i commit something :wink:

Right now adding data sets to it feels like a drag because there lacks the
capability to import new stuctured metadata records from an existing
repository - something that MEF-likes could help to facilitate - or to
easily dump out records for consumption by a different repository.
It probably makes more sense to help CKAN to do this, rather than work
on a near-clone specifically because it currently can't.
But a geo-specific near-clone has some of the constraints earmarked above
which leave me in a position of:

- wanting to be able to "plug in" an extended domain model to a
  given ckan instance (it would be enough, though not ideal, if
  all records in a given repository had to use the same model)

- wanting to "plug in" a query/display protocol to a core

- wanting an easy way to add 'post-create-hooks' to different
  classes of packages

- wanting to contrib useful stuff to the core, not just extensions

I scared myself off DIY frameworks a bit after the experiments with
"nodel". But where that went wrong was the replicating-Django-in-RDF
part of it, rather than the useful bits consisting of "application
packages" of domain models + python modules defining HTTP/XMLish interfaces.

Please tell me if I'm succumbing to frameworkitis again.
Plus, I am still addicted to GeoDjango to the extent that if i were to
work on a custom distribution of CKAN then i would really want to port
it from pylons to Django first. I know OKFN is platform-neutral, and i
know pylons is ORM-neutral and there has been some PostGIS
integration work with SQLAlchemy, but i consider it likely Django will
provide richer "network effects" in terms of related work.
I wish we didnt still have to have this conversation, either.

cheers,

jo

--

Hi Jo,
That's a long mail :slight_smile:

On Oct 17, 2007, at 11:39 PM, Jo Walsh wrote:

In the GIS software world there is an overfocus on standards. Given
several different domain models + XML serialisations, etc,
all differingly mandatory in regional government data policies, what
is an implementor to do? The GN answer is to use XML templates (or XML
documents as prototype templates) and to store per-package metadata in
BLOBs of XML in a database - extracting a few key properties for indexing.

But it becomes harder to re-use, and get a 'network effect' from the
re-use of, descriptions of people or organisations;

The workflow should be more driven by the data; I mean that when the data changes, metadata (that possibly sits with the data) is updated automatically. The problem is/was that many people just had a problem to publish data with metadata anyway. As such the web based editor is a solution to them, even if not ideal. I hope that we get more integration with the data processing applications (like ESRI products, gvSIG, uDig and others). There's several initiatives that attempt to further improve the automatic metadata generation process. It would be cool to see them evolve into powerful tools that many users can benefit from through a simple to install and configuration process.

the *internal
structure* of these data sets can't often be expressed by the
prevalent standards (ISO 19115, FGDC, etc). The information models,
domain models for geographic data have a lot of specific details in them
that I don't necessarily want to fill out, or see on the screen.

I think we see the network services like W*Ss provide the detail to the client app, and as such hide it from the screen. This may need further evolvement, but I would say "leave it as close to the data as possible, only provide the metadata you need to find the resource and make a first assessment".

I wrote a bit back when about the extrapolation of a "core model"
from the current standards prevalent in geo-metadata; and also about
what i perceive as structural flaws in some of the information models
designs that are byproducts of a "metadata first, data later" view.
http://wiki.osgeo.org/index.php/Why_DCLite4G
http://frot.org/terradue/minimal_metadata.html

From what I see, metadata is almost always an afterthought. I can understand although I disagree :slight_smile: It should be something that is done during and at the end of a data creation process. Usually creating metadata after data results in other people writing it, lots of copying or duplication of metadata content etc... etc... It indeed shows that metadata should be part of the normal workflow and should be really easy to do. I think we're fully aware that a web based metadata editor from that perspective is not very useful, but eh, remember the afterthought... It helps to get stuff published in a consistent manner.

But implementation-wise this needs more than a generic CKAN, though.
Foremost is ability to add spatial properties of data sets - typically
an envelope described by two X,Y points showing what area of space the
data set covers. This bleeds into wanting more things:

- The ability to store and query geometry objects in the backend
  (for the postgres database this is supplied by PostGIS)
- The ability to spatially filter search results from the frontend
- "Semi-automatic" grabbing of a data set and extracting the spatial
  extents from the metadata in the file (usually possible)
  and post-facto inserting that into the metadata record.

We're working on that here. Martin Seiler is writing a Python script that does some nice things using GDAL/OGR
Also in Valencia the gvSIG guys are breeding on things. Not sure how far they are. (copied Mike)

Last year it made sense to DIY with some homegrown geo extensions to
sqlobject. Then GeoDjango came along and rendered all that obsolete,
and I ported across in a couple of hours.

Can you describe in simplistic language what that means for me/us not knowing Django?

But now development-wise I feel i am stuck between two stools.
GeoNetwork is doing a great job on both search UI and on support for
standard "harvesting" protocols like old-school OAI-PMH and
new-fangled OpenSearch.

One interesting byproject there is the MEF data package format.
This is a structured zipfile containing metadata about the data,
metadata about the contents of the package, potentially accompanying
screenshots and thumbnails, potentially the data itself.
http://frot.org/terradue/explore_mef.html is an excerpt of the GN
manual that describes MEF. Again it is geospatially specific in
some assumptions about the detail, but this could be a useful way to
think of delivering data from CKAN in something more approaching the
"apt-get install london" dream.

Would be cool to see something like a MEF be stored in CKAN and deployed like that :slight_smile: It could work for just metadata etc... without data if data is too big and just point to the services or data.

MEF originated as part of an
*interchange protocol between GN nodes* e.g. was a mechanism for
registry/repositories to share data amongst one another.

Now client-side software is getting into the act, e.g. the gvSIG
graphical view and analysis program is getting a plugin that will
generate stubs of MEFs, extract the spatial properties, i *assume*
walk through filling in the more useful/mandatory fields, and POST
the resulting metadata/data package off to a GeoNetwork instance.

Ciao,Jeroen

cheers,

jo

--

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at http://sourceforge.net/projects/geonetwork

Jo Walsh wrote:
[snip]

In the GIS software world there is an overfocus on standards. Given
several different domain models + XML serialisations, etc,
all differingly mandatory in regional government data policies, what
is an implementor to do? The GN answer is to use XML templates (or XML
documents as prototype templates) and to store per-package metadata in
BLOBs of XML in a database - extracting a few key properties for indexing.

Interesting. One would imagine there must be quite a bit of common stuff too plus some consensus on what's an absolute minimum (a bit like dublin core for document metadata).

But it becomes harder to re-use, and get a 'network effect' from the
re-use of, descriptions of people or organisations; the *internal
structure* of these data sets can't often be expressed by the
prevalent standards (ISO 19115, FGDC, etc). The information models,
domain models for geographic data have a lot of specific details in them that I don't necessarily want to fill out, or see on the screen.

I wrote a bit back when about the extrapolation of a "core model"
from the current standards prevalent in geo-metadata; and also about
what i perceive as structural flaws in some of the information models designs that are byproducts of a "metadata first, data later" view.
http://wiki.osgeo.org/index.php/Why_DCLite4G
http://frot.org/terradue/minimal_metadata.html

Ah! You're already there so ignore my previous comment :slight_smile: I note for others that the actual spec (which looks nice and concise) is at:

http://wiki.osgeo.org/index.php/DCLite4G

But implementation-wise this needs more than a generic CKAN, though.
Foremost is ability to add spatial properties of data sets - typically
an envelope described by two X,Y points showing what area of space the
data set covers. This bleeds into wanting more things:

- The ability to store and query geometry objects in the backend
  (for the postgres database this is supplied by PostGIS)
- The ability to spatially filter search results from the frontend
- "Semi-automatic" grabbing of a data set and extracting the spatial extents from the metadata in the file (usually possible)
  and post-facto inserting that into the metadata record.

Okay let's break this down into:

   1. Additional metadata: possible ways to do this in CKAN include
     a) 'machine tags' (as proposed by Aaron) which just involves layering on existing tag infrastructure
     b) arbitrary additional per package metadata (should be easy but then hard to specify grouping a la dc4g
     c) metadata 'plugins' that allow addition of metadata sets (a la dc4g)

   2. Querying based on metadata. One nice things about plugins is that they could provide in addition to the basic metadata spec, metadata specific code for the user interface, for querying etc.

   3. Machine automated retrieval of data. This is the (or at least one of the major) reasons for building the system. It would be beginning of componentization and automated build for knowledge packages similar to the way we do software.

Last year it made sense to DIY with some homegrown geo extensions to sqlobject. Then GeoDjango came along and rendered all that obsolete,
and I ported across in a couple of hours.

But now development-wise I feel i am stuck between two stools.
GeoNetwork is doing a great job on both search UI and on support for
standard "harvesting" protocols like old-school OAI-PMH and
new-fangled OpenSearch.

One interesting byproject there is the MEF data package format.
This is a structured zipfile containing metadata about the data,
metadata about the contents of the package, potentially accompanying
screenshots and thumbnails, potentially the data itself.
http://frot.org/terradue/explore_mef.html is an excerpt of the GN
manual that describes MEF. Again it is geospatially specific in some assumptions about the detail, but this could be a useful way to
think of delivering data from CKAN in something more approaching the "apt-get install london" dream. MEF originated as part of an
*interchange protocol between GN nodes* e.g. was a mechanism for
registry/repositories to share data amongst one another.

This sounds very interesting. I'd been thinking of stuff along the lines python package metadata of good old apt on the basis it was just already "there" and could potentially be easily reused. I even started having a datapkg tool similar to python's easy_install back in May.

Now client-side software is getting into the act, e.g. the gvSIG
graphical view and analysis program is getting a plugin that will
generate stubs of MEFs, extract the spatial properties, i *assume*
walk through filling in the more useful/mandatory fields, and POST
the resulting metadata/data package off to a GeoNetwork instance.

Nice.

This seems to make it less and less worthwhile to replicate what GN
does directly, and more worthwhile to replicate its most successful
interfaces. I start a project to do some of the above in python.
Then i look at CKAN and think about how I'd like to add new query
interfaces to it and contribute directly; being able to "scratch my
own itch" with CKAN would maximise the chance that i commit something :wink:

Absolutely. Am i understanding correctly that one could add a new query interface to it that goes off and talks to other repositories (such as GN?). If so this sounds great and I'd happily help you out in coding this up.

Right now adding data sets to it feels like a drag because there lacks the
capability to import new stuctured metadata records from an existing
repository - something that MEF-likes could help to facilitate - or to
easily dump out records for consumption by a different repository.

Completely agree. Earlier this week I put in support for purging revisions (to deal with the occasional spam we're now getting) but next item is to provide a good machine API. In fact the system is already designed around a quasi-RESTful interface so that a straight POST to:

http://ckan.net/package/create/

and

http://ckan.net/package/update/

with the right variables should work. However something should probably be done to make this more completely RESTful. Plus some documentation. The relevant current code is at:

http://knowledgeforge.net/ckan/svn/ckan/trunk/ckan/controllers/package.py

It probably makes more sense to help CKAN to do this, rather than work
on a near-clone specifically because it currently can't.

I agree :slight_smile:

But a geo-specific near-clone has some of the constraints earmarked above
which leave me in a position of:

- wanting to be able to "plug in" an extended domain model to a given ckan instance (it would be enough, though not ideal, if
  all records in a given repository had to use the same model)

- wanting to "plug in" a query/display protocol to a core

- wanting an easy way to add 'post-create-hooks' to different classes of packages
- wanting to contrib useful stuff to the core, not just extensions

I scared myself off DIY frameworks a bit after the experiments with
"nodel". But where that went wrong was the replicating-Django-in-RDF
part of it, rather than the useful bits consisting of "application
packages" of domain models + python modules defining HTTP/XMLish interfaces.

I think one always wants to use a framework but also remember that it only does 10% of the work ...

CKAN uses pylons which is the other major python framework and very similar to django so I don't think there should be any problems using it if you've used django.

Please tell me if I'm succumbing to frameworkitis again. Plus, I am still addicted to GeoDjango to the extent that if i were to
work on a custom distribution of CKAN then i would really want to port
it from pylons to Django first. I know OKFN is platform-neutral, and i
know pylons is ORM-neutral and there has been some PostGIS
integration work with SQLAlchemy, but i consider it likely Django will
provide richer "network effects" in terms of related work.
I wish we didnt still have to have this conversation, either.

Now you're asking. This is quite a rewrite since ckan also uses the versioned domain model code which is written against sqlobject (and, almost, elixir). I note that e.g. bycycle.org use pylons and postgis for stuff, see e.g.:

http://bycycle.org/2007/01/29/using-postgis-with-sqlalchemy/

As this is rather technical perhaps this is something we should discuss further on okfn-help.

~rufus

Yes, our metadata extractor and manager, based on gvSIG ver 1.1, will appear
soon, to be demonstrated at the GN workshop on Nov 9, and then again at the
gvSIG conference 14-16 November. Not sure when the extension will be
"published" however.

Mike

-------
Michael Gould
Centro de Visualización Interactiva www.cevi.uji.es
Dept. Information Systems (LSI), Universitat Jaume I, 12071 Castellón, Spain
email: gould (at) lsi.uji.es // email2: mgould (at) opengeospatial.org
research group www.geoinfo.uji.es // personal www.mgould.com
AGILE www.agile-online.org
Vespucci Summer Institute www.vespucci.org
Erasmus Mundus: Master in Geospatial Technologies www.mastergeotech.info

-----Original Message-----
From: Jeroen Ticheler [mailto:Jeroen.Ticheler@anonymised.com]
Sent: jueves, 18 de octubre de 2007 16:34
To: Jo Walsh
Cc: okfn-discuss@anonymised.com; geonetwork-devel@lists.sourceforge.net;
Pedro Goncalves; michael gould
Subject: Re: [GeoNetwork-devel] Metadata registries too domain specific or
too general - how to fill in the gaps?

Hi Jo,
That's a long mail :slight_smile:

On Oct 17, 2007, at 11:39 PM, Jo Walsh wrote:

In the GIS software world there is an overfocus on standards. Given
several different domain models + XML serialisations, etc,
all differingly mandatory in regional government data policies, what
is an implementor to do? The GN answer is to use XML templates (or XML
documents as prototype templates) and to store per-package metadata in
BLOBs of XML in a database - extracting a few key properties for
indexing.

But it becomes harder to re-use, and get a 'network effect' from the
re-use of, descriptions of people or organisations;

The workflow should be more driven by the data; I mean that when the
data changes, metadata (that possibly sits with the data) is updated
automatically. The problem is/was that many people just had a problem
to publish data with metadata anyway. As such the web based editor is
a solution to them, even if not ideal. I hope that we get more
integration with the data processing applications (like ESRI
products, gvSIG, uDig and others). There's several initiatives that
attempt to further improve the automatic metadata generation process.
It would be cool to see them evolve into powerful tools that many
users can benefit from through a simple to install and configuration
process.

the *internal
structure* of these data sets can't often be expressed by the
prevalent standards (ISO 19115, FGDC, etc). The information models,
domain models for geographic data have a lot of specific details in
them
that I don't necessarily want to fill out, or see on the screen.

I think we see the network services like W*Ss provide the detail to
the client app, and as such hide it from the screen. This may need
further evolvement, but I would say "leave it as close to the data as
possible, only provide the metadata you need to find the resource and
make a first assessment".

I wrote a bit back when about the extrapolation of a "core model"
from the current standards prevalent in geo-metadata; and also about
what i perceive as structural flaws in some of the information models
designs that are byproducts of a "metadata first, data later" view.
http://wiki.osgeo.org/index.php/Why_DCLite4G
http://frot.org/terradue/minimal_metadata.html

From what I see, metadata is almost always an afterthought. I can
understand although I disagree :slight_smile: It should be something that is
done during and at the end of a data creation process. Usually
creating metadata after data results in other people writing it, lots
of copying or duplication of metadata content etc... etc... It indeed
shows that metadata should be part of the normal workflow and should
be really easy to do. I think we're fully aware that a web based
metadata editor from that perspective is not very useful, but eh,
remember the afterthought... It helps to get stuff published in a
consistent manner.

But implementation-wise this needs more than a generic CKAN, though.
Foremost is ability to add spatial properties of data sets - typically
an envelope described by two X,Y points showing what area of space the
data set covers. This bleeds into wanting more things:

- The ability to store and query geometry objects in the backend
  (for the postgres database this is supplied by PostGIS)
- The ability to spatially filter search results from the frontend
- "Semi-automatic" grabbing of a data set and extracting the spatial
  extents from the metadata in the file (usually possible)
  and post-facto inserting that into the metadata record.

We're working on that here. Martin Seiler is writing a Python script
that does some nice things using GDAL/OGR
Also in Valencia the gvSIG guys are breeding on things. Not sure how
far they are. (copied Mike)

Last year it made sense to DIY with some homegrown geo extensions to
sqlobject. Then GeoDjango came along and rendered all that obsolete,
and I ported across in a couple of hours.

Can you describe in simplistic language what that means for me/us not
knowing Django?

But now development-wise I feel i am stuck between two stools.
GeoNetwork is doing a great job on both search UI and on support for
standard "harvesting" protocols like old-school OAI-PMH and
new-fangled OpenSearch.

One interesting byproject there is the MEF data package format.
This is a structured zipfile containing metadata about the data,
metadata about the contents of the package, potentially accompanying
screenshots and thumbnails, potentially the data itself.
http://frot.org/terradue/explore_mef.html is an excerpt of the GN
manual that describes MEF. Again it is geospatially specific in
some assumptions about the detail, but this could be a useful way to
think of delivering data from CKAN in something more approaching the
"apt-get install london" dream.

Would be cool to see something like a MEF be stored in CKAN and
deployed like that :slight_smile: It could work for just metadata etc... without
data if data is too big and just point to the services or data.

MEF originated as part of an
*interchange protocol between GN nodes* e.g. was a mechanism for
registry/repositories to share data amongst one another.

Now client-side software is getting into the act, e.g. the gvSIG
graphical view and analysis program is getting a plugin that will
generate stubs of MEFs, extract the spatial properties, i *assume*
walk through filling in the more useful/mandatory fields, and POST
the resulting metadata/data package off to a GeoNetwork instance.

Ciao,Jeroen

cheers,

jo

--

----------------------------------------------------------------------
---
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a
browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at http://sourceforge.net/
projects/geonetwork