[Geoserver-devel] Some thoughts on the versioning datastore

Hi,
a few days ago me and Justin discussed about the versioning
datastore as part of a broader discussion about geosynchronization.
During the discussion a list of shortcomings of the current datastore
where exposed:
* the datastore is postgis specific
* it modifies the tables it version enables
* the code inside the datastore is quite complex

If you need a quick revision of how the datastore currently works
see here: http://geoserver.org/display/GEOS/Versioning+WFS+-+Database+design)

Now, during that discussion a different design was discussed, namely,
one where the versioning system is a pure wrapper around a target
datastore (the current design is implemented as a wrapper of sorts
for most operations, but it requires direct access to the db
in order to handle FID, alter tables and setup indexes).

The idea is to be able and version the history of a shapefile, a
remote WFS, whatever you want, by keeping the main table as is,
and recording the history information in a separate set of tables.
It's not so difficult, all that is needed is to keep the current
table design, but start with an empty versioned table (the Data
table in the above document's picture), and fill it only
when the main table gets changed, by recording the old value
of each attribute changed. When a new record is created, an empty
row is created in the Data table, when one is deleted, the
record is removed from the actual feature table and copied
verbatim in the Data table (of course, all these operations
have to properly track the revision numbers where these
changes occurred).

Advantages of this approach:
* allows every datastore to be version enabled
* the history storage can be very efficient by using an H2
   backend for it
* access to the current state of the feature is direct,
   there is no slowdown whatsoever
Disadvantages:
* all referential integrity is lost. A direct modification
   to the original table can be made, the history subsytem
   won't notice. It should be noticed that the current
   approach does not make it impossible to alter the table,
   just harder, that is, you must be willing to actively
   circumvent the extra hoops the foreign keys to the
   changesets table impose
* all transactional integrity is lost. There is no more
   guarantee that the modification will occur in an ACID
   way, since the data and the history are not under
   the control of the same database transaction.
   This can be mitigated in two ways:
   - if the target datastore is a dbms, a history support
     for it could be built so that modifications occurr
     under the same db and trasaction. For example, if
     you're playing again Postgis, a Postgis specific
     history subsystem could be built and used in place
     of the H2 based one, and so on
   - if the target datastore can participate in a two
     phase commit distributed transaction, it could be
     possible to use distributed transaction to keep
     ACID safeties around. This would probably require
     a significant change in GT2, which is JTA unaware
* access to an older snapshot of the data is going
   to be very slow, with times proportional to the number
   of extracted features.

The last point requires some explanations. Think about
accessing the state of a feature at revision N.
How do you know what was the state of that feature?
You have to query the history table with that fid
and revision.
Now, what happens if you are query 100 features at
a specific revision? Theoretically one could use a
merge-sort like approach, that is, grab the features
and the history records in the same order, peek
each stream for one record at a time, merge them, and
return the results.
Unfortunately this cannot be done because of sorting
or lack of sorting depending on the case:
* if the query is not sorted on any attribute, we'd
   need to sort on the fid in order to get features
   and history records in the same order... too bad
   we don't have a way to do so with the OGC Query API
* if the query is sorted, we're screwed because the
   old attribute values may impose a different sorting
   order on the features
This means to access history we need to make 1 query
to grab the current features, plus one query for each
feature to grab its state at revision N from the history.
This speaks very badly scalability wise, since each
query we make we'll have to pay for the latency access
to the db (and to make things worse, we're still not
using prepared statements).
This could be mitigated somehow by treating features
in batches, that is:
* grab 100 features from the original storage
   (which we already do, see the jdbc fetch size)
* query the history table with those 100 fids
* build the feature state at revision N in memory

All in all, it seems doable, and it would have
certain advantages over the current approach, thought
as you can see, make it deliver decent performance
and provide transactional guarantees will require
some non trivial extra hoops.

Comments, opinions?
Cheers
Andrea

Just for discussion, (not a solid proposal)...

There are several approaches to versioning, depending on how you
expose the versioning meta-model to the user:

1) a separate interface (WFS-V) with different calls to get metadata,
inject constraints based on versioning metadata

2) build versioning metadata into the data model (feature type) on a
per-data set basis

3) derive datasets from a "root" that contains standardised versioning metadata

4) subscribe to a meta-model that allows metadata objects to refer to
the described features - and then expose this metadata model as
features (allowing traversal of feature relationships in queiries)

5) ditto, but allowing special interfaces to expose the relationships
(almost the same as #1, except its possible to transfer versioning
metadata)

I havent thought through the pros and cons of each approach, but 2,3,4
are all easily enabled by the community schemas stuff (explicitly, we
have test case to support the use of ISO19115 Feature scoped metadata,
using standard schemas. This was a bugger, but is now feasible)

INSPIRE is taking approach #3... so its probably the market leader,
except that database replication must use some form of similar
meta-model, and perhaps we should be directly implementing a pattern
thats proven there in the most effective way, with a minimum of new
interfaces.

Rob Atkinson

On Mon, Mar 10, 2008 at 7:54 PM, Andrea Aime <aaime@anonymised.com> wrote:

Hi,
a few days ago me and Justin discussed about the versioning
datastore as part of a broader discussion about geosynchronization.
During the discussion a list of shortcomings of the current datastore
where exposed:
* the datastore is postgis specific
* it modifies the tables it version enables
* the code inside the datastore is quite complex

If you need a quick revision of how the datastore currently works
see here:
http://geoserver.org/display/GEOS/Versioning+WFS+-+Database+design)

Now, during that discussion a different design was discussed, namely,
one where the versioning system is a pure wrapper around a target
datastore (the current design is implemented as a wrapper of sorts
for most operations, but it requires direct access to the db
in order to handle FID, alter tables and setup indexes).

The idea is to be able and version the history of a shapefile, a
remote WFS, whatever you want, by keeping the main table as is,
and recording the history information in a separate set of tables.
It's not so difficult, all that is needed is to keep the current
table design, but start with an empty versioned table (the Data
table in the above document's picture), and fill it only
when the main table gets changed, by recording the old value
of each attribute changed. When a new record is created, an empty
row is created in the Data table, when one is deleted, the
record is removed from the actual feature table and copied
verbatim in the Data table (of course, all these operations
have to properly track the revision numbers where these
changes occurred).

Advantages of this approach:
* allows every datastore to be version enabled
* the history storage can be very efficient by using an H2
   backend for it
* access to the current state of the feature is direct,
   there is no slowdown whatsoever
Disadvantages:
* all referential integrity is lost. A direct modification
   to the original table can be made, the history subsytem
   won't notice. It should be noticed that the current
   approach does not make it impossible to alter the table,
   just harder, that is, you must be willing to actively
   circumvent the extra hoops the foreign keys to the
   changesets table impose
* all transactional integrity is lost. There is no more
   guarantee that the modification will occur in an ACID
   way, since the data and the history are not under
   the control of the same database transaction.
   This can be mitigated in two ways:
   - if the target datastore is a dbms, a history support
     for it could be built so that modifications occurr
     under the same db and trasaction. For example, if
     you're playing again Postgis, a Postgis specific
     history subsystem could be built and used in place
     of the H2 based one, and so on
   - if the target datastore can participate in a two
     phase commit distributed transaction, it could be
     possible to use distributed transaction to keep
     ACID safeties around. This would probably require
     a significant change in GT2, which is JTA unaware
* access to an older snapshot of the data is going
   to be very slow, with times proportional to the number
   of extracted features.

The last point requires some explanations. Think about
accessing the state of a feature at revision N.
How do you know what was the state of that feature?
You have to query the history table with that fid
and revision.
Now, what happens if you are query 100 features at
a specific revision? Theoretically one could use a
merge-sort like approach, that is, grab the features
and the history records in the same order, peek
each stream for one record at a time, merge them, and
return the results.
Unfortunately this cannot be done because of sorting
or lack of sorting depending on the case:
* if the query is not sorted on any attribute, we'd
   need to sort on the fid in order to get features
   and history records in the same order... too bad
   we don't have a way to do so with the OGC Query API
* if the query is sorted, we're screwed because the
   old attribute values may impose a different sorting
   order on the features
This means to access history we need to make 1 query
to grab the current features, plus one query for each
feature to grab its state at revision N from the history.
This speaks very badly scalability wise, since each
query we make we'll have to pay for the latency access
to the db (and to make things worse, we're still not
using prepared statements).
This could be mitigated somehow by treating features
in batches, that is:
* grab 100 features from the original storage
   (which we already do, see the jdbc fetch size)
* query the history table with those 100 fids
* build the feature state at revision N in memory

All in all, it seems doable, and it would have
certain advantages over the current approach, thought
as you can see, make it deliver decent performance
and provide transactional guarantees will require
some non trivial extra hoops.

Comments, opinions?
Cheers
Andrea

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Geoserver-devel mailing list
Geoserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

Rob Atkinson ha scritto:

Just for discussion, (not a solid proposal)...

There are several approaches to versioning, depending on how you
expose the versioning meta-model to the user:

1) a separate interface (WFS-V) with different calls to get metadata,
inject constraints based on versioning metadata

2) build versioning metadata into the data model (feature type) on a
per-data set basis

3) derive datasets from a "root" that contains standardised versioning metadata

4) subscribe to a meta-model that allows metadata objects to refer to
the described features - and then expose this metadata model as
features (allowing traversal of feature relationships in queiries)

5) ditto, but allowing special interfaces to expose the relationships
(almost the same as #1, except its possible to transfer versioning
metadata)

I havent thought through the pros and cons of each approach, but 2,3,4
are all easily enabled by the community schemas stuff (explicitly, we
have test case to support the use of ISO19115 Feature scoped metadata,
using standard schemas. This was a bugger, but is now feasible)

INSPIRE is taking approach #3... so its probably the market leader,
except that database replication must use some form of similar
meta-model, and perhaps we should be directly implementing a pattern
thats proven there in the most effective way, with a minimum of new
interfaces.

Hum, interesting, is there any document about INSPIRE and versioning?
Cheers
Andrea