Hi,
a few days ago me and Justin discussed about the versioning
datastore as part of a broader discussion about geosynchronization.
During the discussion a list of shortcomings of the current datastore
where exposed:
* the datastore is postgis specific
* it modifies the tables it version enables
* the code inside the datastore is quite complex
If you need a quick revision of how the datastore currently works
see here: http://geoserver.org/display/GEOS/Versioning+WFS+-+Database+design)
Now, during that discussion a different design was discussed, namely,
one where the versioning system is a pure wrapper around a target
datastore (the current design is implemented as a wrapper of sorts
for most operations, but it requires direct access to the db
in order to handle FID, alter tables and setup indexes).
The idea is to be able and version the history of a shapefile, a
remote WFS, whatever you want, by keeping the main table as is,
and recording the history information in a separate set of tables.
It's not so difficult, all that is needed is to keep the current
table design, but start with an empty versioned table (the Data
table in the above document's picture), and fill it only
when the main table gets changed, by recording the old value
of each attribute changed. When a new record is created, an empty
row is created in the Data table, when one is deleted, the
record is removed from the actual feature table and copied
verbatim in the Data table (of course, all these operations
have to properly track the revision numbers where these
changes occurred).
Advantages of this approach:
* allows every datastore to be version enabled
* the history storage can be very efficient by using an H2
backend for it
* access to the current state of the feature is direct,
there is no slowdown whatsoever
Disadvantages:
* all referential integrity is lost. A direct modification
to the original table can be made, the history subsytem
won't notice. It should be noticed that the current
approach does not make it impossible to alter the table,
just harder, that is, you must be willing to actively
circumvent the extra hoops the foreign keys to the
changesets table impose
* all transactional integrity is lost. There is no more
guarantee that the modification will occur in an ACID
way, since the data and the history are not under
the control of the same database transaction.
This can be mitigated in two ways:
- if the target datastore is a dbms, a history support
for it could be built so that modifications occurr
under the same db and trasaction. For example, if
you're playing again Postgis, a Postgis specific
history subsystem could be built and used in place
of the H2 based one, and so on
- if the target datastore can participate in a two
phase commit distributed transaction, it could be
possible to use distributed transaction to keep
ACID safeties around. This would probably require
a significant change in GT2, which is JTA unaware
* access to an older snapshot of the data is going
to be very slow, with times proportional to the number
of extracted features.
The last point requires some explanations. Think about
accessing the state of a feature at revision N.
How do you know what was the state of that feature?
You have to query the history table with that fid
and revision.
Now, what happens if you are query 100 features at
a specific revision? Theoretically one could use a
merge-sort like approach, that is, grab the features
and the history records in the same order, peek
each stream for one record at a time, merge them, and
return the results.
Unfortunately this cannot be done because of sorting
or lack of sorting depending on the case:
* if the query is not sorted on any attribute, we'd
need to sort on the fid in order to get features
and history records in the same order... too bad
we don't have a way to do so with the OGC Query API
* if the query is sorted, we're screwed because the
old attribute values may impose a different sorting
order on the features
This means to access history we need to make 1 query
to grab the current features, plus one query for each
feature to grab its state at revision N from the history.
This speaks very badly scalability wise, since each
query we make we'll have to pay for the latency access
to the db (and to make things worse, we're still not
using prepared statements).
This could be mitigated somehow by treating features
in batches, that is:
* grab 100 features from the original storage
(which we already do, see the jdbc fetch size)
* query the history table with those 100 fids
* build the feature state at revision N in memory
All in all, it seems doable, and it would have
certain advantages over the current approach, thought
as you can see, make it deliver decent performance
and provide transactional guarantees will require
some non trivial extra hoops.
Comments, opinions?
Cheers
Andrea