[Geoserver-devel] Versioned WFS-T

I'm taking the liberty of CCing the Geoserver list as I think this discussion is worth archiving.

Quick wrap for those joining this thead:
* Adrian Custer notes a French government agency has a tender which should used versioned WFS in it solution.
* TOPP have implemented Versioned WFS
* The OGC OWS5.2 testbed has a proposal for Federated Geosynchronisation, which extends Versioned WFS to provided Cached WFS-T.
* I'm lobbying Australian agencies to fund the OWS5.2 thread. There are a couple of other interested parties, but it would help to have extra resources behind the thread.
* The qvSIG team are proposing to provide a PDA data entry client based on qvSIG. I understand that this is funded or about to be funded.
* I've recently blogged a more detailed document at: http://techblog.terrapages.com/2007/05/federated-geo-synchronization-standards.html
* Rob Atkinson has a proposal to handle complex features. (Someone needs to write this up on a wiki or similar)
--
Adrian,
I agree with Chris, version control is a difficult problem and we should address it in lots of small achievable steps rather than going for the big bang approach.

The software versioning systems have gone through numerous iterations before getting to it's current state. (RCS -> SCCS -> CVS -> SVN -> GIT ...)

Currently, most of the components in my blog at http://techblog.terrapages.com/2007/05/federated-geo-synchronization-standards.html are vapourware. However, if we play our cards right, they will be funded as part of the OWS5.2 testbed.
Lisasoft has been selected to work on the Canadian Geographic Data Infrastructure Interoperability Project, and as part of that, we will be extending Mapbuilder to enter and then view Feature Update entries. While this is not a Versioned WFS client, or a Versioned WFS-T Update Reviewer it will provide core building blocks.

I'm proposing we support decentralised WFS-T updates in a simple manner. The workflow would be something like:
1. PDA stores updates in local cache or WFS-T
2. PDA docks with desktop
3. PDA copies cached updates into a staging WFS-T
4. A user views a GUI with side by side view of updates from the staging WFS-T matched against the master WFS-T. Conflicts are highlighted. The user can individually view each feature and diffs.
5. User accepts/rejects updates.
6. Updates are moved from Staging WFS-T back into Master WFS-T.

--

Yes, Rob has a proposal for handling complex feature types. I'm not sure if someone has written this down and I haven't got a spare 30 mins yet to explain it.

--
Adrian,
I'm keen to understand your timeframes, and whether there is a likelihood your sponsors would be likely to be able to fund and/or participate in the OWS testbed. Ideally, funding should be committed to by August 3 2007.
The usual sales spiel may help. Contribute $1 to the testbed, get $3 worth of functionality and access to the best developers in the world. Use open standards and future proof your solution.

Chris Holmes wrote:

Sure.

Thus far TOPP has done all the work on this, the vast majority by Andrea. He spent a good while researching how other systems handle versioning, then came up with a scheme for us to do it and we have a working implementation.

TOPP is not actively on the hook in terms of external contracts, but it is our primary funder's main interest in GeoServer, something he's wanted for awhile. So to some extent it's a higher priority than external contracts, since he gives us a lot more money than any other contracts. But it's been on hold as we've had other things come in and have been waiting to get someone to do a front end to flesh out our design thus far. Active funding will help it move a _lot_ faster though, since it does go on hold when funded work comes in.

I'm not sure the state of Cameron's work, his write-up was a good evaluation of where things are at and could go next. I believe there's some decent money behind it, but he'll obviously have to explain more.

As for distributed versioning, I've definitely thought about it and do very much want to work in that direction. I don't believe our current architecture is prohibitive of that.

But I feel _very_ strongly that we should focus on the simple implementations and getting those in to place before turning to things like distribution. For source code Shuttleworth and Linux can trumpet the benefits of distributed because SVN and CVS are well established, and bazaar and svk and all have built upon that basic knowledge.

We don't yet have that basic knowledge in versioning geospatial data. I mean, ArcSDE and Oracle do it, but from what I've heard don't even do it _all_ that well. And there's no open source versioning yet.

Our initial implementation is just focusing on wiki type editing. No branching at all. I think it's wise to design and implement just the wiki stuff, and then turn to branches. If not we'll overdesign, which has been the source of most of the failures in GeoTools, there weren't clear needs and the basic stuff hadn't been tested.

So along the same lines I don't want to over-optimize for distributed versioning. I think what Cameron's focused on, and what should be in OWS-5.2, is the logical next step - pure synchronization, having the edges stay up to date with the center. We need to get a real operational version of that working before we move on to how to push data back to the center, and then how to make it fully distributed.

So in terms of being 'beholden' to the SVN model, in some ways yes, to simplify the assumptions. We'll feel much more confident thinking the distributed model once we have the non-distributed model down. Are we going to lock it down to a SVN model? No. But are we going to think through all the implications for a truly distributed system before we have an implemented base to stand on? No. Are we ok with other people doing that and extending our work in that direction? Yes. But I definitely caution against getting overly ambitious. This stuff is _really_ hard. Repeat, this stuff is _really_ hard.

In terms of diffs being central/not central - the current database design is optimized towards wiki type operations. The current database design is not set in stone though. Mostly what we're working towards is the protocol, which then should be able to be mapped on to different backends. I don't believe 'diff' is marginalized in the current protocol. There is a 'GetDiff' operation, and the result can be a WFS transaction that can be applied to any other WFS-T. The rollback operation is just shorthand for backwards applying a diff.

To be honest the idea of each database having its own internal schema sounds like an absolute nightmare. The idea of mapping an internal database schema to an external xml schema is the heart of RobA's complex feature work. And theoretically much of that should be coming home soon. But that's purely to read out. Being able to take an external transaction and turn it in to the internal database structure is another whole ball of wax. But if I were doing it I'd look very closely in to leveraging that work, it'd be a transactional complex feature store that can gather the appropriate transactions.

For checksumming - have not thought about it at all. In the category of things I don't want to over-optimize for before we've got a basic working wikiable map.

As for naivety, I think you've got a decent grasp of the issues, the one thing I'd reiterate is how hard this stuff is. If I were going for the project I'd try to start with a couple pilots to show basic functionality before jumping in to this fully distribute vision. It's for sure where I want to go and the kind of workflows we should support. But it's tough stuff that will take time, not just a bunch of people working on it. I do think our community has a good chance of doing a great job at it, but I really don't want to see us get burned the way we have several times on complex features.

Feel free to throw your use cases and thoughts on to the versioning WFS wiki though, I'd like to keep that as a central place to gather thoughts. It could be good to port some of Cameron's posts there as well. I think it's important to gather the more advanced use cases so we can keep them in mind, and think it's great work you're doing. I'm just looking to be realistic about what we can accomplish.

best regards,

Chris

Adrian Custer wrote:

Hey all,

        I would send this directly to some list if I felt a particular
        one was appropriate. The message is intended to be open to
        anyone; the current list of recipients merely contains people
        who I suspect are interested and involved in this work.
        After searching the backwaters of the relational database world, I
discovered late yesterday that several of you have been thinking heavily
about the issues of distributed data set management leading to the
Versioning WFS-T work. I'm excited by your work since the issues are
both interesting and fundamental, as you are all aware. For us, this
work is perfectly timely since I'm helping Geomatys consider
participation in a bid for a French government agency which, to be done
right, requires a distributed versioning system of this kind. (I'll come
back eventually to this project since I think it introduces an extra
twist into the existing discussion.)

This email is principally for two reasons: * clarification of where things stand today, who is doing what,
        where you are each taking things, and * clarification of your analysis perspectives so I can better
        understand your writeups on blogs and wiki pages as I try to
        digest and contextualize the research that has been done.

Status:
------
Ideally, I'd love a one liner from the various groups explaining who is
doing what so that we can understand how to help each other.

Cameron, I see from your blog that you have a project going. Are you
really going to deliver something in Feb '08? If so, what are you
delivering? Your blog talks about all the required pieces from field GPS
survey (data dictionaries...) through uDig verification and integration
to the cascading sever system. Is there a proposal document for what you
are doing? Is this for either the Canadian government or for the OGC?
Chris, similarly is TOPP actively on the hook to deliver versioned WFS-T
products or is this work incidental in that you can see that TOPP will
need it in the future? Do you have a plan and schedule or is this work
catch-as-catch can to be completed when funding arrives?

Info:
----
Recently, I've been following several discussions about source code
version control systems including talks by Linus on Git, Mark
Shuttleworth (Ubuntu) on Bazaar and Launchpad, Mark Wielaard (GNU
Classpath) on mercurial. From that reading, from experience with
svn.geotools.org, and from the history of issues around GNOME, I've
acquired some working assumptions about versionning systems through
which I'm arriving at the Versioning WFS-T work: 1. If it's not distributed, it's not version control. 2. Merges are key. 3. Verifiable integrity is necessary.
It's probably not worth going into the particular reasons I've arrived
at these notions. Linus and Shuttleworth have been hammering the first
of these recently, justifying their stance partly out of convenience
(all work can be local rather than networked) and partly out of
political conviction (if everyone has a fully functional code tree then
we have a functional system). Similarly, the second point seems to be
the collective consensus of a consequence of the first point. The second
point is also something that those of us in the CVS and SVN tradition
are likely to under-appreciate since we avoid branching due to our poor
experience with merging. The third point comes from Git's use of
checksums to immediately notice disk or memory corruption: you need to
be sure that everyone pulling version xxx.yyy is getting *exactly* the
same thing.

A. To what extent is the current discussion of Versioning WFS-T beholden
to the SVN version control model?

Andrea, so far your discussion (at least you seem to be a primary author
of much of the wiki work) seems deeply infused with the vision of SVN.
Is this a conscious choice of a 'place to start' or is your take of SVN
that it can be assumed representative of a good 'versioning'
architecture? Linus' hostility to SVN coupled with the issues we have
working against the refractions server from Europe make me think that
there is a serious problem with using the SVN architecture as a suitable
example for the versionning wfs-t work; instead, I think, without fully
understanding the consequences, that we need to be thinking along the
lines of Git, Bazaar or mercurial.

B. To what extent have diffs been marginalized from the currently
proposed solution?

From what I'm gathering, it seems that the vision of the diff and its

merge is loosing precedence in the current implementations which favour
a particular table structure of {Feature+[temporal validity]}. Since in
the project we are evaluating, the diff is the *only* well defined
element of the whole exchange system, I'm obviously arriving at the
problem from that end. Since the diff blob is also the unit of work done
in the field and because in Git the diff blob is the unit of integration
and authentication, I'm biased to think that the diff is central.
However, the wiki writeups focus on versioning via a table structure. Is
this merely for current convenience? Does the current system still
explicitly guarantee a 1:1 correspondence between a set of changes and
some external representation?
The interesting twist in this French project is that each database is
allowed to have its own internal schema. The project is to maintain a
national road network system based on the work of regional governments
and local operators. Some operators apparently have a full roadway
topology with all the lanes whereas the region or central databases may
only have a schematic representation of the road centerlines. The
restriction is that all databases are required to be able to generate
and digest a 'diff' statement in a common gml based format. So workers
in the field make a change to one of the databases and this same change
can then be sent to the other databases for integration. For this to work, I suspect there must implicitly be a semantic
        'lowest common denominator' schema involving a simple schematic
        representation of the road network. The diffs are made against
        this implicit schema and each database then has its own,
        separate conflation step between the implicit schema and the
        internal schema. I suspect this idea of conflation helps
        separate two aspects of a diff: the change being made and the
        relation of that change to the pre-existing state.

C. Does check summing have a role to play?

In the vision of *distributed* version control, there has to be a way to
know when different states match. I.e. we need to know that
  base + patchA + patchB + patchC
is the same as
  bass + patchB + patchD
when it is. Have any of you been thinking about this kind of
verification work and how it could be implemented?

Enough for now. Perhaps this is all totally naive. I'm planning to flesh
this out by considering use cases and workflows both from your wiki
examples (e.g. an openstreetmap like system vs. a simple cache-then-sync
system vs. a hierarchical system) and from the design specifications for
this french project.

All answers, flames and pointers are welcome,

--adrian

!DSPAM:4005,468355fe204007180515871!