[Geoserver-devel] WFS Locking operation

Good Morning all:

Jody Garnett of Refractions Research here - I am in the process of
planning some GeoServer work. - a prototype of a transactionally strong
PostGIS Datasource. The goal is to have GeoServer use database locking
mechanisms.

Near as I can tell:
- locks are held in a static repository HashMap in TypeRepository
- the Locks are held by TypeRepository.InternalLock which track which
Features are under their control.
- Datasource transactions are supported, using commit and rollback

Any suggestions on how this work could proceed would be appreciated.

Some quick ideas I had were to approach the geotools team about creating
a TransactionDatasource interface that contained explicit feature
locking methods. Creating a postgis implementation of this, and adding
a series of checks and calls from the TypeRepository.InternalLock
implementation. While this smacks of wearing a belt and braces it would
allow the use of GeoServer with PostGIS and pave the way for similar
OracleSpatial support.

I still have not looked into the Datasource management strategy for
GeoServer located in (getTransactionDataSource()?) and would be hard
pressed to reattach Locked operations to their correct Datasource.

Thanks,

Jody Garnett
Refractions Research Inc.

Jody Garnett of Refractions Research here - I am in the process of
planning some GeoServer work. - a prototype of a transactionally strong
PostGIS Datasource. The goal is to have GeoServer use database locking
mechanisms.

That would be nice. I looked into locking in postgres for a bit when I
started locking, but it didn't cover all my needs. I haven't thought
about things much since then, but reading you're email makes me realize
that we can easily do parts in postgis, have a nice synthesis. Postgresql
couldn't cover the expiries at all, so I decided to just focus on getting
things done in java, and just in geoserver. I figured we could come back
later and look at how to beef things up, and to migrate stuff to geotools.
It looks like the time is now...

Near as I can tell:
- locks are held in a static repository HashMap in TypeRepository
- the Locks are held by TypeRepository.InternalLock which track which
Features are under their control.
- Datasource transactions are supported, using commit and rollback

Any suggestions on how this work could proceed would be appreciated.

Some quick ideas I had were to approach the geotools team about creating
a TransactionDatasource interface that contained explicit feature
locking methods. Creating a postgis implementation of this, and adding
a series of checks and calls from the TypeRepository.InternalLock
implementation. While this smacks of wearing a belt and braces it would
allow the use of GeoServer with PostGIS and pave the way for similar
OracleSpatial support.

I think this is definitely the route to follow. As we figure out things
in geotools we can get rid of the redundant code in geoserver, as it was
merely my first shot at getting locking working, and I'm sure we can do
better. I had been thinking of an all or nothing approach with geotools
and locking, but I like the idea of moving things over slowly: leaving the
lock management in geoserver for a bit, but making the geotools interfaces
able to handle the locking. I know there's going to be issues when we
start to remove the braces (geoserver code), as I remember there were
reasons I didn't put code into geotools right away. I remember one had to
do with the fact that a wfs lock can be across datasources, so be sure to
keep that in mind.

I think the geotools team would be receptive to a TransactionDataSource -
Ian and I talked a bit about extending the DataSource interface. An
AbstractTransactionDataSource would probably be a good idea, to reuse
common code. You could also put code in there for datasources that do not
have native locking, that just keeps track of the features that are
locked. This of course is not as good for file datasources, since other
applications could still modify, but it would be something. Perhaps that
code would be in the AbstractTransactionDatasource and we could also write
an AbstractJDBCDatasource, something I've been wanting for awhile, that
would implement commit/rollback/setAutoCommit, and others, leaving the
concrete implementations to do the actual locking with sql.

I still have not looked into the Datasource management strategy for
GeoServer located in (getTransactionDataSource()?) and would be hard
pressed to reattach Locked operations to their correct Datasource.

The datasource management strategy for transactions is pretty bad, and
brings up another thing that I'd like to see in geotools - connection
pooling. The code for how jdbc datasources get their connections is
practically the same, the only difference is the actual driver used. So
this is another place where we can get a lot out of it if we implement
correctly.

Apologies for a fairly vague email, but I've been away from locking for
awhile, so I need to things through a bit more, and I need to run right
now. I will think more about these issues over the next few days, but
these are certainly timely questions, as we've just about completed spec
conformance for geoserver, so it's time to move to issues of robustness.
But I think you're on the right path - do things right in geotools, and as
we move functionality there phase out our lightweight implementations in
geoserver, testing all along the way (I have locking unit tests, but they
are specific to my install, so I'll try to get some distributable tests).

  Chris

Thanks,

Jody Garnett
Refractions Research Inc.

-------------------------------------------------------
This SF.Net email sponsored by: Parasoft
Error proof Web apps, automate testing & more.
Download & eval WebKing and get a free book.
www.parasoft.com/bulletproofapps1
_______________________________________________
Geoserver-devel mailing list
Geoserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

Thanks for the quick response,

Their is certainly plenty of complexity in any distributed system, and locking is usually where it become apparent.

PostGIS Datasource. The goal is to have GeoServer use database locking
mechanisms.
   

about things much since then, but reading you're email makes me realize that we can easily do parts in postgis, have a nice synthesis. Postgresql couldn't cover the expiries at all, so I decided to just focus on getting

Since you have done all that nice expiries work I was simply going to have the TypeRepository.InternalLock
objects release their connection (or rollback and repool I suppose) when their lease had run out. For my purposes
the strong transactional support is important, I have no need to ask the database to implement timeouts.

reasons I didn't put code into geotools right away. I remember one had to do with the fact that a wfs lock can be across datasources, so be sure to keep that in mind.

I had run across the commit and rollback operations across multiple datasources and had it brought to mind. The WFS lock
element seems to be implementation dependent. It would be nice to extend this idea to the locks the TransactionDataSources
provide allowing them to a scratch space to store their session based information on. TransactionalDatasources
could provide enough information in the provided lock xml fragment to recover the specific database connection used.

I think the geotools team would be receptive to a TransactionDataSource - Ian and I talked a bit about extending the DataSource interface. An AbstractTransactionDataSource would probably be a good idea, to reuse common code. You could also put code in there for datasources that do not have native locking, that just keeps track of the features that are locked. This of course is not as good for file datasources, since other applications could still modify, but it would be something. Perhaps that code would be in the AbstractTransactionDatasource and we could also write an AbstractJDBCDatasource, something I've been wanting for awhile, that would implement commit/rollback/setAutoCommit, and others, leaving the concrete implementations to do the actual locking with sql.

This sounds like a sensible arrangement - I will have to be fairly focused for my initial prototype. The idea is
do a quick spike to prove the idea workable for Postgis, and then refine the implementation and generalize
for OracleSpatial and ArcSDE.

Not that I have seen either of the above yet - but I will.

The datasource management strategy for transactions is pretty bad, and
brings up another thing that I'd like to see in geotools - connection
pooling. The code for how jdbc datasources get their connections is
practically the same, the only difference is the actual driver used. So
this is another place where we can get a lot out of it if we implement
correctly.

The interaction between locks and database connections is where the complexity hits for me, I am
used to pooling databases connections for reuse across different transactions. Keeping a connection
for open locks is going to be interesting.

Apologies for a fairly vague email, but I've been away from locking for
awhile, so I need to things through a bit more, and I need to run right
now. I will think more about these issues over the next few days, but
these are certainly timely questions, as we've just about completed spec
conformance for geoserver, so it's time to move to issues of robustness. But I think you're on the right path - do things right in geotools, and as
we move functionality there phase out our lightweight implementations in
geoserver, testing all along the way (I have locking unit tests, but they
are specific to my install, so I'll try to get some distributable tests).

No apologies needed - and testing code would be appreciated (tests are always a great place to start
a design). My time line is have a prototype PostgisTransactionDatastore by the middle of August.

Time to hop on the geotools2 irc I suppose. I hate to disturb their code crunch for a beta. I am
happy with a TransactionalDataSource being added in a dot release, does GeoServer have
any policy regarding tracking geotools2 versions?

>>PostGIS Datasource. The goal is to have GeoServer use database locking
>>mechanisms.
>>
>>
>about things much since then, but reading you're email makes me realize
>that we can easily do parts in postgis, have a nice synthesis. Postgresql
>couldn't cover the expiries at all, so I decided to just focus on getting
>
>
Since you have done all that nice expiries work I was simply going to
have the TypeRepository.InternalLock
objects release their connection (or rollback and repool I suppose) when
their lease had run out. For my purposes
the strong transactional support is important, I have no need to ask the
database to implement timeouts.

Definitely. Time to use what postgis _can_ do, synthesize the best from
each.

>reasons I didn't put code into geotools right away. I remember one had to
>do with the fact that a wfs lock can be across datasources, so be sure to
>keep that in mind.
>
I had run across the commit and rollback operations across multiple
datasources and had it brought to mind. The WFS lock
element seems to be implementation dependent. It would be nice to
extend this idea to the locks the TransactionDataSources
provide allowing them to a scratch space to store their session based
information on. TransactionalDatasources
could provide enough information in the provided lock xml fragment to
recover the specific database connection used.

I think this might work well, if I'm understanding correctly. Code it up
and we'll see.

>I think the geotools team would be receptive to a TransactionDataSource -
>Ian and I talked a bit about extending the DataSource interface. An
>AbstractTransactionDataSource would probably be a good idea, to reuse
>common code. You could also put code in there for datasources that do not
>have native locking, that just keeps track of the features that are
>locked. This of course is not as good for file datasources, since other
>applications could still modify, but it would be something. Perhaps that
>code would be in the AbstractTransactionDatasource and we could also write
>an AbstractJDBCDatasource, something I've been wanting for awhile, that
>would implement commit/rollback/setAutoCommit, and others, leaving the
>concrete implementations to do the actual locking with sql.
>
>
This sounds like a sensible arrangement - I will have to be fairly
focused for my initial prototype. The idea is
do a quick spike to prove the idea workable for Postgis, and then refine
the implementation and generalize
for OracleSpatial and ArcSDE.

That's definitely the approach I've taken through all geoserver
development - focus on postgis and then generalize.

>The datasource management strategy for transactions is pretty bad, and
>brings up another thing that I'd like to see in geotools - connection
>pooling. The code for how jdbc datasources get their connections is
>practically the same, the only difference is the actual driver used. So
>this is another place where we can get a lot out of it if we implement
>correctly.
>
The interaction between locks and database connections is where the
complexity hits for me, I am
used to pooling databases connections for reuse across different
transactions. Keeping a connection
for open locks is going to be interesting.

Will think more on this near the end of the week. Trying to get a release
out in the next day or two, but then I'll have good time to devote to
this. But yes, it's going to be interesting.

>Apologies for a fairly vague email, but I've been away from locking for
>awhile, so I need to things through a bit more, and I need to run right
>now. I will think more about these issues over the next few days, but
>these are certainly timely questions, as we've just about completed spec
>conformance for geoserver, so it's time to move to issues of robustness.
>But I think you're on the right path - do things right in geotools, and as
>we move functionality there phase out our lightweight implementations in
>geoserver, testing all along the way (I have locking unit tests, but they
>are specific to my install, so I'll try to get some distributable tests).
>
>
No apologies needed - and testing code would be appreciated (tests are
always a great place to start
a design).

I'll pass what I already have along to you and a sql file to get the test
data in your postgis install.

My time line is have a prototype PostgisTransactionDatastore
by the middle of August.

Sounds good.

Time to hop on the geotools2 irc I suppose. I hate to disturb their
code crunch for a beta. I am
happy with a TransactionalDataSource being added in a dot release, does
GeoServer have
any policy regarding tracking geotools2 versions?

Considering that there is no such thing as geotools2 versioning yet, no.
We've been begging geotools to start releases for awhile now, for that
very reason. Our current way is awful - I hand roll the jars on my
machine, updating the jar as geotools evolves to GeoServer's needs. As
soon as geotools actually gets a release out the door that will change,
and we'll figure out a policy for geotools2 versions.

  Chris

Near as I can tell:
- locks are held in a static repository HashMap in TypeRepository
- the Locks are held by TypeRepository.InternalLock which track which
Features are under their control.
- Datasource transactions are supported, using commit and rollback

Any suggestions on how this work could proceed would be appreciated.

Some quick ideas I had were to approach the geotools team about creating
a TransactionDatasource interface that contained explicit feature
locking methods. Creating a postgis implementation of this, and adding
a series of checks and calls from the TypeRepository.InternalLock
implementation. While this smacks of wearing a belt and braces it would
allow the use of GeoServer with PostGIS and pave the way for similar
OracleSpatial support.

I still have not looked into the Datasource management strategy for
GeoServer located in (getTransactionDataSource()?) and would be hard
pressed to reattach Locked operations to their correct Datasource.

Thinking about this more and more I do not envy your task at all. There
are a lot of pitfalls, and tons of special cases to think through. I
think I'd pretty much blocked locking from my mind, as the first iteration
was annoying to implement. It definitely could use beefing up, as doing
the locking just in java is not ideal. But there are a ton of little
things that you need to get exactly right, which get harder when you start
doing locking on the backend as well.

As I think about it more I think that you're probably going to have to
more or less completely redo how geotools jdbc datasources handle their
database connections. I think first I'll give a base description of how
geoserver handles locks/connections, to make sure we're on the same page,
though you've probably figured most of this out. I'll highlight the
tricky things, and the things we don't implement very well right now, and
then detail how I think we should perhaps proceed.

First off, database connections. Each postgis datasource is created using
a java.sql.Connection object, which is generated from
PostgisConnectionFactory (which implements java.sql.Datasource for reasons
that were never clear to me, and looking just now at the
java.sql.Datasource interface it seems to make less sense, as that
interface should be implemented by the driver, which it is for postgres -
PostgresqlDataSource.java, in their source tree, but we actually can't
just use that, as it doesn't allow you to set the charset, which is
something our users need...but I digress). GeoServer currently makes use
of the geotools DataSourceFinder, which will find the appropriate
DataSourceFactory based on the parameters. The postgis factory is called
PostgisDataSourceFactory, and it creates a PostgisConnectionFactory, and
uses that connection to make the PostgisDataSource.

The DataSourceFinder is quite nice because it allows us to add new
geotools datasources with no additional work if they are implemented
correctly. No code would need to change, the only thing required for a
user to use the new datasource is for them to fill out their info.xml
correctly - they just need to put the right geotools params in the
DatasourceParams element, and DataSourceFinder does the rest of the work.
Where this falls short is connections, since a new connection is made each
time the DataSourceFinder is used. A geotools connection pooling object
for all jdbc datasources to share could help out here.

GeoServer currently gets around that by holding on to a datasource for
each featureType. Well, actually 2 datasources if transactions are being
used. TypeInfo supplies the datasources with getDataSource and
getTransactionDatasource, both of which lazily initialize DSes with the
DataSourceFinder - the params are stored in the TypeInfo, read in with
FeatureType (yes, should probably be renamed to avoid conflict with
geotools). The reason for the two different datasources is that we don't
want a getFeature that comes in while a transaction is waiting to complete
to return its view. This is because the transaction may rollback if a
later operation (each transaction request can have more than one insert,
update and delete) fails. If a getFeature is on the same connection then
it would see the database from the view of an altered but not yet
committed transaction.

Of course, having all transactions on the same connection is bad, because
two different transactions may access it at the same time. I didn't worry
too much about this, since it's a fairly rare case - two transactions need
to come in at the same time, and need to be operating on the same
features, and it would only mess things up in cases when locking was not
used. If one of the transactions had used the lockFeatures operation,
then only one would have the lockId, and thus it would be the only one
allowed to operate - the other transaction wouldn't be successful until
the lock was released.

Another shortcoming of the current implementation is that on shutdown
all the open connections fail ungracefully - they don't call
connection.close(). I tried for a bit to put in finalize() to
postgis datasource, but I never got it working, I think I didn't
successfully get rid of all references to it or something.

As for locking - it's a lot easier when you don't have to deal with
connections. And it still was a damn bitch to implement. The simple case
is one lock on one datasource. And this would be easy to implement with a
lock interface - either the features are locked, or they're not and can be
operated on. But you also have to add different locks operating on one
datasource - lockid 45 could hold features rail.1-30, and lockid 55 could
hold rail.31-70. And 55 could come in and request rail.20-70, but with
release action SOME, and it has to check what's already locked and grab
what it can. And when a transaction with lockid 55 comes in with
releaseAction=ALL then 55 must know which ones it holds. But it doesn't
stop there, you also have to deal with one (or more) lock operating on
more than one datasource. This means lock 55 can also be locking road.23,
and if it a transaction with releaseAction=ALL comes and updates rail.43,
then road.55 also needs to be released after the transaction is complete.

The current implementation successfully deals with all of these -
internalLocks know all the features they lock, and thus can release them
all. The lockedFeatures map allows quick look-up to see if a given
feature is locked, as lockedFeatures.contains(featureId) just needs to be
called. Every lock or transaction must first get all the features that it
is trying to operate on from the datasource (just gets the fids, to save
overhead of constructing geometries) - the hope with doing stuff in the db
is that we can get rid of this huge overhead, everything checking that
is, though I haven't thought through a good way yet (that's where you come
in, right? :). If one of those features that it is about to operate on is
locked then lockedFeatures hashes the lockId, and compares to see if the
transaction holds the same lockId, and if it does it is allowed to
proceed. The internalLock also has a timer to manage the expiry, and can
reset it's timer if releaseAction some comes through (spec says expiries
are reset in that case).

I've been thinking through what you want to do, and it's looking like we
will probably always need both the belt and braces. I'd still like to
move the code to geotools eventually, but even there I think we will still
need a lock manager, and some sort of internal lock that can keep track of
which datasources are locked by the same lockId. It'll manage the
expiries as well.

I'm having trouble figuring out how the postgis datasource will be able to
handle locking. I think the way to start out is for each internalLock to
hold pointers to the datasources that they affect, then they can call
unlock() when they need to. But I think even for unlock we need to redo
postgis so it deals with connections better, if we are going to do locking
in the database. This is where my knowledge runs thin, and you are
obviously in a better place to figure things out, with all the postgis
team right there. But as far as I can tell, the only way to get row level
locking in postgis (from java) is to setAutoCommit false, and then call
select where (Filter) for update. I guess you could use the same
unpacking mechanism as getFeatures(), and then also lock the fids. But to
keep the lock, you have to hold on to that connection. Of course,
geoserver holds on to that connection, so things are fine from that end.
But from the geotools side, the lock will be lost as soon as someone calls
commit(). Oh, but this extends to geoserver as well, as soon as you get
into two locks on the same datasource, since they will be on the same
connection, and when one calls commit the other is lost. Or am I thinking
about this wrong? Is there another way to acquire a lock?

My initial thoughts are that postgis datasource is going to have to be
reimplemented with better connection management, and it will have to have
its own internalLocks, each of which will have a connection to the
database, where its locks are open. And then it'd also make sense to have
a lockedFeatures map, so that an isLocked operation doesn't have to query
each connection to see if it holds a given row. But I guess this will
make locking with lockAction some a bit better, since creating the
featuresLocked and featuresNotLocked lists will be easier. But each
datasource internal lock - lets call them subInternalLocks, will in turn
have to have a superLock - like the InternalLock of our current
TypeRepository. Of course the only information that would be useful in
the subInternalLock is the Connection that it holds, as my first intuition
is that its other information will be lists of features locked, which may
function better in the superLock/InternalLock. But either way, postgis
will need to be able to handle its connections better, so that
transactions with lockIds can operate on the connections that hold the
locks.

Another thing I'm having trouble with figuring out in postgis is that I
can't seem to get Locking to fail if it can't get what it wants. Like if
something else is already locking something I try to lock, then the
connection that I'm on just sits there and waits until it clears. But for
wfs locking and updating, we need to know that it can't acquire the lock.
I guess the alternative would be a way to query postgis to see what rows
are locked, but I couldn't figure out how to do that either. And then
ideally we could figure out which connection actually held that lock, but
I'm not sure that's possible.

I think an initial interface extension might be able to get by with
something like:
lock(Filter filter, Connection conn);
unlock(Set fids, Connection conn);
transaction(TransactionRequest trans, Connection conn);

This way you could generate connections in GeoServer, and perhaps manage
them there. The TransactionRequest would be just like the wfs transaction
- with lockid, and one or more update/delete/insert. The transaction
operation would use the exisiting interface -
setAutoCommit(false);
add/remove/modify for each operation in the request,
commit() or rollback() depending if they go through.

Of course you'd have to do something to get the add/remove/modify to use
the same connection as the one where the lock is. Not sure if this is the
right way, but if it worked then you could start to think through how
PostgisDataSource uses/generates connections - it should perhaps take
PostgisConnectionFactory in its constructor or something.

Ok, so now I'm rambling into confusion, as I'd need to actually start to
implement some of this stuff to see how it goes. But I'd say start with
the simple case - one datasource with one lock, and then expand from
there. I think it may get nasty fast. There may be much better solutions
out there, that I just can't think through right now, as I've put lots of
thought into the way that I did it, so extensions seem to stem from that.
But I'll think through other options. An initial thought that hit me
while writing this email was something like a LockedFeatureCollection,
that would have a datasource backend, so that it could call lock
operations as needed. Perhaps lock management could take place using it.
That's all I have for now. I hope this email wasn't too incoherent and
hard to understand - feel free to ask questions about any parts of it.

  good luck,

         Chris