[Geoserver-devel] Featureid discussion [was] Re: Breakout Datasource IRC [was] Re: Javadocs for DataSource locking extension

Jody, Sean and I have been having an off list discussion, but I figured it
was time to make the discussion more open, so it can be archived and input
can be given. We started talking about datasources and setting up a break
out irc discussion to talk about geotools datasources, but several threads
developed. If the others continue I shall post them, but I think we should
generally try to have our devel discussions on the devel list. I'm still
unsure if this should go to geoserver or geotools - geotools is nice because
there is the devel vs. users list that we don't yet have on geoserver.
Perhaps if users of geoserver don't want so much email about development we
could create a geoserver-users list. There just hasn't really been enough
traffic on the list to justify it. Jody and Sean, this also contains my
latest reply to the thread...

Some other issues that would be nice to address eventually are:
* Refactoring the SQL statement code from oracle and postgis into a
common class ( but where to put it??)
* Resolving the FID generation problem.

Yes, I'd like to resolve that problem. I don't have any great ideas
right now though.

Me neither. I need to think about it some more.

> Perhaps It would help if you explained the problem to me?
> Reading the code on Fids was confusing for me - what is the
> actual intent?
>

As for the general problem, the contract of generating fids is a hard one
to get exactly right. They aren't _required_ by gml, but they are very
useful. They need to be persistant, across shutting down and starting up
of the db and changes to the file, they can't start with digits, and they
need to be automatically generated during an insert statement. The ideal
candidate I know for fids is the postgresql oid column, as it's
automatically generated, unique and persistant. But the postgis guys say
that I'm not allowed to use it as it's going away in postgres in the
future. So I started looking for primary keys to use those, as they have
to be unique, but the problem is generation. It's also tricky for shape,
since the row number provides a decent fid, but if you are to modify the
file it can change the row numbers.

The current method for producing unique FIDs in the Oracle datasource
is to use an FIDSequence that is initialised with the max fid in the
database. When new features are added, the sequence is then queried for
the next fid, which is produced by incrementing the max fid in the
FIDSequence.

>>This creates a sort of artificial, in code version of an oracle

SEQUENCE construct. However the main problems is that if some other
application adds data to the table, our in code generated fids will be
invalid.

This of course is a better solution than the current postgis datasource,
as that datasource doesn't implement anything for inserts if primary keys
are declared. But most postgis installs don't declare primary keys, so
the oid is just used. Jody, did the postgis guys ever fix shp2pgsql to
use primary keys? They were going to declare a sequence or something, but
I'm not sure if they ever actually implemented. Doing that would make
moving to using primary keys more reasonable. But we still need to figure
out the problem of fids. Another problem with the current approach is
that it also assumes that the primary key column is a number, which will
not always be the case. We can tell users that their databases need to be
set up in a certain way to use geotools, but it's always best to have it
work right away with any existing install.

Oh yeah, the other thing we could do to make the FIDSequence sure to
return the correct fid is to do the SELECT max(fidcolumn) each time you
want the next key. And you could also lock the table before calling the
max, releasing after the insert, to ensure that no one else got it.

>Ideally, we should use a DB sequence construct, however this makes some

assumptions about that database that may not be inline with reality.

Right. Perhaps the best route is to check for each of our ideal
constructs, and go to the more hacky construct if the ideal is not found.
So for postgis we would first check for a primary key with a sequence for
it, if not available then check if primary key is a number, and do an
FIDSequence type thing, and if that's not there either then use the oid.

I haven't really devoted much time to coming up with a good solution to
this and the current solution is just a stop gap measure, however it is
probably sufficient for the beta at this stage. If you have any ideas on
this, I would be most interested.

I'm in the same place, I spent a lot of time thinking about it before, but
haven't since then.

---slight jump, picks up from jody's reply to my email, and my reply to that.

>the oid is just used. Jody, did the postgis guys ever fix shp2pgsql to
>use primary keys? They were going to declare a sequence or something, but
>I'm not sure if they ever actually implemented.
>
It seems that they just use the row number in the shape file, so the
answer is probably no.

Hrm. They were definitely the ones that wanted me to change my datasource
to use primary keys, even though most users of postgis don't make primary
keys. Or wait, are you saying no they don't use primary keys, or no they
don't use a sequence? If they don't use a sequence what happens with
using the append flag? Won't you get duplicates on the primary key, as
rows will be repeated? Did they solve that problem? And if they didn't
implement a primary key declaration at all I'm inclined to use the oid
column. Could ask them again why using the oid is bad? Paul originally
said that pgsql is becoming oid independant, so they might not be around
forever, and that since they are invisible, therefor non-inuitive to non
pgsql users. At the time I just went ahead and did it, but he also said
shp2pgsql would be fixed up with primary keys. If that hasn't really
happened I'm thinking change it when oid's disappear, or when we know for
sure they are going to. As for the second concern, I don't think it
really matters if people don't know where their fids come from.

Also, Sean, did you ever look into using ROWID? As far as I can tell it's
like oid, but not scheduled to go away any time soon? I'm asking my
friend at oracle right now to be sure, but it looks promising. It's
generated for each row, is unique and autogenerated, and can easily be
queried into a 16 character string with ROWIDTOCHAR(ROWID). I'm still
trying to set up my oracle db, so I haven't checked it yet, what form it
is, but if it starts with a number we can just prepend the table name,
though I suspect it might already do that, as the wfs spec author uses
some oracle type db. If it works for oracledatasource then we won't need
to mess with fidsequence for it.

So is there just the two suggestions:
- a sequence backed by the Database
- a fid generated from the primary key

I am also wondering for the purposes of a WFS what happens when we are
trying to Insert and we get a failure or a rollback on the same
transaction. Do we rollback the sequence, or just have gaps in our fids?

I see no reason at all for us to not have gaps in our fids. That's not in
the contract at all. They just have to be unique and autogenerated.

Also, Sean, did you ever look into using ROWID? As far as I can tell it's
like oid, but not scheduled to go away any time soon? I'm asking my
friend at oracle right now to be sure, but it looks promising. It's
generated for each row, is unique and autogenerated, and can easily be
queried into a 16 character string with ROWIDTOCHAR(ROWID). I'm still
trying to set up my oracle db, so I haven't checked it yet, what form it
is, but if it starts with a number we can just prepend the table name,
though I suspect it might already do that, as the wfs spec author uses
some oracle type db. If it works for oracledatasource then we won't need
to mess with fidsequence for it.

Ok, just talked to my friend at oracle, and I guess rowid looks a bit
nasty - AABJSeAAjAAAdECAAA, and there is a chance it can move if the
partition changes? We could use a hash of it though, to get a number,
which would be a bit nicer, to which we could prepend the tablename. But
a sequence generated logical ID would be better. He was saying that you
can create sequences with jdbc, so that might be the way to go, but that
of course requires appropriate permissions of the user connecting.
There's also the idea of looking for already existing primary key
sequences, but that could be bad if the we use a sequence that the user
needs to control himself. But yes, this is a tough problem because we
need to cover as many cases as possible, from the most incredible dba who
obviously sets up all his primary keys and sequences right to the least
skillful oracle newbie who doesn't even know what a primary key is. We
probably should just cast a wide net and try a few different approaches
for each datasource.

  Chris

Chris Holmes wrote:
> But yes, this is a tough problem because we

need to cover as many cases as possible, from the most incredible dba who obviously sets up all his primary keys and sequences right to the least skillful oracle newbie who doesn't even know what a primary key is. We probably should just cast a wide net and try a few different approaches for each datasource.

I dunno, call me Evil, but I think that primary keys are a must on tables you are expecting to transact. Prerequisites to allowing tables to be published are not a crime, they just have to be explained. For example, the transactional mechanism for PostGIS is going to require the existance of a new lock table. So there's a version dependency. OIDs in PgSQL are not actually that great of a choice (I want to get away from them in our mapserver connector too) because there is no such thing as an OID index unless you explicitly create one. Whereas, primary key indexes *are* automatically created, so you get guaranteed great performance with PK queries, and *maybe* great performance with OID queries (if you have remembered to create an OID index).

Paul

--
       __
      /
      | Paul Ramsey
      | Refractions Research
      | Email: pramsey@anonymised.com
      | Phone: (250) 885-0632
      \_

> But yes, this is a tough problem because we
> need to cover as many cases as possible, from the most incredible dba who
> obviously sets up all his primary keys and sequences right to the least
> skillful oracle newbie who doesn't even know what a primary key is. We
> probably should just cast a wide net and try a few different approaches
> for each datasource.

I dunno, call me Evil, but I think that primary keys are a must on
tables you are expecting to transact. Prerequisites to allowing tables
to be published are not a crime, they just have to be explained. For
example, the transactional mechanism for PostGIS is going to require the
existance of a new lock table. So there's a version dependency. OIDs in
PgSQL are not actually that great of a choice (I want to get away from
them in our mapserver connector too) because there is no such thing as
an OID index unless you explicitly create one. Whereas, primary key
indexes *are* automatically created, so you get guaranteed great
performance with PK queries, and *maybe* great performance with OID
queries (if you have remembered to create an OID index).

Well I guess a reasonable solution would be to have all geoserver
documentation say that you absolutely must create primary keys, but still
allow the geotools postgis datasource to default to oid's if it doesn't
find any primary keys. I agree that prerequists for tables is not bad for
geoserver, but I think geotools should have reasonable backups so that all
users of geotools don't also have to publish those requirements in every
app that they create. For geotools it becomes a question of conformance
to the gml:id contract vs. performance, since if geotools were to never
look at the oid, even after it couldn't find primary keys, then features
would be created with java generated ids. The java generated ids would
not be persistant even between queries of the datasource, which violates
the spirit of gml:id. Yes, the performance would not be as good, but I
think geotools design goals are more focused on ogc conformance.

The question of actually creating new fids does remain, as Sean brought
up. I remember an email on the postgis email list about using sequences:
http://postgis.refractions.net/pipermail/postgis-users/2003-February/002086.html,
but as far as I can tell 0.8.5, which was released almost two months after
the email, doesn't even declare primary keys, let alone a sequence. So
I'm hesitant to make any sort of requirement even to geoserver users when
the method that most postgis newbies use to get data into their tables
doesn't yet declare primary keys. I don't really want to force users to
hand modify files that are created by shp2pgsql to declare primary keys
and make sure all the gids are unique. Are there plans for that
improvement to shp2pgsql Paul? Will it be in the next release? Will it
use a sequence as suggested in the email? And if it does declare a
sequence is it reasonable for us to use that sequence? Also, even if it
is in the next release I'd sorta like geoserver users with exisiting
instances of postgis made with shp2pgsql to not have to go in and change
all their tables around just so geoserver will work. Though I actually
don't feel all that strongly about this, I'm just trying to look out for
my users who aren't unix/database/opensource experts, for whom the
postgis installation process is the most complicated they've ever done,
which isn't even half of the lengthy geoserver install process.

Chris Holmes wrote:

Are there plans for that improvement to shp2pgsql Paul?

Yes, I think we should modify shp2pgsql to make the unique ID be declared as a primary key. It can only help, and it cannot hurt.

Will it be in the next release?

I will ask Jeff to put that it ASAP, so it will be in the next release.

Will it use a sequence as suggested in the email?

That will be harder, as sequence management is something which has been changing from PgSQL version to version. Particularly from the perspective of create PgSQL dump files (which is the high performance operating mode of shp2pgsql). The "serial" as a column type is actually an addition which postdates PostGIS. (Although sequences have been around forever, in order to use them before you had to explicitly tie the default value of a column to their next value.)

It is possible if we made the sequence implementation more of a traditional pgsql format we could make something version independant. Create table with unique pkey, and default value set to sequence.next, create sequence, set sequence to nrecords + 1. It will mean that loading a table creates alot of extra scruft. Ideally we would do it with the serial type. Maybe we should reseach (Jeff) when the serial type arrived. If it was 7.2 then we could just use it and say to heck with the 7.1 users (if there are any).

And if it does declare a sequence is it reasonable for us to use that sequence?

So the operation would be
- I need a new feature ID!
- What is the primary key of that feature table?
- What is the next value of the sequence joined to that pkey?

Do feature IDs have to be unique within the WFS, or unique within a feature class? What are they being used for? Why does the WFS have to create them? If a new feature is being created, can't you just insert into the table and let the DBMS handle things?

P.

--
       __
      /
      | Paul Ramsey
      | Refractions Research
      | Email: pramsey@anonymised.com
      | Phone: (250) 885-0632
      \_

> Are there plans for that improvement to shp2pgsql Paul?

Yes, I think we should modify shp2pgsql to make the unique ID be
declared as a primary key. It can only help, and it cannot hurt.

> Will it be in the next release?

I will ask Jeff to put that it ASAP, so it will be in the next release.

Awesome.

> Will it use a sequence as suggested in the email?

That will be harder, as sequence management is something which has been
changing from PgSQL version to version. Particularly from the
perspective of create PgSQL dump files (which is the high performance
operating mode of shp2pgsql). The "serial" as a column type is actually
an addition which postdates PostGIS. (Although sequences have been
around forever, in order to use them before you had to explicitly tie
the default value of a column to their next value.)

Well, if you don't do sequences I'm curious as to how you'll deal with the
problem of using the append flag in shp2pgsql, as it currently repeats the
gid's if I'm not mistaken, just using the row number.

> And if it does declare a sequence is it reasonable for us to use that sequence?

So the operation would be
- I need a new feature ID!
- What is the primary key of that feature table?
- What is the next value of the sequence joined to that pkey?

Do feature IDs have to be unique within the WFS, or unique within a
feature class?

I'm not quite sure what you're asking, but if by feature class you mean
datasource, then it just needs to be unique for the datasource. If you
can get me id numbers for each table in postgis then I can make them
unique for the wfs, by appending the tablename and setting the namespace
of the schema. So if I declare a table called rail and a table called
road, then both can use their own sequences, because the postgis
datasource will create fids like rail.34 and road.34, each for the primary
key that is 34. If my geoserver install also uses another database that
also has a rail table, then geoserver will have them be declared with
different namespaces, and thus ns1:rail.34 will be different than
ns2:rail.34 (assuming the prefixes correspond to full namespace uri's).
OID's are unique for a whole database, but we definitely don't have that
requirement. Is that what you were asking?

What are they being used for? Why does the WFS have to
create them? If a new feature is being created, can't you just insert
into the table and let the DBMS handle things?

A wfs has to create them because the insert operation requires you to
return a set of fids for the newly created features. I think this makes
sense so that users can easily reference the features they just created,
and it's part of the spec so we have to do it. A feature to be inserted
does not specify its fid, and I'm pretty sure wfs's can just ignore the
fid to be inserted, as the fid should depend on the data backend. If
you've got a way for the dbms to handle things and report back a unique id
that can be used to refer to the newly inserted id, that will be
persistant even if the fields change, I'm all ears.

  Chris