[Geoserver-devel] More on name conflict resolution (in REST API)

Hi all,

I’m investigating an issue I came across with respect to importing data into existing datastores when data (shapefiles etc.) is uploaded through the REST API. Currently the behavior is a little complicated to explain:

  1. If no name conflict is detected, then the data is imported into a new physical resource (say, database table) with a name derived from the name of the uploaded file (so foo.shp => CREATE TABLE foo)
  2. If a physical resource of the same name is present in the target datastore, then the data is put into that resource (either replacing or appending to the existing data, depending on request parameters.) Actually the name conflict check is done after this step, so the resource is always modified.
    3a) If a featuretype of the same name already exists in the same datastore, then the existing featuretype is used
    3b) If a featuretype of the same name exists in a different datastore in the same workspace, a numeric suffix is appended to the native name to derive a name for the GeoServer ResourceInfo that gets created. If this suffix would need to be greater than 9, then GeoServer just gives up and uses the _9 suffix, throwing an error when it tries to save.
    3c) If a coverage of the same name exists in the same workspace, then GeoServer doesn’t detect the conflict and errors when trying to save the ResourceInfo again.

http://jira.codehaus.org/browse/GEOS-5057

I think the new name conflict adjusting code I talked about last week1 can help with issue 3(a-c), but I think maybe some adjustment to the data import behavior is in order as well. I think a simple, less confusing behavior would be to never import data when there is a name conflict, and simply error out in this case.

A more complicated option would be to rearrange things so that the name resolution happens before the data import, so that the name always matches up with the created table. Why is this more complicated? It raises the issue of what to do when a table exists that appears to have had its name resolved previously: Say I have topp:states (a shapefile) and topp:states_1 (a postgis table) and I try to import a shapefile into the postgis store through the REST API. Should the shapefile be appended to topp:states_1 or added as a new featuretype in topp:states_2?


David Winslow
OpenGeo - http://opengeo.org/

Hmmm... some subtle issues indeed. I agree that the most sane thing would
be to just send back an error when a name conflict occurs, giving the
client the ability to specify a different name.

On Mon, Apr 23, 2012 at 11:21 AM, David Winslow <dwinslow@anonymised.com>wrote:

Hi all,

I'm investigating an issue I came across with respect to importing data
into existing datastores when data (shapefiles etc.) is uploaded through
the REST API. Currently the behavior is a little complicated to explain:

1) If no name conflict is detected, then the data is imported into a new
physical resource (say, database table) with a name derived from the name
of the uploaded file (so foo.shp => CREATE TABLE foo)
2) If a physical resource of the same name is present in the target
datastore, then the data is put into that resource (either replacing or
appending to the existing data, depending on request parameters.) Actually
the name conflict check is done *after* this step, so the resource is
always modified.
3a) If a featuretype of the same name already exists in the same
datastore, then the existing featuretype is used
3b) If a featuretype of the same name exists in a different datastore in
the same workspace, a numeric suffix is appended to the native name to
derive a name for the GeoServer ResourceInfo that gets created. If this
suffix would need to be greater than 9, then GeoServer just gives up and
uses the _9 suffix, throwing an error when it tries to save.
3c) If a coverage of the same name exists in the same workspace, then
GeoServer doesn't detect the conflict and errors when trying to save the
ResourceInfo again.

http://jira.codehaus.org/browse/GEOS-5057

I think the new name conflict adjusting code I talked about last week[1]
can help with issue 3(a-c), but I think maybe some adjustment to the data
import behavior is in order as well. I think a simple, less confusing
behavior would be to *never* import data when there is a name conflict,
and simply error out in this case.

A more complicated option would be to rearrange things so that the name
resolution happens before the data import, so that the name always matches
up with the created table. Why is this more complicated? It raises the
issue of what to do when a table exists that appears to have had its name
resolved previously: Say I have topp:states (a shapefile) and topp:states_1
(a postgis table) and I try to import a shapefile into the postgis store
through the REST API. Should the shapefile be appended to topp:states_1 or
added as a new featuretype in topp:states_2?

[1]: http://comments.gmane.org/gmane.comp.gis.geoserver.devel/16512

--
David Winslow
OpenGeo - http://opengeo.org/

------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2

_______________________________________________
Geoserver-devel mailing list
Geoserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

--
Justin Deoliveira
OpenGeo - http://opengeo.org
Enterprise support for open source geospatial.

On Mon, Apr 23, 2012 at 7:23 PM, Justin Deoliveira <jdeolive@anonymised.com> wrote:

Hmmm... some subtle issues indeed. I agree that the most sane thing would be
to just send back an error when a name conflict occurs, giving the client
the ability to specify a different name.

+1. Simpler, cleaner.

On Mon, Apr 23, 2012 at 11:21 AM, David Winslow <dwinslow@anonymised.com>
wrote:

Hi all,

I'm investigating an issue I came across with respect to importing data
into existing datastores when data (shapefiles etc.) is uploaded through the
REST API. Currently the behavior is a little complicated to explain:

1) If no name conflict is detected, then the data is imported into a new
physical resource (say, database table) with a name derived from the name of
the uploaded file (so foo.shp => CREATE TABLE foo)
2) If a physical resource of the same name is present in the target
datastore, then the data is put into that resource (either replacing or
appending to the existing data, depending on request parameters.) Actually
the name conflict check is done *after* this step, so the resource is always
modified.
3a) If a featuretype of the same name already exists in the same
datastore, then the existing featuretype is used
3b) If a featuretype of the same name exists in a different datastore in
the same workspace, a numeric suffix is appended to the native name to
derive a name for the GeoServer ResourceInfo that gets created. If this
suffix would need to be greater than 9, then GeoServer just gives up and
uses the _9 suffix, throwing an error when it tries to save.
3c) If a coverage of the same name exists in the same workspace, then
GeoServer doesn't detect the conflict and errors when trying to save the
ResourceInfo again.

http://jira.codehaus.org/browse/GEOS-5057

I think the new name conflict adjusting code I talked about last week[1]
can help with issue 3(a-c), but I think maybe some adjustment to the data
import behavior is in order as well. I think a simple, less confusing
behavior would be to never import data when there is a name conflict, and
simply error out in this case.

A more complicated option would be to rearrange things so that the name
resolution happens before the data import, so that the name always matches
up with the created table. Why is this more complicated? It raises the
issue of what to do when a table exists that appears to have had its name
resolved previously: Say I have topp:states (a shapefile) and topp:states_1
(a postgis table) and I try to import a shapefile into the postgis store
through the REST API. Should the shapefile be appended to topp:states_1 or
added as a new featuretype in topp:states_2?

[1]: http://comments.gmane.org/gmane.comp.gis.geoserver.devel/16512

--
David Winslow
OpenGeo - http://opengeo.org/

------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2

_______________________________________________
Geoserver-devel mailing list
Geoserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

--
Justin Deoliveira
OpenGeo - http://opengeo.org
Enterprise support for open source geospatial.

------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2

_______________________________________________
Geoserver-devel mailing list
Geoserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

--
Gabriel Roldan
OpenGeo - http://opengeo.org
Expert service straight from the developers.

Looking deeper into identifying conflicts ahead of time, it seems that the “update=append” option makes things a bit more complicated - we really should avoid appending when the schemas for the source and target store don’t match up (typename differences should be ok though.) I tried to implement a schema equality check ignoring the name by simply overwriting the name on one of the featuretypes:

SimpleFeatureType sourceSchema = sourceDataStore.getSchema(featureTypeName);
SimpleFeatureType targetSchema = (SimpleFeatureType)((FeatureTypeInfo)resource).getFeatureType();
SimpleFeatureTypeBuilder ftBuilder = new SimpleFeatureTypeBuilder();
ftBuilder.init(sourceSchema);
ftBuilder.setName(targetSchema.getName());
sourceSchema = ftBuilder.buildFeatureType();
sameSchema = sourceSchema.equals(targetSchema);

However, I’m still not getting the expected value (true) from this for a Shapefile that I’m attempting to upload multiple times.

Am I barking up the wrong tree? Is there a GeoTools method I should be using instead of trying to roll my own?

Also, from reviewing this thread, it’s not clear whether we were talking about removing the ability to overwrite/append altogether, or just avoiding munging when the requested type is not available. Just to double check, we do want to keep the “update=” parameter and append or overwrite when the desired resource is already present in the target store, right?


David Winslow
OpenGeo - http://opengeo.org/

On Tue, Apr 24, 2012 at 1:02 PM, Gabriel Roldan <groldan@anonymised.com1…> wrote:

On Mon, Apr 23, 2012 at 7:23 PM, Justin Deoliveira <jdeolive@anonymised.com> wrote:

Hmmm… some subtle issues indeed. I agree that the most sane thing would be
to just send back an error when a name conflict occurs, giving the client
the ability to specify a different name.

+1. Simpler, cleaner.

On Mon, Apr 23, 2012 at 11:21 AM, David Winslow <dwinslow@anonymised.com>
wrote:

Hi all,

I’m investigating an issue I came across with respect to importing data
into existing datastores when data (shapefiles etc.) is uploaded through the
REST API. Currently the behavior is a little complicated to explain:

  1. If no name conflict is detected, then the data is imported into a new
    physical resource (say, database table) with a name derived from the name of
    the uploaded file (so foo.shp => CREATE TABLE foo)
  2. If a physical resource of the same name is present in the target
    datastore, then the data is put into that resource (either replacing or
    appending to the existing data, depending on request parameters.) Actually
    the name conflict check is done after this step, so the resource is always
    modified.
    3a) If a featuretype of the same name already exists in the same
    datastore, then the existing featuretype is used
    3b) If a featuretype of the same name exists in a different datastore in
    the same workspace, a numeric suffix is appended to the native name to
    derive a name for the GeoServer ResourceInfo that gets created. If this
    suffix would need to be greater than 9, then GeoServer just gives up and
    uses the _9 suffix, throwing an error when it tries to save.
    3c) If a coverage of the same name exists in the same workspace, then
    GeoServer doesn’t detect the conflict and errors when trying to save the
    ResourceInfo again.

http://jira.codehaus.org/browse/GEOS-5057

I think the new name conflict adjusting code I talked about last week1
can help with issue 3(a-c), but I think maybe some adjustment to the data
import behavior is in order as well. I think a simple, less confusing
behavior would be to never import data when there is a name conflict, and
simply error out in this case.

A more complicated option would be to rearrange things so that the name
resolution happens before the data import, so that the name always matches
up with the created table. Why is this more complicated? It raises the
issue of what to do when a table exists that appears to have had its name
resolved previously: Say I have topp:states (a shapefile) and topp:states_1
(a postgis table) and I try to import a shapefile into the postgis store
through the REST API. Should the shapefile be appended to topp:states_1 or
added as a new featuretype in topp:states_2?


David Winslow
OpenGeo - http://opengeo.org/


For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know…and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2


Geoserver-devel mailing list
Geoserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel


Justin Deoliveira
OpenGeo - http://opengeo.org
Enterprise support for open source geospatial.


For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know…and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2


Geoserver-devel mailing list
Geoserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

Gabriel Roldan
OpenGeo - http://opengeo.org
Expert service straight from the developers.

On Mon, Apr 30, 2012 at 10:15 PM, David Winslow <dwinslow@anonymised.com> wrote:

Looking deeper into identifying conflicts ahead of time, it seems that the “update=append” option makes things a bit more complicated - we really should avoid appending when the schemas for the source and target store don’t match up (typename differences should be ok though.) I tried to implement a schema equality check ignoring the name by simply overwriting the name on one of the featuretypes:

SimpleFeatureType sourceSchema = sourceDataStore.getSchema(featureTypeName);
SimpleFeatureType targetSchema = (SimpleFeatureType)((FeatureTypeInfo)resource).getFeatureType();
SimpleFeatureTypeBuilder ftBuilder = new SimpleFeatureTypeBuilder();
ftBuilder.init(sourceSchema);
ftBuilder.setName(targetSchema.getName());
sourceSchema = ftBuilder.buildFeatureType();
sameSchema = sourceSchema.equals(targetSchema);

However, I’m still not getting the expected value (true) from this for a Shapefile that I’m attempting to upload multiple times.

Am I barking up the wrong tree? Is there a GeoTools method I should be using instead of trying to roll my own?

Well… i am not sure I 100% agree that the source and target have to match up exactly, especially given that specific format differences might lead to situations where this is unwanted. For example, consider oracle. Unless you have a spatial index on a column I believe oracle will simply return “GEOMETRY” as the type. But say you are uploading a shapefile that has a concrete type for the geometry? Should the transaction be rejected? I would say probably not.

The strategy I usually take when dealing with this sort of thing is a “pull” approach. For every attribute in the destination (the table being updated) type look for an attribute in the source type (the file being uploaded). If attributes don’t exist in the source type ignore it, and similarly any extra attributes in the source that don’t exist in the destination should also be ignored.

Also, from reviewing this thread, it’s not clear whether we were talking about removing the ability to overwrite/append altogether, or just avoiding munging when the requested type is not available. Just to double check, we do want to keep the “update=” parameter and append or overwrite when the desired resource is already present in the target store, right?

No I think we are just talking about avoiding the strange cases that occur when there are potential for name clashing by doing some pre checks and not allowing the user to create resources when there are name clashes. Not removing functionality like updating or appending to an existing table.


David Winslow
OpenGeo - http://opengeo.org/

On Tue, Apr 24, 2012 at 1:02 PM, Gabriel Roldan <groldan@anonymised.com> wrote:

On Mon, Apr 23, 2012 at 7:23 PM, Justin Deoliveira <jdeolive@anonymised.com> wrote:

Hmmm… some subtle issues indeed. I agree that the most sane thing would be
to just send back an error when a name conflict occurs, giving the client
the ability to specify a different name.

+1. Simpler, cleaner.

On Mon, Apr 23, 2012 at 11:21 AM, David Winslow <dwinslow@anonymised.com>
wrote:

Hi all,

I’m investigating an issue I came across with respect to importing data
into existing datastores when data (shapefiles etc.) is uploaded through the
REST API. Currently the behavior is a little complicated to explain:

  1. If no name conflict is detected, then the data is imported into a new
    physical resource (say, database table) with a name derived from the name of
    the uploaded file (so foo.shp => CREATE TABLE foo)
  2. If a physical resource of the same name is present in the target
    datastore, then the data is put into that resource (either replacing or
    appending to the existing data, depending on request parameters.) Actually
    the name conflict check is done after this step, so the resource is always
    modified.
    3a) If a featuretype of the same name already exists in the same
    datastore, then the existing featuretype is used
    3b) If a featuretype of the same name exists in a different datastore in
    the same workspace, a numeric suffix is appended to the native name to
    derive a name for the GeoServer ResourceInfo that gets created. If this
    suffix would need to be greater than 9, then GeoServer just gives up and
    uses the _9 suffix, throwing an error when it tries to save.
    3c) If a coverage of the same name exists in the same workspace, then
    GeoServer doesn’t detect the conflict and errors when trying to save the
    ResourceInfo again.

http://jira.codehaus.org/browse/GEOS-5057

I think the new name conflict adjusting code I talked about last week1
can help with issue 3(a-c), but I think maybe some adjustment to the data
import behavior is in order as well. I think a simple, less confusing
behavior would be to never import data when there is a name conflict, and
simply error out in this case.

A more complicated option would be to rearrange things so that the name
resolution happens before the data import, so that the name always matches
up with the created table. Why is this more complicated? It raises the
issue of what to do when a table exists that appears to have had its name
resolved previously: Say I have topp:states (a shapefile) and topp:states_1
(a postgis table) and I try to import a shapefile into the postgis store
through the REST API. Should the shapefile be appended to topp:states_1 or
added as a new featuretype in topp:states_2?


David Winslow
OpenGeo - http://opengeo.org/


For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know…and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2


Geoserver-devel mailing list
Geoserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel


Justin Deoliveira
OpenGeo - http://opengeo.org
Enterprise support for open source geospatial.


For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know…and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2


Geoserver-devel mailing list
Geoserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

Gabriel Roldan
OpenGeo - http://opengeo.org
Expert service straight from the developers.


Justin Deoliveira
OpenGeo - http://opengeo.org
Enterprise support for open source geospatial.

On Tue, May 1, 2012 at 2:36 AM, Justin Deoliveira <jdeolive@anonymised.com> wrote:

On Mon, Apr 30, 2012 at 10:15 PM, David Winslow <dwinslow@anonymised.com> wrote:

Looking deeper into identifying conflicts ahead of time, it seems that the “update=append” option makes things a bit more complicated - we really should avoid appending when the schemas for the source and target store don’t match up (typename differences should be ok though.) I tried to implement a schema equality check ignoring the name by simply overwriting the name on one of the featuretypes:

SimpleFeatureType sourceSchema = sourceDataStore.getSchema(featureTypeName);
SimpleFeatureType targetSchema = (SimpleFeatureType)((FeatureTypeInfo)resource).getFeatureType();
SimpleFeatureTypeBuilder ftBuilder = new SimpleFeatureTypeBuilder();
ftBuilder.init(sourceSchema);
ftBuilder.setName(targetSchema.getName());
sourceSchema = ftBuilder.buildFeatureType();
sameSchema = sourceSchema.equals(targetSchema);

However, I’m still not getting the expected value (true) from this for a Shapefile that I’m attempting to upload multiple times.

Am I barking up the wrong tree? Is there a GeoTools method I should be using instead of trying to roll my own?

Well… i am not sure I 100% agree that the source and target have to match up exactly, especially given that specific format differences might lead to situations where this is unwanted. For example, consider oracle. Unless you have a spatial index on a column I believe oracle will simply return “GEOMETRY” as the type. But say you are uploading a shapefile that has a concrete type for the geometry? Should the transaction be rejected? I would say probably not.

Indeed it’s in general not possible to get an exact match between the original feature type and the
native feature type:

  • attributes in Oracle all always uppercase
  • Oracle does not have the concept of “boolean”
  • shapefiles have only a single geometry column, it’s always the first attribute, it’s always called “the_geom”
  • shapefile dbf attributes have severe lenght limitations
  • and so on

Generally speaking we’d need the createSchema to return some form of map from the original
attribute names to the ones actually created (they could be properties in the AttributeDescriptor
user map).

In the OGR data store it gets even worse, you cannot call createSchema and expect the output to be
created at all, you actually have to do both the schema creation and data appending in a single shot,
or you won’t get any output created by OGR.

The latter made me roll the following extra method in the OGRDataStore:

public void createSchema(SimpleFeatureCollection data, boolean approximateFields,
String options) throws IOException {

Now, the above will dump data as possible into the target storage, doing all the attribute
mapping internally, which I believe it’s even better than creating the mapping I described
above, and let the store do whatever is best.
It opens the road to recognize, at the db level, that one can use some bulk loading method
to add data, to create the indexes after the table is loaded, and generally speaking
be free to do whatever type and name mapping is deemed necessary given the target
storage tech constraints.

The generic DataAccess api addition could look as follows:

public FeatureType createSchema(FeatureCollection data, Hints hints) throws IOException;

where hints allows to pass down some data store specific extras, e.g., databases
could use it to create some extra indexes on the attributes, WFS data store could
get the admin credentials to create the schema via RESTConfig on a remote
GeoServer, and so on.

This would be new API, for which we’d need a new trunk… stuff seems to start piling
up for a new trunk to be available, time to cut a 2.3.x branch?

Cheers
Andrea

Ing. Andrea Aime
GeoSolutions S.A.S.
Tech lead

Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy

phone: +39 0584 962313
fax: +39 0584 962313
mob: +39 339 8844549

http://www.geo-solutions.it
http://geo-solutions.blogspot.com/
http://www.youtube.com/user/GeoSolutionsIT
http://www.linkedin.com/in/andreaaime
http://twitter.com/geowolf


Well, equality here was meant as a stand-in for “compatibility” - a reasonable expectation that copying the uploaded features into the pre-existing store is not going to cause problems. As I was looking into it I realized I was writing a lot of code, hence the question about whether this sort of check is already implemented somewhere that I should be taking advantage of it.

The “pull” approach you suggest seems to be actually modifying the data, which isn’t where I was thinking things would go. I can definitely see some utility in adjusting field types (in safe ways - widening integers, etc.) but I think throwing out unexpected fields is going a bit too far. In the extreme, if I upload a layer to the wrong table entirely and NO fields are common between the source and target schema, wouldn’t the approach you advocate result in a bunch of rows with all fields set to NULL being inserted into the target table? That doesn’t seem to me like it would be a good default behavior. Maybe if we take the “pull” approach without this condition: “any extra attributes in the source that don’t exist in the destination should also be ignored” it would be less likely to insert junk data. Put succinctly, the uploaded data would need to contain a subset of the fields in the target (and omitted fields would default to NULL.)


David Winslow
OpenGeo - http://opengeo.org/

On Mon, Apr 30, 2012 at 8:36 PM, Justin Deoliveira <jdeolive@anonymised.com> wrote:

On Mon, Apr 30, 2012 at 10:15 PM, David Winslow <dwinslow@anonymised.com01…> wrote:

Looking deeper into identifying conflicts ahead of time, it seems that the “update=append” option makes things a bit more complicated - we really should avoid appending when the schemas for the source and target store don’t match up (typename differences should be ok though.) I tried to implement a schema equality check ignoring the name by simply overwriting the name on one of the featuretypes:

SimpleFeatureType sourceSchema = sourceDataStore.getSchema(featureTypeName);
SimpleFeatureType targetSchema = (SimpleFeatureType)((FeatureTypeInfo)resource).getFeatureType();
SimpleFeatureTypeBuilder ftBuilder = new SimpleFeatureTypeBuilder();
ftBuilder.init(sourceSchema);
ftBuilder.setName(targetSchema.getName());
sourceSchema = ftBuilder.buildFeatureType();
sameSchema = sourceSchema.equals(targetSchema);

However, I’m still not getting the expected value (true) from this for a Shapefile that I’m attempting to upload multiple times.

Am I barking up the wrong tree? Is there a GeoTools method I should be using instead of trying to roll my own?

Well… i am not sure I 100% agree that the source and target have to match up exactly, especially given that specific format differences might lead to situations where this is unwanted. For example, consider oracle. Unless you have a spatial index on a column I believe oracle will simply return “GEOMETRY” as the type. But say you are uploading a shapefile that has a concrete type for the geometry? Should the transaction be rejected? I would say probably not.

The strategy I usually take when dealing with this sort of thing is a “pull” approach. For every attribute in the destination (the table being updated) type look for an attribute in the source type (the file being uploaded). If attributes don’t exist in the source type ignore it, and similarly any extra attributes in the source that don’t exist in the destination should also be ignored.

Also, from reviewing this thread, it’s not clear whether we were talking about removing the ability to overwrite/append altogether, or just avoiding munging when the requested type is not available. Just to double check, we do want to keep the “update=” parameter and append or overwrite when the desired resource is already present in the target store, right?

No I think we are just talking about avoiding the strange cases that occur when there are potential for name clashing by doing some pre checks and not allowing the user to create resources when there are name clashes. Not removing functionality like updating or appending to an existing table.


David Winslow
OpenGeo - http://opengeo.org/

On Tue, Apr 24, 2012 at 1:02 PM, Gabriel Roldan <groldan@anonymised.com> wrote:

On Mon, Apr 23, 2012 at 7:23 PM, Justin Deoliveira <jdeolive@anonymised.com> wrote:

Hmmm… some subtle issues indeed. I agree that the most sane thing would be
to just send back an error when a name conflict occurs, giving the client
the ability to specify a different name.

+1. Simpler, cleaner.

On Mon, Apr 23, 2012 at 11:21 AM, David Winslow <dwinslow@anonymised.com>
wrote:

Hi all,

I’m investigating an issue I came across with respect to importing data
into existing datastores when data (shapefiles etc.) is uploaded through the
REST API. Currently the behavior is a little complicated to explain:

  1. If no name conflict is detected, then the data is imported into a new
    physical resource (say, database table) with a name derived from the name of
    the uploaded file (so foo.shp => CREATE TABLE foo)
  2. If a physical resource of the same name is present in the target
    datastore, then the data is put into that resource (either replacing or
    appending to the existing data, depending on request parameters.) Actually
    the name conflict check is done after this step, so the resource is always
    modified.
    3a) If a featuretype of the same name already exists in the same
    datastore, then the existing featuretype is used
    3b) If a featuretype of the same name exists in a different datastore in
    the same workspace, a numeric suffix is appended to the native name to
    derive a name for the GeoServer ResourceInfo that gets created. If this
    suffix would need to be greater than 9, then GeoServer just gives up and
    uses the _9 suffix, throwing an error when it tries to save.
    3c) If a coverage of the same name exists in the same workspace, then
    GeoServer doesn’t detect the conflict and errors when trying to save the
    ResourceInfo again.

http://jira.codehaus.org/browse/GEOS-5057

I think the new name conflict adjusting code I talked about last week1
can help with issue 3(a-c), but I think maybe some adjustment to the data
import behavior is in order as well. I think a simple, less confusing
behavior would be to never import data when there is a name conflict, and
simply error out in this case.

A more complicated option would be to rearrange things so that the name
resolution happens before the data import, so that the name always matches
up with the created table. Why is this more complicated? It raises the
issue of what to do when a table exists that appears to have had its name
resolved previously: Say I have topp:states (a shapefile) and topp:states_1
(a postgis table) and I try to import a shapefile into the postgis store
through the REST API. Should the shapefile be appended to topp:states_1 or
added as a new featuretype in topp:states_2?


David Winslow
OpenGeo - http://opengeo.org/


For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know…and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2


Geoserver-devel mailing list
Geoserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel


Justin Deoliveira
OpenGeo - http://opengeo.org
Enterprise support for open source geospatial.


For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know…and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2


Geoserver-devel mailing list
Geoserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

Gabriel Roldan
OpenGeo - http://opengeo.org
Expert service straight from the developers.


Justin Deoliveira
OpenGeo - http://opengeo.org
Enterprise support for open source geospatial.

Creating a schema is actually not that big a problem: assuming that all datastores create tables (or whatever) that can accept features conforming to the passed-in schema, the REST API importer can handle the importing without worrying too much about the munging that’s going on “behind the scenes.” That’s how things work today, and it seems to generally work fine.

The complication arises when the schema is already present and we are attempting to insert some new data into it – the destination schema might not be suitable for the uploaded data. So modifying createSchema as you suggest wouldn’t actually solve the problem that I’m encountering (although in general I think getting some feedback on the fact that GeoTools had to modify your schema to realize it in a particular store is good to have.)

For the problem I originally encountered, I suppose it would just be a boolean method taking two schemas:

public static boolean canInsertFrom(SimpleFeatureType source, SimpleFeatureType sink);

Implementing the logic as described in Justin’s mail (or similar.) Of course, this API wouldn’t address the concerns you raise about the Geometry column having a different name, so maybe instead it should return some kind of FeatureAdjuster object that can make that sort of change.

interface FeatureAdjuster {
public void adjust(SimpleFeature f);
}
public static FeatureAdjuster deriveAdjuster(SimpleFeatureType source, SimpleFeatureType sink) throws NoSafeAdjustmentException;

Either way, we could let it “incubate” in the GeoServer REST API for a generation or two instead of having it go straight to GeoTools, making it less of a blocker for the release.


David Winslow
OpenGeo - http://opengeo.org/

On Tue, May 1, 2012 at 8:49 AM, Andrea Aime <andrea.aime@anonymised.com> wrote:

On Tue, May 1, 2012 at 2:36 AM, Justin Deoliveira <jdeolive@anonymised.com> wrote:

On Mon, Apr 30, 2012 at 10:15 PM, David Winslow <dwinslow@anonymised.com> wrote:

Looking deeper into identifying conflicts ahead of time, it seems that the “update=append” option makes things a bit more complicated - we really should avoid appending when the schemas for the source and target store don’t match up (typename differences should be ok though.) I tried to implement a schema equality check ignoring the name by simply overwriting the name on one of the featuretypes:

SimpleFeatureType sourceSchema = sourceDataStore.getSchema(featureTypeName);
SimpleFeatureType targetSchema = (SimpleFeatureType)((FeatureTypeInfo)resource).getFeatureType();
SimpleFeatureTypeBuilder ftBuilder = new SimpleFeatureTypeBuilder();
ftBuilder.init(sourceSchema);
ftBuilder.setName(targetSchema.getName());
sourceSchema = ftBuilder.buildFeatureType();
sameSchema = sourceSchema.equals(targetSchema);

However, I’m still not getting the expected value (true) from this for a Shapefile that I’m attempting to upload multiple times.

Am I barking up the wrong tree? Is there a GeoTools method I should be using instead of trying to roll my own?

Well… i am not sure I 100% agree that the source and target have to match up exactly, especially given that specific format differences might lead to situations where this is unwanted. For example, consider oracle. Unless you have a spatial index on a column I believe oracle will simply return “GEOMETRY” as the type. But say you are uploading a shapefile that has a concrete type for the geometry? Should the transaction be rejected? I would say probably not.

Indeed it’s in general not possible to get an exact match between the original feature type and the
native feature type:

  • attributes in Oracle all always uppercase
  • Oracle does not have the concept of “boolean”
  • shapefiles have only a single geometry column, it’s always the first attribute, it’s always called “the_geom”
  • shapefile dbf attributes have severe lenght limitations
  • and so on

Generally speaking we’d need the createSchema to return some form of map from the original
attribute names to the ones actually created (they could be properties in the AttributeDescriptor
user map).

In the OGR data store it gets even worse, you cannot call createSchema and expect the output to be
created at all, you actually have to do both the schema creation and data appending in a single shot,
or you won’t get any output created by OGR.

The latter made me roll the following extra method in the OGRDataStore:

public void createSchema(SimpleFeatureCollection data, boolean approximateFields,
String options) throws IOException {

Now, the above will dump data as possible into the target storage, doing all the attribute
mapping internally, which I believe it’s even better than creating the mapping I described
above, and let the store do whatever is best.
It opens the road to recognize, at the db level, that one can use some bulk loading method
to add data, to create the indexes after the table is loaded, and generally speaking
be free to do whatever type and name mapping is deemed necessary given the target
storage tech constraints.

The generic DataAccess api addition could look as follows:

public FeatureType createSchema(FeatureCollection data, Hints hints) throws IOException;

where hints allows to pass down some data store specific extras, e.g., databases
could use it to create some extra indexes on the attributes, WFS data store could
get the admin credentials to create the schema via RESTConfig on a remote
GeoServer, and so on.

This would be new API, for which we’d need a new trunk… stuff seems to start piling
up for a new trunk to be available, time to cut a 2.3.x branch?

Cheers
Andrea

Ing. Andrea Aime
GeoSolutions S.A.S.
Tech lead

Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy

phone: +39 0584 962313
fax: +39 0584 962313
mob: +39 339 8844549

http://www.geo-solutions.it
http://geo-solutions.blogspot.com/
http://www.youtube.com/user/GeoSolutionsIT
http://www.linkedin.com/in/andreaaime
http://twitter.com/geowolf


On Tue, May 1, 2012 at 3:36 PM, David Winslow <dwinslow@anonymised.com> wrote:

Creating a schema is actually not that big a problem: assuming that all datastores create tables (or whatever) that can accept features conforming to the passed-in schema, the REST API importer can handle the importing without worrying too much about the munging that’s going on “behind the scenes.” That’s how things work today, and it seems to generally work fine.

Err… which is exactly the problem. Try importing a shapefile with lowercase attributes in Oracle, if you don’t uppercase them
manually during the copy you’ll get a table with all NULL values, the same will happen with shapefiles afaik (try pushing random
GML into a shapefile and you should see that).
You need some information about how the attribute names have been transformed to do something sensible.

I’ve attached to this mail two import scripts I use to load shapefiles into Oracle and SDE, you can see that both
are using some soft knowledge of how that particular store adds data in order to actually do a successfull import.
Both are actually lacking and not really dealing with the whole set of issues (they are just throw-aways anyways).

Imho it would be much better if it was the store itself to handle the problem to start with, instead of trying with
some external heuristic set that only handles some of the issues we know we have today.

I mean… using createSchema(…) you don’t even know if the name of the feature type has been preserved, or not
(in general, at least). In Oracle it hasn’t for example, it has been turned uppercase, in SDE you get a prefix
in front of it.

A way to get these feature inside the stores in this release is to have selected stores implement the
createSchema(FeatureCollection, Hints) method, and access is reflectively if available, whilst on the
new trunk it could be called directly.

Or you can go on with the FeatureAdjuster idea… but imho it’s only going to be a limited hack

That said… maybe I’m talking about a different problem? The thread started talking about conflicts
with existing feature types, which is not the same thing as creating a new one and having to deal
with how the feature type has been altered by the store.

Cheers
Andrea


Ing. Andrea Aime
GeoSolutions S.A.S.
Tech lead

Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy

phone: +39 0584 962313
fax: +39 0584 962313
mob: +39 339 8844549

http://www.geo-solutions.it
http://geo-solutions.blogspot.com/
http://www.youtube.com/user/GeoSolutionsIT
http://www.linkedin.com/in/andreaaime
http://twitter.com/geowolf


(attachments)

OracleImporter.java (3.38 KB)
SDEImporter.java (4.12 KB)

Yes, I have expanded a bit from the original topic. I already have made some improvements to the way temporary files are handled in the data upload process that don’t change existing APIs so perhaps we can put off these other topics for a later release - I have a fair bit of work cut out for me translating my external test scripts into unit tests, and reviewing the file handling in the coverage upload stuff as well. My main focus here is GeoNode, where stores other than Shapefile and PostGIS are vanishingly rare so I guess schema mismatching issues won’t come up much.

With respect to how we could better handle schema adaptation, I agree with you that stores should be responsible for knowing what they can encode. I was suggesting FeatureAdjuster as an alternative to a Map for representing an adjusted schema, not as a static repository of all schema-adjusting knowledge. I think if we want to improve the error handling on appending through the REST API it will be important to decouple schema adjustment from table creation, but it will probably be useful to have a convenience method like the createSchema(FeatureCollection col, Hints hints) you suggest.

Anyway, I’ll step out of the deep end for now and get to work on bringing my existing patch for http://jira.codehaus.org/browse/GEOS-5056 up to a better level of testing/documentation.


David Winslow
OpenGeo - http://opengeo.org/

On Tue, May 1, 2012 at 3:21 PM, Andrea Aime <andrea.aime@anonymised.com> wrote:

On Tue, May 1, 2012 at 3:36 PM, David Winslow <dwinslow@anonymised.com> wrote:

Creating a schema is actually not that big a problem: assuming that all datastores create tables (or whatever) that can accept features conforming to the passed-in schema, the REST API importer can handle the importing without worrying too much about the munging that’s going on “behind the scenes.” That’s how things work today, and it seems to generally work fine.

Err… which is exactly the problem. Try importing a shapefile with lowercase attributes in Oracle, if you don’t uppercase them
manually during the copy you’ll get a table with all NULL values, the same will happen with shapefiles afaik (try pushing random
GML into a shapefile and you should see that).
You need some information about how the attribute names have been transformed to do something sensible.

I’ve attached to this mail two import scripts I use to load shapefiles into Oracle and SDE, you can see that both
are using some soft knowledge of how that particular store adds data in order to actually do a successfull import.
Both are actually lacking and not really dealing with the whole set of issues (they are just throw-aways anyways).

Imho it would be much better if it was the store itself to handle the problem to start with, instead of trying with
some external heuristic set that only handles some of the issues we know we have today.

I mean… using createSchema(…) you don’t even know if the name of the feature type has been preserved, or not
(in general, at least). In Oracle it hasn’t for example, it has been turned uppercase, in SDE you get a prefix
in front of it.

A way to get these feature inside the stores in this release is to have selected stores implement the
createSchema(FeatureCollection, Hints) method, and access is reflectively if available, whilst on the
new trunk it could be called directly.

Or you can go on with the FeatureAdjuster idea… but imho it’s only going to be a limited hack

That said… maybe I’m talking about a different problem? The thread started talking about conflicts
with existing feature types, which is not the same thing as creating a new one and having to deal
with how the feature type has been altered by the store.

Cheers

Andrea


Ing. Andrea Aime
GeoSolutions S.A.S.
Tech lead

Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy

phone: +39 0584 962313
fax: +39 0584 962313
mob: +39 339 8844549

http://www.geo-solutions.it
http://geo-solutions.blogspot.com/
http://www.youtube.com/user/GeoSolutionsIT
http://www.linkedin.com/in/andreaaime
http://twitter.com/geowolf