Hello Andrea and all
On Fri, Apr 27, 2012 at 2:29 PM, Andrea Aime
<andrea.aime@anonymised.com> wrote:
I could make things more explicit on the main proposal page, but each
section contains links to more detailed information, including the
actual "patch" to be seen inline. Namely:
<http://geoserver.org/display/GEOS/GSIP+69+-+API+Proposal>
As it's all about new stuff, when it comes strictly to the API
proposal, I judged it'd be convenient to have the whole proposal to be
seed inline. Yet there's also a link to the github branch where the
whole work lives, both API proposal, code migration for exemplary use
cases, and alternative jdbc backend:
<https://github.com/groldan/geoserver/tree/GSIP69>
All right, took half a day to go though the proposal and the code.
Thanks for doing that, much appreciated.
The proposal direction is good, agree with all the needs it expresses
and the general idea that we should be able to filter and page and load
stuff interactively.
That's a good starting point.
I strongly disagree on the idea that rolling a new filter API is better than
using the OGC one, this is a first show stopper for me.
The Predicate API is very limited and has no spatial filter support,
GeoServer core already depends on GeoTools heavily so the whole
reasoning about Predicate advantages is pretty empty, I actually
see a lot more weaknesses in rolling a new API:
- it violates the KISS principle, adding more stuff does not make anything
simpler
- it does not make writing encoders any easier, on the contrary, demands
more code to be written while we already have a pretty simple way to
split a complex OGC filter into a supported and unsupported part,
lots of encoders that we can mimick and/or copy code from
- it does not avod external dependencies, as geotools is already there
- it misses a lot of expressiveness, instead of writing clumsly Predicates
that can only run in memory (since they are not well known) we can
actually use a API that can get translated to SQL (thinking about the
name matching filters in the GUI here)
- the idea that the domain is different is quite surprising, most of the
elements
that grow in big numbers have a bbox attached to them, so they are
indeed spatial. One of the things you normally want to do in a security
subsystem is restrict access by geographic area, and we could not
express that with Predicate
This is arguable and I don't want to focus on this discussion just
now, I feel it's more important to for the time being to point out
some possible errors of interpretation bellow. They're probably due to
the patches not being squashed on single functional units, which I
apologize for. As I mentioned before, I'm working on that.
Moreover, with OGC filters it would get really easy to create a datastore
based catalog implementation if we want to, and it would be much better
of a proof of concept than the current ones (more on this later).
The only drawback of Filter is that it is supposed to be a "closed" API,
with no way to implement a new filter, but that is actually less of a
limitation
since the model is rich, and easily worked around by implementing
whatever filter function is missing.
Just pointing out here that nothing impedes you of doing so either way
if you want to take advantage of the current feature access oriented
infrastructure, but that'd be an implementation detail, better hidden
from the Catalog API. Given the proposed Catalog predicate object
model is so small and based on observed usage, it'd be a lot easier
going from Predicate to Filter than the other way around. Also, if you
wanted to leverage the DataStore API and adapt it to the Catalog API,
you would be making assumptions about the data structures the catalog
objects are stored with (i.e. flat RDBMS table). I don't want those
assumptions at the API level, so deliberately stayed away. That's how
a layered architecture usually works. More on this later, already
starting to rant and said I wanted to focus on code review.
Moving forward, I would advise against having to check if the store
can data sort or not, it just makes the caller code more complicated
and forces it to do work arounds if sorting is really needed.
I agree with that. Justin already pointed that out already but I left
it in just to get the proposal out and let others voice up. Leveraging
the GeoTools merge-sort code looks to me like the way to go and need
to look into it.
In GeoTools we have code that does merge-sort using disk space
if necessary that can sort whatever amount of data with little memory
footprint (of course, at the price of performance).
It would be better to have a method that checks if sorting can be done
fast instead, so if the code needs sorting as an optimization it can
leverage it or use an alternate path, but code that really needs sorting
will just ask it and have it done by the catalog impl without having to
repeat that in all places that do need sorting for good.
As a rule of dumb, I am against of adding such a check just because or
just in case without an actual use case. We can always add that check
when the real need arises, but it's harder to get rid of it is that
never happens, plus the Catalog API complexity wouldn't be reduced at
all. My intention is hence to leave the canSort method on
CatalogFacade, but remove it from Catalog. O rather remove it from
both. Implementing the merge-sort on CatalogImpl and using canSort on
CatlogFacade seems to make the most sense, given the way the Catalog
API has evolved, it's unlikely that you have multiple Catalog
implementations but can certainly have multiple CatalogFacade ones.
A small other thing is that these methods are supposed to access file system
or the network, but they don't throw any exception... I can live with that,
most likely the calling code does not have anything meaningful to do
in case of exception anyways, but thought I'd point it out anyways.
That's right. There's a lot of literature out there on checked vs
unchecked exceptions. Plus the Catalog is already using unchecked ones
so it kind of makes sense to follow suite.
A thing that I find instead surprising is seeing no trace of a transaction
concept,
if the intent is to move to a model where multiple GeoServer share the same
db
and write on it in parallel, being able to use transactions seems quite
important,
there is a need for coordination that is not addressed by this proposal.
I'm not sure if you mean just database transactions or some kind of
coordination between nodes in a cluster.
The later is out of scope. The former is just a wrong assessment.
Check out the JDBCCatalogFacade add/remove/save methods. They're
transactional. It's just that spring-jdbc is taking care of the boiler
plate.
If, by the other hand, you mean transactions at the Catalog API level,
so that you can, for example, add multiple resources atomically,
that'd be subject of another GSIP.
The modifications done below and above the API changes are simple proofs
of concept, meaning the validation of the API is low and the checks on its
quality low as well, not something we'd want to fast track on a code base
that we want to make more stable.
Strongly disagree here. Call it proof of concept if you want, but I
spent quite a bit of time identifying "exemplary", not "random sample"
use cases where the vertical scalability of the system was highly
compromised. If you have more exemplary use cases from current code
those will be very welcomed. Most of the service code just query
single objects by name or by id, which are not a scalability problem.
Let's start by what's above the API. All we have is a handful of examples,
but the code base is largerly unchanged.
I think my above paragraph explains that. And wonder how come you
seemed to be so concerned about large code base changes, and now
complain about the lack of it. Identifying key scalability
compromising use cases and allowing for incremental adoption of the
new API seemed to me like a good way of addressing your original
concern, which I share btw.
On one side this means the new
code is not going to be touched very much,
on the other side it means we
get no benefit from the new API and we're not validating it and its
implementation
at all.
If starting up geoserver in seconds instead of minutes, loading the
home page almost instantly instead waiting for seconds or even
minutes, under-second response times for the layers page list with
thousands of layers, including filtering and paging; not going OOM or
getting timeouts when doing a GetCapabilities request under
concurrency and/or low heap size, but instead streaming out as quickly
as possible, using as little memory as possible, and gracefully
degrading under load; are not ways of exercising the new API, then I'm
lost.
Looks like a trojan horse to get in the higher level modifications
later,
that's discourteous.
which will actually destabilize the code base as we are already in RC or
bugfix release mode.
I don't get you. Haven't we said we don't want to destabilize the
stable code base? isn't touching as little as possible on the stable
branch (even nothing at all!) a way of preserving its stability, yet
you complain there are too few changes on the wider code base? Haven't
we discussed already to put only the API on 2.2 for peace of mind re
stability, and adding the incremental changes on 2.3 only, and even
not touching 2.2.x at all?
Moreover various of the predicates created have no chance to be encoded
in native mode since they are not "well known".
Examples? The only one I can think of is the one in
SecureCatalogImpl.securityFilter.
The way it works is so that it builds the wrapper policy on an object
by object basis just like it used to be. It may be possible to write
it in a way that's "encodable" to SQL, that's a piece of code I don't
fully understand. In any case, the net effect is that that predicate
that executes in-process, is and'ed to any other filter the
SecureCatalogImpl gets, and the backend can split the anded filter
into supported and unsupported parts, hence having the chance of
evaluating way less objects than if doing a full scan everytime, as
it's currently.
In fact the authorization subsystem should be changed too in order to
leverage
the new scalability api, so that it returns a filter instead of checking
point by
point a single layer.
How's that a bad thing? If a user is authorized to 10 out of 1000
resources, isn't it better to get only those 10 out of the catalog,
and to create a wrapper object for only those 10 instead of for the
1000 of them?
Same goes for the GUI filtering code, which:
- loads the full data set in memory in case the store is not able to sort on
the desired attribute
That is true, but kind of a dead code snippet. The trick is that
BeanProperties (addressing a property name, like styles.name,
resource.store.workspace.name, etc) do can be sorted at the db. For
the layers page, the only non BeanProperty is the "enabled" one,
because it uses the derived enabled() property instead of the
isEnabled() property. But still, one could sort on the "enabled" (as
per the isEnabled() accessor) natively. I should make the code account
for that. Or if we end up using the geotools on-disk sorting code that
wouldn't be a concern either?
- builds a predicate that is not encodable (with OGC filter we could
actually encode it instead).
Again, which one? the only non encodable one is the enabled() property
in LayerProvider. It should actually be encodable adding the smarts to
mean the isEnabled() property plus the in-process check to account for
derived non-enabled states (when the store or resource is not but the
layer is).
The bits below the API are baffling too. Both the JE and JDBC
implementations
are based on key/value store where the value is the XML dump of each
CatalogInfo.
That's a good thing.
This makes the whole point about filter encoding moot, as there is almost no
filter
being encoded down at the DB level.
That is plain wrong.
Both implementaions use indexes to filter on the different properties.
Say I want to do a GetMap request with a single layer, we know the name, we
end
up scanning the whole database, load all the blobs, parse them back in
memory,
apply the filter in memory.
Wrong assessment. The BDB JE implementation uses an index based on
name. The JDBC one uses a separate table where the properties are
stored and indexed.
Sure it scales to 100M items, but nobody will
want to wait
for the time it takes this implementation to do a simple GetMap.
I know they are community modules, but even a community module should have a
split
chance of being used, this implementation seems so weak that I don't believe
anyone
will want actually want to use it, and in order to validate the API we
should have an
implementation that actually makes use of some of its concept (some actual
native filtering
for example).
When you have a chance please re-review taking into consideration my
previous comments.
tips: ConfigDatabase uses PredicatToSQL. Looks like you didn't find
that out and hence was so disappointed. I would have been to if that
were the case.
That said, I am not sure the current approach at translating a
predicate to SQL is the best one, given I honestly suck at SQL. But
for proof of concept, it does a pretty decent job, and as far as I
understand, a GSIP with "under discussion" status doesn't need to have
all its implementation details perfected.
(little aside, nice to see bonecp in the mix, I actually wanted to try out
that connection pool
me too)
yeah, seems to work better than dbcp.
Long story short, the proposal seems weak in some API points, and the
implementation is
proof of concept which I don't think should be allowed to land on trunk
right now.
You say "proof of concept" as if it were pejorative. I call it
functional prototype and have seen more than one land on the community
space over the years.
But I want to offer a reasonable alternative: shall we move to a time boxed
6 months
release cycle? 4 months to add new stuff, 2 months to stabilize, rinse and
repeat,
push out 3-4 bugfix releases in the stable series.
This way we don't have to have these long waits for new features to show up,
this
much needed scalability improvement can land on a stable series in the next
8-9 months (assuming 2 months to release 2.2.0 and 6 months cycle), be
vetted
and improved so that it's an API and an implementation we can all be proud
of.
Sounds reasonable to me. Separate email thread?
I really want to GSIP in, just not now and not in its current state.
But I'm willing to put forward resources to help out making it become a
reality
I really do hope that the rest of the PSC chimes in as well, this is an
important
GSIP and it deserves other people opinions (besides my personal rants).
So do I.
Thanks for your time.
Gabriel.
Ah, next week I'll also try to prepare a GSIP for the 6 months release
cycle,
unless of course there are very negative reactions to the idea.
Cheers
Andrea
--
-------------------------------------------------------
Ing. Andrea Aime
GeoSolutions S.A.S.
Tech lead
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 962313
mob: +39 339 8844549
http://www.geo-solutions.it
http://geo-solutions.blogspot.com/
http://www.youtube.com/user/GeoSolutionsIT
http://www.linkedin.com/in/andreaaime
http://twitter.com/geowolf
-------------------------------------------------------
--
Gabriel Roldan
OpenGeo - http://opengeo.org
Expert service straight from the developers.