Hi,
in the current WFS architecture GeoServer is setup to “favour” large requests,
with a fully streaming protocol that helps serving back basically limitless
amount of GML without having to be memory bound… at most, we are
disk bound in certain output formats (shape-zip, we have to write the
shapefile fully before compressing it).
In order to do that we perform feature counts before actually fetching the
data, as in many responses the count is given at the beginning of the response.
And then we actually fetch the data.
Now, for small returned datasets, this is overkill, we could theoretically just
fetch the data once, store it in memory, count it, and then use it for encoding.
Just one request, and in some cases, that’s a significant speedup:
- Complex database views/sqlviews doing heavy computation, where the
output fetching time is actually of little consequence - Custom data stores that are equally doing heavy computations to figure
out which features to provide as part of the results - Also see the ArcSDE slowness issue recently reported by Martin Davis
on the users list, where going for a count is systematically slower than
actually fetching the data itself
With this in mind, it would be nice to have some smarts in WFS so that
we can just hit the data stores once for small results.
The thing is, we don’t know if the result is small until we fetch it
I’ve hashed a few ideas, all based on the notion that when asked
for size, we should really try to fetch the data instead,
and do something “special” only in case we find we are loading too much of it.
1) Fetch and fall back on count
When the feature collection is asked for its size, start fetching data
instead, count in the progress, up until a certain limit (e.g., 1000 features),
if we finish before, return the count and keep the result aside in memory
waiting for the features() call.
If we hit the limit, keep that iterator open, run a normal size(), and
when features() is called, use the features we have in memory up until
that point, then start scrolling the cached iterator.
This one seems nice, but it’s a no go, as it doubles the number of database connections
needed to answer a single GetFeature request, which means it’s deadlock
prone once we reach the connection pool size limits.
2) Fetch and fall back on count v2
Same as above, but if we hit the limit, close the iterator,
do a normal count, and when features() si called,
start a fetch again.
No deadlock issues, but for requests over the limit, we are actually
looking at 3 requests to the store, instead of 2, making those slower
than they are.
Arguably, this will make large requests slower, but the ones really
affected will be the ones around the threshold
3) Fetch and store
Same as above, but if we go past the limit, we continue fetching
and store all features on disk using the fast serializer that we have
in GeoTools in the merge-sort sorting algorithm, and read back
from disk when features() is called
Of all the approaches I think I like 3) the best, as it has dynamics
similar to the shapefile output format (so, nothing new really).
The one downside of 3) is that it will limit this optimization to
simple features.
Also wondering if this would be a default behavior, something configurable
at the wfs config level, or on a layer by layer basis.
Now… I’m looking at this from the point of view of a customer project, and
making an effort to create something that would be useful to the community
at large, but I see there are trade-offs that might prevent it to be a good
general solutions.
So, I’m soliciting your feedback, I have a (short) window to try and work on this,
so quick feedback is very much appreciated (and slow one too, with the
notion that it will be useful anyways for the readers trying to approach this
issue in the future, in case I don’t make it now)
Cheers
Andrea
···
==
GeoServer Professional Services from the experts! Visit
http://goo.gl/NWWaa2 for more information.
==
Ing. Andrea Aime
@geowolf
Technical Lead
GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549
http://www.geo-solutions.it
http://twitter.com/geosolutions_it
AVVERTENZE AI SENSI DEL D.Lgs. 196/2003
Le informazioni contenute in questo messaggio di posta elettronica e/o nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il loro utilizzo è consentito esclusivamente al destinatario del messaggio, per le finalità indicate nel messaggio stesso. Qualora riceviate questo messaggio senza esserne il destinatario, Vi preghiamo cortesemente di darcene notizia via e-mail e di procedere alla distruzione del messaggio stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso, divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od utilizzarlo per finalità diverse, costituisce comportamento contrario ai principi dettati dal D.Lgs. 196/2003.
The information in this message and/or attachments, is intended solely for the attention and use of the named addressee(s) and may be confidential or proprietary in nature or covered by the provisions of privacy act (Legislative Decree June, 30 2003, no.196 - Italy’s New Data Protection Code).Any use not in accord with its purpose, any disclosure, reproduction, copying, distribution, or either dissemination, either whole or partial, is strictly forbidden except previous formal approval of the named addressee(s). If you are not the intended recipient, please contact immediately the sender by telephone, fax or e-mail and delete the information in this message that has been received in error. The sender does not give any warranty or accept liability as the content, accuracy or completeness of sent messages and accepts no responsibility for changes made after they were sent or for other risks which arise as a result of e-mail transmission, viruses, etc.