[Geoserver-devel] [NEW FEATURE] New configuration and memory caching of features in WFS GetFeature requests.

Hi, I would propose this new feature to solve the issue https://osgeo-org.atlassian.net/browse/GEOS-7296

Now, the WFS requests of 1.1/2.0 versions force to read the features from the data source (and also filter evaluation) twice. Or each time a new iterator is created and visited.
This issue is especially problematic with big data sources or complex/heavy querys.

I propose to debate a new configurable capability to cache results from the data provider in order to avoid this.

A good point to cache the results can be:
https://github.com/geoserver/geoserver/blob/master/src/wfs/src/main/java/org/geoserver/wfs/GetFeature.java#L530

This pull https://github.com/geoserver/geoserver/pull/1321 write the fix, but still it needs a new setting in the WFS admin panel.

What do you think?

Thank you very much!
Best regards
Alvaro

On Thu, Nov 5, 2015 at 11:52 AM, A Huarte <ahuarte47@anonymised.com> wrote:

Hi, I would propose this new feature to solve the issue
https://osgeo-org.atlassian.net/browse/GEOS-7296

Now, the WFS requests of 1.1/2.0 versions force to read the features from
the data source (and also filter evaluation) twice. Or each time a new
iterator is created and visited.
This issue is especially problematic with big data sources or
complex/heavy querys.

Mind, your observation about reading the data twice is true only for
shapefiles, a proper spatial database can count much faster without
the need to actually load all the data.

I propose to debate a new configurable capability to cache results from
the data provider in order to avoid this.

A good point to cache the results can be:

https://github.com/geoserver/geoserver/blob/master/src/wfs/src/main/java/org/geoserver/wfs/GetFeature.java#L530

This pull https://github.com/geoserver/geoserver/pull/1321 write the fix,
but still it needs a new setting in the WFS admin panel.

The change as written is not acceptable imho, there must be a size limit
for the cached collection, like, cache at most 1000 features,
where 1000 is the configurable bit.
No one is their right mind would want to cache all the results all the
time, it would fit only a very narrow use case in which all data sources
are small or there is much RAM than data, and very low traffic (each
request keeps its own result in memory),
typical installations have a mix of small and large ones, with small and
large data extractions.

The usefulness is also debatable in general terms... if I have a database,
do I really want to read and cache up to 1000 features from the db
during to replace the first count operation? Thinking about it, it would
seem to me this would be useful only on a per dataset basis,
and configured manually only for those datasets that cannot do a fast count.

Cheers
Andrea

--

GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

*AVVERTENZE AI SENSI DEL D.Lgs. 196/2003*

Le informazioni contenute in questo messaggio di posta elettronica e/o
nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il
loro utilizzo è consentito esclusivamente al destinatario del messaggio,
per le finalità indicate nel messaggio stesso. Qualora riceviate questo
messaggio senza esserne il destinatario, Vi preghiamo cortesemente di
darcene notizia via e-mail e di procedere alla distruzione del messaggio
stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso,
divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od
utilizzarlo per finalità diverse, costituisce comportamento contrario ai
principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for
the attention and use of the named addressee(s) and may be confidential or
proprietary in nature or covered by the provisions of privacy act
(Legislative Decree June, 30 2003, no.196 - Italy's New Data Protection
Code).Any use not in accord with its purpose, any disclosure, reproduction,
copying, distribution, or either dissemination, either whole or partial, is
strictly forbidden except previous formal approval of the named
addressee(s). If you are not the intended recipient, please contact
immediately the sender by telephone, fax or e-mail and delete the
information in this message that has been received in error. The sender
does not give any warranty or accept liability as the content, accuracy or
completeness of sent messages and accepts no responsibility for changes
made after they were sent or for other risks which arise as a result of
e-mail transmission, viruses, etc.

-------------------------------------------------------

On Thu, Nov 5, 2015 at 12:04 PM, Andrea Aime <andrea.aime@anonymised.com>
wrote:

The usefulness is also debatable in general terms... if I have a database,
do I really want to read and cache up to 1000 features from the db
during to replace the first count operation? Thinking about it, it would
seem to me this would be useful only on a per dataset basis,
and configured manually only for those datasets that cannot do a fast
count.

Thinking a bit more... you cannot really control the requests the server is
receiving, and whenever you go beyond the
limit you wasted the time to load the 1000 features and have to fall back
and do a count.
But the limit cannot be set high, because each GetFeature ends up caching
that many features, so you have to assume
your memory comption is featureLimits * numConcurrentRequests.

Unless... your dataset is really small to start with. In that case, it
would make more sense to just load it in memory once,
in a MemoryDataStore (maybe a variant that supports spatial indexes) and
just hit that one instead... it should be blazing
fast and it would only keep one copy of the features in memory at any time
instead of many (since each GetFeature would
receive a pointer to the feature, not a copy of it).

Cheers
Andrea

--

GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

*AVVERTENZE AI SENSI DEL D.Lgs. 196/2003*

Le informazioni contenute in questo messaggio di posta elettronica e/o
nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il
loro utilizzo è consentito esclusivamente al destinatario del messaggio,
per le finalità indicate nel messaggio stesso. Qualora riceviate questo
messaggio senza esserne il destinatario, Vi preghiamo cortesemente di
darcene notizia via e-mail e di procedere alla distruzione del messaggio
stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso,
divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od
utilizzarlo per finalità diverse, costituisce comportamento contrario ai
principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for
the attention and use of the named addressee(s) and may be confidential or
proprietary in nature or covered by the provisions of privacy act
(Legislative Decree June, 30 2003, no.196 - Italy's New Data Protection
Code).Any use not in accord with its purpose, any disclosure, reproduction,
copying, distribution, or either dissemination, either whole or partial, is
strictly forbidden except previous formal approval of the named
addressee(s). If you are not the intended recipient, please contact
immediately the sender by telephone, fax or e-mail and delete the
information in this message that has been received in error. The sender
does not give any warranty or accept liability as the content, accuracy or
completeness of sent messages and accepts no responsibility for changes
made after they were sent or for other risks which arise as a result of
e-mail transmission, viruses, etc.

-------------------------------------------------------

Hi, we can create this cache in this point:

https://github.com/ahuarte47/geoserver/blob/eeca42adab263a880856e9ca11c962ac2c0ebb6d/src/wfs/src/main/java/org/geoserver/wfs/GetFeature.java#L531

An idea…

if (calculateSize) {
if (queries.size() == 1 && features.getSchema() instanceof SimpleFeatureType) {

int LIMIT_FEATURE_CACHE_SIZE = 128; //-> 1024… configurable?

List featureCache = new ArrayList();
boolean makeFeatureCache = true;
size = 0;

FeatureIterator<? extends Feature> iterator = features.features();
try {
while (iterator.hasNext()) {

if (makeFeatureCache && size>=LIMIT_FEATURE_CACHE_SIZE) {
makeFeatureCache = false;
featureCache.clear();
}
else
if (makeFeatureCache) {
featureCache.add((SimpleFeature)iterator.next());
}
size++;
}
if (makeFeatureCache) {
DefaultFeatureCollection fc = new DefaultFeatureCollection(features.getID(), (SimpleFeatureType)features.getSchema()); //-> or similar
fc.addAll(featureCache);
features = new CachedFeatureCollection(fc);
}
}
finally {
iterator.close();
}
}
else {
size = features.size();
}
}

Alvaro


De: Andrea Aime andrea.aime@anonymised.com
Para: A Huarte ahuarte47@anonymised.com
CC:geoserver-devel@anonymised.comceforge.netgeoserver-devel@lists.sourceforge.net
Enviado: Jueves 5 de noviembre de 2015 12:17
Asunto: Re: [Geoserver-devel] [NEW FEATURE] New configuration and memory caching of features in WFS GetFeature requests.

On Thu, Nov 5, 2015 at 12:04 PM, Andrea Aime <andrea.aime@anonymised.com> wrote:

The usefulness is also debatable in general terms… if I have a database, do I really want to read and cache up to 1000 features from the db

during to replace the first count operation? Thinking about it, it would seem to me this would be useful only on a per dataset basis,
and configured manually only for those datasets that cannot do a fast count.

Thinking a bit more… you cannot really control the requests the server is receiving, and whenever you go beyond the
limit you wasted the time to load the 1000 features and have to fall back and do a count.
But the limit cannot be set high, because each GetFeature ends up caching that many features, so you have to assume
your memory comption is featureLimits * numConcurrentRequests.

Unless… your dataset is really small to start with. In that case, it would make more sense to just load it in memory once,
in a MemoryDataStore (maybe a variant that supports spatial indexes) and just hit that one instead… it should be blazing
fast and it would only keep one copy of the features in memory at any time instead of many (since each GetFeature would
receive a pointer to the feature, not a copy of it).

Cheers
Andrea

==
GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

AVVERTENZE AI SENSI DEL D.Lgs. 196/2003
Le informazioni contenute in questo messaggio di posta elettronica e/o nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il loro utilizzo è consentito esclusivamente al destinatario del messaggio, per le finalità indicate nel messaggio stesso. Qualora riceviate questo messaggio senza esserne il destinatario, Vi preghiamo cortesemente di darcene notizia via e-mail e di procedere alla distruzione del messaggio stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso, divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od utilizzarlo per finalità diverse, costituisce comportamento contrario ai principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for the attention and use of the named addressee(s) and may be confidential or proprietary in nature or covered by the provisions of privacy act (Legislative Decree June, 30 2003, no.196 - Italy’s New Data Protection Code).Any use not in accord with its purpose, any disclosure, reproduction, copying, distribution, or either dissemination, either whole or partial, is strictly forbidden except previous formal approval of the named addressee(s). If you are not the intended recipient, please contact immediately the sender by telephone, fax or e-mail and delete the information in this message that has been received in error. The sender does not give any warranty or accept liability as the content, accuracy or completeness of sent messages and accepts no responsibility for changes made after they were sent or for other risks which arise as a result of e-mail transmission, viruses, etc.


On Thu, Nov 5, 2015 at 11:52 AM, A Huarte <ahuarte47@anonymised.com> wrote:

Hi, I would propose this new feature to solve the issue https://osgeo-org.atlassian.net/browse/GEOS-7296

Now, the WFS requests of 1.1/2.0 versions force to read the features from the data source (and also filter evaluation) twice. Or each time a new iterator is created and visited.
This issue is especially problematic with big data sources or complex/heavy querys.

Mind, your observation about reading the data twice is true only for shapefiles, a proper spatial database can count much faster without
the need to actually load all the data.

Hi Andrea, you are right, I verified same behavior in postgis layers. Each WFS GetFeature request (v1.1/2.0) executes two SQL requests to the database similar to shapefile:

  • size = features.size() sentence:
    First request using “JDBCFeatureSource::getCountInternal()” method.

  • final encode features step:
    Second request using “JDBCFeatureSource::getReaderInternal()” method.

We agree that postgis, or any modern database, is fast, but IMHO I think there is an unnecessary waste of database resources, and performance!, of each GetFeature request if it manages results with a litte count of records (10, 100, 1000?). Now, the initial count and later full query always are executed, we can reduce to one unique request with no overrun caching the data when it contains a little of quantity of records, otherwise we can preserve the current behavior to avoid overloading the RAM.

Alvaro


De: Andrea Aime andrea.aime@anonymised.com
Para: A Huarte ahuarte47@anonymised.com
CC:geoserver-devel@lists.sourceforge.netgeoserver-devel@lists.sourceforge.net
Enviado: Jueves 5 de noviembre de 2015 12:04
Asunto: Re: [Geoserver-devel] [NEW FEATURE] New configuration and memory caching of features in WFS GetFeature requests.

On Thu, Nov 5, 2015 at 11:52 AM, A Huarte <ahuarte47@anonymised.com> wrote:

Hi, I would propose this new feature to solve the issue https://osgeo-org.atlassian.net/browse/GEOS-7296

Now, the WFS requests of 1.1/2.0 versions force to read the features from the data source (and also filter evaluation) twice. Or each time a new iterator is created and visited.
This issue is especially problematic with big data sources or complex/heavy querys.

Mind, your observation about reading the data twice is true only for shapefiles, a proper spatial database can count much faster without
the need to actually load all the data.

I propose to debate a new configurable capability to cache results from the data provider in order to avoid this.

A good point to cache the results can be:
https://github.com/geoserver/geoserver/blob/master/src/wfs/src/main/java/org/geoserver/wfs/GetFeature.java#L530

This pull https://github.com/geoserver/geoserver/pull/1321 write the fix, but still it needs a new setting in the WFS admin panel.

The change as written is not acceptable imho, there must be a size limit for the cached collection, like, cache at most 1000 features,
where 1000 is the configurable bit.
No one is their right mind would want to cache all the results all the time, it would fit only a very narrow use case in which all data sources
are small or there is much RAM than data, and very low traffic (each request keeps its own result in memory),
typical installations have a mix of small and large ones, with small and large data extractions.

The usefulness is also debatable in general terms… if I have a database, do I really want to read and cache up to 1000 features from the db
during to replace the first count operation? Thinking about it, it would seem to me this would be useful only on a per dataset basis,
and configured manually only for those datasets that cannot do a fast count.

Cheers
Andrea

==
GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

AVVERTENZE AI SENSI DEL D.Lgs. 196/2003
Le informazioni contenute in questo messaggio di posta elettronica e/o nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il loro utilizzo è consentito esclusivamente al destinatario del messaggio, per le finalità indicate nel messaggio stesso. Qualora riceviate questo messaggio senza esserne il destinatario, Vi preghiamo cortesemente di darcene notizia via e-mail e di procedere alla distruzione del messaggio stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso, divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od utilizzarlo per finalità diverse, costituisce comportamento contrario ai principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for the attention and use of the named addressee(s) and may be confidential or proprietary in nature or covered by the provisions of privacy act (Legislative Decree June, 30 2003, no.196 - Italy’s New Data Protection Code).Any use not in accord with its purpose, any disclosure, reproduction, copying, distribution, or either dissemination, either whole or partial, is strictly forbidden except previous formal approval of the named addressee(s). If you are not the intended recipient, please contact immediately the sender by telephone, fax or e-mail and delete the information in this message that has been received in error. The sender does not give any warranty or accept liability as the content, accuracy or completeness of sent messages and accepts no responsibility for changes made after they were sent or for other risks which arise as a result of e-mail transmission, viruses, etc.


To try clarify my comments, better a little of code:

https://github.com/geoserver/geoserver/compare/master…ahuarte47:GEOS-7296_fix_read_twice

:slight_smile:

Alvaro


De: A Huarte ahuarte47@anonymised.com
Para: Andrea Aime andrea.aime@anonymised.com
CC:geoserver-devel@anonymised.comge.netgeoserver-devel@lists.sourceforge.net
Enviado: Viernes 6 de noviembre de 2015 9:04
Asunto: Re: [Geoserver-devel] [NEW FEATURE] New configuration and memory caching of features in WFS GetFeature requests.

On Thu, Nov 5, 2015 at 11:52 AM, A Huarte <ahuarte47@anonymised.com…> wrote:

Hi, I would propose this new feature to solve the issue https://osgeo-org.atlassian.net/browse/GEOS-7296

Now, the WFS requests of 1.1/2.0 versions force to read the features from the data source (and also filter evaluation) twice. Or each time a new iterator is created and visited.
This issue is especially problematic with big data sources or complex/heavy querys.

Mind, your observation about reading the data twice is true only for shapefiles, a proper spatial database can count much faster without
the need to actually load all the data.

Hi Andrea, you are right, I verified same behavior in postgis layers. Each WFS GetFeature request (v1.1/2.0) executes two SQL requests to the database similar to shapefile:

  • size = features.size() sentence:
    First request using “JDBCFeatureSource::getCountInternal()” method.

  • final encode features step:
    Second request using “JDBCFeatureSource::getReaderInternal()” method.

We agree that postgis, or any modern database, is fast, but IMHO I think there is an unnecessary waste of database resources, and performance!, of each GetFeature request if it manages results with a litte count of records (10, 100, 1000?). Now, the initial count and later full query always are executed, we can reduce to one unique request with no overrun caching the data when it contains a little of quantity of records, otherwise we can preserve the current behavior to avoid overloading the RAM.

Alvaro


De: Andrea Aime andrea.aime@anonymised.com
Para: A Huarte ahuarte47@anonymised.com
CC:geoserver-devel@lists.sourceforge.netgeoserver-devel@lists.sourceforge.net
Enviado: Jueves 5 de noviembre de 2015 12:04
Asunto: Re: [Geoserver-devel] [NEW FEATURE] New configuration and memory caching of features in WFS GetFeature requests.

On Thu, Nov 5, 2015 at 11:52 AM, A Huarte <ahuarte47@anonymised.com> wrote:

Hi, I would propose this new feature to solve the issue https://osgeo-org.atlassian.net/browse/GEOS-7296

Now, the WFS requests of 1.1/2.0 versions force to read the features from the data source (and also filter evaluation) twice. Or each time a new iterator is created and visited.
This issue is especially problematic with big data sources or complex/heavy querys.

Mind, your observation about reading the data twice is true only for shapefiles, a proper spatial database can count much faster without
the need to actually load all the data.

I propose to debate a new configurable capability to cache results from the data provider in order to avoid this.

A good point to cache the results can be:
https://github.com/geoserver/geoserver/blob/master/src/wfs/src/main/java/org/geoserver/wfs/GetFeature.java#L530

This pull https://github.com/geoserver/geoserver/pull/1321 write the fix, but still it needs a new setting in the WFS admin panel.

The change as written is not acceptable imho, there must be a size limit for the cached collection, like, cache at most 1000 features,
where 1000 is the configurable bit.
No one is their right mind would want to cache all the results all the time, it would fit only a very narrow use case in which all data sources
are small or there is much RAM than data, and very low traffic (each request keeps its own result in memory),
typical installations have a mix of small and large ones, with small and large data extractions.

The usefulness is also debatable in general terms… if I have a database, do I really want to read and cache up to 1000 features from the db
during to replace the first count operation? Thinking about it, it would seem to me this would be useful only on a per dataset basis,
and configured manually only for those datasets that cannot do a fast count.

Cheers
Andrea

==
GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

AVVERTENZE AI SENSI DEL D.Lgs. 196/2003
Le informazioni contenute in questo messaggio di posta elettronica e/o nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il loro utilizzo è consentito esclusivamente al destinatario del messaggio, per le finalità indicate nel messaggio stesso. Qualora riceviate questo messaggio senza esserne il destinatario, Vi preghiamo cortesemente di darcene notizia via e-mail e di procedere alla distruzione del messaggio stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso, divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od utilizzarlo per finalità diverse, costituisce comportamento contrario ai principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for the attention and use of the named addressee(s) and may be confidential or proprietary in nature or covered by the provisions of privacy act (Legislative Decree June, 30 2003, no.196 - Italy’s New Data Protection Code).Any use not in accord with its purpose, any disclosure, reproduction, copying, distribution, or either dissemination, either whole or partial, is strictly forbidden except previous formal approval of the named addressee(s). If you are not the intended recipient, please contact immediately the sender by telephone, fax or e-mail and delete the information in this message that has been received in error. The sender does not give any warranty or accept liability as the content, accuracy or completeness of sent messages and accepts no responsibility for changes made after they were sent or for other risks which arise as a result of e-mail transmission, viruses, etc.


Hi, I see a problem in my proposal. For long GetFeature requests, two “JDBCFeatureSource::getReaderInternal()” calls are executed, much worse than “JDBCFeatureSource::getCountInternal()” + “JDBCFeatureSource::getReaderInternal()” calls.

The best, preserve the cache directly in the shapefile provider.

Do you agree?


De: A Huarte ahuarte47@anonymised.com
Para: Andrea Aime andrea.aime@anonymised.com
CC:geoserver-devel@lists.sourceforge.netgeoserver-devel@lists.sourceforge.net
Enviado: Viernes 6 de noviembre de 2015 9:09
Asunto: Re: [Geoserver-devel] [NEW FEATURE] New configuration and memory caching of features in WFS GetFeature requests.

To try clarify my comments, better a little of code:

https://github.com/geoserver/geoserver/compare/master…ahuarte47:GEOS-7296_fix_read_twice

:slight_smile:

Alvaro


De: A Huarte ahuarte47@anonymised.com
Para: Andrea Aime <andrea.aime@anonymised.com…>
CC:geoserver-devel@lists.sourceforge.netgeoserver-devel@anonymised.comeforge.net
Enviado: Viernes 6 de noviembre de 2015 9:04
Asunto: Re: [Geoserver-devel] [NEW FEATURE] New configuration and memory caching of features in WFS GetFeature requests.

On Thu, Nov 5, 2015 at 11:52 AM, A Huarte <ahuarte47@anonymised.com> wrote:

Hi, I would propose this new feature to solve the issue https://osgeo-org.atlassian.net/browse/GEOS-7296

Now, the WFS requests of 1.1/2.0 versions force to read the features from the data source (and also filter evaluation) twice. Or each time a new iterator is created and visited.
This issue is especially problematic with big data sources or complex/heavy querys.

Mind, your observation about reading the data twice is true only for shapefiles, a proper spatial database can count much faster without
the need to actually load all the data.

Hi Andrea, you are right, I verified same behavior in postgis layers. Each WFS GetFeature request (v1.1/2.0) executes two SQL requests to the database similar to shapefile:

  • size = features.size() sentence:
    First request using “JDBCFeatureSource::getCountInternal()” method.

  • final encode features step:
    Second request using “JDBCFeatureSource::getReaderInternal()” method.

We agree that postgis, or any modern database, is fast, but IMHO I think there is an unnecessary waste of database resources, and performance!, of each GetFeature request if it manages results with a litte count of records (10, 100, 1000?). Now, the initial count and later full query always are executed, we can reduce to one unique request with no overrun caching the data when it contains a little of quantity of records, otherwise we can preserve the current behavior to avoid overloading the RAM.

Alvaro


De: Andrea Aime andrea.aime@anonymised.com
Para: A Huarte ahuarte47@anonymised.com
CC:geoserver-devel@lists.sourceforge.netgeoserver-devel@lists.sourceforge.net
Enviado: Jueves 5 de noviembre de 2015 12:04
Asunto: Re: [Geoserver-devel] [NEW FEATURE] New configuration and memory caching of features in WFS GetFeature requests.

On Thu, Nov 5, 2015 at 11:52 AM, A Huarte <ahuarte47@anonymised.com> wrote:

Hi, I would propose this new feature to solve the issue https://osgeo-org.atlassian.net/browse/GEOS-7296

Now, the WFS requests of 1.1/2.0 versions force to read the features from the data source (and also filter evaluation) twice. Or each time a new iterator is created and visited.
This issue is especially problematic with big data sources or complex/heavy querys.

Mind, your observation about reading the data twice is true only for shapefiles, a proper spatial database can count much faster without
the need to actually load all the data.

I propose to debate a new configurable capability to cache results from the data provider in order to avoid this.

A good point to cache the results can be:
https://github.com/geoserver/geoserver/blob/master/src/wfs/src/main/java/org/geoserver/wfs/GetFeature.java#L530

This pull https://github.com/geoserver/geoserver/pull/1321 write the fix, but still it needs a new setting in the WFS admin panel.

The change as written is not acceptable imho, there must be a size limit for the cached collection, like, cache at most 1000 features,
where 1000 is the configurable bit.
No one is their right mind would want to cache all the results all the time, it would fit only a very narrow use case in which all data sources
are small or there is much RAM than data, and very low traffic (each request keeps its own result in memory),
typical installations have a mix of small and large ones, with small and large data extractions.

The usefulness is also debatable in general terms… if I have a database, do I really want to read and cache up to 1000 features from the db
during to replace the first count operation? Thinking about it, it would seem to me this would be useful only on a per dataset basis,
and configured manually only for those datasets that cannot do a fast count.

Cheers
Andrea

==
GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

AVVERTENZE AI SENSI DEL D.Lgs. 196/2003
Le informazioni contenute in questo messaggio di posta elettronica e/o nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il loro utilizzo è consentito esclusivamente al destinatario del messaggio, per le finalità indicate nel messaggio stesso. Qualora riceviate questo messaggio senza esserne il destinatario, Vi preghiamo cortesemente di darcene notizia via e-mail e di procedere alla distruzione del messaggio stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso, divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od utilizzarlo per finalità diverse, costituisce comportamento contrario ai principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for the attention and use of the named addressee(s) and may be confidential or proprietary in nature or covered by the provisions of privacy act (Legislative Decree June, 30 2003, no.196 - Italy’s New Data Protection Code).Any use not in accord with its purpose, any disclosure, reproduction, copying, distribution, or either dissemination, either whole or partial, is strictly forbidden except previous formal approval of the named addressee(s). If you are not the intended recipient, please contact immediately the sender by telephone, fax or e-mail and delete the information in this message that has been received in error. The sender does not give any warranty or accept liability as the content, accuracy or completeness of sent messages and accepts no responsibility for changes made after they were sent or for other risks which arise as a result of e-mail transmission, viruses, etc.




Geoserver-devel mailing list
Geoserver-devel@anonymised.comnet
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

On Fri, Nov 6, 2015 at 9:56 AM, A Huarte <ahuarte47@anonymised.com> wrote:

Hi, I see a problem in my proposal. For long GetFeature requests, two "JDBCFeatureSource::getReaderInternal()"
calls are executed, much worse than "JDBCFeatureSource::getCountInternal()"
+ "JDBCFeatureSource::getReaderInternal()" calls.

The best, preserve the cache directly in the shapefile provider.

As said before, I'm not convinced this to be an improvement, but I'd like
to hear other developers. As far as I'm concerned, I'd rather not
have this complication only for the sake of an installation powered by
shapefiles (it's a rather uncommon scenario, even if you receive
shapefiles, one can automate their ingest and database into a local
database using the importer module)... but it's just my personal
opinion.

If other developers see this as an improvement, then yes, I'd keep it in
the shapefile store, adding a query hint to activate
this behavior and control the cache size (for no reason it should be the
default one). Even for the shapefile store this would
need to be activated carefully, like, if I'm asking for the size of the
shapefile with no filters, it can be read from the header
without having to check the actual file contents, if it's just a bbox
filter then the count can also be optimized by scanning
just the spatial index and the bbox header contained in each geometry in
the shp file, and so on.

Cheers
Andrea

--

GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

*AVVERTENZE AI SENSI DEL D.Lgs. 196/2003*

Le informazioni contenute in questo messaggio di posta elettronica e/o
nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il
loro utilizzo è consentito esclusivamente al destinatario del messaggio,
per le finalità indicate nel messaggio stesso. Qualora riceviate questo
messaggio senza esserne il destinatario, Vi preghiamo cortesemente di
darcene notizia via e-mail e di procedere alla distruzione del messaggio
stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso,
divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od
utilizzarlo per finalità diverse, costituisce comportamento contrario ai
principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for
the attention and use of the named addressee(s) and may be confidential or
proprietary in nature or covered by the provisions of privacy act
(Legislative Decree June, 30 2003, no.196 - Italy's New Data Protection
Code).Any use not in accord with its purpose, any disclosure, reproduction,
copying, distribution, or either dissemination, either whole or partial, is
strictly forbidden except previous formal approval of the named
addressee(s). If you are not the intended recipient, please contact
immediately the sender by telephone, fax or e-mail and delete the
information in this message that has been received in error. The sender
does not give any warranty or accept liability as the content, accuracy or
completeness of sent messages and accepts no responsibility for changes
made after they were sent or for other risks which arise as a result of
e-mail transmission, viruses, etc.

-------------------------------------------------------

On 6 November 2015 at 08:04, A Huarte <ahuarte47@anonymised.com> wrote:

Hi Andrea, you are right, I verified same behavior in postgis layers. Each
WFS GetFeature request (v1.1/2.0) executes two SQL requests to the database
similar to shapefile:

- size = features.size() sentence:
  First request using "JDBCFeatureSource::getCountInternal()" method.

- final encode features step:
  Second request using "JDBCFeatureSource::getReaderInternal()" method.

We agree that postgis, or any modern database, is fast, but IMHO I think
there is an unnecessary waste of database resources, and performance!, of
each GetFeature request if it manages results with a litte count of records
(10, 100, 1000?). Now, the initial count and later full query always are
executed, we can reduce to one unique request with no overrun caching the
data when it contains a little of quantity of records, otherwise we can
preserve the current behavior to avoid overloading the RAM.

Sure, except count/startIndex parameters and request limits (maxFeatures)
all impact this -- you want an unrestricted count (how many match) *and* a
limited number of features (startIndex/count/maxFeatures) -- they're two
different queries. *And* you need them in the right order: GML3 writes both
numberMatched & numberReturned counts at the top, but features should be
streamed from the datasource so the whole resultset doesn't have to sit in
memory.

Note you can disable the extra count query for individual feature types via
the skipNumberMatched option, which then returns numberMatched=unknown in
GML.

There's some alternative approaches you can do for SQL databases:
http://stackoverflow.com/questions/156114/best-way-to-get-result-count-before-limit-was-applied-in-php-postgresql
discusses a few different ways for PostgreSQL, and one for MySQL too. I
suspect MSSQL can use the window approach too. Whether that's faster than
2x separate queries I don't know (I suspect so).

Seems like (if those ^ don't work) expanding your approach to pre-read
*part* of the resultset (eg. the first 100?1000? features), then if you
*did* get to the end seeing if the rowcount is < min(maxFeatures,
startIndex+count) -- therefore startIndex+rowcount *is* numberMatched, and
you don't need a count query. Potentially. But getting that through the
datasource/featurecollection abstractions down to GeoTools might be complex
:slight_smile:

As Andrea suggests, doing this at the datasource level - for shapefiles
implementing something that has the same sort of effect as the window
approach in SQL might be easier. (filter all, count matched, return subset)

Rob
--

*Koordinates*PO Box 1604, Shortland St, Auckland 1140, New Zealand
Phone +64-9-966 0433 koordinates.com <https://koordinates.com/about&gt;

Hi Robert,
your mail made me consider another option… assuming that we have a caching memory limit configured somewhere,
what if we know in advance that we cannot possibly go beyond it, because the current request or the global
max features is bigger than the cached size?

Say someone sets the cache limit to 1000, and also sets the global max features to 1000… or that the GetFeature
is a paged one (has a maxFeatures/count parameter), and we know in advance only 100 features are going to be loaded tops… in that case, indeed,
we could do a single query and load stuff in memory, into a ListFeaturCollection, and call it a day.
That would help with all data sources, not just shapefiles.

Cheers
Andrea

···

On Fri, Nov 6, 2015 at 10:21 AM, Robert Coup <robert.coup@anonymised.com> wrote:



Geoserver-devel mailing list
Geoserver-devel@anonymised.comsts.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

On 6 November 2015 at 08:04, A Huarte <ahuarte47@anonymised.com> wrote:

Hi Andrea, you are right, I verified same behavior in postgis layers. Each WFS GetFeature request (v1.1/2.0) executes two SQL requests to the database similar to shapefile:

  • size = features.size() sentence:
    First request using “JDBCFeatureSource::getCountInternal()” method.

  • final encode features step:
    Second request using “JDBCFeatureSource::getReaderInternal()” method.

We agree that postgis, or any modern database, is fast, but IMHO I think there is an unnecessary waste of database resources, and performance!, of each GetFeature request if it manages results with a litte count of records (10, 100, 1000?). Now, the initial count and later full query always are executed, we can reduce to one unique request with no overrun caching the data when it contains a little of quantity of records, otherwise we can preserve the current behavior to avoid overloading the RAM.

Sure, except count/startIndex parameters and request limits (maxFeatures) all impact this – you want an unrestricted count (how many match) and a limited number of features (startIndex/count/maxFeatures) – they’re two different queries. And you need them in the right order: GML3 writes both numberMatched & numberReturned counts at the top, but features should be streamed from the datasource so the whole resultset doesn’t have to sit in memory.

Note you can disable the extra count query for individual feature types via the skipNumberMatched option, which then returns numberMatched=unknown in GML.

There’s some alternative approaches you can do for SQL databases: http://stackoverflow.com/questions/156114/best-way-to-get-result-count-before-limit-was-applied-in-php-postgresql discusses a few different ways for PostgreSQL, and one for MySQL too. I suspect MSSQL can use the window approach too. Whether that’s faster than 2x separate queries I don’t know (I suspect so).

Seems like (if those ^ don’t work) expanding your approach to pre-read part of the resultset (eg. the first 100?1000? features), then if you did get to the end seeing if the rowcount is < min(maxFeatures, startIndex+count) – therefore startIndex+rowcount is numberMatched, and you don’t need a count query. Potentially. But getting that through the datasource/featurecollection abstractions down to GeoTools might be complex :slight_smile:

As Andrea suggests, doing this at the datasource level - for shapefiles implementing something that has the same sort of effect as the window approach in SQL might be easier. (filter all, count matched, return subset)

Rob

Koordinates
PO Box 1604, Shortland St, Auckland 1140, New Zealand
Phone +64-9-966 0433 koordinates.com

==
GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

AVVERTENZE AI SENSI DEL D.Lgs. 196/2003

Le informazioni contenute in questo messaggio di posta elettronica e/o nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il loro utilizzo è consentito esclusivamente al destinatario del messaggio, per le finalità indicate nel messaggio stesso. Qualora riceviate questo messaggio senza esserne il destinatario, Vi preghiamo cortesemente di darcene notizia via e-mail e di procedere alla distruzione del messaggio stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso, divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od utilizzarlo per finalità diverse, costituisce comportamento contrario ai principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for the attention and use of the named addressee(s) and may be confidential or proprietary in nature or covered by the provisions of privacy act (Legislative Decree June, 30 2003, no.196 - Italy’s New Data Protection Code).Any use not in accord with its purpose, any disclosure, reproduction, copying, distribution, or either dissemination, either whole or partial, is strictly forbidden except previous formal approval of the named addressee(s). If you are not the intended recipient, please contact immediately the sender by telephone, fax or e-mail and delete the information in this message that has been received in error. The sender does not give any warranty or accept liability as the content, accuracy or completeness of sent messages and accepts no responsibility for changes made after they were sent or for other risks which arise as a result of e-mail transmission, viruses, etc.


Hi,

I moved the cache to other point (ContentFeatureCollection), it seems me better, and that would help with all data sources, not just shapefiles.
:slight_smile:

https://github.com/geotools/geotools/compare/master…ahuarte47:GEOS-7301_CachedFeatureCollection

Whay do tou think ?


De: Andrea Aime <andrea.aime@anonymised.com1268…>
Para: Robert Coup robert.coup@anonymised.com
CC: A Huarte ahuarte47@anonymised.com; “geoserver-devel@lists.sourceforge.netgeoserver-devel@lists.sourceforge.net
Enviado: Viernes 6 de noviembre de 2015 13:03
Asunto: Re: [Geoserver-devel] [NEW FEATURE] New configuration and memory caching of features in WFS GetFeature requests.

Hi Robert,
your mail made me consider another option… assuming that we have a caching memory limit configured somewhere,
what if we know in advance that we cannot possibly go beyond it, because the current request or the global
max features is bigger than the cached size?

Say someone sets the cache limit to 1000, and also sets the global max features to 1000… or that the GetFeature
is a paged one (has a maxFeatures/count parameter), and we know in advance only 100 features are going to be loaded tops… in that case, indeed,
we could do a single query and load stuff in memory, into a ListFeaturCollection, and call it a day.
That would help with all data sources, not just shapefiles.

Cheers
Andrea

On Fri, Nov 6, 2015 at 10:21 AM, Robert Coup <robert.coup@anonymised.com.> wrote:

On 6 November 2015 at 08:04, A Huarte <ahuarte47@anonymised.com> wrote:

Hi Andrea, you are right, I verified same behavior in postgis layers. Each WFS GetFeature request (v1.1/2.0) executes two SQL requests to the database similar to shapefile:

  • size = features.size() sentence:
    First request using “JDBCFeatureSource::getCountInternal()” method.

  • final encode features step:
    Second request using “JDBCFeatureSource::getReaderInternal()” method.

We agree that postgis, or any modern database, is fast, but IMHO I think there is an unnecessary waste of database resources, and performance!, of each GetFeature request if it manages results with a litte count of records (10, 100, 1000?). Now, the initial count and later full query always are executed, we can reduce to one unique request with no overrun caching the data when it contains a little of quantity of records, otherwise we can preserve the current behavior to avoid overloading the RAM.

Sure, except count/startIndex parameters and request limits (maxFeatures) all impact this – you want an unrestricted count (how many match) and a limited number of features (startIndex/count/maxFeatures) – they’re two different queries. And you need them in the right order: GML3 writes both numberMatched & numberReturned counts at the top, but features should be streamed from the datasource so the whole resultset doesn’t have to sit in memory.

Note you can disable the extra count query for individual feature types via the skipNumberMatched option, which then returns numberMatched=unknown in GML.

There’s some alternative approaches you can do for SQL databases: http://stackoverflow.com/questions/156114/best-way-to-get-result-count-before-limit-was-applied-in-php-postgresql discusses a few different ways for PostgreSQL, and one for MySQL too. I suspect MSSQL can use the window approach too. Whether that’s faster than 2x separate queries I don’t know (I suspect so).

Seems like (if those ^ don’t work) expanding your approach to pre-read part of the resultset (eg. the first 100?1000? features), then if you did get to the end seeing if the rowcount is < min(maxFeatures, startIndex+count) – therefore startIndex+rowcount is numberMatched, and you don’t need a count query. Potentially. But getting that through the datasource/featurecollection abstractions down to GeoTools might be complex :slight_smile:

As Andrea suggests, doing this at the datasource level - for shapefiles implementing something that has the same sort of effect as the window approach in SQL might be easier. (filter all, count matched, return subset)

Rob

Koordinates
PO Box 1604, Shortland St, Auckland 1140, New Zealand
Phone +64-9-966 0433 koordinates.com



Geoserver-devel mailing list
Geoserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

==
GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

AVVERTENZE AI SENSI DEL D.Lgs. 196/2003
Le informazioni contenute in questo messaggio di posta elettronica e/o nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il loro utilizzo è consentito esclusivamente al destinatario del messaggio, per le finalità indicate nel messaggio stesso. Qualora riceviate questo messaggio senza esserne il destinatario, Vi preghiamo cortesemente di darcene notizia via e-mail e di procedere alla distruzione del messaggio stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso, divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od utilizzarlo per finalità diverse, costituisce comportamento contrario ai principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for the attention and use of the named addressee(s) and may be confidential or proprietary in nature or covered by the provisions of privacy act (Legislative Decree June, 30 2003, no.196 - Italy’s New Data Protection Code).Any use not in accord with its purpose, any disclosure, reproduction, copying, distribution, or either dissemination, either whole or partial, is strictly forbidden except previous formal approval of the named addressee(s). If you are not the intended recipient, please contact immediately the sender by telephone, fax or e-mail and delete the information in this message that has been received in error. The sender does not give any warranty or accept liability as the content, accuracy or completeness of sent messages and accepts no responsibility for changes made after they were sent or for other risks which arise as a result of e-mail transmission, viruses, etc.


Hi Andrea,

On 6 November 2015 at 12:03, Andrea Aime <andrea.aime@anonymised.com>
wrote:

Say someone sets the cache limit to 1000, and also sets the global max
features to 1000... or that the GetFeature
is a paged one (has a maxFeatures/count parameter), and we know in advance
only 100 features are going to be loaded tops... in that case, indeed,
we could do a single query and load stuff in memory, into a
ListFeaturCollection, and call it a day.

Isn't the count query only needed for numberMatched though? ie.
global-maxFeatures or &count don't affect it.

/me could be missing something, is a year since I looked at that code.

Rob :slight_smile: