[Geoserver-users] Control flow extension troubleshooting

Hello,

I'm looking for some advice on control flow extension.

In my case, after couple of hours since restart, geoserver stops serving any WMS GetMap requests even if there is no load on server at all. I see no OOM messages also.

Is there any particular counter/indicator I should monitor?

Regards

Maciej

Tomcat 7.0.69 on CentOS 7, 12 cores
WMS Max rendering time = 45s

controlflow.properties:
timeout=45
ows.global=48
ows.wms.getmap=24
ows.gwc=48

On Sat, May 21, 2016 at 5:08 PM, Maciej Filocha <m.filocha@anonymised.com>
wrote:

Hello,

I'm looking for some advice on control flow extension.

In my case, after couple of hours since restart, geoserver stops serving
any WMS GetMap requests even if there is no load on server at all. I see
no OOM messages also.

Is there any particular counter/indicator I should monitor?

What version of GeoServer are you using?
If you set the logging level to INFO (e.g., default logging) control-flow
will provide logs of requests in flight.

Recently I've helped debug a similar situation and it turned out the
threads were actually blocked by the GWC disk quota subsystem,
and never released, but having control-flow around caused the
server lockup to happen sooner because it limits the amount of threads
you can actually execute requests on (e.g. in your case it would
happen as 24 threads get locked up, instead of the 200 you'd normally
have to eat away to lock up a geoserver without control-flow).
It would however require someone playing with the disk quota configuration
at runtime: https://osgeo-org.atlassian.net/browse/GEOS-7462

If that's not your case, looking at a jstack output may help identify the
issue.
Normally this lockups are due to control-flow not receiving the notification
that a request completed, or by requests not completing at all.

Cheers
Andrea

--

GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via di Montramito 3/A
55054 Massarosa (LU)
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

*AVVERTENZE AI SENSI DEL D.Lgs. 196/2003*

Le informazioni contenute in questo messaggio di posta elettronica e/o
nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il
loro utilizzo è consentito esclusivamente al destinatario del messaggio,
per le finalità indicate nel messaggio stesso. Qualora riceviate questo
messaggio senza esserne il destinatario, Vi preghiamo cortesemente di
darcene notizia via e-mail e di procedere alla distruzione del messaggio
stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso,
divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od
utilizzarlo per finalità diverse, costituisce comportamento contrario ai
principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for
the attention and use of the named addressee(s) and may be confidential or
proprietary in nature or covered by the provisions of privacy act
(Legislative Decree June, 30 2003, no.196 - Italy's New Data Protection
Code).Any use not in accord with its purpose, any disclosure, reproduction,
copying, distribution, or either dissemination, either whole or partial, is
strictly forbidden except previous formal approval of the named
addressee(s). If you are not the intended recipient, please contact
immediately the sender by telephone, fax or e-mail and delete the
information in this message that has been received in error. The sender
does not give any warranty or accept liability as the content, accuracy or
completeness of sent messages and accepts no responsibility for changes
made after they were sent or for other risks which arise as a result of
e-mail transmission, viruses, etc.

-------------------------------------------------------

Geoserver 2.8.3, -DGWC_DISKQUOTA_DISABLED=true

Maciej

W dniu 21.05.2016 o 17:25, Andrea Aime pisze:

On Sat, May 21, 2016 at 5:08 PM, Maciej Filocha <m.filocha@anonymised.com
<mailto:m.filocha@anonymised.com>> wrote:

    Hello,

    I'm looking for some advice on control flow extension.

    In my case, after couple of hours since restart, geoserver stops serving
    any WMS GetMap requests even if there is no load on server at all. I see
    no OOM messages also.

    Is there any particular counter/indicator I should monitor?

What version of GeoServer are you using?
If you set the logging level to INFO (e.g., default logging) control-flow
will provide logs of requests in flight.

Recently I've helped debug a similar situation and it turned out the
threads were actually blocked by the GWC disk quota subsystem,
and never released, but having control-flow around caused the
server lockup to happen sooner because it limits the amount of threads
you can actually execute requests on (e.g. in your case it would
happen as 24 threads get locked up, instead of the 200 you'd normally
have to eat away to lock up a geoserver without control-flow).
It would however require someone playing with the disk quota configuration
at runtime: https://osgeo-org.atlassian.net/browse/GEOS-7462

If that's not your case, looking at a jstack output may help identify
the issue.
Normally this lockups are due to control-flow not receiving the notification
that a request completed, or by requests not completing at all.

Cheers
Andrea

--

GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime
@geowolf
Technical Lead

W dniu 21.05.2016 o 17:25, Andrea Aime pisze:

If that's not your case, looking at a jstack output may help identify
the issue.
Normally this lockups are due to control-flow not receiving the notification
that a request completed, or by requests not completing at all.

I have now:

INFO [geoserver.flow] - Running requests: 0, blocked requests: 23

and jstack gives me 52 threads with TIMED_WAITING state

with two kinds of stack:

"http-apr-8443-exec-21" #25008 daemon prio=5 os_prio=0 tid=0x00007f42d8021000 nid=0x655 waiting on condition [0x00007f423bffe000]
    java.lang.Thread.State: TIMED_WAITING (parking)
         at sun.misc.Unsafe.park(Native Method)
         - parking to wait for <0x00000004c5330050> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
         at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
         at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
         at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:86)
         at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:32)
         at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
         at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
         at java.lang.Thread.run(Thread.java:745)

or

"http-apr-8443-exec-12" #24999 daemon prio=5 os_prio=0 tid=0x00007f42d8016800 nid=0x64c waiting on condition [0x00007f423b4f1000]
    java.lang.Thread.State: TIMED_WAITING (parking)
         at sun.misc.Unsafe.park(Native Method)
         - parking to wait for <0x00000004c1d9dac0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
         at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
         at java.util.concurrent.ArrayBlockingQueue.offer(ArrayBlockingQueue.java:379)
         at org.geoserver.flow.controller.SingleQueueFlowController.requestIncoming(SingleQueueFlowController.java:58)
         at org.geoserver.flow.ControlFlowCallback.operationDispatched(ControlFlowCallback.java:139)
         at org.geoserver.ows.Dispatcher.fireOperationDispatchedCallback(Dispatcher.java:800)
         at org.geoserver.ows.Dispatcher.dispatch(Dispatcher.java:795)
         at org.geoserver.ows.Dispatcher.handleRequestInternal(Dispatcher.java:265)
         at org.springframework.web.servlet.mvc.AbstractController.handleRequest(AbstractController.java:153)
...

On Sat, May 21, 2016 at 8:44 PM, Maciej Filocha <m.filocha@anonymised.com>
wrote:

W dniu 21.05.2016 o 17:25, Andrea Aime pisze:

If that's not your case, looking at a jstack output may help identify
the issue.
Normally this lockups are due to control-flow not receiving the
notification
that a request completed, or by requests not completing at all.

I have now:

INFO [geoserver.flow] - Running requests: 0, blocked requests: 23

and jstack gives me 52 threads with TIMED_WAITING state

with two kinds of stack:

"http-apr-8443-exec-21" #25008 daemon prio=5 os_prio=0
tid=0x00007f42d8021000 nid=0x655 waiting on condition [0x00007f423bffe000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for <0x00000004c5330050> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
        at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:86)
        at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:32)
        at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at
org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
        at java.lang.Thread.run(Thread.java:745)

or

"http-apr-8443-exec-12" #24999 daemon prio=5 os_prio=0
tid=0x00007f42d8016800 nid=0x64c waiting on condition [0x00007f423b4f1000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for <0x00000004c1d9dac0> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at
java.util.concurrent.ArrayBlockingQueue.offer(ArrayBlockingQueue.java:379)
        at
org.geoserver.flow.controller.SingleQueueFlowController.requestIncoming(SingleQueueFlowController.java:58)
        at
org.geoserver.flow.ControlFlowCallback.operationDispatched(ControlFlowCallback.java:139)
        at
org.geoserver.ows.Dispatcher.fireOperationDispatchedCallback(Dispatcher.java:800)
        at org.geoserver.ows.Dispatcher.dispatch(Dispatcher.java:795)
        at
org.geoserver.ows.Dispatcher.handleRequestInternal(Dispatcher.java:265)
        at
org.springframework.web.servlet.mvc.AbstractController.handleRequest(AbstractController.java:153)

Ok, this may suggest the ControlFlowCallback never received the finished
event for some requests, and thus
it did not return their tokens: the system is based on a finite number of
tokens in a pool, each request starting
grabs one, each one finishing returns one back, requests not finding a free
token wait.
If the end of request message never reaches the callback, we lose tokens
and as a result the pool depletes.

Do you have stack traces in the logs or otherwise any indication as to what
might have gone wrong there?
Wondering in particular if any tiling request failed, those are the
nastiest ones to handle because if they don't
find a tile they circle back and issue another fake request back to
geoserver wms, that's the most problematic
bit to handle, the most likely to cause a misbehavior. We already have
machinery to handle the double request
without getting into a deadlock, but I'm wondering if maybe something is
happening if a tile build fails.

Cheers
Andrea

--

GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via di Montramito 3/A
55054 Massarosa (LU)
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

*AVVERTENZE AI SENSI DEL D.Lgs. 196/2003*

Le informazioni contenute in questo messaggio di posta elettronica e/o
nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il
loro utilizzo è consentito esclusivamente al destinatario del messaggio,
per le finalità indicate nel messaggio stesso. Qualora riceviate questo
messaggio senza esserne il destinatario, Vi preghiamo cortesemente di
darcene notizia via e-mail e di procedere alla distruzione del messaggio
stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso,
divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od
utilizzarlo per finalità diverse, costituisce comportamento contrario ai
principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for
the attention and use of the named addressee(s) and may be confidential or
proprietary in nature or covered by the provisions of privacy act
(Legislative Decree June, 30 2003, no.196 - Italy's New Data Protection
Code).Any use not in accord with its purpose, any disclosure, reproduction,
copying, distribution, or either dissemination, either whole or partial, is
strictly forbidden except previous formal approval of the named
addressee(s). If you are not the intended recipient, please contact
immediately the sender by telephone, fax or e-mail and delete the
information in this message that has been received in error. The sender
does not give any warranty or accept liability as the content, accuracy or
completeness of sent messages and accepts no responsibility for changes
made after they were sent or for other risks which arise as a result of
e-mail transmission, viruses, etc.

-------------------------------------------------------

Andrea,

W dniu 21.05.2016 o 21:16, Andrea Aime pisze:

Wondering in particular if any tiling request failed, those are the
nastiest ones to handle because if they don't
find a tile they circle back and issue another fake request back to
geoserver wms, that's the most problematic
bit to handle, the most likely to cause a misbehavior. We already have
machinery to handle the double request
without getting into a deadlock, but I'm wondering if maybe something is
happening if a tile build fails.

Following you advice I made two changes. One of them definitely has fixed the problem.

First: There is a bug in Geoserver - changing name of a style does not propagate (sometimes?/always?) to GWC xml config in gwc-layers subdirectory under the <defaultStyle> key. I omitted to change style name manually in one layer so sometimes an error was thrown ("style not found"). That kind of error was probably not properly handled by control flow extension, causing infinite locks.

Second: As my service is "WMS tile-only" by design, main url was "http://server/geoserver/gwc/service/wms&quot;\. I've changed it to "http://server/geoserver/wms&quot; yesterday (should I see any performance degradation?).

I think that the first fix was a crucial one.

Thank you for you help!

Maciej

On Sun, May 22, 2016 at 10:59 AM, Maciej Filocha <m.filocha@anonymised.com>
wrote:

Following you advice I made two changes. One of them definitely has fixed
the problem.

First: There is a bug in Geoserver - changing name of a style does not
propagate (sometimes?/always?) to GWC xml config in gwc-layers subdirectory
under the <defaultStyle> key.

Confirmed, changing the name of a style seems to never update the
corresponding GWC configuration. Ticket created:
https://osgeo-org.atlassian.net/browse/GEOS-7552

I also confirmed this ends up causing the lockup, the machinery protecting
control-flow from the double requests
made by GWC (the first WMS request being the client one, a second fake one
being the one GWC generates to build a missing tile)
does not work if the fake request fails during parsing. Here is a second
ticket:

https://osgeo-org.atlassian.net/browse/GEOS-7553

I omitted to change style name manually in one layer so sometimes an error
was thrown ("style not found"). That kind of error was probably not
properly handled by control flow extension, causing infinite locks.

Second: As my service is "WMS tile-only" by design, main url was "
http://server/geoserver/gwc/service/wms&quot;\. I've changed it to "
http://server/geoserver/wms&quot; yesterday (should I see any performance
degradation?).

Well that depends, do you have the GWC direct integration active (see
caching defaults)? If not, then yes, the requests are not going to be
cached at all

Cheers
Andrea

--

GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via di Montramito 3/A
55054 Massarosa (LU)
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

*AVVERTENZE AI SENSI DEL D.Lgs. 196/2003*

Le informazioni contenute in questo messaggio di posta elettronica e/o
nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il
loro utilizzo è consentito esclusivamente al destinatario del messaggio,
per le finalità indicate nel messaggio stesso. Qualora riceviate questo
messaggio senza esserne il destinatario, Vi preghiamo cortesemente di
darcene notizia via e-mail e di procedere alla distruzione del messaggio
stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso,
divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od
utilizzarlo per finalità diverse, costituisce comportamento contrario ai
principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for
the attention and use of the named addressee(s) and may be confidential or
proprietary in nature or covered by the provisions of privacy act
(Legislative Decree June, 30 2003, no.196 - Italy's New Data Protection
Code).Any use not in accord with its purpose, any disclosure, reproduction,
copying, distribution, or either dissemination, either whole or partial, is
strictly forbidden except previous formal approval of the named
addressee(s). If you are not the intended recipient, please contact
immediately the sender by telephone, fax or e-mail and delete the
information in this message that has been received in error. The sender
does not give any warranty or accept liability as the content, accuracy or
completeness of sent messages and accepts no responsibility for changes
made after they were sent or for other risks which arise as a result of
e-mail transmission, viruses, etc.

-------------------------------------------------------

W dniu 22.05.2016 o 19:42, Andrea Aime pisze:

    Second: As my service is "WMS tile-only" by design, main url was
    "http://server/geoserver/gwc/service/wms&quot;\. I've changed it to
    "http://server/geoserver/wms&quot; yesterday (should I see any
    performance degradation?).

Well that depends, do you have the GWC direct integration active (see
caching defaults)? If not, then yes, the requests are not going to be
cached at all

Sure, I have it enabled. All requests are tiled and cached.

My question was a little bit "low-level" - about minor performance differences, behaviour under high load etc.

Regards

Maciej

On Mon, May 23, 2016 at 9:50 AM, Maciej Filocha <m.filocha@anonymised.com>
wrote:

My question was a little bit "low-level" - about minor performance
differences, behaviour under high load etc.

I never measured, so I'm just making a guess, but I don't foresee a
significant performance difference, both
systems have to check if your WMS request is actually matching a cacheable
tile.
To get the last ounce of performance out one would have to hit a xyz
protocol (WMTS, TMS) so that
the checks do not have to be performed... but I don't think the overhead of
those checks is all that significant

Cheers
Andrea

--

GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via di Montramito 3/A
55054 Massarosa (LU)
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

*AVVERTENZE AI SENSI DEL D.Lgs. 196/2003*

Le informazioni contenute in questo messaggio di posta elettronica e/o
nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il
loro utilizzo è consentito esclusivamente al destinatario del messaggio,
per le finalità indicate nel messaggio stesso. Qualora riceviate questo
messaggio senza esserne il destinatario, Vi preghiamo cortesemente di
darcene notizia via e-mail e di procedere alla distruzione del messaggio
stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso,
divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od
utilizzarlo per finalità diverse, costituisce comportamento contrario ai
principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for
the attention and use of the named addressee(s) and may be confidential or
proprietary in nature or covered by the provisions of privacy act
(Legislative Decree June, 30 2003, no.196 - Italy's New Data Protection
Code).Any use not in accord with its purpose, any disclosure, reproduction,
copying, distribution, or either dissemination, either whole or partial, is
strictly forbidden except previous formal approval of the named
addressee(s). If you are not the intended recipient, please contact
immediately the sender by telephone, fax or e-mail and delete the
information in this message that has been received in error. The sender
does not give any warranty or accept liability as the content, accuracy or
completeness of sent messages and accepts no responsibility for changes
made after they were sent or for other risks which arise as a result of
e-mail transmission, viruses, etc.

-------------------------------------------------------

Ok, this may suggest the ControlFlowCallback never received the finished event for some requests, and thus
it did not return their tokens: the system is based on a finite number of tokens in a pool, each request starting
grabs one, each one finishing returns one back, requests not finding a free token wait.
If the end of request message never reaches the callback, we lose tokens and as a result the pool depletes.

Hi Andrea,
Just a thought, but what about implementing a timeout on the pool tokens, and/or allowing one to be configured?

That way if one request in a while goes astray, after n-seconds control-flow will assume things broke and the token will be returned to the pool. That way you don’t end up with a situation of the pool getting slowly smaller over time as the odd erroneous-in-a-certain-way query makes tokens disappears gradually.

Cheers,
Jonathan

On Mon, May 23, 2016 at 1:42 PM, Jonathan Moules <
jonathan-lists@anonymised.com> wrote:

Hi Andrea,

Just a thought, but what about implementing a timeout on the pool tokens,
and/or allowing one to be configured?

That way if one request in a while goes astray, after n-seconds
control-flow will assume things broke and the token will be returned to the
pool. That way you don't end up with a situation of the pool getting slowly
smaller over time as the odd erroneous-in-a-certain-way query makes tokens
disappears gradually.

Yes, this idea has been discussed but I'm fiercely resisting it for a
couple of reasons:

   - It is not entirely trivial to implement
   - It makes control-flow completely useless under heavy load, when all
   the requests end up taking a lot of time, and by allowing more to get it
   due to the timeout, which in the end make things works as requests will
   take even longer, allowing more tokens to reach the timeout and thus allow
   even more requests in.

Long story short, imho, if I cannot trust control-flow to limit concurrent
access under heavy load, it's useless and it's better to run without. I
want to fix it instead of allowing it to degrade into no-control

Cheers
Andrea

--

GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via di Montramito 3/A
55054 Massarosa (LU)
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

*AVVERTENZE AI SENSI DEL D.Lgs. 196/2003*

Le informazioni contenute in questo messaggio di posta elettronica e/o
nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il
loro utilizzo è consentito esclusivamente al destinatario del messaggio,
per le finalità indicate nel messaggio stesso. Qualora riceviate questo
messaggio senza esserne il destinatario, Vi preghiamo cortesemente di
darcene notizia via e-mail e di procedere alla distruzione del messaggio
stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso,
divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od
utilizzarlo per finalità diverse, costituisce comportamento contrario ai
principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for
the attention and use of the named addressee(s) and may be confidential or
proprietary in nature or covered by the provisions of privacy act
(Legislative Decree June, 30 2003, no.196 - Italy's New Data Protection
Code).Any use not in accord with its purpose, any disclosure, reproduction,
copying, distribution, or either dissemination, either whole or partial, is
strictly forbidden except previous formal approval of the named
addressee(s). If you are not the intended recipient, please contact
immediately the sender by telephone, fax or e-mail and delete the
information in this message that has been received in error. The sender
does not give any warranty or accept liability as the content, accuracy or
completeness of sent messages and accepts no responsibility for changes
made after they were sent or for other risks which arise as a result of
e-mail transmission, viruses, etc.

-------------------------------------------------------

Hi Andrea,
I’ll defer to your judgement on implementation details of course, but I think the heavy-load is resolvable by simply having the value really high. I was thinking considerably higher than whatever the GeoServer (rendering time - which I know is only for a portion of the full-query-handling-time), load-balancer, proxy-server, or whatever else have their timeouts set to.
So if they all have timeouts set to 60s, a minimum token-timeout of say 120s would probably be my suggestion (or maybe even higher??).

The idea is to ensure tokens remain alive over the long-term. It shouldn’t matter in this context if a token is missing for a few minutes while the difference between the timeouts passes, the server admin would want to not be permanently losing tokens which over a period of days, weeks, months, or even longer could add up.
As a bonus, this would improve the load-balancing further if tokens did go missing during high load because it would give GeoServer a slight reprieve for a minute or two (or whatever the value is set to) as the number of tokens (and thus requests) is temporarily reduced.

As long as that’s documented clearly in the manual (including the reasoning) so folks can set it appropriately, I believe this should neatly side-step the heavy-load problem.

Cheers,
Jonathan

---- On Mon, 23 May 2016 13:04:38 +0100 Andrea Aime<andrea.aime@anonymised.com.1107…> wrote ----

On Mon, May 23, 2016 at 1:42 PM, Jonathan Moules <jonathan-lists@anonymised.com> wrote:

Hi Andrea,

Just a thought, but what about implementing a timeout on the pool tokens, and/or allowing one to be configured?

That way if one request in a while goes astray, after n-seconds control-flow will assume things broke and the token will be returned to the pool. That way you don’t end up with a situation of the pool getting slowly smaller over time as the odd erroneous-in-a-certain-way query makes tokens disappears gradually.

Yes, this idea has been discussed but I’m fiercely resisting it for a couple of reasons:

  • It is not entirely trivial to implement
  • It makes control-flow completely useless under heavy load, when all the requests end up taking a lot of time, and by allowing more to get it due to the timeout, which in the end make things works as requests will take even longer, allowing more tokens to reach the timeout and thus allow even more requests in.
    Long story short, imho, if I cannot trust control-flow to limit concurrent access under heavy load, it’s useless and it’s better to run without. I want to fix it instead of allowing it to degrade into no-control

Cheers
Andrea

==
GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via di Montramito 3/A
55054 Massarosa (LU)
phone: +39 0584 962313

fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

AVVERTENZE AI SENSI DEL D.Lgs. 196/2003

Le informazioni contenute in questo messaggio di posta elettronica e/o nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il loro utilizzo è consentito esclusivamente al destinatario del messaggio, per le finalità indicate nel messaggio stesso. Qualora riceviate questo messaggio senza esserne il destinatario, Vi preghiamo cortesemente di darcene notizia via e-mail e di procedere alla distruzione del messaggio stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso, divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od utilizzarlo per finalità diverse, costituisce comportamento contrario ai principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for the attention and use of the named addressee(s) and may be confidential or proprietary in nature or covered by the provisions of privacy act (Legislative Decree June, 30 2003, no.196 - Italy’s New Data Protection Code).Any use not in accord with its purpose, any disclosure, reproduction, copying, distribution, or either dissemination, either whole or partial, is strictly forbidden except previous formal approval of the named addressee(s). If you are not the intended recipient, please contact immediately the sender by telephone, fax or e-mail and delete the information in this message that has been received in error. The sender does not give any warranty or accept liability as the content, accuracy or completeness of sent messages and accepts no responsibility for changes made after they were sent or for other risks which arise as a result of e-mail transmission, viruses, etc.


On Mon, May 23, 2016 at 2:27 PM, Jonathan Moules <
jonathan-lists@anonymised.com> wrote:

Hi Andrea,
  I'll defer to your judgement on implementation details of course, but I
think the heavy-load is resolvable by simply having the value really high.
I was thinking considerably higher than whatever the GeoServer (rendering
time - which I know is only for a portion of the full-query-handling-time),
load-balancer, proxy-server, or whatever else have their timeouts set to.
So if they all have timeouts set to 60s, a minimum token-timeout of say
120s would probably be my suggestion (or maybe even higher??).

Eeeh, there are a few catches there ... pure raster requests are not under
the control of the timeout at all, and some vector rendering operations are
not controlled either because the check
is done only between rendering a feature and the next (so if one feature
takes a lot or forever to render, you're toaast). In java there is no such
a thing as stopping a thread, it has to allow you to stop it,
which you do by adding checkpoints in the code, or poisoning inputs and
outputs of such thread.
The rendering timeout remains a very good idea, and it's best to have it
setup in most cases, but the implementation stopped where it was relatively
easy to add....
On a separate note, some types of requests just take minutes to execute,
e.g. large printouts, some WPS requests (you cannot force clients to use
asynch, and asynch is out of the control-flow reach anyways).
Long story short, a reasonable timeout to return the tokens would be,
outside of the simple cases, tens of minutes.

Do you know how much time it takes to lockup control-flow in the scenario
described by Maciej? Less than a second (the time for someone to do two
zoom-ins in the GWC preview).
It's not randomly losing tokens out of bad luck, it's consistent and
fast... even a timeout of a minute, what good does it actually make? Once a
minute you allow another
batch of 16 ore 32 getmap requests to go, not much useful imho :-p

On the other side, as said, a correct implementation of that concept is not
entirely trivial... funding and/or pull requests welcomed :wink:

Cheers
Andrea

--

GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via di Montramito 3/A
55054 Massarosa (LU)
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

*AVVERTENZE AI SENSI DEL D.Lgs. 196/2003*

Le informazioni contenute in questo messaggio di posta elettronica e/o
nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il
loro utilizzo è consentito esclusivamente al destinatario del messaggio,
per le finalità indicate nel messaggio stesso. Qualora riceviate questo
messaggio senza esserne il destinatario, Vi preghiamo cortesemente di
darcene notizia via e-mail e di procedere alla distruzione del messaggio
stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso,
divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od
utilizzarlo per finalità diverse, costituisce comportamento contrario ai
principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for
the attention and use of the named addressee(s) and may be confidential or
proprietary in nature or covered by the provisions of privacy act
(Legislative Decree June, 30 2003, no.196 - Italy's New Data Protection
Code).Any use not in accord with its purpose, any disclosure, reproduction,
copying, distribution, or either dissemination, either whole or partial, is
strictly forbidden except previous formal approval of the named
addressee(s). If you are not the intended recipient, please contact
immediately the sender by telephone, fax or e-mail and delete the
information in this message that has been received in error. The sender
does not give any warranty or accept liability as the content, accuracy or
completeness of sent messages and accepts no responsibility for changes
made after they were sent or for other risks which arise as a result of
e-mail transmission, viruses, etc.

-------------------------------------------------------

Hi Andrea,
Interesting feedback, that does put paid to my idea for the most part. :slight_smile:

I’d just clarify that the purpose of the notion was for a long-term saving. I was more thinking to deal with cases where tokens disappear slowly over time (which I’m just guessing happens - I don’t know if that’s a real thing, it seemed likely from your explanation of tokens). In that circumstance a 10 minute timeout would be fine (in fact “600s” was going to be after the “or maybe even higher” in my last email).

If one was implemented it would probably need to be configurable per-service - to deal with the different usages you noted.

Alas I don’t have any funding. Just wondering aloud. :slight_smile:
Thanks,
Jonathan

---- On Mon, 23 May 2016 13:48:54 +0100 Andrea Aimeandrea.aime@anonymised.com wrote ----

On Mon, May 23, 2016 at 2:27 PM, Jonathan Moules <jonathan-lists@anonymised.com> wrote:

Hi Andrea,
I’ll defer to your judgement on implementation details of course, but I think the heavy-load is resolvable by simply having the value really high. I was thinking considerably higher than whatever the GeoServer (rendering time - which I know is only for a portion of the full-query-handling-time), load-balancer, proxy-server, or whatever else have their timeouts set to.
So if they all have timeouts set to 60s, a minimum token-timeout of say 120s would probably be my suggestion (or maybe even higher??).

Eeeh, there are a few catches there … pure raster requests are not under the control of the timeout at all, and some vector rendering operations are not controlled either because the check
is done only between rendering a feature and the next (so if one feature takes a lot or forever to render, you’re toaast). In java there is no such a thing as stopping a thread, it has to allow you to stop it,
which you do by adding checkpoints in the code, or poisoning inputs and outputs of such thread.
The rendering timeout remains a very good idea, and it’s best to have it setup in most cases, but the implementation stopped where it was relatively easy to add…
On a separate note, some types of requests just take minutes to execute, e.g. large printouts, some WPS requests (you cannot force clients to use asynch, and asynch is out of the control-flow reach anyways).
Long story short, a reasonable timeout to return the tokens would be, outside of the simple cases, tens of minutes.

Do you know how much time it takes to lockup control-flow in the scenario described by Maciej? Less than a second (the time for someone to do two zoom-ins in the GWC preview).
It’s not randomly losing tokens out of bad luck, it’s consistent and fast… even a timeout of a minute, what good does it actually make? Once a minute you allow another
batch of 16 ore 32 getmap requests to go, not much useful imho :-p

On the other side, as said, a correct implementation of that concept is not entirely trivial… funding and/or pull requests welcomed :wink:

Cheers
Andrea

==
GeoServer Professional Services from the experts! Visit
http://goo.gl/it488V for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via di Montramito 3/A
55054 Massarosa (LU)
phone: +39 0584 962313

fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

AVVERTENZE AI SENSI DEL D.Lgs. 196/2003

Le informazioni contenute in questo messaggio di posta elettronica e/o nel/i file/s allegato/i sono da considerarsi strettamente riservate. Il loro utilizzo è consentito esclusivamente al destinatario del messaggio, per le finalità indicate nel messaggio stesso. Qualora riceviate questo messaggio senza esserne il destinatario, Vi preghiamo cortesemente di darcene notizia via e-mail e di procedere alla distruzione del messaggio stesso, cancellandolo dal Vostro sistema. Conservare il messaggio stesso, divulgarlo anche in parte, distribuirlo ad altri soggetti, copiarlo, od utilizzarlo per finalità diverse, costituisce comportamento contrario ai principi dettati dal D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely for the attention and use of the named addressee(s) and may be confidential or proprietary in nature or covered by the provisions of privacy act (Legislative Decree June, 30 2003, no.196 - Italy’s New Data Protection Code).Any use not in accord with its purpose, any disclosure, reproduction, copying, distribution, or either dissemination, either whole or partial, is strictly forbidden except previous formal approval of the named addressee(s). If you are not the intended recipient, please contact immediately the sender by telephone, fax or e-mail and delete the information in this message that has been received in error. The sender does not give any warranty or accept liability as the content, accuracy or completeness of sent messages and accepts no responsibility for changes made after they were sent or for other risks which arise as a result of e-mail transmission, viruses, etc.


W dniu 23.05.2016 o 15:49, Jonathan Moules pisze:

I was more thinking to deal with cases where tokens disappear
slowly over time (which I'm just guessing happens - I don't know if
that's a real thing, it seemed likely from your explanation of tokens).

In my case, after serving ~300k requests not a single one of the processes is lost now :slight_smile:

Maciej