[Geoserver-users] GWC CPU to 100% / hung / gmaps service

Sorry for # of questions from me recently … trying to get my first geoserver instance running (and one question resulted in legitimate bug report) …

I am now having an issue with GWC … specifically using the gmaps service … taking a thread to 100% CPU use and hanging … with no more response from the service. Any pointers toward debugging this?

This seems to happen randomly … sometimes it will happen quickly after a tomcat restart … sometimes I won’t see it again for a few hours. Not a huge load on server. Just me in development zooming and panning the maps, requesting tiles from the gmaps gwc service.

jstack can’t even seem to walk the stack in the hung thread. I get a “Error occurred during stack walking: java.lang.NullPointerException” from jstack on the specific PID

Platform: Ubuntu 12.04 running in AWS. One layer doing this is postgis-backed layer. The other is an image mosaic layer (with meta info in postgis).

Other gwc services seem to continue ok … I can access the gwc WMS service, etc.

Thanks again,

Mike Grogan

On Mon, Aug 19, 2013 at 7:52 PM, Mike Grogan <d.michael.grogan@anonymised.com>wrote:

Sorry for # of questions from me recently ... trying to get my first
geoserver instance running (and one question resulted in legitimate bug
report) ...

I am now having an issue with GWC ... specifically using the gmaps service
... taking a thread to 100% CPU use and hanging ... with no more response
from the service. Any pointers toward debugging this?

This seems to happen randomly ... sometimes it will happen quickly after a
tomcat restart ... sometimes I won't see it again for a few hours. Not a
huge load on server. Just me in development zooming and panning the maps,
requesting tiles from the gmaps gwc service.

jstack can't even seem to walk the stack in the hung thread. I get a
"Error occurred during stack walking: java.lang.NullPointerException" from
jstack on the specific PID

Hmm... not sure, but have you tried sending sigquit instead? Quoting from
http://www.crazysquirrel.com/computing/java/basics/java-thread-dump.jspx

-----------------------------------------
kill -QUIT process_id
There process_id is the process number of the running Java process. The
thread dump will be sent to wherever standard out is redirected too.
-----------------------------------------

Cheers
Andrea

--

Our support, Your Success! Visit http://opensdi.geo-solutions.it for more
information.

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

-------------------------------------------------------

I have gotten a thread dump with “kill -3”. Care to take a look at it?

  • Mike
···

On Mon, Aug 19, 2013 at 2:19 PM, Andrea Aime <andrea.aime@anonymised.com> wrote:

On Mon, Aug 19, 2013 at 7:52 PM, Mike Grogan <d.michael.grogan@anonymised.com> wrote:

Sorry for # of questions from me recently … trying to get my first geoserver instance running (and one question resulted in legitimate bug report) …

I am now having an issue with GWC … specifically using the gmaps service … taking a thread to 100% CPU use and hanging … with no more response from the service. Any pointers toward debugging this?

This seems to happen randomly … sometimes it will happen quickly after a tomcat restart … sometimes I won’t see it again for a few hours. Not a huge load on server. Just me in development zooming and panning the maps, requesting tiles from the gmaps gwc service.

jstack can’t even seem to walk the stack in the hung thread. I get a “Error occurred during stack walking: java.lang.NullPointerException” from jstack on the specific PID

Hmm… not sure, but have you tried sending sigquit instead? Quoting from http://www.crazysquirrel.com/computing/java/basics/java-thread-dump.jspx


kill -QUIT process_id

There process_id is the process number of the running Java process. The thread dump will be sent to wherever standard out is redirected too.


Cheers
Andrea

==
Our support, Your Success! Visit http://opensdi.geo-solutions.it for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it


This looks like it may be getting stuck in a TimerTask in LRUAuthenticationCacheImpl.java that is searching for expired authentication tokens. The stack from the process utilizing 100% CPU is consistently:

“Timer-0” daemon prio=10 tid=0x00007f57495e1800 nid=0x6d0e runnable [0x00007f5743546000]
java.lang.Thread.State: RUNNABLE
at org.geoserver.security.auth.LRUAuthenticationCacheImpl$1.run(LRUAuthenticationCacheImpl.java:66)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)

In this case, my process id was 27918, which matches the hex 6d0e above. I see this consistently after tomcat restart, though, obviously, the PID/nid changes. Each time, though, the trace for PID/nid returns the TimerTask above.

Looking at the source, this is a TimerTask for adding authentication tokens to be removed.

So, Andrea, I know you and I just discussed gwc and security not working well together earlier. I do still have auth on my data layers … and am considering some workarounds outside of geoserver to help me.

Do you think this issue goes away once I remove all the authentication to the data layers?

Even if it does, could there be cause for concern outside of my GWC services? Could I have stumbled upon a race condition or similar that might impact other things? Could this impact any geoserver layers, etc., requiring authentication even outside of GWC?

Thanks,

Mike Grogan

···

On Mon, Aug 19, 2013 at 2:32 PM, Mike Grogan <d.michael.grogan@anonymised.com> wrote:

I have gotten a thread dump with “kill -3”. Care to take a look at it?

  • Mike

On Mon, Aug 19, 2013 at 2:19 PM, Andrea Aime <andrea.aime@anonymised.com> wrote:

On Mon, Aug 19, 2013 at 7:52 PM, Mike Grogan <d.michael.grogan@anonymised.com> wrote:

Sorry for # of questions from me recently … trying to get my first geoserver instance running (and one question resulted in legitimate bug report) …

I am now having an issue with GWC … specifically using the gmaps service … taking a thread to 100% CPU use and hanging … with no more response from the service. Any pointers toward debugging this?

This seems to happen randomly … sometimes it will happen quickly after a tomcat restart … sometimes I won’t see it again for a few hours. Not a huge load on server. Just me in development zooming and panning the maps, requesting tiles from the gmaps gwc service.

jstack can’t even seem to walk the stack in the hung thread. I get a “Error occurred during stack walking: java.lang.NullPointerException” from jstack on the specific PID

Hmm… not sure, but have you tried sending sigquit instead? Quoting from http://www.crazysquirrel.com/computing/java/basics/java-thread-dump.jspx


kill -QUIT process_id

There process_id is the process number of the running Java process. The thread dump will be sent to wherever standard out is redirected too.


Cheers
Andrea

==
Our support, Your Success! Visit http://opensdi.geo-solutions.it for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it


Hello,

We have had this issue with a version 2.2.0 after upgrading the major cause of the huge number of threads disappeared,
but we still have the issue that from time to time a timer thread was using 100% CPU.

What we saw is that often (always) the CPU issue appears after an outOf memory problem, than an other issue caused this.

In our case this happend after a java.lang.OutOfMemoryError: Java heap space on a rendering task, we are still looking
what could trigger this issue.

Regards,

Eric

---
Op 19/08/2013 22:33, Mike Grogan schreef:

This looks like it may be getting stuck in a TimerTask in LRUAuthenticationCacheImpl.java that is searching for expired authentication tokens. The stack from the process utilizing 100% CPU is consistently:

"Timer-0" daemon prio=10 tid=0x00007f57495e1800 nid=0x6d0e runnable [0x00007f5743546000]
   java.lang.Thread.State: RUNNABLE
        at org.geoserver.security.auth.LRUAuthenticationCacheImpl$1.run(LRUAuthenticationCacheImpl.java:66)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)

In this case, my process id was 27918, which matches the hex 6d0e above. I see this consistently after tomcat restart, though, obviously, the PID/nid changes. Each time, though, the trace for PID/nid returns the TimerTask above.

Looking at the source, this is a TimerTask for adding authentication tokens to be removed.

So, Andrea, I know you and I just discussed gwc and security not working well together earlier. I do still have auth on my data layers ... and am considering some workarounds outside of geoserver to help me.

Do you think this issue goes away once I remove all the authentication to the data layers?

Even if it does, could there be cause for concern outside of my GWC services? Could I have stumbled upon a race condition or similar that might impact other things? Could this impact any geoserver layers, etc., requiring authentication even outside of GWC?

Thanks,

Mike Grogan

On Mon, Aug 19, 2013 at 2:32 PM, Mike Grogan <d.michael.grogan@anonymised.com <mailto:d.michael.grogan@anonymised.com>> wrote:

    I have gotten a thread dump with "kill -3". Care to take a look at it?

    - Mike

    On Mon, Aug 19, 2013 at 2:19 PM, Andrea Aime <andrea.aime@anonymised.com
    <mailto:andrea.aime@anonymised.com>> wrote:

        On Mon, Aug 19, 2013 at 7:52 PM, Mike Grogan
        <d.michael.grogan@anonymised.com <mailto:d.michael.grogan@anonymised.com>> wrote:

            Sorry for # of questions from me recently ... trying to get my
            first geoserver instance running (and one question resulted in
            legitimate bug report) ...

            I am now having an issue with GWC ... specifically using the gmaps
            service ... taking a thread to 100% CPU use and hanging ... with
            no more response from the service. Any pointers toward debugging
            this?

            This seems to happen randomly ... sometimes it will happen quickly
            after a tomcat restart ... sometimes I won't see it again for a
            few hours. Not a huge load on server. Just me in development
            zooming and panning the maps, requesting tiles from the gmaps gwc
            service.

            jstack can't even seem to walk the stack in the hung thread. I
            get a "Error occurred during stack walking:
            java.lang.NullPointerException" from jstack on the specific PID

        Hmm... not sure, but have you tried sending sigquit instead? Quoting
        from
        http://www.crazysquirrel.com/computing/java/basics/java-thread-dump.jspx

        -----------------------------------------
        kill -QUIT process_id
        There process_id is the process number of the running Java process.
        The thread dump will be sent to wherever standard out is redirected too.
        -----------------------------------------

        Cheers
        Andrea

        -- ==
        Our support, Your Success! Visit http://opensdi.geo-solutions.it for
        more information.
        ==

        Ing. Andrea Aime
        @geowolf
        Technical Lead

        GeoSolutions S.A.S.
        Via Poggio alle Viti 1187
        55054 Massarosa (LU)
        Italy
        phone: +39 0584 962313 <tel:%2B39%200584%20962313>
        fax: +39 0584 1660272 <tel:%2B39%200584%201660272>
        mob: +39 339 8844549 <tel:%2B39%20%C2%A0339%208844549>

        http://www.geo-solutions.it
        http://twitter.com/geosolutions_it

        -------------------------------------------------------

------------------------------------------------------------------------------
Introducing Performance Central, a new site from SourceForge and
AppDynamics. Performance Central is your source for news, insights,
analysis and resources for efficient Application Performance Management.
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk

_______________________________________________
Geoserver-users mailing list
Geoserver-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-users

--
Eric Smets eric.smets@anonymised.com
FKS bvba - Formal and Knowledge Systems http://www.fks.be/
Schampbergstraat 32 Tel: ++32-(0)11-21 49 11
B-3511 Hasselt Fax: ++32-(0)11-22 04 19

Hi guys,

I’m interested to know whether either of you came up with any further leads on this issue. I’m experiencing a similar thing, using version 2.3.5.

My instance of GeoServer is primarily used to serve tiles of weather-related GeoTIFF (relatively small in size) layers which are created and aged dynamically using the REST API. Authentication is required (implemented using “data security” rules on workspaces) and is provided with each request by means of HTTP header proxy authentication.

Under light to moderate load GeoServer operates normally for anywhere between minutes and hours before OGC service requests suddenly hang, despite the Tomcat being responsive to other requests.

A thread dump shows many threads in the WAITING state stemming from org.geoserver.security.auth.LRUAuthenticationCacheImpl.

Regards,
Andrew

···

On 20/08/2013 17:11, Eric Smets wrote:

Hello,

We have had this issue with a version 2.2.0 after upgrading the major cause of the huge number of threads disappeared,
but we still have the issue that from time to time a timer thread was using 100% CPU.

What we saw is that often (always) the CPU issue appears after an outOf memory problem, than an other issue caused this.

In our case this happend after a java.lang.OutOfMemoryError: Java heap space on a rendering task, we are still looking
what could trigger this issue.

Regards,

Eric


Op 19/08/2013 22:33, Mike Grogan schreef:

This looks like it may be getting stuck in a TimerTask in LRUAuthenticationCacheImpl.java that is searching for expired authentication tokens. The stack from the process utilizing 100% CPU is consistently:

“Timer-0” daemon prio=10 tid=0x00007f57495e1800 nid=0x6d0e runnable [0x00007f5743546000]
java.lang.Thread.State: RUNNABLE
at org.geoserver.security.auth.LRUAuthenticationCacheImpl$1.run(LRUAuthenticationCacheImpl.java:66)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)

In this case, my process id was 27918, which matches the hex 6d0e above. I see this consistently after tomcat restart, though, obviously, the PID/nid changes. Each time, though, the trace for PID/nid returns the TimerTask above.

Looking at the source, this is a TimerTask for adding authentication tokens to be removed.

So, Andrea, I know you and I just discussed gwc and security not working well together earlier. I do still have auth on my data layers … and am considering some workarounds outside of geoserver to help me.

Do you think this issue goes away once I remove all the authentication to the data layers?

Even if it does, could there be cause for concern outside of my GWC services? Could I have stumbled upon a race condition or similar that might impact other things? Could this impact any geoserver layers, etc., requiring authentication even outside of GWC?

Thanks,

Mike Grogan

On Mon, Aug 19, 2013 at 2:32 PM, Mike Grogan <d.michael.grogan@anonymised.com> wrote:

I have gotten a thread dump with “kill -3”. Care to take a look at it?

  • Mike

On Mon, Aug 19, 2013 at 2:19 PM, Andrea Aime <andrea.aime@anonymised.com> wrote:

On Mon, Aug 19, 2013 at 7:52 PM, Mike Grogan <d.michael.grogan@anonymised.com> wrote:

Sorry for # of questions from me recently … trying to get my first geoserver instance running (and one question resulted in legitimate bug report) …

I am now having an issue with GWC … specifically using the gmaps service … taking a thread to 100% CPU use and hanging … with no more response from the service. Any pointers toward debugging this?

This seems to happen randomly … sometimes it will happen quickly after a tomcat restart … sometimes I won’t see it again for a few hours. Not a huge load on server. Just me in development zooming and panning the maps, requesting tiles from the gmaps gwc service.

jstack can’t even seem to walk the stack in the hung thread. I get a “Error occurred during stack walking: java.lang.NullPointerException” from jstack on the specific PID

Hmm… not sure, but have you tried sending sigquit instead? Quoting from http://www.crazysquirrel.com/computing/java/basics/java-thread-dump.jspx


kill -QUIT process_id

There process_id is the process number of the running Java process. The thread dump will be sent to wherever standard out is redirected too.


Cheers
Andrea

==
Our support, Your Success! Visit http://opensdi.geo-solutions.it for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it


------------------------------------------------------------------------------
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
[http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk](http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk)
_______________________________________________
Geoserver-users mailing list
[Geoserver-users@lists.sourceforge.net](mailto:Geoserver-users@lists.sourceforge.net)
[https://lists.sourceforge.net/lists/listinfo/geoserver-users](https://lists.sourceforge.net/lists/listinfo/geoserver-users)

-- 
Eric Smets                                    [eric.smets@anonymised.com](mailto:eric.smets@anonymised.com)
FKS bvba - Formal and Knowledge Systems       [http://www.fks.be/](http://www.fks.be/)
Schampbergstraat 32                           Tel:  ++32-(0)11-21 49 11
B-3511 Hasselt                                Fax:  ++32-(0)11-22 04 19

------------------------------------------------------------------------------
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
[http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk](http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk)
_______________________________________________
Geoserver-users mailing list
[Geoserver-users@lists.sourceforge.net](mailto:Geoserver-users@lists.sourceforge.net)
[https://lists.sourceforge.net/lists/listinfo/geoserver-users](https://lists.sourceforge.net/lists/listinfo/geoserver-users)

On Mon, Aug 19, 2013 at 10:33 PM, Mike Grogan <d.michael.grogan@anonymised.com>wrote:

This looks like it may be getting stuck in a TimerTask
in LRUAuthenticationCacheImpl.java that is searching for expired
authentication tokens. The stack from the process utilizing 100% CPU is
consistently:

"Timer-0" daemon prio=10 tid=0x00007f57495e1800 nid=0x6d0e runnable
[0x00007f5743546000]
   java.lang.Thread.State: RUNNABLE
        at
org.geoserver.security.auth.LRUAuthenticationCacheImpl$1.run(LRUAuthenticationCacheImpl.java:66)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)

In this case, my process id was 27918, which matches the hex 6d0e above.
I see this consistently after tomcat restart, though, obviously, the
PID/nid changes. Each time, though, the trace for PID/nid returns the
TimerTask above.

Looking at the source, this is a TimerTask for adding authentication
tokens to be removed.

Yep, I noticed. Not sure why this is happening, haven't wrote that code
myself, but that thing
just loops over authentication cache keys, my only guess is that they might
be accumulating faster
than we can expire them, which might be compatible with a tilecache like
workload when many
parallel requests are coming in, under the assumption that each one for
some reason creates
a new entry.
By looking at the code, it seems that might happen if the client does not
send back the
_geoserver_security_cache_key attribute...

Do you think this issue goes away once I remove all the authentication to
the data layers?

This one seems more related to the client and maybe the fact you're trying
to authenticate from it?
I cannot be more precise without spending time in a serious investigation,
the code in question has been
written by Justin and Christian (cc'ed) so I'm not familiar with it.

Cheers
Andrea

--

Our support, Your Success! Visit http://opensdi.geo-solutions.it for more
information.

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

-------------------------------------------------------

Hi all

The code was written by me. I is hard to find the problem but looking at the code in

org.geoserver.security.auth.AuthenticationCacheKey

I see the equals(…) implementation comparing strings with == instead of using String.equals(). Maybe the cache is filled up.

Proposal:

I can send you an AuthenticationCacheKey.class file. You have to replace this file in the main-.jar file contained in your WEB-INF/lib folder. (The jar is a zip file, use your preferred zip tool).

After replacement, restart Geoserver and look at the results.

Let me know

Christian

···

On Wed, Aug 21, 2013 at 9:42 AM, Andrea Aime <andrea.aime@anonymised.com> wrote:

On Mon, Aug 19, 2013 at 10:33 PM, Mike Grogan <d.michael.grogan@anonymised.com> wrote:

DI Christian Mueller MSc (GIS), MSc (IT-Security)
OSS Open Source Solutions GmbH

This looks like it may be getting stuck in a TimerTask in LRUAuthenticationCacheImpl.java that is searching for expired authentication tokens. The stack from the process utilizing 100% CPU is consistently:

“Timer-0” daemon prio=10 tid=0x00007f57495e1800 nid=0x6d0e runnable [0x00007f5743546000]
java.lang.Thread.State: RUNNABLE
at org.geoserver.security.auth.LRUAuthenticationCacheImpl$1.run(LRUAuthenticationCacheImpl.java:66)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)

In this case, my process id was 27918, which matches the hex 6d0e above. I see this consistently after tomcat restart, though, obviously, the PID/nid changes. Each time, though, the trace for PID/nid returns the TimerTask above.

Looking at the source, this is a TimerTask for adding authentication tokens to be removed.

Yep, I noticed. Not sure why this is happening, haven’t wrote that code myself, but that thing
just loops over authentication cache keys, my only guess is that they might be accumulating faster
than we can expire them, which might be compatible with a tilecache like workload when many
parallel requests are coming in, under the assumption that each one for some reason creates
a new entry.
By looking at the code, it seems that might happen if the client does not send back the
_geoserver_security_cache_key attribute…

Do you think this issue goes away once I remove all the authentication to the data layers?

This one seems more related to the client and maybe the fact you’re trying to authenticate from it?
I cannot be more precise without spending time in a serious investigation, the code in question has been
written by Justin and Christian (cc’ed) so I’m not familiar with it.

Cheers

Andrea

==
Our support, Your Success! Visit http://opensdi.geo-solutions.it for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it


On Wed, Aug 21, 2013 at 2:14 PM, Christian Mueller <
christian.mueller@anonymised.com> wrote:

Hi all

The code was written by me. I is hard to find the problem but looking at
the code in

org.geoserver.security.auth.AuthenticationCacheKey

I see the equals(..) implementation comparing strings with == instead of
using String.equals(). Maybe the cache is filled up.

Hmmm... the code seems correct to me?

        AuthenticationCacheKey otherKey = (AuthenticationCacheKey) other;
        return (filterName == otherKey.filterName
            || (filterName != null &&
filterName.equals(otherKey.filterName)))
            && (cacheKey == otherKey.cacheKey
            || (cacheKey != null && cacheKey.equals(otherKey.cacheKey)));

First tries to see if it's the same string, or if both are null, if not,
ensures the filterName
is not null, then uses equals.

Cheers
Andrea

--

Our support, Your Success! Visit http://opensdi.geo-solutions.it for more
information.

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

-------------------------------------------------------

-1 to me, the code is correct. Perhaps I should not work on 3 projects concurrently.

@Andrew

Do you see some lines in the log file like

21 Aug 15:34:57 DEBUG [geoserver.security] - Number of expired authentication tokens found: 0
21 Aug 15:34:57 DEBUG [geoserver.security] - End searching for expired authentication tokens

(Maybe you have to set your logging profile to VERBOSE_LOGGING.properties in the global settings page).

Cheers
Christian

···

On Wed, Aug 21, 2013 at 2:34 PM, Andrea Aime <andrea.aime@anonymised.com> wrote:

On Wed, Aug 21, 2013 at 2:14 PM, Christian Mueller <christian.mueller@anonymised.com.5445…> wrote:

DI Christian Mueller MSc (GIS), MSc (IT-Security)
OSS Open Source Solutions GmbH

Hi all

The code was written by me. I is hard to find the problem but looking at the code in

org.geoserver.security.auth.AuthenticationCacheKey

I see the equals(…) implementation comparing strings with == instead of using String.equals(). Maybe the cache is filled up.

Hmmm… the code seems correct to me?

AuthenticationCacheKey otherKey = (AuthenticationCacheKey) other;
return (filterName == otherKey.filterName
|| (filterName != null && filterName.equals(otherKey.filterName)))
&& (cacheKey == otherKey.cacheKey
|| (cacheKey != null && cacheKey.equals(otherKey.cacheKey)));

First tries to see if it’s the same string, or if both are null, if not, ensures the filterName
is not null, then uses equals.

Cheers

Andrea

==
Our support, Your Success! Visit http://opensdi.geo-solutions.it for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it


I see the same thing as Andrew on 2.3.3 and 2.3.4

I would be willing to swap the class file and give it a try if you want, Christian, though I just see your e-mail voting yourself -1 … thinking existing code was ok.

At first, this seemed almost random. It was happening really with close to 0 load on the geoserver. It was just me on a development box zooming and panning my map. Of course, in my case, my Google maps-based client did make multiple simultaneous requests to load tiles from the gwc gmaps service, but just a few at a time. It seemed that, if I hit it just right (or wrong, I guess) in making new requests, it would lock up. Sometimes this was even after everything had been idle with no requests at all for many minutes or more. I could initiate new requests after none had happened for a while and, bam, 100% cpu use and no response from gwc instantly. This is what made this seem like a race condition … I would have to hit stuff just at the right time it seemed to make it happen. Sometimes I could get this to happen within a few minutes … sometimes it would take me an hour or so to make this happen.

So, I wrote some crude bash scripts that would make about 30 simultaneous requests continuously via wget. With this, I can often (but not always) get the 100% cpu lockup from LRUAuthenticationCacheImpl within 2 minutes. This sort of suggests to me that this problem is occurring whenever there is an incoming request happening at the exact instant with the TimerTask firing. I haven’t dug through the code enough to prove this, but that is what it feels like.

While Andrew has seen this with header proxy authentication, I have seen it with both basic and authkey plugin authentication, though it seems to be more reproducible with the authkey.

Andrea speaks of having the client return geoserver_security_cache_key. Can you provide a bit more info? What would you expect the client to provide back to geoserver? Especially in the authkey case where it’s one request and done.

Thanks,

Mike Grogan

···

On Wed, Aug 21, 2013 at 8:14 AM, Christian Mueller <christian.mueller@anonymised.com> wrote:

Hi all

The code was written by me. I is hard to find the problem but looking at the code in

org.geoserver.security.auth.AuthenticationCacheKey

I see the equals(…) implementation comparing strings with == instead of using String.equals(). Maybe the cache is filled up.

Proposal:

I can send you an AuthenticationCacheKey.class file. You have to replace this file in the main-.jar file contained in your WEB-INF/lib folder. (The jar is a zip file, use your preferred zip tool).

After replacement, restart Geoserver and look at the results.

Let me know

Christian

On Wed, Aug 21, 2013 at 9:42 AM, Andrea Aime <andrea.aime@anonymised.com> wrote:

On Mon, Aug 19, 2013 at 10:33 PM, Mike Grogan <d.michael.grogan@anonymised.com> wrote:

DI Christian Mueller MSc (GIS), MSc (IT-Security)
OSS Open Source Solutions GmbH

This looks like it may be getting stuck in a TimerTask in LRUAuthenticationCacheImpl.java that is searching for expired authentication tokens. The stack from the process utilizing 100% CPU is consistently:

“Timer-0” daemon prio=10 tid=0x00007f57495e1800 nid=0x6d0e runnable [0x00007f5743546000]
java.lang.Thread.State: RUNNABLE
at org.geoserver.security.auth.LRUAuthenticationCacheImpl$1.run(LRUAuthenticationCacheImpl.java:66)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)

In this case, my process id was 27918, which matches the hex 6d0e above. I see this consistently after tomcat restart, though, obviously, the PID/nid changes. Each time, though, the trace for PID/nid returns the TimerTask above.

Looking at the source, this is a TimerTask for adding authentication tokens to be removed.

Yep, I noticed. Not sure why this is happening, haven’t wrote that code myself, but that thing
just loops over authentication cache keys, my only guess is that they might be accumulating faster
than we can expire them, which might be compatible with a tilecache like workload when many
parallel requests are coming in, under the assumption that each one for some reason creates
a new entry.
By looking at the code, it seems that might happen if the client does not send back the
_geoserver_security_cache_key attribute…

Do you think this issue goes away once I remove all the authentication to the data layers?

This one seems more related to the client and maybe the fact you’re trying to authenticate from it?
I cannot be more precise without spending time in a serious investigation, the code in question has been
written by Justin and Christian (cc’ed) so I’m not familiar with it.

Cheers

Andrea

==
Our support, Your Success! Visit http://opensdi.geo-solutions.it for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it


Hi Mike

geoserver_security_cache_key is for internal use only.

About your bash scripts. Can you reproduce the race condition using basic auth (standard) and the Geoserver demo data. If you can, it would be nice if you can send me these scripts.

Cheers
Christian

···

On Wed, Aug 21, 2013 at 4:04 PM, Mike Grogan <d.michael.grogan@anonymised.com> wrote:

I see the same thing as Andrew on 2.3.3 and 2.3.4

I would be willing to swap the class file and give it a try if you want, Christian, though I just see your e-mail voting yourself -1 … thinking existing code was ok.

At first, this seemed almost random. It was happening really with close to 0 load on the geoserver. It was just me on a development box zooming and panning my map. Of course, in my case, my Google maps-based client did make multiple simultaneous requests to load tiles from the gwc gmaps service, but just a few at a time. It seemed that, if I hit it just right (or wrong, I guess) in making new requests, it would lock up. Sometimes this was even after everything had been idle with no requests at all for many minutes or more. I could initiate new requests after none had happened for a while and, bam, 100% cpu use and no response from gwc instantly. This is what made this seem like a race condition … I would have to hit stuff just at the right time it seemed to make it happen. Sometimes I could get this to happen within a few minutes … sometimes it would take me an hour or so to make this happen.

So, I wrote some crude bash scripts that would make about 30 simultaneous requests continuously via wget. With this, I can often (but not always) get the 100% cpu lockup from LRUAuthenticationCacheImpl within 2 minutes. This sort of suggests to me that this problem is occurring whenever there is an incoming request happening at the exact instant with the TimerTask firing. I haven’t dug through the code enough to prove this, but that is what it feels like.

While Andrew has seen this with header proxy authentication, I have seen it with both basic and authkey plugin authentication, though it seems to be more reproducible with the authkey.

Andrea speaks of having the client return geoserver_security_cache_key. Can you provide a bit more info? What would you expect the client to provide back to geoserver? Especially in the authkey case where it’s one request and done.

Thanks,

Mike Grogan

DI Christian Mueller MSc (GIS), MSc (IT-Security)
OSS Open Source Solutions GmbH

On Wed, Aug 21, 2013 at 8:14 AM, Christian Mueller <christian.mueller@anonymised.com> wrote:

Hi all

The code was written by me. I is hard to find the problem but looking at the code in

org.geoserver.security.auth.AuthenticationCacheKey

I see the equals(…) implementation comparing strings with == instead of using String.equals(). Maybe the cache is filled up.

Proposal:

I can send you an AuthenticationCacheKey.class file. You have to replace this file in the main-.jar file contained in your WEB-INF/lib folder. (The jar is a zip file, use your preferred zip tool).

After replacement, restart Geoserver and look at the results.

Let me know

Christian

On Wed, Aug 21, 2013 at 9:42 AM, Andrea Aime <andrea.aime@anonymised.com> wrote:

On Mon, Aug 19, 2013 at 10:33 PM, Mike Grogan <d.michael.grogan@anonymised.com> wrote:

DI Christian Mueller MSc (GIS), MSc (IT-Security)
OSS Open Source Solutions GmbH

This looks like it may be getting stuck in a TimerTask in LRUAuthenticationCacheImpl.java that is searching for expired authentication tokens. The stack from the process utilizing 100% CPU is consistently:

“Timer-0” daemon prio=10 tid=0x00007f57495e1800 nid=0x6d0e runnable [0x00007f5743546000]
java.lang.Thread.State: RUNNABLE
at org.geoserver.security.auth.LRUAuthenticationCacheImpl$1.run(LRUAuthenticationCacheImpl.java:66)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)

In this case, my process id was 27918, which matches the hex 6d0e above. I see this consistently after tomcat restart, though, obviously, the PID/nid changes. Each time, though, the trace for PID/nid returns the TimerTask above.

Looking at the source, this is a TimerTask for adding authentication tokens to be removed.

Yep, I noticed. Not sure why this is happening, haven’t wrote that code myself, but that thing
just loops over authentication cache keys, my only guess is that they might be accumulating faster
than we can expire them, which might be compatible with a tilecache like workload when many
parallel requests are coming in, under the assumption that each one for some reason creates
a new entry.
By looking at the code, it seems that might happen if the client does not send back the
_geoserver_security_cache_key attribute…

Do you think this issue goes away once I remove all the authentication to the data layers?

This one seems more related to the client and maybe the fact you’re trying to authenticate from it?
I cannot be more precise without spending time in a serious investigation, the code in question has been
written by Justin and Christian (cc’ed) so I’m not familiar with it.

Cheers

Andrea

==
Our support, Your Success! Visit http://opensdi.geo-solutions.it for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it


Hi all,

Not sure whether this is helpful, or indicates anything significant, but I’ve attached the truncated output of kill -3 which I ran against my “hung” GeoServer JVM this morning. It’s only 410 lines long - the ellipses indicate essentially repeated traces.

A summary of the first few sections is as follows:

Full thread dump Java HotSpot(TM) 64-Bit Server VM (19.1-b02 mixed mode):

"http-bio-8180-exec-31" daemon prio=10 tid=0x00002aaac1c66000 nid=0xac0 waiting on condition [0x00000000520a0000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000783995318> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)
at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:424)
at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:86)
at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:32)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:945)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
at java.lang.Thread.run(Thread.java:662)

... 5-10 similar to the above
``
"http-bio-8180-exec-23" daemon prio=10 tid=0x0000000059641800 nid=0xa22 waiting on condition [0x0000000051998000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000783a48450> (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:941)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1261)
at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:594)
at org.geoserver.security.auth.LRUAuthenticationCacheImpl.get(LRUAuthenticationCacheImpl.java:148)
at org.geoserver.security.filter.GeoServerSecurityFilter.authenticateFromCache(GeoServerSecurityFilter.java:92)
at org.geoserver.security.filter.GeoServerCompositeFilter$NestedFilterChain.doFilter(GeoServerCompositeFilter.java:47)
at org.geoserver.security.filter.GeoServerCompositeFilter.doFilter(GeoServerCompositeFilter.java:91)
at org.geoserver.security.filter.GeoServerBasicAuthenticationFilter.doFilter(GeoServerBasicAuthenticationFilter.java:82)
...

... several hundred similar to the above

Regards,
Andrew

kill-3.txt (30.5 KB)

···

On 22/08/2013 00:38, Christian Mueller wrote:

Hi Mike

geoserver_security_cache_key is for internal use only.

About your bash scripts. Can you reproduce the race condition using basic auth (standard) and the Geoserver demo data. If you can, it would be nice if you can send me these scripts.

Cheers
Christian

On Wed, Aug 21, 2013 at 4:04 PM, Mike Grogan <d.michael.grogan@anonymised.com> wrote:

I see the same thing as Andrew on 2.3.3 and 2.3.4

I would be willing to swap the class file and give it a try if you want, Christian, though I just see your e-mail voting yourself -1 … thinking existing code was ok.

At first, this seemed almost random. It was happening really with close to 0 load on the geoserver. It was just me on a development box zooming and panning my map. Of course, in my case, my Google maps-based client did make multiple simultaneous requests to load tiles from the gwc gmaps service, but just a few at a time. It seemed that, if I hit it just right (or wrong, I guess) in making new requests, it would lock up. Sometimes this was even after everything had been idle with no requests at all for many minutes or more. I could initiate new requests after none had happened for a while and, bam, 100% cpu use and no response from gwc instantly. This is what made this seem like a race condition … I would have to hit stuff just at the right time it seemed to make it happen. Sometimes I could get this to happen within a few minutes … sometimes it would take me an hour or so to make this happen.

So, I wrote some crude bash scripts that would make about 30 simultaneous requests continuously via wget. With this, I can often (but not always) get the 100% cpu lockup from LRUAuthenticationCacheImpl within 2 minutes. This sort of suggests to me that this problem is occurring whenever there is an incoming request happening at the exact instant with the TimerTask firing. I haven’t dug through the code enough to prove this, but that is what it feels like.

While Andrew has seen this with header proxy authentication, I have seen it with both basic and authkey plugin authentication, though it seems to be more reproducible with the authkey.

Andrea speaks of having the client return geoserver_security_cache_key. Can you provide a bit more info? What would you expect the client to provide back to geoserver? Especially in the authkey case where it’s one request and done.

Thanks,

Mike Grogan

On Wed, Aug 21, 2013 at 8:14 AM, Christian Mueller <christian.mueller@anonymised.com> wrote:

Hi all

The code was written by me. I is hard to find the problem but looking at the code in

org.geoserver.security.auth.AuthenticationCacheKey

I see the equals(…) implementation comparing strings with == instead of using String.equals(). Maybe the cache is filled up.

Proposal:

I can send you an AuthenticationCacheKey.class file. You have to replace this file in the main-.jar file contained in your WEB-INF/lib folder. (The jar is a zip file, use your preferred zip tool).

After replacement, restart Geoserver and look at the results.

Let me know

Christian

On Wed, Aug 21, 2013 at 9:42 AM, Andrea Aime <andrea.aime@anonymised.com> wrote:

On Mon, Aug 19, 2013 at 10:33 PM, Mike Grogan <d.michael.grogan@anonymised.com> wrote:

This looks like it may be getting stuck in a TimerTask in LRUAuthenticationCacheImpl.java that is searching for expired authentication tokens. The stack from the process utilizing 100% CPU is consistently:

“Timer-0” daemon prio=10 tid=0x00007f57495e1800 nid=0x6d0e runnable [0x00007f5743546000]
java.lang.Thread.State: RUNNABLE
at org.geoserver.security.auth.LRUAuthenticationCacheImpl$1.run(LRUAuthenticationCacheImpl.java:66)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)

In this case, my process id was 27918, which matches the hex 6d0e above. I see this consistently after tomcat restart, though, obviously, the PID/nid changes. Each time, though, the trace for PID/nid returns the TimerTask above.

Looking at the source, this is a TimerTask for adding authentication tokens to be removed.

Yep, I noticed. Not sure why this is happening, haven’t wrote that code myself, but that thing
just loops over authentication cache keys, my only guess is that they might be accumulating faster
than we can expire them, which might be compatible with a tilecache like workload when many
parallel requests are coming in, under the assumption that each one for some reason creates
a new entry.
By looking at the code, it seems that might happen if the client does not send back the
_geoserver_security_cache_key attribute…

Do you think this issue goes away once I remove all the authentication to the data layers?

This one seems more related to the client and maybe the fact you’re trying to authenticate from it?
I cannot be more precise without spending time in a serious investigation, the code in question has been
written by Justin and Christian (cc’ed) so I’m not familiar with it.

Cheers

Andrea

==
Our support, Your Success! Visit http://opensdi.geo-solutions.it for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it


DI Christian Mueller MSc (GIS), MSc (IT-Security)
OSS Open Source Solutions GmbH

DI Christian Mueller MSc (GIS), MSc (IT-Security)
OSS Open Source Solutions GmbH

On Wed, Aug 21, 2013 at 4:38 PM, Christian Mueller <
christian.mueller@anonymised.com> wrote:

Hi Mike

geoserver_security_cache_key is for internal use only.

About your bash scripts. Can you reproduce the race condition using basic
auth (standard) and the Geoserver demo data. If you can, it would be nice
if you can send me these scripts.

Christian,
I had another quick look at the code during a build, and I'm starting to
question the need of that cleanup thread,
I have the impression in the end it's only causing trouble.

Rationale:
* the map is a LRU one, with 1000 entries, so it purges unused items by
itself
* if an item that's expired is looked up, it gets removed anyways by the
lookup code

So, why should we do a "stop the world" operation like the one the timed
cleaner is doing?
(while it's running, no one else can authenticate)

Cheers
Andrea

--

Our support, Your Success! Visit http://opensdi.geo-solutions.it for more
information.

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

-------------------------------------------------------

On Thu, Aug 22, 2013 at 11:08 AM, Andrea Aime
<andrea.aime@anonymised.com>wrote:

Christian,
I had another quick look at the code during a build, and I'm starting to
question the need of that cleanup thread,
I have the impression in the end it's only causing trouble.

Rationale:
* the map is a LRU one, with 1000 entries, so it purges unused items by
itself
* if an item that's expired is looked up, it gets removed anyways by the
lookup code

So, why should we do a "stop the world" operation like the one the timed
cleaner is doing?
(while it's running, no one else can authenticate)

Ah, another bit. Since we have Guava stuck in the classpath, we could
probably make good use of it
and replace the LRUMap with a concurrent LRU map as built by the
CacheBuilder, have a look:

http://docs.guava-libraries.googlecode.com/git-history/v10.0.1/javadoc/com/google/common/cache/CacheBuilder.html

With this data structure I believe we could also get rid of the
ReadWriteLock and increase concurrency.

Cheers
Andrea

--

Our support, Your Success! Visit http://opensdi.geo-solutions.it for more
information.

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

-------------------------------------------------------

Hi Andrea

The problem is that most cache implementations have global expire time settings. We need global expire times as default but it must be possible to assign expire times to individual entries. As an example, digest authentication has its own expire times and it does not make sense if the cache has expire times larger than the configured expire times of the digest authentication filter. A cache entry created by digest authentication has expire times limited by the digest authentication configuration.

The cleaner thread is here to guarantee that an unused authentication token is not cached for a long time period, this is a security risk.

I am thinking about using Collections.synchronizedMap(…) and to kick out all the locking stuff.

What do you think ?

Cheers

···

On Thu, Aug 22, 2013 at 11:13 AM, Andrea Aime <andrea.aime@anonymised.com> wrote:

On Thu, Aug 22, 2013 at 11:08 AM, Andrea Aime <andrea.aime@anonymised.com> wrote:

DI Christian Mueller MSc (GIS), MSc (IT-Security)
OSS Open Source Solutions GmbH

Christian,

Ah, another bit. Since we have Guava stuck in the classpath, we could probably make good use of it
and replace the LRUMap with a concurrent LRU map as built by the CacheBuilder, have a look:

http://docs.guava-libraries.googlecode.com/git-history/v10.0.1/javadoc/com/google/common/cache/CacheBuilder.html

With this data structure I believe we could also get rid of the ReadWriteLock and increase concurrency.

Cheers

Andrea

==
Our support, Your Success! Visit http://opensdi.geo-solutions.it for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it


I had another quick look at the code during a build, and I’m starting to question the need of that cleanup thread,
I have the impression in the end it’s only causing trouble.

Rationale:

  • the map is a LRU one, with 1000 entries, so it purges unused items by itself
  • if an item that’s expired is looked up, it gets removed anyways by the lookup code

So, why should we do a “stop the world” operation like the one the timed cleaner is doing?
(while it’s running, no one else can authenticate)

On Thu, Aug 22, 2013 at 11:52 AM, Christian Mueller <
christian.mueller@anonymised.com> wrote:

Hi Andrea

The problem is that most cache implementations have global expire time
settings. We need global expire times as default but it must be possible to
assign expire times to individual entries. As an example, digest
authentication has its own expire times and it does not make sense if the
cache has expire times larger than the configured expire times of the
digest authentication filter. A cache entry created by digest
authentication has expire times limited by the digest authentication
configuration.

The cleaner thread is here to guarantee that an unused authentication
token is not cached for a long time period, this is a security risk.

I am thinking about using Collections.synchronizedMap(..) and to kick out
all the locking stuff.

Ugh, it would limit scalability quite a bit. The Guava CacheBuilder should
be a much more performant solution.
Once built you can treat it as a regular map, it handles LRU and
concurrency internally.

Cheers
Andrea

--

Our support, Your Success! Visit http://opensdi.geo-solutions.it for more
information.

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

-------------------------------------------------------

Hello,

We had/have on a production machine the same issue.
Following the discuss we looked into the implementation of LRUAuthenticationCacheImpl

The origin of the problem is not clear but the result is only one loop over the keys.

LRUAuthenticationCacheImpl. Some addional logging is also added but could be removed

The changed snipped follows
----

    TimerTask removeExpiredTask = new TimerTask() {
         @Override
         public void run() {
             LOGGER.fine("Start searching for expired authentication tokens");
             writeLock.lock();
             try {
                  int removed = 0;
                  long currentTime=System.currentTimeMillis();

                  if (cache.size() > 50) LOGGER.info("AuthKey Cache entries: " + cache.size());

                  Iterator<AuthenticationCacheEntry> iter = cache.values().iterator();

                  while (iter.hasNext())
                  {
                     if (iter.next().hasExpired(currentTime))
                     {
                        iter.remove();
                        removed++;
                     }
                  }
                 LOGGER.fine("Number of expired authentication tokens found: " + removed);
             } finally {
                 writeLock.unlock();
             }
             LOGGER.fine("End searching for expired authentication tokens");
         }
     };

Op 22/08/2013 11:58, Andrea Aime schreef:

On Thu, Aug 22, 2013 at 11:52 AM, Christian Mueller <christian.mueller@anonymised.com <mailto:christian.mueller@anonymised.com>> wrote:

    Hi Andrea

    The problem is that most cache implementations have global expire time
    settings. We need global expire times as default but it must be possible
    to assign expire times to individual entries. As an example, digest
    authentication has its own expire times and it does not make sense if the
    cache has expire times larger than the configured expire times of the
    digest authentication filter. A cache entry created by digest
    authentication has expire times limited by the digest authentication
    configuration.

    The cleaner thread is here to guarantee that an unused authentication
    token is not cached for a long time period, this is a security risk.

    I am thinking about using Collections.synchronizedMap(..) and to kick out
    all the locking stuff.

Ugh, it would limit scalability quite a bit. The Guava CacheBuilder should be a much more performant solution.
Once built you can treat it as a regular map, it handles LRU and concurrency internally.

Cheers
Andrea

--

Our support, Your Success! Visit http://opensdi.geo-solutions.it for more information.

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

-------------------------------------------------------

------------------------------------------------------------------------------
Introducing Performance Central, a new site from SourceForge and
AppDynamics. Performance Central is your source for news, insights,
analysis and resources for efficient Application Performance Management.
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk

_______________________________________________
Geoserver-users mailing list
Geoserver-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-users

--
Eric Smets eric.smets@anonymised.com
FKS bvba - Formal and Knowledge Systems http://www.fks.be/
Schampbergstraat 32 Tel: ++32-(0)11-21 49 11
B-3511 Hasselt Fax: ++32-(0)11-22 04 19

Hi Eric

Only to be sure, the above code snippet solves the issue ?

If it does, I will publish a modified class file for the other users to test.

My current idea was to remove the time task completely, but your solution would be better and safer.

Please let me know

···

On Thu, Aug 22, 2013 at 4:09 PM, Eric Smets <eric.smets@…5451…> wrote:

Hello,

We had/have on a production machine the same issue.
Following the discuss we looked into the implementation of LRUAuthenticationCacheImpl

The origin of the problem is not clear but the result is only one loop over the keys.

LRUAuthenticationCacheImpl. Some addional logging is also added but could be removed

The changed snipped follows

TimerTask removeExpiredTask = new TimerTask() {
@Override
public void run() {
LOGGER.fine(“Start searching for expired authentication tokens”);
writeLock.lock();
try {
int removed = 0;
long currentTime=System.currentTimeMillis();

if (cache.size() > 50) LOGGER.info("AuthKey Cache entries: " + cache.size());

Iterator iter = cache.values().iterator();

while (iter.hasNext())
{
if (iter.next().hasExpired(currentTime))
{
iter.remove();
removed++;
}
}
LOGGER.fine("Number of expired authentication tokens found: " + removed);
} finally {
writeLock.unlock();
}
LOGGER.fine(“End searching for expired authentication tokens”);
}
};

Op 22/08/2013 11:58, Andrea Aime schreef:

On Thu, Aug 22, 2013 at 11:52 AM, Christian Mueller <christian.mueller@anonymised.com> wrote:

------------------------------------------------------------------------------
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
[http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk](http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk)
_______________________________________________
Geoserver-users mailing list
[Geoserver-users@lists.sourceforge.net](mailto:Geoserver-users@lists.sourceforge.net)
[https://lists.sourceforge.net/lists/listinfo/geoserver-users](https://lists.sourceforge.net/lists/listinfo/geoserver-users)

-- 
Eric Smets                                    [eric.smets@anonymised.com](mailto:eric.smets@...5451...)
FKS bvba - Formal and Knowledge Systems       [http://www.fks.be/](http://www.fks.be/)
Schampbergstraat 32                           Tel:  ++32-(0)11-21 49 11
B-3511 Hasselt                                Fax:  ++32-(0)11-22 04 19


Introducing Performance Central, a new site from SourceForge and
AppDynamics. Performance Central is your source for news, insights,
analysis and resources for efficient Application Performance Management.
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk


Geoserver-users mailing list
Geoserver-users@anonymised.comsts.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-users

DI Christian Mueller MSc (GIS), MSc (IT-Security)
OSS Open Source Solutions GmbH

Hi Andrea

The problem is that most cache implementations have global expire time settings. We need global expire times as default but it must be possible to assign expire times to individual entries. As an example, digest authentication has its own expire times and it does not make sense if the cache has expire times larger than the configured expire times of the digest authentication filter. A cache entry created by digest authentication has expire times limited by the digest authentication configuration.

The cleaner thread is here to guarantee that an unused authentication token is not cached for a long time period, this is a security risk.

I am thinking about using Collections.synchronizedMap(…) and to kick out all the locking stuff.

Ugh, it would limit scalability quite a bit. The Guava CacheBuilder should be a much more performant solution.
Once built you can treat it as a regular map, it handles LRU and concurrency internally.

Cheers
Andrea

==
Our support, Your Success! Visit http://opensdi.geo-solutions.it for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it


Hi all

I attached LRUAuthenticationCacheImpl.class and LRUAuthenticationCacheImpl$1.class for testing. These classes includes the code snippet proposed by Eric. Please replace these classes in

WEB-INF/lib/main-.jar

Let me know about the results.

If the problem persists, I will publish a version without the timer task. This should resolve the issue. The disadvantage would be that authentication tokens are hold in memory longer than needed, the functionality would not change.

Cheers
Christian

LRUAuthenticationCacheImpl.class (6.24 KB)

LRUAuthenticationCacheImpl$1.class (2.36 KB)

···

On Thu, Aug 22, 2013 at 5:18 PM, Eric Smets <eric.smets@anonymised.com> wrote:

Op 22/08/2013 16:30, Christian Mueller schreef:

Hi Eric

Only to be sure, the above code snippet solves the issue ?

We have only tested this modification, this came up after an internal discussion.
We can not confirm that this solves the original issue, but this solution should do a better job.

We could use the testscript from Andrew to confirm the solution.

If it does, I will publish a modified class file for the other users to test.

My current idea was to remove the time task completely, but your solution would be better and safer.

Please let me know

-- 
Eric Smets                                    [eric.smets@anonymised.com](mailto:eric.smets@...5451...)
FKS bvba - Formal and Knowledge Systems       [http://www.fks.be/](http://www.fks.be/)
Schampbergstraat 32                           Tel:  ++32-(0)11-21 49 11
B-3511 Hasselt                                Fax:  ++32-(0)11-22 04 19

DI Christian Mueller MSc (GIS), MSc (IT-Security)
OSS Open Source Solutions GmbH

On Thu, Aug 22, 2013 at 4:09 PM, Eric Smets <eric.smets@anonymised.com> wrote:

Hello,

We had/have on a production machine the same issue.
Following the discuss we looked into the implementation of LRUAuthenticationCacheImpl

The origin of the problem is not clear but the result is only one loop over the keys.

LRUAuthenticationCacheImpl. Some addional logging is also added but could be removed

The changed snipped follows

TimerTask removeExpiredTask = new TimerTask() {
@Override
public void run() {
LOGGER.fine(“Start searching for expired authentication tokens”);
writeLock.lock();
try {
int removed = 0;
long currentTime=System.currentTimeMillis();

if (cache.size() > 50) LOGGER.info("AuthKey Cache entries: " + cache.size());

Iterator iter = cache.values().iterator();

while (iter.hasNext())
{
if (iter.next().hasExpired(currentTime))
{
iter.remove();
removed++;
}
}
LOGGER.fine("Number of expired authentication tokens found: " + removed);
} finally {
writeLock.unlock();
}
LOGGER.fine(“End searching for expired authentication tokens”);
}
};

Op 22/08/2013 11:58, Andrea Aime schreef:

On Thu, Aug 22, 2013 at 11:52 AM, Christian Mueller <christian.mueller@anonymised.com445…> wrote:

------------------------------------------------------------------------------
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
[http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk](http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk)
_______________________________________________
Geoserver-users mailing list
[Geoserver-users@lists.sourceforge.net](mailto:Geoserver-users@lists.sourceforge.net)
[https://lists.sourceforge.net/lists/listinfo/geoserver-users](https://lists.sourceforge.net/lists/listinfo/geoserver-users)

-- 
Eric Smets                                    [eric.smets@anonymised.com](mailto:eric.smets@...5451...)
FKS bvba - Formal and Knowledge Systems       [http://www.fks.be/](http://www.fks.be/)
Schampbergstraat 32                           Tel:  ++32-(0)11-21 49 11
B-3511 Hasselt                                Fax:  ++32-(0)11-22 04 19


Introducing Performance Central, a new site from SourceForge and
AppDynamics. Performance Central is your source for news, insights,
analysis and resources for efficient Application Performance Management.
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk


Geoserver-users mailing list
Geoserver-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-users

DI Christian Mueller MSc (GIS), MSc (IT-Security)
OSS Open Source Solutions GmbH

Hi Andrea

The problem is that most cache implementations have global expire time settings. We need global expire times as default but it must be possible to assign expire times to individual entries. As an example, digest authentication has its own expire times and it does not make sense if the cache has expire times larger than the configured expire times of the digest authentication filter. A cache entry created by digest authentication has expire times limited by the digest authentication configuration.

The cleaner thread is here to guarantee that an unused authentication token is not cached for a long time period, this is a security risk.

I am thinking about using Collections.synchronizedMap(…) and to kick out all the locking stuff.

Ugh, it would limit scalability quite a bit. The Guava CacheBuilder should be a much more performant solution.
Once built you can treat it as a regular map, it handles LRU and concurrency internally.

Cheers
Andrea

==
Our support, Your Success! Visit http://opensdi.geo-solutions.it for more information.

Ing. Andrea Aime

@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it