Apparently we have a deadlock in org.geoserver.gwc.layer.CatalogConfiguration caused by improper protection of a read write lock.
Here is the situation. The first thread T1 is holding a write lock in CatalogConfiguration.initialize, trying to get a value though the loading guava cache, and failing to do so because another thread is already loading the same key:
Thread 21546: (state = BLOCKED)
- sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
- java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame)
- java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt() @bci=1, line=834 (Interpreted frame)
- java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(int) @bci=72, line=994 (Interpreted frame)
- java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(int) @bci=24, line=1303 (Interpreted frame)
- com.google.common.util.concurrent.AbstractFuture$Sync.get() @bci=2, line=285 (Interpreted frame)
- com.google.common.util.concurrent.AbstractFuture.get() @bci=4, line=116 (Interpreted frame)
- com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(java.util.concurrent.Future) @bci=3, line=135 (Interpreted frame)
- com.google.common.cache.LocalCache$LoadingValueReference.waitForValue() @bci=4, line=3554 (Interpreted frame)
- com.google.common.cache.LocalCache$Segment.waitForLoadingValue(com.google.common.cache.LocalCache$ReferenceEntry, java.lang.Object, com.google.common.cache.LocalCache$ValueReference) @bci=43, line=2300 (Interpreted frame)
- com.google.common.cache.LocalCache$Segment.get(java.lang.Object, int, com.google.common.cache.CacheLoader) @bci=122, line=2189 (Compiled frame)
- com.google.common.cache.LocalCache.get(java.lang.Object, com.google.common.cache.CacheLoader) @bci=17, line=3934 (Compiled frame)
- com.google.common.cache.LocalCache.getOrLoad(java.lang.Object) @bci=6, line=3938 (Compiled frame)
- com.google.common.cache.LocalCache$LocalLoadingCache.get(java.lang.Object) @bci=5, line=4821 (Compiled frame)
- org.geoserver.gwc.layer.CatalogConfiguration.getTileLayerById(java.lang.String) @bci=12, line=277 (Compiled frame)
- org.geoserver.gwc.layer.CatalogConfiguration.initialize(org.geowebcache.grid.GridSetBroker) @bci=93, line=405 (Interpreted frame)
- org.geowebcache.layer.TileLayerDispatcher.initialize(org.geowebcache.config.Configuration) @bci=95, line=192 (Interpreted frame)
- org.geowebcache.layer.TileLayerDispatcher.initialize() @bci=41, line=167 (Interpreted frame)
- org.geowebcache.layer.TileLayerDispatcher.reInit() @bci=20, line=121 (Interpreted frame)
- org.geoserver.gwc.GWC.reload() @bci=16, line=576 (Interpreted frame)
- org.geoserver.gwc.GWC.tryReload() @bci=9, line=563 (Interpreted frame)
- org.geoserver.gwc.config.GWCLifeCycleHandler.onReload() @bci=0, line=26 (Interpreted frame)
Here is the second thread T2, that is making a un-protected access to the loading cache, trying to get a read lock to perform the loading, and failing to do so because T1 is holding the write lock:
Thread 22303: (state = BLOCKED)
- sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
- java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame)
- java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt() @bci=1, line=834 (Interpreted frame)
- java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(int) @bci=83, line=964 (Interpreted frame)
- java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(int) @bci=10, line=1282 (Compiled frame)
- java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock() @bci=5, line=731 (Compiled frame)
- org.geoserver.gwc.layer.CatalogConfiguration$TileLayerLoader.load(java.lang.String) @bci=22, line=83 (Interpreted frame)
- org.geoserver.gwc.layer.CatalogConfiguration$TileLayerLoader.load(java.lang.Object) @bci=5, line=71 (Interpreted frame)
- com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(java.lang.Object, com.google.common.cache.CacheLoader) @bci=24, line=3524 (Interpreted frame)
- com.google.common.cache.LocalCache$Segment.loadSync(java.lang.Object, int, com.google.common.cache.LocalCache$LoadingValueReference, com.google.common.cache.CacheLoader) @bci=4, line=2317 (Interpreted frame)
- com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(java.lang.Object, int, com.google.common.cache.CacheLoader) @bci=379, line=2280 (Interpreted frame)
- com.google.common.cache.LocalCache$Segment.get(java.lang.Object, int, com.google.common.cache.CacheLoader) @bci=138, line=2195 (Compiled frame)
- com.google.common.cache.LocalCache.get(java.lang.Object, com.google.common.cache.CacheLoader) @bci=17, line=3934 (Compiled frame)
- com.google.common.cache.LocalCache.getOrLoad(java.lang.Object) @bci=6, line=3938 (Compiled frame)
- com.google.common.cache.LocalCache$LocalLoadingCache.get(java.lang.Object) @bci=5, line=4821 (Compiled frame)
- org.geoserver.gwc.layer.CatalogConfiguration.getTileLayerById(java.lang.String) @bci=12, line=277 (Compiled frame)
- org.geoserver.gwc.layer.CatalogConfiguration.getTileLayer(java.lang.String) @bci=87, line=305 (Compiled frame)
- org.geoserver.gwc.layer.CatalogConfiguration.getTileLayer(java.lang.String) @bci=2, line=65 (Compiled frame)
- org.geowebcache.layer.TileLayerDispatcher.layerExists(java.lang.String) @bci=31, line=81 (Interpreted frame)
- org.geoserver.gwc.GWC.dispatch(org.geoserver.wms.GetMapRequest, java.lang.StringBuilder) @bci=39, line=636 (Interpreted frame)
Sequence of events to get here:
- T1 gets the write lock, loads some keys from the cache, and gets kicked out of the CPU
- T2 is answering a request, manages to get down in the cache to the point of loading the layer it’s trying to get the GeoServerTileLayer needed, but fails to get a read lock to build it
- T1 gets unlocked, finds that the key its looking for is already being loaded by T2, and the deadlock happens.
Long story short, it seems that each and every access to the cache should be protected by a preventive read lock (getTileLayerById) to make sure no one is already holding a write lock at that point.
As a further improvement it’s probably best to have all lock acquisition attempts to use a tryLock (at least all the read locks) with a reasonable timeout (e.g., 60 seconds, maybe with a system variable to change it), in order to make impossible this type of deadlock to happen in the first place.
|