[Geoserver-devel] Sharing locks between GWC and data dir resources considered harmful

Hi all,
I’ve been recently tasked to find out why one of our GeoServer was getting in a deadlock preventing
further requests.

By the looks of it, all threads were dead on the GWC metatile locks. The common setup for
those is to use the MemoryLockProvider, a striped lock having 1024 ReentrantLock instances,
which are chosen for a given metatile by hashing its path.
This class was then copied over to GeoServer, by Boundless, while working on the Resource
subsystem. I did voice concerns about doing that, the locks was designed to work against
meta tiles, not files, but guess I could not pinpoint exactly why that could cause issues.

In normal conditions the locking works pretty well, and it has been around for years, so why was it causing
trouble? Turns out a colleague tried out a configuration we never use, which makes GWC use
the same locks set in the global configuration, e.g.:

image.png

While the “global locking” can be found in the global panel as “file locking”:

image.png

At first I told myself, well, ok, sharing locks for different purposes is a bad idea for sure: deadlock prevention 101,
all shared resources need to be acquired in a single shot, not one by one, so if a thread grabs a lock for tile generation,
and then while running the internal GetMap, for whatever reason, due to whatever plugin, it grabs a second lock
on a data directory Resource, we’re in trouble, a deadlock can happen.

At the same time I thought, but we don’t use Resource.lock() all that much… I was wrong.
Every time the input or output stream of a resource are used, a lock is acquired around their
usage, transparently, by the FileSytemResourceStore.FileSystemResource, by the in() and out() methods.
Ouch, that makes the deadlock so much more likely to happen.

My first suggestion would be, don’t allow admins to shoot themselves in the foot, and remove, from the
caching defaults, the option to share the locks between GWC and Resource. This should be done ASAP in my opinion.
Objections to this point?

The second thing is, using a striped lock for file system access like FileSystemResourceStore ends up doing,
is a really bad idea… if two threads manage to try and use 2 files that end up in the same MemoryLockProvider
buckets by chance of hashing, and they grab them in opposite order of acquisition, they will deadlock each other
(T1 grabs the lock on bucket A, T2 on bucket B, T1 goes to grab its second resource, which happens to also hit
bucket B, T2 does the same with bucked A, and the deadlock is done).

The chance can be lessened by not using striped locks, but creating unique reentrant locks for each string
(probably backed by a weak hash map). Not entirely trivial to do, the reentrant lock creation needs to be well guarded.

Two threads will still manage to deadlock each other, but they have to use the exact two same files, in opposite
order of acquisition. Which is at least less likely to happen.

But, I’m wondering if the Resource should automatically grab locks when using in or out at all,
instead of allowing the caller to explicitly manage it, if and when needed.
In the default GeoServer configuration, there is not even a need for them: the configuration global lock covers it.
That lock prevents two admin requests to write at the same time (the global lock is more
than just safety for IO, it’s our poor man serializable transaction mechanism, it’s important to prevent inconsistent
changes to related catalog resources that often happen while doing admin tasks).
Also checking what JDBCResourceStore does, it does not acquire locks around in() or out().

Opinions?

Cheers

Andrea

···

== GeoServer Professional Services from the experts! Visit http://goo.gl/it488V for more information. == Ing. Andrea Aime @geowolf Technical Lead GeoSolutions S.A.S. Via di Montramito 3/A 55054 Massarosa (LU) phone: +39 0584 962313 fax: +39 0584 1660272 mob: +39 339 8844549 http://www.geo-solutions.it http://twitter.com/geosolutions_it ------------------------------------------------------- Con riferimento alla normativa sul trattamento dei dati personali (Reg. UE 2016/679 - Regolamento generale sulla protezione dei dati “GDPR”), si precisa che ogni circostanza inerente alla presente email (il suo contenuto, gli eventuali allegati, etc.) è un dato la cui conoscenza è riservata al/i solo/i destinatario/i indicati dallo scrivente. Se il messaggio Le è giunto per errore, è tenuta/o a cancellarlo, ogni altra operazione è illecita. Le sarei comunque grato se potesse darmene notizia. This email is intended only for the person or entity to which it is addressed and may contain information that is privileged, confidential or otherwise protected from disclosure. We remind that - as provided by European Regulation 2016/679 “GDPR” - copying, dissemination or use of this e-mail or the information herein by anyone other than the intended recipient is prohibited. If you have received this email by mistake, please notify us immediately by telephone or e-mail.

Hi Andrea,

totally accurate analysis.

Also, meta-tile locks serve a very different purpose than config locks IIRC.
Meta-tile locks are most useful at seed-on-demand time, where you get concurrent tile requests from the browser/client that fall into the same meta-tile and there’s a cache hit, so that all the tiles that comprise the meta-tile are generated once.
May the not exist, you get a perf degradation, but not an inconsistent server state. Which means that in a single instance deployment scenario an in-memory locking strategy is all you need, and stripped locks work just fine.
That shared locking mechanism might have been added (guessing) for the clustered deployment scenario where the load balancer sends tile requests that fall on the same metatile to different nodes. Nonetheless, some sort of meta-tile key based distributed lock is all you need, and shall not be shared with the global config lock AT ALL.

(attachments)

image.png
image.png

···

Gabriel Roldán