Hello everyone,
as I got 2 fixes nearly commited in the last2 weeks, I’ll throw this one in.
A few years ago I did some work to improve our geoserver uptime. As part of that, I debugged all kinds of crash dumps and user requests. Some of them are easily fixable geoserver problems. I’d like to provide a bunch of PRs to fix them.
This will probably be a multi-month process for me, I’ll do it whenever I have some time left. I dont want to overload maintainers, so no problem if the PRs are sitting around for months in the queue. Nothing of this is urgent.
I’ll throw in a personal CLA just to be sure, then for each problem:
- Validate the problem still exists. These are from geoserver 2.22
- Check if an issue exists or create one
- Write the PR & unittest
- I don’t know if I can aid with backports and babysitting github CI
This is the list of problems
GS1:Raster plugin error messages. If the raster plugin (netcdf?) can’t generate a raster, it checks for common problems. Sometimes it chooses the wrong one, and gives a misleading error message. See Issue with Emodnetbio__aca_spp_19582016_L1 data query · Issue #53 · EMODnet/emodnet.wcs · GitHub .
GS3: Geocache leaks connections and has a maximum of 2.
geocache uses the apache commons http client. In WMSHttpHelper it calls HttpClientBuilder, and calls setMaxConnTotal but forgets calling setMaxConnPerRoute, even if it only uses 1 route per pool. This de facto limits the max number of connections to the default of 2. Additionally, if a connection is not succesfull (e.g. 404), it is never closed. It keeps on reserving a slot in the maxConnections table until the GC happens to clean it up. As a result, any amount of error-traffic causes a deadlock in geocache and geoserver.
GS4: Geoserver has a thread local leak
The class PostGISDialect has 2 threadlocals, wkbreader and twkbreader. They make every (tomcat thread, postgis datasource) combination hold on to the last seen WKB, even if they are never used. As these objects can be up to a few 100K each, this eats a few GB of memory in each server. This was mitigated this by aggresively downtuning the number of tomcat threads: A thread not doing any work for 5 seconds gets torn down.
GS5: Invisible errors for user when using URL/Chart. When a geoserver style uses an URL for e.g. images, and fetching the URL gives an error, the error message is logged at a level below INFO and is not visible even in the logs. This is especially important for chart URLs with multiple intricate flags, with the slightest error causing the chart to disappear without any indication why.
GS6: Geoserver leaks thread pools. A server running for a few days has the number of active threads raising unbounded. It seems we have quite a few thread pools in there, doing nothing, with the GC unable to clean them up as long as the threads are parked. Nothing references these pools. Apart from using 2MB or so per thread, this is harmless. I don’t know where they are coming from, so I propose to adapt geotools & geoserver to create names for its thread pools, then give anything a name and find out who is making them.
GS7: Crash after catching wrong exception. In WMTSStoreNewPage: RuntimeErrorException is the wrong class, should be RuntimeException. As a result, a crash in the getCalaog().remove()s in line 97&98 is not handled correctly, so the real error message disappears.
GS8: web.xml contains invalid CORS config. Tomcat can’t accept a * in the list of headers, it checks for a literal header named “*” instead of accepting everything. Workaround by adding.disabled to the relevant parameter in the ansible script. This one might be fixed already in recengt geoservers.
Not in this list but also regularly recurring: A memory leak in XML filter objects, and gwc refusing to restart if an underlying layer has disappeared.