Geoserver 2.28 Docker container running out of memory

We recently upgraded from Geoserver 2.18.1 running on AWS EC2 to the Geoserver 2.28 Docker container running in ECS.

Our application requests map tiles as png from geoserver which then uses WFS to fetch the underlying data. We have a tile layer cache using a file blobstore which seems to be working fine. Our application is high scale with 10s of millions of map points and serving 2000-5000 requests per second.

The only EXTRA_JAVA_OPTS settings that we are passing are “-Xms8g -Xmx16g”

The issue is that geoserver’s java heap continually increases until the container crashes (maxing out at the 16g max heap) and is replaced. Even clicking “Free Memory” does nothing to lower the memory usage by geoserver.

Using the heap dump option returns the following:

1: 228040028 7297280896 java.util.HashMap$Node (java.base@17.0.16)
2: 227942507 5470620168 javax.xml.namespace.QName (java.xml@17.0.16)
3: 886185 2401492560 [Ljava.util.HashMap$Node; (java.base@17.0.16)
4: 1166812 56006976 java.util.HashMap (java.base@17.0.16)
5: 516265 51450296 [B (java.base@17.0.16)
6: 653643 38105104 [Ljava.lang.Object; (java.base@17.0.16)
7: 290612 24476904 [Ljava.util.concurrent.ConcurrentHashMap$Node; (java.base@17.0.16)

indicating that there is a HashMap holding on to some parsed XML QName objects which are never cleared. Note that Geoserver 2.18.1 had no issues with the load or memory and in-fact work on a max heap of 12GB without issue.

The only thing being logged is: “WARN [xsd.impl] - cvc-elt.1.a: Cannot find the declaration of element ‘Filter’.” which we are not sure is related.

Looking for input on how to further debug and/or resolve this issue. Thanks!

As a stop-gap, we increased the maximum heap to 40GB and it delayed but did not prevent the out of memory crash (lasted 3 hours rather than just 1).

Shortly before the crash, the top offenders in the heap dump log are the same:

1: 404353293 16174131720 java.util.HashMap$Node (java.base@17.0.16)
2: 404258402 16170336080 javax.xml.namespace.QName (java.xml@17.0.16)
3: 1551610 8489845056 [Ljava.util.HashMap$Node; (java.base@17.0.16)
4: 2053978 131454592 java.util.HashMap (java.base@17.0.16)
5: 1085585 107218152 [Ljava.lang.Object; (java.base@17.0.16)

Welcome @jgronfur

I don’t know the answer to your problem, nor have I come across others with similar symptoms. GeoServer is open source, so here’s how I would go about tracking down your problem:

Has the bug previously been logged in Jira, the issue tracker? If not, please log it here: Jira

There are obviously a lot of changes in the last 5 years, from 2.18.1 to 2.28! Do a binary search (focusing on major versions) to narrow down when the problem started: 2.23 is half way between 2.18 and 2.28, did the problem exist then? Repeat, until you can identify the exact version that introduced the problem. Report that here and in Jira.

Then, look at the change log for that version. It might be possible to immediately identify a suspicious change that introduced the bug.

Ultimately, someone will likely need to create a software fix for this. Does your organisation have a Java developer who can contribute a patch? Alternatively, you can look at Support - GeoServer to find a suitable provider who you can engage.

Best of luck & thank you for highlighting the problem. Perhaps someone else here has already experienced the same and can share their solution.

Peter

I was able to take a full heap dump and analyze it with Eclipse MAT. The heap dump shows:

  • org.geotools.filter.v2_0.FES has an attribute named schema of type XSDResourceImpl
  • XSDResourceImpl.eAdapters() contains many instances of:
    • org.geotools.xsd.impl.SchemaIndexImpl$SchemaAdapter each retaining roughly 800 QName objects
  • Those adapters are seemingly created by ApplicationSchemaXSD.buildSchema() which is seemingly called everytime a request to an upstream WFS data source is made.

While I have knowledge of Java and heap analysis, I know very little about geoserver itself. I will try opening a bug.

I don’t seem to be able to create issues in JIRA other than anonymous reports.

However I did find bug Jira that almost exactly matches (same symptoms, same schema/adapter classes, just retained by org.geotools.filter.v2_0.FES rather than PullParser)

Hi,

The class you’re referring to is within GeoTools. The code that triggers this behavior almost certainly is in this part:

/** Adapter for tracking changes to schemas. */

SchemaAdapter adapter;

public SchemaIndexImpl(XSDSchema[] schemas) {

    this.schemas = new XSDSchema[schemas.length + 1];

    adapter = new SchemaAdapter();

    // set the schemas passed in

    for (int i = 0; i < schemas.length; i++) {

        this.schemas[i] = schemas[i];

        synchronized (this.schemas[i].eAdapters()) {

            this.schemas[i].eAdapters().add(adapter);

        }

    }

    // add the schema for xml schema itself

    this.schemas[schemas.length] = schemas[0].getSchemaForSchema();

}

I’m not sure if I have the capacity to look at this for the moment, but someone should.

Please have a look at the link Peter posted about support providers.

Best regards,

Roar Brænden

I believe you will need to create an Atlassian account to create an issue, which you should do, please. Ref: Issue tracker - GeoServer

Peter

Even with a Atlassian account, the Create button does nothing, I don’t think that I have permission to do anything other than search. In the meantime, I’ll look into reaching out to one of the support providers.

Thank you for that feedback. I was not aware there was a problem with Jira, which obviously should work for new accounts, so that we can properly track issues. We will have to look into that.

Until then, please continue to report your findings here (particularly if you can do a binary search to narrow down the version.) I have just done some work on generating (old) Docker images, which you can use to quickly obtain those you need for the search.

Peter

Hi there,

I also observed similar memory leaks with the docker images of the 2.28.1, with a store accessing images on an object storage.

We have changed our memory settings (use the -XX:MaxRAMPercentage=80 parameter instead of -Xms -Xmx parameters) and adjust our Kubernetes requests/limits and we no longer have problems.

Regards

Alexandre

Perhaps we can make another topic to figure out ticket creation?

In the past we have run into a 2000 person limit on the number of users. Checking in today we have: 2667 total users, 1821 active users. The categories are active, suspended and deactivated.

I cannot figure out what the current limit of number of users is.

This discussion does indeed sound very similar to the findings I have made a few years ago and that were reported here: https://osgeo-org.atlassian.net/browse/GEOT-6517 . As mentioned earlier, the problem back then looked exactly as it does right now, and looking back at my comments at the ticket, I wasn’t sure back then, if the fix that was made maybe only fixed one part of the problem.