As we transitioned our Alfresco Enterprise Helm charts pipeline to use the recently introduced GitHub Actions Large Runners, we sought to reduce maintenance costs associated with managing an internal cluster of self-hosted runners. During this process, we encountered failures affecting all ACS versions up to and including 7.2.
Our integration tests pipeline is quite simple: we establish a KinD cluster, deploy ACS in a typical clustered setup using Helm charts, wait for all resources to become ready and healthy, and then execute a basic test suite that assesses the primary endpoints of all the components.
The problem that arose involved memory-intensive pods, specifically the repository, share, and tika transformers pods, which were being terminated due to Out of Memory (OOM) errors triggered by the node host. Typically, this occurs when the process within the container attempts to allocate more memory than what's allowed by .resources.limits
.
Running a kubectl get pod
reveals unhealthy repository pods in a CrashLoopBackOff
state:
default pod/acs-alfresco-repository-6bcfb996cc-88shg 0/1 CrashLoopBackOff 6 (2m41s ago) 13m default pod/acs-alfresco-repository-6bcfb996cc-khswb 0/1 CrashLoopBackOff 6 (3m1s ago) 13m
Describing one of them confirms that it was terminated due to an Out of Memory (OOM) event:
State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: OOMKilled Exit Code: 137
However, all of our Java processes utilize the MaxRAMPercentage
JVM option (e.g. -XX:MaxRAMPercentage=80
) to dynamically limit heap space allocation based on the total available container memory. This means that hitting the memory limit should never occur, as the remaining 20% remains available for non-heap memory allocation and caching.
What we have discovered is that Kubernetes 1.25 and beyond have officially graduated cgroup v2 support to Generally Available (GA) status. Consequently, if your nodes run on a sufficiently recent distribution, containers will automatically adopt cgroup v2.
Unfortunately, cgroup v2 support in java is only available since OpenJDK jdk8u372, 11.0.16, 15 and later. As of today, the latest ACS 7.2.x image relies on an OpenJDK runtime that lacks support for cgroup v2:
~ ❯❯❯ docker run -it -m 1g --rm quay.io/alfresco/alfresco-content-repository:7.2.1.12 /bin/bash [alfresco@db47d44a5b2b tomcat]$ cat /sys/fs/cgroup/memory.max 1073741824 [alfresco@db47d44a5b2b tomcat]$ java -version openjdk version "11.0.14.1" 2022-02-08 LTS OpenJDK Runtime Environment 18.9 (build 11.0.14.1+1-LTS) OpenJDK 64-Bit Server VM 18.9 (build 11.0.14.1+1-LTS, mixed mode, sharing)
To work around this limitation without the need for rebuilding the image with a more recent OpenJDK version, you can manually specify the amount of memory available to the JVM. This is achieved by providing the MaxRAM
Java option with the same amount in bytes as the memory resource limit (e.g., -XX:MaxRAM=2147483648
if resources.limits.memory
is set to 2GB).