Kubernetes cgroup OOM: When the Kernel Kills Before kubectl Knows

3:40 AM, a Tuesday I will not forget. A batch processor was dying every four minutes in production. kubectl top pod reported 38Mi average. The alert from our APM said peak RSS was 42Mi. The limit on the pod was 50Mi. By every number I could see, we had headroom. But the pod kept dying with exit code 137. I spent twenty minutes reading Go profiles before I remembered that kubectl top samples every thirty seconds and the kernel checks every single page fault. Somewhere in the gap between those two clocks, the container was touching 60Mi for half a second and the kernel was doing what kernels do. I had been debugging the wrong layer the whole time.

The scenario

◆ DAY 16 · RESOURCES · CGROUP THROTTLE

The pod was scheduled. The kernel throttled it anyway.

The kubelet saw available CPU and scheduled the pod. But the cgroup v1 hierarchy on the node had a stale cpu.cfs_quota_us, so the Linux CFS scheduler immediately began throttling the container. The pod runs — at less than 10% of its requested CPU.

FIGURE16 / 35

The CPU bar shows delivery at 9% of the request

The pod asked for requests.cpu: 2. The kubelet scheduled it — available capacity checked out. But the container is only receiving 0.18 cores because the cgroup quota was written wrong, not because the node is busy.

The cgroup driver wrote a stale quota

The kubelet's cgroup-driver: cgroupfs set cpu.cfs_quota_us: 200ms in the kubepods hierarchy — a value left over from a previous node reconfiguration. The period is 100ms, so the effective CPU limit is 2 cores mathematically, but the actual enforced value is wrong.

cpu.stat reveals the real story

Run cat /sys/fs/cgroup/.../cpu.stat on the node. nr_throttled: 4823 and throttled_time: 92.4s confirm the scheduler is enforcing the wrong quota. This never appears in kubectl top pod.

A pod requests cpu: 2 but the cgroup driver misconfiguration causes the Linux CFS scheduler to throttle it to 0.18 cores.

A pod inside a Kubernetes cluster requests two CPUs. The kubelet's cgroupfs driver writes a wrong cpu.cfs_quota_us to the kubepods cgroup. The Linux CFS scheduler enforces the misconfigured bandwidth limit and throttles the container to 0.18 actual cores, recording 4823 throttle events and 92.4 seconds of throttled time.

pod.spec.containers.resources.requests.cpu — kubectl explain pod.spec.containers.resources · kind v0.22.0, Kubernetes 1.30.0, kernel 6.x cgroup v1

Same repo, different folder. This one exists to make the gap between pod-level metrics and cgroup-level enforcement painfully visible.

bash

git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git
cd troubleshoot-kubernetes-like-a-pro/scenarios/cgroup-issues
ls

bash

issue.yaml runs stress with 100M of allocations against a 50Mi cgroup ceiling. You cannot win that fight.

Reproduce the issue

bash

kubectl apply -f issue.yaml
kubectl get pods

bash

plaintext

NAME               READY   STATUS             RESTARTS      AGE
cgroup-issue-pod   0/1     CrashLoopBackOff   4 (18s ago)   90s

The restart count climbs while the metrics dashboards stay flat. That is the cgroup loop. Fast, clean, invisible to anything that samples on a poll.

Debug the hard way

bash

kubectl describe pod cgroup-issue-pod | grep -A5 "Last State"

bash

plaintext

Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137
  Started:      Mon, 04 Apr 2026 03:41:02 +0000
  Finished:     Mon, 04 Apr 2026 03:41:03 +0000

One second of life. Born, killed, done.

bash

kubectl get pod cgroup-issue-pod -o jsonpath='{.spec.containers[0].resources}{"\n"}'

bash

plaintext

{"limits":{"memory":"50Mi"}}

If you can SSH to the node, the evidence is sharper:

bash

dmesg -T | grep -i "killed process"
# [Mon Apr 4 03:41:03 2026] Memory cgroup out of memory: Killed process 18422 (stress) total-vm:106048kB, anon-rss:51200kB

bash

Anon-rss 51200kB against a 50Mi limit. The kernel saw the overshoot at the exact microsecond it happened.

Why this happens

Kubernetes does not enforce memory limits. The Linux kernel does, through cgroups v1 or v2, depending on your distro. When you write limits.memory: 50Mi in a pod spec, the kubelet translates that into a cgroup file on the host, something like /sys/fs/cgroup/memory/.../memory.limit_in_bytes. From that moment on, every page fault inside the container goes through a kernel check. If the total charged memory exceeds the limit by even one page, the OOM killer fires. There is no poll, no sample, no averaging. It is instantaneous.

kubectl top pod and your Prometheus dashboards work very differently. They scrape cAdvisor on a schedule, usually every 15 to 30 seconds, and they report whatever they saw at those moments. Short spikes between samples are invisible to them. A pod that touches 120Mi for 200 milliseconds looks identical to a pod that stays at 38Mi forever, because the sample never landed during the spike.

Once you see the two layers clearly, cgroup OOMs stop being a mystery. Your graphs are not lying, they are just looking away at the wrong moment. The kernel never looks away.

The fix

bash

kubectl delete -f issue.yaml
kubectl apply -f fix.yaml

bash

yaml

command: ["stress", "--vm", "1", "--vm-bytes", "50M"]
resources:
  limits:
    memory: "256Mi"

yaml

Same workload numbers as Day 15, different point. Here we are not just giving the app more headroom, we are giving it enough headroom that short allocation spikes during GC or buffer resize cannot punch through the cgroup ceiling.

bash

kubectl get pod cgroup-issue-fixed-pod
# cgroup-issue-fixed-pod   1/1   Running   0   2m

bash

The lesson

kubectl top samples, the kernel does not. If the two disagree, trust the kernel.
Memory limits are enforced at the page-fault layer. Spikes shorter than your scrape interval can still kill you.
When an OOM happens that your graphs cannot explain, walk down one layer to dmesg or cgroup stats. The evidence is always there, you just need to open the right file.

Day 16 of 35. Tomorrow we jump from compute to storage, starting with a volumeMount that points at a volume that does not exist.

The scenario

The CPU bar shows delivery at 9% of the request

The cgroup driver wrote a stale quota

cpu.stat reveals the real story

Reproduce the issue

Debug the hard way

Why this happens

The fix

The lesson

Get the next post in your inbox.