koti.dev
← The Runbook
Mastering Kubernetes the Right Way · DAY 16 / 35

Kubernetes cgroup OOM: When the Kernel Kills Before kubectl Knows

Your pod metrics look fine. The kernel disagrees. Here is what lives beneath kubectl.

KV
Koti Vellanki04 Apr 20263 min read
kubernetesdebuggingresources
Kubernetes cgroup OOM: When the Kernel Kills Before kubectl Knows

3:40 AM, a Tuesday I will not forget. A batch processor was dying every four minutes in production. kubectl top pod reported 38Mi average. The alert from our APM said peak RSS was 42Mi. The limit on the pod was 50Mi. By every number I could see, we had headroom. But the pod kept dying with exit code 137. I spent twenty minutes reading Go profiles before I remembered that kubectl top samples every thirty seconds and the kernel checks every single page fault. Somewhere in the gap between those two clocks, the container was touching 60Mi for half a second and the kernel was doing what kernels do. I had been debugging the wrong layer the whole time.

The scenario

Same repo, different folder. This one exists to make the gap between pod-level metrics and cgroup-level enforcement painfully visible.

bash
git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git cd troubleshoot-kubernetes-like-a-pro/scenarios/cgroup-issues ls

issue.yaml runs stress with 100M of allocations against a 50Mi cgroup ceiling. You cannot win that fight.

Reproduce the issue

bash
kubectl apply -f issue.yaml kubectl get pods
plaintext
NAME READY STATUS RESTARTS AGE cgroup-issue-pod 0/1 CrashLoopBackOff 4 (18s ago) 90s

The restart count climbs while the metrics dashboards stay flat. That is the cgroup loop. Fast, clean, invisible to anything that samples on a poll.

Debug the hard way

bash
kubectl describe pod cgroup-issue-pod | grep -A5 "Last State"
plaintext
Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Mon, 04 Apr 2026 03:41:02 +0000 Finished: Mon, 04 Apr 2026 03:41:03 +0000

One second of life. Born, killed, done.

bash
kubectl get pod cgroup-issue-pod -o jsonpath='{.spec.containers[0].resources}{"\n"}'
plaintext
{"limits":{"memory":"50Mi"}}

If you can SSH to the node, the evidence is sharper:

bash
dmesg -T | grep -i "killed process" # [Mon Apr 4 03:41:03 2026] Memory cgroup out of memory: Killed process 18422 (stress) total-vm:106048kB, anon-rss:51200kB

Anon-rss 51200kB against a 50Mi limit. The kernel saw the overshoot at the exact microsecond it happened.

Why this happens

Kubernetes does not enforce memory limits. The Linux kernel does, through cgroups v1 or v2, depending on your distro. When you write limits.memory: 50Mi in a pod spec, the kubelet translates that into a cgroup file on the host, something like /sys/fs/cgroup/memory/.../memory.limit_in_bytes. From that moment on, every page fault inside the container goes through a kernel check. If the total charged memory exceeds the limit by even one page, the OOM killer fires. There is no poll, no sample, no averaging. It is instantaneous.

kubectl top pod and your Prometheus dashboards work very differently. They scrape cAdvisor on a schedule, usually every 15 to 30 seconds, and they report whatever they saw at those moments. Short spikes between samples are invisible to them. A pod that touches 120Mi for 200 milliseconds looks identical to a pod that stays at 38Mi forever, because the sample never landed during the spike.

Once you see the two layers clearly, cgroup OOMs stop being a mystery. Your graphs are not lying, they are just looking away at the wrong moment. The kernel never looks away.

The fix

bash
kubectl delete -f issue.yaml kubectl apply -f fix.yaml
yaml
command: ["stress", "--vm", "1", "--vm-bytes", "50M"] resources: limits: memory: "256Mi"

Same workload numbers as Day 15, different point. Here we are not just giving the app more headroom, we are giving it enough headroom that short allocation spikes during GC or buffer resize cannot punch through the cgroup ceiling.

bash
kubectl get pod cgroup-issue-fixed-pod # cgroup-issue-fixed-pod 1/1 Running 0 2m

The lesson

  1. kubectl top samples, the kernel does not. If the two disagree, trust the kernel.
  2. Memory limits are enforced at the page-fault layer. Spikes shorter than your scrape interval can still kill you.
  3. When an OOM happens that your graphs cannot explain, walk down one layer to dmesg or cgroup stats. The evidence is always there, you just need to open the right file.

Day 16 of 35. Tomorrow we jump from compute to storage, starting with a volumeMount that points at a volume that does not exist.

◆ Newsletter

Get the next post in your inbox.

Real Kubernetes lessons from seven years in production. One email when a new post drops. No spam. Unsubscribe in one click.