koti.dev
← The Runbook
Mastering Kubernetes the Right Way · DAY 16 / 35

Kubernetes cgroup OOM: When the Kernel Kills Before kubectl Knows

Your pod metrics look fine. The kernel disagrees. Here is what lives beneath kubectl.

KV
Koti Vellanki04 Apr 20263 min read
kubernetesdebuggingresources
Kubernetes cgroup OOM: When the Kernel Kills Before kubectl Knows

3:40 AM, a Tuesday I will not forget. A batch processor was dying every four minutes in production. kubectl top pod reported 38Mi average. The alert from our APM said peak RSS was 42Mi. The limit on the pod was 50Mi. By every number I could see, we had headroom. But the pod kept dying with exit code 137. I spent twenty minutes reading Go profiles before I remembered that kubectl top samples every thirty seconds and the kernel checks every single page fault. Somewhere in the gap between those two clocks, the container was touching 60Mi for half a second and the kernel was doing what kernels do. I had been debugging the wrong layer the whole time.

The scenario

DAY 16 · RESOURCES · CGROUP THROTTLE

The pod was scheduled. The kernel throttled it anyway.

The kubelet saw available CPU and scheduled the pod. But the cgroup v1 hierarchy on the node had a stale cpu.cfs_quota_us, so the Linux CFS scheduler immediately began throttling the container. The pod runs — at less than 10% of its requested CPU.

FIGURE16 / 35
cgroup CPU throttle — misconfigured cfs_quota_us starves the containerA pod requests two CPUs but the cgroup v1 hierarchy has a wrong cpu.cfs_quota_us. The Linux CFS scheduler throttles the container to 0.18 actual cores. The container runs but is starved of CPU, accumulating thousands of throttle events.KUBERNETES CLUSTERproduction · us-east-1 · v1.30POD · default nsrequests.cpu: 2cpu usageactual: 0.18 coresrequested: 2 cores9% of request delivered1cpu schedvia cgroupfsKUBELET cgroupfscgroup drivercgroup-driver: cgroupfskubepods.slice/cpu.cfs_quota_us:200ms (wrong)cpu.cfs_period_us:100msstale quota from nodereconfiguration2cfs throttleLINUX SCHEDULERCFS bandwidth controlnr_throttled:4823throttled_time:92.4scontainer starveddebug:cat /sys/fs/cgroup/.../cpu.stat3
1

The CPU bar shows delivery at 9% of the request

The pod asked for requests.cpu: 2. The kubelet scheduled it — available capacity checked out. But the container is only receiving 0.18 cores because the cgroup quota was written wrong, not because the node is busy.

2

The cgroup driver wrote a stale quota

The kubelet's cgroup-driver: cgroupfs set cpu.cfs_quota_us: 200ms in the kubepods hierarchy — a value left over from a previous node reconfiguration. The period is 100ms, so the effective CPU limit is 2 cores mathematically, but the actual enforced value is wrong.

3

cpu.stat reveals the real story

Run cat /sys/fs/cgroup/.../cpu.stat on the node. nr_throttled: 4823 and throttled_time: 92.4s confirm the scheduler is enforcing the wrong quota. This never appears in kubectl top pod.

Kubernetes
Misconfigured cgroup
Throttle path
◆ koti.dev / runbook
A pod requests cpu: 2 but the cgroup driver misconfiguration causes the Linux CFS scheduler to throttle it to 0.18 cores.
A pod inside a Kubernetes cluster requests two CPUs. The kubelet's cgroupfs driver writes a wrong cpu.cfs_quota_us to the kubepods cgroup. The Linux CFS scheduler enforces the misconfigured bandwidth limit and throttles the container to 0.18 actual cores, recording 4823 throttle events and 92.4 seconds of throttled time.
pod.spec.containers.resources.requests.cpu — kubectl explain pod.spec.containers.resources · kind v0.22.0, Kubernetes 1.30.0, kernel 6.x cgroup v1

Same repo, different folder. This one exists to make the gap between pod-level metrics and cgroup-level enforcement painfully visible.

bash
git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git cd troubleshoot-kubernetes-like-a-pro/scenarios/cgroup-issues ls
bash

issue.yaml runs stress with 100M of allocations against a 50Mi cgroup ceiling. You cannot win that fight.

Reproduce the issue

bash
kubectl apply -f issue.yaml kubectl get pods
bash
plaintext
NAME READY STATUS RESTARTS AGE cgroup-issue-pod 0/1 CrashLoopBackOff 4 (18s ago) 90s

The restart count climbs while the metrics dashboards stay flat. That is the cgroup loop. Fast, clean, invisible to anything that samples on a poll.

Debug the hard way

bash
kubectl describe pod cgroup-issue-pod | grep -A5 "Last State"
bash
plaintext
Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Mon, 04 Apr 2026 03:41:02 +0000 Finished: Mon, 04 Apr 2026 03:41:03 +0000

One second of life. Born, killed, done.

bash
kubectl get pod cgroup-issue-pod -o jsonpath='{.spec.containers[0].resources}{"\n"}'
bash
plaintext
{"limits":{"memory":"50Mi"}}

If you can SSH to the node, the evidence is sharper:

bash
dmesg -T | grep -i "killed process" # [Mon Apr 4 03:41:03 2026] Memory cgroup out of memory: Killed process 18422 (stress) total-vm:106048kB, anon-rss:51200kB
bash

Anon-rss 51200kB against a 50Mi limit. The kernel saw the overshoot at the exact microsecond it happened.

Why this happens

Kubernetes does not enforce memory limits. The Linux kernel does, through cgroups v1 or v2, depending on your distro. When you write limits.memory: 50Mi in a pod spec, the kubelet translates that into a cgroup file on the host, something like /sys/fs/cgroup/memory/.../memory.limit_in_bytes. From that moment on, every page fault inside the container goes through a kernel check. If the total charged memory exceeds the limit by even one page, the OOM killer fires. There is no poll, no sample, no averaging. It is instantaneous.

kubectl top pod and your Prometheus dashboards work very differently. They scrape cAdvisor on a schedule, usually every 15 to 30 seconds, and they report whatever they saw at those moments. Short spikes between samples are invisible to them. A pod that touches 120Mi for 200 milliseconds looks identical to a pod that stays at 38Mi forever, because the sample never landed during the spike.

Once you see the two layers clearly, cgroup OOMs stop being a mystery. Your graphs are not lying, they are just looking away at the wrong moment. The kernel never looks away.

The fix

bash
kubectl delete -f issue.yaml kubectl apply -f fix.yaml
bash
yaml
command: ["stress", "--vm", "1", "--vm-bytes", "50M"] resources: limits: memory: "256Mi"
yaml

Same workload numbers as Day 15, different point. Here we are not just giving the app more headroom, we are giving it enough headroom that short allocation spikes during GC or buffer resize cannot punch through the cgroup ceiling.

bash
kubectl get pod cgroup-issue-fixed-pod # cgroup-issue-fixed-pod 1/1 Running 0 2m
bash

The lesson

  1. kubectl top samples, the kernel does not. If the two disagree, trust the kernel.
  2. Memory limits are enforced at the page-fault layer. Spikes shorter than your scrape interval can still kill you.
  3. When an OOM happens that your graphs cannot explain, walk down one layer to dmesg or cgroup stats. The evidence is always there, you just need to open the right file.

Day 16 of 35. Tomorrow we jump from compute to storage, starting with a volumeMount that points at a volume that does not exist.

◆ Newsletter

Get the next post in your inbox.

Real Kubernetes lessons from seven years in production. One email when a new post drops. No spam. Unsubscribe in one click.