koti.dev
← The Runbook
Mastering Kubernetes the Right Way · DAY 15 / 35

CrashLoopBackOff from Tight Memory Limits: The 2-Minute Fix

Pod created. Pod killed. Pod created. Pod killed. Welcome to the forever loop.

KV
Koti Vellanki03 Apr 20263 min read
kubernetesdebuggingresources
CrashLoopBackOff from Tight Memory Limits: The 2-Minute Fix

2:14 AM. The pager said "payments-api RESTARTS=47." I rolled over, opened my laptop, and watched a pod get born and killed in perfect rhythm. Every 30 seconds: Created, Running, Terminated, Created. kubectl apply from the deploy pipeline had come back green an hour ago. The pod was, technically, created. It just kept getting killed a second later. The RESTARTS counter was at 51 by the time I finished typing kubectl describe. I had seen this shape before. A limit set by somebody who never ran the workload under real traffic, now meeting real traffic at 2 in the morning.

The scenario

This one lives in the troubleshooting repo. Clone it, apply the broken manifest, and you can reproduce the exact loop I was staring at.

bash
git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git cd troubleshoot-kubernetes-like-a-pro/scenarios/failed-resource-limits ls

You will see issue.yaml, fix.yaml, and a short description.md. The issue manifest runs polinux/stress asking for 64M of memory under a 32Mi limit. Twice the headroom it is allowed. Perfect CrashLoop material.

Reproduce the issue

bash
kubectl apply -f issue.yaml # pod/failed-resource-limits-pod created kubectl get pods -w

Wait sixty seconds and the RESTARTS column starts climbing like a stopwatch.

plaintext
NAME READY STATUS RESTARTS AGE failed-resource-limits-pod 0/1 CrashLoopBackOff 5 (12s ago) 2m

Five restarts in two minutes. The pod is not flaky. It is a dead machine being revived over and over.

Debug the hard way

bash
kubectl describe pod failed-resource-limits-pod

Buried in the events:

plaintext
Last State: Terminated Reason: OOMKilled Exit Code: 137 Limits: memory: 32Mi Command: stress --vm 1 --vm-bytes 64M

Two fields, one answer. The limit is 32Mi. The workload wants 64M.

bash
kubectl logs failed-resource-limits-pod --previous
plaintext
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd stress: FAIL: [1] (415) <-- worker 7 got signal 9 stress: FAIL: [1] (451) failed run completed in 0s

Signal 9 is the kernel saying "I killed this on purpose." No application bug. No race condition. Just cgroups doing their job.

bash
kubectl get pod failed-resource-limits-pod -o jsonpath='{.status.containerStatuses[0].restartCount}{"\n"}' # 11

Why this happens

Memory limits in Kubernetes are not soft targets. They are hard ceilings enforced by the Linux kernel through cgroups. When your container tries to allocate past its limit, the kernel does not send a warning or a graceful shutdown. It fires the OOM killer, the process dies with exit code 137, and the kubelet dutifully restarts the pod because restartPolicy defaults to Always. That loop runs forever. The backoff caps at five minutes, so you get one dead pod every five minutes until a human notices.

The mental model I wish somebody had drawn for me in year one: a pod with a limit below its actual memory need is not a bug. It is a permanent kill switch. CrashLoopBackOff is not a transient state here, it is the steady state. No amount of patience or retries will fix it because nothing about the workload is going to change between attempts.

The lesson from the field is that limits are a contract with the kernel, not a guideline for the scheduler. Write the contract wrong and the kernel enforces it exactly.

The fix

bash
kubectl delete -f issue.yaml kubectl apply -f fix.yaml

The diff is two lines:

yaml
command: ["stress", "--vm", "1", "--vm-bytes", "50M"] resources: limits: memory: "256Mi"

256Mi for a 50M workload. That looks wasteful until you remember that memory pages are free, outages are not.

bash
kubectl get pod failed-resource-limits-fixed-pod # failed-resource-limits-fixed-pod 1/1 Running 0 1m

Zero restarts. Steady state.

The lesson

  1. CrashLoopBackOff plus OOMKilled equals a memory limit below real usage. It will not self-heal. Stop waiting.
  2. The RESTARTS counter is the most honest metric in Kubernetes. A climbing number means something is fundamentally wrong, not transiently wrong.
  3. Set memory limits to peak observed usage times 1.5, minimum. Headroom is the cheapest insurance you can buy.

Day 15 of 35. Tomorrow we go one layer deeper, into the cgroup itself, where the kernel makes the decisions Kubernetes only reports.

◆ Newsletter

Get the next post in your inbox.

Real Kubernetes lessons from seven years in production. One email when a new post drops. No spam. Unsubscribe in one click.