koti.dev
← The Runbook
Mastering Kubernetes the Right Way · DAY 02 / 35

CrashLoopBackOff in Kubernetes: Read the Exit Code First

A status, a timer, and an exit code. Learn to read all three and you debug in sixty seconds instead of forty minutes.

KV
Koti Vellanki21 Mar 20266 min read
kubernetesdebuggingapplication
CrashLoopBackOff in Kubernetes: Read the Exit Code First

2:14 AM. Payments service, RESTARTS column climbing like a stock chart, monitoring is paging me every 90 seconds and the pod is stuck in CrashLoopBackOff with nothing useful in the logs. I run kubectl logs and get the current container's output which is empty because it has been alive for less than a second. I run --previous and get an error saying the previous container does not exist. The pod has restarted fourteen times in two minutes and I still cannot see a single line of application output. This is the trap every engineer falls into on their first real CrashLoopBackOff. The symptom is loud, the signal is silent.

The scenario

This comes from my troubleshoot-kubernetes-like-a-pro repo. You will break a pod yourself, time the backoff, read the exit code, and build the mental model once so you never have to guess again.

bash
git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git cd troubleshoot-kubernetes-like-a-pro/scenarios/crashloopbackoff ls

Three files: description.md, issue.yaml, fix.yaml. Assumes you already have a working cluster from Day 0.

Reproduce the issue

bash
kubectl apply -f issue.yaml kubectl get pods

Wait about twenty seconds and run it again.

plaintext
NAME READY STATUS RESTARTS AGE crashloopbackoff-pod 0/1 CrashLoopBackOff 5 (12s ago) 2m14s

Five restarts, the last one twelve seconds ago, and the pod has been alive for just over two minutes. The backoff timer is already kicking in.

Terminal: cd into scenarios/crashloopbackoff, ls the folder, kubectl apply -f issue.yaml, kubectl get po showing crashloopbackoff-pod with status Error, 1 restart, age 10s
Apply issue.yaml. The first kubectl get po catches the pod mid-cycle, showing Error before the kubelet flips it into the CrashLoopBackOff state on the next restart.

Debug the hard way

Reach for logs.

bash
kubectl logs crashloopbackoff-pod
plaintext

Empty. The container is sh -c "exit 1" so it dies before producing any output. Try --previous.

bash
kubectl logs crashloopbackoff-pod --previous
plaintext
Error from server (BadRequest): previous terminated container "busybox" in pod "crashloopbackoff-pod" not found

Not helpful. Describe it.

bash
kubectl describe pod crashloopbackoff-pod
plaintext
Last State: Terminated Reason: Error Exit Code: 1 Started: Sat, 21 Mar 2026 02:14:03 +0530 Finished: Sat, 21 Mar 2026 02:14:03 +0530 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 1m (x6 over 2m) kubelet Successfully pulled image "busybox" Warning BackOff 8s (x10 over 2m) kubelet Back-off restarting failed container

Two useful numbers. Exit Code: 1 and a BackOff event that has fired ten times. Started and Finished are the same second, meaning the container lived for milliseconds. That timing is itself a signal.

Terminal: kubectl logs crashloopbackoff-pod (empty), kubectl logs --previous (previous container not found), kubectl describe pod crashloopbackoff-pod showing container state Waiting with reason CrashLoopBackOff, Last State Terminated with Exit Code 1 and Reason Error, and the command sh -c exit 1
kubectl describe — the container is stuck in Waiting with reason CrashLoopBackOff. Last State: Terminated with Exit Code 1 tells you the container ran, failed, and the kubelet is backing off.
Terminal: bottom half of kubectl describe showing Conditions, Volumes, QoS Class Burstable, and the Events table with multiple Pulled, Created, Started, and Warning BackOff entries
Scroll to the Events table. Normal Pulled, Normal Created, Normal Started, Warning BackOff — the full birth-and-death cycle of every failed restart, timestamped and counted.

Why this happens

CrashLoopBackOff is not a cause. It is a state machine. The kubelet starts the container, the container exits, the kubelet waits, starts it again, exits, waits longer. The backoff doubles: 10s, 20s, 40s, 80s, capped at 5 minutes. Kubernetes is being polite, trying not to hammer a broken container while still giving it chances to recover.

The real question is always the same: why did the container exit? The answer is in two places, the exit code and the last stream of logs before it died. Exit code 1 usually means the app ran and failed on its own (config, missing env, unhandled exception). Exit code 137 means something sent SIGKILL (usually OOMKilled or a liveness probe). Exit code 143 means SIGTERM. Exit code 2 often means shell misuse.

And the backoff timing tells you how long the pod has really been crashing. Restart one minute ago, next one five minutes ago? The pod is fully backed off and has been crashing for at least twenty minutes. That is a clock you get for free.

The fix

Apply the fixed version.

bash
kubectl apply -f fix.yaml kubectl get pods

The only change is the command. The broken pod ran sh -c "exit 1" and the fixed pod runs something that stays alive:

yaml
command: - "sh" - "-c" - "sleep 3600"
Terminal: cat issue.yaml shows command sh -c exit 1 with a comment 'Command that causes the container to crash', cat fix.yaml shows command sh -c sleep 3600 with a comment 'Keeps the container running'
Diff the two manifests. The broken pod runs exit 1 (deliberate suicide). The fixed pod runs sleep 3600. Same shape, one word of difference, completely different fate.

Verify.

plaintext
NAME READY STATUS RESTARTS AGE crashloopbackoff-fixed-pod 1/1 Running 0 8s
Terminal: kubectl apply -f fix.yaml creating crashloopbackoff-fixed-pod, followed by kubectl get po showing crashloopbackoff-fixed-pod with status Running 1/1 and the original crashloopbackoff-pod still CrashLoopBackOff with 6 restarts
One kubectl apply. The fixed pod is Running in seconds. The original is still crash-looping until you delete it — the fix is scoped to the new manifest.

In the real world you are rarely fixing this with sleep 3600. You are fixing the missing config, the bad env var, the unhandled exception. But the shape of the fix is identical: give the main process a reason to keep living.

The easiest way — with Kubilitics

The same debug, surfaced one click at a time. Open the Pods view and the broken pod is already badged with its status and restart count, no kubectl needed.

Kubilitics Pods view showing 1 pod total and 1 failed, crashloopbackoff-pod with a CrashLoopBackOff status badge, READY 0/1, RESTARTS counter visible, and the failed count highlighted in the summary cards
Kubilitics Pods view — the failed counter sits in the summary row, the pod is badged CrashLoopBackOff, the restart counter is visible at a glance. The entire triage from the kubectl describe wall is here in one row.

Click the pod and open the Overview tab. The rendered pod spec shows up under Annotations, and the bad command array (sh -c exit 1) is visible inline without running kubectl get pod -o yaml.

Kubilitics Overview tab showing Tolerations and Annotations sections with the full pod spec JSON, including the command field set to sh -c exit 1
Overview tab — the pod's command array is visible inline. exit 1 is right there in the Annotations block, no terminal, no YAML export.

After applying fix.yaml, the list updates in real time. The fixed pod shows green, the original stays red, and you see both outcomes side by side while you decide whether to delete the crashing pod.

Kubilitics Pods view after applying the fix showing 2 pods — crashloopbackoff-fixed-pod with a green Running status badge, READY 1/1, RESTARTS 0, and the original crashloopbackoff-pod still showing CrashLoopBackOff with 8 restarts
Pods view after the fix lands — the new pod is green, the original is still red, the restart counter frozen. Side-by-side proof the fix worked without touching the terminal.

The lesson

  1. CrashLoopBackOff is a status, not a cause. The cause lives in Last State and --previous logs.
  2. The exit code is the fastest triage signal. 1 is app error, 137 is SIGKILL, 143 is SIGTERM, 2 is shell misuse. Read it first.
  3. The backoff interval is a clock. If you see restarts spaced five minutes apart, the pod has been crashing for at least twenty minutes. You do not need AGE to tell you that.

Day 2 of 35 — tomorrow the pod does not even make it to the backoff because the image itself never arrives.

◆ Newsletter

Get the next post in your inbox.

Real Kubernetes lessons from seven years in production. One email when a new post drops. No spam. Unsubscribe in one click.