koti.dev
← The Runbook
Mastering Kubernetes the Right Way · DAY 06 / 35

Liveness Probe Killing Your Kubernetes Pods? Read This First

Exit code 137 with clean app logs means the probe is the murderer. Here is how to catch it in the act.

KV
Koti Vellanki25 Mar 20264 min read
kubernetesdebuggingprobes
Liveness Probe Killing Your Kubernetes Pods? Read This First

3:05 AM. JVM service, 400 MB of cached data loaded on boot, it takes 45 seconds to warm up on a cold node. The liveness probe has initialDelaySeconds: 10. You can already see what is about to happen. The kubelet probes at T+10, gets connection refused because the app is still loading, probes again at T+13, again at T+16, hits failureThreshold: 3, and kills the container with SIGKILL. New container starts, same 45 second warmup, same kill. The pod is in CrashLoopBackOff and every log line I can find says the application is perfectly healthy. I spend ninety minutes convinced it is an OOM. It is not. The probe is killing the app before it is born.

The scenario

From my troubleshoot-kubernetes-like-a-pro repo. You are going to reproduce the case where the app is fine and the probe is the problem, and learn to spot it from exit code alone.

bash
git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git cd troubleshoot-kubernetes-like-a-pro/scenarios/liveness-probe-failure ls

description.md, issue.yaml, fix.yaml. Assumes you have a cluster running from Day 0.

Reproduce the issue

bash
kubectl apply -f issue.yaml kubectl get pods

Wait about thirty seconds.

plaintext
NAME READY STATUS RESTARTS AGE liveness-probe-failure-pod 0/1 CrashLoopBackOff 4 (8s ago) 1m20s

Four restarts in eighty seconds, each one roughly 15 to 20 seconds apart. That timing is itself a clue.

Debug the hard way

Logs.

bash
kubectl logs liveness-probe-failure-pod
plaintext
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration 2026/03/25 03:05:12 [notice] 1#1: start worker processes

Nginx started. Clean startup, no errors. The app is not crashing on its own. Describe it.

bash
kubectl describe pod liveness-probe-failure-pod
plaintext
Last State: Terminated Reason: Error Exit Code: 137 Started: Wed, 25 Mar 2026 03:05:12 +0530 Finished: Wed, 25 Mar 2026 03:05:27 +0530 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Unhealthy 12s (x6 over 45s) kubelet Liveness probe failed: HTTP probe failed with statuscode: 404 Normal Killing 12s (x2 over 36s) kubelet Container nginx failed liveness probe, will be restarted

Read two things. Exit Code: 137 means SIGKILL. Something killed the container from the outside, it did not die on its own. And the Killing event says why: "Container nginx failed liveness probe, will be restarted." The kubelet is the killer. The probe is the weapon.

Check the probe spec directly.

bash
kubectl get pod liveness-probe-failure-pod -o yaml | grep -A 6 livenessProbe
plaintext
livenessProbe: httpGet: path: /nonexistent port: 80 initialDelaySeconds: 5 periodSeconds: 3 failureThreshold: 3

Read it like a recipe. Start probing after 5 seconds. Probe every 3 seconds. Kill after 3 failures. Total window before death: 14 seconds. An app that does not respond with a 200 at /nonexistent within 14 seconds is dead. And nginx does not serve /nonexistent at all, ever. It is a death sentence on a timer.

Why this happens

A liveness probe exists to detect a wedged process and restart it. The kubelet runs the probe on a schedule, and when failureThreshold consecutive probes fail, it sends SIGKILL to the container. The restart happens through the normal restart policy, which is why you end up with the combination of CrashLoopBackOff and exit code 137.

The dangerous part is the interaction with slow starts and transient dependency failures. If your probe hits the database, and the database hiccups for 20 seconds, the kubelet will happily roll your entire fleet while the database recovers. If your app takes 45 seconds to warm up and your initialDelaySeconds is 10, you are never going to get past the first probe window. The defaults are a trap. The safest liveness probe is the cheapest, most local check you can write.

The fix

bash
kubectl apply -f fix.yaml kubectl get pods

The key change is the probe path and the timing. Broken:

yaml
livenessProbe: httpGet: path: /nonexistent port: 80 periodSeconds: 3

Fixed:

yaml
livenessProbe: httpGet: path: / port: 80 periodSeconds: 5

/ returns 200, periodSeconds goes from 3 to 5, total grace window stretches from 14 to 20 seconds.

plaintext
NAME READY STATUS RESTARTS AGE liveness-probe-failure-fixed-pod 1/1 Running 0 15s

For a slow-starting app, the real fix is a startupProbe. That is exactly what it was added for. Let the startup probe take five minutes, and only then hand off to a tight liveness probe.

The lesson

  1. Exit code 137 plus clean application logs equals liveness probe kill. This correlation has never failed me in seven years.
  2. Liveness probes are self-inflicted wounds waiting to happen. Keep them local, cheap, and independent of any dependency.
  3. If any part of your app takes more than ten seconds to be ready, use a startupProbe. It is not optional, it is the correct answer.

Day 6 of 35 — tomorrow both probes fail at once and the kubelet has an argument with itself.

◆ Newsletter

Get the next post in your inbox.

Real Kubernetes lessons from seven years in production. One email when a new post drops. No spam. Unsubscribe in one click.