koti.dev
← The Runbook
Mastering Kubernetes the Right Way · DAY 06 / 35

Liveness Probe Killing Your Kubernetes Pods? Read This First

Exit code 137 with clean app logs means the probe is the murderer. Here is how to catch it in the act.

KV
Koti Vellanki25 Mar 20264 min read
kubernetesdebuggingprobes
Liveness Probe Killing Your Kubernetes Pods? Read This First

3:05 AM. JVM service, 400 MB of cached data loaded on boot, it takes 45 seconds to warm up on a cold node. The liveness probe has initialDelaySeconds: 10. You can already see what is about to happen. The kubelet probes at T+10, gets connection refused because the app is still loading, probes again at T+13, again at T+16, hits failureThreshold: 3, and kills the container with SIGKILL. New container starts, same 45 second warmup, same kill. The pod is in CrashLoopBackOff and every log line I can find says the application is perfectly healthy. I spend ninety minutes convinced it is an OOM. It is not. The probe is killing the app before it is born.

The scenario

DAY 6 · APP · LIVENESS PROBE

The app is healthy. The probe disagrees.

A slow-starting JVM service takes 45 seconds to warm up. The liveness probe fires at T+10 with a 1-second timeout. Three consecutive timeouts later, the kubelet kills the perfectly healthy container with SIGKILL. Restart count climbs. The app never gets a chance.

FIGURE06 / 35
LivenessProbeFailure — probe timeout kills a healthy slow-starting containerA pod running a slow-starting app is killed repeatedly by an overly aggressive liveness probe. The probe times out after 1 second, fails 3 times in a row, and the kubelet sends SIGTERM then SIGKILL. The restart count climbs even though the app is healthy.KUBERNETES CLUSTERproduction · us-east-1 · v1.30POD · default nsslow-apprestartCount: 5status: Running(warmup: 45s)1probe← no response →KUBELETliveness probehttpGet: /healthztimeoutSeconds: 1failureThreshold: 3→ probe timeout(app still loading)initialDelaySeconds: 10too short for 45s warmup2probe failed(3x in a row)RESTART CYCLESIGTERM → SIGKILLsignal sent: SIGTERMwait gracePeriodsignal sent: SIGKILLrestart containerrestartCount++healthy app, dead again3
1

The app is healthy — the probe is too aggressive

This JVM service loads 400 MB of cache on boot and takes 45 seconds to warm up. restartCount: 5 is not the app crashing — it is the kubelet killing it before it is ready to answer.

2

timeoutSeconds: 1 is the default — and it is lethal here

The probe fires at initialDelaySeconds: 10. The app needs 45 seconds. Three consecutive 1-second timeouts trigger the kill. Raise initialDelaySeconds and failureThreshold before assuming the app is broken.

3

SIGKILL leaves no chance to flush state

The kubelet sends SIGTERM first, waits terminationGracePeriodSeconds (default 30s), then sends SIGKILL. The container cannot catch or block SIGKILL. The cycle repeats every restart.

Kubernetes
Kill / restart
Probe cycle
◆ koti.dev / runbook
A liveness probe with timeoutSeconds: 1 kills a healthy slow app before it finishes warming up.
A pod with a slow-starting app is inside a Kubernetes cluster. The kubelet fires an HTTP liveness probe at /healthz with a 1-second timeout. The app is still loading and does not respond in time. After 3 consecutive failures, the kubelet sends SIGTERM then SIGKILL. The restart count increments and the cycle repeats.
pod.spec.containers.livenessProbe — kubectl explain pod.spec.containers.livenessProbe · livenessProbe.timeoutSeconds default 1, failureThreshold default 3 — kubectl explain pod.spec.containers.livenessProbe.timeoutSeconds · kind v0.22.0, Kubernetes 1.30.0

From my troubleshoot-kubernetes-like-a-pro repo. You are going to reproduce the case where the app is fine and the probe is the problem, and learn to spot it from exit code alone.

bash
git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git cd troubleshoot-kubernetes-like-a-pro/scenarios/liveness-probe-failure ls
bash

description.md, issue.yaml, fix.yaml. Assumes you have a cluster running from Day 0.

Reproduce the issue

bash
kubectl apply -f issue.yaml kubectl get pods
bash

Wait about thirty seconds.

plaintext
NAME READY STATUS RESTARTS AGE liveness-probe-failure-pod 0/1 CrashLoopBackOff 4 (8s ago) 1m20s

Four restarts in eighty seconds, each one roughly 15 to 20 seconds apart. That timing is itself a clue.

Debug the hard way

Logs.

bash
kubectl logs liveness-probe-failure-pod
bash
plaintext
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration 2026/03/25 03:05:12 [notice] 1#1: start worker processes

Nginx started. Clean startup, no errors. The app is not crashing on its own. Describe it.

bash
kubectl describe pod liveness-probe-failure-pod
bash
plaintext
Last State: Terminated Reason: Error Exit Code: 137 Started: Wed, 25 Mar 2026 03:05:12 +0530 Finished: Wed, 25 Mar 2026 03:05:27 +0530 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Unhealthy 12s (x6 over 45s) kubelet Liveness probe failed: HTTP probe failed with statuscode: 404 Normal Killing 12s (x2 over 36s) kubelet Container nginx failed liveness probe, will be restarted

Read two things. Exit Code: 137 means SIGKILL. Something killed the container from the outside, it did not die on its own. And the Killing event says why: "Container nginx failed liveness probe, will be restarted." The kubelet is the killer. The probe is the weapon.

Check the probe spec directly.

bash
kubectl get pod liveness-probe-failure-pod -o yaml | grep -A 6 livenessProbe
bash
plaintext
livenessProbe: httpGet: path: /nonexistent port: 80 initialDelaySeconds: 5 periodSeconds: 3 failureThreshold: 3

Read it like a recipe. Start probing after 5 seconds. Probe every 3 seconds. Kill after 3 failures. Total window before death: 14 seconds. An app that does not respond with a 200 at /nonexistent within 14 seconds is dead. And nginx does not serve /nonexistent at all, ever. It is a death sentence on a timer.

Why this happens

A liveness probe exists to detect a wedged process and restart it. The kubelet runs the probe on a schedule, and when failureThreshold consecutive probes fail, it sends SIGKILL to the container. The restart happens through the normal restart policy, which is why you end up with the combination of CrashLoopBackOff and exit code 137.

The dangerous part is the interaction with slow starts and transient dependency failures. If your probe hits the database, and the database hiccups for 20 seconds, the kubelet will happily roll your entire fleet while the database recovers. If your app takes 45 seconds to warm up and your initialDelaySeconds is 10, you are never going to get past the first probe window. The defaults are a trap. The safest liveness probe is the cheapest, most local check you can write.

The fix

bash
kubectl apply -f fix.yaml kubectl get pods
bash

The key change is the probe path and the timing. Broken:

yaml
livenessProbe: httpGet: path: /nonexistent port: 80 periodSeconds: 3
yaml

Fixed:

yaml
livenessProbe: httpGet: path: / port: 80 periodSeconds: 5
yaml

/ returns 200, periodSeconds goes from 3 to 5, total grace window stretches from 14 to 20 seconds.

plaintext
NAME READY STATUS RESTARTS AGE liveness-probe-failure-fixed-pod 1/1 Running 0 15s

For a slow-starting app, the real fix is a startupProbe. That is exactly what it was added for. Let the startup probe take five minutes, and only then hand off to a tight liveness probe.

The lesson

  1. Exit code 137 plus clean application logs equals liveness probe kill. This correlation has never failed me in seven years.
  2. Liveness probes are self-inflicted wounds waiting to happen. Keep them local, cheap, and independent of any dependency.
  3. If any part of your app takes more than ten seconds to be ready, use a startupProbe. It is not optional, it is the correct answer.

Day 6 of 35 — tomorrow both probes fail at once and the kubelet has an argument with itself.

◆ Newsletter

Get the next post in your inbox.

Real Kubernetes lessons from seven years in production. One email when a new post drops. No spam. Unsubscribe in one click.