3:05 AM. JVM service, 400 MB of cached data loaded on boot, it takes 45 seconds to warm up on a cold node. The liveness probe has initialDelaySeconds: 10. You can already see what is about to happen. The kubelet probes at T+10, gets connection refused because the app is still loading, probes again at T+13, again at T+16, hits failureThreshold: 3, and kills the container with SIGKILL. New container starts, same 45 second warmup, same kill. The pod is in CrashLoopBackOff and every log line I can find says the application is perfectly healthy. I spend ninety minutes convinced it is an OOM. It is not. The probe is killing the app before it is born.
The scenario
From my troubleshoot-kubernetes-like-a-pro repo. You are going to reproduce the case where the app is fine and the probe is the problem, and learn to spot it from exit code alone.
git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git
cd troubleshoot-kubernetes-like-a-pro/scenarios/liveness-probe-failure
lsdescription.md, issue.yaml, fix.yaml. Assumes you have a cluster running from Day 0.
Reproduce the issue
kubectl apply -f issue.yaml
kubectl get podsWait about thirty seconds.
NAME READY STATUS RESTARTS AGE
liveness-probe-failure-pod 0/1 CrashLoopBackOff 4 (8s ago) 1m20sFour restarts in eighty seconds, each one roughly 15 to 20 seconds apart. That timing is itself a clue.
Debug the hard way
Logs.
kubectl logs liveness-probe-failure-pod/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to
perform configuration
2026/03/25 03:05:12 [notice] 1#1: start worker processesNginx started. Clean startup, no errors. The app is not crashing on its own. Describe it.
kubectl describe pod liveness-probe-failure-podLast State: Terminated
Reason: Error
Exit Code: 137
Started: Wed, 25 Mar 2026 03:05:12 +0530
Finished: Wed, 25 Mar 2026 03:05:27 +0530
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 12s (x6 over 45s) kubelet Liveness probe failed:
HTTP probe failed with statuscode: 404
Normal Killing 12s (x2 over 36s) kubelet Container nginx failed
liveness probe, will be restartedRead two things. Exit Code: 137 means SIGKILL. Something killed the container from the outside, it did not die on its own. And the Killing event says why: "Container nginx failed liveness probe, will be restarted." The kubelet is the killer. The probe is the weapon.
Check the probe spec directly.
kubectl get pod liveness-probe-failure-pod -o yaml | grep -A 6 livenessProbe livenessProbe:
httpGet:
path: /nonexistent
port: 80
initialDelaySeconds: 5
periodSeconds: 3
failureThreshold: 3Read it like a recipe. Start probing after 5 seconds. Probe every 3 seconds. Kill after 3 failures. Total window before death: 14 seconds. An app that does not respond with a 200 at /nonexistent within 14 seconds is dead. And nginx does not serve /nonexistent at all, ever. It is a death sentence on a timer.
Why this happens
A liveness probe exists to detect a wedged process and restart it. The kubelet runs the probe on a schedule, and when failureThreshold consecutive probes fail, it sends SIGKILL to the container. The restart happens through the normal restart policy, which is why you end up with the combination of CrashLoopBackOff and exit code 137.
The dangerous part is the interaction with slow starts and transient dependency failures. If your probe hits the database, and the database hiccups for 20 seconds, the kubelet will happily roll your entire fleet while the database recovers. If your app takes 45 seconds to warm up and your initialDelaySeconds is 10, you are never going to get past the first probe window. The defaults are a trap. The safest liveness probe is the cheapest, most local check you can write.
The fix
kubectl apply -f fix.yaml
kubectl get podsThe key change is the probe path and the timing. Broken:
livenessProbe:
httpGet:
path: /nonexistent
port: 80
periodSeconds: 3Fixed:
livenessProbe:
httpGet:
path: /
port: 80
periodSeconds: 5/ returns 200, periodSeconds goes from 3 to 5, total grace window stretches from 14 to 20 seconds.
NAME READY STATUS RESTARTS AGE
liveness-probe-failure-fixed-pod 1/1 Running 0 15sFor a slow-starting app, the real fix is a startupProbe. That is exactly what it was added for. Let the startup probe take five minutes, and only then hand off to a tight liveness probe.
The lesson
- Exit code 137 plus clean application logs equals liveness probe kill. This correlation has never failed me in seven years.
- Liveness probes are self-inflicted wounds waiting to happen. Keep them local, cheap, and independent of any dependency.
- If any part of your app takes more than ten seconds to be ready, use a
startupProbe. It is not optional, it is the correct answer.
Day 6 of 35 — tomorrow both probes fail at once and the kubelet has an argument with itself.
