Liveness Probe Killing Your Kubernetes Pods? Read This First

3:05 AM. JVM service, 400 MB of cached data loaded on boot, it takes 45 seconds to warm up on a cold node. The liveness probe has initialDelaySeconds: 10. You can already see what is about to happen. The kubelet probes at T+10, gets connection refused because the app is still loading, probes again at T+13, again at T+16, hits failureThreshold: 3, and kills the container with SIGKILL. New container starts, same 45 second warmup, same kill. The pod is in CrashLoopBackOff and every log line I can find says the application is perfectly healthy. I spend ninety minutes convinced it is an OOM. It is not. The probe is killing the app before it is born.

The scenario

◆ DAY 6 · APP · LIVENESS PROBE

The app is healthy. The probe disagrees.

A slow-starting JVM service takes 45 seconds to warm up. The liveness probe fires at T+10 with a 1-second timeout. Three consecutive timeouts later, the kubelet kills the perfectly healthy container with SIGKILL. Restart count climbs. The app never gets a chance.

FIGURE06 / 35

The app is healthy — the probe is too aggressive

This JVM service loads 400 MB of cache on boot and takes 45 seconds to warm up. restartCount: 5 is not the app crashing — it is the kubelet killing it before it is ready to answer.

timeoutSeconds: 1 is the default — and it is lethal here

The probe fires at initialDelaySeconds: 10. The app needs 45 seconds. Three consecutive 1-second timeouts trigger the kill. Raise initialDelaySeconds and failureThreshold before assuming the app is broken.

SIGKILL leaves no chance to flush state

The kubelet sends SIGTERM first, waits terminationGracePeriodSeconds (default 30s), then sends SIGKILL. The container cannot catch or block SIGKILL. The cycle repeats every restart.

A liveness probe with timeoutSeconds: 1 kills a healthy slow app before it finishes warming up.

A pod with a slow-starting app is inside a Kubernetes cluster. The kubelet fires an HTTP liveness probe at /healthz with a 1-second timeout. The app is still loading and does not respond in time. After 3 consecutive failures, the kubelet sends SIGTERM then SIGKILL. The restart count increments and the cycle repeats.

pod.spec.containers.livenessProbe — kubectl explain pod.spec.containers.livenessProbe · livenessProbe.timeoutSeconds default 1, failureThreshold default 3 — kubectl explain pod.spec.containers.livenessProbe.timeoutSeconds · kind v0.22.0, Kubernetes 1.30.0

From my troubleshoot-kubernetes-like-a-pro repo. You are going to reproduce the case where the app is fine and the probe is the problem, and learn to spot it from exit code alone.

bash

git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git
cd troubleshoot-kubernetes-like-a-pro/scenarios/liveness-probe-failure
ls

bash

description.md, issue.yaml, fix.yaml. Assumes you have a cluster running from Day 0.

Reproduce the issue

bash

kubectl apply -f issue.yaml
kubectl get pods

bash

Wait about thirty seconds.

plaintext

NAME                         READY   STATUS             RESTARTS      AGE
liveness-probe-failure-pod   0/1     CrashLoopBackOff   4 (8s ago)    1m20s

Four restarts in eighty seconds, each one roughly 15 to 20 seconds apart. That timing is itself a clue.

Debug the hard way

Logs.

bash

kubectl logs liveness-probe-failure-pod

bash

plaintext

/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to
perform configuration
2026/03/25 03:05:12 [notice] 1#1: start worker processes

Nginx started. Clean startup, no errors. The app is not crashing on its own. Describe it.

bash

kubectl describe pod liveness-probe-failure-pod

bash

plaintext

Last State:     Terminated
  Reason:       Error
  Exit Code:    137
  Started:      Wed, 25 Mar 2026 03:05:12 +0530
  Finished:     Wed, 25 Mar 2026 03:05:27 +0530
Events:
  Type     Reason     Age                From     Message
  ----     ------     ----               ----     -------
  Warning  Unhealthy  12s (x6 over 45s)  kubelet  Liveness probe failed:
                                                   HTTP probe failed with statuscode: 404
  Normal   Killing    12s (x2 over 36s)  kubelet  Container nginx failed
                                                   liveness probe, will be restarted

Read two things. Exit Code: 137 means SIGKILL. Something killed the container from the outside, it did not die on its own. And the Killing event says why: "Container nginx failed liveness probe, will be restarted." The kubelet is the killer. The probe is the weapon.

Check the probe spec directly.

bash

kubectl get pod liveness-probe-failure-pod -o yaml | grep -A 6 livenessProbe

bash

plaintext

    livenessProbe:
      httpGet:
        path: /nonexistent
        port: 80
      initialDelaySeconds: 5
      periodSeconds: 3
      failureThreshold: 3

Read it like a recipe. Start probing after 5 seconds. Probe every 3 seconds. Kill after 3 failures. Total window before death: 14 seconds. An app that does not respond with a 200 at /nonexistent within 14 seconds is dead. And nginx does not serve /nonexistent at all, ever. It is a death sentence on a timer.

Why this happens

A liveness probe exists to detect a wedged process and restart it. The kubelet runs the probe on a schedule, and when failureThreshold consecutive probes fail, it sends SIGKILL to the container. The restart happens through the normal restart policy, which is why you end up with the combination of CrashLoopBackOff and exit code 137.

The dangerous part is the interaction with slow starts and transient dependency failures. If your probe hits the database, and the database hiccups for 20 seconds, the kubelet will happily roll your entire fleet while the database recovers. If your app takes 45 seconds to warm up and your initialDelaySeconds is 10, you are never going to get past the first probe window. The defaults are a trap. The safest liveness probe is the cheapest, most local check you can write.

The fix

bash

kubectl apply -f fix.yaml
kubectl get pods

bash

The key change is the probe path and the timing. Broken:

yaml

livenessProbe:
  httpGet:
    path: /nonexistent
    port: 80
  periodSeconds: 3

yaml

Fixed:

yaml

livenessProbe:
  httpGet:
    path: /
    port: 80
  periodSeconds: 5

yaml

/ returns 200, periodSeconds goes from 3 to 5, total grace window stretches from 14 to 20 seconds.

plaintext

NAME                               READY   STATUS    RESTARTS   AGE
liveness-probe-failure-fixed-pod   1/1     Running   0          15s

For a slow-starting app, the real fix is a startupProbe. That is exactly what it was added for. Let the startup probe take five minutes, and only then hand off to a tight liveness probe.

The lesson

Exit code 137 plus clean application logs equals liveness probe kill. This correlation has never failed me in seven years.
Liveness probes are self-inflicted wounds waiting to happen. Keep them local, cheap, and independent of any dependency.
If any part of your app takes more than ten seconds to be ready, use a startupProbe. It is not optional, it is the correct answer.

Day 6 of 35 — tomorrow both probes fail at once and the kubelet has an argument with itself.

The scenario

The app is healthy — the probe is too aggressive

timeoutSeconds: 1 is the default — and it is lethal here

SIGKILL leaves no chance to flush state

Reproduce the issue

Debug the hard way

Why this happens

The fix

The lesson

Get the next post in your inbox.