koti.dev
← The Runbook
Mastering Kubernetes the Right Way · DAY 07 / 35

Liveness and Readiness Failing Together: Why Startup Probes Exist

Two probes, one bug, two outages. Read the event ages and you will know the order of operations.

KV
Koti Vellanki26 Mar 20264 min read
kubernetesdebuggingprobes
Liveness and Readiness Failing Together: Why Startup Probes Exist

3:22 AM. API service, both probes share a /health endpoint, a developer renamed it to /api/v1/health in a "cleanup" PR that also passed review. Readiness fails at T+8, traffic drains, Service endpoints go empty, the loadbalancer starts returning 503 to every request. At T+14 liveness trips its threshold and the kubelet SIGKILLs the container. Container restarts, same renamed route, same failure, same outage. The dashboard looks like the service has been unplugged. Two kubelet controllers are arguing about whether this pod should even exist, and I have to untangle which failure came first before I can stop the bleeding.

The scenario

DAY 7 · APP · PROBE COMPARISON

Two probes, two behaviours. One gates traffic. One gates restarts.

Readiness and liveness probes look identical in YAML but have completely different consequences when they fail. Use the wrong one and you either take traffic during a deploy or restart-loop a pod that is perfectly healthy.

FIGURE07 / 35
Liveness vs Readiness probe failure — side-by-side comparisonLeft panel: a readiness probe failure sets Ready=False and removes the pod from the Service Endpoints. The pod keeps running. Right panel: a liveness probe failure causes the kubelet to kill and restart the container.READINESS PROBE FAILSPOD · default nspodReady=False1REMOVED FROM ENDPOINTSstill runningnotReadyAddresses2LIVENESS PROBE FAILSPOD · default nspodliveness failing3CONTAINER KILLEDkubelet restartsSIGTERM → SIGKILL4
1

Readiness failure: pod stays alive

kubelet sets Ready=False on the PodCondition. The process keeps running — no kill, no restart. The pod can recover without kubelet touching it if the probe starts passing again.

2

Removed from Endpoints — traffic drains

The Endpoints controller watches PodConditions. A Ready=False pod moves to notReadyAddresses. kube-proxy removes its iptables/IPVS rules. No traffic reaches the pod until it passes again.

3

Liveness failure: kubelet kills the container

After failureThreshold consecutive failures kubelet sends SIGTERM, then SIGKILL after terminationGracePeriodSeconds. The container is restarted per restartPolicy. Every liveness kill increments restartCount.

4

Wrong probe, wrong consequence

Use liveness on a slow-starting pod and kubelet kills it in a restart loop. Use readiness as a health gate during a deploy and traffic drains but the container stays running and can recover. Choose the probe type that matches the recovery you want.

Kubernetes
Failure path
Removed from endpoints
Normal path
◆ koti.dev / runbook
Side-by-side: readiness failure drains the pod from Endpoints while it keeps running; liveness failure triggers a kubelet container kill and restart.
Two side-by-side panels. Left: readiness probe fails, pod moves to notReadyAddresses, traffic stops but pod keeps running. Right: liveness probe fails, kubelet kills and restarts the container.
pod.spec.containers.livenessProbe — kubectl explain pod.spec.containers.livenessProbe · pod.spec.containers.readinessProbe — kubectl explain pod.spec.containers.readinessProbe · kind v0.22.0, Kubernetes 1.30.0

From my troubleshoot-kubernetes-like-a-pro repo. This is the final probe scenario, and the one that forces you to read event ages as a timeline instead of a blob.

bash
git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git cd troubleshoot-kubernetes-like-a-pro/scenarios/liveness-readiness-failure ls
bash

description.md, issue.yaml, fix.yaml. Assumes you have a cluster running from Day 0.

Reproduce the issue

bash
kubectl apply -f issue.yaml kubectl get pods
bash

Wait about a minute.

plaintext
NAME READY STATUS RESTARTS AGE liveness-readiness-failure-pod 0/1 CrashLoopBackOff 3 (5s ago) 1m10s

0/1 and CrashLoopBackOff at the same time. Two symptoms, which is what makes this confusing. You have to figure out which one to fix first, or whether they share a cause.

Debug the hard way

Logs are useless because nginx is happy. Describe.

bash
kubectl describe pod liveness-readiness-failure-pod
bash
plaintext
Conditions: Type Status Ready False ContainersReady False Last State: Terminated Reason: Error Exit Code: 137 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Unhealthy 50s (x8 over 1m10s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 404 Warning Unhealthy 14s (x6 over 50s) kubelet Liveness probe failed: HTTP probe failed with statuscode: 404 Normal Killing 14s (x2 over 40s) kubelet Container nginx failed liveness probe, will be restarted

Read the ages like a timeline. Readiness first failed 1m10s ago. Liveness first failed 50s ago. The gap is not random. Readiness probes mark the pod unready immediately on failure, independent of liveness. Liveness takes failureThreshold * periodSeconds to actually trigger a kill. So the order is always the same: readiness fails first, traffic drains, then liveness pulls the trigger, then the container restarts and the clock resets.

Confirm the cause in one row:

bash
kubectl get pod liveness-readiness-failure-pod \ -o custom-columns='NAME:.metadata.name,READY:.status.conditions[?(@.type=="Ready")].status,RESTARTS:.status.containerStatuses[0].restartCount,LASTEXIT:.status.containerStatuses[0].lastState.terminated.exitCode'
bash
plaintext
NAME READY RESTARTS LASTEXIT liveness-readiness-failure-pod False 3 137

Ready: False, RESTARTS: 3, LASTEXIT: 137. Two probes broken, one shared root cause.

Why this happens

Readiness and liveness share the same probe machinery in the kubelet but answer different questions. Readiness answers "should the Service route traffic here right now?" and its failure is cheap, just mark the pod unready. Liveness answers "is the process wedged beyond recovery?" and its failure is expensive, SIGKILL and restart. Both run on their own schedules, both independent, both reading their own result off the same HTTP endpoint if you are reckless enough to point them at the same path.

When the shared endpoint breaks, readiness notices first because it reacts on the first failure for routing purposes, while liveness waits for its full threshold before killing. You end up with a narrow window where the pod is hidden from the Service but the container is still alive, followed by the SIGKILL, followed by a restart that starts the entire cycle again. Two outages stacked on top of each other from one renamed route.

This is also exactly why startupProbe was added in Kubernetes 1.16. Before it existed, you had to cram slow-start tolerance into initialDelaySeconds, which meant either crippling your liveness probe or accepting that cold starts would kill pods. A startup probe disables both liveness and readiness entirely until it passes, which gives slow-starting apps a safe window to finish booting.

The fix

bash
kubectl apply -f fix.yaml kubectl get pods
bash

The change is both probes pointed at / and periodSeconds relaxed from 3 to 5:

yaml
livenessProbe: httpGet: path: / port: 80 periodSeconds: 5 readinessProbe: httpGet: path: / port: 80 periodSeconds: 5
yaml
plaintext
NAME READY STATUS RESTARTS AGE liveness-readiness-failure-fixed-pod 1/1 Running 0 11s

The real fix you would push in a PR review is three probes, not two. A generous startupProbe that covers the boot window. A tight readinessProbe that owns traffic routing. A cheap local livenessProbe that only catches true deadlocks and never touches a dependency.

The lesson

  1. When both probes fail, read the event ages as a timeline. Readiness always fails first. Liveness always takes failureThreshold * periodSeconds longer. Same pattern every time.
  2. Liveness and readiness should almost never share an endpoint. Different questions, different failure modes, different blast radius.
  3. Startup probes exist for exactly this situation. If any boot step takes more than a few seconds, a startupProbe is the correct answer, not a looser initialDelaySeconds.

Day 7 of 35 — tomorrow the pod boots cleanly and then quietly gets killed by the node itself.

◆ Newsletter

Get the next post in your inbox.

Real Kubernetes lessons from seven years in production. One email when a new post drops. No spam. Unsubscribe in one click.