koti.dev
← The Runbook
Mastering Kubernetes the Right Way · DAY 20 / 35

Kubernetes Disk I/O Errors: Pod Symptoms, Node Root Cause

The container is crashing. The node is the reason. Here is how to prove it in 90 seconds.

KV
Koti Vellanki08 Apr 20263 min read
kubernetesdebuggingstorage
Kubernetes Disk I/O Errors: Pod Symptoms, Node Root Cause

4:02 AM. An ingest service was dying on exactly one node out of twenty-three. The pod spec was identical everywhere. The image was identical. The config was identical. Same Helm release, same values. On twenty-two nodes it ran fine. On worker-14 it crashed within three seconds every single time, with a cryptic no such file or directory coming out of a path that existed on every other host. Twenty minutes of reading Go stack traces and diffing node labels got me nowhere. The moment I ran kubectl describe pod and looked at the volume block, the pattern snapped into place: we had a hostPath volume pointing at a directory that had been wiped off worker-14 during a disk replacement two days earlier and never restored.

The scenario

The repo reproduces this cleanly with a hostPath that points at a path guaranteed not to exist.

bash
git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git cd troubleshoot-kubernetes-like-a-pro/scenarios/disk-io-errors ls

issue.yaml mounts hostPath: /nonexistent-path with type: Directory. The kubelet will try to stat that directory on the node and fail.

Reproduce the issue

bash
kubectl apply -f issue.yaml kubectl get pod disk-io-error-pod
plaintext
NAME READY STATUS RESTARTS AGE disk-io-error-pod 0/1 ContainerCreating 0 30s

Stuck in ContainerCreating. No restarts, no crashloop, just waiting. That is the first tell: a hostPath problem does not crash the container, it prevents the container from ever starting.

Debug the hard way

bash
kubectl describe pod disk-io-error-pod
plaintext
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 18s (x3 over 35s) kubelet MountVolume.SetUp failed for volume "data-volume" : hostPath type check failed: /nonexistent-path is not a directory

The error names the path, names the check that failed, and tells you exactly which node component emitted it. The kubelet did a type: Directory validation, the directory did not exist, and the mount step aborted.

bash
kubectl get pod disk-io-error-pod -o jsonpath='{.spec.nodeName}{"\n"}' # worker-14

Now you know which node to look at. If you can, SSH there:

bash
ssh worker-14 "ls -la /nonexistent-path" # ls: cannot access '/nonexistent-path': No such file or directory

Confirmed. The pod-level error was a faithful report of a node-level reality. Nothing was wrong with the container, the image, or the cluster control plane.

Why this happens

Any pod that mounts node-local storage (hostPath, local PV, some CSI drivers with pre-provisioned volumes) couples the pod's health to the node's disk state. Kubernetes does not replicate hostPath directories between nodes. It does not create them on demand. If you write path: /data/logs in your spec, that directory has to exist on every node where the pod might land, or the mount fails, or you use a nodeSelector to pin the pod to a node where it does exist.

The reason these bugs feel mysterious is that kubectl presents everything from the pod's point of view, but the root cause lives one layer down on the node. The pod looks broken when actually the filesystem under it is missing. Storage hardware failures, disk replacements, rebuilds, and manual cleanups all leave nodes in slightly different states over time, and any workload that assumes filesystem uniformity across nodes is one bad reboot away from a FailedMount.

The lesson I took from the worker-14 outage: if your pod events mention mounts and your pod works on some nodes but not others, stop reading application logs. Walk to the node.

The fix

The repo's fix swaps the hostPath for an emptyDir, which removes the dependency on any specific node filesystem:

bash
kubectl delete -f issue.yaml kubectl apply -f fix.yaml
yaml
volumes: - name: data-volume emptyDir: {}
bash
kubectl get pod disk-io-error-fixed-pod # disk-io-error-fixed-pod 1/1 Running 0 10s

For a real workload that needs persistence, replace hostPath with a proper PVC backed by a CSI driver. HostPath is fine for system daemons and debugging. For application state, it is a trap.

The lesson

  1. Mount errors live in pod events, not pod logs. kubectl describe pod is the first command, not the fourth.
  2. HostPath and local-PV failures are node-local. Always resolve the pod's nodeName before you start guessing.
  3. If you need persistence, use a PVC backed by a real CSI driver. HostPath is a debugging tool, not a storage strategy.

Day 20 of 35. Tomorrow, the eviction that hits your pod because a completely different pod filled the node's disk.

◆ Newsletter

Get the next post in your inbox.

Real Kubernetes lessons from seven years in production. One email when a new post drops. No spam. Unsubscribe in one click.