4:02 AM. An ingest service was dying on exactly one node out of twenty-three. The pod spec was identical everywhere. The image was identical. The config was identical. Same Helm release, same values. On twenty-two nodes it ran fine. On worker-14 it crashed within three seconds every single time, with a cryptic no such file or directory coming out of a path that existed on every other host. Twenty minutes of reading Go stack traces and diffing node labels got me nowhere. The moment I ran kubectl describe pod and looked at the volume block, the pattern snapped into place: we had a hostPath volume pointing at a directory that had been wiped off worker-14 during a disk replacement two days earlier and never restored.
The scenario
The repo reproduces this cleanly with a hostPath that points at a path guaranteed not to exist.
git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git
cd troubleshoot-kubernetes-like-a-pro/scenarios/disk-io-errors
lsissue.yaml mounts hostPath: /nonexistent-path with type: Directory. The kubelet will try to stat that directory on the node and fail.
Reproduce the issue
kubectl apply -f issue.yaml
kubectl get pod disk-io-error-podNAME READY STATUS RESTARTS AGE
disk-io-error-pod 0/1 ContainerCreating 0 30sStuck in ContainerCreating. No restarts, no crashloop, just waiting. That is the first tell: a hostPath problem does not crash the container, it prevents the container from ever starting.
Debug the hard way
kubectl describe pod disk-io-error-podEvents:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 18s (x3 over 35s) kubelet MountVolume.SetUp failed for volume "data-volume" : hostPath type check failed: /nonexistent-path is not a directoryThe error names the path, names the check that failed, and tells you exactly which node component emitted it. The kubelet did a type: Directory validation, the directory did not exist, and the mount step aborted.
kubectl get pod disk-io-error-pod -o jsonpath='{.spec.nodeName}{"\n"}'
# worker-14Now you know which node to look at. If you can, SSH there:
ssh worker-14 "ls -la /nonexistent-path"
# ls: cannot access '/nonexistent-path': No such file or directoryConfirmed. The pod-level error was a faithful report of a node-level reality. Nothing was wrong with the container, the image, or the cluster control plane.
Why this happens
Any pod that mounts node-local storage (hostPath, local PV, some CSI drivers with pre-provisioned volumes) couples the pod's health to the node's disk state. Kubernetes does not replicate hostPath directories between nodes. It does not create them on demand. If you write path: /data/logs in your spec, that directory has to exist on every node where the pod might land, or the mount fails, or you use a nodeSelector to pin the pod to a node where it does exist.
The reason these bugs feel mysterious is that kubectl presents everything from the pod's point of view, but the root cause lives one layer down on the node. The pod looks broken when actually the filesystem under it is missing. Storage hardware failures, disk replacements, rebuilds, and manual cleanups all leave nodes in slightly different states over time, and any workload that assumes filesystem uniformity across nodes is one bad reboot away from a FailedMount.
The lesson I took from the worker-14 outage: if your pod events mention mounts and your pod works on some nodes but not others, stop reading application logs. Walk to the node.
The fix
The repo's fix swaps the hostPath for an emptyDir, which removes the dependency on any specific node filesystem:
kubectl delete -f issue.yaml
kubectl apply -f fix.yamlvolumes:
- name: data-volume
emptyDir: {}kubectl get pod disk-io-error-fixed-pod
# disk-io-error-fixed-pod 1/1 Running 0 10sFor a real workload that needs persistence, replace hostPath with a proper PVC backed by a CSI driver. HostPath is fine for system daemons and debugging. For application state, it is a trap.
The lesson
- Mount errors live in pod events, not pod logs.
kubectl describe podis the first command, not the fourth. - HostPath and local-PV failures are node-local. Always resolve the pod's nodeName before you start guessing.
- If you need persistence, use a PVC backed by a real CSI driver. HostPath is a debugging tool, not a storage strategy.
Day 20 of 35. Tomorrow, the eviction that hits your pod because a completely different pod filled the node's disk.
