The Runbook.
Real incidents. Real fixes. Mapped 1:1 to scenarios in troubleshoot-kubernetes-like-a-pro. Deploy the broken config, debug it, apply the fix.
- DAY 00
Day 0 — Spin Up a Real Kubernetes Cluster in One Command
Before you debug a single Pending pod, you need a real cluster. Here is how to get one on AWS, GCP, or Azure with a single command — and the Terraform that makes it boring.
Setup2026-03-198 min read - DAY 01
RunContainerError in Kubernetes: Why Your Pod Never Starts
The kubelet pulled the image, runc tried to exec, and nothing happened. Here is the one line that tells you why.
App2026-03-206 min read - DAY 02
CrashLoopBackOff in Kubernetes: Read the Exit Code First
A status, a timer, and an exit code. Learn to read all three and you debug in sixty seconds instead of forty minutes.
App2026-03-216 min read - DAY 03
ErrImagePull in Kubernetes: Typo, Auth, or Network?
One status, three completely different fixes. The describe event tells you which one you are actually looking at.
Image2026-03-225 min read - DAY 04
ImagePullBackOff in Kubernetes: The Timer on Top of the Error
Same cause as ErrImagePull, plus a backoff clock that makes the cluster look broken when it is just being polite.
Image2026-03-235 min read - DAY 05
Kubernetes Pod Running But Not Ready: Readiness Probe Failures Explained
STATUS says Running. READY says 0/1. Users see 503s. Here is where the pod is hiding.
Probes2026-03-245 min read - DAY 06
Liveness Probe Killing Your Kubernetes Pods? Read This First
Exit code 137 with clean app logs means the probe is the murderer. Here is how to catch it in the act.
Probes2026-03-254 min read - DAY 07
Liveness and Readiness Failing Together: Why Startup Probes Exist
Two probes, one bug, two outages. Read the event ages and you will know the order of operations.
Probes2026-03-264 min read - DAY 08
Why Is Your Kubernetes Pod Stuck in Pending? The Real Fix
45 minutes staring at a Pending pod, a Slack channel on fire, and one line of kubectl output that finally made sense.
Scheduling2026-03-277 min read - DAY 09
Pod Affinity Violation in Kubernetes: The Silent Pending Trap
The scheduler will wait forever for an affinity rule that can never be satisfied, and it will never tell you out loud.
Scheduling2026-03-283 min read - DAY 10
Node Affinity in Kubernetes: The Hostname Typo That Pending'd My Pod
One wrong letter in a hostname inside a nodeAffinity block, and the scheduler goes silent for an hour.
Scheduling2026-03-293 min read - DAY 11
Taints and Tolerations in Kubernetes: Why Your Pod Won't Land on Any Node
Half the Pending pods I have debugged in my career were a missing toleration. Here is the mental model that ends the confusion.
Scheduling2026-03-303 min read - DAY 12
Cluster Autoscaler Not Scaling Up? The 4 Signals to Check First
Twenty replicas asking for CPU the cluster does not have, and an autoscaler that stays silent. Here is the debug path.
Scheduling2026-03-314 min read - DAY 13
Kubernetes Resource Limits Must Be Greater Than Requests: Here's Why
One of the few Kubernetes errors that fails loudly at apply time, and the one most people still misread.
Resources2026-04-014 min read - DAY 14
OOMKilled in Kubernetes: Why the Linux Kernel Murdered Your Pod
Memory limits, cgroups, and the OOM score nobody reads. Why your container is dead and the node is perfectly fine.
Resources2026-04-024 min read - DAY 15
CrashLoopBackOff from Tight Memory Limits: The 2-Minute Fix
Pod created. Pod killed. Pod created. Pod killed. Welcome to the forever loop.
Resources2026-04-033 min read - DAY 16
Kubernetes cgroup OOM: When the Kernel Kills Before kubectl Knows
Your pod metrics look fine. The kernel disagrees. Here is what lives beneath kubectl.
Resources2026-04-043 min read - DAY 17
Kubernetes volumeMount References Undefined Volume: The Typo Fix
API rejects the pod, the error scrolls past, and you lose an hour to one missing block.
Storage2026-04-053 min read - DAY 18
The 5 Reasons Your Kubernetes PVC Never Binds
Pending forever. No events. No provisioner. Here is the five-minute diagnosis.
Storage2026-04-063 min read - DAY 19
Read-Only Filesystem in Kubernetes: The Volume Permission Fix
readOnlyRootFilesystem is a security win and a 2AM footgun. Here is how to get both.
Storage2026-04-073 min read - DAY 20
Kubernetes Disk I/O Errors: Pod Symptoms, Node Root Cause
The container is crashing. The node is the reason. Here is how to prove it in 90 seconds.
Storage2026-04-083 min read - DAY 21
Pod Evicted from Disk Pressure in Kubernetes: The Ephemeral Storage Fix
Your pod is healthy. The node is out of disk. Guess who gets evicted.
Storage2026-04-094 min read - DAY 22
Kubernetes Service Returns Nothing? Check targetPort First
The pod runs, the service exists, curl returns refused. One number was wrong.
Network2026-04-104 min read - DAY 23
hostPort Conflicts in Kubernetes: Why Your Pod Is Stuck Pending
Two containers, one host port, zero useful error messages. Here is what is happening.
Network2026-04-114 min read - DAY 24
LoadBalancer Stuck on <pending>? It Is Probably Your Selector
The cloud provider is fine. Your Service selector matches nothing. Here is the five-minute fix.
Network2026-04-124 min read - DAY 25
Kubernetes Ingress 404? Check These 4 Things Before nginx Logs
IngressClass, TLS secret, host, path. One of these four is why your Ingress is silent.
Network2026-04-134 min read - DAY 26
It Is Always DNS: Debugging CoreDNS Failures in Kubernetes
The pod is healthy, the Service is up, nslookup hangs. The five-hop debug.
Network2026-04-144 min read - DAY 27
NetworkPolicy Default-Deny Broke My Whole Namespace. Here Is the Fix
One default-deny egress policy, one black-hole namespace, one very long pager night.
Network2026-04-154 min read - DAY 28
Pod Cannot Reach External API? It Is Probably Egress NetworkPolicy
Cluster-to-cluster is fine. Cluster-to-Stripe is dead. The rule that hides in plain sight.
Network2026-04-164 min read - DAY 29
Kubernetes ServiceAccount Forbidden? Here's What Your RBAC Actually Says
The pod is Running. The controller is silent. The API server is quietly returning 403 and nobody is reading the audit log.
RBACSOON - DAY 30
I Ran Kubernetes Pods as Root for 2 Years. Then the Auditor Called.
runAsNonRoot, runAsUser, fsGroup, allowPrivilegeEscalation. The four lines that turn a SOC2 finding from red to green.
SecuritySOON - DAY 31
The hostPID Footgun: One Line That Shows You Every Process on the Node
A pod with hostPID: true is not a pod, it is a node-wide observation deck. Here is exactly what an attacker sees.
SecuritySOON - DAY 32
SELinux Denied Your Kubernetes Pod. kubectl Has No Idea.
Profile mismatches, denied syscalls, and the audit2allow loop you should never run blind.
SecuritySOON - DAY 33
RunPodSandbox Failed: When Kubernetes CRI Says No Before Your Image Pulls
RuntimeClass not found, sandbox failures, the layer below containerd where kubelet and CRI disagree.
ClusterSOON - DAY 34
kubectl drain Stuck? Your PodDisruptionBudget Is Lying About 'Available'
Voluntary disruptions, minAvailable, and the PDB somebody wrote eighteen months ago that now blocks your upgrade.
ClusterSOON - DAY 35
35 Days of Kubernetes — What Nobody Tells You About Running Clusters in Production
The upgrade matrix, deprecated APIs, why you cannot skip a minor, and the honest lessons from thirty-five scenarios in the dark.
ClusterSOON - DAY 36
Day 36 — The Unified Playbook (35 Scenarios, One Brain)
Thirty-five scenarios in five weeks. Here is the single mental model that ties them together — and the repo you can keep coming back to.
WrapSOON