koti.dev
◆ Sr. DevOps Engineer · Day 0 + 35 Runbook Entries

Koti
Vellanki.

Seven years running Kubernetes in production. I write about the late-night incidents, the configs that broke things, and the workflows I actually use to fix them.

Open the runbookAbout me
7+ YEARS IN PRODUCTIONEKS · AKS · GKETERRAFORM · HELM · ARGO CDCALICO · CILIUM · ISTIOPROMETHEUS · GRAFANA · LOKIKUBECTL DESCRIBE → KUBECTL FIX35 DEPLOYABLE SCENARIOSON CALL FOR YEARS7+ YEARS IN PRODUCTIONEKS · AKS · GKETERRAFORM · HELM · ARGO CDCALICO · CILIUM · ISTIOPROMETHEUS · GRAFANA · LOKIKUBECTL DESCRIBE → KUBECTL FIX35 DEPLOYABLE SCENARIOSON CALL FOR YEARS7+ YEARS IN PRODUCTIONEKS · AKS · GKETERRAFORM · HELM · ARGO CDCALICO · CILIUM · ISTIOPROMETHEUS · GRAFANA · LOKIKUBECTL DESCRIBE → KUBECTL FIX35 DEPLOYABLE SCENARIOSON CALL FOR YEARS
◆ The Runbook · 35 Entries

Mastering Kubernetes
the right way.

Thirty-five entries, mapped 1:1 to scenarios in troubleshoot-kubernetes-like-a-pro. Every entry: deploy the broken config, debug it, apply the fix.

RunContainerError in Kubernetes: Why Your Pod Never Starts
DAY 01

RunContainerError in Kubernetes: Why Your Pod Never Starts

The kubelet pulled the image, runc tried to exec, and nothing happened. Here is the one line that tells you why.

AppRead →
CrashLoopBackOff in Kubernetes: Read the Exit Code First
DAY 02

CrashLoopBackOff in Kubernetes: Read the Exit Code First

A status, a timer, and an exit code. Learn to read all three and you debug in sixty seconds instead of forty minutes.

AppRead →
ErrImagePull in Kubernetes: Typo, Auth, or Network?
DAY 03

ErrImagePull in Kubernetes: Typo, Auth, or Network?

One status, three completely different fixes. The describe event tells you which one you are actually looking at.

ImageRead →
ImagePullBackOff in Kubernetes: The Timer on Top of the Error
DAY 04

ImagePullBackOff in Kubernetes: The Timer on Top of the Error

Same cause as ErrImagePull, plus a backoff clock that makes the cluster look broken when it is just being polite.

ImageRead →
Kubernetes Pod Running But Not Ready: Readiness Probe Failures Explained
DAY 05

Kubernetes Pod Running But Not Ready: Readiness Probe Failures Explained

STATUS says Running. READY says 0/1. Users see 503s. Here is where the pod is hiding.

ProbesRead →
Liveness Probe Killing Your Kubernetes Pods? Read This First
DAY 06

Liveness Probe Killing Your Kubernetes Pods? Read This First

Exit code 137 with clean app logs means the probe is the murderer. Here is how to catch it in the act.

ProbesRead →
Liveness and Readiness Failing Together: Why Startup Probes Exist
DAY 07

Liveness and Readiness Failing Together: Why Startup Probes Exist

Two probes, one bug, two outages. Read the event ages and you will know the order of operations.

ProbesRead →
Why Is Your Kubernetes Pod Stuck in Pending? The Real Fix
DAY 08

Why Is Your Kubernetes Pod Stuck in Pending? The Real Fix

45 minutes staring at a Pending pod, a Slack channel on fire, and one line of kubectl output that finally made sense.

SchedulingRead →
Pod Affinity Violation in Kubernetes: The Silent Pending Trap
DAY 09

Pod Affinity Violation in Kubernetes: The Silent Pending Trap

The scheduler will wait forever for an affinity rule that can never be satisfied, and it will never tell you out loud.

SchedulingRead →
Node Affinity in Kubernetes: The Hostname Typo That Pending'd My Pod
DAY 10

Node Affinity in Kubernetes: The Hostname Typo That Pending'd My Pod

One wrong letter in a hostname inside a nodeAffinity block, and the scheduler goes silent for an hour.

SchedulingRead →
Taints and Tolerations in Kubernetes: Why Your Pod Won't Land on Any Node
DAY 11

Taints and Tolerations in Kubernetes: Why Your Pod Won't Land on Any Node

Half the Pending pods I have debugged in my career were a missing toleration. Here is the mental model that ends the confusion.

SchedulingRead →
Cluster Autoscaler Not Scaling Up? The 4 Signals to Check First
DAY 12

Cluster Autoscaler Not Scaling Up? The 4 Signals to Check First

Twenty replicas asking for CPU the cluster does not have, and an autoscaler that stays silent. Here is the debug path.

SchedulingRead →
Kubernetes Resource Limits Must Be Greater Than Requests: Here's Why
DAY 13

Kubernetes Resource Limits Must Be Greater Than Requests: Here's Why

One of the few Kubernetes errors that fails loudly at apply time, and the one most people still misread.

ResourcesRead →
OOMKilled in Kubernetes: Why the Linux Kernel Murdered Your Pod
DAY 14

OOMKilled in Kubernetes: Why the Linux Kernel Murdered Your Pod

Memory limits, cgroups, and the OOM score nobody reads. Why your container is dead and the node is perfectly fine.

ResourcesRead →
CrashLoopBackOff from Tight Memory Limits: The 2-Minute Fix
DAY 15

CrashLoopBackOff from Tight Memory Limits: The 2-Minute Fix

Pod created. Pod killed. Pod created. Pod killed. Welcome to the forever loop.

ResourcesRead →
Kubernetes cgroup OOM: When the Kernel Kills Before kubectl Knows
DAY 16

Kubernetes cgroup OOM: When the Kernel Kills Before kubectl Knows

Your pod metrics look fine. The kernel disagrees. Here is what lives beneath kubectl.

ResourcesRead →
Kubernetes volumeMount References Undefined Volume: The Typo Fix
DAY 17

Kubernetes volumeMount References Undefined Volume: The Typo Fix

API rejects the pod, the error scrolls past, and you lose an hour to one missing block.

StorageRead →
The 5 Reasons Your Kubernetes PVC Never Binds
DAY 18

The 5 Reasons Your Kubernetes PVC Never Binds

Pending forever. No events. No provisioner. Here is the five-minute diagnosis.

StorageRead →
Read-Only Filesystem in Kubernetes: The Volume Permission Fix
DAY 19

Read-Only Filesystem in Kubernetes: The Volume Permission Fix

readOnlyRootFilesystem is a security win and a 2AM footgun. Here is how to get both.

StorageRead →
Kubernetes Disk I/O Errors: Pod Symptoms, Node Root Cause
DAY 20

Kubernetes Disk I/O Errors: Pod Symptoms, Node Root Cause

The container is crashing. The node is the reason. Here is how to prove it in 90 seconds.

StorageRead →
Pod Evicted from Disk Pressure in Kubernetes: The Ephemeral Storage Fix
DAY 21

Pod Evicted from Disk Pressure in Kubernetes: The Ephemeral Storage Fix

Your pod is healthy. The node is out of disk. Guess who gets evicted.

StorageRead →
Kubernetes Service Returns Nothing? Check targetPort First
DAY 22

Kubernetes Service Returns Nothing? Check targetPort First

The pod runs, the service exists, curl returns refused. One number was wrong.

NetworkRead →
hostPort Conflicts in Kubernetes: Why Your Pod Is Stuck Pending
DAY 23

hostPort Conflicts in Kubernetes: Why Your Pod Is Stuck Pending

Two containers, one host port, zero useful error messages. Here is what is happening.

NetworkRead →
LoadBalancer Stuck on <pending>? It Is Probably Your Selector
DAY 24

LoadBalancer Stuck on <pending>? It Is Probably Your Selector

The cloud provider is fine. Your Service selector matches nothing. Here is the five-minute fix.

NetworkRead →
Kubernetes Ingress 404? Check These 4 Things Before nginx Logs
DAY 25

Kubernetes Ingress 404? Check These 4 Things Before nginx Logs

IngressClass, TLS secret, host, path. One of these four is why your Ingress is silent.

NetworkRead →
DAY 26

It Is Always DNS: Debugging CoreDNS Failures in Kubernetes

The pod is healthy, the Service is up, nslookup hangs. The five-hop debug.

NetworkRead →
DAY 27

NetworkPolicy Default-Deny Broke My Whole Namespace. Here Is the Fix

One default-deny egress policy, one black-hole namespace, one very long pager night.

NetworkRead →
DAY 28

Pod Cannot Reach External API? It Is Probably Egress NetworkPolicy

Cluster-to-cluster is fine. Cluster-to-Stripe is dead. The rule that hides in plain sight.

NetworkRead →
DAY 29soon

Kubernetes ServiceAccount Forbidden? Here's What Your RBAC Actually Says

The pod is Running. The controller is silent. The API server is quietly returning 403 and nobody is reading the audit log.

RBAC
DAY 30soon

I Ran Kubernetes Pods as Root for 2 Years. Then the Auditor Called.

runAsNonRoot, runAsUser, fsGroup, allowPrivilegeEscalation. The four lines that turn a SOC2 finding from red to green.

Security
DAY 31soon

The hostPID Footgun: One Line That Shows You Every Process on the Node

A pod with hostPID: true is not a pod, it is a node-wide observation deck. Here is exactly what an attacker sees.

Security
DAY 32soon

SELinux Denied Your Kubernetes Pod. kubectl Has No Idea.

Profile mismatches, denied syscalls, and the audit2allow loop you should never run blind.

Security
DAY 33soon

RunPodSandbox Failed: When Kubernetes CRI Says No Before Your Image Pulls

RuntimeClass not found, sandbox failures, the layer below containerd where kubelet and CRI disagree.

Cluster
DAY 34soon

kubectl drain Stuck? Your PodDisruptionBudget Is Lying About 'Available'

Voluntary disruptions, minAvailable, and the PDB somebody wrote eighteen months ago that now blocks your upgrade.

Cluster
DAY 35soon

35 Days of Kubernetes — What Nobody Tells You About Running Clusters in Production

The upgrade matrix, deprecated APIs, why you cannot skip a minor, and the honest lessons from thirty-five scenarios in the dark.

Cluster
DAY 36soon

Day 36 — The Unified Playbook (35 Scenarios, One Brain)

Thirty-five scenarios in five weeks. Here is the single mental model that ties them together — and the repo you can keep coming back to.

Wrap
◆ About me

I've been on call for seven years. These are the notes I wish I had.

Sr. DevOps Engineer based in India. I've run Kubernetes in production across AWS, Azure, and GCP — the good, the bad, and the 3AM pages. These posts are the playbook I wish existed when I started.

More about me
7+
Years in production
100+
Clusters operated
30
Field notes incoming
◆ Newsletter

Get the next post in your inbox.

Real Kubernetes lessons from seven years in production. One email when a new post drops. No spam. Unsubscribe in one click.