koti.dev
← The Runbook
Mastering Kubernetes the Right Way · DAY 10 / 35

Node Affinity in Kubernetes: The Hostname Typo That Pending'd My Pod

One wrong letter in a hostname inside a nodeAffinity block, and the scheduler goes silent for an hour.

KV
Koti Vellanki29 Mar 20263 min read
kubernetesdebuggingscheduling
Node Affinity in Kubernetes: The Hostname Typo That Pending'd My Pod

2:55 AM. A colleague pings me: "my pod won't schedule, can you look?" He sends me a screenshot of kubectl get pods and the status is, of course, Pending. I ask him how long. "An hour." I ask him if he changed anything recently. "Just added a nodeAffinity so it lands on the right box." And I already know what I am going to find before I even look at the YAML. Because every nodeAffinity bug I have ever seen comes down to the same two things: a label that does not exist, or a value with a typo in it.

This one was the typo. One character wrong in a hostname. The scheduler happily filtered out every real node in the cluster and then waited for a node that was never going to come.

The scenario

bash
git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git cd troubleshoot-kubernetes-like-a-pro/scenarios/node-affinity-issue ls

description.md, issue.yaml, fix.yaml. The issue pod pins itself to a hostname called non-existent-node. No such node exists in the cluster. The fix drops the affinity block entirely.

Reproduce the issue

bash
kubectl apply -f issue.yaml kubectl get pod node-affinity-issue-pod
plaintext
NAME READY STATUS RESTARTS AGE node-affinity-issue-pod 0/1 Pending 0 1m

The pod lands in Pending and stays there. No crash, no image pull error, no container state. Just a schedule that is never going to happen.

Debug the hard way

Go straight to describe.

bash
kubectl describe pod node-affinity-issue-pod
plaintext
Events: Type Reason From Message ---- ------ ---- ------- Warning FailedScheduling default-scheduler 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.

Same message as yesterday's post, and that is an important clue. Affinity failures all look alike from the event log. To tell them apart you have to read the actual affinity rule.

bash
kubectl get pod node-affinity-issue-pod -o yaml | grep -A 10 affinity
yaml
affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - non-existent-node

There it is. The pod is demanding a node whose kubernetes.io/hostname label equals non-existent-node. Now the sanity check:

bash
kubectl get nodes -o jsonpath='{.items[*].metadata.labels.kubernetes\.io/hostname}'
plaintext
kind-control-plane

Cluster has one node, named kind-control-plane. The pod is asking for non-existent-node. No match, no schedule, forever.

Why this happens

kubernetes.io/hostname is a well-known label that every node in a Kubernetes cluster automatically has. The value is the node's actual hostname. When you use it in a nodeAffinity rule with operator: In, you are saying "pin this pod to a specific named machine." That is a completely legal thing to do, and sometimes it is exactly what you want, for example when a workload has a licence tied to a specific MAC address.

The problem is that the value is a free-form string. Nothing in the API server or the scheduler validates that the hostname you wrote actually exists. If you type nod1 instead of node1, the pod is accepted, the scheduler filters out every real node, and the pod waits. There is no linter between you and the mistake.

The cure is boring but effective. Anytime you hand-write an affinity rule against kubernetes.io/hostname, run kubectl get nodes in the same breath and copy the value from the output, do not retype it.

The fix

bash
kubectl apply -f fix.yaml kubectl get pod node-affinity-issue-fixed-pod
plaintext
NAME READY STATUS RESTARTS AGE node-affinity-issue-fixed-pod 1/1 Running 0 5s

The diff: the entire affinity block removed. If the intent is to actually pin the pod, rewrite it with a real hostname:

yaml
affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - kind-control-plane # copied from kubectl get nodes

The lesson

  1. The API server does not validate affinity values against real nodes. Typos are silent.
  2. Every affinity failure looks the same in the event log. The difference is in the pod spec, not the events.
  3. When you reference kubernetes.io/hostname, copy the value from kubectl get nodes, do not retype it.

Day 10 of 35 — tomorrow, the most common scheduling block in production: a taint without a matching toleration.

◆ Newsletter

Get the next post in your inbox.

Real Kubernetes lessons from seven years in production. One email when a new post drops. No spam. Unsubscribe in one click.