koti.dev
← The Runbook
Mastering Kubernetes the Right Way · DAY 26 / 35

It Is Always DNS: Debugging CoreDNS Failures in Kubernetes

The pod is healthy, the Service is up, nslookup hangs. The five-hop debug.

KV
Koti Vellanki14 Apr 20264 min read
kubernetesdebuggingnetworking

2AM, yaar, the application is throwing connection timeout to a Service that is definitely up. I kubectl exec into a healthy pod and wget the ClusterIP directly, works instantly. But wget http://payments.default.svc.cluster.local hangs for thirty seconds and dies. I already know what this is. Every SRE knows what this is. It is DNS. It has always been DNS. Yet every single time, my brain refuses to believe DNS is broken and spends twenty minutes checking kube-proxy and iptables first, because "DNS is infrastructure, DNS does not just break". DNS does just break. And when it does, the pod is healthy, the Service is healthy, and the error looks like application flake.

The scenario

Reproduce it in your own cluster so the hang you see matches the hang I describe.

bash
git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git cd troubleshoot-kubernetes-like-a-pro/scenarios/dns-resolution-failure ls

You will see issue.yaml, fix.yaml, description.md, dns_failure.sh. The issue.yaml creates a pod with dnsPolicy: None and a bogus nameserver (192.0.2.1, an RFC 5737 TEST-NET address that nothing can reach). That is not how real DNS breaks in production, but the resulting symptom is identical, and that is what matters for practicing the debug loop.

Reproduce the issue

bash
kubectl apply -f issue.yaml
plaintext
pod/dns-failure-pod created
bash
kubectl logs dns-failure-pod
plaintext
Server: 192.0.2.1 Address 1: 192.0.2.1 nslookup: can't resolve 'kubernetes.default.svc.cluster.local'

The pod is Running but DNS is a wall. In production the signature is a little different, your pod looks fine on the outside, applications connect to some Services and not others, and retries make everything worse. But the underlying symptom is the same. DNS queries fail or time out, the pod keeps running, and nothing ties it all together in the events.

Debug the hard way

Five hops. Do them in order, do not skip.

Hop one, is CoreDNS even running:

bash
kubectl -n kube-system get pods -l k8s-app=kube-dns
plaintext
NAME READY STATUS RESTARTS AGE coredns-5d78c9869d-4xk7v 1/1 Running 0 14d coredns-5d78c9869d-hq2mp 1/1 Running 0 14d

Hop two, does the Service exist and have endpoints:

bash
kubectl -n kube-system get svc kube-dns kubectl -n kube-system get endpoints kube-dns

Both populated. Good. Hop three, what does the broken pod's /etc/resolv.conf look like:

bash
kubectl exec dns-failure-pod -- cat /etc/resolv.conf
plaintext
nameserver 192.0.2.1

There it is. The nameserver is wrong. In production, the most common equivalent is nameserver 10.96.0.10 pointing at kube-dns correctly, but search lines or ndots:5 causing 5x slow lookups for every external hostname. Hop four, try a direct query bypassing search paths:

bash
kubectl run dns-test --rm -it --image=busybox --restart=Never -- \ nslookup -timeout=2 kubernetes.default.svc.cluster.local 10.96.0.10

If that works but nslookup kubernetes from the app pod fails, the problem is the pod's /etc/resolv.conf, not CoreDNS itself. Hop five, look at CoreDNS logs:

bash
kubectl -n kube-system logs -l k8s-app=kube-dns --tail=50

Errors like i/o timeout talking to upstream mean CoreDNS can reach the cluster but cannot reach its configured upstream. That's a cluster networking or firewall issue, not a DNS config issue.

Why this happens

CoreDNS is a recursive resolver with a Kubernetes plugin. When a pod sends a query for payments.default.svc.cluster.local, CoreDNS answers from its cache of Service records and returns the ClusterIP instantly. When a pod queries google.com, CoreDNS forwards the query upstream, usually to the node's /etc/resolv.conf nameservers, and proxies the answer back.

Two things break this, and they look identical from the app. The first is ndots:5. Most pods have ndots:5 in their resolv.conf, which means any hostname with fewer than five dots gets run through every entry in the search list before the literal name is tried. A lookup for api.stripe.com becomes api.stripe.com.default.svc.cluster.local, then api.stripe.com.svc.cluster.local, then api.stripe.com.cluster.local, then finally api.stripe.com.. Four wasted round trips, 4x DNS load, 4x the chance one of them times out under pressure. The fix is either to add a trailing dot to hostnames in the app or to set dnsConfig.options with a lower ndots.

The second is CoreDNS's upstream timing out. CoreDNS forwards unknown names to the node's upstream resolver. If that resolver is flaky, if the network to it is flaky, or if the upstream DNS server itself is slow, every external lookup from every pod in the cluster hangs. You'll see the log line forward: i/o timeout in CoreDNS, repeatedly, and the app looks like it is "randomly" slow.

The fix

bash
kubectl delete -f issue.yaml kubectl apply -f fix.yaml

The scenario fix is trivial: switch from dnsPolicy: None back to dnsPolicy: ClusterFirst:

yaml
spec: dnsPolicy: ClusterFirst
bash
kubectl logs dns-fixed-pod
plaintext
Server: 10.96.0.10 Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local Name: kubernetes.default.svc.cluster.local Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local

In a real cluster, the fix depends on which hop failed. Broken resolv.conf means fix the pod spec or the node's kubelet. Broken upstream means fix CoreDNS's Corefile forward directive or fix the network to the upstream resolver. Scaling issues mean scale CoreDNS replicas and tune autoscaling.

The lesson

  1. When applications randomly time out and direct ClusterIP calls work, stop debugging the app and check DNS. Always.
  2. ndots:5 is the single biggest cause of slow external DNS in Kubernetes. If you resolve a lot of external names, set ndots to 2 in the pod's dnsConfig.
  3. Do the five hops in order. CoreDNS pods, kube-dns service and endpoints, pod resolv.conf, direct query to CoreDNS, CoreDNS logs. The answer is always in one of those five.

Day 26 of 35, tomorrow someone merges a NetworkPolicy during the security review and accidentally cuts the kubelet off from the pod it needs to probe.

◆ Newsletter

Get the next post in your inbox.

Real Kubernetes lessons from seven years in production. One email when a new post drops. No spam. Unsubscribe in one click.