It Is Always DNS: Debugging CoreDNS Failures in Kubernetes

2AM, yaar, the application is throwing connection timeout to a Service that is definitely up. I kubectl exec into a healthy pod and wget the ClusterIP directly, works instantly. But wget http://payments.default.svc.cluster.local hangs for thirty seconds and dies. I already know what this is. Every SRE knows what this is. It is DNS. It has always been DNS. Yet every single time, my brain refuses to believe DNS is broken and spends twenty minutes checking kube-proxy and iptables first, because "DNS is infrastructure, DNS does not just break". DNS does just break. And when it does, the pod is healthy, the Service is healthy, and the error looks like application flake.

The scenario

◆ DAY 20 · NETWORK · DNS

kube-dns is Ready. It is not resolving.

The pod's /etc/resolv.conf correctly points to kube-dns at 10.96.0.10. kube-dns reports Ready=True. But the CoreDNS Corefile's forward plugin points at a stale upstream resolver — so every out-of-cluster name returns SERVFAIL. Ready does not mean working.

FIGURE20 / 35

The pod asks kube-dns

kubelet writes /etc/resolv.conf with nameserver 10.96.0.10 — the canonical kube-dns ClusterIP. The pod's resolver config is correct; so is the kube-dns pod status (Ready=True). Nothing looks wrong yet.

The Corefile points at a stale upstream

The CoreDNS forward . 192.0.2.53 line was left over from a previous cluster config. kube-dns is healthy and answers in-cluster queries fine — but any out-of-cluster name it must forward returns SERVFAIL. Check with: kubectl -n kube-system get cm coredns -o yaml.

SERVFAIL propagates back to the pod

The upstream at 192.0.2.53 is unreachable (RFC 5737 TEST-NET-1). CoreDNS times out and returns SERVFAIL. The pod sees a DNS failure and the application throws a connection error — even though the target Service exists and is healthy.

A pod queries kube-dns which forwards to a stale upstream (192.0.2.53) and returns SERVFAIL.

A pod inside a Kubernetes cluster sends a DNS query on port 53/udp to kube-dns at 10.96.0.10. kube-dns shows Ready=True but its CoreDNS Corefile forwards queries to a stale upstream at 192.0.2.53. The upstream is unreachable so kube-dns returns SERVFAIL to the pod.

53/udp (DNS) · 192.0.2.53 (RFC 5737 TEST-NET-1, documentation only — fake upstream) · kube-dns service IP — kubectl -n kube-system get svc kube-dns · CoreDNS Corefile — kubectl -n kube-system get cm coredns -o yaml · kind v0.22.0, Kubernetes 1.30.0, CoreDNS 1.11 — kubectl exec into a pod and dig against a misconfigured forward upstream returns SERVFAIL

Reproduce it in your own cluster so the hang you see matches the hang I describe.

bash

git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git
cd troubleshoot-kubernetes-like-a-pro/scenarios/dns-resolution-failure
ls

bash

You will see issue.yaml, fix.yaml, description.md, dns_failure.sh. The issue.yaml creates a pod with dnsPolicy: None and a bogus nameserver (192.0.2.1, an RFC 5737 TEST-NET address that nothing can reach). That is not how real DNS breaks in production, but the resulting symptom is identical, and that is what matters for practicing the debug loop.

Reproduce the issue

bash

kubectl apply -f issue.yaml

bash

plaintext

pod/dns-failure-pod created

bash

kubectl logs dns-failure-pod

bash

plaintext

Server:    192.0.2.1
Address 1: 192.0.2.1

nslookup: can't resolve 'kubernetes.default.svc.cluster.local'

The pod is Running but DNS is a wall. In production the signature is a little different, your pod looks fine on the outside, applications connect to some Services and not others, and retries make everything worse. But the underlying symptom is the same. DNS queries fail or time out, the pod keeps running, and nothing ties it all together in the events.

Debug the hard way

Five hops. Do them in order, do not skip.

Hop one, is CoreDNS even running:

bash

kubectl -n kube-system get pods -l k8s-app=kube-dns

bash

plaintext

NAME                       READY   STATUS    RESTARTS   AGE
coredns-5d78c9869d-4xk7v   1/1     Running   0          14d
coredns-5d78c9869d-hq2mp   1/1     Running   0          14d

Hop two, does the Service exist and have endpoints:

bash

kubectl -n kube-system get svc kube-dns
kubectl -n kube-system get endpoints kube-dns

bash

Both populated. Good. Hop three, what does the broken pod's /etc/resolv.conf look like:

bash

kubectl exec dns-failure-pod -- cat /etc/resolv.conf

bash

plaintext

nameserver 192.0.2.1

There it is. The nameserver is wrong. In production, the most common equivalent is nameserver 10.96.0.10 pointing at kube-dns correctly, but search lines or ndots:5 causing 5x slow lookups for every external hostname. Hop four, try a direct query bypassing search paths:

bash

kubectl run dns-test --rm -it --image=busybox --restart=Never -- \
  nslookup -timeout=2 kubernetes.default.svc.cluster.local 10.96.0.10

bash

If that works but nslookup kubernetes from the app pod fails, the problem is the pod's /etc/resolv.conf, not CoreDNS itself. Hop five, look at CoreDNS logs:

bash

kubectl -n kube-system logs -l k8s-app=kube-dns --tail=50

bash

Errors like i/o timeout talking to upstream mean CoreDNS can reach the cluster but cannot reach its configured upstream. That's a cluster networking or firewall issue, not a DNS config issue.

Why this happens

CoreDNS is a recursive resolver with a Kubernetes plugin. When a pod sends a query for payments.default.svc.cluster.local, CoreDNS answers from its cache of Service records and returns the ClusterIP instantly. When a pod queries google.com, CoreDNS forwards the query upstream, usually to the node's /etc/resolv.conf nameservers, and proxies the answer back.

Two things break this, and they look identical from the app. The first is ndots:5. Most pods have ndots:5 in their resolv.conf, which means any hostname with fewer than five dots gets run through every entry in the search list before the literal name is tried. A lookup for api.stripe.com becomes api.stripe.com.default.svc.cluster.local, then api.stripe.com.svc.cluster.local, then api.stripe.com.cluster.local, then finally api.stripe.com.. Four wasted round trips, 4x DNS load, 4x the chance one of them times out under pressure. The fix is either to add a trailing dot to hostnames in the app or to set dnsConfig.options with a lower ndots.

The second is CoreDNS's upstream timing out. CoreDNS forwards unknown names to the node's upstream resolver. If that resolver is flaky, if the network to it is flaky, or if the upstream DNS server itself is slow, every external lookup from every pod in the cluster hangs. You'll see the log line forward: i/o timeout in CoreDNS, repeatedly, and the app looks like it is "randomly" slow.

The fix

bash

kubectl delete -f issue.yaml
kubectl apply -f fix.yaml

bash

The scenario fix is trivial: switch from dnsPolicy: None back to dnsPolicy: ClusterFirst:

yaml

spec:
  dnsPolicy: ClusterFirst

yaml

bash

kubectl logs dns-fixed-pod

bash

plaintext

Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name:      kubernetes.default.svc.cluster.local
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local

In a real cluster, the fix depends on which hop failed. Broken resolv.conf means fix the pod spec or the node's kubelet. Broken upstream means fix CoreDNS's Corefile forward directive or fix the network to the upstream resolver. Scaling issues mean scale CoreDNS replicas and tune autoscaling.

The lesson

When applications randomly time out and direct ClusterIP calls work, stop debugging the app and check DNS. Always.
ndots:5 is the single biggest cause of slow external DNS in Kubernetes. If you resolve a lot of external names, set ndots to 2 in the pod's dnsConfig.
Do the five hops in order. CoreDNS pods, kube-dns service and endpoints, pod resolv.conf, direct query to CoreDNS, CoreDNS logs. The answer is always in one of those five.

Day 26 of 35, tomorrow someone merges a NetworkPolicy during the security review and accidentally cuts the kubelet off from the pod it needs to probe.

The scenario

The pod asks kube-dns

The Corefile points at a stale upstream

SERVFAIL propagates back to the pod

Reproduce the issue

Debug the hard way

Why this happens

The fix

The lesson

Get the next post in your inbox.