2AM, yaar, the application is throwing connection timeout to a Service that is definitely up. I kubectl exec into a healthy pod and wget the ClusterIP directly, works instantly. But wget http://payments.default.svc.cluster.local hangs for thirty seconds and dies. I already know what this is. Every SRE knows what this is. It is DNS. It has always been DNS. Yet every single time, my brain refuses to believe DNS is broken and spends twenty minutes checking kube-proxy and iptables first, because "DNS is infrastructure, DNS does not just break". DNS does just break. And when it does, the pod is healthy, the Service is healthy, and the error looks like application flake.
The scenario
Reproduce it in your own cluster so the hang you see matches the hang I describe.
git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git
cd troubleshoot-kubernetes-like-a-pro/scenarios/dns-resolution-failure
lsYou will see issue.yaml, fix.yaml, description.md, dns_failure.sh. The issue.yaml creates a pod with dnsPolicy: None and a bogus nameserver (192.0.2.1, an RFC 5737 TEST-NET address that nothing can reach). That is not how real DNS breaks in production, but the resulting symptom is identical, and that is what matters for practicing the debug loop.
Reproduce the issue
kubectl apply -f issue.yamlpod/dns-failure-pod createdkubectl logs dns-failure-podServer: 192.0.2.1
Address 1: 192.0.2.1
nslookup: can't resolve 'kubernetes.default.svc.cluster.local'The pod is Running but DNS is a wall. In production the signature is a little different, your pod looks fine on the outside, applications connect to some Services and not others, and retries make everything worse. But the underlying symptom is the same. DNS queries fail or time out, the pod keeps running, and nothing ties it all together in the events.
Debug the hard way
Five hops. Do them in order, do not skip.
Hop one, is CoreDNS even running:
kubectl -n kube-system get pods -l k8s-app=kube-dnsNAME READY STATUS RESTARTS AGE
coredns-5d78c9869d-4xk7v 1/1 Running 0 14d
coredns-5d78c9869d-hq2mp 1/1 Running 0 14dHop two, does the Service exist and have endpoints:
kubectl -n kube-system get svc kube-dns
kubectl -n kube-system get endpoints kube-dnsBoth populated. Good. Hop three, what does the broken pod's /etc/resolv.conf look like:
kubectl exec dns-failure-pod -- cat /etc/resolv.confnameserver 192.0.2.1There it is. The nameserver is wrong. In production, the most common equivalent is nameserver 10.96.0.10 pointing at kube-dns correctly, but search lines or ndots:5 causing 5x slow lookups for every external hostname. Hop four, try a direct query bypassing search paths:
kubectl run dns-test --rm -it --image=busybox --restart=Never -- \
nslookup -timeout=2 kubernetes.default.svc.cluster.local 10.96.0.10If that works but nslookup kubernetes from the app pod fails, the problem is the pod's /etc/resolv.conf, not CoreDNS itself. Hop five, look at CoreDNS logs:
kubectl -n kube-system logs -l k8s-app=kube-dns --tail=50Errors like i/o timeout talking to upstream mean CoreDNS can reach the cluster but cannot reach its configured upstream. That's a cluster networking or firewall issue, not a DNS config issue.
Why this happens
CoreDNS is a recursive resolver with a Kubernetes plugin. When a pod sends a query for payments.default.svc.cluster.local, CoreDNS answers from its cache of Service records and returns the ClusterIP instantly. When a pod queries google.com, CoreDNS forwards the query upstream, usually to the node's /etc/resolv.conf nameservers, and proxies the answer back.
Two things break this, and they look identical from the app. The first is ndots:5. Most pods have ndots:5 in their resolv.conf, which means any hostname with fewer than five dots gets run through every entry in the search list before the literal name is tried. A lookup for api.stripe.com becomes api.stripe.com.default.svc.cluster.local, then api.stripe.com.svc.cluster.local, then api.stripe.com.cluster.local, then finally api.stripe.com.. Four wasted round trips, 4x DNS load, 4x the chance one of them times out under pressure. The fix is either to add a trailing dot to hostnames in the app or to set dnsConfig.options with a lower ndots.
The second is CoreDNS's upstream timing out. CoreDNS forwards unknown names to the node's upstream resolver. If that resolver is flaky, if the network to it is flaky, or if the upstream DNS server itself is slow, every external lookup from every pod in the cluster hangs. You'll see the log line forward: i/o timeout in CoreDNS, repeatedly, and the app looks like it is "randomly" slow.
The fix
kubectl delete -f issue.yaml
kubectl apply -f fix.yamlThe scenario fix is trivial: switch from dnsPolicy: None back to dnsPolicy: ClusterFirst:
spec:
dnsPolicy: ClusterFirstkubectl logs dns-fixed-podServer: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: kubernetes.default.svc.cluster.local
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.localIn a real cluster, the fix depends on which hop failed. Broken resolv.conf means fix the pod spec or the node's kubelet. Broken upstream means fix CoreDNS's Corefile forward directive or fix the network to the upstream resolver. Scaling issues mean scale CoreDNS replicas and tune autoscaling.
The lesson
- When applications randomly time out and direct ClusterIP calls work, stop debugging the app and check DNS. Always.
ndots:5is the single biggest cause of slow external DNS in Kubernetes. If you resolve a lot of external names, setndotsto 2 in the pod'sdnsConfig.- Do the five hops in order. CoreDNS pods, kube-dns service and endpoints, pod
resolv.conf, direct query to CoreDNS, CoreDNS logs. The answer is always in one of those five.
Day 26 of 35, tomorrow someone merges a NetworkPolicy during the security review and accidentally cuts the kubelet off from the pod it needs to probe.