Somebody wrote a "default deny" NetworkPolicy during the security review last week. Looked reasonable on paper, applied cleanly, everybody signed off. 2AM tonight, a fresh deployment rolls out into that namespace and every pod turns into a black hole. Running, Ready, but unable to reach the database, the metrics endpoint, even kube-dns. Liveness probes start failing because the kubelet itself tries to HTTP-GET the pod from the node and the kubelet's source IP is not in any allowlist. The pods start restarting. The restarts don't help because nothing changed on the pod side. The blast radius is the entire namespace and I'm the one holding the pager.
The scenario
Reproduce it in your own cluster. You need a CNI that actually enforces NetworkPolicy for this to mean anything, Calico, Cilium, or Antrea. Plain flannel will accept the policy and silently ignore it.
git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git
cd troubleshoot-kubernetes-like-a-pro/scenarios/network-connectivity-issues
lsYou should see issue.yaml, fix.yaml, description.md, network_issue.sh. The issue file creates a pod plus a NetworkPolicy that denies all egress for pods with the label app: network-test.
Reproduce the issue
kubectl apply -f issue.yamlpod/network-connectivity-issue-pod created
networkpolicy.networking.k8s.io/deny-egress-network-test createdThe pod tries to wget http://google.com and fails:
kubectl logs network-connectivity-issue-podblockedThe pod is Running, the container is happy, the wget timed out, the log says blocked, and nothing in the pod events tells you a NetworkPolicy is the cause. NetworkPolicy drops are silent. The kernel silently drops the packet at the CNI layer and the pod sees a connect timeout like it is an upstream problem.
Debug the hard way
First the usual checks, because you will run them anyway:
kubectl get pod network-connectivity-issue-pod -o wideNAME READY STATUS RESTARTS AGE IP NODE
network-connectivity-issue-pod 1/1 Running 0 90s 10.244.1.17 worker-1Pod is fine. Then DNS and direct reachability from inside the pod:
kubectl exec network-connectivity-issue-pod -- wget -qO- --timeout=3 http://kubernetes.default || echo failfailEven the API server Service is unreachable. That is the fingerprint of a default-deny egress policy. Now the command that actually matters, list every NetworkPolicy that selects this pod:
kubectl get networkpolicy -ANAMESPACE NAME POD-SELECTOR AGE
default deny-egress-network-test app=network-test 2mkubectl describe networkpolicy deny-egress-network-testName: deny-egress-network-test
Namespace: default
PodSelector: app=network-test
Policy Types: Egress
Egress:
<none>Egress: <none> with Policy Types: Egress means "all egress denied for pods matching app=network-test". That is your answer, and no events, no logs, no pod conditions would have told you that. You had to go look for the policy yourself.
Why this happens
NetworkPolicy is additive in an interesting way: if no policy selects a pod, all traffic is allowed. If any policy selects a pod, then only the traffic explicitly allowed by all policies combined is permitted for the direction listed in policyTypes. So the moment you apply an empty egress policy that selects a pod, everything egress is denied unless you also add to rules. A lot of teams write this policy thinking "default deny" means "start from deny and then we layer allows on top", which is correct in intent but wrong in consequences, because they forget the allow-list layer.
The second trap is the kubelet health probe. The kubelet sends HTTP probes to the pod from the node's IP, which is not the pod network. An ingress policy that only allows traffic from podSelector in the same namespace will silently block the kubelet's probe, marking the pod Unhealthy and restarting it in a loop. The fix is an ingress rule allowing traffic from the node CIDR, or using exec probes instead of HTTP probes.
The third trap is DNS. A default-deny egress policy blocks traffic to kube-dns on port 53, which means every application call that uses a hostname fails before it even starts. Your allow-list needs an explicit rule allowing UDP and TCP to port 53 to the kube-system namespace, or nothing resolves.
The fix
kubectl delete -f issue.yaml
kubectl apply -f fix.yamlThe scenario fix removes the NetworkPolicy entirely. In a real cluster you do not want to delete the security policy, you want to fix it. The correct pattern is a default-deny policy paired with explicit allows for DNS, for kubelet probes, and for the actual traffic the app needs:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns-and-app
spec:
podSelector:
matchLabels:
app: network-test
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
- to:
- podSelector:
matchLabels:
app: api
ports:
- protocol: TCP
port: 8080Verify with a wget from the pod and with kubectl describe on the policy to make sure the rules render as you expect.
The lesson
- Default-deny egress without an explicit kube-dns allow is self-sabotage. Every default-deny policy needs DNS exceptions on day one.
- NetworkPolicy drops are silent. No events, no logs on the pod. The only diagnosis is listing the policies that select the pod.
- You must have a CNI that enforces NetworkPolicy. Calico, Cilium, Antrea. Plain flannel accepts the policy and ignores it, which is worse than not having one at all because it gives you false confidence.
Day 27 of 35, tomorrow the cluster talks to itself perfectly but cannot reach the payment processor, and nothing in the cluster looks broken.