koti.dev
← The Runbook
Mastering Kubernetes the Right Way · DAY 28 / 35

Pod Cannot Reach External API? It Is Probably Egress NetworkPolicy

Cluster-to-cluster is fine. Cluster-to-Stripe is dead. The rule that hides in plain sight.

KV
Koti Vellanki16 Apr 20264 min read
kubernetesdebuggingnetworking

2AM, the billing service is failing to reach Stripe. Exactly five-second timeouts, every call, nothing else in the cluster showing any symptom. My first reflex is "it must be Stripe" because come on, what are the odds. Status page is green. curl from my laptop works instantly. SSH to a worker node and curl from there works. kubectl exec into the pod and curl is dead silent, five-second timeout, nothing. So the problem is sitting in the pod's own network namespace, not on the node, not upstream. Cluster-to-cluster traffic is fine, the pod hits its database happily. Only external traffic is broken, and only for this one workload. Somebody wrote an egress rule that says "you, specifically, are not allowed to talk outside", and I need to find it.

The scenario

Reproduce it locally so your debug loop practices on real output. You need a CNI that enforces NetworkPolicy.

bash
git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git cd troubleshoot-kubernetes-like-a-pro/scenarios/firewall-restriction ls

You will see issue.yaml, fix.yaml, description.md, firewall_restriction.sh. The issue file creates a pod with the label app: firewall-test and a NetworkPolicy that denies all egress for pods with that label.

Reproduce the issue

bash
kubectl apply -f issue.yaml
plaintext
pod/firewall-restriction-pod created networkpolicy.networking.k8s.io/deny-all-egress created
bash
kubectl logs firewall-restriction-pod
plaintext
blocked

The wget to google.com times out at five seconds and prints blocked. The pod itself is Running. No events, no restarts, no warnings. In production, the signature is exactly this: one workload, one destination, silent timeouts, and every other part of the cluster is fine.

Debug the hard way

Layer by layer, from the outside in. First prove the network path works for something else:

bash
kubectl exec firewall-restriction-pod -- wget -qO- --timeout=3 http://kubernetes.default

If that also fails, you have a default-deny egress that includes the cluster, same category as yesterday's post. If it works but external fails, the policy is more surgical, allowing in-cluster but blocking external. Next, test from outside the pod to rule out the node and the upstream:

bash
NODE=$(kubectl get pod firewall-restriction-pod -o jsonpath='{.spec.nodeName}') kubectl debug node/$NODE -it --image=busybox -- wget -qO- --timeout=3 http://google.com

If this works but the pod fails, congratulations, the node can reach the internet and only the pod cannot. That is a NetworkPolicy fingerprint, or a CNI egress filter, nothing else. Now find the rule:

bash
kubectl get networkpolicy -A
plaintext
NAMESPACE NAME POD-SELECTOR AGE default deny-all-egress app=firewall-test 3m
bash
kubectl describe networkpolicy deny-all-egress
plaintext
Name: deny-all-egress Namespace: default PodSelector: app=firewall-test Policy Types: Egress Egress: <none>

Egress: <none> with Policy Types: Egress is the universal "nothing egress allowed" signature. The pod's labels match app=firewall-test, so the policy catches it. And because the scenario only has this one policy, nothing else allows any traffic, so all egress dies.

Why this happens

NetworkPolicy egress works on a pod-by-pod basis. The moment a pod is selected by any policy with policyTypes: Egress, its egress traffic is filtered by the union of all egress allow-rules across all selecting policies. Zero rules means zero allowed. One rule means only that destination. Teams get bitten because they write a policy to restrict one specific workload, forget that restricting means "deny-everything-else", and the workload loses access to DNS, to the metrics sink, and to external APIs all at once.

The other flavor of this bug lives outside Kubernetes, on the cloud. AWS security groups, GCP VPC firewall rules, Azure NSGs. The cluster nodes may have egress, but individual pods sitting behind a NAT gateway might be going out through an IP that the destination does not allow. Stripe and other payment processors often require IP allowlisting for webhook callbacks, and if your NAT gateway rotates, your callbacks start failing silently. The fingerprint is similar: only external traffic affected, in-cluster fine.

The third cause is CNI-level egress filtering. Cilium has CiliumNetworkPolicy which can do FQDN-based egress and DNS-based allowlists. Misconfigure a FQDN rule and you get the same symptom. The debug is the same as standard NetworkPolicy, just with kubectl get ciliumnetworkpolicy.

The fix

bash
kubectl delete -f issue.yaml kubectl apply -f fix.yaml

The scenario fix removes the policy. In a real cluster you want to keep the default-deny and add an allow for the destination:

yaml
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-stripe spec: podSelector: matchLabels: app: billing policyTypes: - Egress egress: - to: - ipBlock: cidr: 0.0.0.0/0 except: - 10.0.0.0/8 ports: - protocol: TCP port: 443 - to: - namespaceSelector: matchLabels: kubernetes.io/metadata.name: kube-system ports: - protocol: UDP port: 53

Verify:

bash
kubectl exec firewall-restriction-fixed-pod -- wget -qO- --timeout=3 http://google.com

You get HTML back instead of blocked. If you want stricter security, use an FQDN-based policy via Cilium or Calico and allowlist only api.stripe.com on port 443 instead of the entire internet.

The lesson

  1. Symptom "pod cannot reach external, everything else fine" is egress policy 95% of the time. The other 5% is NAT or cloud firewall.
  2. kubectl debug node lets you test from the node's network namespace in one command. If the node works and the pod does not, you have found the layer.
  3. A default-deny egress policy is a good idea, but every one needs explicit allows for DNS, metrics, and the real destinations. Ship the allowlist and the deny together, never the deny alone.

Day 28 of 35, tomorrow we leave networking behind and step into the world of RBAC, ServiceAccounts, and the Forbidden errors that make no sense.

◆ Newsletter

Get the next post in your inbox.

Real Kubernetes lessons from seven years in production. One email when a new post drops. No spam. Unsubscribe in one click.