koti.dev
← The Runbook
Mastering Kubernetes the Right Way · DAY 28 / 35

Pod Cannot Reach External API? It Is Probably Egress NetworkPolicy

Cluster-to-cluster is fine. Cluster-to-Stripe is dead. The rule that hides in plain sight.

KV
Koti Vellanki16 Apr 20264 min read
kubernetesdebuggingnetworking
Pod Cannot Reach External API? It Is Probably Egress NetworkPolicy

2AM, the billing service is failing to reach Stripe. Exactly five-second timeouts, every call, nothing else in the cluster showing any symptom. My first reflex is "it must be Stripe" because come on, what are the odds. Status page is green. curl from my laptop works instantly. SSH to a worker node and curl from there works. kubectl exec into the pod and curl is dead silent, five-second timeout, nothing. So the problem is sitting in the pod's own network namespace, not on the node, not upstream. Cluster-to-cluster traffic is fine, the pod hits its database happily. Only external traffic is broken, and only for this one workload. Somebody wrote an egress rule that says "you, specifically, are not allowed to talk outside", and I need to find it.

The scenario

DAY 14 · NETWORK · EGRESS POLICY

The pod can reach the cluster. It cannot reach the internet.

Cluster-to-cluster traffic flows. Cluster-to-Stripe is silently dropped at the egress firewall. The default-deny NetworkPolicy is doing exactly what you told it to do — and exactly what you forgot you told it to do.

FIGURE14 / 35
Pod cannot reach api.stripe.com — egress NetworkPolicy drops the SYNA pod inside a Kubernetes cluster sends a request to api.stripe.com on port 443; the request is silently dropped by an egress NetworkPolicy at the CNI plugin and the pod never receives a response.KUBERNETES CLUSTERproduction · us-east-1 · v1.30POD · default nsapi-clientimage: api-client:1.41443/tcpSYN →EGRESS FIREWALLNetworkPolicy default-denyallow → 10.0.0.0/8allow → kube-dnsdeny → 0.0.0.0/0silent droppolicy enforcedat the CNI plugin2DROPPEDno RST sentEXTERNAL · INTERNETapi.stripe.com203.0.113.43
1

The pod issues the request

It runs curl https://api.stripe.com. The kernel resolves DNS through kube-dns. Up to here, everything is normal.

2

The CNI consults the policy

The egress NetworkPolicy in default only whitelists in-cluster CIDRs. Stripe's documentation IP is not on the list, so the SYN is silently dropped — no RST, no ICMP unreachable.

3

The pod sees a hang, then a timeout

From the pod's view the connection just stalls until the kernel exhausts tcp_syn_retries. No firewall log because logging was never enabled.

Kubernetes
Blocked path
Application traffic
External SaaS
◆ koti.dev / runbook
An api-client pod hits a default-deny egress NetworkPolicy on its way to api.stripe.com.
A pod inside a Kubernetes cluster sends a request on port 443 toward api.stripe.com. The request hits an egress firewall enforcing a NetworkPolicy that denies all destinations except in-cluster CIDRs. The Stripe IP is not in the allow list, so the SYN is silently dropped and the pod times out.
443/tcp · 203.0.113.4 (RFC 5737 TEST-NET-3, documentation only) · Silent drop — no errno; the kernel keeps retransmitting via tcp_syn_retries (man tcp(7)) · NetworkPolicy v1 networking.k8s.io — kubectl explain networkpolicy.spec.egress · kind v0.22.0, Kubernetes 1.30.0, Calico CNI 3.27 (default-deny egress NetworkPolicy)

Reproduce it locally so your debug loop practices on real output. You need a CNI that enforces NetworkPolicy.

bash
git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git cd troubleshoot-kubernetes-like-a-pro/scenarios/firewall-restriction ls
bash

You will see issue.yaml, fix.yaml, description.md, firewall_restriction.sh. The issue file creates a pod with the label app: firewall-test and a NetworkPolicy that denies all egress for pods with that label.

Reproduce the issue

bash
kubectl apply -f issue.yaml
bash
plaintext
pod/firewall-restriction-pod created networkpolicy.networking.k8s.io/deny-all-egress created
bash
kubectl logs firewall-restriction-pod
bash
plaintext
blocked

The wget to google.com times out at five seconds and prints blocked. The pod itself is Running. No events, no restarts, no warnings. In production, the signature is exactly this: one workload, one destination, silent timeouts, and every other part of the cluster is fine.

Debug the hard way

Layer by layer, from the outside in. First prove the network path works for something else:

bash
kubectl exec firewall-restriction-pod -- wget -qO- --timeout=3 http://kubernetes.default
bash

If that also fails, you have a default-deny egress that includes the cluster, same category as yesterday's post. If it works but external fails, the policy is more surgical, allowing in-cluster but blocking external. Next, test from outside the pod to rule out the node and the upstream:

bash
NODE=$(kubectl get pod firewall-restriction-pod -o jsonpath='{.spec.nodeName}') kubectl debug node/$NODE -it --image=busybox -- wget -qO- --timeout=3 http://google.com
bash

If this works but the pod fails, congratulations, the node can reach the internet and only the pod cannot. That is a NetworkPolicy fingerprint, or a CNI egress filter, nothing else. Now find the rule:

bash
kubectl get networkpolicy -A
bash
plaintext
NAMESPACE NAME POD-SELECTOR AGE default deny-all-egress app=firewall-test 3m
bash
kubectl describe networkpolicy deny-all-egress
bash
plaintext
Name: deny-all-egress Namespace: default PodSelector: app=firewall-test Policy Types: Egress Egress: <none>

Egress: <none> with Policy Types: Egress is the universal "nothing egress allowed" signature. The pod's labels match app=firewall-test, so the policy catches it. And because the scenario only has this one policy, nothing else allows any traffic, so all egress dies.

Why this happens

NetworkPolicy egress works on a pod-by-pod basis. The moment a pod is selected by any policy with policyTypes: Egress, its egress traffic is filtered by the union of all egress allow-rules across all selecting policies. Zero rules means zero allowed. One rule means only that destination. Teams get bitten because they write a policy to restrict one specific workload, forget that restricting means "deny-everything-else", and the workload loses access to DNS, to the metrics sink, and to external APIs all at once.

The other flavor of this bug lives outside Kubernetes, on the cloud. AWS security groups, GCP VPC firewall rules, Azure NSGs. The cluster nodes may have egress, but individual pods sitting behind a NAT gateway might be going out through an IP that the destination does not allow. Stripe and other payment processors often require IP allowlisting for webhook callbacks, and if your NAT gateway rotates, your callbacks start failing silently. The fingerprint is similar: only external traffic affected, in-cluster fine.

The third cause is CNI-level egress filtering. Cilium has CiliumNetworkPolicy which can do FQDN-based egress and DNS-based allowlists. Misconfigure a FQDN rule and you get the same symptom. The debug is the same as standard NetworkPolicy, just with kubectl get ciliumnetworkpolicy.

The fix

bash
kubectl delete -f issue.yaml kubectl apply -f fix.yaml
bash

The scenario fix removes the policy. In a real cluster you want to keep the default-deny and add an allow for the destination:

yaml
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-stripe spec: podSelector: matchLabels: app: billing policyTypes: - Egress egress: - to: - ipBlock: cidr: 0.0.0.0/0 except: - 10.0.0.0/8 ports: - protocol: TCP port: 443 - to: - namespaceSelector: matchLabels: kubernetes.io/metadata.name: kube-system ports: - protocol: UDP port: 53
yaml

Verify:

bash
kubectl exec firewall-restriction-fixed-pod -- wget -qO- --timeout=3 http://google.com
bash

You get HTML back instead of blocked. If you want stricter security, use an FQDN-based policy via Cilium or Calico and allowlist only api.stripe.com on port 443 instead of the entire internet.

The lesson

  1. Symptom "pod cannot reach external, everything else fine" is egress policy 95% of the time. The other 5% is NAT or cloud firewall.
  2. kubectl debug node lets you test from the node's network namespace in one command. If the node works and the pod does not, you have found the layer.
  3. A default-deny egress policy is a good idea, but every one needs explicit allows for DNS, metrics, and the real destinations. Ship the allowlist and the deny together, never the deny alone.

Day 28 of 35, tomorrow we leave networking behind and step into the world of RBAC, ServiceAccounts, and the Forbidden errors that make no sense.

◆ Newsletter

Get the next post in your inbox.

Real Kubernetes lessons from seven years in production. One email when a new post drops. No spam. Unsubscribe in one click.