koti.dev
← The Runbook
Mastering Kubernetes the Right Way · DAY 12 / 35

Cluster Autoscaler Not Scaling Up? The 4 Signals to Check First

Twenty replicas asking for CPU the cluster does not have, and an autoscaler that stays silent. Here is the debug path.

KV
Koti Vellanki31 Mar 20264 min read
kubernetesdebuggingscheduling
Cluster Autoscaler Not Scaling Up? The 4 Signals to Check First

2:18 AM. A load test is meant to be running. Marketing is sending traffic at 9 AM and we need to know the new service can take it. I open kubectl get pods and see eighteen of twenty replicas stuck in Pending. The autoscaler should have woken up by now. It has not. The node count is the same as it was an hour ago. I am about to start swearing at AWS when I remember that the autoscaler is not magic, it is just a controller watching unscheduled pods and deciding whether a new node would help. And if it decided that a new node would not help, it would not scale. The question is why it decided that.

There are four signals I check in order, every single time, and they tell me which of the usual suspects is the problem.

The scenario

bash
git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git cd troubleshoot-kubernetes-like-a-pro/scenarios/cluster-autoscaler-issues ls

description.md, issue.yaml, fix.yaml, autoscaler_issue.sh. The issue manifest creates a Deployment with 20 replicas, each asking for 500 millicores and 256 megabytes. On a small dev cluster, most of those will pile up in Pending.

Reproduce the issue

bash
kubectl apply -f issue.yaml kubectl get pods -l app=cluster-autoscaler
plaintext
NAME READY STATUS AGE cluster-autoscaler-issue-deployment-6b7cf9c5f9-2fq8p 1/1 Running 30s cluster-autoscaler-issue-deployment-6b7cf9c5f9-4lmxc 0/1 Pending 30s cluster-autoscaler-issue-deployment-6b7cf9c5f9-7h2vj 0/1 Pending 30s cluster-autoscaler-issue-deployment-6b7cf9c5f9-8k9rn 0/1 Pending 30s ... 16 more Pending ...

One or two running, the rest sitting in Pending. On a real cluster with an autoscaler, this is the exact state right before the autoscaler is supposed to act.

Debug the hard way

Signal one, the scheduling event. What does the scheduler say about the Pending pods?

bash
kubectl describe pod -l app=cluster-autoscaler | grep -A 3 FailedScheduling | head -10
plaintext
Warning FailedScheduling default-scheduler 0/1 nodes are available: 1 Insufficient cpu.

Insufficient CPU. Good, that is what an autoscaler is for. Signal two, the node count:

bash
kubectl get nodes
plaintext
NAME STATUS ROLES AGE kind-control-plane Ready control-plane 2h

One node, still. The autoscaler has not added any. Signal three, the autoscaler logs. In a real cluster you would run:

bash
kubectl logs -n kube-system deployment/cluster-autoscaler | tail -30
plaintext
I0331 02:20:11 scale_up.go:452] No pod can be scheduled even if a node group is expanded to maximum size. I0331 02:20:11 scale_up.go:310] No expansion options.

That is the line that matters. The autoscaler looked at the pending pods and said "even if I add a node of the biggest allowed size to every node group I manage, these pods still would not fit." Signal four, the node group config. If the max size is 1, or the instance type is too small for a 500m request plus system overhead, the autoscaler has no move to make.

Why this happens

Cluster autoscaler works at the node group level, not the cluster level. For each node group, it knows the max size, the instance type, and the resources that a fresh node would provide. When a pod is Pending, the autoscaler simulates adding one new node to each group and checks whether the pod would then fit. If yes, it scales. If no, it logs and moves on.

The silent failure modes are all about that simulation. A node group capped at its max will be skipped. A node group whose instance type cannot hold the pod's requests will be skipped. A pod with an affinity rule that pins it to a label no group provides will be skipped. A pending pod with a PVC stuck in the wrong zone will be skipped. In every case, the autoscaler is not broken, it is correctly refusing to scale up something that would not solve the problem.

The fastest way out of this debug loop is to treat the autoscaler like a scheduler in its own right. Read its logs, not just the pod events.

The fix

For this scenario, the replicas are a lie. Nobody needs twenty busybox sleeps. The fix drops the count to three and the request to something sane.

bash
kubectl apply -f fix.yaml kubectl get pods -l app=cluster-autoscaler
plaintext
NAME READY STATUS AGE cluster-autoscaler-fixed-deployment-7c8d5f4b9d-abcde 1/1 Running 5s cluster-autoscaler-fixed-deployment-7c8d5f4b9d-fghij 1/1 Running 5s cluster-autoscaler-fixed-deployment-7c8d5f4b9d-klmno 1/1 Running 5s

The diff that matters:

yaml
replicas: 3 # was 20 resources: requests: cpu: "100m" # was "500m" memory: "64Mi" # was "256Mi"

In a real incident the fix is rarely the replica count. Usually it is the node group's max size, or the instance type, or an affinity rule that pins the pods to a group that is already full.

The lesson

  1. Autoscaler silence is almost always a deliberate decision. Read its logs before you restart anything.
  2. The four signals are always the same: pod events, node count, autoscaler logs, node group config. In that order.
  3. The autoscaler simulates a single new node per group. If one new node would not fix the pod, the autoscaler is correct to do nothing.

Day 12 of 35 — tomorrow, a resource spec where the limit is smaller than the request, and the one error message Kubernetes gets right.

◆ Newsletter

Get the next post in your inbox.

Real Kubernetes lessons from seven years in production. One email when a new post drops. No spam. Unsubscribe in one click.