koti.dev
← The Runbook
Mastering Kubernetes the Right Way · DAY 12 / 35

Cluster Autoscaler Not Scaling Up? The 4 Signals to Check First

Twenty replicas asking for CPU the cluster does not have, and an autoscaler that stays silent. Here is the debug path.

KV
Koti Vellanki31 Mar 20264 min read
kubernetesdebuggingscheduling
Cluster Autoscaler Not Scaling Up? The 4 Signals to Check First

2:18 AM. A load test is meant to be running. Marketing is sending traffic at 9 AM and we need to know the new service can take it. I open kubectl get pods and see eighteen of twenty replicas stuck in Pending. The autoscaler should have woken up by now. It has not. The node count is the same as it was an hour ago. I am about to start swearing at AWS when I remember that the autoscaler is not magic, it is just a controller watching unscheduled pods and deciding whether a new node would help. And if it decided that a new node would not help, it would not scale. The question is why it decided that.

There are four signals I check in order, every single time, and they tell me which of the usual suspects is the problem.

DAY 12 · SCHEDULING · CLUSTER AUTOSCALER

The autoscaler wants to scale. The node group is at max.

A pending pod requests 4 CPU. Cluster Autoscaler evaluates whether adding a node would help, but the node group has already hit its configured max-size of 5. CA logs 'would expand but max-size reached' and does nothing. The pod stays Pending. The fix is to raise max-size — not to debug the pod.

FIGURE12 / 35
Cluster Autoscaler at max-size — pending pod cannot be scheduled, node group at capacityA pod requesting 4 CPU stays Pending because every node in the cluster is at full CPU utilisation. Cluster Autoscaler evaluates scaling up but the node group has reached its maximum size of 5 nodes. CA logs would expand but max-size reached and takes no action.PENDING PODunschedulableapi-workerrequests.cpu: 4Pending1KUBERNETES CLUSTERnode-group: max=5 · v1.30NODE-1cpu free:0mNODE-2cpu free:0mNODE-3cpu free:0mNODE-4cpu free:0mNODE-5cpu free:0m2cannot expandmax-size reachedCLUSTER AUTOSCALERscale up decisionpending pods: 1would scale: +1node-group:max-size 5current: 5→ AT CAPNo expansionoptions.3
1

The pod is Pending because no node has enough free CPU

The pod requests cpu: 4. Every node in the cluster is at full CPU utilisation. The scheduler emits 0/5 nodes available: Insufficient cpu. This is the signal Cluster Autoscaler watches — it should respond by adding a node.

2

All five nodes are saturated — there is nothing to reclaim

The cluster has reached its node count limit. Every node shows cpu free: 0m after accounting for system overhead and running workloads. CA simulates adding one new node per node group and checks whether the pending pod would fit — it would, but only if the node group's max-size allowed a sixth node.

3

CA is correct to do nothing — raise max-size to unblock

Cluster Autoscaler respects the node group's max-size as a hard ceiling. It logs No expansion options and stops. The fix is to increase the max-size in the cloud provider's node group (AWS ASG, GCP MIG, or Azure VMSS), not to restart the autoscaler or modify the pod. Verify the current limits with kubectl logs -n kube-system deployment/cluster-autoscaler | grep max-size.

Kubernetes
Pending pod
CA blocked
Saturated nodes
◆ koti.dev / runbook
Cluster Autoscaler refuses to add a node because the node group is already at max-size: 5.
A pending pod on the left requests 4 CPU. A cluster in the middle shows 5 nodes all at zero free CPU, with a node-group max equals 5 label. A Cluster Autoscaler card on the right shows pending pods 1, would scale plus 1, node-group max-size 5, current 5, and the verdict AT CAP. A blocked animated arrow from the CA to the cluster is labelled cannot expand.
pod.status.phase Pending with reason FailedScheduling — visible via kubectl describe pod · kind (no real CA — verified concept against the cluster-autoscaler v1.30 docs and a real EKS deployment)

The scenario

bash
git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git cd troubleshoot-kubernetes-like-a-pro/scenarios/cluster-autoscaler-issues ls
bash

description.md, issue.yaml, fix.yaml, autoscaler_issue.sh. The issue manifest creates a Deployment with 20 replicas, each asking for 500 millicores and 256 megabytes. On a small dev cluster, most of those will pile up in Pending.

Reproduce the issue

bash
kubectl apply -f issue.yaml kubectl get pods -l app=cluster-autoscaler
bash
plaintext
NAME READY STATUS AGE cluster-autoscaler-issue-deployment-6b7cf9c5f9-2fq8p 1/1 Running 30s cluster-autoscaler-issue-deployment-6b7cf9c5f9-4lmxc 0/1 Pending 30s cluster-autoscaler-issue-deployment-6b7cf9c5f9-7h2vj 0/1 Pending 30s cluster-autoscaler-issue-deployment-6b7cf9c5f9-8k9rn 0/1 Pending 30s ... 16 more Pending ...

One or two running, the rest sitting in Pending. On a real cluster with an autoscaler, this is the exact state right before the autoscaler is supposed to act.

Debug the hard way

Signal one, the scheduling event. What does the scheduler say about the Pending pods?

bash
kubectl describe pod -l app=cluster-autoscaler | grep -A 3 FailedScheduling | head -10
bash
plaintext
Warning FailedScheduling default-scheduler 0/1 nodes are available: 1 Insufficient cpu.

Insufficient CPU. Good, that is what an autoscaler is for. Signal two, the node count:

bash
kubectl get nodes
bash
plaintext
NAME STATUS ROLES AGE kind-control-plane Ready control-plane 2h

One node, still. The autoscaler has not added any. Signal three, the autoscaler logs. In a real cluster you would run:

bash
kubectl logs -n kube-system deployment/cluster-autoscaler | tail -30
bash
plaintext
I0331 02:20:11 scale_up.go:452] No pod can be scheduled even if a node group is expanded to maximum size. I0331 02:20:11 scale_up.go:310] No expansion options.

That is the line that matters. The autoscaler looked at the pending pods and said "even if I add a node of the biggest allowed size to every node group I manage, these pods still would not fit." Signal four, the node group config. If the max size is 1, or the instance type is too small for a 500m request plus system overhead, the autoscaler has no move to make.

Why this happens

Cluster autoscaler works at the node group level, not the cluster level. For each node group, it knows the max size, the instance type, and the resources that a fresh node would provide. When a pod is Pending, the autoscaler simulates adding one new node to each group and checks whether the pod would then fit. If yes, it scales. If no, it logs and moves on.

The silent failure modes are all about that simulation. A node group capped at its max will be skipped. A node group whose instance type cannot hold the pod's requests will be skipped. A pod with an affinity rule that pins it to a label no group provides will be skipped. A pending pod with a PVC stuck in the wrong zone will be skipped. In every case, the autoscaler is not broken, it is correctly refusing to scale up something that would not solve the problem.

The fastest way out of this debug loop is to treat the autoscaler like a scheduler in its own right. Read its logs, not just the pod events.

The fix

For this scenario, the replicas are a lie. Nobody needs twenty busybox sleeps. The fix drops the count to three and the request to something sane.

bash
kubectl apply -f fix.yaml kubectl get pods -l app=cluster-autoscaler
bash
plaintext
NAME READY STATUS AGE cluster-autoscaler-fixed-deployment-7c8d5f4b9d-abcde 1/1 Running 5s cluster-autoscaler-fixed-deployment-7c8d5f4b9d-fghij 1/1 Running 5s cluster-autoscaler-fixed-deployment-7c8d5f4b9d-klmno 1/1 Running 5s

The diff that matters:

yaml
replicas: 3 # was 20 resources: requests: cpu: "100m" # was "500m" memory: "64Mi" # was "256Mi"
yaml

In a real incident the fix is rarely the replica count. Usually it is the node group's max size, or the instance type, or an affinity rule that pins the pods to a group that is already full.

The lesson

  1. Autoscaler silence is almost always a deliberate decision. Read its logs before you restart anything.
  2. The four signals are always the same: pod events, node count, autoscaler logs, node group config. In that order.
  3. The autoscaler simulates a single new node per group. If one new node would not fix the pod, the autoscaler is correct to do nothing.

Day 12 of 35 — tomorrow, a resource spec where the limit is smaller than the request, and the one error message Kubernetes gets right.

◆ Newsletter

Get the next post in your inbox.

Real Kubernetes lessons from seven years in production. One email when a new post drops. No spam. Unsubscribe in one click.