Kubernetes Cluster Autoscaler

Friday, 6 February 2026

Kubernetes Cluster Autoscaler

Kubernetes Cluster Autoscaler:

Designed to automatically adjust the number of nodes (EC2 instances) in our cluster based on the resource requests of the workloads running in the cluster
Kubernetes project, supported on EKS

Key Features:

Node Scaling: It adds or removes nodes based on the pending pods that cannot be scheduled due to insufficient resources.
Pod Scheduling: Ensures that all pending pods are scheduled by scaling the cluster up.

It works with EKS Managed Node Groups backed by AWS Auto Scaling Groups. In node group, if we provide specific settings (like custom block_device_mappings), EKS creates an EC2 Launch Template under the hood.

How to check if it's installed and enabled?

(1) Cluster Autoscaler usually runs as a Deployment in kube-system namespace so we can look for that deployment:

% kubectl get deployments -n kube-system | grep -i autoscaler

cluster-autoscaler-aws-cluster-autoscaler 2/2 2 2 296d

We can also list pods directly:

% kubectl get pods -n kube-system | grep -i autoscaler

cluster-autoscaler-aws-cluster-autoscaler-7cbb844455-q2lxv 1/1 Running 0 206d

cluster-autoscaler-aws-cluster-autoscaler-7cbb844455-vhbsw 1/1 Running 0 206d

If we see a pod running, it’s installed.

Typical names:

cluster-autoscaler
cluster-autoscaler-aws-clustername
cluster-autoscaler-eks-...

(2) Inspect the Deployment (confirm it’s enabled & configured)

% kubectl describe deployment cluster-autoscaler -n kube-system

Name: cluster-autoscaler-aws-cluster-autoscaler

Namespace: kube-system

CreationTimestamp: Wed, 16 Apr 2025 12:25:38 +0100

Labels: app.kubernetes.io/instance=cluster-autoscaler

app.kubernetes.io/managed-by=Helm

app.kubernetes.io/name=aws-cluster-autoscaler

helm.sh/chart=cluster-autoscaler-9.46.6

Annotations: deployment.kubernetes.io/revision: 1

meta.helm.sh/release-name: cluster-autoscaler

meta.helm.sh/release-namespace: kube-system

Selector: app.kubernetes.io/instance=cluster-autoscaler,app.kubernetes.io/name=aws-cluster-autoscaler

Replicas: 2 desired | 2 updated | 2 total | 2 available | 0 unavailable

StrategyType: RollingUpdate

MinReadySeconds: 0

RollingUpdateStrategy: 25% max unavailable, 25% max surge

Pod Template:

Labels: app.kubernetes.io/instance=cluster-autoscaler

app.kubernetes.io/name=aws-cluster-autoscaler

Service Account: cluster-autoscaler-aws-cluster-autoscaler

Containers:

aws-cluster-autoscaler:

Image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.32.0

Port: 8085/TCP

Host Port: 0/TCP

Command:

./cluster-autoscaler

--cloud-provider=aws

--namespace=kube-system

--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/mycorp-prod-mycluster

--logtostderr=true

--stderrthreshold=info

--v=4

Liveness: http-get http://:8085/health-check delay=0s timeout=1s period=10s #success=1 #failure=3

Environment:

POD_NAMESPACE: (v1:metadata.namespace)

SERVICE_ACCOUNT: (v1:spec.serviceAccountName)

AWS_REGION: us-east-1

Mounts: <none>

Volumes: <none>

Priority Class Name: system-cluster-critical

Node-Selectors: <none>

Tolerations: <none>

Conditions:

Type Status Reason

---- ------ ------

Progressing True NewReplicaSetAvailable

Available True MinimumReplicasAvailable

OldReplicaSets: <none>

NewReplicaSet: cluster-autoscaler-aws-cluster-autoscaler-7cbb844455 (2/2 replicas created)

Events: <none>

Key things to look for:

Replicas ≥ 1
No crash loops
Command args like:

--cloud-provider=aws
--nodes=1:10:nodegroup-name
--balance-similar-node-groups

If replicas are 0, it’s installed but effectively disabled.

(3) Check logs (is it actively scaling?)

This confirms it’s working, not just running.

% kubectl logs -n kube-system deployment/cluster-autoscaler

Healthy / active signs:

scale up
scale down
Unschedulable pods
Node group ... increase size

Red flags:

AccessDenied
no node groups found
failed to get ASG

(4) Check for unschedulable pods trigger

If CA is working, it reacts to pods stuck in Pending.

% kubectl get pods -A | grep Pending

If pods are pending and CA logs mention them → CA is enabled and reacting.

(5) AWS EKS-specific checks (very common)

a) Check IAM permissions (classic failure mode)

Cluster Autoscaler must run with an IAM role that can talk to ASGs.

% kubectl -n kube-system get sa | grep autoscaler

cluster-autoscaler-aws-cluster-autoscaler 0 296d

horizontal-pod-autoscaler 0 296d

Let's inspect cluster-autoscaler-aws-cluster-autoscaler service accont:

% kubectl -n kube-system get sa cluster-autoscaler-aws-cluster-autoscaler -o yaml

apiVersion: v1

automountServiceAccountToken: true

kind: ServiceAccount

metadata:

annotations:

eks.amazonaws.com/role-arn: arn:aws:iam::xxxxx:role/mycorp-prod-mycluster-cluster-autoscaler

meta.helm.sh/release-name: cluster-autoscaler

meta.helm.sh/release-namespace: kube-system

creationTimestamp: "2026-04-16T11:25:37Z"

labels:

app.kubernetes.io/instance: cluster-autoscaler

app.kubernetes.io/managed-by: Helm

app.kubernetes.io/name: aws-cluster-autoscaler

helm.sh/chart: cluster-autoscaler-9.46.6

name: cluster-autoscaler-aws-cluster-autoscaler

namespace: kube-system

resourceVersion: "15768"

uid: 0a7da521-1bf5-5a5f-a155-8801e876ea7b

Look for:

eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/ClusterAutoscalerRole

If missing → CA may exist but cannot scale.

b) Check Auto Scaling Group tags

Your node group ASGs must be tagged:

k8s.io/cluster-autoscaler/enabled = true

k8s.io/cluster-autoscaler/<cluster-name> = owned

Without these → CA runs but does nothing.

(6) Check Helm (if installed via Helm)

% helm list -A

NAME NAMESPACE REVISION UPDATED

cluster-autoscaler kube-system 1 2025-04-16 12:25:30.389073326 +0100BST

STATUS CHART APP VERSION

deployed cluster-autoscaler-9.46.6 1.32.0

Then:

helm status cluster-autoscaler -n kube-system

The command helm list -A (or its alias helm ls -A) is used to list all Helm releases across every namespace in a Kubernetes cluster. Helm identifies your cluster and authenticates through the same mechanism as kubectl: the kubeconfig file. It uses the standard Kubernetes configuration file, typically located at ~/.kube/config, to determine which cluster to target.

(7) Double-check it’s not replaced by Karpenter

Many newer EKS clusters don’t use Cluster Autoscaler anymore.

% kubectl get pods -A | grep -i karpenter

kube-system karpenter-6f67b8c97b-lbq8p 1/1 Running 0 206d

kube-system karpenter-6f67b8c97b-wmprj 1/1 Running 0 206d

If Karpenter is installed, Cluster Autoscaler usually isn’t (or shouldn’t be).

Quick decision table

-----------------------------------------------------------------

Symptom Meaning

-----------------------------------------------------------------

No CA pod Not installed

Pod running, replicas=0 Installed but disabled

Logs show AccessDenied Broken IAM

Pods Pending, no scale-up ASG tags / config issue

Karpenter present CA likely not used

-----------------------------------------------------------------

Installation and Setup:

To use the Cluster Autoscaler in the EKS cluster we need to deploy it using a Helm chart or a pre-configured YAML manifest.

kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

In Terraform:

resource "helm_release" "cluster_autoscaler" {

name = "cluster-autoscaler"

repository = "https://kubernetes.github.io/autoscaler"

chart = "cluster-autoscaler"

...

}

Configuration:

Ensure the --nodes flag in the deployment specifies the min and max nodes for your node group.
Annotate your node groups with the k8s.io/cluster-autoscaler tags to enable autoscaler to manage them.

How to know if node was provisioned by Cluster Autoscaler?

Cluster Autoscaler applies labels on nodes it provisions so let's check labels:

% kubectl get nodes --show-labels

If label like eks.amazonaws.com/nodegroup exists, node was launched by and belongs to EKS Managed Node Group as Cluster Autoscaler launched the node.

Example:

% kubectl get nodes --show-labels

NAME STATUS ROLES AGE VERSION

ip-10-2-1-244.us-east-1.compute.internal Ready <none> 206d v1.32.3-eks-473151a

LABELS

Environment=prod,

beta.kubernetes.io/arch=amd64,

beta.kubernetes.io/instance-type=m5.xlarge,

beta.kubernetes.io/os=linux,

eks.amazonaws.com/capacityType=ON_DEMAND,

eks.amazonaws.com/nodegroup-image=ami-07fa6c030f5802c74,

eks.amazonaws.com/nodegroup=mycorp-prod-mycluster-20260714151819635800000002,

eks.amazonaws.com/sourceLaunchTemplateId=lt-0edc7a2b08ea82a28,

eks.amazonaws.com/sourceLaunchTemplateVersion=1,

failure-domain.beta.kubernetes.io/region=us-east-1,

failure-domain.beta.kubernetes.io/zone=us-east-1a,

mycorp;/node-type=default,

k8s.io/cloud-provider-aws=12b0e11196b7091c737cf66015f19720,

kubernetes.io/arch=amd64,

kubernetes.io/hostname=ip-10-2-1-244.us-east-1.compute.internal,

kubernetes.io/os=linux,

node.kubernetes.io/instance-type=m5.xlarge,

topology.ebs.csi.aws.com/zone=us-east-1a,

topology.k8s.aws/zone-id=use1-az1,

topology.kubernetes.io/region=us-east-1,

topology.kubernetes.io/zone=us-east-1a

If we list all nodegroups in the cluster, the one above is listed:

% aws eks list-nodegroups --cluster-name mycorp-prod-mycluster --profile my-profile
{
   "nodegroups": [
       "mycorp-prod-mycluster-20260714151819635800000002"
   ]
}

If cluster is overprovisioned, why Cluster Autoscaler doesn't scale nodes down automatically?

If Cluster Autoscaler is running but not shrinking the cluster, it's usually because:

System Pods: Pods like kube-dns or metrics-server don't have PDBs (Pod Disruption Budgets) and CA is afraid to move them.
Local Storage: A pod is using emptyDir or local storage.
Annotation: A pod has the "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" annotation.
Manual Overrides: Check if someone manually updated the Auto Scaling Group (ASG) or the EKS Managed Node Group settings in the AWS Console. Terraform won't automatically "downgrade" those nodes until the next terraform apply or a node recycle.
If nodes are very old, they are "frozen" in time. Even if you changed your Terraform to smaller EC2 instances recently, EKS Managed Node Groups do not automatically replace existing nodes just because the configuration changed. They wait for a triggered update or a manual recycling of the nodes.

How to fix this overprovisioning?

Since your current Terraform state says you want e.g. 2 nodes of m5.large, but the reality is e.g. 4 nodes of m5.xlarge, you need to force a sync.

Step 1: Check for Drift

Run a terraform plan. It will likely show that it wants to update the Launch Template or the Node Group version to switch from xlarge back to large.

Step 2: Trigger a Rolling Update

If you apply the Terraform and nothing happens to the existing nodes, you need to tell EKS to recycle them. You can do this via the AWS CLI:

aws eks update-nodegroup-version \

--cluster-name <your-cluster-name> \

--nodegroup-name <your-nodegroup-name> \

--force

Note: This will gracefully terminate nodes one by one and replace them with the new m5.large type defined in your TF.

Cluster Autoscaler VS Karpenter

While both tools scale Kubernetes nodes to meet pod demand, they use fundamentally different approaches. Cluster Autoscaler (CA) is the traditional, "group-based" tool that adds nodes to existing pools, whereas Karpenter is a "provisioning" tool that directly creates the specific instances your applications need.

Quick Feature Comparison Table

Scaling Logic

Cluster Autoscaler (CA): Scales pre-defined node groups (ASGs)
Karpenter: Directly provisions individual EC2 instances.

Speed

Cluster Autoscaler (CA): Slower; waits for cloud provider group updates
Karpenter: Faster; provisions nodes in seconds via direct APIs.

Cost Control

Cluster Autoscaler (CA): Limited; uses fixed node sizes in groups.
Karpenter: High; picks the cheapest/optimal instance for the pod.

Complexity

Cluster Autoscaler (CA): Higher; must manage multiple node groups.
Karpenter: Lower; one provisioner can handle many pod types.

Key Differences

Infrastructure Model:

CA asks, "How many more of these pre-configured nodes do I need?".
Karpenter asks, "What specific resources (CPU, RAM, GPU) does this pending pod need right now?" and builds a node to match.

Node Groups:

CA requires you to manually define and maintain Auto Scaling Groups (ASGs) for different instance types or zones.
Karpenter bypasses ASGs entirely, allowing it to "mix and match" instance types dynamically in a single cluster.

Consolidation:

Karpenter actively monitors the cluster to see if it can move pods to fewer or cheaper nodes to save money (bin-packing).
While CA has a "scale-down" feature, it is less aggressive at optimizing for cost.

Spot Instance Management:

Karpenter handles Spot interruptions and price changes more natively, selecting the most stable and cost-efficient Spot instances in real-time.

Which should you choose?

Use Cluster Autoscaler if you need a stable, battle-tested solution that works across multiple cloud providers (GCP, Azure) or if your workloads are very predictable and don't require rapid scaling.

Use Karpenter if you are on AWS EKS, need to scale up hundreds of nodes quickly, want to heavily use Spot instances, or want to reduce the operational burden of managing dozens of node groups.

Disable Cluster Autoscaler if you plan to use Karpenter. Having both leads to race conditions and wasted cost.

When to Run Both Together

It's generally not recommended to run Cluster Autoscaler and Karpenter together in the same cluster. However, there are specific scenarios where it might be acceptable:

Valid use cases for running both:

Migration period: Transitioning from Cluster Autoscaler to Karpenter, where you temporarily run both while gradually moving workloads
Hybrid node management: Managing distinct, non-overlapping node groups where Cluster Autoscaler handles some node groups and Karpenter handles others (though this adds complexity)

When It's Not Recommended (and Why)

Primary reasons to avoid running both:

Conflicting decisions: Both tools make independent scaling decisions, which can lead to:

Race conditions where both try to provision nodes simultaneously
Inefficient resource allocation
Unpredictable scaling behavior
One tool removing nodes the other just provisioned

Increased operational complexity:

Two systems to monitor, troubleshoot, and maintain
Doubled configuration overhead
More difficult to understand which tool made which scaling decision

Resource contention: Both tools consume cluster resources and API server capacity, adding unnecessary load.

No significant benefits: Karpenter can handle everything Cluster Autoscaler does, often more efficiently, so there's rarely a technical need for both.

EKS-Specific Considerations

Tthe same principles apply to AWS EKS clusters, with some additional context:

EKS particularities:

Karpenter was designed specifically for AWS/EKS and integrates deeply with EC2 APIs
Karpenter typically provides better performance on EKS (faster provisioning, better bin-packing)
If you're on EKS, the general recommendation is to choose Karpenter over Cluster Autoscaler for new deployments

Migration best practice for EKS: If migrating from Cluster Autoscaler to Karpenter on EKS, ensure they manage completely separate node groups, and complete the migration as quickly as feasible to minimize the period of running both.

How to migrate pods from nodes deployed by Cluster Autoscaler to those deployed by Karpenter?

If you'd rather use Karpenter for everything, you should eventually set your min_size, max_size, and desired_size to 0 in this node group and let Karpenter handle the provisioning instead.

My Public Notepad

Pages

Friday, 6 February 2026