- Designed to automatically adjust the number of nodes (EC2 instances) in our cluster based on the resource requests of the workloads running in the cluster
- Kubernetes project, supported on EKS
Key Features:
- Node Scaling: It adds or removes nodes based on the pending pods that cannot be scheduled due to insufficient resources.
- Pod Scheduling: Ensures that all pending pods are scheduled by scaling the cluster up.
How to check if it's installed and enabled?
% kubectl get deployments -n kube-system | grep -i autoscaler
- cluster-autoscaler
- cluster-autoscaler-aws-clustername
- cluster-autoscaler-eks-...
- Replicas ≥ 1
- No crash loops
- Command args like:
- --cloud-provider=aws
- --nodes=1:10:nodegroup-name
- --balance-similar-node-groups
- scale up
- scale down
- Unschedulable pods
- Node group ... increase size
- AccessDenied
- no node groups found
- failed to get ASG
Installation and Setup:
In Terraform:
Configuration:
- Ensure the --nodes flag in the deployment specifies the min and max nodes for your node group.
- Annotate your node groups with the k8s.io/cluster-autoscaler tags to enable autoscaler to manage them.
How to know if node was provisioned by Cluster Autoscaler?
{
"nodegroups": [
"mycorp-prod-mycluster-20260714151819635800000002"
]
}
If cluster is overprovisioned, why Cluster Autoscaler doesn't scale nodes down automatically?
System Pods: Pods like
kube-dnsormetrics-serverdon't have PDBs (Pod Disruption Budgets) and CA is afraid to move them.Local Storage: A pod is using
emptyDiror local storage.Annotation: A pod has the
"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"annotation.Manual Overrides: Check if someone manually updated the Auto Scaling Group (ASG) or the EKS Managed Node Group settings in the AWS Console. Terraform won't automatically "downgrade" those nodes until the next terraform apply or a node recycle.
If nodes are very old, they are "frozen" in time. Even if you changed your Terraform to smaller EC2 instances recently, EKS Managed Node Groups do not automatically replace existing nodes just because the configuration changed. They wait for a triggered update or a manual recycling of the nodes.
How to fix this overprovisioning?
Cluster Autoscaler VS Karpenter
Quick Feature Comparison Table
- Cluster Autoscaler (CA): Scales pre-defined node groups (ASGs)
- Karpenter: Directly provisions individual EC2 instances.
- Cluster Autoscaler (CA): Slower; waits for cloud provider group updates
- Karpenter: Faster; provisions nodes in seconds via direct APIs.
- Cluster Autoscaler (CA): Limited; uses fixed node sizes in groups.
- Karpenter: High; picks the cheapest/optimal instance for the pod.
- Cluster Autoscaler (CA): Higher; must manage multiple node groups.
- Karpenter: Lower; one provisioner can handle many pod types.
Key Differences
- CA asks, "How many more of these pre-configured nodes do I need?".
- Karpenter asks, "What specific resources (CPU, RAM, GPU) does this pending pod need right now?" and builds a node to match.
Node Groups:
- CA requires you to manually define and maintain Auto Scaling Groups (ASGs) for different instance types or zones.
- Karpenter bypasses ASGs entirely, allowing it to "mix and match" instance types dynamically in a single cluster.
Consolidation:
- Karpenter actively monitors the cluster to see if it can move pods to fewer or cheaper nodes to save money (bin-packing).
- While CA has a "scale-down" feature, it is less aggressive at optimizing for cost.
Spot Instance Management:
- Karpenter handles Spot interruptions and price changes more natively, selecting the most stable and cost-efficient Spot instances in real-time.
Which should you choose?
- Migration period: Transitioning from Cluster Autoscaler to Karpenter, where you temporarily run both while gradually moving workloads
- Hybrid node management: Managing distinct, non-overlapping node groups where Cluster Autoscaler handles some node groups and Karpenter handles others (though this adds complexity)
- Race conditions where both try to provision nodes simultaneously
- Inefficient resource allocation
- Unpredictable scaling behavior
- One tool removing nodes the other just provisioned
- Two systems to monitor, troubleshoot, and maintain
- Doubled configuration overhead
- More difficult to understand which tool made which scaling decision
- Karpenter was designed specifically for AWS/EKS and integrates deeply with EC2 APIs
- Karpenter typically provides better performance on EKS (faster provisioning, better bin-packing)
- If you're on EKS, the general recommendation is to choose Karpenter over Cluster Autoscaler for new deployments

No comments:
Post a Comment