This article explores AWS EKS cluster upgrade strategies end-to-end (control plane + nodes), and where node strategies fit inside them. We are assuming there is a single cluster (no multiple clusters, one per environment). In case there are multiple clusters, one per environment (e.g. dev, stage, prod), different approaches should also be considered.
General Upgrade order
This is critical in EKS. A proper cadence explicitly states the order:
- Non-prod clusters first
- Control plane
- Managed add-ons
- VPC CNI
- CoreDNS
- kube-proxy
- Worker nodes
- Platform controllers
- Ingress
- Autoscalers
- Observability
Example:
Upgrade order: dev → staging → prod
Control plane → add-ons → nodes → workloads
At the cluster level, there are really 3 canonical upgrade approaches:
- In-place cluster upgrade (sequential, same cluster)
- Blue/Green cluster upgrade (two clusters)
- Rolling control-plane + self-healing nodes
(1) In-place cluster upgrade (sequential, same cluster)
(Most common / AWS default)
What it is?
- Upgrade control plane
- Upgrade add-ons
- Upgrade nodes
- All within the same EKS cluster
Node replacement can be:
- blue/green node groups
- in-place rolling
- Karpenter-driven
✅ Benefits
- No DNS or traffic switching
- Minimal external dependencies
- Supported and documented by AWS
- Lowest operational overhead
- No full cluster rebuild
- We get a safe rollback point (just stop draining if something looks off)
- Works with:
- Managed Node Groups
- Self-managed ASGs
- Karpenter (with some extra checks)
❌ Drawbacks
- You must respect:
- version skew rules
- API deprecations
- Errors surface in prod if you’re careless
⚠️ Risks
- Deprecated API removal breaks workloads
- Operators not compatible with target version
- Node drains blocked by PDBs
Risk level: 🟡 (low if done carefully)
(1.1) Upgrading Control Plane
Upgrading an Amazon EKS control plane is an "in-place" process, but it requires a specific sequence to avoid downtime for our workloads.
In EKS, the control plane (managed by AWS) and the data plane (your worker nodes) are upgraded separately. You should always upgrade the control plane before your worker nodes.
(1.1.1) Pre-Upgrade Preparation
Before you click "Update," you must ensure your cluster is ready for the new version.
- Check API Deprecations: Newer Kubernetes versions often remove older APIs. Use the EKS Upgrade Insights tab in the AWS Console to see which of your resources are using deprecated APIs.
- To check deprecated APIs removed in 1.33 we can also use:
kubectl get apiserviceskubectl api-resources
- Scan manifests / Helm charts for removed APIs
- Check Add-on Compatibility: Verify that your CoreDNS, kube-proxy, and VPC CNI versions are compatible with the target Kubernetes version.
- Check CRDs, especially if you run:
- cert-manager
- ingress controllers
- ECK / Elastic
- external-dns
- Sequential Upgrades: You can only upgrade one minor version at a time (e.g., v1.29 to 1.30). If you are on 1.28 and want 1.30, you must upgrade to 1.29 first.
- Control plane is backward-compatible with nodes one minor version behind: 1.33 control plane happily runs 1.32 nodes
- IP Address Availability: Ensure your subnets have at least 5 available IP addresses; EKS needs these to spin up the new control plane instances during the rolling update
(1.1.2) Upgrade the Control Plane
You can trigger this via the Console, CLI, or Terraform. The API server will remain available during this process, though there may be a few seconds of latency.
Using AWS CLI:
aws eks update-cluster-version \
--name <my-cluster> \
--kubernetes-version <target-version>
Notes:
- Zero downtime for the API server
- Existing workloads keep running
- Takes ~10–15 minutes typically
Afterwards:
kubectl version
(1.2) Update EKS Managed Add-ons
Once the cluster status is ACTIVE, update the default / core add-ons to match the new version. Do this before adding new nodes.
Core add-ons:
- VPC CNI
- CoreDNS
- Mismatched CoreDNS + new nodes = weird DNS failures
- Kube-proxy
aws eks update-addon --cluster-name <cluster> --addon-name coredns
aws eks update-addon --cluster-name <cluster> --addon-name kube-proxy
aws eks update-addon --cluster-name <cluster> --addon-name vpc-cni
(1.3) Upgrade the Data Plane (Worker Nodes)
After the control plane is finished, you must upgrade your nodes.
- Managed Node Groups: In the EKS Console, select your Node Group and click Update version. AWS will perform a rolling update, cordoning and draining old nodes one by one.
- Fargate: Simply restart your pods (e.g., kubectl rollout restart deployment <name>). New pods will automatically join with the updated version.
- Self-Managed Nodes: You must manually update the AMI and rotate the instances.
[!IMPORTANT] No Rollbacks: EKS does not support "downgrading" a cluster version once the upgrade has started. Always test the upgrade in a development environment first.
(2) Blue/Green cluster upgrade (two clusters)
What it is
- Stand up new EKS cluster at new version
- Deploy all workloads there
- Shift traffic (DNS, LB, mesh)
- Decommission old cluster
✅ Benefits
- No version skew
- Full rollback (just flip traffic back)
- Clean break from legacy cruft
- Very predictable cutover
❌ Drawbacks
- Expensive (double infra)
- Heavy operational lift
- Hard for stateful workloads
- Requires mature CI/CD and traffic control
⚠️ Risks
- Data sync / migration errors
- Split-brain risks for stateful systems
- DNS propagation delays
Risk level: 🟡
Operational cost: 🔴
(3) Rolling control-plane + self-healing nodes
(“Hands-off” / Karpenter-heavy)
What it is
- Upgrade control plane
- Update node provisioning config
- Let autoscaling replace nodes over time
✅ Benefits
- Very little manual work
- No explicit node migration steps
- Elegant once mature
❌ Drawbacks
- Less deterministic
- Harder to audit
- Slower convergence
- Poor visibility during incidents
⚠️ Risks
- Nodes stick around longer than expected
- Sudden mass replacement
- Operators reschedule in unsafe order
Risk level: 🟡→🔴 (depends on maturity)
Where node strategies fit
Node strategies are sub-choices inside approach #1:
Cluster strategy Node strategy options
In-place cluster upgrade blue/green, rolling, Karpenter
Blue/green cluster new cluster nodes only
Self-healing Karpenter-driven
In-place cluster upgrade blue/green, rolling, Karpenter
Blue/green cluster new cluster nodes only
Self-healing Karpenter-driven
Practical comparison (cluster-level)
Strategy Safety Effort Cost Stateful-friendly Rollback
In-place cluster 🟢🟡 🟡 🟢 🟡 🟡
Blue/green cluster 🟢 🔴 🔴 🔴 🟢
Self-healing 🟡→🔴 🟢 🟢 🟡 🔴
Pre-upgrade Checks
What must be checked before moving on:
- Cluster health checks
- Critical workloads running
- No failing webhooks
- No deprecated APIs in use
Example:
Validation: all Tier 0 workloads healthy for 24h before prod upgrade
Rollback & abort criteria
A mature cadence defines when to stop.
Examples:
- Node upgrade paused if error rate increases
- Abort if CoreDNS fails to stabilise
- Roll back Helm releases, not control plane
Update strategies from the operational scheduling and planning perspective
I mentioned above cadence, so let's explore Update from the operational scheduling and planning perspective:
Upgrade strategy can be manual or fully-automated, reactive, ad-hoc or cadence-based.
A cadence upgrade strategy means:
Upgrading Kubernetes (EKS) on a regular, predictable schedule, rather than only when forced by deprecation or incidents.
It answers when, how often, and how upgrades are performed.
An upgrade cadence usually defines four things:
1) How often upgrades happen
Examples:
- Every minor Kubernetes release (roughly every 3–4 months)
- Every second release
- Twice per year
- Just before AWS EOL deadlines (not recommended)
Example wording:
Upgrade cadence: every minor EKS release, within 30 days of availability
2) Upgrade order (this is critical in EKS) - discussed above
3) Validation gates - discussed above
4) Rollback & abort criteria - discussed above
Cadence vs other strategies (quick comparison)
Strategy Description Risk
---------- --------------- -----
Reactive Upgrade only when AWS forces it 🔥 High
Ad-hoc “When someone remembers” 😬 Medium
Cadence-based Predictable, scheduled ✅ Low
Fully automated Zero-touch pipelines ⚠️ Depends on maturity
Example: a solid EKS cadence strategy
We upgrade EKS clusters on a quarterly cadence.- Dev: within 7 days of a new EKS minor version- Staging: within 14 days- Production: within 30 daysUpgrade order:1. EKS control plane2. Managed add-ons3. Node groups / Karpenter4. Platform services5. Application workloadsUpgrades are paused if:- Tier 0 services fail health checks- CoreDNS or CNI fails to stabilise within 15 minutes
Why this matters (especially for EKS)
- AWS only supports N–2 Kubernetes versions
- Miss the window and you’re forced into rushed upgrades
- Deprecated APIs will bite you eventually
- Regular cadence = smaller, safer jumps

No comments:
Post a Comment