Sunday, 25 January 2026

Strategies for AWS EKS Cluster Kubernetes Version Upgrade


This article explores AWS EKS cluster upgrade strategies end-to-end (control plane + nodes), and where node strategies fit inside them. We are assuming there is a single cluster (no multiple clusters, one per environment). In case there are multiple clusters, one per environment (e.g. dev, stage, prod), different approaches should also be considered.


General Upgrade order


This is critical in EKS. A proper cadence explicitly states the order:

  1. Non-prod clusters first
  2. Control plane
  3. Managed add-ons
    1. VPC CNI
    2. CoreDNS
    3. kube-proxy
  4. Worker nodes
  5. Platform controllers
    1. Ingress
    2. Autoscalers
    3. Observability

Example:

Upgrade order: dev → staging → prod
Control plane → add-ons → nodes → workloads


At the cluster level, there are really 3 canonical upgrade approaches:
  1. In-place cluster upgrade (sequential, same cluster)
  2. Blue/Green cluster upgrade (two clusters)
  3. Rolling control-plane + self-healing nodes

(1) In-place cluster upgrade (sequential, same cluster)


(Most common / AWS default)

What it is?
  • Upgrade control plane
  • Upgrade add-ons
  • Upgrade nodes
  • All within the same EKS cluster

Node replacement can be:
  • blue/green node groups
  • in-place rolling
  • Karpenter-driven

✅ Benefits
  • No DNS or traffic switching
  • Minimal external dependencies
  • Supported and documented by AWS
  • Lowest operational overhead
  • No full cluster rebuild
  • We get a safe rollback point (just stop draining if something looks off)
  • Works with:
    • Managed Node Groups
    • Self-managed ASGs
    • Karpenter (with some extra checks)

❌ Drawbacks
  • You must respect:
    • version skew rules
    • API deprecations
  • Errors surface in prod if you’re careless

⚠️ Risks
  • Deprecated API removal breaks workloads
  • Operators not compatible with target version
  • Node drains blocked by PDBs

Risk level: 🟡 (low if done carefully)


(1.1) Upgrading Control Plane


Upgrading an Amazon EKS control plane is an "in-place" process, but it requires a specific sequence to avoid downtime for our workloads.

In EKS, the control plane (managed by AWS) and the data plane (your worker nodes) are upgraded separately. You should always upgrade the control plane before your worker nodes.

(1.1.1) Pre-Upgrade Preparation


Before you click "Update," you must ensure your cluster is ready for the new version.
  • Check API Deprecations: Newer Kubernetes versions often remove older APIs. Use the EKS Upgrade Insights tab in the AWS Console to see which of your resources are using deprecated APIs.
    • To check deprecated APIs removed in 1.33 we can also use:
kubectl get apiservices
kubectl api-resources
  • Scan manifests / Helm charts for removed APIs
  • Check Add-on Compatibility: Verify that your CoreDNS, kube-proxy, and VPC CNI versions are compatible with the target Kubernetes version.
  • Check CRDs, especially if you run:
    • cert-manager
    • ingress controllers
    • ECK / Elastic
    • external-dns
  • Sequential Upgrades: You can only upgrade one minor version at a time (e.g., v1.29 to 1.30). If you are on 1.28 and want 1.30, you must upgrade to 1.29 first.
    • Control plane is backward-compatible with nodes one minor version behind: 1.33 control plane happily runs 1.32 nodes
  • IP Address Availability: Ensure your subnets have at least 5 available IP addresses; EKS needs these to spin up the new control plane instances during the rolling update

(1.1.2) Upgrade the Control Plane


You can trigger this via the Console, CLI, or Terraform. The API server will remain available during this process, though there may be a few seconds of latency.

Using AWS CLI:

aws eks update-cluster-version \
  --name <my-cluster> \
  --kubernetes-version <target-version>


Notes:
  • Zero downtime for the API server
  • Existing workloads keep running
  • Takes ~10–15 minutes typically

Afterwards:

kubectl version

(1.2) Update EKS Managed Add-ons


Once the cluster status is ACTIVE, update the default / core add-ons to match the new version. Do this before adding new nodes.

Core add-ons:
  • VPC CNI
  • CoreDNS
    • Mismatched CoreDNS + new nodes = weird DNS failures
  • Kube-proxy

aws eks update-addon --cluster-name <cluster> --addon-name coredns
aws eks update-addon --cluster-name <cluster> --addon-name kube-proxy
aws eks update-addon --cluster-name <cluster> --addon-name vpc-cni



(1.3) Upgrade the Data Plane (Worker Nodes)


After the control plane is finished, you must upgrade your nodes.
  • Managed Node Groups: In the EKS Console, select your Node Group and click Update version. AWS will perform a rolling update, cordoning and draining old nodes one by one.
  • Fargate: Simply restart your pods (e.g., kubectl rollout restart deployment <name>). New pods will automatically join with the updated version.
  • Self-Managed Nodes: You must manually update the AMI and rotate the instances.


[!IMPORTANT] No Rollbacks: EKS does not support "downgrading" a cluster version once the upgrade has started. Always test the upgrade in a development environment first.


(2) Blue/Green cluster upgrade (two clusters)


What it is
  • Stand up new EKS cluster at new version
  • Deploy all workloads there
  • Shift traffic (DNS, LB, mesh)
  • Decommission old cluster

✅ Benefits
  • No version skew
  • Full rollback (just flip traffic back)
  • Clean break from legacy cruft
  • Very predictable cutover

❌ Drawbacks
  • Expensive (double infra)
  • Heavy operational lift
  • Hard for stateful workloads
  • Requires mature CI/CD and traffic control

⚠️ Risks
  • Data sync / migration errors
  • Split-brain risks for stateful systems
  • DNS propagation delays

Risk level: 🟡
Operational cost: 🔴

(3) Rolling control-plane + self-healing nodes


(“Hands-off” / Karpenter-heavy)

What it is
  • Upgrade control plane
  • Update node provisioning config
  • Let autoscaling replace nodes over time

✅ Benefits
  • Very little manual work
  • No explicit node migration steps
  • Elegant once mature

❌ Drawbacks
  • Less deterministic
  • Harder to audit
  • Slower convergence
  • Poor visibility during incidents

⚠️ Risks
  • Nodes stick around longer than expected
  • Sudden mass replacement
  • Operators reschedule in unsafe order

Risk level: 🟡→🔴 (depends on maturity)


Where node strategies fit


Node strategies are sub-choices inside approach #1:

Cluster strategy           Node strategy options
In-place cluster upgrade   blue/green, rolling, Karpenter
Blue/green cluster           new cluster nodes only
Self-healing                   Karpenter-driven


Practical comparison (cluster-level)


Strategy                 Safety Effort Cost    Stateful-friendly   Rollback
In-place cluster 🟢🟡        🟡         🟢     🟡                    ðŸŸ¡
Blue/green cluster 🟢         🔴         🔴     🔴                    ðŸŸ¢
Self-healing         🟡→🔴 🟢         🟢     🟡                    ðŸ”´



Pre-upgrade Checks


What must be checked before moving on:

  • Cluster health checks
  • Critical workloads running
  • No failing webhooks
  • No deprecated APIs in use

Example:

Validation: all Tier 0 workloads healthy for 24h before prod upgrade


Rollback & abort criteria


A mature cadence defines when to stop.

Examples:
  • Node upgrade paused if error rate increases
  • Abort if CoreDNS fails to stabilise
  • Roll back Helm releases, not control plane


Update strategies from the operational scheduling and planning perspective


I mentioned above cadence, so let's explore Update from the operational scheduling and planning perspective:

Upgrade strategy can be manual or fully-automated, reactive, ad-hoc or cadence-based.

A cadence upgrade strategy means:

Upgrading Kubernetes (EKS) on a regular, predictable schedule, rather than only when forced by deprecation or incidents.

It answers when, how often, and how upgrades are performed.

An upgrade cadence usually defines four things:

1) How often upgrades happen

Examples:
  • Every minor Kubernetes release (roughly every 3–4 months)
  • Every second release
  • Twice per year
  • Just before AWS EOL deadlines (not recommended)

Example wording:
Upgrade cadence: every minor EKS release, within 30 days of availability

2) Upgrade order (this is critical in EKS) - discussed above
3) Validation gates - discussed above
4) Rollback & abort criteria - discussed above


Cadence vs other strategies (quick comparison)


Strategy                 Description                                      Risk
----------                   ---------------                                         -----
Reactive                 Upgrade only when AWS forces it      ðŸ”¥ High
Ad-hoc                 “When someone remembers”              ðŸ˜¬ Medium
Cadence-based Predictable, scheduled                      ✅ Low
Fully automated Zero-touch pipelines                      ⚠️ Depends on maturity

Example: a solid EKS cadence strategy


We upgrade EKS clusters on a quarterly cadence.

- Dev: within 7 days of a new EKS minor version
- Staging: within 14 days
- Production: within 30 days

Upgrade order:
1. EKS control plane
2. Managed add-ons
3. Node groups / Karpenter
4. Platform services
5. Application workloads

Upgrades are paused if:
- Tier 0 services fail health checks
- CoreDNS or CNI fails to stabilise within 15 minutes

Why this matters (especially for EKS)
  • AWS only supports N–2 Kubernetes versions
  • Miss the window and you’re forced into rushed upgrades
  • Deprecated APIs will bite you eventually
  • Regular cadence = smaller, safer jumps

No comments: