Strategies for AWS EKS Cluster Kubernetes Version Upgrade

Sunday, 25 January 2026

Strategies for AWS EKS Cluster Kubernetes Version Upgrade

This article explores AWS EKS cluster upgrade strategies end-to-end (control plane + nodes), and where node strategies fit inside them. We are assuming there is a single cluster (no multiple clusters, one per environment). In case there are multiple clusters, one per environment (e.g. dev, stage, prod), different approaches should also be considered.

General Upgrade order (brief)

This is critical in EKS. A proper cadence explicitly states the order:

Control plane (API server version)
Managed add-ons

VPC CNI
CoreDNS
kube-proxy

Node groups (Worker nodes) (kubelet agents version)
Platform controllers

Ingress
Autoscalers
Observability agents (e.g. Prometheus/Grafana Stack)

Complete EKS Cluster Upgrade Order

Phase 1: Pre-Upgrade Preparation

Audit and compatibility check

Review Kubernetes changelog for API deprecations (e.g. 1.32 → 1.33)

https://github.com/kubernetes/kubernetes/tree/master/CHANGELOG

If cluster is provisioned via terraform-aws-modules/eks/aws: verify if its current version supports the new Kubernetes version; if not - update this module.
Scan workloads for deprecated APIs:

kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis

This bypasses the standard kubectl formatting and asks the API server for its raw internal performance and health data in Prometheus format; then it filters the massive stream of metrics to find one specific gauge: apiserver_requested_deprecated_apis. The apiserver_requested_deprecated_apis gauge doesn't just look at what you have installed (like a static scan); it tracks what is actually being called by users, CI/CD pipelines, or controllers.

Each entry in the output includes a removed_release label. For example, if you are on v1.32 and you see:

apiserver_requested_deprecated_apis{..., removed_release="1.33"} 1

...it means a client is calling an API that is scheduled for removal in the 1.33 version. If the command returns nothing, we have no recorded requests to deprecated APIs. If it returns lines of text, here is how to read them:

apiserver_requested_deprecated_apis {
group="networking.k8s.io",
version="v1beta1",
resource="ingresses",
removed_release="1.22"
} 1

Label: Meaning:

group / version The specific API group and version being used (e.g., v1beta1).
resource The type of object being requested (e.g., ingresses).
removed_release The most important field. This is the Kubernetes version where this API will stop working entirely.
Value (1) A value of 1 indicates that at least one request has been made to this endpoint since the API server last restarted.

See if any removed_release matches the version you are upgrading to. If our kubectl_manifest or helm_release is still using an old apiVersion, the upgrade will succeed at the infrastructure level but your pods will fail to start because they are trying to create resources that the new Kubernetes version no longer understands.

Audit the "Ghost" Apps - any apps installed manually months ago, as they might be using older Helm chart versions that rely on deprecated APIs.

If we find hits, we must update the apiVersion in our YAML files (like in Helm values e.g. utils/values.yaml or our kubectl_manifest blocks) to the current stable versions before we apply the Terraform.

Example:

% kubectl get --raw /metrics | grep deprecate
registered_metrics_total{deprecated_version="",stability_level="ALPHA"} 190
registered_metrics_total{deprecated_version="",stability_level="BETA"} 15
registered_metrics_total{deprecated_version="",stability_level="STABLE"} 12

These metrics track the metrics themselves. They are telling us that the Kubernetes API server has:

190 Alpha metrics
15 Beta metrics
12 Stable metrics

None of the metrics are currently marked as deprecated (the deprecated_version="" is empty).

IMPORTANT!!! The metrics are reset whenever the kube-apiserver restarts. If your control plane recently rebooted, the metric might show "0" even if you have deprecated usage. It's best to check this after the cluster has been under normal load for at least 24 hours.

We can also use tools like pluto, kubepug or kubent (this last project seems not to be active anymore) to detect deprecated API usage
Check third-party controller compatibility
If we find any deprecated API currently used in our manifests and Helm charts, we need to upgrade our workloads first (update APIs and deploy all affected K8s objects), before upgrading our Kubernetes cluster.

Backup everything

etcd snapshot (though AWS manages this)
Backup all persistent volumes (snapshots of EBS/EFS).
Export critical resources: kubectl get all --all-namespaces -o yaml > backup.yaml
Document current state (versions, configs, ingress endpoints)

Test in non-production

Upgrade a dev/staging cluster first
Run smoke tests on applications
Verify data migration strategies work

Phase 2: Core Infrastructure Upgrade

Upgrade control plane

Update cluster_version = "1.33"
Apply and monitor

Upgrade managed add-ons

VPC CNI
CoreDNS
kube-proxy
EBS CSI driver (if using persistent volumes)

Upgrade node groups

Update managed node group versions to 1.33
Rolling update happens automatically
Monitor pod rescheduling

Phase 3: Application Layer Upgrade

Update workload manifests for API compatibility

Before or during node upgrades, update application manifests that use deprecated APIs
Example: If using deprecated policy/v1beta1 PodDisruptionBudget, update to policy/v1
Update CRDs that might be version-dependent
Redeploy applications with updated manifests

Upgrade platform controllers

AWS Load Balancer Controller
Cluster Autoscaler (MUST match k8s version)
Metrics Server
External DNS
Cert Manager
Any other cluster-level controllers

Upgrade observability stack

Prometheus/Grafana
CloudWatch Container Insights
Datadog/New Relic/other APM agents
Logging agents (Fluentd, Fluent Bit)

Phase 4: Stateful Workload Migration

Stateful workload data considerations

Important distinction: Data migration isn't usually a separate step in in-place EKS upgrades because:

PersistentVolumes stay attached to the same cluster
StatefulSets maintain their PVCs when pods reschedule
Data persists as pods move between old and new nodes

However, you DO need to handle:

a) Database version compatibility:

# If running PostgreSQL in k8s and the pod image needs updating

# The data in the PV might need migration

kubectl exec -it postgres-0 -- pg_upgrade ...

b) Application-level state:

Cache warmup after pods restart on new nodes
Session data if using in-memory sessions
Queue processing (ensure messages aren't lost during pod restarts)

c) Stateful application upgrades:

# Update StatefulSet with new image that's compatible with k8s 1.33

apiVersion: apps/v1

kind: StatefulSet

metadata:

name: my-database

spec:

template:

spec:

containers:

- name: db

image: postgres:15 # Updated version

d) Storage driver compatibility:

If using EBS CSI driver, ensure workloads work with updated driver
Test PVC creation/attachment on new nodes
Verify snapshot/restore functionality

Phase 5: Validation & Monitoring

Validate cluster health

kubectl get nodes # All should be Ready with v1.33

kubectl get pods --all-namespaces # All should be Running

kubectl top nodes # Resource utilization

Application-level validation

Run smoke tests against applications
Check ingress/service endpoints are accessible
Verify database connections work
Test autoscaling behavior
Validate monitoring/alerting still works

Performance validation

Check application latency/throughput
Verify persistent storage I/O performance
Monitor for any degradation

Phase 6: Cleanup

Remove old/deprecated resources

Clean up any temporary migration resources
Remove old API version manifests
Update GitOps repos with new manifests

The Data Migration Question - When Does It Happen?

For in-place EKS upgrades (most common):

Data migration is implicit during pod rescheduling
When a pod on an old node is terminated and recreated on a new node, its PV reattaches automatically
No explicit "data migration step" needed

You DO need explicit data migration if:

1) Changing storage classes:

Old: gp2 volumes

New: gp3 volumes

Need to: snapshot → restore → update PVC

2) Moving between clusters (blue/green upgrade):

This is where heavy data migration happens
NOT recommended for stateful workloads (as we discussed)

3) Database major version upgrades:

# PostgreSQL 12 → 15 might need pg_upgrade

# This happens AFTER k8s upgrade, as part of workload upgrade

4) Application schema changes:

- Run database migrations after deploying new app versions

- Often part of your CI/CD pipeline

Timeline View

Pre-upgrade prep (API audit, backups)

↓

Control plane upgrade

↓

Managed add-ons upgrade

↓

Node groups upgrade ← Pods reschedule here (implicit data "migration" via PV reattachment)

↓

Update workload manifests (fix deprecated APIs)

↓

Redeploy applications with new manifests

↓

Upgrade platform controllers

↓

Upgrade observability

↓

Explicit data migrations (if needed for app version changes)

↓

Validation & testing

Key Points

Workload manifest updates: Must update apps using deprecated APIs
Application redeployment: Some apps need redeployment for compatibility
Database/app data migrations: Happen as part of app upgrades, not k8s upgrade
Validation phase: Critical but often overlooked
Pre-upgrade API auditing: Must happen BEFORE you start

The data migration question is nuanced - for in-place upgrades, it's mostly automatic via PV reattachment. Explicit migration is only needed for application-level changes or storage class changes.

Start with Non-prod clusters first.

Example:

Upgrade order:

dev → staging → prod

Control plane → add-ons → nodes → workloads

Why Control Plane needs to be upgraded before Addons?

Control plane should be upgraded first, then managed add-ons. Here's why:

Correct Upgrade Order:

Control plane (cluster_version)
Managed add-ons (VPC CNI, CoreDNS, kube-proxy). Upgrade Add-ons immediately after the Control Plane, but before you upgrade the nodes.
Node groups (worker nodes)

Why This Order Matters

Add-ons depend on API versions: Managed add-ons interact with the Kubernetes API server. If you upgrade add-ons before the control plane:

Add-ons might use API versions not yet available in the older control plane
Incompatibility errors can break cluster networking or DNS
You could lose cluster connectivity entirely

AWS enforces compatibility checks: When you try to upgrade add-ons, AWS EKS checks compatibility with your current control plane version. Upgrading add-ons to versions designed for 1.33 while still on 1.32 control plane may fail validation.

Add-ons have version skew policies: Each add-on version is qualified against specific EKS versions. The add-on versions for 1.33 are tested against 1.33 control plane APIs, not 1.32.

Version Skew Support

AWS EKS follows these skew policies:

Control plane → Add-ons: Add-ons typically support N and N-1 control plane versions
Control plane → Nodes: Nodes can be up to 2 minor versions behind (1.33 control plane, 1.31 nodes OK)

This means:

Upgrading control plane first is safe - existing add-ons continue working
Upgrading add-ons before control plane risks incompatibility

Practical Example

# Step 1: Upgrade control plane

cluster_version = "1.33" # was 1.32

# Step 2: Upgrade add-ons (in same or subsequent apply)

cluster_addons = {

vpc-cni = {

most_recent = true

}

coredns = {

most_recent = true

}

kube-proxy = {

most_recent = true

}

aws-ebs-csi-driver = {

most_recent = true

}

# Step 3: Upgrade nodes

eks_managed_node_groups = {

example = {

version = "1.33"

}

What Happens If You Upgrade Add-ons First?

VPC CNI: Might fail to create network interfaces or configure pod networking correctly if it expects newer API features.

CoreDNS: Could break DNS resolution cluster-wide if it tries to use API endpoints not available in 1.32.

kube-proxy: Might fail to update iptables rules properly, breaking service networking.

The risk is cluster-wide networking failure, which is much harder to recover from than a failed control plane upgrade.

Best Practice

Terraform will often handle this automatically if you update everything together, but the underlying AWS API calls happen in the correct order: control plane, then add-ons, then nodes. If doing manual staged upgrades, always follow this sequence.

Why Node Groups (Worker Nodes) need to be upgraded before than platform controllers?

Node groups should be upgraded first, then platform controllers. Here's why:

Correct Upgrade Order:

Control plane (1.32 → 1.33)
Managed add-ons (VPC CNI, CoreDNS, kube-proxy)
Node groups (worker nodes to 1.33)
Platform controllers (Ingress, Autoscalers, Observability)

Why Nodes Before Platform Controllers

Platform controllers run ON the nodes: Controllers like AWS Load Balancer Controller, Cluster Autoscaler, and observability agents are deployed as pods on your worker nodes. They depend on:

The kubelet version on the nodes
Node-level features and APIs
Container runtime capabilities

Version compatibility flows upward:

Nodes must be compatible with the control plane
Controllers must be compatible with the node/kubelet version they run on
Upgrading nodes first ensures the runtime environment is ready for updated controllers

Controllers are more tolerant of version skew: Most platform controllers (especially those from AWS or CNCF) are designed to work across multiple Kubernetes versions. For example:

AWS Load Balancer Controller supports multiple EKS versions
Cluster Autoscaler typically supports N-2 versions
These controllers are more flexible than core components

Practical Reasons

Reduced risk of deployment failures: If you upgrade controllers before nodes:

New controller versions might require node features only available in 1.33
Pod scheduling could fail if nodes don't support required capabilities
Controllers might crash-loop waiting for node-level features

Example - Cluster Autoscaler:

If you upgrade CA to v1.33 while nodes are still on 1.32:

CA might expect new node labels or taints
Node group scaling could behave unexpectedly
CA might not recognize older kubelet versions properly

Autoscaler-specific issue: If you upgrade Cluster Autoscaler before nodes, it might:

Try to scale using 1.33 assumptions on 1.32 nodes
Misread node capacity or allocatable resources
Make incorrect scaling decisions during the transition

Ingress controllers need stable node networking: AWS Load Balancer Controller relies on:

VPC CNI being properly configured on nodes
Node security groups and networking
Upgrading nodes ensures network stack is stable before controller changes

Real-World Scenario

# After control plane + add-ons are on 1.33:

# Step 3: Upgrade nodes first

eks_managed_node_groups = {

app_nodes = {

version = "1.33"

}

# Step 4: Then upgrade platform controllers

# (via Helm or whatever deployment method you use)

Then update controllers:

# After nodes are on 1.33

helm upgrade aws-load-balancer-controller eks/aws-load-balancer-controller \

--version 1.9.0 # supports k8s 1.33

helm upgrade cluster-autoscaler autoscaler/cluster-autoscaler \

--version 9.38.0

Why This Order Works

Forward compatibility: Platform controllers are typically designed with forward compatibility in mind - they'll work on older Kubernetes versions but expect nodes to be reasonably current.
Blast radius: If a platform controller upgrade fails, it's usually isolated (broken autoscaling, broken ingress, etc.). If nodes are in a bad state, nothing works properly.
Rollback simplicity: Rolling back a controller deployment (Helm rollback, kubectl apply old manifest) is much easier than rolling back a node group upgrade.

Exception: Pre-upgrade Compatibility Checks

The only time you might check controller documentation first is to verify compatibility requirements, but you still upgrade nodes before actually updating the controllers. For example:

Check that Cluster Autoscaler v1.33.x supports EKS 1.33
Then upgrade nodes
Then deploy the new controller version

Summary: Nodes are part of the core Kubernetes infrastructure; platform controllers are workloads running on that infrastructure. Upgrade the foundation (nodes) before the applications (controllers).

Canonical AWS EKS Kubernetes cluster upgrade approaches

At the cluster level, there are 3 canonical upgrade approaches:

In-place cluster upgrade

within the same cluster
sequential - can't jump minor version upgrades, it needs to go sequentially e.g. 1.32 --> 1.33 --> 1.34
Two types:

(1) Manual:

control plane: manual
data plane (nodes): manual - blue/green strategy where green/new node group is created

(2) Hybrid:

control-plane: manual
data plane (nodes): Rolling Upgrade / self-healing nodes - within the same/current node group

(3) Blue/Green cluster upgrade

two clusters: blue - current, green - new
non-sequential - green cluster can be on an arbitrary higher version

In-place cluster upgrade (sequential, same cluster)

(Most common / AWS default)

Steps:

Upgrade control plane
Upgrade add-ons
Upgrade nodes (node groups). Node replacement can be:
- manual:
- automatic:
All within the same EKS cluster

Upgrading Control Plane

Upgrading an Amazon EKS control plane is an "in-place" process, but it requires a specific sequence to avoid downtime for our workloads.

In EKS, the control plane (managed by AWS) and the data plane (your worker nodes) are upgraded separately. You should always upgrade the control plane before your worker nodes.

Pre-Upgrade Preparation (Checks & Prerequisites)

Before you click "Update" or upgrade via Terraform (terraform-aws-modules/eks/aws) you must ensure your cluster is ready for the new version.

Kubectl Version: Ensure your local kubectl is at version which should be within one minor version of the target. E.g. if new version is v1.33 then local kubectl should be at least at version 1.32. Run kubectl version to get both client and server versions.
If using Terraform and terraform-aws-modules/eks/aws module: we need to check its version. Ensure we are using a recent version of the module (v20.0.0+ is recommended for 1.30+ support).
Check API Deprecations: Newer Kubernetes versions often remove older APIs. We need to scan manifests / Helm charts for removed APIs. We can use:

EKS Upgrade Insights tab in the AWS Console to see which of your resources are using deprecated APIs.
A tool like Pluto [https://github.com/FairwindsOps/pluto] to check for APIs removed in 1.33.
To check deprecated APIs removed in 1.33 we can also use:

kubectl get apiservices
kubectl api-resources

Check Add-on Compatibility: Verify that versions of these addons are compatible with the target Kubernetes version:

CoreDNS
kube-proxy
VPC CNI

Check CRDs, especially if you run:

cert-manager
ingress controllers
ECK / Elastic
external-dns

Sequential Upgrades: You can only upgrade one minor version at a time (e.g., v1.29 to 1.30). If you are on 1.28 and want 1.30, you must upgrade to 1.29 first.

Control plane is backward-compatible with nodes one minor version behind: 1.33 control plane happily runs 1.32 nodes

IP Address Availability: Ensure your subnets have at least 5 available IP addresses; EKS needs these to spin up the new control plane instances during the rolling update
Backup all persistent volumes (snapshots of EBS/EFS)

Backing up all persistent volumes (snapshots of EBS/EFS)

While a Blue-Green node group migration is designed to be safer than an "in-place" upgrade, it doesn't eliminate the risks to your data. In fact, moving data-heavy workloads between node groups introduces specific points of failure that make backups even more critical.

Here is why we should hit "snapshot" before we start:

1. The "Single AZ" Trap. EBS volumes are locked to a specific Availability Zone (AZ). If your new "Green" node group scales up in a different AZ than your "Blue" group, your Pods will get stuck in a Pending state because they cannot attach to their existing EBS volumes.

The Risk: You might have to manually migrate data or recreate volumes in the correct AZ. If something goes wrong during that manual shuffle, you’ll want a recovery point.

2. Dynamic Provisioning & Reclaim Policies. If your storage classes are configured with reclaimPolicy: Delete, any accidental deletion of a namespace or a PVC during the migration "cleanup" phase will result in the immediate and permanent deletion of the underlying EBS/EFS volume.

The Risk: Human error is peak during cluster maintenance. A typo in a kubectl delete command can wipe your production database faster than you can say "Green group."

3. CSI Driver Compatibility. Upgrading EKS often requires updating the EBS/EFS CSI Drivers.

The Risk: If the new driver has a configuration issue or a bug, it may fail to mount existing volumes to the new nodes. While this usually doesn't corrupt data, it can cause significant downtime. Having a snapshot ensures you can restore the data to a fresh volume if the pointer gets corrupted.

To see the commands for listing all used EBS and EFS volumes in the AWS EKS cluster and also how to take their snapshots, please see https://github.com/BojanKomazec/aws-util/blob/main/modules/eks.sh.

Upgrade the Control Plane

You can trigger this via the Console, CLI, eksctl or Terraform. The API server will remain available during this process, though there may be a few seconds of latency.

Using AWS CLI:

aws eks update-cluster-version \

--name <my-cluster> \

--kubernetes-version <target-version>

Notes:

Zero downtime for the API server
Existing workloads keep running
Takes ~10–15 minutes typically

Afterwards:

kubectl version

Update EKS Managed Add-ons

Once the cluster status is ACTIVE, update the default / core add-ons to match the new version. Do this before adding new nodes.

Core add-ons:

VPC CNI
CoreDNS

Mismatched CoreDNS + new nodes = weird DNS failures

Kube-proxy

aws eks update-addon --cluster-name <cluster> --addon-name coredns

aws eks update-addon --cluster-name <cluster> --addon-name kube-proxy

aws eks update-addon --cluster-name <cluster> --addon-name vpc-cni

(1) Manual In-place cluster upgrade

✅ Benefits

No DNS or traffic switching
Minimal external dependencies
Supported and documented by AWS

Lowest operational overhead

Storage stays attached to the same cluster
StatefulSets maintain their identities
Pods reschedule to new nodes but keep their volumes
No data migration needed
No DNS cutover complexity

No full cluster rebuild
We get a safe rollback point (just stop draining if something looks off)
Works with:

Managed Node Groups
Self-managed ASGs
Karpenter (with some extra checks)

❌ Drawbacks

You must respect:

version skew rules - read below what this means!
API deprecations

Errors surface in prod if you’re careless <-- well, this applies to any approach actioned on prod cluster :-)

⚠️ Risks

Deprecated API removal breaks workloads
Operators not compatible with target version
Node drains blocked by Pod Disruption Budget (PDB) limitations

Risk level: 🟡 (low if done carefully)

Why do we need to respect version skew rules in case of Manual In-place cluster upgrade (blue/green manual)?

In a Manual Blue/Green Node Group upgrade, you are effectively running a "hybrid" cluster for a period of time where the Control Plane is at v1.33 but the nodes are at v1.32.

Respecting the Version Skew Policy is critical because Kubernetes components (the API Server on the control plane and the Kubelet on the nodes) communicate using specific internal APIs. If the gap between them is too wide, they literally stop speaking the same language.

1. The "Language" Problem (API Compatibility)

Kubernetes follows a strict N−3 skew policy (as of v1.28). This means the Kubelet (the agent on your node) can be up to three minor versions behind the API Server.

From Understand the Kubernetes version lifecycle on EKS:

The Kubernetes project tests compatibility between the control plane and nodes for up to three minor versions. For example, 1.30 nodes continue to operate when orchestrated by a 1.33 control plane. However, running a cluster with nodes that are persistently three minor versions behind the control plane isn’t recommended. For more information, see Kubernetes version and version skew support policy in the Kubernetes documentation. We recommend maintaining the same Kubernetes version on your control plane and nodes.

Allowed: Control Plane 1.33 ↔ Node 1.32 (1 version apart).
Dangerous: Control Plane 1.33 ↔ Node 1.29 (4 versions apart).

If you violate this skew, the API Server may send instructions or objects that the old Kubelet doesn't understand. This leads to nodes reporting as NotReady, or worse, pods appearing to run while their networking or storage is actually broken.

2. Feature Gates and Fields

When you upgrade to 1.33, new "Feature Gates" might be enabled by default. If your old 1.32 nodes don't recognize these features:

The Kubelet might crash trying to parse a Pod specification that includes a new 1.33 field.
The Scheduler might place a pod on a 1.32 node that requires a 1.33-specific capability, causing a CreateContainerConfigError.

3. The kube-proxy Risk

In your kube-prometheus-stack setup, networking is vital. The kube-proxy component (which handles Service routing) has even stricter rules than the Kubelet.

Rule: kube-proxy must not be newer than the Control Plane, and it should ideally match the Kubelet version.
Risk: If you upgrade the Control Plane but keep very old nodes, your kube-proxy might fail to sync iptables or IPVS rules, causing your Grafana pod to lose its connection to the Prometheus datasource.

4. Why it matters for Blue/Green

Even though you are planning to delete the 1.32 nodes eventually, they must remain functional and healthy during the "transition" phase.

If you ignore version skew (e.g., trying to jump from 1.28 to 1.33 in one Terraform apply):

The Control Plane updates to 1.33.
The 1.28 nodes immediately become incompatible.
The nodes go NotReady.
The Drain Fails: You cannot gracefully kubectl drain a node that is NotReady. The pods will get stuck in Terminating, and your EBS volumes won't detach cleanly.

Summary: The "Safety Window"

By respecting the skew and only moving one version at a time (1.32 → 1.33), you ensure that:

Workloads stay alive on the Blue nodes while the Green nodes are being provisioned.
The API Server can still command the Blue nodes to "Drain" and "Detach Volumes."
The EBS CSI Driver remains operational across both sets of nodes.

Since you are moving from 1.32 to 1.33, you are within the safe N−1 window. Everything will remain fully compatible during your manual handover.

Upgrade the Data Plane (Worker Nodes)

After the control plane is finished, you must upgrade your nodes.

Manual Blue/Green node group strategy

Self-Managed Nodes: You must manually update the AMI and rotate the instances.

If our node group is already using AMI (e.g. AL2023_x86_64_STANDARD) supported by the new K8s version, the upgrade becomes significantly smoother because the operating system and its underlying dependencies (like the kernel and init system) aren't changing—just the Kubernetes components (kubelet, containerd).

Upgrading k8s version using a Blue/Green node group strategy is the industry standard for minimizing downtime, especially when handling stateful workloads like Prometheus and Grafana.

Since you are moving to EKS 1.33, the EBS CSI Driver is your most critical component. It must be updated before the node migration to ensure it can talk to the new AL2023 nodes.

1. Upgrade the Storage Layer First

Before touching the nodes, update your EBS CSI Driver. For EKS 1.33, the recommended version is v1.54.0-eksbuild.1 or later.

How to check your current version:

kubectl get deployment ebs-csi-controller -n kube-system -o jsonpath='{.spec.template.spec.containers[0].image}'

Terraform Update: In your cluster_addons block, ensure you have:

cluster_addons = {

aws-ebs-csi-driver = {

most_recent = true

resolve_conflicts_on_update = "PRESERVE"

}

2. The Blue/Green Node Group Strategy

Since you aren't using Karpenter, you will manually provision the "Green" group.

Step A: Create the "Green" Node Group

After Control Plane has been upgraded to a new version, the goal now is to provision the Green node group without letting it "steal" pods from the Blue nodes before we are ready.

Add a new entry to your eks_managed_node_groups map. When you add the ${local.cluster_name}-green block to your Terraform, you should include a Taint. This acts as a "lock" on the new nodes, ensuring that only pods that explicitly tolerate the taint can land there.

Why do this? If a Blue node happens to restart or a new pod is deployed while you are mid-Terraform-apply, Kubernetes might try to schedule it on the new Green nodes immediately. By tainting them, you keep the Green group "empty and ready" until you manually decide to move the pods.

Note: EKS 1.33 requires AL2023; AL2 will not work.

eks_managed_node_groups = {

# Keep your existing "${local.cluster_name}-default" here unchanged

# NEW: Green Node Group

"${local.cluster_name}-green" = {

ami_type = "AL2023_x86_64_STANDARD" # Mandatory for 1.33

instance_types = ["m5.large"]

min_size = 2

max_size = 5

desired_size = 2

# Ensure this matches your existing networking/tags

subnet_ids = var.private_subnets

# ... other configs from your snippet ...

taints = [

{

key = "node-upgrade"

value = "green"

effect = "NO_SCHEDULE"

}

]

}

!!!IMPORTANT!!!

Don't actually use "-green" suffix in node group name as during the next upgrade that node group will actually be a "blue" one which will be confusing. Instead, use the next Kubernetes version (target upgrade version) e.g. "-v1.34" (make sure this key name is put under quotes if it contains dot character!).

min_size and desired_size should be set to number of nodes set by Cluster Autoscaler during last rolling update (stable number of nodes).

How to ensure Manual Control in Terraform

How to make sure that simply bumping the cluster version will not trigger a "forced" upgrade on your existing nodes before you can manually intervene (create green node group, and migrate workload from blue node group onto it).

In the terraform-aws-modules/eks/aws module, Managed Node Groups have an optional version argument. If you omit it, they usually track the cluster_version at creation time, but they won't force an update unless you tell them to.

To be 100% safe during a Blue/Green migration, follow this pattern:

module "eks" {

source = "terraform-aws-modules/eks/aws"

version = "~> 20.0"

cluster_name = "my-cluster"

kubernetes_version = "1.33" # Control Plane goes to 1.33

eks_managed_node_groups = {

# BLUE (Old) GROUP

"default-v132" = {

# Hardcode the version here so it stays on 1.32

# even though the cluster is 1.33

ami_type = "AL2023_x86_64_STANDARD"

version = "1.32"

min_size = 2

desired_size = 2

}

# GREEN (New) GROUP

"default-v133" = {

ami_type = "AL2023_x86_64_STANDARD"

version = "1.33" # New group gets the new version

min_size = 2

desired_size = 2

}

NOTE: The above code actually won't work...Once we set kubernetes-version to 1.33, each node group in eks_managed_node_groups will automatically be upgraded to the same version, no matter which of their version- or ami- related attributes we set. This was tested on "21.15.1". That's why it's important to do first terraform apply and target only eks submodule.

The "Manual" Part: Step-by-Step

Once you run terraform apply with the code above:

AWS upgrades the Control Plane to 1.33.
The "Blue" nodes stay on 1.32. They continue running your Grafana/Prometheus pods.
AWS creates new "Green" nodes on 1.33.

Now, nothing has moved yet. Your pods are still on Blue. This is where you take the wheel:

Step B: Cordon and Drain

Once Terraform finishes creating the new nodes, verify they are Ready:

kubectl get nodes -l eks.amazonaws.com/nodegroup=${local.cluster_name}-v133

Check Health:

kubectl get nodes

(You should see 2 nodes at 1.32 and 2 nodes at 1.33).

Cordon the old nodes:

kubectl cordon -l eks.amazonaws.com/nodegroup=${local.cluster_name}-default

Cordon Blue:

kubectl cordon -l eks.amazonaws.com/nodegroup=default-v132

Drain the monitoring namespace first (Grafana/Prometheus):

kubectl drain -l eks.amazonaws.com/nodegroup=${local.cluster_name}-default \

--namespace monitoring \

--ignore-daemonsets \

--delete-emptydir-data \

--terminate-grace-period=300

Drain Blue:

kubectl drain -l eks.amazonaws.com/nodegroup=default-v132 --ignore-daemonsets --delete-emptydir-data

At this point, Kubernetes kills the pods on Blue. Since Green is the only place left to go, it starts them there.

Your EBS volumes detach from Blue and attach to Green.

Verify: Check your Grafana dashboards. If something is broken, you can uncordon Blue and move back.

Cleanup: Once you are happy, delete the default-v132 block from Terraform and apply again.

Why this is safer than "In-Place"

In an In-Place rollover, AWS decides when to pull the rug out from under your Prometheus pod. In Blue/Green, you decide. If the first node drain causes a storage error, you can stop immediately without AWS trying to kill your second node.

3. Monitoring the PVC Migration

Since Grafana and Prometheus use ReadWriteOnce EBS volumes, they cannot start on the new node until they are fully detached from the old one.

Watch the "Handover":

Monitor Pods:

kubectl get pods -n monitoring -w (Look for ContainerCreating).

Monitor Volume Attachments:

kubectl get volumeattachments

You should see the attachment for the old node disappear and a new one for the v1.33 node appear.

Check for Errors:

If a pod stays in ContainerCreating for >2 mins, check the events:

kubectl describe pod <pod-name> -n monitoring

Look for: "FailedAttachVolume" or "Multi-Attach error".

4. Risks & Verification

Risk: AZ Mismatch

Mitigation: EBS volumes are AZ-locked. Ensure your new Node Group spans the exact same Availability Zones as the old one.

Risk: IMDSv2 Hop Limit

Mitigation: AL2023 defaults to a hop limit of 1. If your Grafana/Prometheus pods need to fetch IAM metadata, they might fail. Set http_put_response_hop_limit = 2 in your launch template if needed.

Risk: StorageClass Name

Mitigation: Your Helm values use storageClassName: general. Ensure this StorageClass exists and uses the ebs.csi.aws.com provisioner.

Verification Checklist:

[ ] Grafana Login: Ensure you can log in and that your dashboards (stored in the PVC) are present.

[ ] Prometheus Targets: Check http://<prometheus-url>/targets to ensure it's still scraping.

[ ] Clean Up: After 24 hours of stability, remove the old node group from Terraform.

[!IMPORTANT] No Rollbacks: EKS does not support "downgrading" a cluster version once the upgrade has started. Always test the upgrade in a development environment first.

(2) Hybrid In-place cluster upgrade

Manual control-plane: manual
nodes: automatic rolling upgrade / self-healing nodes

(“Hands-off” / Karpenter-heavy)

What it is

Upgrade control plane
Update node provisioning config (k8s version, AMIs...)
Let autoscaling replace nodes over time

✅ Benefits

Very little manual work
No explicit node migration steps
Low operational overhead

Storage stays attached to the same cluster
StatefulSets maintain their identities
Pods reschedule to new nodes but keep their volumes
No data migration needed
No DNS cutover complexity

Elegant once mature

❌ Drawbacks

We have no control over node upgrades - EKS is doing it
Less deterministic
Harder to audit
Slower convergence
Poor visibility during incidents

⚠️ Risks

Nodes stick around longer than expected
Sudden mass replacement
Operators reschedule in unsafe order

Risk level: 🟡→🔴 (depends on maturity)

Upgrade the Data Plane (Worker Nodes)

In-place Rolling Update strategy

When you change the cluster_version to 1.33 in Terraform and apply, the EKS Managed Node Group detects a mismatch between its current version and the cluster's control plane version.

AWS EKS Managed Node Groups do not automatically upgrade just because the control plane version changed.

The Skew "Grace Period"

Kubernetes allows a version skew between the Control Plane and the Nodes. For version 1.28 and above, EKS supports up to 3 minor versions of skew.

Control Plane: 1.33
Nodes: 1.32 (This is perfectly valid and supported)

As long as you do not change the version property of the specific node group in Terraform, AWS will leave those nodes alone on v1.32, even after the control plane hits v1.33.

So, if you're using managed node groups (via the module's eks_managed_node_groups):

You need to explicitly update the version parameter for each node group
The module will handle the rolling update
Example: Set version = "1.33" in your node group configuration

If using self-managed node groups:

You'll need to update the AMI and launch template
The module typically uses ami_type which will need updating

Recommended Approach

module "eks" {

source = "terraform-aws-modules/eks/aws"

cluster_version = "1.33"

eks_managed_node_groups = {

example = {

version = "1.33" # ← Add this

# ... other config

}

Apply in stages if you want to minimize risk:

Upgrade control plane first (just cluster_version)
Test your workloads
Upgrade node groups (add/update version in node group configs)

This gives you a chance to validate compatibility before fully committing to the new version across all nodes.

EKS triggers a Rolling Update (managed by AWS):

Surge: EKS spins up new EC2 instances using the latest AWS-optimized AL2023 AMI for Kubernetes 1.33.
Readiness: It waits for these new nodes to join the cluster and reach the Ready state.
Drain: It selects an old 1.32 node, cordons it (marking it unschedulable), and drains it (evicting your pods).
Terminate: Once the old node is empty, AWS terminates the instance.
Repeat: This continues until all 1.32 nodes are replaced by 1.33 nodes.

What happens to your PVCs?

Since your (e.g. Grafana and Prometheus) pods use EBS volumes with ReadWriteOnce, the rolling update behaves like this:

Pod Eviction: When a node hosting Grafana is drained, the pod is signaled to shut down.
Volume Detach: The EBS CSI driver (which you upgraded in the previous step) sends a command to AWS to detach the EBS volume from the old 1.32 node.
Scheduling: Kubernetes sees the Grafana pod is pending. Because the old node is cordoned, it schedules the pod onto one of the new 1.33 nodes.
Volume Attach: The CSI driver attaches the existing EBS volume to the new node. Your data stays intact because the EBS volume is independent of the EC2 instance.

The "Gotcha": Surge Capacity

By default, Managed Node Groups use 1 surge node (this is configurable via update_config).

If you have 2 nodes, EKS will spin up a 3rd (1.33), then kill one old one, then spin up a 4th, then kill the last old one.
Risk: If your AWS Account Quotas or Subnet IP space are very tight, the "Surge" node might fail to provision, causing the upgrade to hang.

(3) Blue/Green cluster upgrade (two clusters)

Steps:

(If not using Terraform for EKS infra) Create cluster backup using Velero

AWS resources (IAM roles, EKS add-ons, VPC settings, etc.) are not backed up by Velero — you must re-create or re-configure them manually.

Always validate workloads and custom resources for API deprecations/changes before you upgrade.
Create a new EKS cluster at new version
Deploy all workloads there:

If EKS infra is in Terraform - re-apply Terraform
If using Velero:

Use Velero to restore backups from the old cluster into the new one. Velero cannot restore into a cluster with a lower Kubernetes version than what the backup came from; but restoring into a higher version is generally workable — just test workloads for API compatibility.
Re-apply any AWS infrastructure configuration that Velero does not capture (e.g., IAM roles, load balancers, service accounts, security groups).

Perform state migration for stateful workloads:

Persistent storage doesn't move automatically: When you create a new green cluster, your EBS volumes, EFS filesystems, and other persistent storage remain attached to the old blue cluster. You need to:

Snapshot and restore volumes to the new cluster
Migrate data between storage backends
Handle cross-cluster storage access during transition
Deal with potential data inconsistency during migration
Wire up new Persistent Volume Claims to new volumes

StatefulSets have cluster-specific identities: Pods in StatefulSets have stable network identities and persistent volume claims tied to the original cluster. Simply recreating them in a new cluster can cause:

Identity mismatches
Orphaned PVCs in the old cluster
Loss of pod-to-volume associations
Disruption of quorum-based systems (etcd, databases)

Shift/redirect traffic (DNS, LB, mesh) to the new cluster once validated.
Decommission old cluster

Operational Complexity

No shared state between clusters: Unlike in-place rolling updates where pods gradually migrate, blue/green means:

Databases need full replication or backup/restore between clusters
No shared distributed state (cache warmth, leader election, etc.)
Message queues need draining or dual-publishing
All stateful data must be explicitly moved

Synchronization windows are risky: You need to:

Stop writes to the blue cluster
Ensure all data is migrated/synced to green
Switch DNS
Verify green cluster has all data
Any failure means rollback with potential data loss

DNS propagation delays: Even after switching DNS:

TTL means clients may hit old cluster for minutes
Long-lived connections stay on blue cluster
Split-brain scenarios where some clients hit blue, others hit green
Stateful systems can't handle this dual-write scenario safely

Specific Examples

Database with active transactions:

Can't just switch DNS mid-transaction
Need to drain connections, finalize replication, then cut over
Risk of split-brain if clients connect to both clusters

Distributed caches (Redis, Memcached):

Cache in green cluster starts cold
Can't share cache state between clusters
Performance degradation until warm

Message queues (Kafka, RabbitMQ):

Messages in-flight in blue cluster
Need to drain queues or accept message loss
Consumer offsets don't transfer automatically

Why In-Place Upgrades Are Easier for Stateful Workloads

With in-place node-by-node upgrades:

Storage stays attached to the same cluster
StatefulSets maintain their identities
Pods reschedule to new nodes but keep their volumes
No data migration needed
No DNS cutover complexity

The "heavy operational lift" comes from having to manually orchestrate what Kubernetes normally handles automatically during rolling updates - maintaining state consistency while changing the underlying infrastructure.

✅ Benefits

No version skew
Full rollback (just flip traffic back)
Clean break from legacy cruft
Very predictable cutover

❌ Drawbacks

Expensive (double infra)
Heavy operational lift
Hard for stateful workloads
Requires mature CI/CD and traffic control

⚠️ Risks

Data sync / migration errors
Split-brain risks for stateful systems
DNS propagation delays

Risk level: 🟡

Operational cost: 🔴

Where node strategies fit

Node strategies are sub-choices inside approach #1:

Cluster strategy Node strategy options
In-place cluster upgrade blue/green, rolling, Karpenter
Blue/green cluster new cluster nodes only
Self-healing Karpenter-driven

Practical comparison (cluster-level)

Strategy Safety Effort Cost Stateful-friendly Rollback

In-place cluster 🟢🟡 🟡 🟢 🟡 🟡

Blue/green cluster 🟢 🔴 🔴 🔴 🟢

Self-healing 🟡→🔴 🟢 🟢 🟡 🔴

Pre-upgrade Checks

What must be checked before moving on:

Cluster health checks
Critical workloads running
No failing webhooks
No deprecated APIs in use

Example:

Validation: all Tier 0 workloads healthy for 24h before prod upgrade

Rollback & abort criteria

A mature cadence defines when to stop.

Examples:

Node upgrade paused if error rate increases
Abort if CoreDNS fails to stabilise
Roll back Helm releases, not control plane

Update strategies from the operational scheduling and planning perspective

I mentioned above cadence, so let's explore Update from the operational scheduling and planning perspective:

Upgrade strategy can be manual or fully-automated, reactive, ad-hoc or cadence-based.

A cadence upgrade strategy means:

Upgrading Kubernetes (EKS) on a regular, predictable schedule, rather than only when forced by deprecation or incidents.

It answers when, how often, and how upgrades are performed.

An upgrade cadence usually defines four things:

1) How often upgrades happen

Examples:

Every minor Kubernetes release (roughly every 3–4 months)
Every second release
Twice per year
Just before AWS EOL deadlines (not recommended)

Example wording:

Upgrade cadence: every minor EKS release, within 30 days of availability

2) Upgrade order (this is critical in EKS) - discussed above

3) Validation gates - discussed above

4) Rollback & abort criteria - discussed above

Cadence vs other strategies (quick comparison)

Strategy Description Risk

---------- --------------- -----

Reactive Upgrade only when AWS forces it 🔥 High

Ad-hoc “When someone remembers” 😬 Medium

Cadence-based Predictable, scheduled ✅ Low

Fully automated Zero-touch pipelines ⚠️ Depends on maturity

Example: a solid EKS cadence strategy

We upgrade EKS clusters on a quarterly cadence.

- Dev: within 7 days of a new EKS minor version
- Staging: within 14 days
- Production: within 30 days

Upgrade order:
1. EKS control plane
2. Managed add-ons
3. Node groups / Karpenter
4. Platform services
5. Application workloads

Upgrades are paused if:
- Tier 0 services fail health checks
- CoreDNS or CNI fails to stabilise within 15 minutes

Why this matters (especially for EKS)

AWS only supports N–2 Kubernetes versions
Miss the window and you’re forced into rushed upgrades
Deprecated APIs will bite you eventually
Regular cadence = smaller, safer jumps

Pages

Sunday, 25 January 2026