This article explores AWS EKS cluster upgrade strategies end-to-end (control plane + nodes), and where node strategies fit inside them. We are assuming there is a single cluster (no multiple clusters, one per environment). In case there are multiple clusters, one per environment (e.g. dev, stage, prod), different approaches should also be considered.
General Upgrade order (brief)
This is critical in EKS. A proper cadence explicitly states the order:
- Control plane (API server version)
- Managed add-ons
- VPC CNI
- CoreDNS
- kube-proxy
- Node groups (Worker nodes) (kubelet agents version)
- Platform controllers
- Ingress
- Autoscalers
- Observability agents (e.g. Prometheus/Grafana Stack)
Complete EKS Cluster Upgrade Order
Phase 1: Pre-Upgrade Preparation
Audit and compatibility check
- Review Kubernetes changelog for API deprecations (e.g. 1.32 → 1.33)
- Scan workloads for deprecated APIs: kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis
- Use tools like pluto or kubent to detect deprecated API usage
- Check third-party controller compatibility
Backup everything
- etcd snapshot (though AWS manages this)
- Backup all persistent volumes (snapshots of EBS/EFS)
- Export critical resources: kubectl get all --all-namespaces -o yaml > backup.yaml
- Document current state (versions, configs, ingress endpoints)
Test in non-production
- Upgrade a dev/staging cluster first
- Run smoke tests on applications
- Verify data migration strategies work
Phase 2: Core Infrastructure Upgrade
Upgrade control plane
- Update cluster_version = "1.33"
- Apply and monitor
Upgrade managed add-ons
- VPC CNI
- CoreDNS
- kube-proxy
- EBS CSI driver (if using persistent volumes)
Upgrade node groups
- Update managed node group versions to 1.33
- Rolling update happens automatically
- Monitor pod rescheduling
Phase 3: Application Layer Upgrade ⚠️ YOU WERE MISSING THIS
Update workload manifests for API compatibility
- Before or during node upgrades, update application manifests that use deprecated APIs
- Example: If using deprecated policy/v1beta1 PodDisruptionBudget, update to policy/v1
- Update CRDs that might be version-dependent
- Redeploy applications with updated manifests
Upgrade platform controllers
- AWS Load Balancer Controller
- Cluster Autoscaler (MUST match k8s version)
- Metrics Server
- External DNS
- Cert Manager
- Any other cluster-level controllers
Upgrade observability stack
- Prometheus/Grafana
- CloudWatch Container Insights
- Datadog/New Relic/other APM agents
- Logging agents (Fluentd, Fluent Bit)
Phase 4: Stateful Workload Migration
Stateful workload data considerations
Important distinction: Data migration isn't usually a separate step in in-place EKS upgrades because:
- PersistentVolumes stay attached to the same cluster
- StatefulSets maintain their PVCs when pods reschedule
- Data persists as pods move between old and new nodes
However, you DO need to handle:
a) Database version compatibility:
# If running PostgreSQL in k8s and the pod image needs updating
# The data in the PV might need migration
kubectl exec -it postgres-0 -- pg_upgrade ...
b) Application-level state:
- Cache warmup after pods restart on new nodes
- Session data if using in-memory sessions
- Queue processing (ensure messages aren't lost during pod restarts)
c) Stateful application upgrades:
# Update StatefulSet with new image that's compatible with k8s 1.33
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: my-database
spec:
template:
spec:
containers:
- name: db
image: postgres:15 # Updated version
d) Storage driver compatibility:
- If using EBS CSI driver, ensure workloads work with updated driver
- Test PVC creation/attachment on new nodes
- Verify snapshot/restore functionality
Phase 5: Validation & Monitoring
Validate cluster health
kubectl get nodes # All should be Ready with v1.33
kubectl get pods --all-namespaces # All should be Running
kubectl top nodes # Resource utilization
Application-level validation
- Run smoke tests against applications
- Check ingress/service endpoints are accessible
- Verify database connections work
- Test autoscaling behavior
- Validate monitoring/alerting still works
Performance validation
- Check application latency/throughput
- Verify persistent storage I/O performance
- Monitor for any degradation
Phase 6: Cleanup
Remove old/deprecated resources
- Clean up any temporary migration resources
- Remove old API version manifests
- Update GitOps repos with new manifests
The Data Migration Question - When Does It Happen?
For in-place EKS upgrades (most common):
- Data migration is implicit during pod rescheduling
- When a pod on an old node is terminated and recreated on a new node, its PV reattaches automatically
- No explicit "data migration step" needed
You DO need explicit data migration if:
1) Changing storage classes:
Old: gp2 volumes
New: gp3 volumes
Need to: snapshot → restore → update PVC
2) Moving between clusters (blue/green upgrade):
- This is where heavy data migration happens
- NOT recommended for stateful workloads (as we discussed)
3) Database major version upgrades:
# PostgreSQL 12 → 15 might need pg_upgrade
# This happens AFTER k8s upgrade, as part of workload upgrade
4) Application schema changes:
- Run database migrations after deploying new app versions
- Often part of your CI/CD pipeline
Timeline View
Pre-upgrade prep (API audit, backups)
↓
Control plane upgrade
↓
Managed add-ons upgrade
↓
Node groups upgrade ← Pods reschedule here (implicit data "migration" via PV reattachment)
↓
Update workload manifests (fix deprecated APIs)
↓
Redeploy applications with new manifests
↓
Upgrade platform controllers
↓
Upgrade observability
↓
Explicit data migrations (if needed for app version changes)
↓
Validation & testing
Key Points
- Workload manifest updates: Must update apps using deprecated APIs
- Application redeployment: Some apps need redeployment for compatibility
- Database/app data migrations: Happen as part of app upgrades, not k8s upgrade
- Validation phase: Critical but often overlooked
- Pre-upgrade API auditing: Must happen BEFORE you start
The data migration question is nuanced - for in-place upgrades, it's mostly automatic via PV reattachment. Explicit migration is only needed for application-level changes or storage class changes.
Start with Non-prod clusters first.
Example:
Upgrade order:
dev → staging → prod
Control plane → add-ons → nodes → workloads
Why Control Plane needs to be upgraded before Addons?
Control plane should be upgraded first, then managed add-ons. Here's why:
Correct Upgrade Order:
- Control plane (cluster_version)
- Managed add-ons (VPC CNI, CoreDNS, kube-proxy)
- Node groups (worker nodes)
Why This Order Matters
Add-ons depend on API versions: Managed add-ons interact with the Kubernetes API server. If you upgrade add-ons before the control plane:
- Add-ons might use API versions not yet available in the older control plane
- Incompatibility errors can break cluster networking or DNS
- You could lose cluster connectivity entirely
AWS enforces compatibility checks: When you try to upgrade add-ons, AWS EKS checks compatibility with your current control plane version. Upgrading add-ons to versions designed for 1.33 while still on 1.32 control plane may fail validation.
Add-ons have version skew policies: Each add-on version is qualified against specific EKS versions. The add-on versions for 1.33 are tested against 1.33 control plane APIs, not 1.32.
Version Skew Support
AWS EKS follows these skew policies:
- Control plane → Add-ons: Add-ons typically support N and N-1 control plane versions
- Control plane → Nodes: Nodes can be up to 2 minor versions behind (1.33 control plane, 1.31 nodes OK)
This means:
- Upgrading control plane first is safe - existing add-ons continue working
- Upgrading add-ons before control plane risks incompatibility
Practical Example
# Step 1: Upgrade control plane
cluster_version = "1.33" # was 1.32
# Step 2: Upgrade add-ons (in same or subsequent apply)
cluster_addons = {
vpc-cni = {
addon_version = "v1.19.0-eksbuild.1" # compatible with 1.33
}
coredns = {
addon_version = "v1.11.4-eksbuild.1"
}
kube-proxy = {
addon_version = "v1.33.0-eksbuild.1"
}
}
# Step 3: Upgrade nodes
eks_managed_node_groups = {
example = {
version = "1.33"
}
}
What Happens If You Upgrade Add-ons First?
VPC CNI: Might fail to create network interfaces or configure pod networking correctly if it expects newer API features.
CoreDNS: Could break DNS resolution cluster-wide if it tries to use API endpoints not available in 1.32.
kube-proxy: Might fail to update iptables rules properly, breaking service networking.
The risk is cluster-wide networking failure, which is much harder to recover from than a failed control plane upgrade.
Best Practice
Terraform will often handle this automatically if you update everything together, but the underlying AWS API calls happen in the correct order: control plane, then add-ons, then nodes. If doing manual staged upgrades, always follow this sequence.
Why Node Groups (Worker Nodes) need to be upgraded before than platform controllers?
Node groups should be upgraded first, then platform controllers. Here's why:
Correct Upgrade Order:
- Control plane (1.32 → 1.33)
- Managed add-ons (VPC CNI, CoreDNS, kube-proxy)
- Node groups (worker nodes to 1.33)
- Platform controllers (Ingress, Autoscalers, Observability)
Why Nodes Before Platform Controllers
Platform controllers run ON the nodes: Controllers like AWS Load Balancer Controller, Cluster Autoscaler, and observability agents are deployed as pods on your worker nodes. They depend on:
- The kubelet version on the nodes
- Node-level features and APIs
- Container runtime capabilities
Version compatibility flows upward:
- Nodes must be compatible with the control plane
- Controllers must be compatible with the node/kubelet version they run on
- Upgrading nodes first ensures the runtime environment is ready for updated controllers
Controllers are more tolerant of version skew: Most platform controllers (especially those from AWS or CNCF) are designed to work across multiple Kubernetes versions. For example:
- AWS Load Balancer Controller supports multiple EKS versions
- Cluster Autoscaler typically supports N-2 versions
- These controllers are more flexible than core components
Practical Reasons
Reduced risk of deployment failures: If you upgrade controllers before nodes:
- New controller versions might require node features only available in 1.33
- Pod scheduling could fail if nodes don't support required capabilities
- Controllers might crash-loop waiting for node-level features
Example - Cluster Autoscaler:
If you upgrade CA to v1.33 while nodes are still on 1.32:
- CA might expect new node labels or taints
- Node group scaling could behave unexpectedly
- CA might not recognize older kubelet versions properly
Autoscaler-specific issue: If you upgrade Cluster Autoscaler before nodes, it might:
- Try to scale using 1.33 assumptions on 1.32 nodes
- Misread node capacity or allocatable resources
- Make incorrect scaling decisions during the transition
Ingress controllers need stable node networking: AWS Load Balancer Controller relies on:
- VPC CNI being properly configured on nodes
- Node security groups and networking
- Upgrading nodes ensures network stack is stable before controller changes
Real-World Scenario
# After control plane + add-ons are on 1.33:
# Step 3: Upgrade nodes first
eks_managed_node_groups = {
app_nodes = {
version = "1.33"
}
}
# Step 4: Then upgrade platform controllers
# (via Helm or whatever deployment method you use)
Then update controllers:
# After nodes are on 1.33
helm upgrade aws-load-balancer-controller eks/aws-load-balancer-controller \
--version 1.9.0 # supports k8s 1.33
helm upgrade cluster-autoscaler autoscaler/cluster-autoscaler \
--version 9.38.0
Why This Order Works
- Forward compatibility: Platform controllers are typically designed with forward compatibility in mind - they'll work on older Kubernetes versions but expect nodes to be reasonably current.
- Blast radius: If a platform controller upgrade fails, it's usually isolated (broken autoscaling, broken ingress, etc.). If nodes are in a bad state, nothing works properly.
- Rollback simplicity: Rolling back a controller deployment (Helm rollback, kubectl apply old manifest) is much easier than rolling back a node group upgrade.
Exception: Pre-upgrade Compatibility Checks
The only time you might check controller documentation first is to verify compatibility requirements, but you still upgrade nodes before actually updating the controllers. For example:
- Check that Cluster Autoscaler v1.33.x supports EKS 1.33
- Then upgrade nodes
- Then deploy the new controller version
Summary: Nodes are part of the core Kubernetes infrastructure; platform controllers are workloads running on that infrastructure. Upgrade the foundation (nodes) before the applications (controllers).
Canonical AWS EKS Kubernetes cluster upgrade approaches
At the cluster level, there are 3 canonical upgrade approaches:
- In-place cluster upgrade
- within the same cluster
- sequential - can't jump minor version upgrades, it needs to go sequentially e.g. 1.32 --> 1.33 --> 1.34
- Two types:
- (1) Manual:
- control plane: manual
- data plane (nodes): manual - blue/green strategy where green/new node group is created
- (2) Hybrid:
- control-plane: manual
- data plane (nodes): Rolling Upgrade / self-healing nodes - within the same/current node group
- (3) Blue/Green cluster upgrade
- two clusters: blue - current, green - new
- non-sequential - green cluster can be on an arbitrary higher version
In-place cluster upgrade (sequential, same cluster)
(Most common / AWS default)
Steps:
- Upgrade control plane
- Upgrade add-ons
- Upgrade nodes (node groups). Node replacement can be:
- manual:
- blue/green node groups - we create a new node group, cordon off the old one and drain it
- automatic:
- in-place rolling - AWS EKS determines k8s version mismatch between control and data plane (noes) and automatically creates new nodes at new k8s version, cordons off the current nodes and drains them, migrating pods to new nodes
- Cluster Autoscaler or Karpenter-driven
- All within the same EKS cluster
Upgrading Control Plane
Upgrading an Amazon EKS control plane is an "in-place" process, but it requires a specific sequence to avoid downtime for our workloads.
In EKS, the control plane (managed by AWS) and the data plane (your worker nodes) are upgraded separately. You should always upgrade the control plane before your worker nodes.
Pre-Upgrade Preparation (Checks & Prerequisites)
Before you click "Update" or upgrade via Terraform (terraform-aws-modules/eks/aws) you must ensure your cluster is ready for the new version.
- Kubectl Version: Ensure your local kubectl is at version which should be within one minor version of the target. E.g. if new version is v1.33 then local kubectl should be at least at version 1.32.
- If using Terraform and terraform-aws-modules/eks/aws module: we need to check its version. Ensure we are using a recent version of the module (v20.0.0+ is recommended for 1.30+ support).
- Check API Deprecations: Newer Kubernetes versions often remove older APIs. We need to scan manifests / Helm charts for removed APIs. We can use:
- EKS Upgrade Insights tab in the AWS Console to see which of your resources are using deprecated APIs.
- A tool like Pluto [https://github.com/FairwindsOps/pluto] to check for APIs removed in 1.33.
- To check deprecated APIs removed in 1.33 we can also use:
kubectl get apiserviceskubectl api-resources
- Check Add-on Compatibility: Verify that versions of these addons are compatible with the target Kubernetes version:
- CoreDNS
- kube-proxy
- VPC CNI
- Check CRDs, especially if you run:
- cert-manager
- ingress controllers
- ECK / Elastic
- external-dns
- Sequential Upgrades: You can only upgrade one minor version at a time (e.g., v1.29 to 1.30). If you are on 1.28 and want 1.30, you must upgrade to 1.29 first.
- Control plane is backward-compatible with nodes one minor version behind: 1.33 control plane happily runs 1.32 nodes
- IP Address Availability: Ensure your subnets have at least 5 available IP addresses; EKS needs these to spin up the new control plane instances during the rolling update
Upgrade the Control Plane
You can trigger this via the Console, CLI, eksctl or Terraform. The API server will remain available during this process, though there may be a few seconds of latency.
Using AWS CLI:
aws eks update-cluster-version \
--name <my-cluster> \
--kubernetes-version <target-version>
Notes:
- Zero downtime for the API server
- Existing workloads keep running
- Takes ~10–15 minutes typically
Afterwards:
kubectl version
Update EKS Managed Add-ons
Once the cluster status is ACTIVE, update the default / core add-ons to match the new version. Do this before adding new nodes.
Core add-ons:
- VPC CNI
- CoreDNS
- Mismatched CoreDNS + new nodes = weird DNS failures
- Kube-proxy
aws eks update-addon --cluster-name <cluster> --addon-name coredns
aws eks update-addon --cluster-name <cluster> --addon-name kube-proxy
aws eks update-addon --cluster-name <cluster> --addon-name vpc-cni
(1) Manual In-place cluster upgrade
✅ Benefits
- No DNS or traffic switching
- Minimal external dependencies
- Supported and documented by AWS
- Update existing cluster to new Kubernetes version - Amazon EKS
- Best Practices for Cluster Upgrades - Amazon EKS
- Lowest operational overhead
- Storage stays attached to the same cluster
- StatefulSets maintain their identities
- Pods reschedule to new nodes but keep their volumes
- No data migration needed
- No DNS cutover complexity
- No full cluster rebuild
- We get a safe rollback point (just stop draining if something looks off)
- Works with:
- Managed Node Groups
- Self-managed ASGs
- Karpenter (with some extra checks)
❌ Drawbacks
- You must respect:
- version skew rules - read below what this means!
- API deprecations
- Errors surface in prod if you’re careless <-- well, this applies to any approach actioned on prod cluster :-)
⚠️ Risks
- Deprecated API removal breaks workloads
- Operators not compatible with target version
- Node drains blocked by PDBs
Risk level: 🟡 (low if done carefully)
Why do we need to respect version skew rules in case of Manual In-place cluster upgrade (blue/green manual)?
In a Manual Blue/Green Node Group upgrade, you are effectively running a "hybrid" cluster for a period of time where the Control Plane is at v1.33 but the nodes are at v1.32.
Respecting the Version Skew Policy is critical because Kubernetes components (the API Server on the control plane and the Kubelet on the nodes) communicate using specific internal APIs. If the gap between them is too wide, they literally stop speaking the same language.
1. The "Language" Problem (API Compatibility)
Kubernetes follows a strict N−3 skew policy (as of v1.28). This means the Kubelet (the agent on your node) can be up to three minor versions behind the API Server.
The Kubernetes project tests compatibility between the control plane and nodes for up to three minor versions. For example, 1.30 nodes continue to operate when orchestrated by a 1.33 control plane. However, running a cluster with nodes that are persistently three minor versions behind the control plane isn’t recommended. For more information, see Kubernetes version and version skew support policy in the Kubernetes documentation. We recommend maintaining the same Kubernetes version on your control plane and nodes.
- Allowed: Control Plane 1.33 ↔ Node 1.32 (1 version apart).
- Dangerous: Control Plane 1.33 ↔ Node 1.29 (4 versions apart).
If you violate this skew, the API Server may send instructions or objects that the old Kubelet doesn't understand. This leads to nodes reporting as NotReady, or worse, pods appearing to run while their networking or storage is actually broken.
2. Feature Gates and Fields
When you upgrade to 1.33, new "Feature Gates" might be enabled by default. If your old 1.32 nodes don't recognize these features:
- The Kubelet might crash trying to parse a Pod specification that includes a new 1.33 field.
- The Scheduler might place a pod on a 1.32 node that requires a 1.33-specific capability, causing a CreateContainerConfigError.
3. The kube-proxy Risk
In your kube-prometheus-stack setup, networking is vital. The kube-proxy component (which handles Service routing) has even stricter rules than the Kubelet.
- Rule: kube-proxy must not be newer than the Control Plane, and it should ideally match the Kubelet version.
- Risk: If you upgrade the Control Plane but keep very old nodes, your kube-proxy might fail to sync iptables or IPVS rules, causing your Grafana pod to lose its connection to the Prometheus datasource.
4. Why it matters for Blue/Green
Even though you are planning to delete the 1.32 nodes eventually, they must remain functional and healthy during the "transition" phase.
If you ignore version skew (e.g., trying to jump from 1.28 to 1.33 in one Terraform apply):
- The Control Plane updates to 1.33.
- The 1.28 nodes immediately become incompatible.
- The nodes go NotReady.
- The Drain Fails: You cannot gracefully kubectl drain a node that is NotReady. The pods will get stuck in Terminating, and your EBS volumes won't detach cleanly.
Summary: The "Safety Window"
By respecting the skew and only moving one version at a time (1.32 → 1.33), you ensure that:
- Workloads stay alive on the Blue nodes while the Green nodes are being provisioned.
- The API Server can still command the Blue nodes to "Drain" and "Detach Volumes."
- The EBS CSI Driver remains operational across both sets of nodes.
Since you are moving from 1.32 to 1.33, you are within the safe N−1 window. Everything will remain fully compatible during your manual handover.
Upgrade the Data Plane (Worker Nodes)
After the control plane is finished, you must upgrade your nodes.
Manual Blue/Green node group strategy
Self-Managed Nodes: You must manually update the AMI and rotate the instances.
If our node group is already using AMI (e.g. AL2023_x86_64_STANDARD) supported by the new K8s version, the upgrade becomes significantly smoother because the operating system and its underlying dependencies (like the kernel and init system) aren't changing—just the Kubernetes components (kubelet, containerd).
Upgrading k8s version using a Blue/Green node group strategy is the industry standard for minimizing downtime, especially when handling stateful workloads like Prometheus and Grafana.
Since you are moving to EKS 1.33, the EBS CSI Driver is your most critical component. It must be updated before the node migration to ensure it can talk to the new AL2023 nodes.
1. Upgrade the Storage Layer First
Before touching the nodes, update your EBS CSI Driver. For EKS 1.33, the recommended version is v1.54.0-eksbuild.1 or later.
How to check your current version:
kubectl get deployment ebs-csi-controller -n kube-system -o jsonpath='{.spec.template.spec.containers[0].image}'
Terraform Update: In your cluster_addons block, ensure you have:
cluster_addons = {
aws-ebs-csi-driver = {
addon_version = "v1.54.0-eksbuild.1"
resolve_conflicts_on_update = "PRESERVE"
}
}
2. The Blue/Green Node Group Strategy
Since you aren't using Karpenter, you will manually provision the "Green" group.
Step A: Create the "Green" Node Group
Add a new entry to your eks_managed_node_groups map.
Note: EKS 1.33 requires AL2023; AL2 will not work.
eks_managed_node_groups = {
# Keep your existing "${local.cluster_name}-default" here unchanged
# NEW: Green Node Group
"${local.cluster_name}-v133" = {
ami_type = "AL2023_x86_64_STANDARD" # Mandatory for 1.33
instance_types = ["m5.large"]
min_size = 2
max_size = 5
desired_size = 2
# Ensure this matches your existing networking/tags
subnet_ids = var.private_subnets
# ... other configs from your snippet ...
}
}
How to ensure Manual Control in Terraform
How to make sure that simply bumping the cluster version will not trigger a "forced" upgrade on your existing nodes before you can manually intervene (create green node group, and migrate workload from blue node group onto it).
In the terraform-aws-modules/eks/aws module, Managed Node Groups have an optional version argument. If you omit it, they usually track the cluster_version at creation time, but they won't force an update unless you tell them to.
To be 100% safe during a Blue/Green migration, follow this pattern:
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 20.0"
cluster_name = "my-cluster"
cluster_version = "1.33" # Control Plane goes to 1.33
eks_managed_node_groups = {
# BLUE (Old) GROUP
"default-v132" = {
# Hardcode the version here so it stays on 1.32
# even though the cluster is 1.33
ami_type = "AL2023_x86_64_STANDARD"
version = "1.32"
min_size = 2
desired_size = 2
}
# GREEN (New) GROUP
"default-v133" = {
ami_type = "AL2023_x86_64_STANDARD"
version = "1.33" # New group gets the new version
min_size = 2
desired_size = 2
}
}
}
The "Manual" Part: Step-by-Step
Once you run terraform apply with the code above:
- AWS upgrades the Control Plane to 1.33.
- The "Blue" nodes stay on 1.32. They continue running your Grafana/Prometheus pods.
- AWS creates new "Green" nodes on 1.33.
Now, nothing has moved yet. Your pods are still on Blue. This is where you take the wheel:
Step B: Cordon and Drain
Once Terraform finishes creating the new nodes, verify they are Ready:
kubectl get nodes -l eks.amazonaws.com/nodegroup=${local.cluster_name}-v133
Check Health:
kubectl get nodes
(You should see 2 nodes at 1.32 and 2 nodes at 1.33).
Cordon the old nodes:
kubectl cordon -l eks.amazonaws.com/nodegroup=${local.cluster_name}-default
Cordon Blue:
kubectl cordon -l eks.amazonaws.com/nodegroup=default-v132
Drain the monitoring namespace first (Grafana/Prometheus):
kubectl drain -l eks.amazonaws.com/nodegroup=${local.cluster_name}-default \
--namespace monitoring \
--ignore-daemonsets \
--delete-emptydir-data \
--terminate-grace-period=300
Drain Blue:
kubectl drain -l eks.amazonaws.com/nodegroup=default-v132 --ignore-daemonsets --delete-emptydir-data
At this point, Kubernetes kills the pods on Blue. Since Green is the only place left to go, it starts them there.
Your EBS volumes detach from Blue and attach to Green.
Verify: Check your Grafana dashboards. If something is broken, you can uncordon Blue and move back.
Cleanup: Once you are happy, delete the default-v132 block from Terraform and apply again.
Why this is safer than "In-Place"
In an In-Place rollover, AWS decides when to pull the rug out from under your Prometheus pod. In Blue/Green, you decide. If the first node drain causes a storage error, you can stop immediately without AWS trying to kill your second node.
3. Monitoring the PVC Migration
Since Grafana and Prometheus use ReadWriteOnce EBS volumes, they cannot start on the new node until they are fully detached from the old one.
Watch the "Handover":
Monitor Pods:
kubectl get pods -n monitoring -w (Look for ContainerCreating).
Monitor Volume Attachments:
kubectl get volumeattachments
You should see the attachment for the old node disappear and a new one for the v1.33 node appear.
Check for Errors:
If a pod stays in ContainerCreating for >2 mins, check the events:
kubectl describe pod <pod-name> -n monitoring
Look for: "FailedAttachVolume" or "Multi-Attach error".
4. Risks & Verification
Risk: AZ Mismatch
Mitigation: EBS volumes are AZ-locked. Ensure your new Node Group spans the exact same Availability Zones as the old one.
Risk: IMDSv2 Hop Limit
Mitigation: AL2023 defaults to a hop limit of 1. If your Grafana/Prometheus pods need to fetch IAM metadata, they might fail. Set http_put_response_hop_limit = 2 in your launch template if needed.
Risk: StorageClass Name
Mitigation: Your Helm values use storageClassName: general. Ensure this StorageClass exists and uses the ebs.csi.aws.com provisioner.
Verification Checklist:
[ ] Grafana Login: Ensure you can log in and that your dashboards (stored in the PVC) are present.
[ ] Prometheus Targets: Check http://<prometheus-url>/targets to ensure it's still scraping.
[ ] Clean Up: After 24 hours of stability, remove the old node group from Terraform.
[!IMPORTANT] No Rollbacks: EKS does not support "downgrading" a cluster version once the upgrade has started. Always test the upgrade in a development environment first.
(2) Hybrid In-place cluster upgrade
- Manual control-plane: manual
- nodes: automatic rolling upgrade / self-healing nodes
(“Hands-off” / Karpenter-heavy)
What it is
- Upgrade control plane
- Update node provisioning config (k8s version, AMIs...)
- Let autoscaling replace nodes over time
✅ Benefits
- Very little manual work
- No explicit node migration steps
- Low operational overhead
- Storage stays attached to the same cluster
- StatefulSets maintain their identities
- Pods reschedule to new nodes but keep their volumes
- No data migration needed
- No DNS cutover complexity
- Elegant once mature
❌ Drawbacks
- We have no control over node upgrades - EKS is doing it
- Less deterministic
- Harder to audit
- Slower convergence
- Poor visibility during incidents
⚠️ Risks
- Nodes stick around longer than expected
- Sudden mass replacement
- Operators reschedule in unsafe order
Risk level: 🟡→🔴 (depends on maturity)
Upgrade the Data Plane (Worker Nodes)
In-place Rolling Update strategy
When you change the cluster_version to 1.33 in Terraform and apply, the EKS Managed Node Group detects a mismatch between its current version and the cluster's control plane version.
AWS EKS Managed Node Groups do not automatically upgrade just because the control plane version changed.
The Skew "Grace Period"
Kubernetes allows a version skew between the Control Plane and the Nodes. For version 1.28 and above, EKS supports up to 3 minor versions of skew.
- Control Plane: 1.33
- Nodes: 1.32 (This is perfectly valid and supported)
As long as you do not change the version property of the specific node group in Terraform, AWS will leave those nodes alone on v1.32, even after the control plane hits v1.33.
So, if you're using managed node groups (via the module's eks_managed_node_groups):
- You need to explicitly update the version parameter for each node group
- The module will handle the rolling update
- Example: Set version = "1.33" in your node group configuration
If using self-managed node groups:
- You'll need to update the AMI and launch template
- The module typically uses ami_type which will need updating
Recommended Approach
module "eks" {
source = "terraform-aws-modules/eks/aws"
cluster_version = "1.33"
eks_managed_node_groups = {
example = {
version = "1.33" # ← Add this
# ... other config
}
}
}
Apply in stages if you want to minimize risk:
- Upgrade control plane first (just cluster_version)
- Test your workloads
- Upgrade node groups (add/update version in node group configs)
This gives you a chance to validate compatibility before fully committing to the new version across all nodes.
EKS triggers a Rolling Update (managed by AWS):
- Surge: EKS spins up new EC2 instances using the latest AWS-optimized AL2023 AMI for Kubernetes 1.33.
- Readiness: It waits for these new nodes to join the cluster and reach the Ready state.
- Drain: It selects an old 1.32 node, cordons it (marking it unschedulable), and drains it (evicting your pods).
- Terminate: Once the old node is empty, AWS terminates the instance.
- Repeat: This continues until all 1.32 nodes are replaced by 1.33 nodes.
What happens to your PVCs?
Since your (e.g. Grafana and Prometheus) pods use EBS volumes with ReadWriteOnce, the rolling update behaves like this:
- Pod Eviction: When a node hosting Grafana is drained, the pod is signaled to shut down.
- Volume Detach: The EBS CSI driver (which you upgraded in the previous step) sends a command to AWS to detach the EBS volume from the old 1.32 node.
- Scheduling: Kubernetes sees the Grafana pod is pending. Because the old node is cordoned, it schedules the pod onto one of the new 1.33 nodes.
- Volume Attach: The CSI driver attaches the existing EBS volume to the new node. Your data stays intact because the EBS volume is independent of the EC2 instance.
The "Gotcha": Surge Capacity
By default, Managed Node Groups use 1 surge node (this is configurable via update_config).
- If you have 2 nodes, EKS will spin up a 3rd (1.33), then kill one old one, then spin up a 4th, then kill the last old one.
- Risk: If your AWS Account Quotas or Subnet IP space are very tight, the "Surge" node might fail to provision, causing the upgrade to hang.
(3) Blue/Green cluster upgrade (two clusters)
Steps:
- (If not using Terraform for EKS infra) Create cluster backup using Velero
- AWS resources (IAM roles, EKS add-ons, VPC settings, etc.) are not backed up by Velero — you must re-create or re-configure them manually.
- Always validate workloads and custom resources for API deprecations/changes before you upgrade.
- Create a new EKS cluster at new version
- Deploy all workloads there:
- If EKS infra is in Terraform - re-apply Terraform
- If using Velero:
- Use Velero to restore backups from the old cluster into the new one. Velero cannot restore into a cluster with a lower Kubernetes version than what the backup came from; but restoring into a higher version is generally workable — just test workloads for API compatibility.
- Re-apply any AWS infrastructure configuration that Velero does not capture (e.g., IAM roles, load balancers, service accounts, security groups).
- Perform state migration for stateful workloads:
- Persistent storage doesn't move automatically: When you create a new green cluster, your EBS volumes, EFS filesystems, and other persistent storage remain attached to the old blue cluster. You need to:
- Snapshot and restore volumes to the new cluster
- Migrate data between storage backends
- Handle cross-cluster storage access during transition
- Deal with potential data inconsistency during migration
- Wire up new Persistent Volume Claims to new volumes
- StatefulSets have cluster-specific identities: Pods in StatefulSets have stable network identities and persistent volume claims tied to the original cluster. Simply recreating them in a new cluster can cause:
- Identity mismatches
- Orphaned PVCs in the old cluster
- Loss of pod-to-volume associations
- Disruption of quorum-based systems (etcd, databases)
- Shift/redirect traffic (DNS, LB, mesh) to the new cluster once validated.
- Decommission old cluster
Operational Complexity
No shared state between clusters: Unlike in-place rolling updates where pods gradually migrate, blue/green means:
- Databases need full replication or backup/restore between clusters
- No shared distributed state (cache warmth, leader election, etc.)
- Message queues need draining or dual-publishing
- All stateful data must be explicitly moved
Synchronization windows are risky: You need to:
- Stop writes to the blue cluster
- Ensure all data is migrated/synced to green
- Switch DNS
- Verify green cluster has all data
- Any failure means rollback with potential data loss
DNS propagation delays: Even after switching DNS:
- TTL means clients may hit old cluster for minutes
- Long-lived connections stay on blue cluster
- Split-brain scenarios where some clients hit blue, others hit green
- Stateful systems can't handle this dual-write scenario safely
Specific Examples
Database with active transactions:
- Can't just switch DNS mid-transaction
- Need to drain connections, finalize replication, then cut over
- Risk of split-brain if clients connect to both clusters
Distributed caches (Redis, Memcached):
- Cache in green cluster starts cold
- Can't share cache state between clusters
- Performance degradation until warm
Message queues (Kafka, RabbitMQ):
- Messages in-flight in blue cluster
- Need to drain queues or accept message loss
- Consumer offsets don't transfer automatically
Why In-Place Upgrades Are Easier for Stateful Workloads
With in-place node-by-node upgrades:
- Storage stays attached to the same cluster
- StatefulSets maintain their identities
- Pods reschedule to new nodes but keep their volumes
- No data migration needed
- No DNS cutover complexity
The "heavy operational lift" comes from having to manually orchestrate what Kubernetes normally handles automatically during rolling updates - maintaining state consistency while changing the underlying infrastructure.
✅ Benefits
- No version skew
- Full rollback (just flip traffic back)
- Clean break from legacy cruft
- Very predictable cutover
❌ Drawbacks
- Expensive (double infra)
- Heavy operational lift
- Hard for stateful workloads
- Requires mature CI/CD and traffic control
⚠️ Risks
- Data sync / migration errors
- Split-brain risks for stateful systems
- DNS propagation delays
Risk level: 🟡
Operational cost: 🔴
Where node strategies fit
Node strategies are sub-choices inside approach #1:
Cluster strategy Node strategy options
In-place cluster upgrade blue/green, rolling, Karpenter
Blue/green cluster new cluster nodes only
Self-healing Karpenter-driven
In-place cluster upgrade blue/green, rolling, Karpenter
Blue/green cluster new cluster nodes only
Self-healing Karpenter-driven
Practical comparison (cluster-level)
Strategy Safety Effort Cost Stateful-friendly Rollback
In-place cluster 🟢🟡 🟡 🟢 🟡 🟡
Blue/green cluster 🟢 🔴 🔴 🔴 🟢
Self-healing 🟡→🔴 🟢 🟢 🟡 🔴
Pre-upgrade Checks
What must be checked before moving on:
- Cluster health checks
- Critical workloads running
- No failing webhooks
- No deprecated APIs in use
Example:
Validation: all Tier 0 workloads healthy for 24h before prod upgrade
Rollback & abort criteria
A mature cadence defines when to stop.
Examples:
- Node upgrade paused if error rate increases
- Abort if CoreDNS fails to stabilise
- Roll back Helm releases, not control plane
Update strategies from the operational scheduling and planning perspective
I mentioned above cadence, so let's explore Update from the operational scheduling and planning perspective:
Upgrade strategy can be manual or fully-automated, reactive, ad-hoc or cadence-based.
A cadence upgrade strategy means:
Upgrading Kubernetes (EKS) on a regular, predictable schedule, rather than only when forced by deprecation or incidents.
It answers when, how often, and how upgrades are performed.
An upgrade cadence usually defines four things:
1) How often upgrades happen
Examples:
- Every minor Kubernetes release (roughly every 3–4 months)
- Every second release
- Twice per year
- Just before AWS EOL deadlines (not recommended)
Example wording:
Upgrade cadence: every minor EKS release, within 30 days of availability
2) Upgrade order (this is critical in EKS) - discussed above
3) Validation gates - discussed above
4) Rollback & abort criteria - discussed above
Cadence vs other strategies (quick comparison)
Strategy Description Risk
---------- --------------- -----
Reactive Upgrade only when AWS forces it 🔥 High
Ad-hoc “When someone remembers” 😬 Medium
Cadence-based Predictable, scheduled ✅ Low
Fully automated Zero-touch pipelines ⚠️ Depends on maturity
Example: a solid EKS cadence strategy
We upgrade EKS clusters on a quarterly cadence.- Dev: within 7 days of a new EKS minor version- Staging: within 14 days- Production: within 30 daysUpgrade order:1. EKS control plane2. Managed add-ons3. Node groups / Karpenter4. Platform services5. Application workloadsUpgrades are paused if:- Tier 0 services fail health checks- CoreDNS or CNI fails to stabilise within 15 minutes
Why this matters (especially for EKS)
- AWS only supports N–2 Kubernetes versions
- Miss the window and you’re forced into rushed upgrades
- Deprecated APIs will bite you eventually
- Regular cadence = smaller, safer jumps

No comments:
Post a Comment