Monday, 16 March 2026

How to architecture highly available and fault tolerant AWS EKS Kubernetes cluster


We should follow a layered approach—from the network up to the application.

The "Highly Available EKS" Design Framework


1. Networking & Control Plane (The Foundation)


  • Multi-AZ VPC
    • Design a VPC with at least 3 Private Subnets across 3 different Availability Zones (AZs).
  • EKS Endpoint (API server endpoint) Access
    • Enable "Private Access" for the EKS control plane so worker nodes communicate with API sever within the VPC, reducing exposure and latency.
  • NAT Gateways
    • Use one NAT Gateway per AZ (3 total) to ensure that if one AZ fails, the nodes in other zones still have outbound internet access for image pulls. This means that we need to have 3 public subnets, one per each AZ.

To use a NAT Gateway, you must place it in a public subnet (a subnet with a route to an Internet Gateway). If your goal is to have one NAT Gateway per Availability Zone (AZ) for high availability, you need a corresponding public subnet in each of those three AZs to host them.

Why this structure is necessary

The architecture follows a specific "dependency chain" to ensure that an issue in one data center doesn't take down your entire outbound connectivity:
  1. AZ Independence: NAT Gateways are zone-redundant by design, but they physically reside in a specific AZ. If AZ-a goes down, the NAT Gateway inside it goes down too.
  2. The Public Subnet Requirement: A NAT Gateway needs a Public IP (EIP) and a route to the Internet Gateway (IGW). Only subnets configured as "public" can provide this.
  3. Cross-Zone Resilience: By having three public subnets (one in each AZ), you can place a NAT Gateway in each. Then, you point the private subnets in AZ-a to the NAT Gateway in AZ-a, the private subnets in AZ-b to the one in AZ-b, and so on.

The Standard Setup

If you are following the recommendation for a 3-AZ deployment, your VPC structure will typically look like this:

Component       AZ-1                        AZ-2                        AZ-3
-----------------   ----------------------   ---------------------    ---------------------
Public Subnet   Subnet-Pub-1           Subnet-Pub-2          Subnet-Pub-3
NAT Gateway   NAT-GW-1              NAT-GW-2             NAT-GW-3
Private Subnet   Nodes/Workloads   Nodes/Workloads   Nodes/Workloads

Note on Cost: 

While this is the "Gold Standard" for reliability (preventing "cross-zone data charges" and ensuring 100% uptime during an AZ failure), keep in mind that AWS charges per hour for each NAT Gateway. Running three of them is significantly more expensive than running one.

A Common Misconception

You could technically have 3 private subnets and only 1 public subnet (with 1 NAT Gateway). In that case, all nodes in all 3 AZs would send their traffic to that single NAT Gateway.

The Risk: If the AZ containing that lone NAT Gateway fails, your nodes in the other two healthy AZs will lose their ability to pull images or talk to the internet, effectively "breaking" your cluster even though the nodes themselves are fine.


2. Compute & Data Plane (The Muscle)


  • Managed Node Groups
    • Node groups implement basic compute scaling through EC2 Auto Scaling groups.
    • Use EKS Managed Node Groups spread across those 3 subnets. Select multiple subnets for a node group to provision nodes across multiple Availability Zones.
    • Amazon EKS managed node groups make it easy to provision compute capacity for your cluster. managed node groups consist of one or more Amazon EC2 instances running the latest EKS-optimized AMIs. All nodes are provisioned as part of an Amazon EC2 Auto Scaling group that is managed for you by Amazon EKS and all resources including EC2 instances and autoscaling groups run within your AWS account.
  • Auto Scaling Groups (ASG)
    • Set the min_size to 3. This ensures that even if a node fails, the ASG replaces it immediately. 
    • This is NOT Kubernetes Cluster Autoscaler (or Karpenter), which, if we want to use it, needs to be installed separately. 
      • CAS and ASG both control same node groups and can get in conflict. Solution: disable scaling policies on the ASG so CAS takes control
      • Karpenter and ASG might get in conflict only if they control the same nodes. But in a well-architected EKS cluster, we usually have two different "families" of nodes: 
        • static: for small, fixed-size Managed Node Group To run "System" pods (CoreDNS, CNI, Karpenter itself)
        • dynamic: Karpenter-managed node group which runs actual workload, with dynamic number of nodes, depending on the current usage 
  • Instance Diversity
    • Mention using multiple instance types (e.g., m5.large and m6g.large) to avoid "insufficient capacity" errors in a specific AWS zone.

3. Traffic Management (The Entry Point)


  • AWS Load Balancer Controller
    • Use the aws-load-balancer-controller to provision an Application Load Balancer (ALB).
  • Cross-Zone Load Balancing
    • Ensure this is enabled so the ALB can route traffic to healthy pods in any AZ, even if the "local" node in its own zone is struggling.

4. Pod-Level Availability (The Brains)


  • Pod Anti-Affinity
    • To ensure replicas don't land on the same node.
  • Topology Spread Constraints
    • To force an equal distribution of pods across the 3 AZs.
  • Pod Disruption Budgets (PDB)
    • To prevent the Cluster Autoscaler or AWS maintenance from taking down too many replicas at once.


3 Pillars of High Availability


  • Component: Multi-AZ ASG
    • Layer: Infrastructure
    • Goal: Survive an entire AWS Data Center outage.
  • Component: PDBs & Rollouts
    • Layer: Orchestration
    • Goal: Survive maintenance and human error (updates).
  • Component: Spread Constraints
    • Layer: Application
    • Goal: Survive individual EC2 instance crashes.

---

No comments: