Thursday, 2 July 2026

Mitigating Load Balancer Routing Clamping




In computing and networking, routing clumping (also known as traffic clumping or clustering) refers to the phenomenon where a load balancer or router unevenly sends bursts of traffic to the same backend servers instead of distributing requests evenly across the entire available server pool.

While a load balancer is designed to act as a traffic cop, routing clumping defeats this purpose, momentarily overloading specific servers while leaving others idle.

Primary Causes of Routing Clumping


  • Hashing Bias: Algorithms like 5-tuple hashing (which use source IP, destination IP, ports, and protocols) calculate a fixed value to select a routing path. If thousands of users share a proxy or exit node, the load balancer may calculate the same hash and clump all their traffic onto a single backend server.
  • Sticky Sessions: If you use session persistence (sticky sessions) based on a client's IP, multiple requests from the same office or network range will clump to a single server to maintain session continuity.
  • Connection Multiplexing: Load balancers maintain long-lived backend connections to save resources. When a new batch of requests comes in, the load balancer routes them through existing open connections, causing requests to cluster on specific nodes rather than opening new connections to idle servers.
  • Inconsistent Pod or Instance Count: In environments like Kubernetes, if traffic is processed node-by-node, unequal distribution of application pods across nodes or restrictive node affinities can cause traffic to clump in specific zones or servers.

How to Prevent and Mitigate Clumping


  • Use Round Robin or Least Connections: Switch from IP-hashing or fixed-hash methods to AWS Least Connections or round-robin algorithms, which evaluate server availability in real-time rather than relying on mathematical routing properties.
  • Enable Cross-Zone Load Balancing: For cloud architectures (like AWS Elastic Load Balancing), enabling cross-zone load balancing forces your load balancers to distribute traffic evenly across all registered targets in all enabled Availability Zones.
  • Use Layer 7 Intelligent Routing: Instead of Layer 4 packet forwarding, use a Layer 7 Application Load Balancer to inspect HTTP headers and distribute traffic based on request context, allowing for finer-grained distribution.
  • Deploy Hierarchical Splitting: For massive, hyperscale systems, implementing a hierarchical L4 + L7 structure (ultra-fast L4 hardware balancers routing to a software-based L7 pool) helps prevent clumping while maintaining high raw throughput.


Why round robin routing didn't work well in this scenario so we need to use least connections routing? 



Good question — the answer is about the kind of workload service3 is. Round robin isn't broken in general; it's the wrong fit for this specific traffic shape.

  What each algorithm does 
  • Round robin: hands each new request to the next target in rotation (pod A, B, C, A, B, C…), purely by count — blind to how busy each target currently is.
  • Least outstanding requests (LOR): sends each new request to the target with the fewest in-flight requests right now — i.e., the least-busy one.

Why round robin fails here

Round robin distributes by count, which only equals distributing by load when requests are uniform and short. service3 is the opposite — it has the toxic combination:

1. Highly variable, long service times — a spaCy extraction is anywhere from ~0.3s to multi-second depending on text length/entity count. One heavy request ties a pod up for seconds.
2. Low per-pod concurrency — each pod does ~1–2 concurrent extractions (2 workers, GIL-bound). It saturates almost immediately.

Under those conditions, round robin keeps handing new requests to a pod that's still grinding a heavy multi-second extraction, just because it's "next in rotation" — while a pod that just finished sits idle. The new request then queues head-of-line behind the heavy one for seconds, even though free capacity exists elsewhere. Over a burst this compounds: unlucky pods accumulate a backlog while others drain and idle → exactly the clumping / 33%-CPU-but-3s-latency pattern we measured.

The analogy: round robin is a supermarket sending shoppers to checkout lanes strictly 1‑2‑3‑1‑2‑3, ignoring that lane 2 has someone with a giant cart. You get stuck behind the big cart while lane 3 is empty. LOR is "go to the lane with the fewest items in it right now."

Why LOR fixes it

LOR tracks in-flight requests per target and skips the busy pod — a pod mid-heavy-extraction has a high outstanding count, so new requests go to the idle/draining pods instead. It self-corrects for variance (whichever pod is busy, for any reason, automatically gets fewer new requests until it catches up) and it reaches the idle capacity round robin was stranding.

The catch — why we also needed IP mode

LOR only helps if the ALB is choosing among the right units. In instance mode, the ALB's targets are nodes, so it'd balance outstanding requests across nodes — then kube-proxy randomly picks a pod within the node, re-scrambling the per-pod balance. LOR literally can't see per-pod busyness in instance mode.

IP mode makes the targets pods, so LOR's "least outstanding" is measured per pod → it can actually steer to the least-busy pod. That's why the fix is ip mode + LOR together: ip mode gives LOR the right granularity; LOR gives ip mode the right decision rule. Either alone wouldn't do it.

Two footnotes

- The other 5 services keep round robin — deliberately. auth/data/game/webhook are fast (<0.5s), uniform, high-concurrency — round robin distributes them fine, and there's no reason to touch them. The algorithm choice is workload-specific: LOR for
  slow/variable/expensive backends, round robin for fast/uniform ones.
- Terminology: you said "least connections" — the ALB (L7) equivalent is least outstanding requests, which counts in-flight HTTP requests, not TCP connections. That distinction matters here: the ALB reuses keep-alive connections (one connection carries many sequential requests), so "least connections" would be misleading — LOR counts the actual requests, which is the right signal for this.



In which cases round-robin doesn't help so we need to use least outstanding requests routing?


While Round-Robin works perfectly for uniform traffic and identical servers, it fails when the system experiences variations in request complexity or server capacity. You must switch from Round-Robin to Least Outstanding Requests (LOR) (also known as Least Connections) in the following four specific engineering scenarios:

1. Varying Request Processing Times (Asymmetric Workloads)


  • The Problem: In many applications, some API calls take 5 milliseconds (e.g., fetching a profile cached in memory), while others take 5 seconds (e.g., generating a heavy PDF report or running a complex database query).Why Round-Robin Fails: Round-Robin blindly hands out requests in a strict, alternating sequence. If Server A randomly receives a consecutive string of heavy PDF requests while Server B receives fast cache requests, Server A's queue will spike, causing high latency or timeouts, while Server B sits mostly idle.
  • How LOR Helps: LOR actively tracks the active connection count. It will notice Server A is backed up with pending work and will divert all new incoming traffic to Server B until Server A finishes its heavy processing.

2. Heterogeneous Server Capacities (Mixed Server Sizes)


  • The Problem: Production clusters often use mixed hardware. For example, during an auto-scaling event, you might temporarily mix older 4-core virtual machines with newer, high-performance 16-core instances.Why Round-Robin Fails: Round-Robin treats every backend target as equal. It sends exactly 1,000 requests to the weak 4-core machine and exactly 1,000 requests to the powerful 16-core machine. The weaker machine will quickly choke, run out of memory, or drop packets, while the stronger machine remains underutilised.
  • How LOR Helps: The faster, more powerful server processes and closes connections much quicker than the weaker server. Because its outstanding connection count drops rapidly, LOR naturally funnels a significantly higher volume of traffic to the stronger hardware without needing manual weight configurations.

3. Persistent and Long-Lived Connections (WebSocket & gRPC)


  • The Problem: Modern applications rely heavily on persistent connections like WebSockets, HTTP/2 multiplexing, server-sent events (SSE), or gRPC streams. These connections stay open for minutes or hours.Why Round-Robin Fails: Round-Robin only counts the initial connection establishment. If Server A hosts 50 clients who stay connected for 3 hours chatting, and Server B hosts 50 clients who disconnect after 30 seconds, Round-Robin will continue to feed new connections to both servers equally. Server A will slowly crush under the cumulative weight of long-lived active sessions.
  • How LOR Helps: LOR continuously monitors active, open connections rather than connection arrival rates. It will see that Server A has 50 active outstanding connections and Server B has 0, immediately routing all new clients to Server B.

4. Unpredictable Backend "Cold Starts" and Drifts


The Problem: When a new application container or server boots up (e.g., during a Kubernetes deployment or AWS auto-scaling event), it often suffers from a "cold start" where it runs slowly while warming up caches or compiling code just-in-time (JIT).Why Round-Robin Fails: Round-Robin immediately floods the newly booted server with its full share of production traffic. Because the server is not yet performing at 100% capacity, this sudden burst easily overwhelms it, causing it to crash immediately after launching.
How LOR Helps: LOR naturally throttles traffic to the warming server. Since the cold server processes its first few requests slowly, its outstanding request count will naturally rise, signaling the load balancer to back off and route traffic elsewhere until the server catches up.


Algorithm Comparison Table


  • Scenario
  • Round-Robin Behavior
  • Least Outstanding Requests Behavior
  • All requests take equal time 
  • Perfect, completely equal distribution.
  • Excellent, identical result to Round-Robin.
  • Mix of short & long requests
  • Overloads random servers (creates traffic clumps).
  • Dynamically balances the processing queue.
  • Mixed server capacities
  • Overwhelms smaller, weaker backend servers.
  • Directs more traffic to faster instances automatically.
  • Long-lived WebSockets / gRPC
  • Ignores session duration; causes massive load imbalance.
  • Tracks real-time active sessions perfectly.

If you are using a cloud platform, you can learn more about implementing this via the AWS Least Outstanding Requests Documentation or the NGINX Least Connections Guide.



k8s ingress - AWS ALB


To implement Least Outstanding Requests (LOR) routing with an AWS Application Load Balancer (ALB) inside a Kubernetes cluster, you must configure it using the AWS Load Balancer Controller.

By default, the AWS ALB uses Round-Robin routing. You can change this behavior by applying a specific routing annotation to your Kubernetes TargetGroupBinding or Ingress resource.


1. Ingress Configuration Example


Apply the alb.ingress.kubernetes.io/target-group-attributes annotation to your Ingress manifest. This tells the AWS controller to configure the underlying target groups to use the load_balancing.algorithm.type=least_outstanding_requests setting.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app-ingress
  namespace: production
  annotations:
    # Essential ALB Controller configurations
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    
    # Enable Least Outstanding Requests Routing
    alb.ingress.kubernetes.io/target-group-attributes: load_balancing.algorithm.type=least_outstanding_requests
spec:
  rules:
    - http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: my-app-service
                port:
                  number: 80



2. Crucial Step: Use "IP" Target Type


When running AWS ALB in Kubernetes, you should always set alb.ingress.kubernetes.io/target-type: ip instead of instance mode. 

  • Why? In instance mode, the ALB routes traffic to NodePorts on the EC2 worker nodes. The node then uses kube-proxy (iptables or IPVS) to randomly route the packet to a pod. This completely breaks the math behind Least Outstanding Requests because the ALB can only see outstanding connections to the node, not the actual pods.
  • The Fix: Using ip mode configures the ALB to bypass the node network entirely and route traffic directly to the Pod IPs. This gives the ALB true visibility into the exact outstanding request count for each application container.

3. Verify the Configuration


After applying the manifest via kubectl apply -f ingress.yaml, you can verify the changes are active in AWS:
  • Open the AWS EC2 Console and navigate to Target Groups.
  • Select the target group automatically generated by your Kubernetes Ingress.
  • Click on the Attributes tab.
  • Verify that Routing algorithm is set to Least outstanding requests.

Alternatively, check the AWS Load Balancer Controller logs to ensure the modification was successfully reconciled:

kubectl logs -n kube-system deployment/aws-load-balancer-controller

---

Wednesday, 1 July 2026

Introduction to Velero

 


Velero is an open-source disaster recovery, Kubernetes-native backup, restore and migration tool for Kubernetes. It allows you to back up and restore both your Kubernetes cluster resources and, optionally, the persistent volumes (PVs) that hold application data.

It was originally created by Heptio (the company founded by two of Kubernetes' creators) and is now maintained by VMware and the open-source community.

What does Velero back up?

Velero can back up:

  • Kubernetes resources (Cluster API objects):
    • Deployments
    • StatefulSets
    • Services
    • Ingresses
    • ConfigMaps
    • Secrets
    • CRDs
    • Namespaces
    • RBAC resources
    • Custom Resources
They are stored as tarballs in an S3 bucket.

Optionally, it can also back up:

  • Persistent Volumes (application data)
    • via storage snapshots (AWS EBS, Azure Disk, GCP Persistent Disk, etc.)
      • cloud snapshots (EBS snapshots through the CSI driver)
    • or via a file-level backup tool called Node Agent (formerly Restic)
      • file-level backup with the built-in Kopia/Restic uploader for non-snapshottable volumes (EFS, hostPath, etc.)

How it works

A typical Velero deployment consists of:


                    +----------------+
| Kubernetes API |
+--------+-------+
|
Velero Server
|
+-------------------+-------------------+
| |
Metadata Backup Volume Backup
| |
v v
Object Storage Snapshot or File Backup
(S3, Azure Blob, (EBS, CSI Snapshot,
GCS, MinIO...) Node Agent/Restic)


For example:

  • Cluster metadata → stored in an S3 bucket
  • PV data → stored as EBS snapshots or uploaded to object storage

Typical use cases

1. Disaster recovery

Your EKS cluster is accidentally deleted.

With Velero you can:

  • recreate the cluster
  • install Velero
  • restore all workloads
  • restore persistent data

2. Accidental deletion

Someone runs:

kubectl delete namespace production

Instead of recreating everything manually:

velero restore create \
--from-backup production-backup

3. Cluster migration

Move workloads from:

  • EKS → EKS
  • EKS → AKS
  • EKS → GKE
  • On-prem → cloud

Velero restores Kubernetes objects into the new cluster.


4. Scheduled backups

Example:

Every night at 2 AM



Backup namespaces:
- production
- monitoring
- logging

Retention can be configured, for example:

Keep 30 daily backups
Delete older ones automatically

What it does NOT back up

Velero does not automatically back up:

  • etcd directly
  • cloud infrastructure (VPCs, Load Balancers, IAM, Security Groups)
  • managed databases like RDS
  • container images (they remain in your registry)
  • external services

Those require separate backup strategies.


Storage providers

Velero supports many object storage backends:

  • Amazon S3
  • MinIO
  • Azure Blob Storage
  • Google Cloud Storage
  • OCI Object Storage
  • many S3-compatible systems

Persistent Volume backup methods

There are two main approaches.

1. CSI snapshots (preferred)

If your storage class supports the Container Storage Interface (CSI) snapshot API:

PVC

VolumeSnapshot

Cloud snapshot

Advantages:

  • very fast
  • incremental (depending on storage backend)
  • cloud-native
  • recommended

2. Node Agent (formerly Restic)

If snapshots aren't available:

PVC

Read filesystem

Compress

Upload to object storage

Advantages:

  • works almost everywhere
  • storage-independent

Disadvantages:

  • slower
  • consumes CPU and network bandwidth

Example architecture in AWS

                 Amazon EKS
|
+-----------+-----------+
| |
Kubernetes API Persistent Volumes
| |
Velero Server EBS Snapshots
|
|
S3 Bucket
backups/

Example installation

Install the CLI:

brew install velero

Deploy into an EKS cluster:

velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws \
--bucket my-backups \
--backup-location-config region=eu-west-2

Example backup

Back up an entire cluster:

velero backup create full-cluster

Back up a namespace:

velero backup create production \
--include-namespaces production

Example restore

velero restore create \
--from-backup full-cluster

Scheduling

Create a nightly backup:

velero schedule create nightly \
--schedule="0 2 * * *"

When should you use Velero?

Velero is a good fit if you want to:

  • Recover Kubernetes workloads after accidental deletion or cluster failure.
  • Back up application manifests and, optionally, persistent data.
  • Migrate workloads between Kubernetes clusters or cloud providers.
  • Automate recurring backups with retention policies.
  • Protect stateful applications running on Kubernetes.

If your applications also depend on external systems (for example, managed databases, message brokers, or cloud resources), Velero should be part of a broader disaster recovery strategy rather than the only backup solution.

Velero vs. etcd backup

FeatureVeleroetcd backup
Kubernetes resources
Persistent volume data
Restore individual namespaces
Restore individual applications
Migrate between clustersLimited
Cloud agnosticMostly
Disaster recovery for applicationsPartial

For managed Kubernetes services such as Amazon Elastic Kubernetes Service (EKS), Velero is generally the preferred backup tool because it focuses on application-level recovery rather than restoring the control plane itself. In contrast, direct etcd backups are more common in self-managed Kubernetes clusters where you control the control plane.

Friday, 26 June 2026

DevOps Interview Questions - k8s ALB alarm




Post-incident review / technical interview questions.

These questions and answers were generated by Claude, upon it analyzed and fixed one CloudWatch alarm which have been flipping. Prompt I used:

I was following your analysis and resolution of this issue in order to acquire your knowledge. I would like to test my knowledge now. Can you compile a list of questions which cover every aspect of the issue and solution here? Don't be shy of creating a really long list of questions. If I am able to answer them all that means I am able to fix the issue on my own next time. Also prepare the answer key.



A. The alarm itself (CloudWatch / ALB fundamentals)

  1. What does the metric TargetResponseTime actually measure, and from whose perspective (client → ALB → target)?

  2. The alarm name was k8s-api-prod-core-7db0ccf2c3-target-response-time. What do the k8s- prefix and the hash portion tell you about how it was created?

  3. The alarm config was: threshold 0.8, GreaterThanThreshold, period 60s, 2 evaluation periods, statistic Average. In plain English, what condition makes it fire?

  4. Why does the alarm use the LoadBalancer dimension only, and not a TargetGroup dimension? What consequence did that have for our investigation?

  5. What is "flapping," and why does this particular threshold/period combination make flapping likely for a bursty workload?

  6. How do you pull an alarm's state-transition history, and what did 9 OKALARM cycles in 5 hours tell you?

  7. The alarm fires on Average. Why is that distinction (vs Maximum/p99) absolutely central to both the diagnosis and the fix?

  8. What is the "low-traffic statistical artifact" pattern, where a handful of slow requests inflate the average on a near-idle target — and what evidence did we use to rule it out here?

B. Narrowing from ALB to one service

  1. The ALB fronted six target groups. Name the technique we used to find which one was responsible, and the AWS CLI call behind it.

  2. Why can a single ALB serve six different Kubernetes services? What AWS-LB-Controller concept ties them together (hint: group.name)?

  3. Given the target group name k8s-default-dataservice-88e28b4b6c, how do you map it back to a Kubernetes Service and namespace?

  4. During the burst, data-service showed 5.83s avg / 29.7s max while every other service was <0.5s. Why did that immediately exonerate the shared ALB/ingress as the cause?

C. First look at the workload

  1. What does kubectl top pods show, and why is a single snapshot from it dangerous as evidence?

  2. The first snapshot showed one pod at 997m and three near-idle. What two different explanations are consistent with that, and why can't a snapshot distinguish them?

  3. What's the difference between a CPU request and a CPU limit in Kubernetes?

  4. How did we check whether the pod was being CPU-throttled at its limit, and what file did we read inside the container? What did nr_throttled / throttled_usec tell us?

  5. A deploy to v1.14.3-prod had happened ~15 min earlier. Why was it a red herring, and what evidence dated the flapping as pre-existing?

  6. Distinguish the three probe types (liveness, readiness, startup). What does each one control?

  7. The deployment had a startupProbe and livenessProbe but no readinessProbe. Operationally, what can't the system do without a readiness probe?

D. The load-balancing theory (and why it was wrong)

  1. Explain why round-robin load balancing degrades when request durations are highly variable. Use the "request count vs. total work" framing.

  2. What is the feedback-loop difference between round-robin and least-outstanding-requests (LOR)?

  3. What older, well-known algorithm is LOR equivalent to in nginx/HAProxy terms, and what does AWS's own guidance say about when to use LOR vs round-robin?

  4. "Head-of-line blocking" appeared twice in this incident at two different layers. Name both layers and how each one blocks.

  5. Why is a single kubectl top snapshot insufficient to prove round-robin is causing imbalance, and what data would actually prove or refute it?

  6. We initially called round-robin the "smoking gun," then withdrew it. What specifically made that conclusion wrong?

E. Target mode: instance vs IP (the topology that broke the theory)

  1. What's the difference between an ALB target group in instance mode vs ip mode? How can you tell which one you have from the registered targets and the target-group port?

  2. The target group registered 10 EC2 instances on port 31892. What does that tell you about the request path from ALB to pod?

  3. What is a NodePort service, and what is externalTrafficPolicy: Cluster vs Local?

  4. With instance mode + Cluster policy, describe the full path of a request from client to the backend process, including every hop and who load-balances at each hop.

  5. Given that topology, explain precisely why switching the ALB to least_outstanding_requests would not fix per-pod imbalance.

  6. What two changes together would enable load-aware per-pod routing, and why is that a much bigger change than a one-line annotation?

  7. Why were ALB access logs not useful for confirming per-pod distribution in this setup?

F. Getting the proof (Prometheus / time series)

  1. What is kube-prometheus-stack, and where did it live in the cluster?

  2. Why is kubectl port-forward an acceptable read-only way to query Prometheus, and what does it actually do?

  3. Write (conceptually) the PromQL that gives per-pod CPU usage over time. Why rate(container_cpu_usage_seconds_total[...]) rather than the raw counter?

  4. The time series showed all four pods evenly at 0.5–0.95 cores during the burst. Why did that refute the imbalance theory in one stroke?

  5. Each pod plateaued at ~0.9 cores despite a 1.5-core limit. What does that plateau strongly imply about the process model inside the pod?

G. The real root cause (app server / GIL / async)

  1. What is an application server worker (e.g., Gunicorn), and what's the difference between the master process and a worker process?

  2. What is a Global Interpreter Lock (GIL) or similar single-threaded runtime constraint, and why does it mean one worker process ≈ one core of CPU-bound throughput?

  3. The config had workers = 1 and an asynchronous event-loop worker class specified. Explain what each line does.

  4. What is the difference between a synchronous worker and an asynchronous worker in an application server? When does each shine?

  5. Why is an async (event-loop) worker the wrong model for heavy, synchronous CPU-bound data processing? What does "blocking the event loop" mean concretely?

  6. So with one async worker doing CPU-bound work, what is the per-pod concurrency for heavy requests — and how did that produce the fleet-wide ceiling of ~4 concurrent requests?

  7. How did we confirm the worker count and the CPU quota from inside a running pod (what command, what does cpu.max = 150000 100000 mean, what does worker count = 2 mean)?

  8. Why was the 1.5-core CPU limit effectively unusable given workers = 1?

  9. Tie it together: explain the full causal chain from "traffic burst" to "alarm fires," in one paragraph, using the confirmed root cause.

H. Designing the fix

  1. We considered vertical (more workers/CPU per pod), horizontal (more pods via HPA), and both. What's the trade-off, and why did "both" win?

  2. Why couldn't we just change the worker class to synchronous (or offload to a process pool) as part of this immediate infrastructure fix? What kind of change would that be?

  3. We set workers = 2. Why did the CPU limit have to go up to 2000m at the same time? What would happen if we'd set workers = 2 but left the limit at 1500m?

  4. The pods used ~2.5Gi RSS each at one worker. Why did we expect ~5Gi with two workers, and why wouldn't worker-fork preloading save us here? (What did we learn about when the application memory caches are built?)

  5. There's one import-time load we found (e.g., heavy model loading in a utility file). Why is that one shareable-via-fork but the bulk of the runtime memory is not?

  6. The HPA was autoscaling/v1, target 80%, min 4 / max 7. We measured bursts peaking at ~75% of the 1200m request. Explain mechanically why the HPA never scaled.

  7. targetCPUUtilizationPercentage is a percentage of what? Recompute: at the new 1500m request and a pod using ~0.9 cores, what utilization does the HPA see?

  8. Why did we lower the target to 50% and raise max to 10, rather than just one of those?

  9. What does a topologySpreadConstraints with maxSkew: 1, topologyKey: kubernetes.io/hostname, whenUnsatisfiable: ScheduleAnyway do — and why soft (ScheduleAnyway) rather than hard (DoNotSchedule)?

  10. We deliberately did not add a readiness probe. Explain the failure mode that a naive /health readiness probe would cause in this specific app under heavy load. Why is "no readiness probe" temporarily safer than a bad one here?

  11. What was the container port vs Service target port mismatch situation? Why was alignment low-risk, and why didn't it affect routing?

  12. An orphaned HPA manifest file was deleted. Why was it safe to delete, and how did we confirm it was no longer active?

I. Where the config lives & deploy mechanics

  1. How did we determine the workload was not managed by GitOps tool deployments (e.g., Argo CD), despite the tool being installed? What metadata annotation was the fingerprint?

  2. Which repository and file holds the deployment/HPA, and which separate repository/file holds the ingress? Why do they deploy through different mechanisms?

  3. Describe the deployment pipeline flow end to end. What event triggers it, and what are the key build/test/deploy jobs?

  4. The deploy step does a string substitution (sed 's#$TAG#...#') then kubectl apply. What's the role of the $TAG placeholder, and where does the tag value come from?

  5. Why does merging a PR to the main branch not deploy anything, while pushing a specific environment release tag does?

  6. The pipeline runs on private self-hosted runners. Why does that matter for a private-endpoint cluster?

J. Capacity analysis

  1. What instance types/sizes back the general-purpose compute tier, and how much allocatable CPU/memory does each have?

  2. What is Cluster Autoscaler, and how does it differ from node lifecycle managers like Karpenter? Which one is active in this cluster?

  3. There are two node groups feeding the tier (spot instances and on-demand instances). What are their min/max sizes, and what's the combined node ceiling?

  4. Do the packing math: given ~3920m allocatable CPU and ~400m daemonset overhead, how many pods at 1500m request fit per node? Why is CPU, not memory, the binding constraint?

  5. At HPA max (10 pods), how many nodes are needed, and is that within the ceiling? What's left for other tenants?

  6. Why is horizontal scale-out slow relative to instant traffic bursts? List every contributor to a cold pod's total time-to-serve.

  7. Given that scale-out lag, which part of our fix delivers immediate relief, and which acts as the slower "second line of defense"?

  8. Why did we bump the memory request to 5Gi even though it doesn't change node packing density?

K. Staging-first deploy & verification

  1. Why deploy to the staging environment first when the staging manifest file wasn't even changed by the PR?

  2. Precisely what does the staging deploy validate, and what does it not validate?

  3. Staging runs on the same cluster as production. How is it isolated, and what was the risk we flagged about a tiny 0.5-core staging pod suddenly running workers=2?

  4. List the verification steps we ran on staging, and the pass criteria for each.

  5. Why did we test both GET /health and a heavy POST data processing route, rather than just the health check?

  6. The startup probe warms up by hitting an internal pre-cache endpoint. What does that endpoint do, and why is it the most likely place for a deploy to fail (especially on a CPU-starved staging pod)?

L. Production rollout & confirmation

  1. What version did we tag, and why a patch bump (v1.14.4) rather than a minor/major version shift?

  2. During the production rollout, two new pods went Pending, one with no available node. What happened next, and which log event confirmed the cluster autoscaler reacting?

  3. Why did the rollout take ~6 minutes and stay safe (no dropped traffic) the whole time? What deployment configuration controls the surge/unavailable behavior?

  4. Post-deploy, the new pods showed only ~2.6Gi memory usage, not ~5Gi. Why — and why is that expected rather than a contradiction of our sizing?

  5. After deploy, individual requests still hit ~3s max, but the alarm stayed OK. Why is that consistent with a successful fix? (Connect it back to question 7.)

  6. We monitored for 45 minutes and saw no flapping, yet we kept the incident ticket open. What's the honest gap in that evidence, and what would definitively close it?

  7. The HPA stayed at 4 pods during the entire monitoring window. Why is that a good sign rather than a sign the HPA fix did nothing?

M. Operational / process & gotchas

  1. The cluster API is private-only. What's the practical consequence for local engineers using kubectl, and why does aws sts get-caller-identity succeed while kubectl times out? How do you distinguish an auth problem from a network/VPN problem?

  2. Why did the CI/CD deployment job still work even when our local machine's kubectl couldn't reach the cluster?

  3. The standard code security checks failed on automated image-scanning and vulnerability gates. Diagnose how to isolate the root cause. Were they caused by our configuration change? How do we prove that?

  4. The repository management system showed a BLOCKED merge status, but branch protection rules returned a 404. What's the resolution of that apparent contradiction (e.g., legacy branch protection vs. modern repository rulesets)?

  5. What was the only thing actually gating the code merge, and why were the red security check flags irrelevant to it?

  6. Which actions in this whole deployment flow required explicit manual operator confirmation before execution, and why those specifically?

  7. What's the commit-authorship convention in this engineering environment, and what must never appear in a commit message or PR description?

N. Synthesis & transfer (test of true mastery)

  1. If you were paged for this exact alarm tomorrow with zero prior context, list the first five commands/queries you'd run, in order, and what each one would tell you.

  2. Name three plausible-but-wrong hypotheses for a flapping TargetResponseTime alarm, and the single piece of evidence that kills each.

  3. Suppose the per-pod CPU time series had shown one pod pinned at 100% and three idle (the load imbalance pattern we originally expected). Given the instance-mode topology, what would the real fix have been then — and why is it different from the LOR annotation?

  4. The fix here was vertical + HPA scaling. Under what circumstances would the correct long-term fix instead be an application architecture change, and what would that change look like?

  5. Generalize the core lesson: what specific property of an application workload makes "adding more replicas behind a standard round-robin/random balancer" fail to resolve response time spikes, and what is the class of fixes that does help?

  6. If the morning peak traffic burst still trips the alarm after this infrastructure fix, what are your next two remediation levers (in order), and what data would you collect to choose between them?


Thursday, 25 June 2026

Elastic Fleet


Elastic Agents do not strictly require Fleet and Fleet Server. You can deploy them in standalone mode, which allows you to manually configure and manage them using local configuration files.

However, running Elastic Agents in Fleet-managed mode with a Fleet Server is the recommended best practice for most enterprise environments.

Here is how the two approaches compare:

1. Fleet-Managed Mode (Recommended)


In this setup, you use the Fleet UI in Kibana to centrally manage agent policies, roll out upgrades, and apply integrations.
  • How it works: Agents connect to a Fleet Server (which is just a specialized Elastic Agent process), which then receives policies from Elasticsearch and pushes them to your endpoints.
  • Best for: Large-scale deployments, continuous monitoring, and Elastic Security/Endpoint integrations.
  • Action: To set this up, refer to the official Elastic Agent Installation Guide.


2. Standalone Mode


In this setup, you manually install the agent and define its inputs, outputs, and integrations directly in a local YAML configuration file.

  • How it works: The agent connects directly to outputs like Elasticsearch or Logstash without a Fleet Server intermediary.
  • Limitations: Central management, automated upgrades, and certain advanced Endpoint Security features are disabled.
  • Best for: Edge cases, highly air-gapped networks, or evaluating specific integrations.
  • Action: To configure this, use the steps outlined in the Standalone Elastic Agent Tutorial.


Example of how an Elastic Fleet can actually be wired


Components & where they live


Everything is in cluster company-prod-elastic-eks (EKS, us-east-2, now on v1.36.2), namespace elastic-system, across 5 nodes in 3 AZs:

Node            AZ  Node Group             Runs
====            ==  ==========             ====
ip-10-99-44-1   2a  es MNG (m7g.2xlarge)   agent
ip-10-99-55-34  2b  default MNG (m5.large) agent + Fleet Server pod
ip-10-99-55-89  2b  es MNG (m7g.2xlarge)   agent
ip-10-99-66-51  2c  es MNG (m7g.2xlarge)   agent
ip-10-99-66-78  2c  default MNG (m5.large) agent


  • Fleet Server — Deployment fleet-server-prod, replicas: 1 (single pod, currently on the 2b node). Listens on :8220 HTTPS. Exposed by a NodePort service fleet-server-prod-agent-http (8220 --> 32202 on every node).
  • Elastic Agents — DaemonSet agent-prod-eck-agent, one pod per node (5 total), spread across all 3 AZs. hostNetwork: true, mode: fleet, FLEET_INSECURE=true. They collect node/pod logs + metrics (hostPath mounts of /var/log/...).
  • ALB k8s-elastics-eckfleet-a1b2c3d4e5 — internal (not internet-facing), spanning subnets in 2a / 2b / 2c. Created by the AWS LB controller from Ingress eck-fleet-server-prod-ingress. Targets = all 5 nodes' NodePort 32202 (instance mode), backend-protocol: HTTPS, idle_timeout: 300s.
  • DNS — eck-fleet-server.internal-domain.local (Route53 via external-dns) -->  the internal ALB. Elasticsearch is a separate endpoint, elasticsearch.internal-domain.local:443.

Who listens, who initiates


The key thing: every connection is initiated by the agent (outbound). Nothing is ever pushed to the agents — that's why this works with agents behind hostNetwork and no inbound rules of their own.

                          

                           ┌───────────────────────────────────────────────┐
                           │ cluster company-prod-elastic-eks / elastic-sys│
     ┌──────────┐          │                                               │
     │ Agent    │ check-in │   internal ALB          Fleet Server (1 pod)  │
     │ DaemonSet│─────────►│  eckfleet-a1b2...   ┌──────► :8220 (HTTPS)    │
     │ (5 pods, │  :443    │  :443 TLS  ─────────┘  NodePort 32202→8220    │
     │  3 AZs)  │  HTTPS   │  (internal cert)       kube-proxy → the 1 pod │
     └────┬─────┘          │       ▲  re-encrypt HTTPS to a node's :32202  │
          │                │       │                     │                 │
          │ ship data      └───────┼─────────────────────┼─────────────────┘
          │ (logs/metrics)         │ DNS                 │ writes .fleet-* / agent
          ▼                   eck-fleet-server.          ▼ metadata, reads policy
 elasticsearch.               internal-domain.local  Elasticsearch + Kibana (Fleet app)
 internal-domain.local:443 ◄──────────────────────  
 (NOT via the fleet ALB)



Traffic Flows: Breakdown


1. Agent --> Fleet Server (Control Plane: Enrollment & Policy Check-in)

The agent acts as the client, and Fleet Server listens on :8220. The agent dials https://eck-fleet-server.internal-domain.local:443:

Traffic Path:

agent --> DNS --> internal ALB :443 (TLS termination, internal cert) --> re-encrypt --> a node's NodePort :32202 -->  kube-proxy --> the single Fleet Server pod :8220

Cross-AZ Routing: 

Because the agent always targets the ALB DNS name (not the pod directly), an agent can land on any node's NodePort. It is then forwarded by kube-proxy to whichever node is hosting the single Fleet Server pod—making cross-AZ hops standard behavior.

The Long Poll Mechanism: The check-in is a long poll. The agent opens a connection, and Fleet Server holds it open for up to ~5 minutes until a policy change occurs or the poll times out.

The Latency "False Positive": This held-open duration is exactly what the ALB records as a TargetResponseTime of ~300s, which triggered the false-positive alert. The ALB's idle_timeout is intentionally set to 300s to support these long polls. Conversely, the health check (/health, expects 404, 300s interval) is a separate lightweight probe, explaining why targets report as perfectly healthy while "latency" metrics look highly skewed.


2. Agent --> Elasticsearch (Data Plane: Telemetry Shipping)

This data path does not traverse the Fleet ALB. The Fleet output configuration points agents directly at https://elasticsearch.internal-domain.local:443.

Consequently, the eckfleet ALB only handles control-plane check-in traffic. This explains why its request volume is incredibly low (~3–5 requests/min) while individual request durations hover around 300s. They are two entirely separate paths utilizing two different load balancers.


3. Fleet Server --> Elasticsearch + Kibana

Fleet Server (via elasticsearchRefs / kibanaRef) acts as the client here. It writes agent and policy metadata directly into the .fleet-* indices and coordinates with the Kibana Fleet application, where the eck-fleet-server and eck-agent policies are managed.


One-Line Summary


Agents (DaemonSet, 5 pods across 2a/2b/2c) long-poll the single Fleet Server pod for policy updates via the internal ALB on :443 --> NodePort 32202 --> :8220, but ship their actual telemetry data directly to Elasticsearch, completely bypassing that ALB. The ALB only processes the slow, held-open control traffic—which is the root cause of the  alarm.



Why does fleet server holds connection open for so long? Why it does not reply immediately?                                                                                                                                                                                   


It's a deliberate design choice called long polling, and it's the opposite of a bug. Here's the reasoning.

The problem Fleet is solving

Fleet Server's job is to push policy changes out to agents promptly — a new integration, a changed log path, an output credential rotation. Agents need to find out "has my policy changed?" with low latency.

There are three ways an agent could learn about changes:

  1. Short polling — agent asks "anything new?" every N seconds, server replies instantly "no", agent sleeps, repeats.
  2. Server push — server opens a connection to each agent when something changes.
  3. Long polling — agent asks "anything new?", and the server holds the request open until either something actually changes or a timeout fires.

Fleet uses #3, and the connection sitting open for ~5 minutes is that hold.

Why not reply immediately (short polling)?


If Fleet answered every check-in instantly with "nothing changed", then to get fast reaction to a policy change, agents would have to poll very frequently — say every few seconds. With your 5 agents that's tolerable, but Fleet is built to manage thousands to tens of thousands of agents from one server. At that scale:

  - Frequent short polls = a constant storm of requests, almost all of which return "no change". Huge wasted CPU/network on both ends.
  - To keep reaction time low you'd poll more often, which makes the storm worse. To cut the storm you'd poll less often, which makes policy changes take longer to land. There's no good setting.

Long polling breaks that trade-off: the agent gets a near-instant reaction to a real change (the server responds the moment policy updates) and there's almost no idle chatter (one held-open connection per agent instead of hundreds of empty round-trips).

Why not server push (#2)?


Pushing would mean the server initiating connections inward to every agent. Agents are all over the place — behind NAT, firewalls, in private subnets, on laptops, on hostNetwork pods like yours. The server usually can't reach them, and you'd need inbound rules everywhere. Long polling flips it: the agent always dials out to the server, the connection is already established and held open, and the server pushes the change down that existing agent-initiated connection the instant it happens. You get push-like latency with poll-like (outbound-only) connectivity. That's exactly why your agents work behind hostNetwork with no inbound exposure.

 So what's actually happening in those ~300s


A check-in is essentially: "Here's my current state; tell me the moment my policy differs." The server parks that request. Two things can end it:

  - A policy change occurs → server responds immediately with the new policy (could be 2 seconds in).
  - Nothing changes → the server lets the request time out at its poll ceiling (~5 min), responds "no change", and the agent immediately opens a fresh one.

Since your policies rarely change, almost every check-in runs the full clock and returns at ~300s. That's the held connection the ALB measures as TargetResponseTime.

Why this specifically fools the ALB


 From the ALB's point of view, "request received → response sent" took 300 seconds, so it reports TargetResponseTime ≈ 300s. The ALB can't tell the difference between "the backend was slow for 300s" (bad) and "the backend intentionally held an idle long-poll for 300s" (normal). That ambiguity is the whole reason your generic 0.8s threshold misfires — and why the right fix (DOP-833) is to exempt this endpoint rather than treat it as latency. It's also why the Ingress sets idle_timeout.timeout_seconds=300: the ALB has to be told to tolerate the held connection, otherwise it would sever the long poll before the agent's poll cycle completes.

In short: Fleet holds the connection open so it can deliver policy changes near-instantly without either hammering the server with empty polls or needing to reach inward to firewalled agents. The ~300s is just an idle long-poll waiting for a change that usually never comes during that window — efficient by design, and only a problem for a latency metric that doesn't know to expect it.


----

AWS EC2: Application Load Balancer

 



An Application Load Balancer (ALB) is a fully managed AWS service that automatically distributes incoming HTTP and HTTPS traffic across multiple backend targets.

It operates at the Application Layer (Layer 7) of the Open Systems Interconnection (OSI) model.

Key Features:

  • Content-Based Routing: Routes traffic based on URL paths (/api vs /images) or hostnames (://example.com).
  • Container Support: Integrates directly with Amazon ECS and EKS using dynamic port mapping.
  • Advanced Protocols: Native support for modern protocols like HTTP/2, gRPC, and WebSockets.
  • Security Integration: Features built-in HTTPS/TLS termination and integrates directly with AWS WAF for web security.

How Components Work Together


  • Listener: Evaluates connection requests from clients using protocols and ports you configure.
  • Rules: Determines how the load balancer routes requests to its registered targets.
  • Target Group: Groups backend resources (like EC2 instances, containers, or IP addresses) that receive the traffic

How ALB health checks keep applications online?


An Application Load Balancer (ALB) keeps your application online by continuously monitoring the health of your backend targets and dynamically redirecting traffic away from failing nodes.

1. Automatic Traffic Redirection


The ALB sends periodic ping requests (health checks) to every registered target. If a target fails to respond correctly, the ALB marks it as unhealthy and immediately stops sending user traffic to it. Traffic is rerouted to the remaining healthy nodes with zero downtime for the user.

2. Auto Scaling Integration


When paired with an Auto Scaling Group (ASG), ALB health checks can trigger the automatic replacement of broken instances.
  • The Problem: An EC2 instance might be running (healthy at the hardware level), but the web server inside it has crashed (unhealthy at the application level).
  • The Solution: The ALB tells the ASG that the instance is failing application health checks. The ASG terminates that specific broken instance and launches a fresh, working one.

3. Graceful Recovery


When an unhealthy instance recovers, or when a new instance is launched, the ALB does not send traffic to it immediately. It enters an initial state and undergoes consecutive successful health checks. Only when it passes the threshold does the ALB safely introduce it back into the traffic rotation.

How to Configure an ALB Health Check


You configure health checks inside the Target Group settings using these parameters:

Parameter                  What it does                                                                          Recommended Setting
========                   =========                                                                           =================
Health Check Path     The URL endpoint the ALB hits 
                                      (e.g., /health or /index.html).                                                    /health
Healthy Threshold      Consecutive successes needed to mark a target as healthy.      3
Unhealthy Threshold  Consecutive failures needed to mark a target as unhealthy.     2
Timeout                        How long the ALB waits for a response before failing.           5 seconds
Interval                         The time between individual health check pings.                    30 seconds
Success Codes              The HTTP status codes that prove the app is working.            200 (or 200-399)



When you configure an ALB, you do not select an Availability Zone (AZ) directly; instead, you must select at least two subnets in different Availability Zones to ensure high availability.

  • How it works: AWS places a load balancer node in each of the specified subnets.
  • The AZ link: Because each subnet belongs to exactly one AZ, this fundamentally binds the ALB's nodes to those corresponding Availability Zones.
  • Custom routing: You can modify the subnets via the Update Availability Zones settings in the EC2 Console at any time.

Public ALB


Binding an ALB to public subnets makes it a public (internet-facing) load balancer. 

When you create an internet-facing ALB, AWS requires you to select public subnets so the ALB nodes can receive a public IP address and route traffic from the internet.

Key Characteristics:

  • Public DNS: The ALB receives a public DNS name that resolves to public IP addresses.
  • Internet Gateway: The selected public subnets must have a route to an Internet Gateway (IGW) in their route tables.
  • Target Routing: Even though the ALB is public, it can still route traffic to EC2 instances living in private subnets
An internet-facing ALB routes traffic directly to the individual backend targets (such as EC2 instances or IP addresses), not to the private subnets themselves.


How Routing Works

  • Target Group Config: You configure the ALB to route traffic to a Target Group.
  • Direct Node Communication: The ALB nodes in the public subnets communicate directly with the private IP addresses of your backend nodes.
  • Cross-Subnet Traffic: AWS handles this routing internally via the VPC router, allowing the public ALB to securely traverse into private subnets.

Configuration Checklist

  • VPC: Both the public subnets (where the ALB lives) and the private subnets (where the nodes live) must be in the same VPC.
  • Security Groups: The private instances must have a security group that allows inbound traffic from the ALB's security group



Private ALB



An internal (private) ALB routes traffic in the exact same way as a public ALB, but it is only accessible within your VPC or connected networks.

It routes traffic directly to individual backend targets, not to subnets.

Key Characteristics

  • Private Subnets: You deploy the ALB nodes into private subnets.
  • Private DNS: The ALB receives a public DNS name, but it resolves exclusively to private IP addresses.
  • No Internet Access: It cannot receive any traffic from the public internet because it lacks a public IP.

Common Use Cases

  • Internal Microservices: Routing traffic from a public-facing web tier to a private backend API tier.
  • Hybrid Networks: Routing traffic coming from an on-premises data centre via AWS Direct Connect or a VPN

Setting Up ALB in AWS Console


AWS Elastic Load Balancing shows basic building blocks of AWS Load Balancer which include listeners and target groups. 

To create Application Load Balancer go to EC2 >> Load balancers >> Create Load balancer >> Select load balancer type (click on Create under Application Load Balancer)



Here we can set:

  • Basic configuration
    • Name
    • Scheme (cannot be changed after the load balancer is created)
      • Internet-facing. An internet-facing load balancer routes requests from clients over the internet to targets. Requires a public subnet. 
      • Internal. An internal load balancer routes requests from clients to targets using private IP addresses.
    • IP address type. Select the type of IP addresses that your subnets use.
      • IPv4. Recommended for internal load balancers.
      • Dualstack. Includes IPv4 and IPv6 addresses.
  • Network mapping. The load balancer routes traffic to targets in the selected subnets, and in accordance with your IP address settings.
    • VPC. Virtual private cloud for your targets. If balancer is internet-facing, only VPCs with an internet gateway are enabled for selection. The selected VPC cannot be changed after the load balancer is created. As VPC is region-specific so is Application Load Balancer.
    • Mappings. Once VPC is selected, its availability zones are listed here and are selectable. Select at least two Availability Zones and one subnet per zone. The load balancer routes traffic to targets in these Availability Zones only. Availability Zones that are not supported by the load balancer or the VPC are not available for selection. We should select all AZs that we listed in the Auto scaling group (if we used it).
  • Security groups. A security group is a set of firewall rules that control the traffic to your load balancer. We can select up to 10 security groups.
    • If our application is listening for HTTP requests on port 80 we should select a security group with:
      • Inbound rule: accept HTTP/TCP traffic on port 80 with source Anywhere-IPv4
      • Outbound rule: allow all traffic for all protocols and port ranges to custom destination 0.0.0.0/0
  • Listeners and routing. A listener is a process that checks for connection requests using the port and protocol you configure. The rules that you define for a listener determine how the load balancer routes requests to its registered targets.
    • Add listener
      • Protocol e.g. HTTP
      • Port e.g. 80. This is a public facing port and it does not need to be the same as the port from the attached target group. E.g. LB can listen on port 80 and forward traffic to target group port 8080.
      • Default action: Forward to (select a target group)
      • Add listener tags
  • Add-on services - optional
    • AWS Global Accelerator
  • Tags - optional


More info on Scheme, from AWS documentation:

When you create a load balancer, you must choose whether to make it an internal load balancer or an internet-facing load balancer.

The nodes of an internet-facing load balancer have public IP addresses.

The nodes of an internal load balancer have only private IP addresses.

Both internet-facing and internal load balancers route requests to your targets using private IP addresses. Therefore, your targets do not need public IP addresses to receive requests from an internal or an internet-facing load balancer.

More info on how ALB routes traffic to multiple Availability Zones (and about what Load Balancer Nodes are):

When you enable an Availability Zone for your load balancer, Elastic Load Balancing creates a load balancer node in the Availability Zone. 

The nodes for your load balancer distribute requests from clients to registered targets. When cross-zone load balancing is enabled, each load balancer node distributes traffic across the registered targets in all enabled Availability Zones. When cross-zone load balancing is disabled, each load balancer node distributes traffic only across the registered targets in its Availability Zone.

Before a client sends a request to your load balancer, it resolves the load balancer's domain name using a Domain Name System (DNS) server. The DNS entry is controlled by Amazon, because your load balancers are in the amazonaws.com domain. The Amazon DNS servers return one or more IP addresses to the client. These are the IP addresses of the load balancer nodes for your load balancer.

As traffic to your application changes over time, Elastic Load Balancing scales your load balancer and updates the DNS entry. The DNS entry also specifies the time-to-live (TTL) of 60 seconds. This helps ensure that the IP addresses can be remapped quickly in response to changing traffic.

The client determines which IP address to use to send requests to the load balancer. The load balancer node that receives the request selects a healthy registered target and sends the request to the target using its private IP address.

With Application Load Balancers, the load balancer node that receives the request uses the following process:

1) Evaluates the listener rules in priority order to determine which rule to apply.

2) Selects a target from the target group for the rule action, using the routing algorithm configured for the target group. The default routing algorithm is round robin. Routing is performed independently for each target group, even when a target is registered with multiple target groups.

For further info: How Elastic Load Balancing works - Elastic Load Balancing

ALB nodes use Elastic Network Interface (Elastic network interfaces - Amazon Elastic Compute Cloud) which has public IP address:

At least one ENI is created and attached to the balancer in each availability zone where the balancer is deployed (except NLB, which should only have one per AZ). Over the life of the balancer, new ENIs will appear and old ones will disappear, as the balancer scales horizontally (number of nodes) and/or vertically (capacity of underlying hardware), all of which is handled transparently by the infrastructure. Even though you can tag them, the tagging will become stale over time.

Source: amazon web services - AWS - Affect Load Balancer's tags to its Network Interfaces (ENI) - Stack Overflow

 

You can determine the IP addresses associated with an internal load balancer or an internet-facing load balancer by resolving the DNS name of the load balancer. These are the IP addresses where the clients should send the requests that are destined for the load balancer. However, Classic Load Balancers and Application Load Balancers use the private IP addresses associated with their elastic network interfaces as the source IP address for requests forwarded to your web servers.

Source: Find the IP address used by a load balancer to forward traffic to web servers

 

Load balancer routes requests to the targets in a target group and performs health checks on the targets. Target group is accepting requests from the load balancer and forwards them to targets. These targets can be e.g. EC2 instances created either manually or through auto scaling group.

How to create a Target Group used by Load Balancer listeners? (This applies for any type of Load Balancer)

EC2 >> Target groups >> Create target group

Step 1: Specify group details

 

Here we can set:

  • Basic configuration. Settings in this section cannot be changed after the target group is created.
    • Target type
      • Instances
        • Supports load balancing to instances within a specific VPC.
        • Facilitates the use of Amazon EC2 Auto Scaling  to manage and scale your EC2 capacity.
      • IP addresses
        • Supports load balancing to VPC and on-premises resources.
        • Facilitates routing to multiple IP addresses and network interfaces on the same instance.
        • Offers flexibility with microservice based architectures, simplifying inter-application communication.
        •  Supports IPv6 targets, enabling end-to-end IPv6 communication, and IPv4-to-IPv6 NAT.
      • Lambda function
        • Facilitates routing to a single Lambda function.
        •  Accessible to Application Load Balancers only.
      • Application Load Balancer
        • Offers the flexibility for a Network Load Balancer to accept and route TCP requests within a specific VPC
        • Facilitates using static IP addresses and PrivateLink with an Application Load Balancer.
    • Target group name
    • Protocol:Port e.g. If our application is accepting HTTP requests on port 8080 this would be HTTP:8080
    • VPC - VPC with the instances that you want to include in the target group.
    • Protocol version
      • HTTP1. Send requests to targets using HTTP/1.1. Supported when the request protocol is HTTP/1.1 or HTTP/2.
      • HTTP2. Send requests to targets using HTTP/2. Supported when the request protocol is HTTP/2 or gRPC, but gRPC-specific features are not available.
      • gRPC. Send requests to targets using gRPC. Supported when the request protocol is gRPC.
  • Health checks. The associated load balancer periodically sends requests, per the settings below, to the registered targets to test their status.
    • Health check protocol
      • HTTP
      • HTTPS
    • Health check path. Use the default path of “/“ to ping the root, or specify a custom path if preferred.
    • Advanced health check settings
      • Port. The port the load balancer uses when performing health checks on targets. The default is the port on which each target receives traffic from the load balancer, but you can specify a different port.
        • Traffic port
        • Override
      • Healthy threshold. The number of consecutive health checks successes required before considering an unhealthy target healthy.
      • Unhealthy threshold. The number of consecutive health check failures required before considering a target unhealthy.
      • Timeout. The amount of time, in seconds, during which no response means a failed health check.
      • Interval. The approximate amount of time between health checks of an individual target
      • Success codes. The HTTP codes to use when checking for a successful response from a target. You can specify multiple values (for example, "200,202") or a range of values (for example, "200-299").
  • Attributes
  • Tags - optional


Step 2: Register targets

This is an optional step to create a target group. However, to ensure that your load balancer routes traffic to this target group you must register your targets.




After load balancer is created it takes several minutes while it's in provisioning state and get into active state. After this, we can use its DNS name in order to see what it's doing.

If we copy its DNS name and paste it to our browser, if we haven't registered any targets in the target group associated with the load balancer, we'll get error 503 - Service Temporary Unavailable.

If we've registered targets and are getting error 504 Gateway time-out, we should check first if security groups (firewalls) for our EC2 instances (inbound rule - source IP range) are set up correctly as this error usually indicates that inbound traffic is not allowed.

AWS Terraform provider offers provisioning all these resources:

 
How is AWS Application Load Balancing usually implemented?
 
Let's say we have our application running on 3 EC2 instances where 2 are in the same region e.g. us-west-2 but in separate availability zones e.g. us-west-2a and us-west-2b. Third EC2 instance is in eu-central-1, in availability zone eu-central-1a.
 
VPC is region-specific but can span multiple availability zones (AZ). 
Subnet is an IP address range within VPC.
VPC can have public and private subnets.
VPC can be divided into multiple subnets but each subnet is AZ-specific.
AZ can have multiple subnets.

So, all EC2 instances belong to the same VPC but, as they are in different AZs, each of them belongs to different subnet.
 
Load balancer must be in the public subnet of VPC as clients communicate with load balancer via internet (public network).
 
Load balancer does not get associated directly with EC2 instances but subnets:

resource "aws_lb" "test" {
    subnets = ["subnet-0001", "subnet-0002"] 
    ...
}

Target group is associated with VPC:
 
resource "aws_alb_target_group" "test" {
    vpc_id   = var.vpc_id
    ...
}
 
 

Difference between ALB and NLB (Network Load Balancer)


An Application Load Balancer (ALB) and a Network Load Balancer (NLB) serve different purposes based on the layer of the network they operate on and the type of traffic they handle.

The core difference is that an ALB understands application-level traffic (Layer 7) like HTTP/HTTPS headers, while an NLB handles low-level network traffic (Layer 4) like TCP/UDP packets at extreme speeds.

Direct Comparison Matrix


Feature          Application Load Balancer (ALB)                     Network Load Balancer (NLB)
======         =========================                      =======================
OSI Layer     Layer 7 (Application)                                            Layer 4 (Transport)
Protocols       HTTP, HTTPS, HTTP/2, gRPC, WebSockets      TCP, UDP, TLS
IP Addresses 
                       Dynamic IPs (Changes automatically; requires a DNS name)  
                                                                                                     Static IPs (Can assign an Elastic IP per AZ)
Routing Features  
                        Advanced (Path, Host, Query parameters, Headers)  
                                                                                                      Basic (Port and IP protocol routing only)
Performance  
                        Optimized for complex web apps (Millions of requests/sec)  
                                                                                                      Optimized for ultra-low latency (Billions of requests/sec)


Key Technical Differences


1. Smart Routing vs. Raw Speed


  • ALB (Smart): Can read the contents of your HTTP requests. It can route traffic bound for ://example.com to an API server cluster, and traffic for ://example.com to a storage cluster.
  • NLB (Fast): Does not look inside the data packet. It simply looks at the target port and forwards the packet instantly. This results in ultra-low latency (measured in milliseconds).

2. IP Addresses and DNS


  • ALB: Scale out dynamically by adding or removing nodes. This causes its underlying IP addresses to change frequently. You must always point your domain name to the ALB's DNS Name, never to a static IP.
  • NLB: Gives you a Static IP address per Availability Zone. You can also assign your own Elastic IP addresses. This is critical if your corporate clients need to whitelist specific, unchanging IPs in their firewalls.

3. Client IP Preservation


  • ALB: Terminates the connection and makes a new one to your backend instances. The backend see the ALB's private IP. To find the real user's IP, your code must read the X-Forwarded-For HTTP header.
  • NLB: Passes the original TCP packet straight through to your backend server. Your backend instances see the original source IP address of the client natively, without needing extra headers.


When to Choose Which?


Choose an ALB if you are building:
  • Standard web applications and microservices.
  • Containerized apps (ECS/EKS) requiring path-based or host-based routing.
  • Applications requiring tight integration with AWS Web Application Firewall (WAF).

Choose an NLB if you are building:
  • Non-HTTP applications (e.g., gaming servers, SFTP, MQTT, database clusters).
  • Architectures requiring fixed, static IP addresses or Elastic IPs.
  • High-frequency financial applications where sub-millisecond network latency is a hard requirement

Which alerts should typically be set for AWS ALB?


To keep your applications highly available, you should set up Amazon CloudWatch alarms for a mix of availability, performance, and target health metrics.

The most critical metrics to monitor for an AWS ALB are grouped by priority below:

1. High Priority (Critical Infrastructure Impact)

UnHealthyHostCount (Per Target Group)
What it means: The number of backend instances failing health checks Target Group Metrics.
Alert Threshold: > 0 (or > 1 for larger clusters).
Why it matters: Signals that your servers are crashing or cannot handle traffic.

HTTPCode_Target_5XX_Count
What it means: The number of 5xx server error codes generated by your backend application ALB Metrics.
Alert Threshold: Depends on baseline traffic, typically > 5 failures within a 1-minute to 5-minute window.
Why it matters: Indicates server crashes, database connection timeouts, or unhandled exceptions in your application code.

HTTPCode_ELB_5XX_Count
What it means: The number of 5xx errors generated directly by the ALB itself (not your servers) ALB Metrics.
Alert Threshold: > 0.
Why it matters: Usually means the ALB cannot find any healthy hosts, or it is experiencing a configuration mismatch (e.g., bad TLS handshake with the target).

2. Medium Priority (Performance & User Experience)

TargetResponseTime
What it means: The time elapsed (in seconds) from when the ALB sent the request to the target until the target started responding ALB Metrics.
Alert Threshold: Use the p95 or p99 statistic. Alert if it exceeds your application’s maximum acceptable latency (e.g., > 2.0 seconds).
Why it matters: Users are experiencing severe application slowdowns, likely due to high CPU/memory usage on your instances.

RejectedConnectionCount
What it means: The load balancer is rejecting connections because it has reached its maximum capacity ALB Metrics.
Alert Threshold: > 0.
Why it matters: Your application is getting sudden traffic spikes and the ALB cannot scale fast enough, or backend targets are failing to keep up.

3. Low Priority (Anomalies & Security)

HTTPCode_Target_4XX_Count
What it means: The number of 4xx client errors (like 404 Not Found or 401 Unauthorized) returned by backend targets ALB Metrics.
Alert Threshold: A significant spike above your standard baseline.
Why it matters: A sudden surge might indicate a broken frontend deployment, a bad API update, or a malicious entity scanning your network for vulnerabilities.

Summary Checklist for CloudWatch Alarms

Metric Name                               Statistic         Recommended       Suggested                      Action
                                                                            Period                     Threshold     
==========                              ======          ===========       ========                     =====
UnHealthyHostCount                    Maximum   1 Minute                         > 0                          Page/On-Call
HTTPCode_ELB_5XX_Count     Sum             1 Minute                         > 0                          Page/On-Call
HTTPCode_Target_5XX_Count  Sum            5 Minutes                > 10 (or > 1% of traffic) Ticket/Slack
TargetResponseTime                     p95              5 Minutes> [Your Limit]                                Ticket/Slack





Resources: