My Public Notepad: Mitigating Load Balancer Routing Clamping

In computing and networking, routing clumping (also known as traffic clumping or clustering) refers to the phenomenon where a load balancer or router unevenly sends bursts of traffic to the same backend servers instead of distributing requests evenly across the entire available server pool.

While a load balancer is designed to act as a traffic cop, routing clumping defeats this purpose, momentarily overloading specific servers while leaving others idle.

Primary Causes of Routing Clumping

Hashing Bias: Algorithms like 5-tuple hashing (which use source IP, destination IP, ports, and protocols) calculate a fixed value to select a routing path. If thousands of users share a proxy or exit node, the load balancer may calculate the same hash and clump all their traffic onto a single backend server.
Sticky Sessions: If you use session persistence (sticky sessions) based on a client's IP, multiple requests from the same office or network range will clump to a single server to maintain session continuity.
Connection Multiplexing: Load balancers maintain long-lived backend connections to save resources. When a new batch of requests comes in, the load balancer routes them through existing open connections, causing requests to cluster on specific nodes rather than opening new connections to idle servers.
Inconsistent Pod or Instance Count: In environments like Kubernetes, if traffic is processed node-by-node, unequal distribution of application pods across nodes or restrictive node affinities can cause traffic to clump in specific zones or servers.

How to Prevent and Mitigate Clumping

Use Round Robin or Least Connections: Switch from IP-hashing or fixed-hash methods to AWS Least Connections or round-robin algorithms, which evaluate server availability in real-time rather than relying on mathematical routing properties.
Enable Cross-Zone Load Balancing: For cloud architectures (like AWS Elastic Load Balancing), enabling cross-zone load balancing forces your load balancers to distribute traffic evenly across all registered targets in all enabled Availability Zones.
Use Layer 7 Intelligent Routing: Instead of Layer 4 packet forwarding, use a Layer 7 Application Load Balancer to inspect HTTP headers and distribute traffic based on request context, allowing for finer-grained distribution.
Deploy Hierarchical Splitting: For massive, hyperscale systems, implementing a hierarchical L4 + L7 structure (ultra-fast L4 hardware balancers routing to a software-based L7 pool) helps prevent clumping while maintaining high raw throughput.

Why round robin routing didn't work well in this scenario so we need to use least connections routing?

Good question — the answer is about the kind of workload service3 is. Round robin isn't broken in general; it's the wrong fit for this specific traffic shape.

What each algorithm does

Round robin: hands each new request to the next target in rotation (pod A, B, C, A, B, C…), purely by count — blind to how busy each target currently is.
Least outstanding requests (LOR): sends each new request to the target with the fewest in-flight requests right now — i.e., the least-busy one.

Why round robin fails here

Round robin distributes by count, which only equals distributing by load when requests are uniform and short. service3 is the opposite — it has the toxic combination:

1. Highly variable, long service times — a spaCy extraction is anywhere from ~0.3s to multi-second depending on text length/entity count. One heavy request ties a pod up for seconds.

2. Low per-pod concurrency — each pod does ~1–2 concurrent extractions (2 workers, GIL-bound). It saturates almost immediately.

Under those conditions, round robin keeps handing new requests to a pod that's still grinding a heavy multi-second extraction, just because it's "next in rotation" — while a pod that just finished sits idle. The new request then queues head-of-line behind the heavy one for seconds, even though free capacity exists elsewhere. Over a burst this compounds: unlucky pods accumulate a backlog while others drain and idle → exactly the clumping / 33%-CPU-but-3s-latency pattern we measured.

The analogy: round robin is a supermarket sending shoppers to checkout lanes strictly 1‑2‑3‑1‑2‑3, ignoring that lane 2 has someone with a giant cart. You get stuck behind the big cart while lane 3 is empty. LOR is "go to the lane with the fewest items in it right now."

Why LOR fixes it

LOR tracks in-flight requests per target and skips the busy pod — a pod mid-heavy-extraction has a high outstanding count, so new requests go to the idle/draining pods instead. It self-corrects for variance (whichever pod is busy, for any reason, automatically gets fewer new requests until it catches up) and it reaches the idle capacity round robin was stranding.

The catch — why we also needed IP mode

LOR only helps if the ALB is choosing among the right units. In instance mode, the ALB's targets are nodes, so it'd balance outstanding requests across nodes — then kube-proxy randomly picks a pod within the node, re-scrambling the per-pod balance. LOR literally can't see per-pod busyness in instance mode.

IP mode makes the targets pods, so LOR's "least outstanding" is measured per pod → it can actually steer to the least-busy pod. That's why the fix is ip mode + LOR together: ip mode gives LOR the right granularity; LOR gives ip mode the right decision rule. Either alone wouldn't do it.

Two footnotes

- The other 5 services keep round robin — deliberately. auth/data/game/webhook are fast (<0.5s), uniform, high-concurrency — round robin distributes them fine, and there's no reason to touch them. The algorithm choice is workload-specific: LOR for

slow/variable/expensive backends, round robin for fast/uniform ones.

- Terminology: you said "least connections" — the ALB (L7) equivalent is least outstanding requests, which counts in-flight HTTP requests, not TCP connections. That distinction matters here: the ALB reuses keep-alive connections (one connection carries many sequential requests), so "least connections" would be misleading — LOR counts the actual requests, which is the right signal for this.

In which cases round-robin doesn't help so we need to use least outstanding requests routing?

While Round-Robin works perfectly for uniform traffic and identical servers, it fails when the system experiences variations in request complexity or server capacity. You must switch from Round-Robin to Least Outstanding Requests (LOR) (also known as Least Connections) in the following four specific engineering scenarios:

1. Varying Request Processing Times (Asymmetric Workloads)

The Problem: In many applications, some API calls take 5 milliseconds (e.g., fetching a profile cached in memory), while others take 5 seconds (e.g., generating a heavy PDF report or running a complex database query).Why Round-Robin Fails: Round-Robin blindly hands out requests in a strict, alternating sequence. If Server A randomly receives a consecutive string of heavy PDF requests while Server B receives fast cache requests, Server A's queue will spike, causing high latency or timeouts, while Server B sits mostly idle.
How LOR Helps: LOR actively tracks the active connection count. It will notice Server A is backed up with pending work and will divert all new incoming traffic to Server B until Server A finishes its heavy processing.

2. Heterogeneous Server Capacities (Mixed Server Sizes)

The Problem: Production clusters often use mixed hardware. For example, during an auto-scaling event, you might temporarily mix older 4-core virtual machines with newer, high-performance 16-core instances.Why Round-Robin Fails: Round-Robin treats every backend target as equal. It sends exactly 1,000 requests to the weak 4-core machine and exactly 1,000 requests to the powerful 16-core machine. The weaker machine will quickly choke, run out of memory, or drop packets, while the stronger machine remains underutilised.
How LOR Helps: The faster, more powerful server processes and closes connections much quicker than the weaker server. Because its outstanding connection count drops rapidly, LOR naturally funnels a significantly higher volume of traffic to the stronger hardware without needing manual weight configurations.

3. Persistent and Long-Lived Connections (WebSocket & gRPC)

The Problem: Modern applications rely heavily on persistent connections like WebSockets, HTTP/2 multiplexing, server-sent events (SSE), or gRPC streams. These connections stay open for minutes or hours.Why Round-Robin Fails: Round-Robin only counts the initial connection establishment. If Server A hosts 50 clients who stay connected for 3 hours chatting, and Server B hosts 50 clients who disconnect after 30 seconds, Round-Robin will continue to feed new connections to both servers equally. Server A will slowly crush under the cumulative weight of long-lived active sessions.
How LOR Helps: LOR continuously monitors active, open connections rather than connection arrival rates. It will see that Server A has 50 active outstanding connections and Server B has 0, immediately routing all new clients to Server B.

4. Unpredictable Backend "Cold Starts" and Drifts

The Problem: When a new application container or server boots up (e.g., during a Kubernetes deployment or AWS auto-scaling event), it often suffers from a "cold start" where it runs slowly while warming up caches or compiling code just-in-time (JIT).Why Round-Robin Fails: Round-Robin immediately floods the newly booted server with its full share of production traffic. Because the server is not yet performing at 100% capacity, this sudden burst easily overwhelms it, causing it to crash immediately after launching.

How LOR Helps: LOR naturally throttles traffic to the warming server. Since the cold server processes its first few requests slowly, its outstanding request count will naturally rise, signaling the load balancer to back off and route traffic elsewhere until the server catches up.

Algorithm Comparison Table

Scenario
Round-Robin Behavior
Least Outstanding Requests Behavior

All requests take equal time
Perfect, completely equal distribution.
Excellent, identical result to Round-Robin.

Mix of short & long requests
Overloads random servers (creates traffic clumps).
Dynamically balances the processing queue.

Mixed server capacities
Overwhelms smaller, weaker backend servers.
Directs more traffic to faster instances automatically.

Long-lived WebSockets / gRPC
Ignores session duration; causes massive load imbalance.
Tracks real-time active sessions perfectly.

If you are using a cloud platform, you can learn more about implementing this via the AWS Least Outstanding Requests Documentation or the NGINX Least Connections Guide.

k8s ingress - AWS ALB

To implement Least Outstanding Requests (LOR) routing with an AWS Application Load Balancer (ALB) inside a Kubernetes cluster, you must configure it using the AWS Load Balancer Controller.

By default, the AWS ALB uses Round-Robin routing. You can change this behavior by applying a specific routing annotation to your Kubernetes TargetGroupBinding or Ingress resource.

1. Ingress Configuration Example

Apply the alb.ingress.kubernetes.io/target-group-attributes annotation to your Ingress manifest. This tells the AWS controller to configure the underlying target groups to use the load_balancing.algorithm.type=least_outstanding_requests setting.

apiVersion: networking.k8s.io/v1

kind: Ingress

metadata:

name: my-app-ingress

namespace: production

annotations:

# Essential ALB Controller configurations

kubernetes.io/ingress.class: alb

alb.ingress.kubernetes.io/scheme: internet-facing

alb.ingress.kubernetes.io/target-type: ip

# Enable Least Outstanding Requests Routing

alb.ingress.kubernetes.io/target-group-attributes: load_balancing.algorithm.type=least_outstanding_requests

spec:

rules:

- http:

paths:

- path: /

pathType: Prefix

backend:

service:

name: my-app-service

port:

number: 80

2. Crucial Step: Use "IP" Target Type

When running AWS ALB in Kubernetes, you should always set alb.ingress.kubernetes.io/target-type: ip instead of instance mode.

Why? In instance mode, the ALB routes traffic to NodePorts on the EC2 worker nodes. The node then uses kube-proxy (iptables or IPVS) to randomly route the packet to a pod. This completely breaks the math behind Least Outstanding Requests because the ALB can only see outstanding connections to the node, not the actual pods.
The Fix: Using ip mode configures the ALB to bypass the node network entirely and route traffic directly to the Pod IPs. This gives the ALB true visibility into the exact outstanding request count for each application container.

3. Verify the Configuration

After applying the manifest via kubectl apply -f ingress.yaml, you can verify the changes are active in AWS:

Open the AWS EC2 Console and navigate to Target Groups.
Select the target group automatically generated by your Kubernetes Ingress.
Click on the Attributes tab.
Verify that Routing algorithm is set to Least outstanding requests.

Alternatively, check the AWS Load Balancer Controller logs to ensure the modification was successfully reconciled:

kubectl logs -n kube-system deployment/aws-load-balancer-controller

---

My Public Notepad

Pages

Thursday, 2 July 2026

Mitigating Load Balancer Routing Clamping