Pages

Friday, 26 June 2026

DevOps Interview Questions - k8s ALB alarm




Post-incident review / technical interview questions.

These questions and answers were generated by Claude, upon it analyzed and fixed one CloudWatch alarm which have been flipping. Prompt I used:

I was following your analysis and resolution of this issue in order to acquire your knowledge. I would like to test my knowledge now. Can you compile a list of questions which cover every aspect of the issue and solution here? Don't be shy of creating a really long list of questions. If I am able to answer them all that means I am able to fix the issue on my own next time. Also prepare the answer key.



A. The alarm itself (CloudWatch / ALB fundamentals)

  1. What does the metric TargetResponseTime actually measure, and from whose perspective (client → ALB → target)?

  2. The alarm name was k8s-api-prod-core-7db0ccf2c3-target-response-time. What do the k8s- prefix and the hash portion tell you about how it was created?

  3. The alarm config was: threshold 0.8, GreaterThanThreshold, period 60s, 2 evaluation periods, statistic Average. In plain English, what condition makes it fire?

  4. Why does the alarm use the LoadBalancer dimension only, and not a TargetGroup dimension? What consequence did that have for our investigation?

  5. What is "flapping," and why does this particular threshold/period combination make flapping likely for a bursty workload?

  6. How do you pull an alarm's state-transition history, and what did 9 OKALARM cycles in 5 hours tell you?

  7. The alarm fires on Average. Why is that distinction (vs Maximum/p99) absolutely central to both the diagnosis and the fix?

  8. What is the "low-traffic statistical artifact" pattern, where a handful of slow requests inflate the average on a near-idle target — and what evidence did we use to rule it out here?

B. Narrowing from ALB to one service

  1. The ALB fronted six target groups. Name the technique we used to find which one was responsible, and the AWS CLI call behind it.

  2. Why can a single ALB serve six different Kubernetes services? What AWS-LB-Controller concept ties them together (hint: group.name)?

  3. Given the target group name k8s-default-dataservice-88e28b4b6c, how do you map it back to a Kubernetes Service and namespace?

  4. During the burst, data-service showed 5.83s avg / 29.7s max while every other service was <0.5s. Why did that immediately exonerate the shared ALB/ingress as the cause?

C. First look at the workload

  1. What does kubectl top pods show, and why is a single snapshot from it dangerous as evidence?

  2. The first snapshot showed one pod at 997m and three near-idle. What two different explanations are consistent with that, and why can't a snapshot distinguish them?

  3. What's the difference between a CPU request and a CPU limit in Kubernetes?

  4. How did we check whether the pod was being CPU-throttled at its limit, and what file did we read inside the container? What did nr_throttled / throttled_usec tell us?

  5. A deploy to v1.14.3-prod had happened ~15 min earlier. Why was it a red herring, and what evidence dated the flapping as pre-existing?

  6. Distinguish the three probe types (liveness, readiness, startup). What does each one control?

  7. The deployment had a startupProbe and livenessProbe but no readinessProbe. Operationally, what can't the system do without a readiness probe?

D. The load-balancing theory (and why it was wrong)

  1. Explain why round-robin load balancing degrades when request durations are highly variable. Use the "request count vs. total work" framing.

  2. What is the feedback-loop difference between round-robin and least-outstanding-requests (LOR)?

  3. What older, well-known algorithm is LOR equivalent to in nginx/HAProxy terms, and what does AWS's own guidance say about when to use LOR vs round-robin?

  4. "Head-of-line blocking" appeared twice in this incident at two different layers. Name both layers and how each one blocks.

  5. Why is a single kubectl top snapshot insufficient to prove round-robin is causing imbalance, and what data would actually prove or refute it?

  6. We initially called round-robin the "smoking gun," then withdrew it. What specifically made that conclusion wrong?

E. Target mode: instance vs IP (the topology that broke the theory)

  1. What's the difference between an ALB target group in instance mode vs ip mode? How can you tell which one you have from the registered targets and the target-group port?

  2. The target group registered 10 EC2 instances on port 31892. What does that tell you about the request path from ALB to pod?

  3. What is a NodePort service, and what is externalTrafficPolicy: Cluster vs Local?

  4. With instance mode + Cluster policy, describe the full path of a request from client to the backend process, including every hop and who load-balances at each hop.

  5. Given that topology, explain precisely why switching the ALB to least_outstanding_requests would not fix per-pod imbalance.

  6. What two changes together would enable load-aware per-pod routing, and why is that a much bigger change than a one-line annotation?

  7. Why were ALB access logs not useful for confirming per-pod distribution in this setup?

F. Getting the proof (Prometheus / time series)

  1. What is kube-prometheus-stack, and where did it live in the cluster?

  2. Why is kubectl port-forward an acceptable read-only way to query Prometheus, and what does it actually do?

  3. Write (conceptually) the PromQL that gives per-pod CPU usage over time. Why rate(container_cpu_usage_seconds_total[...]) rather than the raw counter?

  4. The time series showed all four pods evenly at 0.5–0.95 cores during the burst. Why did that refute the imbalance theory in one stroke?

  5. Each pod plateaued at ~0.9 cores despite a 1.5-core limit. What does that plateau strongly imply about the process model inside the pod?

G. The real root cause (app server / GIL / async)

  1. What is an application server worker (e.g., Gunicorn), and what's the difference between the master process and a worker process?

  2. What is a Global Interpreter Lock (GIL) or similar single-threaded runtime constraint, and why does it mean one worker process ≈ one core of CPU-bound throughput?

  3. The config had workers = 1 and an asynchronous event-loop worker class specified. Explain what each line does.

  4. What is the difference between a synchronous worker and an asynchronous worker in an application server? When does each shine?

  5. Why is an async (event-loop) worker the wrong model for heavy, synchronous CPU-bound data processing? What does "blocking the event loop" mean concretely?

  6. So with one async worker doing CPU-bound work, what is the per-pod concurrency for heavy requests — and how did that produce the fleet-wide ceiling of ~4 concurrent requests?

  7. How did we confirm the worker count and the CPU quota from inside a running pod (what command, what does cpu.max = 150000 100000 mean, what does worker count = 2 mean)?

  8. Why was the 1.5-core CPU limit effectively unusable given workers = 1?

  9. Tie it together: explain the full causal chain from "traffic burst" to "alarm fires," in one paragraph, using the confirmed root cause.

H. Designing the fix

  1. We considered vertical (more workers/CPU per pod), horizontal (more pods via HPA), and both. What's the trade-off, and why did "both" win?

  2. Why couldn't we just change the worker class to synchronous (or offload to a process pool) as part of this immediate infrastructure fix? What kind of change would that be?

  3. We set workers = 2. Why did the CPU limit have to go up to 2000m at the same time? What would happen if we'd set workers = 2 but left the limit at 1500m?

  4. The pods used ~2.5Gi RSS each at one worker. Why did we expect ~5Gi with two workers, and why wouldn't worker-fork preloading save us here? (What did we learn about when the application memory caches are built?)

  5. There's one import-time load we found (e.g., heavy model loading in a utility file). Why is that one shareable-via-fork but the bulk of the runtime memory is not?

  6. The HPA was autoscaling/v1, target 80%, min 4 / max 7. We measured bursts peaking at ~75% of the 1200m request. Explain mechanically why the HPA never scaled.

  7. targetCPUUtilizationPercentage is a percentage of what? Recompute: at the new 1500m request and a pod using ~0.9 cores, what utilization does the HPA see?

  8. Why did we lower the target to 50% and raise max to 10, rather than just one of those?

  9. What does a topologySpreadConstraints with maxSkew: 1, topologyKey: kubernetes.io/hostname, whenUnsatisfiable: ScheduleAnyway do — and why soft (ScheduleAnyway) rather than hard (DoNotSchedule)?

  10. We deliberately did not add a readiness probe. Explain the failure mode that a naive /health readiness probe would cause in this specific app under heavy load. Why is "no readiness probe" temporarily safer than a bad one here?

  11. What was the container port vs Service target port mismatch situation? Why was alignment low-risk, and why didn't it affect routing?

  12. An orphaned HPA manifest file was deleted. Why was it safe to delete, and how did we confirm it was no longer active?

I. Where the config lives & deploy mechanics

  1. How did we determine the workload was not managed by GitOps tool deployments (e.g., Argo CD), despite the tool being installed? What metadata annotation was the fingerprint?

  2. Which repository and file holds the deployment/HPA, and which separate repository/file holds the ingress? Why do they deploy through different mechanisms?

  3. Describe the deployment pipeline flow end to end. What event triggers it, and what are the key build/test/deploy jobs?

  4. The deploy step does a string substitution (sed 's#$TAG#...#') then kubectl apply. What's the role of the $TAG placeholder, and where does the tag value come from?

  5. Why does merging a PR to the main branch not deploy anything, while pushing a specific environment release tag does?

  6. The pipeline runs on private self-hosted runners. Why does that matter for a private-endpoint cluster?

J. Capacity analysis

  1. What instance types/sizes back the general-purpose compute tier, and how much allocatable CPU/memory does each have?

  2. What is Cluster Autoscaler, and how does it differ from node lifecycle managers like Karpenter? Which one is active in this cluster?

  3. There are two node groups feeding the tier (spot instances and on-demand instances). What are their min/max sizes, and what's the combined node ceiling?

  4. Do the packing math: given ~3920m allocatable CPU and ~400m daemonset overhead, how many pods at 1500m request fit per node? Why is CPU, not memory, the binding constraint?

  5. At HPA max (10 pods), how many nodes are needed, and is that within the ceiling? What's left for other tenants?

  6. Why is horizontal scale-out slow relative to instant traffic bursts? List every contributor to a cold pod's total time-to-serve.

  7. Given that scale-out lag, which part of our fix delivers immediate relief, and which acts as the slower "second line of defense"?

  8. Why did we bump the memory request to 5Gi even though it doesn't change node packing density?

K. Staging-first deploy & verification

  1. Why deploy to the staging environment first when the staging manifest file wasn't even changed by the PR?

  2. Precisely what does the staging deploy validate, and what does it not validate?

  3. Staging runs on the same cluster as production. How is it isolated, and what was the risk we flagged about a tiny 0.5-core staging pod suddenly running workers=2?

  4. List the verification steps we ran on staging, and the pass criteria for each.

  5. Why did we test both GET /health and a heavy POST data processing route, rather than just the health check?

  6. The startup probe warms up by hitting an internal pre-cache endpoint. What does that endpoint do, and why is it the most likely place for a deploy to fail (especially on a CPU-starved staging pod)?

L. Production rollout & confirmation

  1. What version did we tag, and why a patch bump (v1.14.4) rather than a minor/major version shift?

  2. During the production rollout, two new pods went Pending, one with no available node. What happened next, and which log event confirmed the cluster autoscaler reacting?

  3. Why did the rollout take ~6 minutes and stay safe (no dropped traffic) the whole time? What deployment configuration controls the surge/unavailable behavior?

  4. Post-deploy, the new pods showed only ~2.6Gi memory usage, not ~5Gi. Why — and why is that expected rather than a contradiction of our sizing?

  5. After deploy, individual requests still hit ~3s max, but the alarm stayed OK. Why is that consistent with a successful fix? (Connect it back to question 7.)

  6. We monitored for 45 minutes and saw no flapping, yet we kept the incident ticket open. What's the honest gap in that evidence, and what would definitively close it?

  7. The HPA stayed at 4 pods during the entire monitoring window. Why is that a good sign rather than a sign the HPA fix did nothing?

M. Operational / process & gotchas

  1. The cluster API is private-only. What's the practical consequence for local engineers using kubectl, and why does aws sts get-caller-identity succeed while kubectl times out? How do you distinguish an auth problem from a network/VPN problem?

  2. Why did the CI/CD deployment job still work even when our local machine's kubectl couldn't reach the cluster?

  3. The standard code security checks failed on automated image-scanning and vulnerability gates. Diagnose how to isolate the root cause. Were they caused by our configuration change? How do we prove that?

  4. The repository management system showed a BLOCKED merge status, but branch protection rules returned a 404. What's the resolution of that apparent contradiction (e.g., legacy branch protection vs. modern repository rulesets)?

  5. What was the only thing actually gating the code merge, and why were the red security check flags irrelevant to it?

  6. Which actions in this whole deployment flow required explicit manual operator confirmation before execution, and why those specifically?

  7. What's the commit-authorship convention in this engineering environment, and what must never appear in a commit message or PR description?

N. Synthesis & transfer (test of true mastery)

  1. If you were paged for this exact alarm tomorrow with zero prior context, list the first five commands/queries you'd run, in order, and what each one would tell you.

  2. Name three plausible-but-wrong hypotheses for a flapping TargetResponseTime alarm, and the single piece of evidence that kills each.

  3. Suppose the per-pod CPU time series had shown one pod pinned at 100% and three idle (the load imbalance pattern we originally expected). Given the instance-mode topology, what would the real fix have been then — and why is it different from the LOR annotation?

  4. The fix here was vertical + HPA scaling. Under what circumstances would the correct long-term fix instead be an application architecture change, and what would that change look like?

  5. Generalize the core lesson: what specific property of an application workload makes "adding more replicas behind a standard round-robin/random balancer" fail to resolve response time spikes, and what is the class of fixes that does help?

  6. If the morning peak traffic burst still trips the alarm after this infrastructure fix, what are your next two remediation levers (in order), and what data would you collect to choose between them?



Incident Study: core-service-api · prod-default-k8s

The Flapping Latency Alarm, From Page to Fix

A complete answer key to the 98-question self-test — covering triage, the two false trails, the confirmed root cause, the fix, capacity, deployment, and the operational mechanics.

How to use: Answer from memory first, then check here. If you can reconstruct section N (synthesis) unaided, you can run this incident solo. Answers are deliberately tight; the why-not-the-other-thing reasoning matters more than the recall.

Sections Overview

  • A: The Alarm Itself

  • B: ALB → One Service

  • C: First Look at the Workload

  • D: Load-Balancing Theory (False Trail)

  • E: Instance vs IP Mode (False Trail)

  • F: Getting the Proof

  • G: The Real Root Cause (Confirmed)

  • H: Designing the Fix

  • I: Where Config Lives

  • J: Capacity Analysis

  • K: Staging & Verification

  • L: Prod Rollout

  • M: Operational Gotchas

  • N: Synthesis & Transfer

A. The Alarm Itself

CloudWatch & ALB fundamentals — what the metric is and what makes it fire.

A1. What does TargetResponseTime measure, and from whose perspective?

The elapsed seconds from when the ALB forwards a request to a target until the target starts returning the response headers — i.e., backend processing time as measured by the ALB. It excludes the ALB's own request/response processing and the client ↔ ALB network time. The vantage point is the load balancer, not the client or the pod.

A2. What do the k8s- prefix and the hash in the alarm name tell you?

It's machine-generated, not hand-authored. The ALB k8s-coreproddefault-7db0ccf2c3 is one that the AWS Load Balancer Controller created for an ingress group; the hash is derived from it, and the alarm is named after that LB by monitoring automation. Practical takeaway: it tracks an ALB resource, so you investigate at the ALB, and hand-edits to the alarm may be reverted by whatever automated system manages it.

A3. Threshold 0.8, GreaterThanThreshold, period 60s, 2 evaluation periods, statistic Average — what fires it?

When the 60-second average TargetResponseTime is greater than 0.8s for two consecutive 1-minute periods.

A4. Why only the LoadBalancer dimension, and what did that cost us?

With only the LoadBalancer dimension, the metric is aggregated across all six target groups behind the ALB. Consequence: the alarm can't tell you which specific service is slow, so we had to break latency down per-target-group ourselves to localize it.

A5. What is flapping, and why is it likely here?

Flapping is the alarm rapidly toggling OKALARM. A modest threshold (0.8s) on a short 2 × 60s average, against a workload that spikes for a few minutes then falls to near-zero, means every burst crosses the line and every recovery clears it — resulting in repeated state transitions rather than a sustained ALARM.

A6. How do you pull state-transition history, and what did 9 cycles / 5h say?

aws cloudwatch describe-alarm-history --history-item-type StateUpdate. Nine OKALARM cycles in five hours signaled a genuine, recurring degradation — a pattern worth real investigation, not a one-off network blip.

A7. The alarm fires on Average. Why is that central to both diagnosis and fix?

Because the trigger is the average, the fix doesn't need to eliminate slow individual requests (tail max can stay ~3s) — it needs to keep the average under 0.8s. Adding concurrency pulls the average down even while tail latency persists. This framing drives the diagnosis (bursts raise the average once capacity saturates) and the fix (more concurrency → average drops below threshold). Keep this in mind for Q83.

A8. What is the low-traffic statistical artifact, and how did we rule it out?

On a near-idle target, a handful of slow requests dominate a tiny sample and inflate the average, tripping the alarm with no real systemic problem. We ruled it out because request count rose during the spikes (~100 → 130–320/min) — the latency was load-correlated, meaning real degradation under load, not a sparse-sample artifact.

B. Narrowing from ALB to One Service

B9. How did we find the responsible target group?

A per-target-group breakdown of TargetResponseTime using the TargetGroup dimension, via aws cloudwatch get-metric-data with one query per target group (avg, max, request count) over the burst window.

B10. Why can one ALB serve six services?

The AWS Load Balancer Controller's ingress grouping feature: ingresses sharing the alb.ingress.kubernetes.io/group.name annotation are merged onto a single ALB, with each backend service getting its own target group.

B11. From k8s-default-dataapi-88e28b4b6c, how do you get back to a Service?

The target-group name encodes namespace + service: namespace default, service truncated to dataapi...core-service-api. Confirm via the ingress backend mapping (host → service) or by matching the target-group port to the service's NodePort.

B12. Why did data-api being slow exonerate the shared ALB/ingress?

Only data-api showed 5.83s avg / 29.7s max; the other five services on the same ALB and ingress were <0.5s. A shared-path fault (ALB, ingress, global TLS) would hurt all of them. Isolation to one backend points squarely at that backend, not the shared infrastructure.

C. First Look at the Workload

C13. What does kubectl top pods show, and why is one snapshot dangerous?

Current per-pod CPU/memory from metrics-server — a point-in-time sample. It has no time axis, so it can't distinguish a persistent hotspot from a transient one or reveal behavior across an entire traffic burst. Conclusions drawn from a single sample are easy to get wrong (we did, briefly).

C14. One pod at 997m, three idle — what two explanations fit, and why can't a snapshot decide?

  • (a) Routing keeps hammering one pod (a persistent imbalance); or

  • (b) That pod merely happens to be mid-way through one heavy request while the others are momentarily between requests (a transient, rotating hotspot).

With no time dimension, a snapshot can't tell a fixed hotspot from a moving one.

C15. CPU request vs limit?

  • Request: Reserved amount used for scheduling (and the denominator for HPA CPU %).

  • Limit: Hard ceiling; exceed CPU and you're throttled, exceed memory and you're OOMKilled.

C16. How did we check CPU-limit throttling, and what did we find?

Read the cgroup stats inside the container: cat /sys/fs/cgroup/cpu.stat. nr_throttled / throttled_usec were near-zero, so the pod was not being throttled at its CPU limit — limit-throttling wasn't the cause; the ceiling was elsewhere.

C17. Why was the recent v1.14.3 deploy a red herring?

The deploy created the young pods (~12:00), but the alarm history showed flapping starting ~04:34 and running ~8h — well before the deploy. A cause can't post-date its effect, so the deploy was purely coincidental.

C18. Distinguish liveness, readiness, startup probes.

  • Liveness: Is the container alive — fail → restart container.

  • Readiness: Is the pod fit to receive traffic — fail → removed from Service/LB endpoints (no restart).

  • Startup: Has the app finished initializing — gates the other two until it passes; fail → restart container.

C19. What can't the system do without a readiness probe?

It can't shed load: a pod has no way to signal "I'm overloaded, stop sending me traffic" and drop out of the ALB target pool. As long as the container runs, it stays in the routing rotation.

D. The Load-Balancing Theory

False Trail

A well-reasoned hypothesis that the evidence later refuted — kept here because the reasoning is correct in general, just not the cause here.

D20. Why does round-robin degrade with variable request durations?

Round-Robin (RR) equalizes request count, but actual load is count × duration. With uniform durations, equal count ≈ equal load. With high variance, a target that draws a long request still receives its turn-by-turn share of new requests (count-fair) while it's busy, so its backlog grows while targets handling short requests finish quickly and idle. RR is blind to how busy each target actually is.

D21. The feedback-loop difference vs least-outstanding-requests?

RR has no feedback — it rotates regardless of backend state. Least Outstanding Requests (LOR) keeps a per-target in-flight counter and sends each new request to the least-busy target, so a stuck target stops receiving work until it drains. LOR closes the feedback loop that RR lacks.

D22. What is LOR equivalent to, and what does AWS advise?

It's AWS's name for classic least-connections (nginx least_conn, HAProxy leastconn). AWS guidance: round-robin when requests and targets are similar; least-outstanding-requests when request complexity varies or targets differ in capacity — exactly the 0.01–30s spread we saw.

D23. "Head-of-line blocking" appeared at two layers — name both.

  1. ALB/target layer: Under RR, requests queue behind a slow request on a busy target instance.

  2. Inside the pod: The single async worker's event loop is blocked by a CPU-bound request, freezing other in-flight requests on that same worker.

Same phenomenon, two different layers — and the second turned out to be the real root cause.

D24. Why can't a snapshot prove RR is the cause, and what would?

A snapshot lacks a time axis, so it can't show whether the hot pod persists or rotates, nor whether requests spread evenly by count while load concentrates. Proof needs a time series: per-pod CPU through a burst, or per-target request-count and latency over time.

D25. What made the round-robin conclusion wrong?

Two things: the per-pod CPU time series showed all pods evenly loaded (no pinned pod), and the target group is in instance mode, where the ALB balances across nodes — not pods — so its algorithm can't even produce a per-pod imbalance. Both facts independently killed the theory.

E. Target Mode: Instance vs IP

False Trail

The topology detail that demolished the LB-algorithm fix and reframed the problem.

E26. instance vs ip target mode — how do you tell which you have?

  • instance: Targets are EC2 nodes (ALB → node:NodePort → kube-proxy → pod).

  • ip: Targets are pod IPs directly.

Tell from the registered targets (instance IDs i-... vs pod IPs) and the target-group port (a NodePort like 31892 ⇒ instance; a container port like 8080 ⇒ ip).

E27. 10 instances registered on 31892 — what does that imply about the path?

The ALB sends to nodes on the NodePort, then each node's kube-proxy forwards to a pod. So the ALB's balancing decision is across nodes, and there's a second hop (kube-proxy) that actually picks the destination pod.

E28. NodePort, and externalTrafficPolicy: Cluster vs Local?

NodePort exposes the service on a static port on every node.

  • Cluster: Any node may forward to any pod cluster-wide (extra network hop, source IP is SNAT'd).

  • Local: A node only forwards to pods running on itself (preserves client IP, no second hop, but a node with no local pod fails LB health checks).

E29. Trace the full request path in this topology.

client → ALB (round-robins across 10 nodes) → node:31892 → kube-proxy (iptables/IPVS, effectively random across all 4 pod endpoints cluster-wide, possibly a second network hop to a pod on a different node) → application master → a worker process.

E30. Why wouldn't LOR fix per-pod imbalance here?

LOR would balance outstanding requests across EC2 nodes, but kube-proxy then picks the destination pod randomly afterward. The ALB neither sees nor controls which pod serves a request, so per-pod imbalance is completely untouched by altering the ALB algorithm.

E31. What two changes together would enable load-aware per-pod routing?

Switch target-type to ip (ALB targets pods directly, removing the kube-proxy hop) and set least_outstanding_requests. It's a much bigger change: ip mode alters the data path, requires specific VPC CNI / security-group wiring for ALB-to-pod traffic, and the algorithm annotation applies ingress-wide.

E32. Why were ALB access logs useless for the per-pod question?

In instance mode, the access log records the EC2 node as the target (node IP:31892), not the individual pod — so it cannot reveal per-pod traffic distribution.

F. Getting the Proof

Replacing inference with a time series — the turning point.

F33. What is kube-prometheus-stack and where did it live?

The Prometheus Operator bundle — Prometheus, Alertmanager, Grafana, kube-state-metrics, and node-exporter. It ran in the dedicated monitoring namespace.

F34. Why is kubectl port-forward an acceptable read-only way to query it?

It opens a local secure tunnel (localhost → pod/service port) through the Kubernetes API server. It proxies a connection without mutating cluster state — so it's completely safe for querying metrics.

F35. What PromQL gives per-pod CPU over time, and why rate(...)?

rate(container_cpu_usage_seconds_total{pod=~"core-service-api.*"}[1m]). The metric is a monotonic CPU-seconds counter; rate(...[1m]) turns it into cores-used-per-second (i.e., CPU cores used) over the window. The raw counter alone is meaningless for "how busy is it right now".

F36. All four pods even at 0.5–0.95 cores — why did that refute imbalance?

The imbalance theory predicted one pinned pod at 100% and the rest idle. The time series showed the exact opposite — every pod working evenly and hard, none starved — so load was already naturally balanced. There was nothing to rebalance; the LB algorithm was irrelevant.

F37. Why does a ~0.9-core plateau under a 1.5-core limit point at the process model?

A CPU-bound workload that flatlines near exactly one core despite a 1.5-core allowance is the classic fingerprint of a single-process / single-threaded execution ceiling — the pod cannot push past ~1 core, which points straight at having a single execution worker.

G. The Real Root Cause

Confirmed

Application server workers, runtime resource locks, and an async worker doing synchronous CPU work.

G38. What is an application server worker vs the master?

A worker is an OS process running a full copy of the application code and handling incoming HTTP requests. The master process binds the network socket, spawns/monitors/restarts workers, and handles reloads — but serves no requests itself.

G39. What is a Global Interpreter Lock (GIL) or single-threaded runtime constraint and its consequence?

It prevents multiple threads from executing native bytecode at the same time per process. A single process yields about one core of CPU-bound throughput regardless of how many threads are created or how many cores are allocated to the container.

G40. What do workers = 1 and worker_class = "async_class_name" do?

workers = 1 creates exactly one worker process. The worker class sets each worker to run an asynchronous event loop, designed to juggle many concurrent connections via async non-blocking I/O multiplexing.

G41. Sync vs async worker — when does each shine?

  • Sync: Handles one request at a time to completion; concurrency comes from adding more worker processes — ideal for CPU-bound tasks.

  • Async: Handles many concurrent requests on a single loop — but is only an advantage when requests spend most of their time waiting on I/O.

G42. Why is async wrong for CPU-bound data processing work?

An event loop can only switch tasks at await points (non-blocking I/O waits). A heavy synchronous CPU computation holds the thread and the runtime lock tightly and never yields, meaning the entire event loop is blocked for the duration of the calculation — freezing every other concurrent request on that worker. Async buys nothing for CPU-heavy tasks and introduces massive head-of-line blocking.

G43. So what's the per-pod and fleet concurrency for heavy requests?

  • Per pod: Effectively one heavy request in flight at a time (one worker, loop blocked during CPU work).

  • Fleet: 4 pods × 1 worker ⇒ a strict fleet ceiling of about 4 concurrent heavy data extractions.

G44. How did we confirm worker count and CPU quota from inside the pod?

  • cat /sys/fs/cgroup/cpu.max150000 100000 = quota/period = 1.5 cores.

  • ps | grep -c app_server → confirmed 1 master + 1 worker, meaning workers=1.

The Prometheus plateau and the running configuration now perfectly agreed.

G45. Why was the 1.5-core limit largely unusable?

Because one worker ≈ one core max due to the process lock, the extra ~0.5 core of the 1.5 limit could never be consumed by CPU-bound work. The resource allocation was there on paper, but the process model couldn't physically use it.

G46. State the full causal chain, burst → alarm.

A traffic burst arrives → with only ~4 concurrent-request fleet capacity, excess requests queue up immediately. On each pod, the single event loop is completely blocked by the in-progress CPU-bound processing task → per-request latency climbs dramatically into seconds (max ~30s) → the 60s average crosses 0.8s for two consecutive periods → the CloudWatch alarm fires. Once the burst passes, the queue drains, the average falls, the alarm clears, resulting in a flapping pattern.

H. Designing the Fix

H47. Vertical, horizontal, or both — and why?

  • Vertical (more workers/CPU per pod) adds concurrency instantly to existing running pods.

  • Horizontal (more pods via HPA) adds capacity but provisions slowly due to node provisioning.

Both won: vertical gives immediate relief on the real bottleneck; HPA adds elastic burst headroom. Horizontal alone is too slow for sudden short bursts; vertical alone has no safety ceiling.

H48. Why not just switch to sync workers / offload to a pool?

The application code is fundamentally built on top of an async framework with coroutine request handlers. Switching to sync workers or offloading heavy work to a background process pool requires rewriting structural parts of the app — an application-code change, which is out of scope for an emergency infrastructure hotfix.

H49. Why raise the CPU limit to 2000m alongside workers=2?

Two workers need up to ~2 full cores. Leaving the limit at 1500m would CPU-throttle them (two processes fighting over 1.5 cores), meaning neither could run a full core and defeating the change. The limit rose to 2000m to back two full workers.

H50. Why expect ~5Gi at two workers, and why won't application preloading help?

Each worker builds its own heavy data models/caches in memory at runtime (via a warmup endpoint or on initial requests), so memory scales near-linearly with workers (~2.5Gi → ~5Gi). Preloading only shares state loaded at import time before the process forks (copy-on-write); caches built post-fork are private to each worker process.

H51. Why is the import-time model load shareable but the runtime caches aren't?

The base library imports run at module load time — with preloading, this data lives in the master process before the fork and is shared copy-on-write. The actual execution caches are built after the fork, per worker, making them private anonymous memory.

H52. Why did the HPA never scale originally?

The targetCPUUtilizationPercentage is a percentage of the CPU request (1200m). Traffic bursts pushed pods to ~0.9 cores ≈ 75% — just under the 80% trigger — and the bursts were too short for the time-averaged utilization metric to sustain above 80%. So the cluster sat flat at 4 replicas.

H53. Percent of what? Recompute at the new request.

It is a percentage of the CPU request. At the new 1500m request, 0.9 cores ≈ 60%. Under real heavy load, two workers can use up to ~2 cores ≈ 133% of the request — far over the new 50% target, which easily triggers scaling.

H54. Why lower the target to 50% and raise max to 10?

Lowering the target makes it scale much earlier during short bursts; raising max gives the cluster room to scale out. A lower target without a higher max caps out too early; a higher max without a lower target would still fail to trigger.

H55. What does the topology spread constraint do, and why soft?

maxSkew: 1 / topologyKey: kubernetes.io/hostname ensures the per-node count of these pods stays within 1 of each other, spreading them evenly across nodes. Soft (ScheduleAnyway) makes it a preference: if node capacity is scarce, it will still schedule pods (just less evenly) rather than leaving pods stuck in Pending. Hard (DoNotSchedule) could block scheduling entirely and cause a self-inflicted outage.

H56. Why deliberately omit the readiness probe?

A readiness probe shares the exact same event loop that CPU-bound requests block. Under sustained load, the probe would time out on every busy pod → all pods would flip to NotReady at once → all would be removed from the ALB → causing wide-scale 503 errors, which is worse than slow responses. Without a readiness probe, pods stay in rotation (slow but serving). Revisit this once loop-blocking is fixed in code.

E57. The container port vs Service target port situation?

The application server binds to port 8080 and the Kubernetes Service targetPort was set to 8080 (numeric), so traffic already routed correctly; containerPort: 80 in the manifest was just an incorrect documentation declaration. Changing it to 8080 aligns the documentation with no risk to production routing.

H58. Why was deleting the separate HPA file safe?

The deployment pipeline only applies k8s.prod.yml (which has its own inline HPA definition); the old hpa-prod.yml was never referenced anywhere — confirmed by searching the repo and workflows. It was a confusing duplicate file, safe to remove.

I. Where the Config Lives & Deploy Mechanics

I59. How did we know it wasn't GitOps-managed despite the tool being installed?

The live Deployment carried a kubectl.kubernetes.io/last-applied-configuration annotation — the clear fingerprint of a standard kubectl apply execution. GitOps-managed objects (like Argo CD) carry distinct tracking annotations instead, and querying the tool showed it only managed specific preview apps.

I60. Which repo/file holds the Deployment vs the ingress, and why different mechanisms?

  • Deployment/HPA: org/core-servicedeploy/k8s.prod.yml, shipped by the repo's tag-triggered GitHub Actions (kubectl apply).

  • Ingress: org/infra-k8s.../platform/router/main.tf, shipped via terraform apply.

They have different lifecycles — application config moves with the app's release tags, while cluster routing is foundational infrastructure managed in Terraform.

I61. Describe the production pipeline end to end.

  • Trigger: Pushing a git tag matching v*-prod.

  • Jobs: test (unit tests) → build (docker build + push to AWS ECR) → deploy-prod (sed string replacement on the tag, then kubectl apply executed on an in-VPC self-hosted runner) → release-prod (creates GitHub release).

I62. Role of the $TAG placeholder?

It acts as a placeholder string in the manifest's image declaration line; the deploy step uses sed to substitute it with the exact pushed tag (e.g., v1.14.4-prod) so the applied manifest pulls the freshly built image. The value is sourced from the pipeline context (GITHUB_REF).

I63. Why does merging to main branch not deploy, but a tag does?

The automation workflows do not trigger deployments on main branch pushes; the CD workflow is configured to trigger only on a v*-prod tag. Deployment is therefore a deliberate, separate engineering act — explicitly cutting and pushing the tag.

I64. Why does the self-hosted runner matter for a private cluster?

The runner instance is situated inside the VPC, meaning it can natively reach the Kubernetes cluster's private API endpoint to run kubectl commands — even when an engineer's local machine is off the VPN.

J. Capacity Analysis

J65. What instances back the general-purpose tier, and how big?

t3.xlarge / t3a.xlarge (and similar compute variants in the node group) — 4 vCPU, ~3920m allocatable CPU, ~14.4Gi allocatable memory each.

J66. Cluster Autoscaler vs Karpenter — which is here?

Cluster Autoscaler scales fixed-shape Auto Scaling Groups (ASGs) based on pods being marked as unschedulable (Pending); Karpenter provisions right-sized nodes directly on demand from pending pod resource definitions without predefined groups. This cluster uses Cluster Autoscaler (the alternative tool was removed during a node group migration).

J67. The two node groups and the combined ceiling?

  • gen-purpose (Spot instances): min 2 / max 5.

  • gp-od (On-Demand fallback): min 0 / max 5.

  • Combined ceiling: 10 total nodes.

J68. Do the packing math — why is CPU the binding constraint?

(3920m allocatable − ~400m daemonset overhead) ÷ 1500m request ≈ 2 pods per node. Memory utilization at 2 pods would be 2 × 5Gi + ~1Gi system overhead ≈ 11Gi, well under the 14.4Gi allocatable. A 3rd pod would require more CPU than the node possesses (4500m > 3920m). Therefore, CPU exhaustion prevents further packing before memory runs out.

J69. At HPA max (10 pods), how many nodes, and does it fit?

10 pods ÷ 2 pods per node = 5 nodes. This is completely within the 10-node group ceiling, leaving 5 nodes available for other workloads in the tier. It fits safely.

J70. Why is scale-out slow relative to the bursts?

Spot instance capacity acquisition wait + node OS boot and cluster join + 726MB docker image download pull time (~48s) + startup probe warmup execution (up to 300s) ≈ 3–6 minutes total before a brand-new pod can actively serve traffic. This latency struggles against brief 5–8 minute traffic spikes.

J71. Which part of the fix gives immediate relief?

The vertical workers=2 change — every single existing pod immediately doubles its concurrent capacity for handling heavy requests. HPA horizontal scale-out serves as a slower second line of defense and may not even complete processing within a short burst window.

J72. Why bump the memory request to 5Gi if density is unchanged?

5Gi matches the actual steady-state memory utilization of the dual-worker process model, making the scheduler's reservation honest and avoiding over-packing followed by sudden runtime evictions. Density remains unchanged because the CPU request (1500m) already limits placement to 2 pods per node.

K. Staging-First Deploy & Verification

K73. Why deploy to staging first when the staging manifest file wasn't changed?

The workers=2 code and environment configuration is baked directly into the Docker image artifact, which staging runs. Thus, deploying to staging smoke-tests the internal behavior of the new container version before hitting production — even though the staging infra manifest itself remains unchanged.

K74. What does staging validate, and what does it not validate?

  • Validates: Image build/push mechanics, container runtime startup with multiple workers, startup probe caching execution, and basic API routing correctness (/health and live requests).

  • Does NOT validate: Production resource sizing under load, HPA auto-scaling dynamics, or topology spread rules (as those are production-only manifest keys). Those configurations were verified via client-side dry-runs and theoretical capacity analysis.

K75. How is staging isolated, and what risk did we flag?

Staging runs as a distinct Deployment and Service (core-service-api-staging) inside the staging namespace on the same shared cluster. Risk: Running workers=2 inside a tiny 0.5-core staging allocation means both workers are severely CPU-throttled. The startup warmup steps could slow down and exceed the 300s threshold, triggering an artifact crashloop that is a byproduct of staging constraints rather than an application defect.

K76. List the staging verification steps and pass criteria.

Pipeline checks pass green → Rollout completes successfully with zero container restarts → Verify process table via container exec (1 master + 2 workers) → HTTP GET /health returns 200 → Sample heavy request returns 200 with correct output payloads → Logs are clean. All must be green to pass.

K77. Why test both /health and a real request route?

The /health endpoint only verifies that the web server socket is open and listening (it is trivial and lightweight). Sending a real processing request confirms that the heavy data models have loaded correctly into active memory and that the application can execute computations under the new worker configuration.

K78. What does the warmup endpoint do, and why is it the likely failure point?

It builds and primes the heavy in-memory entity models for data extraction. It is the most intensive step during application boot and is gated by the startup probe; on a resource-starved container, this is the most likely location for a timeout failure to occur.

L. Production Rollout & Confirmation

L79. Why a patch bump (v1.14.4)?

It represents an operational configuration and infrastructure hotfix with absolutely no changes to the external API surface or core application behavior, following v1.14.3 — making it a patch version under semver guidelines.

L80. Two new pods went Pending — what happened next?

The larger 1500m CPU / 5Gi memory pods could not fit on the existing allocated nodes. A TriggeredScaleUp event was successfully logged by the Cluster Autoscaler, a new node was provisioned and joined the cluster, and the pending pods scheduled onto it — behaving exactly as predicted in the capacity analysis.

L81. Why did the ~6-minute rollout drop no traffic?

A RollingUpdate deployment replaces pods incrementally based on maxSurge and maxUnavailable definitions. Each new pod is required to completely pass its lengthy startup probe before an older pod is decommissioned out of the active routing pool, ensuring continuous handling capacity.

L82. Why only ~2.6Gi post-deploy, not ~5Gi?

Only the initial worker that processed the startup warmup request had fully populated its internal cache; the second worker process was still cold (having received no user traffic yet). Memory usage scales up toward ~5Gi as the second worker processes requests under load, which our sizing footprint anticipates.

L83. ~3s max requests persisted, yet the alarm stayed OK — why is that consistent with success?

The CloudWatch alarm tracks the mathematical average. Individual 3-second maximum latencies are expected due to the variable nature of heavy data workloads, but because 2 workers × 4 pods are now absorbing the volume, the aggregate average stays low (0.1s - 0.3s). Tail latency maxes can persist while keeping the alarm green (confirming the point in Q7).

L84. What's the honest gap in the 45-minute monitoring evidence?

The monitoring window captured only moderate off-peak traffic patterns, not a worst-case peak morning burst. The fix is not yet fully proven against maximum production volume; observing a morning traffic peak stay entirely under the latency threshold is required to definitively close out the incident ticket.

L85. HPA stayed at 4 pods — why is that good, not a non-event?

The vertical optimization change alone provided enough per-pod concurrency to keep individual utilization comfortably below the scaling threshold under the observed load. The horizontal scaling capacity is a safety cushion held in reserve, not a failure of the fix.

M. Operational & Process Gotchas

M86. Why does sts get-caller-identity succeed while kubectl times out?

The Kubernetes cluster control plane API endpoint is configured as strictly private. kubectl requires a path through the internal VPN network to resolve and route traffic. The AWS STS command calls public AWS API endpoints, succeeding regardless of local network routing. Takeaway: AWS SSO success combined with a kubectl I/O timeout or context deadline error indicates a local network/VPN connectivity issue, not an authentication problem.

M87. Why did the deploy still work when local kubectl couldn't?

The CI/CD pipeline's deployment step executes directly on a self-hosted runner instance stationed natively inside the project VPC, giving it a direct internal network path to the private cluster endpoint independent of local engineer connectivity.

M88. Diagnose the two CI failures — were they ours?

  • security-check: The internal package scanner attempted to run an interactive prompt setup in an environment with no standard input (stdin), resulting in an EOFError crash (a regression in the testing utility version).

  • image-scan: The automated vulnerability scan flagged pre-existing CVE thresholds within upstream base dependencies.

Neither issue was introduced by our configuration change (the Dockerfile and application dependencies were completely untouched), and historical runs across unrelated branches confirmed identical global failures.

M89. BLOCKED but branch protection returned a 404 — resolve the contradiction.

The repository does not use legacy branch protection rules (hence the API returning 404), but instead implements modern repository compliance rulesets. Rulesets represent a distinct platform feature and must be queried via specific ruleset endpoints. The merge block was enforced by an active ruleset requirement.

M90. What actually gated the merge?

A ruleset requirement mandating exactly 1 peer-approved review. The automated status checks were not configured as blocking gates within the ruleset, which allowed the branch to be merged despite the red CI scanner flags once a manual review was submitted.

M91. Which actions needed explicit confirmation, and why?

Any operation writing state changes to production infrastructure or mutating master source tracking (e.g., executing infrastructure applications, pushing release tags, merging to master, modifying tracking tickets). These actions are highly visible and difficult to cleanly reverse, necessitating a strict "confirm-before-acting" protocol.

M92. Commit-authorship convention here?

Commits must enforce verified corporate identity configurations using strict compliance controls, explicitly avoiding automated local environment placeholders or any explicit AI author attribution footers in either commit histories or pull request descriptions.

N. Synthesis & Transfer

The Real Mastery Test

N93. Paged cold — first five moves, in order?

  1. describe-alarms → Check current metric state, threshold definitions, and exact configurations.

  2. describe-alarm-history → Determine if the issue is a flapping pattern and correlate the start time against the cluster deployment log.

  3. Break down TargetResponseTime and RequestCount by Target Group → Identify the specific failing service and check if the degradation is load-correlated.

  4. kubectl get/describe pods and top → Check pod health, restart loops, resource tracking, and probe status.

  5. Analyze Prometheus per-pod CPU over time series → Evaluate if workload distribution is balanced and look for single-core saturation plateaus.

Each progressive step narrows the operational scope: Alarm → Service → Pod → Process.

N94. Name three plausible-but-wrong hypotheses and the one fact that kills each.

  1. Low-traffic statistical artifact: Contradicted by clear increases in total request volume coinciding exactly with the latency spikes.

  2. Load balancer round-robin routing imbalance: Disproven by historical Prometheus time series showing even CPU utilization across all pods, alongside an instance-mode infrastructure topology that prevents pod-level load steering.

  3. The recent application deployment: Refuted by historical alarm logs demonstrating that the latency flapping pattern was active hours prior to the deployment timestamp.

N95. If the CPU series had shown one pinned pod, what's the real fix in instance mode?

Applying a load balancer annotation wouldn't work because an ALB operating in instance mode cannot steer individual pod traffic behind a NodePort hop. The structural resolution would require changing the target mode configuration to ip (paired with LOR) to give the ALB direct visibility into individual pods, or rectifying the balancing mechanics at the internal proxy layer. Most importantly, per-pod concurrency limits would need to be raised to prevent single long-running requests from starving concurrent traffic.

N96. When is the correct long-term fix an application change?

When the core performance bottleneck stems directly from the internal concurrency model of the runtime — such as running compute-intensive, synchronous operations inside a single asynchronous event loop thread. The robust architectural solution requires modifying the application code: transitioning to synchronous worker processes mapped to available cores, or offloading compute tasks onto background worker queues so the main request loop remains fully responsive. Infrastructure configuration tweaks merely act as mitigation.

N97. Generalize: what makes "more replicas behind round-robin" fail, and what helps?

  • The Failure Property: Significant variance in per-request computation times, coupled with a routing layer that distributes traffic based purely on request count without visibility into backend state, running on replicas with restricted internal concurrency. Adding replicas fails because the actual load (not request count) is severely unbalanced, leaving individual instances susceptible to head-of-line blocking.

  • What Helps: Implementing load-aware routing metrics (e.g., Least Connections / Least Outstanding Requests), increasing true parallel processing capability within each discrete replica, or refactoring the code to handle heavy calculations asynchronously outside the primary request path. You must provide capacity where the work actually serializes.

N98. If the morning peak still trips it — next two levers, and how to choose?

  1. Optimize Horizontal Scaling Responsiveness: Increase minimum running replicas ahead of time or adjust the auto-scaler to be more aggressive. Evaluate per-pod CPU limits, HPA state machine transitions, and raw request velocity during spikes to verify if additional pre-warmed instances can absorb the peak.

  2. Refactor Application Concurrency: Rewrite the request handling layer to run synchronous workers or execute compute tasks via isolated worker pools. Check detailed profiling data to confirm if event loop starvation continues to limit per-pod throughput.

Choose based on whether the exhaustion is caused by total cluster capacity constraints (→ scale out) or localized thread blocking (→ refactor application).

No comments:

Post a Comment