These questions and answers were generated by Claude, upon it analyzed and fixed one CloudWatch alarm which have been flipping. Prompt I used:
I was following your analysis and resolution of this issue in order to acquire your knowledge. I would like to test my knowledge now. Can you compile a list of questions which cover every aspect of the issue and solution here? Don't be shy of creating a really long list of questions. If I am able to answer them all that means I am able to fix the issue on my own next time. Also prepare the answer key.
A. The alarm itself (CloudWatch / ALB fundamentals)
What does the metric
TargetResponseTimeactually measure, and from whose perspective (client → ALB → target)?The alarm name was
k8s-api-prod-core-7db0ccf2c3-target-response-time. What do thek8s-prefix and the hash portion tell you about how it was created?The alarm config was: threshold 0.8,
GreaterThanThreshold, period 60s, 2 evaluation periods, statisticAverage. In plain English, what condition makes it fire?Why does the alarm use the
LoadBalancerdimension only, and not aTargetGroupdimension? What consequence did that have for our investigation?What is "flapping," and why does this particular threshold/period combination make flapping likely for a bursty workload?
How do you pull an alarm's state-transition history, and what did 9
OK→ALARMcycles in 5 hours tell you?The alarm fires on
Average. Why is that distinction (vsMaximum/p99) absolutely central to both the diagnosis and the fix?What is the "low-traffic statistical artifact" pattern, where a handful of slow requests inflate the average on a near-idle target — and what evidence did we use to rule it out here?
B. Narrowing from ALB to one service
The ALB fronted six target groups. Name the technique we used to find which one was responsible, and the AWS CLI call behind it.
Why can a single ALB serve six different Kubernetes services? What AWS-LB-Controller concept ties them together (hint:
group.name)?Given the target group name
k8s-default-dataservice-88e28b4b6c, how do you map it back to a Kubernetes Service and namespace?During the burst,
data-serviceshowed 5.83s avg / 29.7s max while every other service was <0.5s. Why did that immediately exonerate the shared ALB/ingress as the cause?
C. First look at the workload
What does
kubectl top podsshow, and why is a single snapshot from it dangerous as evidence?The first snapshot showed one pod at 997m and three near-idle. What two different explanations are consistent with that, and why can't a snapshot distinguish them?
What's the difference between a CPU request and a CPU limit in Kubernetes?
How did we check whether the pod was being CPU-throttled at its limit, and what file did we read inside the container? What did
nr_throttled/throttled_usectell us?A deploy to
v1.14.3-prodhad happened ~15 min earlier. Why was it a red herring, and what evidence dated the flapping as pre-existing?Distinguish the three probe types (liveness, readiness, startup). What does each one control?
The deployment had a
startupProbeandlivenessProbebut noreadinessProbe. Operationally, what can't the system do without a readiness probe?
D. The load-balancing theory (and why it was wrong)
Explain why round-robin load balancing degrades when request durations are highly variable. Use the "request count vs. total work" framing.
What is the feedback-loop difference between round-robin and least-outstanding-requests (LOR)?
What older, well-known algorithm is LOR equivalent to in nginx/HAProxy terms, and what does AWS's own guidance say about when to use LOR vs round-robin?
"Head-of-line blocking" appeared twice in this incident at two different layers. Name both layers and how each one blocks.
Why is a single
kubectl topsnapshot insufficient to prove round-robin is causing imbalance, and what data would actually prove or refute it?We initially called round-robin the "smoking gun," then withdrew it. What specifically made that conclusion wrong?
E. Target mode: instance vs IP (the topology that broke the theory)
What's the difference between an ALB target group in
instancemode vsipmode? How can you tell which one you have from the registered targets and the target-group port?The target group registered 10 EC2 instances on port 31892. What does that tell you about the request path from ALB to pod?
What is a
NodePortservice, and what isexternalTrafficPolicy: ClustervsLocal?With
instancemode +Clusterpolicy, describe the full path of a request from client to the backend process, including every hop and who load-balances at each hop.Given that topology, explain precisely why switching the ALB to
least_outstanding_requestswould not fix per-pod imbalance.What two changes together would enable load-aware per-pod routing, and why is that a much bigger change than a one-line annotation?
Why were ALB access logs not useful for confirming per-pod distribution in this setup?
F. Getting the proof (Prometheus / time series)
What is
kube-prometheus-stack, and where did it live in the cluster?Why is
kubectl port-forwardan acceptable read-only way to query Prometheus, and what does it actually do?Write (conceptually) the PromQL that gives per-pod CPU usage over time. Why
rate(container_cpu_usage_seconds_total[...])rather than the raw counter?The time series showed all four pods evenly at 0.5–0.95 cores during the burst. Why did that refute the imbalance theory in one stroke?
Each pod plateaued at ~0.9 cores despite a 1.5-core limit. What does that plateau strongly imply about the process model inside the pod?
G. The real root cause (app server / GIL / async)
What is an application server worker (e.g., Gunicorn), and what's the difference between the master process and a worker process?
What is a Global Interpreter Lock (GIL) or similar single-threaded runtime constraint, and why does it mean one worker process ≈ one core of CPU-bound throughput?
The config had
workers = 1and an asynchronous event-loop worker class specified. Explain what each line does.What is the difference between a synchronous worker and an asynchronous worker in an application server? When does each shine?
Why is an async (event-loop) worker the wrong model for heavy, synchronous CPU-bound data processing? What does "blocking the event loop" mean concretely?
So with one async worker doing CPU-bound work, what is the per-pod concurrency for heavy requests — and how did that produce the fleet-wide ceiling of ~4 concurrent requests?
How did we confirm the worker count and the CPU quota from inside a running pod (what command, what does
cpu.max = 150000 100000mean, what does worker count = 2 mean)?Why was the 1.5-core CPU limit effectively unusable given
workers = 1?Tie it together: explain the full causal chain from "traffic burst" to "alarm fires," in one paragraph, using the confirmed root cause.
H. Designing the fix
We considered vertical (more workers/CPU per pod), horizontal (more pods via HPA), and both. What's the trade-off, and why did "both" win?
Why couldn't we just change the worker class to synchronous (or offload to a process pool) as part of this immediate infrastructure fix? What kind of change would that be?
We set
workers = 2. Why did the CPU limit have to go up to2000mat the same time? What would happen if we'd setworkers = 2but left the limit at1500m?The pods used ~2.5Gi RSS each at one worker. Why did we expect ~5Gi with two workers, and why wouldn't worker-fork preloading save us here? (What did we learn about when the application memory caches are built?)
There's one import-time load we found (e.g., heavy model loading in a utility file). Why is that one shareable-via-fork but the bulk of the runtime memory is not?
The HPA was
autoscaling/v1, target 80%, min 4 / max 7. We measured bursts peaking at ~75% of the 1200m request. Explain mechanically why the HPA never scaled.targetCPUUtilizationPercentageis a percentage of what? Recompute: at the new 1500m request and a pod using ~0.9 cores, what utilization does the HPA see?Why did we lower the target to 50% and raise max to 10, rather than just one of those?
What does a
topologySpreadConstraintswithmaxSkew: 1,topologyKey: kubernetes.io/hostname,whenUnsatisfiable: ScheduleAnywaydo — and why soft (ScheduleAnyway) rather than hard (DoNotSchedule)?We deliberately did not add a readiness probe. Explain the failure mode that a naive
/healthreadiness probe would cause in this specific app under heavy load. Why is "no readiness probe" temporarily safer than a bad one here?What was the container port vs Service target port mismatch situation? Why was alignment low-risk, and why didn't it affect routing?
An orphaned HPA manifest file was deleted. Why was it safe to delete, and how did we confirm it was no longer active?
I. Where the config lives & deploy mechanics
How did we determine the workload was not managed by GitOps tool deployments (e.g., Argo CD), despite the tool being installed? What metadata annotation was the fingerprint?
Which repository and file holds the deployment/HPA, and which separate repository/file holds the ingress? Why do they deploy through different mechanisms?
Describe the deployment pipeline flow end to end. What event triggers it, and what are the key build/test/deploy jobs?
The deploy step does a string substitution (
sed 's#$TAG#...#') thenkubectl apply. What's the role of the$TAGplaceholder, and where does the tag value come from?Why does merging a PR to the main branch not deploy anything, while pushing a specific environment release tag does?
The pipeline runs on private self-hosted runners. Why does that matter for a private-endpoint cluster?
J. Capacity analysis
What instance types/sizes back the general-purpose compute tier, and how much allocatable CPU/memory does each have?
What is Cluster Autoscaler, and how does it differ from node lifecycle managers like Karpenter? Which one is active in this cluster?
There are two node groups feeding the tier (spot instances and on-demand instances). What are their min/max sizes, and what's the combined node ceiling?
Do the packing math: given ~3920m allocatable CPU and ~400m daemonset overhead, how many pods at 1500m request fit per node? Why is CPU, not memory, the binding constraint?
At HPA max (10 pods), how many nodes are needed, and is that within the ceiling? What's left for other tenants?
Why is horizontal scale-out slow relative to instant traffic bursts? List every contributor to a cold pod's total time-to-serve.
Given that scale-out lag, which part of our fix delivers immediate relief, and which acts as the slower "second line of defense"?
Why did we bump the memory request to 5Gi even though it doesn't change node packing density?
K. Staging-first deploy & verification
Why deploy to the staging environment first when the staging manifest file wasn't even changed by the PR?
Precisely what does the staging deploy validate, and what does it not validate?
Staging runs on the same cluster as production. How is it isolated, and what was the risk we flagged about a tiny 0.5-core staging pod suddenly running
workers=2?List the verification steps we ran on staging, and the pass criteria for each.
Why did we test both
GET /healthand a heavyPOSTdata processing route, rather than just the health check?The startup probe warms up by hitting an internal pre-cache endpoint. What does that endpoint do, and why is it the most likely place for a deploy to fail (especially on a CPU-starved staging pod)?
L. Production rollout & confirmation
What version did we tag, and why a patch bump (
v1.14.4) rather than a minor/major version shift?During the production rollout, two new pods went
Pending, one with no available node. What happened next, and which log event confirmed the cluster autoscaler reacting?Why did the rollout take ~6 minutes and stay safe (no dropped traffic) the whole time? What deployment configuration controls the surge/unavailable behavior?
Post-deploy, the new pods showed only ~2.6Gi memory usage, not ~5Gi. Why — and why is that expected rather than a contradiction of our sizing?
After deploy, individual requests still hit ~3s max, but the alarm stayed
OK. Why is that consistent with a successful fix? (Connect it back to question 7.)We monitored for 45 minutes and saw no flapping, yet we kept the incident ticket open. What's the honest gap in that evidence, and what would definitively close it?
The HPA stayed at 4 pods during the entire monitoring window. Why is that a good sign rather than a sign the HPA fix did nothing?
M. Operational / process & gotchas
The cluster API is private-only. What's the practical consequence for local engineers using
kubectl, and why doesaws sts get-caller-identitysucceed whilekubectltimes out? How do you distinguish an auth problem from a network/VPN problem?Why did the CI/CD deployment job still work even when our local machine's
kubectlcouldn't reach the cluster?The standard code security checks failed on automated image-scanning and vulnerability gates. Diagnose how to isolate the root cause. Were they caused by our configuration change? How do we prove that?
The repository management system showed a
BLOCKEDmerge status, but branch protection rules returned a 404. What's the resolution of that apparent contradiction (e.g., legacy branch protection vs. modern repository rulesets)?What was the only thing actually gating the code merge, and why were the red security check flags irrelevant to it?
Which actions in this whole deployment flow required explicit manual operator confirmation before execution, and why those specifically?
What's the commit-authorship convention in this engineering environment, and what must never appear in a commit message or PR description?
N. Synthesis & transfer (test of true mastery)
If you were paged for this exact alarm tomorrow with zero prior context, list the first five commands/queries you'd run, in order, and what each one would tell you.
Name three plausible-but-wrong hypotheses for a flapping
TargetResponseTimealarm, and the single piece of evidence that kills each.Suppose the per-pod CPU time series had shown one pod pinned at 100% and three idle (the load imbalance pattern we originally expected). Given the instance-mode topology, what would the real fix have been then — and why is it different from the LOR annotation?
The fix here was vertical + HPA scaling. Under what circumstances would the correct long-term fix instead be an application architecture change, and what would that change look like?
Generalize the core lesson: what specific property of an application workload makes "adding more replicas behind a standard round-robin/random balancer" fail to resolve response time spikes, and what is the class of fixes that does help?
If the morning peak traffic burst still trips the alarm after this infrastructure fix, what are your next two remediation levers (in order), and what data would you collect to choose between them?







