Friday, 26 June 2026

DevOps Interview Questions - k8s ALB alarm

Post-incident review / technical interview questions.

These questions and answers were generated by Claude, upon it analyzed and fixed one CloudWatch alarm which have been flipping. Prompt I used:

I was following your analysis and resolution of this issue in order to acquire your knowledge. I would like to test my knowledge now. Can you compile a list of questions which cover every aspect of the issue and solution here? Don't be shy of creating a really long list of questions. If I am able to answer them all that means I am able to fix the issue on my own next time. Also prepare the answer key.

A. The alarm itself (CloudWatch / ALB fundamentals)

What does the metric TargetResponseTime actually measure, and from whose perspective (client → ALB → target)?
The alarm name was k8s-api-prod-core-7db0ccf2c3-target-response-time. What do the k8s- prefix and the hash portion tell you about how it was created?
The alarm config was: threshold 0.8, GreaterThanThreshold, period 60s, 2 evaluation periods, statistic Average. In plain English, what condition makes it fire?
Why does the alarm use the LoadBalancer dimension only, and not a TargetGroup dimension? What consequence did that have for our investigation?
What is "flapping," and why does this particular threshold/period combination make flapping likely for a bursty workload?
How do you pull an alarm's state-transition history, and what did 9 OK→ALARM cycles in 5 hours tell you?
The alarm fires on Average. Why is that distinction (vs Maximum/p99) absolutely central to both the diagnosis and the fix?
What is the "low-traffic statistical artifact" pattern, where a handful of slow requests inflate the average on a near-idle target — and what evidence did we use to rule it out here?

B. Narrowing from ALB to one service

The ALB fronted six target groups. Name the technique we used to find which one was responsible, and the AWS CLI call behind it.
Why can a single ALB serve six different Kubernetes services? What AWS-LB-Controller concept ties them together (hint: group.name)?
Given the target group name k8s-default-dataservice-88e28b4b6c, how do you map it back to a Kubernetes Service and namespace?
During the burst, data-service showed 5.83s avg / 29.7s max while every other service was <0.5s. Why did that immediately exonerate the shared ALB/ingress as the cause?

C. First look at the workload

What does kubectl top pods show, and why is a single snapshot from it dangerous as evidence?
The first snapshot showed one pod at 997m and three near-idle. What two different explanations are consistent with that, and why can't a snapshot distinguish them?
What's the difference between a CPU request and a CPU limit in Kubernetes?
How did we check whether the pod was being CPU-throttled at its limit, and what file did we read inside the container? What did nr_throttled / throttled_usec tell us?
A deploy to v1.14.3-prod had happened ~15 min earlier. Why was it a red herring, and what evidence dated the flapping as pre-existing?
Distinguish the three probe types (liveness, readiness, startup). What does each one control?
The deployment had a startupProbe and livenessProbe but no readinessProbe. Operationally, what can't the system do without a readiness probe?

D. The load-balancing theory (and why it was wrong)

Explain why round-robin load balancing degrades when request durations are highly variable. Use the "request count vs. total work" framing.
What is the feedback-loop difference between round-robin and least-outstanding-requests (LOR)?
What older, well-known algorithm is LOR equivalent to in nginx/HAProxy terms, and what does AWS's own guidance say about when to use LOR vs round-robin?
"Head-of-line blocking" appeared twice in this incident at two different layers. Name both layers and how each one blocks.
Why is a single kubectl top snapshot insufficient to prove round-robin is causing imbalance, and what data would actually prove or refute it?
We initially called round-robin the "smoking gun," then withdrew it. What specifically made that conclusion wrong?

E. Target mode: instance vs IP (the topology that broke the theory)

What's the difference between an ALB target group in instance mode vs ip mode? How can you tell which one you have from the registered targets and the target-group port?
The target group registered 10 EC2 instances on port 31892. What does that tell you about the request path from ALB to pod?
What is a NodePort service, and what is externalTrafficPolicy: Cluster vs Local?
With instance mode + Cluster policy, describe the full path of a request from client to the backend process, including every hop and who load-balances at each hop.
Given that topology, explain precisely why switching the ALB to least_outstanding_requests would not fix per-pod imbalance.
What two changes together would enable load-aware per-pod routing, and why is that a much bigger change than a one-line annotation?
Why were ALB access logs not useful for confirming per-pod distribution in this setup?

F. Getting the proof (Prometheus / time series)

What is kube-prometheus-stack, and where did it live in the cluster?
Why is kubectl port-forward an acceptable read-only way to query Prometheus, and what does it actually do?
Write (conceptually) the PromQL that gives per-pod CPU usage over time. Why rate(container_cpu_usage_seconds_total[...]) rather than the raw counter?
The time series showed all four pods evenly at 0.5–0.95 cores during the burst. Why did that refute the imbalance theory in one stroke?
Each pod plateaued at ~0.9 cores despite a 1.5-core limit. What does that plateau strongly imply about the process model inside the pod?

G. The real root cause (app server / GIL / async)

What is an application server worker (e.g., Gunicorn), and what's the difference between the master process and a worker process?
What is a Global Interpreter Lock (GIL) or similar single-threaded runtime constraint, and why does it mean one worker process ≈ one core of CPU-bound throughput?
The config had workers = 1 and an asynchronous event-loop worker class specified. Explain what each line does.
What is the difference between a synchronous worker and an asynchronous worker in an application server? When does each shine?
Why is an async (event-loop) worker the wrong model for heavy, synchronous CPU-bound data processing? What does "blocking the event loop" mean concretely?
So with one async worker doing CPU-bound work, what is the per-pod concurrency for heavy requests — and how did that produce the fleet-wide ceiling of ~4 concurrent requests?
How did we confirm the worker count and the CPU quota from inside a running pod (what command, what does cpu.max = 150000 100000 mean, what does worker count = 2 mean)?
Why was the 1.5-core CPU limit effectively unusable given workers = 1?
Tie it together: explain the full causal chain from "traffic burst" to "alarm fires," in one paragraph, using the confirmed root cause.

H. Designing the fix

We considered vertical (more workers/CPU per pod), horizontal (more pods via HPA), and both. What's the trade-off, and why did "both" win?
Why couldn't we just change the worker class to synchronous (or offload to a process pool) as part of this immediate infrastructure fix? What kind of change would that be?
We set workers = 2. Why did the CPU limit have to go up to 2000m at the same time? What would happen if we'd set workers = 2 but left the limit at 1500m?
The pods used ~2.5Gi RSS each at one worker. Why did we expect ~5Gi with two workers, and why wouldn't worker-fork preloading save us here? (What did we learn about when the application memory caches are built?)
There's one import-time load we found (e.g., heavy model loading in a utility file). Why is that one shareable-via-fork but the bulk of the runtime memory is not?
The HPA was autoscaling/v1, target 80%, min 4 / max 7. We measured bursts peaking at ~75% of the 1200m request. Explain mechanically why the HPA never scaled.
targetCPUUtilizationPercentage is a percentage of what? Recompute: at the new 1500m request and a pod using ~0.9 cores, what utilization does the HPA see?
Why did we lower the target to 50% and raise max to 10, rather than just one of those?
What does a topologySpreadConstraints with maxSkew: 1, topologyKey: kubernetes.io/hostname, whenUnsatisfiable: ScheduleAnyway do — and why soft (ScheduleAnyway) rather than hard (DoNotSchedule)?
We deliberately did not add a readiness probe. Explain the failure mode that a naive /health readiness probe would cause in this specific app under heavy load. Why is "no readiness probe" temporarily safer than a bad one here?
What was the container port vs Service target port mismatch situation? Why was alignment low-risk, and why didn't it affect routing?
An orphaned HPA manifest file was deleted. Why was it safe to delete, and how did we confirm it was no longer active?

I. Where the config lives & deploy mechanics

How did we determine the workload was not managed by GitOps tool deployments (e.g., Argo CD), despite the tool being installed? What metadata annotation was the fingerprint?
Which repository and file holds the deployment/HPA, and which separate repository/file holds the ingress? Why do they deploy through different mechanisms?
Describe the deployment pipeline flow end to end. What event triggers it, and what are the key build/test/deploy jobs?
The deploy step does a string substitution (sed 's#$TAG#...#') then kubectl apply. What's the role of the $TAG placeholder, and where does the tag value come from?
Why does merging a PR to the main branch not deploy anything, while pushing a specific environment release tag does?
The pipeline runs on private self-hosted runners. Why does that matter for a private-endpoint cluster?

J. Capacity analysis

What instance types/sizes back the general-purpose compute tier, and how much allocatable CPU/memory does each have?
What is Cluster Autoscaler, and how does it differ from node lifecycle managers like Karpenter? Which one is active in this cluster?
There are two node groups feeding the tier (spot instances and on-demand instances). What are their min/max sizes, and what's the combined node ceiling?
Do the packing math: given ~3920m allocatable CPU and ~400m daemonset overhead, how many pods at 1500m request fit per node? Why is CPU, not memory, the binding constraint?
At HPA max (10 pods), how many nodes are needed, and is that within the ceiling? What's left for other tenants?
Why is horizontal scale-out slow relative to instant traffic bursts? List every contributor to a cold pod's total time-to-serve.
Given that scale-out lag, which part of our fix delivers immediate relief, and which acts as the slower "second line of defense"?
Why did we bump the memory request to 5Gi even though it doesn't change node packing density?

K. Staging-first deploy & verification

Why deploy to the staging environment first when the staging manifest file wasn't even changed by the PR?
Precisely what does the staging deploy validate, and what does it not validate?
Staging runs on the same cluster as production. How is it isolated, and what was the risk we flagged about a tiny 0.5-core staging pod suddenly running workers=2?
List the verification steps we ran on staging, and the pass criteria for each.
Why did we test both GET /health and a heavy POST data processing route, rather than just the health check?
The startup probe warms up by hitting an internal pre-cache endpoint. What does that endpoint do, and why is it the most likely place for a deploy to fail (especially on a CPU-starved staging pod)?

L. Production rollout & confirmation

What version did we tag, and why a patch bump (v1.14.4) rather than a minor/major version shift?
During the production rollout, two new pods went Pending, one with no available node. What happened next, and which log event confirmed the cluster autoscaler reacting?
Why did the rollout take ~6 minutes and stay safe (no dropped traffic) the whole time? What deployment configuration controls the surge/unavailable behavior?
Post-deploy, the new pods showed only ~2.6Gi memory usage, not ~5Gi. Why — and why is that expected rather than a contradiction of our sizing?
After deploy, individual requests still hit ~3s max, but the alarm stayed OK. Why is that consistent with a successful fix? (Connect it back to question 7.)
We monitored for 45 minutes and saw no flapping, yet we kept the incident ticket open. What's the honest gap in that evidence, and what would definitively close it?
The HPA stayed at 4 pods during the entire monitoring window. Why is that a good sign rather than a sign the HPA fix did nothing?

M. Operational / process & gotchas

The cluster API is private-only. What's the practical consequence for local engineers using kubectl, and why does aws sts get-caller-identity succeed while kubectl times out? How do you distinguish an auth problem from a network/VPN problem?
Why did the CI/CD deployment job still work even when our local machine's kubectl couldn't reach the cluster?
The standard code security checks failed on automated image-scanning and vulnerability gates. Diagnose how to isolate the root cause. Were they caused by our configuration change? How do we prove that?
The repository management system showed a BLOCKED merge status, but branch protection rules returned a 404. What's the resolution of that apparent contradiction (e.g., legacy branch protection vs. modern repository rulesets)?
What was the only thing actually gating the code merge, and why were the red security check flags irrelevant to it?
Which actions in this whole deployment flow required explicit manual operator confirmation before execution, and why those specifically?
What's the commit-authorship convention in this engineering environment, and what must never appear in a commit message or PR description?

N. Synthesis & transfer (test of true mastery)

If you were paged for this exact alarm tomorrow with zero prior context, list the first five commands/queries you'd run, in order, and what each one would tell you.
Name three plausible-but-wrong hypotheses for a flapping TargetResponseTime alarm, and the single piece of evidence that kills each.
Suppose the per-pod CPU time series had shown one pod pinned at 100% and three idle (the load imbalance pattern we originally expected). Given the instance-mode topology, what would the real fix have been then — and why is it different from the LOR annotation?
The fix here was vertical + HPA scaling. Under what circumstances would the correct long-term fix instead be an application architecture change, and what would that change look like?
Generalize the core lesson: what specific property of an application workload makes "adding more replicas behind a standard round-robin/random balancer" fail to resolve response time spikes, and what is the class of fixes that does help?
If the morning peak traffic burst still trips the alarm after this infrastructure fix, what are your next two remediation levers (in order), and what data would you collect to choose between them?

Thursday, 25 June 2026

Elastic Fleet

Elastic Agents do not strictly require Fleet and Fleet Server. You can deploy them in standalone mode, which allows you to manually configure and manage them using local configuration files.

However, running Elastic Agents in Fleet-managed mode with a Fleet Server is the recommended best practice for most enterprise environments.

Here is how the two approaches compare:

1. Fleet-Managed Mode (Recommended)

In this setup, you use the Fleet UI in Kibana to centrally manage agent policies, roll out upgrades, and apply integrations.

How it works: Agents connect to a Fleet Server (which is just a specialized Elastic Agent process), which then receives policies from Elasticsearch and pushes them to your endpoints.
Best for: Large-scale deployments, continuous monitoring, and Elastic Security/Endpoint integrations.
Action: To set this up, refer to the official Elastic Agent Installation Guide.

2. Standalone Mode

In this setup, you manually install the agent and define its inputs, outputs, and integrations directly in a local YAML configuration file.

How it works: The agent connects directly to outputs like Elasticsearch or Logstash without a Fleet Server intermediary.
Limitations: Central management, automated upgrades, and certain advanced Endpoint Security features are disabled.
Best for: Edge cases, highly air-gapped networks, or evaluating specific integrations.
Action: To configure this, use the steps outlined in the Standalone Elastic Agent Tutorial.

Example of how an Elastic Fleet can actually be wired

Components & where they live

Everything is in cluster company-prod-elastic-eks (EKS, us-east-2, now on v1.36.2), namespace elastic-system, across 5 nodes in 3 AZs:

Node AZ Node Group Runs

==== == ========== ====

ip-10-99-44-1 2a es MNG (m7g.2xlarge) agent

ip-10-99-55-34 2b default MNG (m5.large) agent + Fleet Server pod

ip-10-99-55-89 2b es MNG (m7g.2xlarge) agent

ip-10-99-66-51 2c es MNG (m7g.2xlarge) agent

ip-10-99-66-78 2c default MNG (m5.large) agent

Fleet Server — Deployment fleet-server-prod, replicas: 1 (single pod, currently on the 2b node). Listens on :8220 HTTPS. Exposed by a NodePort service fleet-server-prod-agent-http (8220 --> 32202 on every node).
Elastic Agents — DaemonSet agent-prod-eck-agent, one pod per node (5 total), spread across all 3 AZs. hostNetwork: true, mode: fleet, FLEET_INSECURE=true. They collect node/pod logs + metrics (hostPath mounts of /var/log/...).
ALB k8s-elastics-eckfleet-a1b2c3d4e5 — internal (not internet-facing), spanning subnets in 2a / 2b / 2c. Created by the AWS LB controller from Ingress eck-fleet-server-prod-ingress. Targets = all 5 nodes' NodePort 32202 (instance mode), backend-protocol: HTTPS, idle_timeout: 300s.
DNS — eck-fleet-server.internal-domain.local (Route53 via external-dns) --> the internal ALB. Elasticsearch is a separate endpoint, elasticsearch.internal-domain.local:443.

Who listens, who initiates

The key thing: every connection is initiated by the agent (outbound). Nothing is ever pushed to the agents — that's why this works with agents behind hostNetwork and no inbound rules of their own.

┌───────────────────────────────────────────────┐

│ cluster company-prod-elastic-eks / elastic-sys│

┌──────────┐ │ │

│ Agent │ check-in │ internal ALB Fleet Server (1 pod) │

│ DaemonSet│─────────►│ eckfleet-a1b2... ┌──────► :8220 (HTTPS) │

│ (5 pods, │ :443 │ :443 TLS ─────────┘ NodePort 32202→8220 │

│ 3 AZs) │ HTTPS │ (internal cert) kube-proxy → the 1 pod │

└────┬─────┘ │ ▲ re-encrypt HTTPS to a node's :32202 │

│ │ │ │ │

│ ship data └───────┼─────────────────────┼─────────────────┘

│ (logs/metrics) │ DNS │ writes .fleet-* / agent

▼ eck-fleet-server. ▼ metadata, reads policy

elasticsearch. internal-domain.local Elasticsearch + Kibana (Fleet app)

internal-domain.local:443 ◄──────────────────────

(NOT via the fleet ALB)

Traffic Flows: Breakdown

1. Agent --> Fleet Server (Control Plane: Enrollment & Policy Check-in)

The agent acts as the client, and Fleet Server listens on :8220. The agent dials https://eck-fleet-server.internal-domain.local:443:

Traffic Path:

agent --> DNS --> internal ALB :443 (TLS termination, internal cert) --> re-encrypt --> a node's NodePort :32202 --> kube-proxy --> the single Fleet Server pod :8220

Cross-AZ Routing:

Because the agent always targets the ALB DNS name (not the pod directly), an agent can land on any node's NodePort. It is then forwarded by kube-proxy to whichever node is hosting the single Fleet Server pod—making cross-AZ hops standard behavior.

The Long Poll Mechanism: The check-in is a long poll. The agent opens a connection, and Fleet Server holds it open for up to ~5 minutes until a policy change occurs or the poll times out.

The Latency "False Positive": This held-open duration is exactly what the ALB records as a TargetResponseTime of ~300s, which triggered the false-positive alert. The ALB's idle_timeout is intentionally set to 300s to support these long polls. Conversely, the health check (/health, expects 404, 300s interval) is a separate lightweight probe, explaining why targets report as perfectly healthy while "latency" metrics look highly skewed.

2. Agent --> Elasticsearch (Data Plane: Telemetry Shipping)

This data path does not traverse the Fleet ALB. The Fleet output configuration points agents directly at https://elasticsearch.internal-domain.local:443.

Consequently, the eckfleet ALB only handles control-plane check-in traffic. This explains why its request volume is incredibly low (~3–5 requests/min) while individual request durations hover around 300s. They are two entirely separate paths utilizing two different load balancers.

3. Fleet Server --> Elasticsearch + Kibana

Fleet Server (via elasticsearchRefs / kibanaRef) acts as the client here. It writes agent and policy metadata directly into the .fleet-* indices and coordinates with the Kibana Fleet application, where the eck-fleet-server and eck-agent policies are managed.

One-Line Summary

Agents (DaemonSet, 5 pods across 2a/2b/2c) long-poll the single Fleet Server pod for policy updates via the internal ALB on :443 --> NodePort 32202 --> :8220, but ship their actual telemetry data directly to Elasticsearch, completely bypassing that ALB. The ALB only processes the slow, held-open control traffic—which is the root cause of the alarm.

Why does fleet server holds connection open for so long? Why it does not reply immediately?

It's a deliberate design choice called long polling, and it's the opposite of a bug. Here's the reasoning.

The problem Fleet is solving

Fleet Server's job is to push policy changes out to agents promptly — a new integration, a changed log path, an output credential rotation. Agents need to find out "has my policy changed?" with low latency.

There are three ways an agent could learn about changes:

1. Short polling — agent asks "anything new?" every N seconds, server replies instantly "no", agent sleeps, repeats.

2. Server push — server opens a connection to each agent when something changes.

3. Long polling — agent asks "anything new?", and the server holds the request open until either something actually changes or a timeout fires.

Fleet uses #3, and the connection sitting open for ~5 minutes is that hold.

Why not reply immediately (short polling)?

If Fleet answered every check-in instantly with "nothing changed", then to get fast reaction to a policy change, agents would have to poll very frequently — say every few seconds. With your 5 agents that's tolerable, but Fleet is built to manage thousands to tens of thousands of agents from one server. At that scale:

- Frequent short polls = a constant storm of requests, almost all of which return "no change". Huge wasted CPU/network on both ends.

- To keep reaction time low you'd poll more often, which makes the storm worse. To cut the storm you'd poll less often, which makes policy changes take longer to land. There's no good setting.

Long polling breaks that trade-off: the agent gets a near-instant reaction to a real change (the server responds the moment policy updates) and there's almost no idle chatter (one held-open connection per agent instead of hundreds of empty round-trips).

Why not server push (#2)?

Pushing would mean the server initiating connections inward to every agent. Agents are all over the place — behind NAT, firewalls, in private subnets, on laptops, on hostNetwork pods like yours. The server usually can't reach them, and you'd need inbound rules everywhere. Long polling flips it: the agent always dials out to the server, the connection is already established and held open, and the server pushes the change down that existing agent-initiated connection the instant it happens. You get push-like latency with poll-like (outbound-only) connectivity. That's exactly why your agents work behind hostNetwork with no inbound exposure.

So what's actually happening in those ~300s

A check-in is essentially: "Here's my current state; tell me the moment my policy differs." The server parks that request. Two things can end it:

- A policy change occurs → server responds immediately with the new policy (could be 2 seconds in).

- Nothing changes → the server lets the request time out at its poll ceiling (~5 min), responds "no change", and the agent immediately opens a fresh one.

Since your policies rarely change, almost every check-in runs the full clock and returns at ~300s. That's the held connection the ALB measures as TargetResponseTime.

Why this specifically fools the ALB

From the ALB's point of view, "request received → response sent" took 300 seconds, so it reports TargetResponseTime ≈ 300s. The ALB can't tell the difference between "the backend was slow for 300s" (bad) and "the backend intentionally held an idle long-poll for 300s" (normal). That ambiguity is the whole reason your generic 0.8s threshold misfires — and why the right fix (DOP-833) is to exempt this endpoint rather than treat it as latency. It's also why the Ingress sets idle_timeout.timeout_seconds=300: the ALB has to be told to tolerate the held connection, otherwise it would sever the long poll before the agent's poll cycle completes.

In short: Fleet holds the connection open so it can deliver policy changes near-instantly without either hammering the server with empty polls or needing to reach inward to firewalled agents. The ~300s is just an idle long-poll waiting for a change that usually never comes during that window — efficient by design, and only a problem for a latency metric that doesn't know to expect it.

----

AWS EC2: Application Load Balancer

An Application Load Balancer (ALB) is a fully managed AWS service that automatically distributes incoming HTTP and HTTPS traffic across multiple backend targets.

It operates at the Application Layer (Layer 7) of the Open Systems Interconnection (OSI) model.

Key Features:

Content-Based Routing: Routes traffic based on URL paths (/api vs /images) or hostnames (://example.com).
Container Support: Integrates directly with Amazon ECS and EKS using dynamic port mapping.
Advanced Protocols: Native support for modern protocols like HTTP/2, gRPC, and WebSockets.
Security Integration: Features built-in HTTPS/TLS termination and integrates directly with AWS WAF for web security.

How Components Work Together

Listener: Evaluates connection requests from clients using protocols and ports you configure.
Rules: Determines how the load balancer routes requests to its registered targets.
Target Group: Groups backend resources (like EC2 instances, containers, or IP addresses) that receive the traffic

How ALB health checks keep applications online?

An Application Load Balancer (ALB) keeps your application online by continuously monitoring the health of your backend targets and dynamically redirecting traffic away from failing nodes.

1. Automatic Traffic Redirection

The ALB sends periodic ping requests (health checks) to every registered target. If a target fails to respond correctly, the ALB marks it as unhealthy and immediately stops sending user traffic to it. Traffic is rerouted to the remaining healthy nodes with zero downtime for the user.

2. Auto Scaling Integration

When paired with an Auto Scaling Group (ASG), ALB health checks can trigger the automatic replacement of broken instances.

The Problem: An EC2 instance might be running (healthy at the hardware level), but the web server inside it has crashed (unhealthy at the application level).
The Solution: The ALB tells the ASG that the instance is failing application health checks. The ASG terminates that specific broken instance and launches a fresh, working one.

3. Graceful Recovery

When an unhealthy instance recovers, or when a new instance is launched, the ALB does not send traffic to it immediately. It enters an initial state and undergoes consecutive successful health checks. Only when it passes the threshold does the ALB safely introduce it back into the traffic rotation.

How to Configure an ALB Health Check

You configure health checks inside the Target Group settings using these parameters:

Parameter What it does Recommended Setting

======== ========= =================

Health Check Path The URL endpoint the ALB hits

(e.g., /health or /index.html). /health

Healthy Threshold Consecutive successes needed to mark a target as healthy. 3

Unhealthy Threshold Consecutive failures needed to mark a target as unhealthy. 2

Timeout How long the ALB waits for a response before failing. 5 seconds

Interval The time between individual health check pings. 30 seconds

Success Codes The HTTP status codes that prove the app is working. 200 (or 200-399)

When you configure an ALB, you do not select an Availability Zone (AZ) directly; instead, you must select at least two subnets in different Availability Zones to ensure high availability.

How it works: AWS places a load balancer node in each of the specified subnets.
The AZ link: Because each subnet belongs to exactly one AZ, this fundamentally binds the ALB's nodes to those corresponding Availability Zones.
Custom routing: You can modify the subnets via the Update Availability Zones settings in the EC2 Console at any time.

Public ALB

Binding an ALB to public subnets makes it a public (internet-facing) load balancer.

When you create an internet-facing ALB, AWS requires you to select public subnets so the ALB nodes can receive a public IP address and route traffic from the internet.

Key Characteristics:

Public DNS: The ALB receives a public DNS name that resolves to public IP addresses.
Internet Gateway: The selected public subnets must have a route to an Internet Gateway (IGW) in their route tables.
Target Routing: Even though the ALB is public, it can still route traffic to EC2 instances living in private subnets

An internet-facing ALB routes traffic directly to the individual backend targets (such as EC2 instances or IP addresses), not to the private subnets themselves.

How Routing Works

Target Group Config: You configure the ALB to route traffic to a Target Group.
Direct Node Communication: The ALB nodes in the public subnets communicate directly with the private IP addresses of your backend nodes.
Cross-Subnet Traffic: AWS handles this routing internally via the VPC router, allowing the public ALB to securely traverse into private subnets.

Configuration Checklist

VPC: Both the public subnets (where the ALB lives) and the private subnets (where the nodes live) must be in the same VPC.
Security Groups: The private instances must have a security group that allows inbound traffic from the ALB's security group

Private ALB

An internal (private) ALB routes traffic in the exact same way as a public ALB, but it is only accessible within your VPC or connected networks.

It routes traffic directly to individual backend targets, not to subnets.

Key Characteristics

Private Subnets: You deploy the ALB nodes into private subnets.
Private DNS: The ALB receives a public DNS name, but it resolves exclusively to private IP addresses.
No Internet Access: It cannot receive any traffic from the public internet because it lacks a public IP.

Common Use Cases

Internal Microservices: Routing traffic from a public-facing web tier to a private backend API tier.
Hybrid Networks: Routing traffic coming from an on-premises data centre via AWS Direct Connect or a VPN

Setting Up ALB in AWS Console

AWS Elastic Load Balancing shows basic building blocks of AWS Load Balancer which include listeners and target groups.

To create Application Load Balancer go to EC2 >> Load balancers >> Create Load balancer >> Select load balancer type (click on Create under Application Load Balancer)

Here we can set:

Basic configuration

Name
Scheme (cannot be changed after the load balancer is created)

Internet-facing. An internet-facing load balancer routes requests from clients over the internet to targets. Requires a public subnet.
Internal. An internal load balancer routes requests from clients to targets using private IP addresses.

IP address type. Select the type of IP addresses that your subnets use.

IPv4. Recommended for internal load balancers.
Dualstack. Includes IPv4 and IPv6 addresses.

Network mapping. The load balancer routes traffic to targets in the selected subnets, and in accordance with your IP address settings.

VPC. Virtual private cloud for your targets. If balancer is internet-facing, only VPCs with an internet gateway are enabled for selection. The selected VPC cannot be changed after the load balancer is created. As VPC is region-specific so is Application Load Balancer.
Mappings. Once VPC is selected, its availability zones are listed here and are selectable. Select at least two Availability Zones and one subnet per zone. The load balancer routes traffic to targets in these Availability Zones only. Availability Zones that are not supported by the load balancer or the VPC are not available for selection. We should select all AZs that we listed in the Auto scaling group (if we used it).

Security groups. A security group is a set of firewall rules that control the traffic to your load balancer. We can select up to 10 security groups.

If our application is listening for HTTP requests on port 80 we should select a security group with:

Inbound rule: accept HTTP/TCP traffic on port 80 with source Anywhere-IPv4
Outbound rule: allow all traffic for all protocols and port ranges to custom destination 0.0.0.0/0

Listeners and routing. A listener is a process that checks for connection requests using the port and protocol you configure. The rules that you define for a listener determine how the load balancer routes requests to its registered targets.

Add listener

Protocol e.g. HTTP
Port e.g. 80. This is a public facing port and it does not need to be the same as the port from the attached target group. E.g. LB can listen on port 80 and forward traffic to target group port 8080.
Default action: Forward to (select a target group)
Add listener tags

Add-on services - optional

AWS Global Accelerator

Tags - optional

More info on Scheme, from AWS documentation:

When you create a load balancer, you must choose whether to make it an internal load balancer or an internet-facing load balancer.

The nodes of an internet-facing load balancer have public IP addresses.

The nodes of an internal load balancer have only private IP addresses.

Both internet-facing and internal load balancers route requests to your targets using private IP addresses. Therefore, your targets do not need public IP addresses to receive requests from an internal or an internet-facing load balancer.

More info on how ALB routes traffic to multiple Availability Zones (and about what Load Balancer Nodes are):

When you enable an Availability Zone for your load balancer, Elastic Load Balancing creates a load balancer node in the Availability Zone.
The nodes for your load balancer distribute requests from clients to registered targets. When cross-zone load balancing is enabled, each load balancer node distributes traffic across the registered targets in all enabled Availability Zones. When cross-zone load balancing is disabled, each load balancer node distributes traffic only across the registered targets in its Availability Zone.
Before a client sends a request to your load balancer, it resolves the load balancer's domain name using a Domain Name System (DNS) server. The DNS entry is controlled by Amazon, because your load balancers are in the amazonaws.com domain. The Amazon DNS servers return one or more IP addresses to the client. These are the IP addresses of the load balancer nodes for your load balancer.
As traffic to your application changes over time, Elastic Load Balancing scales your load balancer and updates the DNS entry. The DNS entry also specifies the time-to-live (TTL) of 60 seconds. This helps ensure that the IP addresses can be remapped quickly in response to changing traffic.

The client determines which IP address to use to send requests to the load balancer. The load balancer node that receives the request selects a healthy registered target and sends the request to the target using its private IP address.

With Application Load Balancers, the load balancer node that receives the request uses the following process:

1) Evaluates the listener rules in priority order to determine which rule to apply.

2) Selects a target from the target group for the rule action, using the routing algorithm configured for the target group. The default routing algorithm is round robin. Routing is performed independently for each target group, even when a target is registered with multiple target groups.

For further info: How Elastic Load Balancing works - Elastic Load Balancing

ALB nodes use Elastic Network Interface (Elastic network interfaces - Amazon Elastic Compute Cloud) which has public IP address:

At least one ENI is created and attached to the balancer in each availability zone where the balancer is deployed (except NLB, which should only have one per AZ). Over the life of the balancer, new ENIs will appear and old ones will disappear, as the balancer scales horizontally (number of nodes) and/or vertically (capacity of underlying hardware), all of which is handled transparently by the infrastructure. Even though you can tag them, the tagging will become stale over time.

Source: amazon web services - AWS - Affect Load Balancer's tags to its Network Interfaces (ENI) - Stack Overflow

You can determine the IP addresses associated with an internal load balancer or an internet-facing load balancer by resolving the DNS name of the load balancer. These are the IP addresses where the clients should send the requests that are destined for the load balancer. However, Classic Load Balancers and Application Load Balancers use the private IP addresses associated with their elastic network interfaces as the source IP address for requests forwarded to your web servers.

Source: Find the IP address used by a load balancer to forward traffic to web servers

Load balancer routes requests to the targets in a target group and performs health checks on the targets. Target group is accepting requests from the load balancer and forwards them to targets. These targets can be e.g. EC2 instances created either manually or through auto scaling group.

How to create a Target Group used by Load Balancer listeners? (This applies for any type of Load Balancer)

EC2 >> Target groups >> Create target group

Step 1: Specify group details

Here we can set:

Basic configuration. Settings in this section cannot be changed after the target group is created.

Target type

Instances

Supports load balancing to instances within a specific VPC.
Facilitates the use of Amazon EC2 Auto Scaling to manage and scale your EC2 capacity.

IP addresses

Supports load balancing to VPC and on-premises resources.
Facilitates routing to multiple IP addresses and network interfaces on the same instance.
Offers flexibility with microservice based architectures, simplifying inter-application communication.
Supports IPv6 targets, enabling end-to-end IPv6 communication, and IPv4-to-IPv6 NAT.

Lambda function

Facilitates routing to a single Lambda function.
Accessible to Application Load Balancers only.

Application Load Balancer

Offers the flexibility for a Network Load Balancer to accept and route TCP requests within a specific VPC
Facilitates using static IP addresses and PrivateLink with an Application Load Balancer.

Target group name
Protocol:Port e.g. If our application is accepting HTTP requests on port 8080 this would be HTTP:8080
VPC - VPC with the instances that you want to include in the target group.
Protocol version

HTTP1. Send requests to targets using HTTP/1.1. Supported when the request protocol is HTTP/1.1 or HTTP/2.
HTTP2. Send requests to targets using HTTP/2. Supported when the request protocol is HTTP/2 or gRPC, but gRPC-specific features are not available.
gRPC. Send requests to targets using gRPC. Supported when the request protocol is gRPC.

Health checks. The associated load balancer periodically sends requests, per the settings below, to the registered targets to test their status.

Health check protocol

HTTP
HTTPS

Health check path. Use the default path of “/“ to ping the root, or specify a custom path if preferred.
Advanced health check settings

Port. The port the load balancer uses when performing health checks on targets. The default is the port on which each target receives traffic from the load balancer, but you can specify a different port.

Traffic port
Override

Healthy threshold. The number of consecutive health checks successes required before considering an unhealthy target healthy.
Unhealthy threshold. The number of consecutive health check failures required before considering a target unhealthy.
Timeout. The amount of time, in seconds, during which no response means a failed health check.
Interval. The approximate amount of time between health checks of an individual target
Success codes. The HTTP codes to use when checking for a successful response from a target. You can specify multiple values (for example, "200,202") or a range of values (for example, "200-299").

Attributes
Tags - optional

Step 2: Register targets

This is an optional step to create a target group. However, to ensure that your load balancer routes traffic to this target group you must register your targets.

After load balancer is created it takes several minutes while it's in provisioning state and get into active state. After this, we can use its DNS name in order to see what it's doing.

If we copy its DNS name and paste it to our browser, if we haven't registered any targets in the target group associated with the load balancer, we'll get error 503 - Service Temporary Unavailable.

If we've registered targets and are getting error 504 Gateway time-out, we should check first if security groups (firewalls) for our EC2 instances (inbound rule - source IP range) are set up correctly as this error usually indicates that inbound traffic is not allowed.

AWS Terraform provider offers provisioning all these resources:

Application Load Balancer: aws_lb | Resources | hashicorp/aws | Terraform Registry
Listener: aws_lb_listener | Resources | hashicorp/aws | Terraform Registry
Target Group: aws_lb_target_group | Resources | hashicorp/aws | Terraform Registry

How is AWS Application Load Balancing usually implemented?

Let's say we have our application running on 3 EC2 instances where 2 are in the same region e.g. us-west-2 but in separate availability zones e.g. us-west-2a and us-west-2b. Third EC2 instance is in eu-central-1, in availability zone eu-central-1a.

VPC is region-specific but can span multiple availability zones (AZ).

Subnet is an IP address range within VPC.

VPC can have public and private subnets.

VPC can be divided into multiple subnets but each subnet is AZ-specific.

AZ can have multiple subnets.

So, all EC2 instances belong to the same VPC but, as they are in different AZs, each of them belongs to different subnet.

Load balancer must be in the public subnet of VPC as clients communicate with load balancer via internet (public network).

Load balancer does not get associated directly with EC2 instances but subnets:

resource "aws_lb" "test" {

subnets = ["subnet-0001", "subnet-0002"]

...

}

Target group is associated with VPC:

resource "aws_alb_target_group" "test" {

vpc_id = var.vpc_id

...

}

Difference between ALB and NLB (Network Load Balancer)

An Application Load Balancer (ALB) and a Network Load Balancer (NLB) serve different purposes based on the layer of the network they operate on and the type of traffic they handle.

The core difference is that an ALB understands application-level traffic (Layer 7) like HTTP/HTTPS headers, while an NLB handles low-level network traffic (Layer 4) like TCP/UDP packets at extreme speeds.

Direct Comparison Matrix

Feature Application Load Balancer (ALB) Network Load Balancer (NLB)

====== ========================= =======================

OSI Layer Layer 7 (Application) Layer 4 (Transport)

Protocols HTTP, HTTPS, HTTP/2, gRPC, WebSockets TCP, UDP, TLS

IP Addresses

Dynamic IPs (Changes automatically; requires a DNS name)

Static IPs (Can assign an Elastic IP per AZ)

Routing Features

Advanced (Path, Host, Query parameters, Headers)

Basic (Port and IP protocol routing only)

Performance

Optimized for complex web apps (Millions of requests/sec)

Optimized for ultra-low latency (Billions of requests/sec)

Key Technical Differences

1. Smart Routing vs. Raw Speed

ALB (Smart): Can read the contents of your HTTP requests. It can route traffic bound for ://example.com to an API server cluster, and traffic for ://example.com to a storage cluster.
NLB (Fast): Does not look inside the data packet. It simply looks at the target port and forwards the packet instantly. This results in ultra-low latency (measured in milliseconds).

2. IP Addresses and DNS

ALB: Scale out dynamically by adding or removing nodes. This causes its underlying IP addresses to change frequently. You must always point your domain name to the ALB's DNS Name, never to a static IP.
NLB: Gives you a Static IP address per Availability Zone. You can also assign your own Elastic IP addresses. This is critical if your corporate clients need to whitelist specific, unchanging IPs in their firewalls.

3. Client IP Preservation

ALB: Terminates the connection and makes a new one to your backend instances. The backend see the ALB's private IP. To find the real user's IP, your code must read the X-Forwarded-For HTTP header.
NLB: Passes the original TCP packet straight through to your backend server. Your backend instances see the original source IP address of the client natively, without needing extra headers.

When to Choose Which?

Choose an ALB if you are building:

Standard web applications and microservices.
Containerized apps (ECS/EKS) requiring path-based or host-based routing.
Applications requiring tight integration with AWS Web Application Firewall (WAF).

Choose an NLB if you are building:

Non-HTTP applications (e.g., gaming servers, SFTP, MQTT, database clusters).
Architectures requiring fixed, static IP addresses or Elastic IPs.
High-frequency financial applications where sub-millisecond network latency is a hard requirement

Which alerts should typically be set for AWS ALB?

To keep your applications highly available, you should set up Amazon CloudWatch alarms for a mix of availability, performance, and target health metrics.

The most critical metrics to monitor for an AWS ALB are grouped by priority below:

1. High Priority (Critical Infrastructure Impact)

UnHealthyHostCount (Per Target Group)

What it means: The number of backend instances failing health checks Target Group Metrics.

Alert Threshold: > 0 (or > 1 for larger clusters).

Why it matters: Signals that your servers are crashing or cannot handle traffic.

HTTPCode_Target_5XX_Count

What it means: The number of 5xx server error codes generated by your backend application ALB Metrics.

Alert Threshold: Depends on baseline traffic, typically > 5 failures within a 1-minute to 5-minute window.

Why it matters: Indicates server crashes, database connection timeouts, or unhandled exceptions in your application code.

HTTPCode_ELB_5XX_Count

What it means: The number of 5xx errors generated directly by the ALB itself (not your servers) ALB Metrics.

Alert Threshold: > 0.

Why it matters: Usually means the ALB cannot find any healthy hosts, or it is experiencing a configuration mismatch (e.g., bad TLS handshake with the target).

2. Medium Priority (Performance & User Experience)

TargetResponseTime

What it means: The time elapsed (in seconds) from when the ALB sent the request to the target until the target started responding ALB Metrics.

Alert Threshold: Use the p95 or p99 statistic. Alert if it exceeds your application’s maximum acceptable latency (e.g., > 2.0 seconds).

Why it matters: Users are experiencing severe application slowdowns, likely due to high CPU/memory usage on your instances.

RejectedConnectionCount

What it means: The load balancer is rejecting connections because it has reached its maximum capacity ALB Metrics.

Alert Threshold: > 0.

Why it matters: Your application is getting sudden traffic spikes and the ALB cannot scale fast enough, or backend targets are failing to keep up.

3. Low Priority (Anomalies & Security)

HTTPCode_Target_4XX_Count

What it means: The number of 4xx client errors (like 404 Not Found or 401 Unauthorized) returned by backend targets ALB Metrics.

Alert Threshold: A significant spike above your standard baseline.

Why it matters: A sudden surge might indicate a broken frontend deployment, a bad API update, or a malicious entity scanning your network for vulnerabilities.

Summary Checklist for CloudWatch Alarms

Metric Name Statistic Recommended Suggested Action

Period Threshold

========== ====== =========== ======== =====

UnHealthyHostCount Maximum 1 Minute > 0 Page/On-Call

HTTPCode_ELB_5XX_Count Sum 1 Minute > 0 Page/On-Call

HTTPCode_Target_5XX_Count Sum 5 Minutes > 10 (or > 1% of traffic) Ticket/Slack

TargetResponseTime p95 5 Minutes> [Your Limit] Ticket/Slack

Resources:

How Elastic Load Balancing works - Elastic Load Balancing

Wednesday, 3 June 2026

Some notes on LLMs

What does e.g. 1M context mean in a model description?

"1M context" means the model can handle up to 1 million tokens in its context window — the combined input (your messages, files, tool results, system prompt) plus output it can consider at once. For reference, that's roughly 750,000 words or a few thousand pages of text, letting it work with very large codebases or long conversations without losing earlier content.

What is a LLM's tool-use loop?

An LLM tool-use loop is the pattern where you let a language model drive an investigation by repeatedly choosing tools to call, rather than answering in one shot.

The shape

1. Send: system prompt + user request + list of available tools (with JSON schemas)

2. Model responds with either:

(a) a final text answer -> exit loop

(b) a "tool_use" block: { name: "run_aws_cli", input: { args: [...] } }

3. Your code executes that tool, captures the result

4. Append the tool result to the conversation as a "tool_result" message

5. Send the whole conversation back to the model

6. Goto 2

The model never executes anything itself — it just emits requests to call tools. Your code is the runtime that actually runs them and feeds the output back.

Why it's a loop

Each turn the model sees everything it has learned so far (prior tool calls + their outputs) and decides the next step based on that. So a real run looks like:

- Turn 1: model calls cloudwatch describe-alarms --state-value ALARM

- Turn 2: sees 3 alarms, picks the noisiest, calls logs filter-log-events for that log group around the alarm time

- Turn 3: sees an error pattern, calls kubectl describe pod on the affected workload

- Turn 4: emits final Markdown report, no tool call → loop exits

The model is doing the planning; your code is the dispatcher.

Why you need a budget

Without limits the loop can spin forever — the model keeps finding "one more thing to check." Hence in agent.run():

- max_iterations=30 — hard cap on turns

- max_tokens_per_turn=12288 — cap on a single response

- Per-tool wall-clock timeouts (60 s for CLI, 30 s for HTTP)

- Output truncation (50 000 char stdout) so a giant tool result doesn't blow the context window

How it ends

The loop terminates when the model returns a response with no tool_use block — that's the "I'm done, here's the answer" signal (stop_reason: end_turn). Or when you hit a budget limit and force-stop it.

Where the safety lives

Because the model can ask for arbitrary tool calls, the loop is only as safe as the tool implementations. That's why when implementing agents we should have the allowlists (services, verbs, paths) - the model can request aws s3 rm, but the validator rejects it before subprocess.run ever sees it.

The "two-pass" design in agent is a refinement: pass 1 is a tool-use loop (gather), pass 2 is a single non-loop call (synthesize). Splitting them lets each prompt focus on one job.

What are those .md files used by AI Agents?

There isn't a universally agreed official name, but people commonly refer to files like CLAUDE.md, GEMINI.md, AGENTS.md, COPILOT_INSTRUCTIONS.md, and .cursorrules as:

AI agent instruction files (most generic)
Agent configuration files
Agent context files
LLM instruction files
Repository AI instructions
Project AI guidelines

In the developer tooling community, "agent instructions" or "agent context files" are probably the most widely understood umbrella terms.

For example:

Tool File

---- -----

Claude Code CLAUDE.md

Gemini CLI GEMINI.md

GitHub Copilot .github/copilot-instructions.md

OpenAI Codex CLI AGENTS.md

Cursor .cursorrules / project rules

Windsurf Rules files

Collectively, you could describe them as:

"Repository-level AI agent instruction files that provide persistent context and operating rules for coding assistants."

If you're building tooling around them (e.g., in your DevOps work), I'd recommend using "agent instructions" as the generic term because it's vendor-neutral and easily understood across Claude, Gemini, Copilot, Cursor, Codex, and similar tools.

Introduction to Claude by Anthropic

How to give Claude an instruction to apply label "DevOps" whenever it creates a new Linear ticket?

It depends on whether you're creating Linear tickets via:

Linear MCP,
a custom /create-ticket command,
or just asking Claude in chat to create tickets

The best location depends on which of those you're using.

If you're just asking Claude in chat to create tickets: if you want this behavior for all projects, put it in:

~/.claude/CLAUDE.md

If you want it only for a specific repository/project, put it in:

<repo-root>/CLAUDE.md

If you want it only for yourself in a specific project (without committing it to git), put it in:

<repo-root>/CLAUDE.local.md

CLAUDE.local.md is usually the best place for personal workflow preferences like Linear labels. Claude loads it after the project-level file, so your preference is read later and tends to have more influence.

For example:

## Linear Ticket Creation

When creating new Linear issues:

- Always apply the "DevOps" label.

- If multiple labels are appropriate, include "DevOps" in addition to the others.

- Verify the label exists before creating the issue.

If you're creating tickets through a custom slash command (e.g. /create-ticket) or a Linear MCP workflow, an even more reliable approach is to put the instruction directly in the command definition under:

~/.claude/commands/create-ticket.md

or in a dedicated skill, because the instruction is then attached specifically to the ticket-creation workflow rather than relying on general memory. Many Claude Code users find workflow-specific commands more reliable than large global CLAUDE.md files for things like Linear automation.

How to instruct Claude to use feature branch name from a Linear ticket when implementing it?

Add a specific workflow rule to CLAUDE.md rather than a vague instruction. Claude follows procedural instructions much more reliably.

For example:

## Linear Ticket Workflow

When implementing work from a Linear ticket:

1. Retrieve the ticket details from Linear.

2. Check whether the ticket has a suggested branch name.

3. Before making any code changes, switch to or create a git branch using the Linear branch name.

4. Do not create an alternative branch name unless the Linear branch name is unavailable.

5. If the branch already exists remotely, check it out instead of creating a new branch.

6. Confirm the branch name being used before starting implementation.

Example:

Linear ticket: DEV-123 Add deployment health checks

Branch: bojan/dev-123-add-deployment-health-checks

Commands:

git fetch origin

git checkout bojan/dev-123-add-deployment-health-checks || \

git checkout -b bojan/dev-123-add-deployment-health-checks

If you're using the Linear MCP server, I'd make it even more explicit:

## Branch Naming

For any task originating from Linear:

- Always obtain the branch name from the Linear ticket metadata.

- Never invent a branch name if Linear provides one.

- Use the Linear-provided branch name exactly as written.

- Checkout the branch before planning or implementing changes.

An even better approach is to create a dedicated command such as .claude/commands/implement-ticket.md:

Given a Linear ticket ID:

1. Load the ticket.

2. Read the ticket's branch name.

3. Checkout the branch.

4. Implement the work.

5. Commit using the ticket ID in the commit message.

Workflow-specific commands tend to be more reliable than relying on global instructions in CLAUDE.md, especially for things like branch creation, ticket labeling, and PR naming.

If I update ~/.claude/CLAUDE.md, do I need to start the new Claude session?

Usually no restart is strictly required, but it depends on how you’re using Claude:

If you’re using Claude Code / CLI (most common case)

~/.claude/CLAUDE.md is typically read when a new session starts

So:

✅ New terminal session / new Claude run → picks up changes

⚠️ Already-running interactive session → may not reliably reload it

Best practice: restart the Claude session (or re-open the CLI/chat) after changing it if you want to be sure the new rules apply.

If you’re using a long-running agent session

(e.g. an ongoing “Claude is working in the repo” process)

It often caches the instructions in memory at session start

So updates won’t consistently apply mid-task

If you’re using repo-level CLAUDE.md

Same rule: loaded at session start or project initialization

Changes apply reliably on next session or re-attach to repo

Practical rule of thumb

Changed CLAUDE.md → start a new Claude session to guarantee it applies

Minor edits while actively working → assume they won’t fully apply until restart

How to add auto-generated instructions to global context?

During Claude Code session, it created two memories but they were bound to the current project only:

Write(/Users/bojan/.claude/projects/-Users-bojan-repos-Company-Project/memory/gh-token-unset.md)

⎿ Wrote 10 lines to /Users/bojan/.claude/projects/-Users-bojan-repos-Company-Project/memory/gh-token-unset.md

1 ---

2 name: gh-token-unset

3 description: gh CLI fails with HTTP 401 Bad credentials unless GH_TOKEN is unset first

4 metadata:

5 node_type: memory

6 type: reference

7 originSessionId: 9xxx6

8 ---

10 In this environment `gh` fails with `HTTP 401: Bad credentials` because a stale `GH_TOKEN` env var overrides the working keyring auth. Run `unset GH_TOKEN` in the same Bash command before any `gh`

call — the keyring account (`BojanKomazec`, scopes gist/read:org/repo) then works. Env doesn't persist between Bash calls, so include the `unset` each time.

⏺ Write(/Users/bojan/.claude/projects/-Users-bojan-repos-Company-Project/memory/precommit-hook-needs-no-verify.md)

⎿ Wrote 10 lines to /Users/bojan/.claude/projects/-Users-bojan-repos-Company-Project/memory/precommit-hook-needs-no-verify.md

1 ---

2 name: precommit-hook-needs-no-verify

3 description: git commit aborts in non-interactive shell; the pre-commit hook prompts on /dev/tty

4 metadata:

5 node_type: memory

6 type: reference

7 originSessionId: 9xxx6

8 ---

10 The global pre-commit hook (`~/.git-templates/hooks/pre-commit`) prompts interactively via `/dev/tty` ("view diff? proceed with commit?"). In a non-interactive shell it errors `/dev/tty: Device no

t configured` and aborts the commit. Use `git commit --no-verify` to bypass it when committing programmatically.

⏺ Write(/Users/bojan/.claude/projects/-Users-bojan-repos-Company-Project/memory/MEMORY.md)

⎿ Wrote 2 lines to /Users/bojan/.claude/projects/-Users-bojan-repos-Company-Project/memory/MEMORY.md

1 - [gh token unset](gh-token-unset.md) — `unset GH_TOKEN` before any `gh` command or it 401s

2 - [pre-commit hook needs --no-verify](precommit-hook-needs-no-verify.md) — commit hook prompts on /dev/tty, aborts non-interactively

What you’re looking at in:

~/.claude/projects/.../memory/MEMORY.md

is project-scoped learned memory (auto-generated runtime notes), not the best place to define global rules.

To make something apply to all projects, you should put it in:

~/.claude/CLAUDE.md

How to structure it properly

Think of ~/.claude/CLAUDE.md as your global “agent operating system”.

You can directly migrate those MEMORY.md rules into it like this:

## Global DevOps / CLI Behavior Rules

These rules apply to all projects unless explicitly overridden.

### GitHub CLI

- Always run `unset GH_TOKEN` before executing any `gh` command.

- Reason: GH_TOKEN causes 401 errors in local CLI authentication.

### Git commits

- When committing changes, always use `--no-verify`.

- Reason: pre-commit hooks may attempt interactive prompts via /dev/tty and break non-interactive runs.

Important distinction (this is the key part)

File Scope Who writes it Purpose

---- ----- ------------- --------

~/.claude/CLAUDE.md Global (all projects) You Your permanent rules

<repo>/CLAUDE.md Per project You/team Repo-specific rules

~/.claude/projects/.../MEMORY.md Auto-generated Claude Learned behavior, not authoritative

👉 Never manually “promote” MEMORY.md as your source of truth

Instead, copy stable patterns into ~/.claude/CLAUDE.md.

For your Linear branch-name requirement

You’d add it here too:

## Linear Workflow (Global)

When working from Linear tickets:

- Always use the branch name provided by Linear metadata.

- Never generate custom branch names.

- Always checkout the Linear branch before implementation.

One extra improvement (worth doing)

If you want this to be really robust, structure your global file like:

~/.claude/CLAUDE.md

├── Git / GitHub rules

├── Linear workflow rules

├── Shell safety rules

├── CI/CD rules

Claude responds better to clear domains than long flat lists.

Pages

Friday, 26 June 2026

DevOps Interview Questions - k8s ALB alarm

A. The alarm itself (CloudWatch / ALB fundamentals)

B. Narrowing from ALB to one service

C. First look at the workload

D. The load-balancing theory (and why it was wrong)

E. Target mode: instance vs IP (the topology that broke the theory)

F. Getting the proof (Prometheus / time series)

G. The real root cause (app server / GIL / async)

H. Designing the fix

I. Where the config lives & deploy mechanics

J. Capacity analysis

K. Staging-first deploy & verification

L. Production rollout & confirmation

M. Operational / process & gotchas

N. Synthesis & transfer (test of true mastery)

Thursday, 25 June 2026

Elastic Fleet

1. Fleet-Managed Mode (Recommended)

2. Standalone Mode

Example of how an Elastic Fleet can actually be wired

Components & where they live

Who listens, who initiates

Traffic Flows: Breakdown

1. Agent --> Fleet Server (Control Plane: Enrollment & Policy Check-in)

2. Agent --> Elasticsearch (Data Plane: Telemetry Shipping)

3. Fleet Server --> Elasticsearch + Kibana

One-Line Summary

Why does fleet server holds connection open for so long? Why it does not reply immediately?

Why not reply immediately (short polling)?

Why not server push (#2)?

So what's actually happening in those ~300s

Why this specifically fools the ALB

AWS EC2: Application Load Balancer

Key Features:

How Components Work Together

How ALB health checks keep applications online?

1. Automatic Traffic Redirection

2. Auto Scaling Integration

3. Graceful Recovery

How to Configure an ALB Health Check

Public ALB

Key Characteristics:

How Routing Works

Configuration Checklist

Private ALB

Key Characteristics

Common Use Cases

Setting Up ALB in AWS Console

Difference between ALB and NLB (Network Load Balancer)

Direct Comparison Matrix

Key Technical Differences

1. Smart Routing vs. Raw Speed

2. IP Addresses and DNS

3. Client IP Preservation

When to Choose Which?

Which alerts should typically be set for AWS ALB?

Resources:

Wednesday, 3 June 2026

Some notes on LLMs

What does e.g. 1M context mean in a model description?

What is a LLM's tool-use loop?

The shape

Why it's a loop

Why you need a budget

How it ends

Where the safety lives

What are those .md files used by AI Agents?

Introduction to Claude by Anthropic

How to give Claude an instruction to apply label "DevOps" whenever it creates a new Linear ticket?

How to instruct Claude to use feature branch name from a Linear ticket when implementing it?

If I update ~/.claude/CLAUDE.md, do I need to start the new Claude session?

If you’re using Claude Code / CLI (most common case)

If you’re using a long-running agent session

If you’re using repo-level CLAUDE.md

Practical rule of thumb

How to add auto-generated instructions to global context?

How to structure it properly

Important distinction (this is the key part)