Wednesday, 3 June 2026

Some notes on LLMs

 

What does e.g. 1M context mean in a model description?


"1M context" means the model can handle up to 1 million tokens in its context window — the combined input (your messages, files, tool results, system prompt) plus output it can consider at once. For reference, that's roughly 750,000 words or a few thousand pages of text, letting it work with very large codebases or long conversations without losing earlier content.

What is a LLM's tool-use loop? 


An LLM tool-use loop is the pattern where you let a language model drive an investigation by repeatedly choosing tools to call, rather than answering in one shot.

The shape


  1. Send: system prompt + user request + list of available tools (with JSON schemas)
  2. Model responds with either:
       (a) a final text answer  -> exit loop
       (b) a "tool_use" block: { name: "run_aws_cli", input: { args: [...] } }
  3. Your code executes that tool, captures the result
  4. Append the tool result to the conversation as a "tool_result" message
  5. Send the whole conversation back to the model
  6. Goto 2

The model never executes anything itself — it just emits requests to call tools. Your code is the runtime that actually runs them and feeds the output back.

Why it's a loop

Each turn the model sees everything it has learned so far (prior tool calls + their outputs) and decides the next step based on that. So a real run looks like:

  - Turn 1: model calls cloudwatch describe-alarms --state-value ALARM
  - Turn 2: sees 3 alarms, picks the noisiest, calls logs filter-log-events for that log group around the alarm time
  - Turn 3: sees an error pattern, calls kubectl describe pod on the affected workload
  - Turn 4: emits final Markdown report, no tool call → loop exits

  The model is doing the planning; your code is the dispatcher.

Why you need a budget

Without limits the loop can spin forever — the model keeps finding "one more thing to check." Hence in agent.run():

  - max_iterations=30 — hard cap on turns
  - max_tokens_per_turn=12288 — cap on a single response
  - Per-tool wall-clock timeouts (60 s for CLI, 30 s for HTTP)
  - Output truncation (50 000 char stdout) so a giant tool result doesn't blow the context window

How it ends

The loop terminates when the model returns a response with no tool_use block — that's the "I'm done, here's the answer" signal (stop_reason: end_turn). Or when you hit a budget limit and force-stop it.

Where the safety lives

Because the model can ask for arbitrary tool calls, the loop is only as safe as the tool implementations. That's why when implementing agents we should have the allowlists (services, verbs, paths) - the model can request aws s3 rm, but the validator rejects it before subprocess.run ever sees it.

The "two-pass" design in agent is a refinement: pass 1 is a tool-use loop (gather), pass 2 is a single non-loop call (synthesize). Splitting them lets each prompt focus on one job.


What are those .md files used by AI Agents?


There isn't a universally agreed official name, but people commonly refer to files like CLAUDE.md, GEMINI.md, AGENTS.md, COPILOT_INSTRUCTIONS.md, and .cursorrules as:

  • AI agent instruction files (most generic)
  • Agent configuration files
  • Agent context files
  • LLM instruction files
  • Repository AI instructions
  • Project AI guidelines

In the developer tooling community, "agent instructions" or "agent context files" are probably the most widely understood umbrella terms.

For example:

Tool             File
----               -----
Claude Code       CLAUDE.md
Gemini CLI       GEMINI.md
GitHub Copilot   .github/copilot-instructions.md
OpenAI Codex CLI   AGENTS.md
Cursor             .cursorrules / project rules
Windsurf          Rules files

Collectively, you could describe them as:

"Repository-level AI agent instruction files that provide persistent context and operating rules for coding assistants."

If you're building tooling around them (e.g., in your DevOps work), I'd recommend using "agent instructions" as the generic term because it's vendor-neutral and easily understood across Claude, Gemini, Copilot, Cursor, Codex, and similar tools.



Introduction to Claude by Anthropic




How to give Claude an instruction to apply label "DevOps" whenever it creates a new Linear ticket?



It depends on whether you're creating Linear tickets via:
  • Linear MCP,
  • a custom /create-ticket command,
  • or just asking Claude in chat to create tickets
The best location depends on which of those you're using.


If you're just asking Claude in chat to create tickets: if you want this behavior for all projects, put it in:

~/.claude/CLAUDE.md

If you want it only for a specific repository/project, put it in:

<repo-root>/CLAUDE.md

If you want it only for yourself in a specific project (without committing it to git), put it in:

<repo-root>/CLAUDE.local.md

CLAUDE.local.md is usually the best place for personal workflow preferences like Linear labels. Claude loads it after the project-level file, so your preference is read later and tends to have more influence.

For example:

## Linear Ticket Creation

When creating new Linear issues:

- Always apply the "DevOps" label.
- If multiple labels are appropriate, include "DevOps" in addition to the others.
- Verify the label exists before creating the issue.

If you're creating tickets through a custom slash command (e.g. /create-ticket) or a Linear MCP workflow, an even more reliable approach is to put the instruction directly in the command definition under:

~/.claude/commands/create-ticket.md

or in a dedicated skill, because the instruction is then attached specifically to the ticket-creation workflow rather than relying on general memory. Many Claude Code users find workflow-specific commands more reliable than large global CLAUDE.md files for things like Linear automation.


How to instruct Claude to use feature branch name from a Linear ticket when implementing it?


Add a specific workflow rule to CLAUDE.md rather than a vague instruction. Claude follows procedural instructions much more reliably.

For example:

## Linear Ticket Workflow

When implementing work from a Linear ticket:

1. Retrieve the ticket details from Linear.
2. Check whether the ticket has a suggested branch name.
3. Before making any code changes, switch to or create a git branch using the Linear branch name.
4. Do not create an alternative branch name unless the Linear branch name is unavailable.
5. If the branch already exists remotely, check it out instead of creating a new branch.
6. Confirm the branch name being used before starting implementation.

Example:

Linear ticket: DEV-123 Add deployment health checks
Branch: bojan/dev-123-add-deployment-health-checks

Commands:

git fetch origin
git checkout bojan/dev-123-add-deployment-health-checks || \
git checkout -b bojan/dev-123-add-deployment-health-checks

If you're using the Linear MCP server, I'd make it even more explicit:

## Branch Naming

For any task originating from Linear:

- Always obtain the branch name from the Linear ticket metadata.
- Never invent a branch name if Linear provides one.
- Use the Linear-provided branch name exactly as written.
- Checkout the branch before planning or implementing changes.

An even better approach is to create a dedicated command such as .claude/commands/implement-ticket.md:

Given a Linear ticket ID:

1. Load the ticket.
2. Read the ticket's branch name.
3. Checkout the branch.
4. Implement the work.
5. Commit using the ticket ID in the commit message.

Workflow-specific commands tend to be more reliable than relying on global instructions in CLAUDE.md, especially for things like branch creation, ticket labeling, and PR naming.


If I update ~/.claude/CLAUDE.md, do I need to start the new Claude session?


Usually no restart is strictly required, but it depends on how you’re using Claude:

If you’re using Claude Code / CLI (most common case)


~/.claude/CLAUDE.md is typically read when a new session starts

So:
✅ New terminal session / new Claude run → picks up changes
⚠️ Already-running interactive session → may not reliably reload it

Best practice: restart the Claude session (or re-open the CLI/chat) after changing it if you want to be sure the new rules apply.

If you’re using a long-running agent session


(e.g. an ongoing “Claude is working in the repo” process)

It often caches the instructions in memory at session start
So updates won’t consistently apply mid-task

If you’re using repo-level CLAUDE.md


Same rule: loaded at session start or project initialization
Changes apply reliably on next session or re-attach to repo

Practical rule of thumb


Changed CLAUDE.md → start a new Claude session to guarantee it applies
Minor edits while actively working → assume they won’t fully apply until restart


How to add auto-generated instructions to global context?

During Claude Code session, it created two memories but they were bound to the current project only:

Write(/Users/bojan/.claude/projects/-Users-bojan-repos-Company-Project/memory/gh-token-unset.md)
  ⎿  Wrote 10 lines to /Users/bojan/.claude/projects/-Users-bojan-repos-Company-Project/memory/gh-token-unset.md
       1 ---
       2 name: gh-token-unset
       3 description: gh CLI fails with HTTP 401 Bad credentials unless GH_TOKEN is unset first
       4 metadata:
       5   node_type: memory
       6   type: reference
       7   originSessionId: 9xxx6
       8 ---
       9
      10 In this environment `gh` fails with `HTTP 401: Bad credentials` because a stale `GH_TOKEN` env var overrides the working keyring auth. Run `unset GH_TOKEN` in the same Bash command before any `gh`
          call — the keyring account (`BojanKomazec`, scopes gist/read:org/repo) then works. Env doesn't persist between Bash calls, so include the `unset` each time.

⏺ Write(/Users/bojan/.claude/projects/-Users-bojan-repos-Company-Project/memory/precommit-hook-needs-no-verify.md)
  ⎿  Wrote 10 lines to /Users/bojan/.claude/projects/-Users-bojan-repos-Company-Project/memory/precommit-hook-needs-no-verify.md
       1 ---
       2 name: precommit-hook-needs-no-verify
       3 description: git commit aborts in non-interactive shell; the pre-commit hook prompts on /dev/tty
       4 metadata:
       5   node_type: memory
       6   type: reference
       7   originSessionId: 9xxx6
       8 ---
       9
      10 The global pre-commit hook (`~/.git-templates/hooks/pre-commit`) prompts interactively via `/dev/tty` ("view diff? proceed with commit?"). In a non-interactive shell it errors `/dev/tty: Device no
         t configured` and aborts the commit. Use `git commit --no-verify` to bypass it when committing programmatically.

⏺ Write(/Users/bojan/.claude/projects/-Users-bojan-repos-Company-Project/memory/MEMORY.md)
  ⎿  Wrote 2 lines to /Users/bojan/.claude/projects/-Users-bojan-repos-Company-Project/memory/MEMORY.md
      1 - [gh token unset](gh-token-unset.md) — `unset GH_TOKEN` before any `gh` command or it 401s
      2 - [pre-commit hook needs --no-verify](precommit-hook-needs-no-verify.md) — commit hook prompts on /dev/tty, aborts non-interactively

What you’re looking at in:

~/.claude/projects/.../memory/MEMORY.md

is project-scoped learned memory (auto-generated runtime notes), not the best place to define global rules.

To make something apply to all projects, you should put it in:

~/.claude/CLAUDE.md

How to structure it properly


Think of ~/.claude/CLAUDE.md as your global “agent operating system”.

You can directly migrate those MEMORY.md rules into it like this:

## Global DevOps / CLI Behavior Rules

These rules apply to all projects unless explicitly overridden.

### GitHub CLI
- Always run `unset GH_TOKEN` before executing any `gh` command.
- Reason: GH_TOKEN causes 401 errors in local CLI authentication.

### Git commits
- When committing changes, always use `--no-verify`.
- Reason: pre-commit hooks may attempt interactive prompts via /dev/tty and break non-interactive runs.


Important distinction (this is the key part)


File               Scope                 Who writes it Purpose
----                  -----                  -------------    --------
~/.claude/CLAUDE.md  Global (all projects) You             Your permanent rules
<repo>/CLAUDE.md      Per project             You/team         Repo-specific rules
~/.claude/projects/.../MEMORY.md Auto-generated Claude     Learned behavior, not authoritative

👉 Never manually “promote” MEMORY.md as your source of truth
Instead, copy stable patterns into ~/.claude/CLAUDE.md.

For your Linear branch-name requirement


You’d add it here too:

## Linear Workflow (Global)

When working from Linear tickets:
- Always use the branch name provided by Linear metadata.
- Never generate custom branch names.
- Always checkout the Linear branch before implementation.


One extra improvement (worth doing)


If you want this to be really robust, structure your global file like:

~/.claude/CLAUDE.md
├── Git / GitHub rules
├── Linear workflow rules
├── Shell safety rules
├── CI/CD rules

Claude responds better to clear domains than long flat lists.









Thursday, 14 May 2026

Introduction to Checkly



How Checkly works

Checkly is a SaaS synthetic monitoring platform — you define "checks" (HTTP requests or browser scripts), Checkly runs them on a schedule from probe locations around the world (or on-demand from CI), records latency/assertions/screenshots, and alerts you when they fail or get slow.
  
  Two main check types:

  - API checks — a single HTTP request with assertions on status, headers, body, response time.
  - Browser checks — a Playwright script run in a real headless Chromium against your deployed app.

There's also multi-step API checks (chain requests, e.g. login → use token → logout) and heartbeat checks (your job pings Checkly; alert if it stops).

Hearbeat vs Ping 

Heartbeats and pings are both vital network failure-detection mechanisms, but they differ in purpose: Heartbeats are proactive, periodic "I am alive" messages sent by an application to signal it is healthy, while Pings are reactive requests to check if a server is reachable. Heartbeats detect application crashes, while pings detect network downtime.

Checks are typically authored as code (Checkly CLI, TypeScript) and checkly deploy'd to the cloud. You can tag them (tags: ["auth"]), parametrise them with env vars like ENVIRONMENT_URL, and trigger them on-demand from CI — which is exactly what this PR does with npx checkly trigger --tags=auth.
  
  Runtime model:
  - Scheduled: every N minutes from chosen regions (e.g. us-east-2, eu-west-1) — catches regressions/outages between deploys.
  - Triggered from CI: post-deploy smoke test, results gate (or just annotate) the deploy.
  - Alerts: Slack/PagerDuty/email on failure, with retry/degraded thresholds to avoid flap.

  ---

  What it would check for this auth API
  
  Given the auth API's surface (login, OAuth, JWT issuance, admin endpoints), realistic auth-tagged checks:

  1. Health endpoint — basic liveness

  new ApiCheck("auth-health", {
    name: "Auth API – health",
    tags: ["auth"],
    frequency: 1, // minute
    locations: ["us-east-2", "eu-west-1"],
    request: {
      url: `${process.env.ENVIRONMENT_URL}/health`,
      method: "GET",
      assertions: [
        AssertionBuilder.statusCode().equals(200),
        AssertionBuilder.responseTime().lessThan(500),
        AssertionBuilder.jsonBody("$.status").equals("ok"),
      ],
    },
  });

  2. Login flow — happy path, returns a JWT

  new ApiCheck("auth-login", {
    name: "Auth API – login returns JWT",
    tags: ["auth"],
    request: {
      url: `${process.env.ENVIRONMENT_URL}/auth/login`,
      method: "POST",
      headers: [{ key: "Content-Type", value: "application/json" }],
      body: JSON.stringify({
        email: process.env.SYNTHETIC_USER_EMAIL,
        password: process.env.SYNTHETIC_USER_PASSWORD,
      }),
      assertions: [
        AssertionBuilder.statusCode().equals(200),
        AssertionBuilder.responseTime().lessThan(1500),
        AssertionBuilder.jsonBody("$.token").isNotNull(),
        // structural check on JWT shape
        AssertionBuilder.jsonBody("$.token").matches("^eyJ[A-Za-z0-9_-]+\\.[A-Za-z0-9_-]+\\.[A-Za-z0-9_-]+$"),
      ],
    },
  });
  
  3. Login — wrong password returns 401 (negative path)

  Catches the "accidentally accepts anything" class of regression.

  new ApiCheck("auth-login-bad-pw", {
    name: "Auth API – wrong password = 401",
    tags: ["auth"],
    request: {
      url: `${process.env.ENVIRONMENT_URL}/auth/login`,
      method: "POST",
      headers: [{ key: "Content-Type", value: "application/json" }],
      body: JSON.stringify({ email: process.env.SYNTHETIC_USER_EMAIL, password: "wrong" }),
      assertions: [AssertionBuilder.statusCode().equals(401)],
    },
  });
  
  4. Multi-step — login then call protected endpoint

  This is the most useful kind for an auth API, because it proves the token actually works.

  new MultiStepCheck("auth-token-roundtrip", {
    name: "Auth API – token works against /me",
    tags: ["auth"],
    code: { entrypoint: path.join(__dirname, "token-roundtrip.spec.ts") },
  });
  // token-roundtrip.spec.ts
  import { test, expect } from "@playwright/test";
  test("login then /me", async ({ request }) => {
    const login = await request.post(`${process.env.ENVIRONMENT_URL}/auth/login`, {
      data: { email: process.env.SYNTHETIC_USER_EMAIL, password: process.env.SYNTHETIC_USER_PASSWORD },
    });
    expect(login.ok()).toBeTruthy();
    const { token } = await login.json();
    
    const me = await request.get(`${process.env.ENVIRONMENT_URL}/me`, {
      headers: { Authorization: `Bearer ${token}` },
    });
    expect(me.status()).toBe(200);
    const body = await me.json();
    expect(body.email).toBe(process.env.SYNTHETIC_USER_EMAIL);
  });
  
  5. TLS & cert expiry

  A pure config check — useful because cert rotation is a classic outage cause.

  new ApiCheck("auth-tls", {
    name: "Auth API – TLS cert valid > 14d",
    tags: ["auth"],
    request: {
      url: `${process.env.ENVIRONMENT_URL}/health`,
      method: "GET",
      assertions: [AssertionBuilder.statusCode().equals(200)],
    },
    // Checkly surfaces cert expiry on the run; you set a threshold per check
  });
  
  6. Browser check — full login UX

    expect(login.ok()).toBeTruthy();
    const { token } = await login.json();

    const { token } = await login.json();

    const me = await request.get(`${process.env.ENVIRONMENT_URL}/me`, {
      headers: { Authorization: `Bearer ${token}` },
    });
    expect(me.status()).toBe(200);
    const body = await me.json();
    expect(body.email).toBe(process.env.SYNTHETIC_USER_EMAIL);
  });

  5. TLS & cert expiry

  A pure config check — useful because cert rotation is a classic outage cause.

  new ApiCheck("auth-tls", {
    name: "Auth API – TLS cert valid > 14d",
    tags: ["auth"],
    request: {
      url: `${process.env.ENVIRONMENT_URL}/health`,
      method: "GET",
      assertions: [AssertionBuilder.statusCode().equals(200)],
    },
    // Checkly surfaces cert expiry on the run; you set a threshold per check

  5. TLS & cert expiry

  A pure config check — useful because cert rotation is a classic outage cause.

  new ApiCheck("auth-tls", {
    name: "Auth API – TLS cert valid > 14d",
    tags: ["auth"],
    request: {
      url: `${process.env.ENVIRONMENT_URL}/health`,
      method: "GET",
      assertions: [AssertionBuilder.statusCode().equals(200)],
    },
    // Checkly surfaces cert expiry on the run; you set a threshold per check
  });

  new ApiCheck("auth-tls", {
    name: "Auth API – TLS cert valid > 14d",
    tags: ["auth"],
    request: {
      url: `${process.env.ENVIRONMENT_URL}/health`,
      method: "GET",
      assertions: [AssertionBuilder.statusCode().equals(200)],
    },
    // Checkly surfaces cert expiry on the run; you set a threshold per check
  });

  6. Browser check — full login UX

  Runs against the front-end but exercises the auth API end-to-end including redirects, cookies, CSRF.

  new BrowserCheck("auth-ui-login", {
    name: "Login UI works",
    tags: ["auth"],
    code: { entrypoint: path.join(__dirname, "login.spec.ts") },
  });
  import { test, expect } from "@playwright/test";
  test("user can sign in", async ({ page }) => {
    await page.goto(process.env.ENVIRONMENT_URL!);
    await page.getByLabel("Email").fill(process.env.SYNTHETIC_USER_EMAIL!);
    await page.getByLabel("Password").fill(process.env.SYNTHETIC_USER_PASSWORD!);
    await page.getByRole("button", { name: "Sign in" }).click();
    await expect(page.getByText("Dashboard")).toBeVisible({ timeout: 10_000 });
  });

  7. OAuth callback reachability

  Doesn't fully exercise the Google/Microsoft flow (those need real consent), but checks the callback
  endpoint responds correctly to a missing-code request — confirms route + handler are wired.

  new ApiCheck("auth-oauth-google-callback-shape", {
    name: "Auth API – Google OAuth callback exists",
    tags: ["auth"],
    request: {
      url: `${process.env.ENVIRONMENT_URL}/auth/google/callback`,
      method: "GET",
      assertions: [
        // 400 for missing `code`, not 404/500 — proves handler is mounted
        AssertionBuilder.statusCode().equals(400),
      ],
    },
  });

Intro to QA with Headless Browsers

Headless browsers are used in QA to execute automated browser tests faster and more efficiently by eliminating the graphical user interface (GUI). Because they don't render visuals, they consume fewer resources, enabling rapid, parallel testing in CI/CD pipelines, making them ideal for high-volume functional and regression testing.

Key Reasons for Using Headless Browsers in QA:
  • Faster Execution: Without the need to render CSS, images, or layout, tests run significantly faster.
  • CI/CD Integration: They are ideal for server-side environments where a GUI is unavailable, allowing automated tests to run after every code commit.
  • Lower Resource Usage: They consume significantly less RAM and CPU, allowing for higher parallelization (running many tests simultaneously) without overloading hardware.
  • Automated Functional Testing: They can accurately simulate user actions such as clicking buttons, submitting forms, and navigating pages.
  • Regression Testing: Due to speed and efficiency, they are perfect for running large suites of regression tests to ensure new changes haven't broken existing functionality.
Common tools for headless testing include headless Chrome, Firefox, Puppeteer, and Playwright

Headless browsers parse, compile, and execute the exact same underlying code as standard browsers, but they skip the final step of painting pixels to a physical screen.

What Headless Browsers Still Do
  • Construct the DOM: They parse HTML into a full Document Object Model tree.
  • Apply Styling: They process CSS and calculate layout, element positions, and visibility.
  • Execute JavaScript: They run a full JS engine (like V8 in Chrome) to handle AJAX, animations, and frontend logic.
  • Manage Network Traffic: They make real HTTP requests, download cookies, and handle API responses.

How QA Verifies Visuals Without a Display
  • Layout Queries: Code checks if elements are present, hidden, or overlapping by querying their coordinates.
  • Computed Styles: Scripts verify specific CSS properties, like checking if a button color is exactly rgb(0, 0, 255).
  • Virtual Screenshots: The browser renders the page into an in-memory buffer, allowing QA tools to save PNGs or perform pixel-by-pixel visual regression comparisons.

To help tailor using headless browser to our workflow, we need to know:
  • Which testing framework we are using (e.g., Playwright, Selenium, Cypress)?
  • Are we trying to catch functional bugs or visual layout glitches?
  • Do our tests run on a local machine or a CI/CD server (e.g., GitHub Actions, Jenkins)?

---

Wednesday, 29 April 2026

Introduction to Amazon Simple Notification Service (SNS)



Amazon Simple Notification Service (SNS) is a fully managed messaging service that enables you to decouple microservices, distributed systems, and serverless applications. Here's how SNS works:

Key Concepts


Topics:

A topic is a logical access point and communication channel. Publishers send messages to a topic, and subscribers receive these messages by subscribing to the topic.

Publishers:

Publishers are entities that send messages to an SNS topic. They could be applications, services, or even other AWS services like Lambda or CloudWatch.

Subscribers:

Subscribers are endpoints that receive messages from an SNS topic. These can include Amazon SQS queues, AWS Lambda functions, HTTP/S endpoints, email addresses, and SMS numbers.

Messages:

Messages are the payload sent by publishers to SNS topics. They can include a variety of data formats, typically JSON.



How It Works


Creating a Topic:

First, you create a topic using the AWS Management Console, AWS CLI, or AWS SDKs. This topic acts as a communication channel.

Subscribing to a Topic:

You then subscribe one or more endpoints to the topic. These endpoints can be other AWS services or external services capable of receiving notifications.

When subscribing, you specify the protocol (such as HTTP, SQS, Lambda, etc.) and the endpoint (like the URL or ARN of the SQS queue).

Publishing a Message:

Publishers send messages to the SNS topic using the Publish API. The message can include a subject, a message body, and optional attributes.

SNS stores multiple copies of the message for redundancy and high availability.

Message Delivery:

SNS distributes the message to all subscribed endpoints.

Each endpoint processes the message according to its protocol. For example, an HTTP endpoint receives a POST request with the message content, and an SQS queue receives the message as a new queue entry.



Use Cases


Fan-out Scenarios:

When a message published to an SNS topic needs to be sent to multiple endpoints, SNS acts as a fan-out service. For example, updating various microservices or notifying multiple systems about an event.

Push Notifications:

SNS can be used to send push notifications to mobile devices through services like Amazon Device Messaging (ADM), Apple Push Notification Service (APNS), and Google Cloud Messaging (GCM).

Application Alerts:

Sending notifications or alerts to administrators or users via SMS, email, or other endpoints.

Event-Driven Computing:

Integrating with AWS Lambda to trigger functions in response to incoming messages, enabling event-driven processing.



Benefits


Scalability:

SNS is designed to handle high-throughput and large volumes of messages, making it suitable for large-scale applications.


Flexibility:

It supports multiple protocols, allowing for versatile messaging architectures.


Reliability:

With built-in redundancy and fault tolerance, SNS ensures reliable message delivery.


Ease of Use:

The service is fully managed, reducing the operational overhead associated with maintaining messaging infrastructure.


Security



Access Control:

Using AWS IAM policies, you can control who can publish or subscribe to topics.


Encryption:

SNS supports server-side encryption (SSE) for messages at rest and SSL/TLS for messages in transit.

Compliance:

SNS is compliant with various industry standards and regulations, ensuring it meets security and compliance requirements.

Amazon SNS is a powerful and flexible tool for building loosely coupled, scalable, and robust messaging solutions in the cloud.



How does Amazon SNS differ from Amazon SQS?



Amazon SNS (Simple Notification Service) and Amazon SQS (Simple Queue Service) are both messaging services provided by AWS, but they serve different purposes and have distinct features. Here's a comparison highlighting their differences:

Purpose and Use Cases


Amazon SNS:
  • Publish/Subscribe Messaging: 
    • SNS is designed for sending notifications or messages to multiple subscribers.
    • It's ideal for broadcasting messages to multiple endpoints, such as sending alerts, notifications, or updates to different systems or users.
  • Use Cases: 
    • Real-time notifications
    • push notifications
    • fan-out scenarios (where a message needs to be sent to multiple recipients simultaneously)
    • application alerts
    • event-driven architectures

Amazon SQS:
  • Message Queuing: 
    • SQS is designed for decoupling and scaling distributed systems. 
    • It allows you to send, store, and receive messages between software components at any volume, without losing messages or requiring other services to be available.
  • Use Cases:
    • Task queues
    • asynchronous processing
    • decoupling microservices
    • job dispatching
    • buffering messages between producer and consumer systems



Messaging Patterns



Amazon SNS:

  • Push-Based: SNS pushes messages to subscribers. Subscribers can be other AWS services (like Lambda, SQS), HTTP/S endpoints, email addresses, SMS numbers, and mobile push notifications.
  • Fan-Out: One message can be sent to multiple subscribers.

Amazon SQS:

  • Pull-Based: Consumers pull messages from the queue. A consumer explicitly retrieves messages from the queue.
  • Point-to-Point: Each message is delivered to and processed by one consumer.


Message Handling


Amazon SNS:

  • Real-Time Delivery: Messages are delivered immediately to all subscribers.
  • No Message Persistence: Messages are not stored after delivery; if a subscriber is unavailable, the message is lost unless it's sent to an SQS queue or some other durable store.

Amazon SQS:

  • Message Persistence: Messages are stored in the queue until they are processed and deleted by a consumer, or until they expire.
  • Delivery Guarantees: Ensures at least once delivery. With FIFO queues, SQS provides exactly-once processing and message ordering.

Scalability and Performance


Amazon SNS:

  • Scalable: Designed to handle massive numbers of messages and deliver them to large numbers of subscribers.
  • Latency: Typically has very low latency for message delivery.

Amazon SQS:

  • Scalable: Automatically scales to handle large volumes of messages. Suitable for high-throughput applications.
  • Latency: Slightly higher latency compared to SNS due to the nature of pull-based consumption.

Features and Capabilities


Amazon SNS:

  • Multiple Protocols: Supports multiple delivery protocols including HTTP/S, email, SMS, SQS, Lambda, and mobile push notifications.
  • Filtering: Allows message filtering, enabling subscribers to receive only the messages that match their filter policies.

Amazon SQS:

  • Visibility Timeout: Temporarily hides a message from other consumers while it is being processed.
  • Dead-Letter Queues (DLQ): Allows you to handle messages that can't be processed successfully.
  • FIFO Queues: Ensures the order of messages and exactly-once processing.
  • Delay Queues: Postpones the delivery of new messages to consumers for a specified amount of time.


Pricing


Amazon SNS:

  • Pricing Model: Based on the number of requests (publishes, deliveries, and notifications) and data transfer.
  • Cost Efficiency: More cost-effective for scenarios requiring a high number of subscribers and real-time notifications.

Amazon SQS:

  • Pricing Model: Based on the number of requests (send, receive, delete) and data transfer.
  • Cost Efficiency: More cost-effective for decoupling microservices and scenarios requiring message persistence and complex message handling.


Integration and Interoperability


Amazon SNS:

  • Integration: Easily integrates with a wide range of AWS services (e.g., Lambda, SQS, HTTP/S endpoints, etc.).
  • Interoperability: Often used in conjunction with SQS for fan-out scenarios where messages need to be processed asynchronously and stored reliably.

Amazon SQS:

  • Integration: Commonly used to decouple systems and provide reliable message delivery. Often used with other AWS services like Lambda, ECS, and EC2.
  • Interoperability: Can be subscribed to SNS topics to receive messages that need persistent storage or further processing.

Summary


In summary, Amazon SNS is a pub/sub messaging service optimized for real-time notifications and broadcasting messages to multiple subscribers, while Amazon SQS is a message queuing service designed for decoupling distributed systems and ensuring reliable message delivery through persistence and processing guarantees. They are often used together to build scalable, resilient, and flexible messaging architectures in AWS.


Push Notification Service - Amazon Simple Notification Service - AWS

Tuesday, 21 April 2026

Provisioning AWS EKS Cluster with terraform-aws-modules/eks/aws





In this article we want to explore and breakdown its key components and their purposes.

We'd typically use this module like here:

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "21.15.1"
  ...
}


Let's explore this module's attributes.

1. Cluster Configuration


name,  version

Sets the name and Kubernetes version for the EKS cluster. Use local and variable values for flexibility.

endpoint_public_access

Set public (Internet) access to the Kubernetes API endpoint (via kubectl). Disable it for enhanced security. 

endpoint_private_access

Set private access to the API endpoint, whether only resources within the VPC can access it. If enabled, it is only reachable from within the VPC (Virtual Private Cloud) where your EKS cluster is deployed. There are few ways to access it:

How to Access the Kubernetes API from VPC

1. Use a Bastion Host or EC2 Instance in the VPC

Launch an EC2 instance (bastion host or jump box) in a subnet within the same VPC as your EKS cluster.
SSH into this instance, and from there, use kubectl to access the cluster.
Alternatively, use SSH port forwarding or a VPN to proxy kubectl commands from your local machine through the bastion.

2. Use AWS Systems Manager (SSM) Session Manager

If your EC2 instances have the SSM agent and the necessary IAM permissions, you can use AWS SSM Session Manager to start a shell session on an instance in the VPC, then run kubectl from there.

3. Use a VPN Connection

Set up a VPN (such as AWS Client VPN or OpenVPN, or Site-to-site VPN for office LAN) that connects your local network to the VPC. Once connected, your local machine will be able to reach the private endpoint.

4. Use AWS PrivateLink (Interface VPC Endpoints)

For advanced scenarios, you can use AWS PrivateLink to expose the Kubernetes API endpoint privately to other VPCs or on-premises networks.


enable_cluster_creator_admin_permissions


If enabled, grants admin permissions to the user who creates the cluster.


2. Logging and Add-ons


enabled_log_types

Enables logging for various Kubernetes components (API, audit, authenticator, controllerManager, scheduler) for monitoring and troubleshooting.

Example:

  enabled_log_types = [
    "api",
    "audit",
    "authenticator",
    "controllerManager",
    "scheduler"
  ]

addons

A dictionary-type attribute which installs and configures essential Kubernetes add-ons. Dictionary keys are addon names like:
  • coredns
  • kube-proxy
  • aws-ebs-csi-driver
  • vpc-cni

Dictionary values are objects which attributes are:
  • most_recent - to set using the latest version (set it to false for version pinning)
  • version - addon version (use it for version pinning)
  • before_compute - set it to true if addon should be installed and set before nodes (compute layer)
  • service_account_role_arn - to configure addon with IAM roles for service accounts, enabling secure integration with AWS services.

Example:

addons = {
    ...
    vpc-cni = {
      most_recent              = false
      version                  = "v1.21.1-eksbuild.7"
      before_compute           = true
      service_account_role_arn = module.k8s_default_vpc_cni_irsa.iam_role_arn
    }
    ...
}

VPC CNI (Container networking interface) is responsible for allocating IP addresses to the Kubernetes nodes and provides networking to pods. The plugin manage network interfaces (ENIs) on the nodes and uses it to assign IP addresses to pods.



3. Networking

We need to integrate the EKS cluster with existing VPC and subnets:

vpc_id 

VPC ID

subnet_ids

Subnets in which nodes (EC2 instances) will be created.
Where your worker nodes (EC2 instances) run.

control_plane_subnet_ids

Where the EKS control plane ENIs (network interfaces) are placed
Defines where the EKS control plane creates its Elastic Network Interfaces (ENIs)

What it controls:
  • The EKS control plane runs in an AWS-managed VPC (you don't see it)
  • To communicate with your worker nodes, it creates ENIs in your VPC
  • These ENIs are placed in the subnets you specify here

Typical configuration:

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  
  name = "my-cluster"
  
  # Control plane ENIs go here
  control_plane_subnet_ids = [
    "subnet-private-1a",
    "subnet-private-1b",
    "subnet-private-1c"
  ]
}

Best practices:
  • Usually private subnets
  • Should span multiple AZs for high availability (AWS requires at least 2)
  • Minimum of 2 subnets, maximum of 16
  • Each subnet needs at least 5 available IP addresses

What these ENIs do:
  • Allow the control plane to communicate with worker nodes
  • Allow worker nodes to communicate with the API server
  • Handle API server endpoint traffic


security_group_additional_rules


Adds custom security group rules for the cluster, such as allowing node-to-node communication and VPN access for kubectl.

node_security_group_additional_rules


Further customizes node security groups, allowing all node-to-node traffic and all outbound traffic.



Understanding EKS Architecture

An EKS cluster has two main components:

┌─────────────────────────────────────────────────────────┐
│                    EKS Cluster                          │
│                                                         │
│  ┌───────────────────────────────────────┐              │
│  │   Control Plane (AWS Managed)         │              │
│  │   - API Server                        │              │
│  │   - etcd                              │              │
│  │   - Scheduler                         │              │
│  │   - Controller Manager                │              │
│  │                                       │              │
│  │   Runs in AWS-managed account         │              │
│  └──────────────┬────────────────────────┘              │
│                 │                                       │
│                 │ ENIs in your VPC                      │
│                 │ (control_plane_subnet_ids)            │
│  ┌──────────────▼────────────────────────┐              │
│  │   Your VPC                            │              │
│  │   ┌─────────────────────────────┐     │              │
│  │   │  Worker Nodes (subnet_ids)  │     │              │
│  │   │  - EC2 instances            │     │              │
│  │   │  - Your pods run here       │     │              │
│  │   └─────────────────────────────┘     │              │
│  └───────────────────────────────────────┘              │
└─────────────────────────────────────────────────────────┘

ENI: elastic network interface. It is a logical networking component in a VPC that represents a virtual network card.



4. Node Group Configuration


node_security_group_tags


Adds a tag for Karpenter (an open-source Kubernetes node autoscaler) discovery.

eks_managed_node_group_defaults


Sets default properties for all managed node groups, including:
  • Attaching the CNI policy for networking.
  • Using a specific SSH key.
  • Associating additional security groups.
  • Defining block device mappings for EBS volumes.
  • Attaching the AmazonSSMManagedInstanceCore policy for SSM access.

eks_managed_node_groups


Defines a default managed node group with:
  • A specific AMI type.
  • Desired, minimum, and maximum node counts.
  • Instance types from a variable.
  • On-demand capacity, EBS optimization, and disk size.
  • Custom labels for node identification and environment.

The gold standard for production environments is explicit pinning. This ensures that our infrastructure only changes when we decide to change the code. In order to pin AMI version used in node groups we need to set two attributes:
  • ami_release_version needs to be set. This prevents nodes from cycling unexpectedly during a routine deployment.
  • use_latest_ami_release_version needs to be set to false (without this, terraform plan will still show that it wants to upgrade AMI version, even if we've set ami_release_version)

Example:

  eks_managed_node_groups = {
    "${local.cluster_name}-v1_33" = {
      ...
      ami_release_version            = "1.33.8-20260224"
      use_latest_ami_release_version = false
      ...


5. Tagging


tags


Applies custom tags to all AWS resources created by the module, supporting cost allocation and resource management.


Summary



Our configuration sets up a secure, private, and production-ready EKS cluster with managed node groups, essential add-ons, robust logging, and fine-grained network and IAM controls. It leverages best practices for security (private endpoints, IAM roles for service accounts), scalability (managed node groups, Karpenter tags), and maintainability (modular, versioned, and tagged infrastructure).


---

Wednesday, 15 April 2026

Core Security Practices in DevSecOps & Software Engineering

 


Integrating security into DevOps and software engineering, often called DevSecOps, is a critical shift from treating security as a final checkpoint to embedding it throughout the entire development lifecycle. 

Here are the best security practices, with a specific focus on secrets management and key rotation.

Core Security Practices in DevSecOps & Software Engineering


1. Shift Left


This is the foundational principle of DevSecOps. It means introducing security testing and considerations as early as possible in the software development life cycle (SDLC).
  • Why: It is significantly cheaper and faster to fix a security flaw during the design or coding phase than it is after deployment.
  • Action: Conduct threat modeling during design, use secure coding standards, and run security scans on every code commit.

2. Automate Security Testing


Manual security reviews cannot keep up with the speed of DevOps. Automation is essential.
  • Static Application Security Testing (SAST): Scans your source code for known vulnerabilities (like SQL injection or cross-site scripting) without running the application. Tools: SonarQube, CodeQL.
  • Dynamic Application Security Testing (DAST): Tests the running application from the outside, mimicking an attacker to find runtime vulnerabilities. Tools: OWASP ZAP, Burp Suite.
  • Software Composition Analysis (SCA): Analyzes your application’s dependencies (open-source libraries) for known vulnerabilities. Tools: Snyk, Dependabot, OWASP Dependency-Check.

3. Implement the Principle of Least Privilege (PoLP)


Every user, process, and system should have only the minimum permissions necessary to perform its function.

Action:
  • Developers should not have administrative access to production environments.
  • CI/CD pipelines should use dedicated service accounts with tightly scoped permissions (e.g., a pipeline deploying to a specific AWS S3 bucket should only have s3:PutObject permissions on that bucket).
  • Use Role-Based Access Control (RBAC) to manage permissions.

4. Secure the CI/CD Pipeline


The pipeline itself is a high-value target for attackers. If they compromise the pipeline, they can inject malicious code into your production application.

Action:
  • Lock down pipeline configurations: Require code reviews for any changes to pipeline definition files (e.g., .github/workflows/*.yml).
  • Use code signing: Digitally sign your build artifacts (containers, binaries) to ensure their integrity and origin.
  • Monitor pipeline logs: Look for unauthorized changes or suspicious activity.