Friday, 10 April 2026

How to install and setup Claude Code on MacOS + VS Code

 

Let's follow steps from Quickstart - Claude Code Docs:

% curl -fsSL https://claude.ai/install.sh | bash
Setting up Claude Code...

✔ Claude Code successfully installed!        
                                                                       
  Version: 2.1.100
                                                                       
  Location: ~/.local/bin/claude

  Next: Run claude --help to get started

⚠ Setup notes:
  • Native installation exists but ~/.local/bin is not in your PATH. Run:

  echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrc && source ~/.zshrc


✅ Installation complete!



Let's add path to bin to PATH, add it to zsh config and reload it:

% echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrc && source ~/.zshrc


If you use Bash:

source ~/.bashrc 


Verification:

$HOME/.local/bin is now in $PATH:

% echo $PATH
/Users/bojan/.local/bin:....


Let's check Claude version:

% claude --version
2.1.100 (Claude Code)

Let's also see its CLI arguments:

% claude --help
Usage: claude [options] [command] [prompt]

Claude Code - starts an interactive session by default, use -p/--print for non-interactive output

Arguments:
  prompt                                            Your prompt

Options:
  --add-dir <directories...>                        Additional directories to allow tool access to
  --agent <agent>                                   Agent for the current session. Overrides the 'agent' setting.
  --agents <json>                                   JSON object defining custom agents (e.g. '{"reviewer": {"description": "Reviews code", "prompt": "You are a code
                                                    reviewer"}}')
  --allow-dangerously-skip-permissions              Enable bypassing all permission checks as an option, without it being enabled by default. Recommended only for
                                                    sandboxes with no internet access.
  --allowedTools, --allowed-tools <tools...>        Comma or space-separated list of tool names to allow (e.g. "Bash(git:*) Edit")
  --append-system-prompt <prompt>                   Append a system prompt to the default system prompt
  --bare                                            Minimal mode: skip hooks, LSP, plugin sync, attribution, auto-memory, background prefetches, keychain reads, and
                                                    CLAUDE.md auto-discovery. Sets CLAUDE_CODE_SIMPLE=1. Anthropic auth is strictly ANTHROPIC_API_KEY or apiKeyHelper via
                                                    --settings (OAuth and keychain are never read). 3P providers (Bedrock/Vertex/Foundry) use their own credentials.
                                                    Skills still resolve via /skill-name. Explicitly provide context via: --system-prompt[-file],
                                                    --append-system-prompt[-file], --add-dir (CLAUDE.md dirs), --mcp-config, --settings, --agents, --plugin-dir.
  --betas <betas...>                                Beta headers to include in API requests (API key users only)
  --brief                                           Enable SendUserMessage tool for agent-to-user communication
  --chrome                                          Enable Claude in Chrome integration
  -c, --continue                                    Continue the most recent conversation in the current directory
  --dangerously-skip-permissions                    Bypass all permission checks. Recommended only for sandboxes with no internet access.
  -d, --debug [filter]                              Enable debug mode with optional category filtering (e.g., "api,hooks" or "!1p,!file")
  --debug-file <path>                               Write debug logs to a specific file path (implicitly enables debug mode)
  --disable-slash-commands                          Disable all skills
  --disallowedTools, --disallowed-tools <tools...>  Comma or space-separated list of tool names to deny (e.g. "Bash(git:*) Edit")
  --effort <level>                                  Effort level for the current session (low, medium, high, max)
  --exclude-dynamic-system-prompt-sections          Move per-machine sections (cwd, env info, memory paths, git status) from the system prompt into the first user
                                                    message. Improves cross-user prompt-cache reuse. Only applies with the default system prompt (ignored with
                                                    --system-prompt). (default: false)
  --fallback-model <model>                          Enable automatic fallback to specified model when default model is overloaded (only works with --print)
  --file <specs...>                                 File resources to download at startup. Format: file_id:relative_path (e.g., --file file_abc:doc.txt file_def:img.png)
  --fork-session                                    When resuming, create a new session ID instead of reusing the original (use with --resume or --continue)
  --from-pr [value]                                 Resume a session linked to a PR by PR number/URL, or open interactive picker with optional search term
  -h, --help                                        Display help for command
  --ide                                             Automatically connect to IDE on startup if exactly one valid IDE is available
  --include-hook-events                             Include all hook lifecycle events in the output stream (only works with --output-format=stream-json)
  --include-partial-messages                        Include partial message chunks as they arrive (only works with --print and --output-format=stream-json)
  --input-format <format>                           Input format (only works with --print): "text" (default), or "stream-json" (realtime streaming input) (choices:
                                                    "text", "stream-json")
  --json-schema <schema>                            JSON Schema for structured output validation. Example:
                                                    {"type":"object","properties":{"name":{"type":"string"}},"required":["name"]}
  --max-budget-usd <amount>                         Maximum dollar amount to spend on API calls (only works with --print)
  --mcp-config <configs...>                         Load MCP servers from JSON files or strings (space-separated)
  --mcp-debug                                       [DEPRECATED. Use --debug instead] Enable MCP debug mode (shows MCP server errors)
  --model <model>                                   Model for the current session. Provide an alias for the latest model (e.g. 'sonnet' or 'opus') or a model's full name
                                                    (e.g. 'claude-sonnet-4-6').
  -n, --name <name>                                 Set a display name for this session (shown in /resume and terminal title)
  --no-chrome                                       Disable Claude in Chrome integration
  --no-session-persistence                          Disable session persistence - sessions will not be saved to disk and cannot be resumed (only works with --print)
  --output-format <format>                          Output format (only works with --print): "text" (default), "json" (single result), or "stream-json" (realtime
                                                    streaming) (choices: "text", "json", "stream-json")
  --permission-mode <mode>                          Permission mode to use for the session (choices: "acceptEdits", "auto", "bypassPermissions", "default", "dontAsk",
                                                    "plan")
  --plugin-dir <path>                               Load plugins from a directory for this session only (repeatable: --plugin-dir A --plugin-dir B) (default: [])
  -p, --print                                       Print response and exit (useful for pipes). Note: The workspace trust dialog is skipped when Claude is run with the
                                                    -p mode. Only use this flag in directories you trust.
  --remote-control-session-name-prefix <prefix>     Prefix for auto-generated Remote Control session names (default: hostname)
  --replay-user-messages                            Re-emit user messages from stdin back on stdout for acknowledgment (only works with --input-format=stream-json and
                                                    --output-format=stream-json)
  -r, --resume [value]                              Resume a conversation by session ID, or open interactive picker with optional search term
  --session-id <uuid>                               Use a specific session ID for the conversation (must be a valid UUID)
  --setting-sources <sources>                       Comma-separated list of setting sources to load (user, project, local).
  --settings <file-or-json>                         Path to a settings JSON file or a JSON string to load additional settings from
  --strict-mcp-config                               Only use MCP servers from --mcp-config, ignoring all other MCP configurations
  --system-prompt <prompt>                          System prompt to use for the session
  --tmux                                            Create a tmux session for the worktree (requires --worktree). Uses iTerm2 native panes when available; use
                                                    --tmux=classic for traditional tmux.
  --tools <tools...>                                Specify the list of available tools from the built-in set. Use "" to disable all tools, "default" to use all tools,
                                                    or specify tool names (e.g. "Bash,Edit,Read").
  --verbose                                         Override verbose mode setting from config
  -v, --version                                     Output the version number
  -w, --worktree [name]                             Create a new git worktree for this session (optionally specify a name)

Commands:
  agents [options]                                  List configured agents
  auth                                              Manage authentication
  auto-mode                                         Inspect auto mode classifier configuration
  doctor                                            Check the health of your Claude Code auto-updater. Note: The workspace trust dialog is skipped and stdio servers from
                                                    .mcp.json are spawned for health checks. Only use this command in directories you trust.
  install [options] [target]                        Install Claude Code native build. Use [target] to specify version (stable, latest, or specific version)
  mcp                                               Configure and manage MCP servers
  plugin|plugins                                    Manage Claude Code plugins
  setup-token                                       Set up a long-lived authentication token (requires Claude subscription)
  update|upgrade                                    Check for updates and install if available


And finally, let's launch it:

% claude
Welcome to Claude Code v2.1.100
…………………………………………………………………………………………………………………………………………………………

     *                                       █████▓▓░
                                 *         ███▓░     ░░
            ░░░░░░                        ███▓░
    ░░░   ░░░░░░░░░░                      ███▓░
   ░░░░░░░░░░░░░░░░░░░    *                ██▓░░      ▓
                                             ░▓▓███▓▓░
 *                                 ░░░░
                                 ░░░░░░░░
                               ░░░░░░░░░░░░░░░░
       █████████                                        *
      ██▄█████▄██                        *
       █████████      *
…………………█ █   █ █………………………………………………………………………………………………………………

 Let's get started.

 Choose the text style that looks best with your terminal
 To change this later, run /theme

 ❯ 1. Dark mode ✔
   2. Light mode
   3. Dark mode (colorblind-friendly)
   4. Light mode (colorblind-friendly)
   5. Dark mode (ANSI colors only)
   6. Light mode (ANSI colors only)

 ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
  1  function greet() {
  2 -  console.log("Hello, World!");                                   
  2 +  console.log("Hello, Claude!");                                  
  3  }
╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
  Syntax theme: Monokai Extended (ctrl+t to disable)

After that we need to select a login method:


 ❯ 1. Claude account with subscription · Pro, Max, Team, or Enterprise

   2. Anthropic Console account · API usage billing

   3. 3rd-party platform · Amazon Bedrock, Microsoft Foundry, or Vertex AI


Option 1 - Claude Accounts are for the consumer/pro web interface (claude.ai) which is seat-based.

Option 2 - Anthropic Console account should be selected if your organization is on an API plan (pay-as-you-go billing based on token usage). Anthropic Console (platform.claude.com) is the hub for managing API keys, billing, and developer organizations.

Option 3 - 3rd-party platforms are only for when you want to route Claude's "brain" through your existing AWS (Bedrock) or Google Cloud (Vertex) bills.


After selecting Anthropic Console, you'll be taken to page which shows the following:


Claude Code would like to connect to your Anthropic organization MYORG

YOUR ACCOUNT WILL BE USED TO:
    • Generate API keys on your behalf
    • Access your Anthropic profile information
    • Upload files on your behalf

Logged in as user@myorg.com
Switch account


After clicking on Authorize button, you'll be redirected to a page which shows:


Build something great
You’re all set up for Claude Code.

You can now close this window.


Back in terminal, you'll see:

Logged in as user@myorg.com                                           
Login successful. Press Enter to continue…   

After pressing Enter:

 Security notes:                                                        
 1. Claude can make mistakes                                          
    You should always review Claude's responses, especially when       
    running code.                                                                                                                             
 2. Due to prompt injection risks, only use it with code you trust    
    For more details see:                                              
    https://code.claude.com/docs/en/security
                                                                
 Press Enter to continue…   


After clicking on Enter:

Use Claude Code's terminal setup?                                                                                                   
 For the optimal coding experience, enable the recommended settings    
 for your terminal: Shift+Enter for newlines                            
 ❯ 1. Yes, use recommended settings                                    
   2. No, maybe later with /terminal-setup                                                                                                     
 Enter to confirm · Esc to skip   


After choosing 1 - recommended settings:

 Accessing workspace:
                                                     
 /Users/bojan/path/to/project
                                               
 Quick safety check: Is this a project you created or one you trust? (Like your own code, a well-known open source project, or work from your team). If not, take a
 moment to review what's in this folder first.         
                                                      
 Claude Code'll be able to read, edit, and execute files here.
                                                          
 Security guide                                           
                                                         
 ❯ 1. Yes, I trust this folder            
   2. No, exit         
                                                          
 Enter to confirm · Esc to cancel


After selecting 1:

╭─── Claude Code v2.1.100───────────────────────────────────────────────────────────────────────────────────────╮
│                                            │ Tips for getting started                                          
│             Welcome back User!             │ Run /init to create a CLAUDE.md file with instructions for Claude│
│                                            │ ─────────────────────────────────────────────────────────────────│
│                   ▐▛███▜▌                  │ Recent activity                                                  │
│                  ▝▜█████▛▘                 │ No recent activity                                               │
│                    ▘▘ ▝▝                   │                                                                  |
│                                            │                                                                  │
│   Sonnet 4.6 · API Usage Billing · MYORG   │                                                                   
│   ~/…/path/to/project                      │                                                                   
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
                                                          
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
❯                                         
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  ? for shortcuts                                                                                                                                       ● high · /effort
                                                                                                                                                                          
                                    
We can now run various commands, like:

────────────────────────────────────────────────────────────
❯ /stats                                  
────────────────────────────────────────────────────────────
/stats                  Show your Claude Code usage statistics and activity            
/status                 Show Claude Code status including version, model, account, API connectivity, and tool statuses  
/statusline             Set up Claude Code's status line UI
/ide                    Manage IDE integrations and show status    


If we execute /stats at this point, the output will show:

❯ /stats                                                                
────────────────────────────────────────────────────────────   
Status   Config   Usage   Stats                                        

No stats available yet. Start using Claude Code!   


In my case Status tab showed, among other things:

  IDE: ✘ Error installing VS Code extension: 1: Command failed with ERR_STREAM_PREMATURE_CLOSE: code --force --install-extension anthropic.claude-code
       Premature close
       Please restart your IDE and try again.


I restarted VS Code to no avail. I then manually installed Claude Code for VS Code plugin and restarted VD Code but the same error appeared again. There is a related bug, still with Open status: [BUG] Claude code VS Code extension error in MacOS · Issue #34639 · anthropics/claude-code

If we try /cost:

❯ /stats 
  ⎿  Status dialog dismissed

❯ /cost                                                                
  ⎿  Total cost:            $0.0000
     Total duration (API):  0s                                        
     Total duration (wall): 1h 16m 21s                                 
     Total code changes:    0 lines added, 0 lines removed            
     Usage:                 0 input, 0 output, 0 cache read, 0 cache write



Wednesday, 8 April 2026

Model Context Protocol (MCP)

 

Model Context Protocol (MCP)


The Model Context Protocol (MCP):
  • Open-source standard
  • Enables AI models to seamlessly connect with external data sources, tools, and software systems
  • Acts as a universal "USB-C port" for AI, allowing LLMs to securely access local files, databases, and APIs to enhance context-aware responses. 
  • Introduced by Anthropic in late 2024

Key Aspects of MCP:
  • Purpose: Replaces fragmented, custom integrations with a single, open standard, making it easier to connect AI assistants to enterprise data, tools, and development environments.
  • Components: Consists of:
    • MCP Clients
    • MCP Hosts - AI apps like Claude or coding agents 
    • MCP Servers - programs that bridge specific data sources
  • Security: MCP supports secure, two-way connections, allowing developers to control exactly what data is exposed to the AI.
  • Functionality: Enables models to read files, query databases, use search engines, and call external APIs, providing live, relevant context for tasks.
  • Open Standard: Hosted by the Linux Foundation, the protocol is designed for broad industry adoption. 

MCP differs from RAG (Retrieval-Augmented Generation) by focusing on active, two-way interaction with systems, whereas RAG is focused on retrieving documents for context.

For developers, it provides SDKs in Python and TypeScript.


MCP clients



MCP clients are the components within AI applications (AI Hosts) that manage one-to-one connections with MCP servers, translating AI requests into protocol-standardized messages. 

Popular MCP Client Applications


Several major AI-powered platforms and editors have integrated MCP client support to allow users to pull in their own tools and context: 
  • Claude Desktop: Anthropic’s flagship app provides a built-in interface for managing local and remote MCP servers (e.g., Google Drive, Slack, GitHub).
  • Cursor: An AI-native code editor that uses MCP to give its internal AI models direct access to project files, local databases, and custom developer tools.
  • Windsurf Editor: A developer environment that supports tool invocation through MCP servers, allowing it to seamlessly interact with external scripts and APIs during coding sessions.
  • Visual Studio Code (Agent Mode): Developers can use extensions to register MCP servers, enabling chat assistants to interact with internal enterprise tools directly within the editor.
  • JetBrains IDEs: Platforms like IntelliJ IDEA feature an MCP-client UI where users can paste server configurations to bring external tool catalogues into the AI Assistant pane.
  • BeeAI: An open-source desktop AI assistant from IBM that supports tool integration via built-in or custom MCP servers. 

Core Client Features


In the MCP architecture, clients don't just consume data; they provide specific features that enable complex, "agentic" workflows: 
  • Sampling: Allows a server to request that the client (and its LLM) generate a completion, enabling the AI to "ask back" for clarification or more information.
  • Elicitation: Provides a structured way for servers to ask users for specific information (like a password or preference) through the client’s UI.
  • Roots: Allows the client to define specific file system boundaries, telling servers which directories they are permitted to access for safety and context scoping. 

Developer SDKs for Building Clients 


If you are building your own AI application, you can implement an MCP client using official software development kits: 
  • TypeScript/Node.js SDK: The most comprehensive SDK for building web-based or local Node.js clients using standard JSON-RPC 2.0.
  • Python SDK: Ideal for data science or automation workflows, allowing easy integration with Python-based AI agents.


MCP Hosts


Model Context Protocol (MCP) hosts are AI applications or environments that manage, connect, and run MCP clients to interact with external data sources (servers). Key examples include Claude Desktop, Claude Code, Zed, Replit, Cursor, and Sourcegraph. 

Key Roles of MCP Hosts:
  • Orchestration: Act as the container for AI models to coordinate and communicate with multiple MCP servers.
  • Security & Permissions: Manage user consent and enforce security policies (e.g., in sandboxed iframes) to control what data servers can access.
  • Workflow Management: Aggregate user prompts, conversation state, and tool results from various sources to enable seamless interaction.
  • Session Control: Initialize, maintain, and terminate connections between LLMs and tools via the MCP client. 
Primary MCP Host Examples:
  • Claude Desktop: Anthropic's desktop application is a primary implementation of an MCP host, allowing the Claude model to utilize local tools and data.
  • Claude Code: An AI agentic coding tool that acts as a host to interact with development-focused MCP servers.
  • AI IDEs (Zed, Cursor, Replit): Integrated Development Environments that use MCP to connect coding assistants with files, documentation, and databases.
  • Platforms & Agents: Other platforms like Workato or custom agents that orchestrate data access via the protocol.

What's the difference between mcp host and mcp client?

In the Model Context Protocol (MCP), the Host is the overall application you interact with, while the Client is the specific component inside that application that handles the technical connection to a server.


Key Differences at a Glance

  • What it is
    • MCP Host: The end-user application (e.g., Claude Desktop, Cursor).
    • MCP Client: A protocol handler or "adapter" inside the host.
  • Primary Role
    • MCP Host: Manages the user interface and coordinates multiple clients.
    • MCP Client: Maintains a 1:1 connection with a single MCP server.
  • Responsibility
    • MCP Host: Security policies, user consent, and aggregating data for the AI model
    • MCP Client: Translating protocol messages (JSON-RPC) between the host and server.
  • Hierarchy
    • MCP Host: A single Host can contain multiple Clients.
    • MCP Client: A Client is a subsidiary of the Host.

The "Restaurant" Analogy


To make it simpler, imagine a restaurant setting: 
  • The Host is the Executive Chef: They decide what needs to be cooked and oversee everything, but they don't leave the kitchen to buy ingredients.
  • The Client is the Waiter: They take the Chef's specific order, run to the source (the Server), and bring back exactly what was requested in a format the Chef can use. 

Why the distinction matters


While you will often hear people refer to applications like Claude Desktop as "the client," technically they are hosts. This architecture allows one app to connect to many different data sources (like Google Drive, Slack, and local files) simultaneously by instantiating a separate client for each one.


What is a MCP Client in Claude Desktop?



In Claude Desktop, the MCP client is the internal software layer that allows the app to "talk" to the tools you've added. While you might call the whole app "the client," it actually functions as a host that manages multiple individual client connections. 

How it works in Claude Desktop

  • The Translator: When you ask Claude to "read a file," the client translates that human request into a technical JSON-RPC message that the Filesystem server understands.
  • The Connection Manager: Claude Desktop can run several clients at once. For example, one client might be connected to a GitHub server while another is connected to a Google Drive server.
  • Permission Gatekeeper: The client facilitates the security handshake. Before a tool executes, the client triggers the UI popup in Claude Desktop asking for your explicit permission. 

How to see them

You can see your active MCP clients and their available tools by clicking the "hammer" or "plug" icon (the MCP server indicator) in the bottom-right corner of the chat input box. 

Configuration

Claude Desktop's clients are configured via a local JSON file (claude_desktop_config.json). This file tells the internal clients exactly how to launch and communicate with your servers.

Config File Location:
  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\Claude\claude_desktop_config.json


MCP Servers



Model Context Protocol (MCP) servers:
  • Lightweight programs that connect AI models (like Claude or ChatGPT) to external data sources and tools, such as local files, databases, GitHub, or Slack
  • Provide a standardized interface, enabling AI agents to securely access, read, and manipulate data beyond their training sets

Key Aspects of MCP Servers:
  • Functionality: They expose specific capabilities—resources, prompts, and tools—to AI applications.
  • Use Cases: Common implementations include file system access for documentation, database querying, and API interactions for services like GitHub or Google Tasks.
  • Security: They provide controlled, authorized access to local or remote resources, with user permission required for actions.
  • Architecture: As part of the Model Context Protocol, they act as the "server" in a client-server model, connecting to "hosts" like desktop apps. 

Common MCP Server Examples:
  • Local File System: Allows AI to read, write, and organize local documents.
  • GitHub/GitLab: Enables AI to manage repositories, create issues, and pull code.
  • Database/API Connectors: Connects AI to SQL databases, HubSpot CRM, or AWS services.
  • Developer Tools: Includes servers for Terraform, Angular CLI, and Home Assistant. 

You can build your own MCP server using Python or TypeScript, often utilizing tools like uv for environment setup.


How to start using Agentic AI in DevOps and Platform Engineering



The next frontier of DevOps and Platform Engineering is Agentic AI. We need to learn how autonomous agents reason and adapt to reduce cognitive load and accelerate the SDLC as we want to move beyond simple automation to build self-optimizing ecosystems that scale with confidence, innovation, and enterprise governance.


We should be able to:
  • Explain the shift from automation to agentic AI and articulate what makes an AI system truly
  • “agentic”
  • Design agent-aware workflows in GitHub Actions, integrating LLMs with events, logs, APIs, and quality gates to create intelligent CI/CD pipelines
  • Build AI-powered diagnostic loops that ingest failure context, reason about root causes, and generate structured remediation proposals or self-healing fixes
  • Implement intelligent release decisions using multi-signal quality gates (test coverage, performance, security, cost) and generate auditable release rationale reports
  • Deploy our own end-to-end platform engineering agent, capable of diagnosing pipeline failures, evaluating release readiness, and autonomously opening a fix PR or escalating with structured context.

Learning while Doing
  • Identify Platform engineering pain points and the AI opportunity
    • How can we get from static scripts and CI/CD automation to agentic AI
    • Make a comparison of manual vs. AI-driven diagnosis
    • Understand how platform engineering is evolving from static automation toward AI-driven systems that proactively diagnose and resolve operational issues
  • Agentic AI fundamentals - how agents reason and act?
    • Learn about core agent components (LLMs, memory, and tools)
    • Compare event-driven vs. polling architectures
    • Balance autonomous actions with human oversight
    • Understand how agentic systems combine reasoning, memory, and tools to perceive events, make decisions, and act within engineering workflows
  • How to setup the environment and create our first agentic workflow
    • Set up an agentic runtime that responds to CI/CD events
    • Connect an AI agent to our pipeline's event stream and context
    • Trigger our first agent run and interpret its reasoning logs
    • Learn how to connect AI agents to CI/CD events and platform context to trigger automated reasoning and actions in real time
  • AI-powered diagnosis and remediation
    • Compare manual vs. AI-driven incident diagnosis 
    • Build agents that read logs, reason about failures, and propose fixes 
    • Define escalation boundaries: when the agent self-heals vs. asks a human
    • Understand how AI agents analyze logs, diagnose failures, and determine whether to self-heal or escalate issues to humans
  • Intelligent CI/CD & adaptive delivery
    • How to move beyond pass/fail pipelines to AI-driven release decision
    • Automate rollback decisions using AI quality gates
    • Query pipeline state and release history using natural language
    • How AI transforms CI/CD pipelines into adaptive systems that make context-aware release and rollback decisions
  • Operational intelligence & conversational observability
    • Replace complex dashboards with AI anomaly detection
    • Check platform health via chat interfaces
    • Shift from reactive alerts to predictive management
    • Understand how AI enables conversational access to platform health and detects anomalies to support proactive operations.
  • Multi-agent coordination & implementation strategy
    • Architect multi-agent systems for our platform workflows
    • Handle agent conflicts, failures, and graceful degradation 
    • Design a phased enterprise rollout with guardrails and audit trails
  • Build our platform engineering agent
    • Learn how to design coordinated multi-agent systems that handle complex platform workflows with governance and reliability
    • Wire together diagnosis, quality gates, and observability into one agent pipeline
    • Implement self-healing PRs with confidence thresholds
    • Shift our role from platform operator to AI supervisor
    • Learn how to combine diagnosis, delivery intelligence, and observability into a unified agent that automates key platform workflows

---

Thursday, 2 April 2026

Kubernetes StatefulSet

 

In Kubernetes, a StatefulSet is a specialized workload API object designed to manage stateful applications. Unlike standard Deployments, where Pods are interchangeable "cattle," StatefulSets treat Pods as unique "pets" with a persistent identity that is maintained even if they are rescheduled or restarted.

Key Features

  • Stable Network Identity: Each Pod is assigned a unique, ordinal index (e.g., web-0, web-1) and a corresponding stable DNS name through a Headless Service.
  • Stable Storage: By using volumeClaimTemplates, each Pod is automatically paired with its own PersistentVolume. If a Pod dies, the replacement Pod with the same identity will automatically remount the same storage.
  • Ordered Deployment: Pods are created and scaled sequentially from 0 to N-1. Kubernetes ensures that the previous Pod is "Running and Ready" before starting the next one.
  • Ordered Termination: Scaling down or deleting the StatefulSet occurs in reverse order, starting from the highest ordinal (e.g., web-2 is deleted before web-1). 


When to Use StatefulSets


StatefulSets are the standard choice for applications that require consistent data and unique identities, such as: 
  • Databases: Systems like MySQL, PostgreSQL, MongoDB, and Cassandra.
  • Distributed Systems: Tools like ZooKeeper, Kafka, and Elasticsearch that need a quorum or master election.
  • Clustered Applications: Any software where instances need to know each other’s specific addresses to sync data. 

Comparison: StatefulSet vs. Deployment


Feature         StatefulSet                                                 Deployment
----------           --------------                                                     ---------------
Pod Identity: Unique and stable (ordinal names)                 Randomly generated and ephemeral
Storage:         Dedicated volume per Pod (via template) Typically shared or transient
Network: Fixed DNS per Pod (via Headless Service) Single Load Balancer for the whole set
Scaling:         Sequential (0, then 1, then 2...)                 Parallel (multiple Pods at once)


Best Practices

  • Use Headless Services: Always pair our StatefulSet with a Service that has clusterIP: None to ensure Pods are individually addressable.
  • Persistent Storage: Ensure our StorageClass is correctly configured for dynamic provisioning so that each Pod gets its own disk automatically.
  • Manual Data Sync: Note that while Kubernetes manages the infrastructure, we are still responsible for configuring internal application logic like data replication or master/slave sync. 


YAML manifest example for a basic MySQL StatefulSet


Below is a standard YAML manifest for a MySQL StatefulSet. It includes a Headless Service for network identity and a volumeClaimTemplate to automatically provision unique storage for each replica.

apiVersion: v1
kind: Service
metadata:
  name: mysql
  labels:
    app: mysql
spec:
  ports:
  - port: 3306
    name: mysql
  clusterIP: None # Defines this as a Headless Service
  selector:
    app: mysql
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mysql
spec:
  selector:
    matchLabels:
      app: mysql
  serviceName: "mysql"
  replicas: 3
  template:
    metadata:
      labels:
        app: mysql
    spec:
      containers:
      - name: mysql
        image: mysql:8.0
        env:
        - name: MYSQL_ROOT_PASSWORD
          value: "password" # Use Secrets in production!
        ports:
        - containerPort: 3306
          name: mysql
        volumeMounts:
        - name: mysql-data
          mountPath: /var/lib/mysql
  volumeClaimTemplates:
  - metadata:
      name: mysql-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi


Why this works:
  • Stable DNS: Each Pod gets a predictable name: mysql-0.mysql, mysql-1.mysql, etc.
  • Unique Storage: Kubernetes creates three separate PersistentVolumeClaims. mysql-0 will always mount the first disk, even after a reboot.
  • Ordered Startup: Pods launch one after another (0, then 1, then 2), which is critical for forming database clusters.


To use a Kubernetes Secret (like mysql-secret) instead of hardcoding passwords, we need to create a Secret object and then reference it in our StatefulSet. This is the standard practice for distributing credentials securely in Kubernetes.

1. Create the Secret

We can define our password in a YAML file. Note that values in the data field must be base64 encoded.

apiVersion: v1
kind: Secret
metadata:
  name: mysql-secret
type: Opaque
data:
  # 'password' encoded in base64 is 'cGFzc3dvcmQ='
  root-password: cGFzc3dvcmQ=


Alternatively, we can use stringData to provide the password in plain text; Kubernetes will handle the encoding for us when we apply it:

apiVersion: v1
kind: Secret
metadata:
  name: mysql-secret
type: Opaque
stringData:
  root-password: "our-secure-password"

2. Update the StatefulSet

Modify the env section of our MySQL container to use valueFrom and secretKeyRef. This tells the Pod to pull the value of MYSQL_ROOT_PASSWORD from the secret we just created.

 containers:
      - name: mysql
        image: mysql:8.0
        env:
        - name: MYSQL_ROOT_PASSWORD
          valueFrom:
            secretKeyRef:
              name: mysql-secret   # Name of our Secret object
              key: root-password   # The specific key inside the Secret


Key Considerations

  • Initialization Only: For MySQL, the MYSQL_ROOT_PASSWORD environment variable is typically only used during the first-time initialization of the data directory. Changing the Secret later will not automatically update the root password in an existing database.
  • Security: Ensure our cluster has encryption at rest enabled for Secrets to truly protect sensitive data.
  • Alternative for Multiple Variables: If we have many credentials (user, password, DB name), we can use envFrom to map all keys in a Secret to environment variables at once.

Changing Storage Spec


Changing spec.volumeClaimTemplate updates the StatefulSet template but will not resize already-created PVCs. If the goal is to fix an existing CrashLoopBackOff due to disk-full, we still need to expand the current PVC(s) (and ensure the general StorageClass allows volume expansion), or recreate the PVC/StatefulSet so the new size takes effect.

References:



Thursday, 19 March 2026

Amazon EBS CSI Driver



The Amazon EBS CSI Driver is a standard interface that allows Amazon Elastic Kubernetes Service (EKS) clusters to manage the full lifecycle of Amazon EBS volumes as persistent storage for containers. It replaces the older, deprecated "in-tree" Kubernetes storage plugin with a more flexible, decoupled model. 

Key Features

  • Dynamic Provisioning: Automatically creates and attaches EBS volumes when a PersistentVolumeClaim (PVC) is made.
  • Volume Lifecycle Management: Handles the creation, attachment, mounting, and deletion of volumes.
  • Resizing & Snapshots: Supports online volume resizing (for gp3 and others) and taking volume snapshots for data backup.
  • EKS Auto Mode Support: In EKS Auto Mode, routine block storage tasks are automated, and you don't even need to manually install the driver. 

Deployment Methods


You can install and manage the driver through several channels: 
  • EKS Managed Add-on (Recommended): Simplifies installation and updates via the AWS Console, CLI, or Terraform.
  • Helm Chart: Provides highly customizable installation options.
  • Kustomize: Direct deployment using manifests from the official GitHub repository. 

Core Requirements

  • IAM Permissions: The driver requires an IAM role with the AmazonEBSCSIDriverPolicy to interact with EBS resources.
  • Service Accounts: Typically uses IAM Roles for Service Accounts (IRSA) to securely provide AWS credentials to the driver pods.
  • Compatibility: Supports Linux and Windows worker nodes, as well as ARM64 architectures.

Driver Components


The driver is typically deployed into the kube-system namespace and consists of two main parts: 
  • Controller Deployment: Runs as a set of replicas (ebs-csi-controller) to communicate with the AWS EC2 API and manage volume operations.
  • Node DaemonSet: Runs on every worker node (ebs-csi-node) to handle the actual mounting and unmounting of volumes to pods on that specific host. 

In the Amazon EBS CSI driver architecture, the ebs-csi-controller and ebs-csi-node are the two primary components that work together to manage the lifecycle of EBS volumes in a Kubernetes cluster.

Core Feature Differences


ebs-csi-controller

  • Deployment Type: 
    • Deployment (typically 2 replicas for HA)
  • Main Function: 
    • Control Plane operations: Creating, deleting, attaching, and detaching volumes
  • AWS Interaction: 
    • Calls the AWS EC2 API to manage EBS resources
  • IAM Permissions: 
    • Requires an IAM role with permissions like ec2:CreateVolume and ec2:AttachVolume

ebs-csi-node

  • Deployment Type: 
    • DaemonSet (runs on every worker node)
  • Main Function: 
    • Node-level operations: Mounting and unmounting volumes to the local filesystem
  • AWS Interaction: 
    • Interacts with the local OS (privileged system calls) to handle block devices
  • IAM Permissions: 
    • Generally requires fewer/no AWS API permissions, as it mostly performs local mount actions

How They Work Together
  • Provisioning & Attachment: When you create a PersistentVolumeClaim (PVC), the ebs-csi-controller watches the request and calls the AWS API to create the EBS volume and attach it to the correct EC2 instance.
  • Mounting: Once the volume is physically attached to the EC2 instance, the ebs-csi-node pod running on that specific node detects the new block device and mounts it into the container’s path so your application can use it. 

Key Considerations
  • Security: For better security, you can schedule the ebs-csi-controller on hardened management nodes, while the ebs-csi-node must run everywhere your workloads need storage.
  • Fargate: You can run the controller on Fargate nodes, but the ebs-csi-node (as a DaemonSet) only runs on EC2 instances.
  • Troubleshooting: If a volume fails to "attach," check the controller logs; if it fails to "mount" or "format," check the node logs.

Pods for both the ebs-csi-controller and ebs-csi-node typically share the same value for the app.kubernetes.io/name label. 

In standard deployments (such as via the official Helm chart or EKS add-on), both components use this label to identify that they belong to the same overarching application: the Amazon EBS CSI Driver.

Label Comparisons


While they share the same application name, they use the app.kubernetes.io/component label to distinguish between their specific roles.

Label Key                           ebs-csi-controller Pods     ebs-csi-node Pods
------------                                  ------------------------------    ----------------------
app.kubernetes.io/name           aws-ebs-csi-driver       aws-ebs-csi-driver
app.kubernetes.io/instance   aws-ebs-csi-driver       aws-ebs-csi-driver
app.kubernetes.io/component   csi-driver (or controller)     csi-driver (or node)
app (Legacy label)                   ebs-csi-controller               ebs-csi-node


How to Verify in Your Cluster


You can check these labels yourself using kubectl. This is useful if you are writing Prometheus rules or network policies that need to target the entire driver or just one part of it. 

To see labels for all EBS CSI pods:

kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver --show-labels

To target only the controller:

kubectl get pods -n kube-system -l app=ebs-csi-controller


---