Thursday, 5 February 2026

Horizontal Pod Autoscaler (HPA)

 


The Horizontal Pod Autoscaler (HPA) serves to automatically align your application's capacity with its real-time demand by adjusting the number of pod replicas. Its operation depends on several critical components and configurations within an EKS or Kubernetes cluster. 

Kubernetes Horizontal Pod Autoscaler (HPA) is a built-in Kubernetes controller
  • standard Kubernetes autoscaling mechanism
  • Available Out of the box (i.e., without installing third-party controllers)

It's "standard" in the sense that the feature is built into the Kubernetes control plane, but it isn't "automatic" in the sense that it guesses which of your apps need scaling.

Think of it like a Thermostat: The thermostat (HPA Controller) is already installed on the wall (EKS Control Plane), but it won't turn on the AC until you tell it what the Target Temperature (CPU/Memory threshold) is and which room (Deployment) to monitor.

Here is why a manifest is required for every app:

1. The Controller vs. The Resource

The Controller (The "How"): This is a loop running inside the EKS Control Plane. It is always active, waiting for instructions. Kubernetes HPA Documentation explains this loop.
The Resource (The "What"): The HPA Manifest is that instruction. It tells the controller: "Watch Deployment X, keep CPU at 50%, and don't go above 10 pods."

2. Manual Intent

Kubernetes follows a Declarative Model. It never assumes you want to scale. If it scaled every pod automatically, a single bug in your code (like an infinite loop) could scale your cluster to 1,000 nodes and drain your AWS budget instantly. You must explicitly opt-in by creating the HPA resource.

3. Unique Criteria for Every App

Not all apps scale the same way:
  • Web API: Might scale when CPU hits 70%.
  • Background Worker: Might scale based on Memory usage.
  • Data Processor: Might scale based on a Custom Metric like SQS queue depth.

Summary: What is "Standard"?

What is standard is the API definition and the Controller. What is not standard is your specific application's scaling logic.

To see what the HPA Controller is looking for, you can check your Deployment's resource requests via kubectl:

kubectl get deployment <name> -o yaml | grep resources -A 5


Key Features:

  • Pod Scaling: Adjusts the number of pod replicas to match the demand.
  • Automatically scales up/down the number of pods in a deployment, replication controller, or replica set based on observed CPU utilization, memory or other selected custom/external metrics.

Purpose

  • Dynamic Scalability: Automatically adds pods during traffic surges to maintain performance and removes them during low-traffic periods to reduce waste.
  • Cost Optimisation: Ensures you only pay for the compute resources currently needed rather than over-provisioning for peak loads.
  • Resilience & Availability: Prevents application crashes and outages by proactively scaling out before resources are fully exhausted.
  • Operational Efficiency: Replaces manual intervention with "architectural definition," allowing infrastructure to manage itself based on predefined performance rules. 

Dependencies


HPA cannot function on its own; it requires the following "links" and infrastructure: 

  • Metrics Server (The Aggregator): This is the most critical infrastructure dependency. The HPA controller queries the Metrics API (typically provided by the Metrics Server) to get real-time CPU and memory usage data.
  • Resource Requests (The Baseline): For the HPA to calculate percentage-based utilization (e.g., "scale at 50% CPU"), the target Deployment must have resources.requests defined. Without these, the HPA has no 100% baseline to measure against and will show an unknown status.
  • Controller Manager: The HPA logic runs as a control loop within the Kubernetes kube-controller-manager, which periodically (every 15 seconds by default) evaluates the metrics and updates the desired replica count.
  • Scalable Target: The HPA must be linked to a resource that supports scaling, such as a Deployment, ReplicaSet, or StatefulSet.
  • Cluster Capacity (Node Scaling): While HPA scales pods, it depends on an underlying node scaler (like Karpenter or Cluster Autoscaler) to provide new EC2 instances if the cluster runs out of physical space for the additional pods. 

Dependencies

  • It requires Kubernetes Metrics Server 

Installation and Setup:


To use HPA ensure the Metrics Server is installed in your cluster to provide resource metrics.

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Configuration:


Create an HPA resource for your deployment.

kubectl autoscale deployment your-deployment --cpu-percent=50 --min=1 --max=10

How to check if it's enabled?


% kubectl api-resources -o wide | grep autoscaling

NAME SHORTNAMES APIVERSION NAMESPACED KIND
VERBS CATEGORIES
...
horizontalpodautoscalers  hpa  autoscaling/v2 true HorizontalPodAutoscaler  create,delete,deletecollection,get,list,patch,update,watch all


In which namespace do HorizontalPodAutoscalers reside in?


In AWS EKS, HorizontalPodAutoscalers (HPA) are namespaced resources, meaning they belong in the same namespace as the workload (e.g., Deployment or StatefulSet) they are intended to scale. 

While there is no single "HPA namespace," here is how they are distributed and where related components live:

1. The HPA Resource Namespace 

  • Application Namespace: When you create an HPA, you define it within the specific namespace where your application is running (e.g., default, production, or demo).
  • Constraint: An HPA can only scale a target resource (like a Deployment) that exists in that same namespace. 

2. Infrastructure & Metrics Namespaces 

While the HPA resource lives with your app, the supporting infrastructure often resides in system namespaces: 

  • Metrics Server: This is a mandatory prerequisite for HPA on EKS. It is typically deployed in the kube-system namespace.
  • Custom Metrics Adapters: If you are scaling based on custom metrics (like Prometheus or CloudWatch), components like the prometheus-adapter or k8s-cloudwatch-adapter may be installed in kube-system or a dedicated namespace like custom-metrics.
  • Cluster Autoscaler: Often confused with HPA, the Cluster Autoscaler (which scales EC2 nodes rather than pods) also typically resides in the kube-system namespace. 

To find all HPAs across your entire EKS cluster, you can run:

kubectl get hpa -A

We might have an output like this:

% kubectl get horizontalpodautoscalers -A    
No resources found

It is possible to get "No resources found" for several reasons, despite the resource being namespaced. This usually means that while the API type exists, no actual instances of that resource have been created in your EKS cluster yet.

Why you see "No resources found":
  • HPA is not yet createdBy default, EKS clusters do not come with any HorizontalPodAutoscalers pre-configured. You must explicitly create one for your application.
  • Metrics Server Missing: HPAs rely on the Kubernetes Metrics Server to function. While the HPA object can be created without it, it will show a status of <unknown> and may not appear if you are looking for active scaling.
  • Namespace Context: Even with -A (all namespaces), if no user or system service has defined an HPA resource, the list will be empty. 

How to Verify and Fix:
  • Check if Metrics Server is running:
    • Run kubectl get deployment metrics-server -n kube-system. If it’s missing, you can install it via the AWS EKS Add-ons in the console or via kubectl apply.
        kubectl get all -A | grep metrics-server 
  • Check API availability:
    • Run kubectl api-resources | grep hpa to confirm the cluster recognizes the resource type.
  • Create a test HPA:
    • If you have a deployment named my-app, try creating one: 
        kubectl autoscale deployment my-app \
            --cpu-percent=50 --min=1 --max=10


Note: If you are using a newer version of EKS (like 1.31) with Auto Mode, some autoscaling is handled automatically by the control plane, but standard HPAs still need to be manually defined if you want pod-level scaling based on custom metrics. 

% kubectl get all -A | grep metrics-server 

default pod/metrics-server-5db5f64c66-sjd2p        1/1     Running     0          205d
default service/metrics-server  ClusterIP       172.21.76.224    <none>  443/TCP  95d
default deployment.apps/metrics-server             1/1     1            1         295d
default replicaset.apps/metrics-server-5db5f64c66   1         1         1         295d

This behavior occurs because no instances of HorizontalPodAutoscaler (HPA) have been created yet, even though the supporting infrastructure (Metrics Server) and API are active. 

In Kubernetes, the presence of the metrics-server and the autoscaling/v2 API resource does not mean an HPA is automatically running for your appsYou must manually define an HPA for each deployment you want to scale. 

Why kubectl get hpa -A is empty
  • Workloads are not yet auto-scaled: By default, EKS (and Kubernetes) does not apply HPAs to your deployments. You must explicitly create an HPA object that references your target Deployment or StatefulSet.
  • kubectl get all exclusion: Standard kubectl get all does not include HPAs in its output, which is why your previous command didn't show them even if they existed.
  • Namespace Location: While your metrics-server is in the default namespace (though typically it's in kube-system), HPAs must be created in the same namespace as the app they are scaling. 

How to create your first HPA

If you have a deployment (e.g., named my-deployment) in the default namespace, you can create an HPA for it using this command:

kubectl autoscale deployment my-deployment --cpu-percent=50 --min=1 --max=10


Verification Steps

Once created, verify it with the following:
  • List all HPAs: kubectl get hpa -A.
  • Check Metrics Flow: Since your metrics-server is running, ensure it is actually collecting data by running kubectl top pods -A. If this returns usage data, your HPA will be able to scale correctly. 

Note: For HPA to function, your Deployment must have resource requests (specifically cpu) defined in its container spec, or the HPA will show <unknown> targets. 



How to configure metrics it needs to observe?


In AWS EKS, you set criteria for scaling in the spec section of a HorizontalPodAutoscaler (HPA) resource. You define thresholds through two primary blocks: metrics (to trigger scaling) and behavior (to control the rate and stability of scaling). 

1. Setting Thresholds (metrics)

The HPA calculates the required number of replicas based on the gap between current usage and your target. 
  • Target Utilization: Typically set as a percentage of a pod's requested CPU or memory.
  • Where to define: Inside the metrics list in your HPA manifest. 

metrics:
- type: Resource
  resource:
    name: cpu
    target:
      type: Utilization
      averageUtilization: 60 # Scale when average CPU exceeds 60% of requests



2. Setting Scaling Speed (behavior)

Advanced scaling logic is set in the behavior block, allowing you to fine-tune how fast the cluster grows or shrinks. 
  • Stabilization Window: Prevents "flapping" by making the HPA wait and look at past recommendations before acting.
    • Scale-Up: Default is 0 seconds (instant growth).
    • Scale-Down: Default is 300 seconds (5 minutes) to ensure a spike is truly over before killing pods.
  • Policies: Restrict the absolute number or percentage of pods changed within a specific timeframe. 

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
    - type: Percent
      value: 100
      periodSeconds: 15 # Double replicas every 15 seconds if needed
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
    - type: Pods
      value: 1
      periodSeconds: 60 # Remove only 1 pod per minute for stability


3. Critical Prerequisites
  • Resource Requests: You must define resources.requests in your Deployment manifest. HPA cannot calculate utilization percentages without this baseline.
  • Metrics Server: Must be running in your cluster (usually in kube-system or default) to provide the data HPA needs.


To link an HPA to your Deployment, the HPA uses a scaleTargetRef. This acts like a pointer, telling the HPA controller exactly which resource to watch and resize.

1. Ensure your Deployment has "Requests"

The HPA cannot calculate percentages (like "50% CPU") unless the Deployment defines what 100% looks like. Check your Deployment for a resources.requests block:

# Inside your Deployment manifest
spec:
  containers:
  - name: my-app
    image: my-image
    resources:
      requests:
        cpu: "250m"    # HPA uses this as the 100% baseline
        memory: "512Mi"

Use code with caution.

2. The HPA Manifest

Create a file named hpa.yaml. The scaleTargetRef is the "link" that connects it to your app.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
  namespace: default # MUST be the same as your deployment
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-deployment-name # <--- This is the "Link"
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60 # Target 60% of requested CPU


3. Apply and Verify

Run the following commands to put the link into action:

Apply the manifest:

kubectl apply -f hpa.yaml

Check the link:

kubectl get hpa

Wait about 30-60 seconds. If you see TARGETS: <unknown>/60%, the Metrics Server is still calculating.
If you see TARGETS: 0%/60% (or a specific number), the link is successful!

Pro Tip - The Quick Link: If you don't want to write YAML, you can create this link instantly via the CLI:

kubectl autoscale deployment my-deployment-name --cpu-percent=60 --min=2 --max=10


HPA vs Karpenter


In AWS EKS, it is perfectly normal to have Karpenter running without any HorizontalPodAutoscaler (HPA) manifests. This happens because they solve two entirely different scaling problems: 

1. Karpenter vs. HPA: The "Supply" vs. "Demand" Gap

  • HPA manages Pods (The Demand): It decides how many pods you need (e.g., "CPU is high, let's go from 2 pods to 5 pods").
  • Karpenter manages Nodes (The Supply): It provides the underlying infrastructure for those pods. It watches for pods that are "Pending" because there is no room for them, then quickly spins up a new EC2 instance that fits them perfectly. 

If you have no HPAs, it means your application replica counts are currently static (e.g., always 3 pods). Karpenter is only "scaling" when you manually change that number or when you deploy a new app that needs more room. 

2. Can pods be adjusted automatically without HPA? 

Yes, there are a few other ways pod counts or resources can be adjusted:
  • Vertical Pod Autoscaler (VPA): Instead of adding more pods, VPA adjusts the CPU and Memory limits of your existing pods based on actual usage.
  • KEDA (Kubernetes Event-driven Autoscaling): Often used instead of standard HPA for complex triggers. It can scale pods to zero and back up based on external events like AWS SQS queue depth, Kafka lag, or Cron schedules.
  • GitOps/CD Pipelines: Sometimes scaling is "automated" via external CI/CD tools (like ArgoCD) that update the replica count in your git repo based on specific triggers or schedules rather than in-cluster metrics. 

3. Why you might want to add HPA

Without HPA, Karpenter is essentially a "just-in-time" provisioning tool for a static workload. If your traffic spikes, your pods might crash from resource exhaustion before Karpenter has a reason to act. Adding HPA allows your app to "request" more pods, which then triggers Karpenter to "supply" more nodes. 

To handle traffic spikes, HPA and Karpenter work as a two-stage relay: 
  • HPA (The Demand): Triggers when CPU/Memory usage spikes, creating "Pending" pods that cannot fit on current nodes.
  • Karpenter (The Supply): Sees those "Pending" pods and immediately provisions new EC2 instances to accommodate them. 

The Combined YAML Example

This configuration sets up an application to scale up during spikes and ensures Karpenter has the right "instructions" to provide nodes. 

Part A: The Workload (Deployment) 

You must define resource requests so HPA has a baseline and Karpenter knows what size node to buy. 

apiVersion: apps/v1
kind: Deployment
metadata:
  name: spike-app
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: web-server
        image: nginx
        resources:
          requests:
            cpu: "500m"    # Crucial: HPA uses this for % calculation
            memory: "512Mi" # Crucial: Karpenter uses this to select EC2 size

Part B: The Scaling Rule (HPA)

This tells Kubernetes to add pods when the existing ones are busy. 

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: spike-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: spike-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60  # Scale up at 60% to give Karpenter time to boot nodes

Part C: The Node Provisioner (Karpenter NodePool)

This tells Karpenter which AWS instances are "allowed" for your scaling pods. 

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["spot", "on-demand"] # Use Spot to save money during spikes
        - key: "karpenter.k8s.aws/instance-category"
          operator: In
          values: ["c", "m", "r"]
      nodeClassRef:
        name: default
  disruption:
    consolidationPolicy: WhenUnderutilized # Automatically kill nodes when pods scale down

Why this works for spikes

  • Buffer Time: Setting the HPA to 60% (instead of 90%) ensures you start scaling before pods are overwhelmed, giving Karpenter ~60 seconds to join new nodes to the cluster.
  • Just-in-Time Nodes: Unlike the old Cluster Autoscaler, Karpenter doesn't wait for "Node Groups" to update; it calls the EC2 Fleet API directly to get exactly what your pending pods need.
  • Automatic Cleanup: When the spike ends, HPA reduces pod counts. Karpenter's consolidationPolicy then notices the nodes are empty and terminates them to stop AWS billing. 

In Kubernetes, "700m" stands for 700 millicores.

It is a unit of measurement for CPU processing power, where 1000m is equal to 1 vCPU (or 1 Core). Therefore, 700m is 0.7 of a vCPU.

How it works in your HPA:

Since your HPA is using type: AverageValue (instead of Utilization), it is looking at the raw CPU usage rather than a percentage:
  1. The Trigger: The HPA controller calculates the average CPU usage across all currently running pods in that deployment.
  2. The Action: If the average usage exceeds 700m, the HPA will add more pods to spread the load and bring that average back down to 700m.

Why use AverageValue instead of AverageUtilization?

  • AverageUtilization (Percentage): Requires you to have resources.requests defined. It scales based on "percentage of what I asked for."
  • AverageValue (Raw Number): Does not technically require a request baseline to function. It scales based on "absolute CPU consumed." This is useful if your app has a hard performance limit (e.g., "This app starts lagging if it hits 0.7 cores") regardless of what the pod's requested limit is.

Pro-Tip for Karpenter users:

When using AverageValue, ensure your Deployment's CPU request is set to something sensible (like 800m or 1000m). If your request is lower than your HPA target (e.g., request is 500m but target is 700m), your pods will constantly throttle before they ever trigger a scale-up! Kubernetes Resource Management provides more details on these units.

Horizontal Pod Autoscaler and Upgrading Kubernetes version of the cluster


When upgrading your Kubernetes cluster version, the most critical Horizontal Pod Autoscaler (HPA) considerations involve API version deprecations, metrics server compatibility, and newly introduced scaling configuration fields.

1. API Version Deprecations & Removals

Kubernetes frequently matures its APIs, meaning older HPA versions are deprecated and eventually removed. 
  • autoscaling/v2 is now GA (General Availability): As of Kubernetes v1.23, the autoscaling/v2 API version is stable and generally available.
  • Removal of v2beta2: The autoscaling/v2beta2 version was removed in v1.26. If your manifests still use this version, they will fail to apply or update in clusters v1.26 and newer.
  • Manifest Updates: You must update the apiVersion in your YAML files. Note that fields like targetAverageUtilization in beta versions were replaced by a more structured target block in the stable v2 API. 

2. Metrics Server & Infrastructure

The HPA depends on external components that may also require updates during a cluster upgrade. 
  • Metrics Server Compatibility: Ensure your Metrics Server version is compatible with your new Kubernetes version. Without it, HPA cannot fetch CPU or memory data, and scaling will fail.
  • Custom Metrics Adapters: if you use custom metrics (e.g., via Prometheus), ensure your Prometheus Adapter supports the new Kubernetes API version, as some older adapters may still attempt to call removed API endpoints. 

3. New Features and Behaviors

Upgrading allows you to leverage newer scaling controls that improve stability: 
  • Configurable Scaling Behavior: Introduced in v1.18 and matured in later versions, the behavior field allows you to set a stabilizationWindowSeconds for scale-up and scale-down independently. This is essential for preventing "flapping" (rapidly scaling up and then down).
  • Configurable Tolerance: In very recent versions (e.g., v1.33), you can now fine-tune the 10% default tolerance. Previously, HPA would only act if the metric differed by more than 10%; you can now adjust this for more sensitive or coarser scaling needs. 

4. Best Practices for the Upgrade Process

  • Audit Before Upgrading: Use tools like Kube-no-trouble (kubent) or Pluto to find resources using deprecated HPA APIs.
  • Dry Runs: Run kubectl apply --dry-run=client on your HPA manifests against the target cluster version to catch schema errors before they impact production.
  • Monitor Events: After upgrading, watch HPA events using kubectl get events --field-selector involvedObject.kind=HorizontalPodAutoscaler to ensure it is still successfully fetching metrics and making decisions. 

When moving from the deprecated autoscaling/v2beta2 (removed in v1.26) to the stable autoscaling/v2 (available since v1.23), the primary change is the unification of target fields. 

YAML Comparison

The stable v2 API replaces direct target fields (like averageUtilization) with a nested target block. 

Feature Deprecated autoscaling/v2beta2 Stable autoscaling/v2
-------     -------------------------------  ---------------------- 
API Version apiVersion: autoscaling/v2beta2 apiVersion: autoscaling/v2

CPU Target   averageUtilization: 50 target:type:Utilization averageUtilization: 50
Custom Target averageValue: 100 target:type:AverageValue averageValue: 100


Comparison Example:

A comparison of the YAML structure shows how the apiVersion changes and the resource target is nested within a target block in the v2 version. You can see the full YAML example in the referenced documents. 

Key Migration Notes
  • Seamless Conversion: The Kubernetes API server can convert between these versions, allowing you to use kubectl get hpa <name> -o yaml --output-version=autoscaling/v2 to view HPAs in the new format.
  • Manifest Updates: While conversion is possible, you must update your CI/CD pipelines and YAML manifests to use autoscaling/v2 before upgrading to v1.26 to prevent errors.
  • Behavior Block: The behavior block remains the same structurally, but using the v2 API is required for long-term support.


To identify which HPAs in your cluster are using deprecated API versions, you can use built-in kubectl commands or specialized open-source tools.
 

1. Using Built-in kubectl Commands


While kubectl doesn't have a single "find-deprecated" flag, you can use these methods to audit your resources:

Audit via API Server Warnings (v1.19+):

The API server automatically sends a warning header when you access a deprecated endpoint. Simply listing them often triggers a warning in the console if they use deprecated APIs:

kubectl get hpa -A


Dry-Run Manifest Validation:

Before applying an update, use a client-side dry-run to see if the manifest will be accepted by the new cluster version's schema.

kubectl apply -f your-hpa.yaml --dry-run=client



Check Metrics for Requested Deprecated APIs:

You can query the API server's raw metrics to see if any client (like an old CI/CD script) is still requesting deprecated HPA versions.

kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis

 
2. Using Specialized Audit Tools (Recommended)

Specialized tools are the most reliable way to find exactly which resources are affected before an upgrade. 

Kube-no-trouble (kubent):

This tool scans your live cluster and lists exactly which resources are using APIs scheduled for removal.

# Install and run (requires no cluster installation)
sh -c "$(curl -sSL https://git.io/install-kubent)"
kubent


Pluto:

While kubent scans the live cluster, Pluto is best for scanning your Helm charts and static YAML files in your git repository to catch issues before they are deployed.

# Scan local directory
pluto detect-files -d .

 
3. Quick Check of Supported Versions 

To see which API versions your cluster currently supports for horizontal scaling, use the following command: 

kubectl api-versions | grep autoscaling


Note: If you are upgrading to v1.26 or newer, any HPA using autoscaling/v2beta2 must be updated to autoscaling/v2, as the older version will no longer be served


How to use kubectl convert to automatically upgrade your existing YAML manifests to the latest API version?


The kubectl convert command is no longer part of the standard kubectl binary; it is now a standalone plugin. You must install it to automatically upgrade your HPA manifests from v2beta2 to v2. 

1. Install the kubectl-convert Plugin 

Choose the method that matches your operating system:
macOS (via Homebrew):

brew install kubectl-convert


Manual Download (Linux/macOS/Windows):

Download the binary for your architecture from the official Kubernetes release page and move it to your system path (e.g., /usr/local/bin/kubectl-convert).

Verification:

Run kubectl convert --help to confirm the plugin is active. 

2. Convert Your HPA Manifests

Once installed, you can use the command to rewrite your old YAML files to the stable autoscaling/v2 version. 

Convert a Specific File:

This command reads your v2beta2 file and outputs a clean v2 version to your terminal.

kubectl convert -f old-hpa.yaml --output-version autoscaling/v2

Save the Converted File:

kubectl convert -f old-hpa.yaml --output-version autoscaling/v2 > new-hpa.yaml

Bulk Conversion (Directory):

You can point it to a directory containing multiple manifests to update them all at once.

kubectl convert -f ./my-hpa-folder/ --output-version autoscaling/v2

 
3. Alternative: Direct Export from the Cluster

Because the Kubernetes API server internally handles conversion between versions, you can "live-convert" an existing HPA by explicitly requesting the target version during a get command: 

kubectl get hpa.v2.autoscaling <hpa-name> -o yaml > upgraded-hpa.yaml

This method is often faster if the HPA is already running in your cluster, as it bypasses the need for the convert plugin. 


How to use Kustomize to handle these API version changes across multiple environments?



Kustomize allows you to manage the transition from v2beta2 to v2 across multiple environments (e.g., Dev, Staging, Prod) by layering environment-specific changes over a common base configuration. 

1. Structure Your Directory

Keep your primary HPA manifest in a base folder and create overlays for each environment. 


├── base
│   ├── hpa.yaml
│   └── kustomization.yaml
└── overlays
    ├── dev
    │   └── kustomization.yaml
    └── prod
        ├── hpa-v2-patch.yaml
        └── kustomization.yaml

2. Strategy for Phased Upgrades

If you are upgrading clusters one by one, you can use Kustomize patches to change the apiVersion only for specific environments while keeping others on the older version. 

Example: Upgrading 'Prod' to v2

If your base/hpa.yaml still uses v2beta2, you can create a patch in your Prod overlay to upgrade it to v2 without touching the base file used by other environments. 

overlays/prod/hpa-v2-patch.yaml:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60

overlays/prod/kustomization.yaml:

resources:
  - ../../base
patches:
  - path: hpa-v2-patch.yaml
    target:
      kind: HorizontalPodAutoscaler
      name: my-app-hpa

3. Validating the Conversion

Before applying changes to a live cluster, use the following commands to ensure Kustomize has correctly merged the new apiVersion and schema: 

View Rendered YAML:

kubectl kustomize overlays/prod

Diff Against Live Cluster:

Use kubectl diff to see exactly what will change in the API server.

kubectl diff -k overlays/prod

 
4. Best Practices

Keep Base "Newest": Once all clusters are upgraded, move the v2 configuration into the base and remove the patches from your overlays to keep your code DRY.
CI/CD Integration: Use Pluto in your CI pipeline to scan the output of kustomize build for any remaining deprecated API versions.


Wednesday, 28 January 2026

How to test Redis connectivity




We first need to check if Redis DNS name resolves:

% nslookup redis.example.com

Server: 192.168.1.1
Address: 192.168.1.1#53

Non-authoritative answer:
redis.example.com canonical name = example-prod-redis-serverless-kg5o59.serverless.use2.cache.amazonaws.com.
example-prod-redis-serverless-kg5o59.serverless.use2.cache.amazonaws.com canonical name = default.example-prod-redis-serverless-kg5o59.serverless.use2.cache.amazonaws.com.
Name: default.example-prod-redis-serverless-kg5o59.serverless.use2.cache.amazonaws.com
Address: 10.0.3.74
...


Let's try to make a TCP connection:

% nc -vz redis.example.com 6379

nc: connectx to redis.example.com port 6379 (tcp) failed: Operation timed out

After adding remote client's IP address to inbound rules of Redis security group (firewall):

% nc -vz redis.example.com 6379

Connection to redis.example.com port 6379 [tcp/*] succeeded!

 
Let's now install Redis client so we can try to connect to the server:

% brew install redis
==> Fetching downloads for: redis
✔︎ Bottle Manifest redis (8.4.0)                                                                                                                                                            Downloaded   10.9KB/ 10.9KB
✔︎ Bottle Manifest ca-certificates (2025-12-02)                                                                                                                                             Downloaded    2.0KB/  2.0KB
✔︎ Bottle ca-certificates (2025-12-02)                                                                                                                                                      Downloaded  131.8KB/131.8KB
✔︎ Bottle redis (8.4.0)                                                                                                                                                                     Downloaded    1.2MB/  1.2MB
==> Installing redis dependency: ca-certificates
==> Pouring ca-certificates--2025-12-02.all.bottle.tar.gz
==> Regenerating CA certificate bundle from keychain, this may take a while...
🍺  /opt/homebrew/Cellar/ca-certificates/2025-12-02: 4 files, 236.4KB
==> Pouring redis--8.4.0.arm64_sequoia.bottle.tar.gz
==> Caveats
To start redis now and restart at login:
  brew services start redis
Or, if you don't want/need a background service you can just run:
  /opt/homebrew/opt/redis/bin/redis-server /opt/homebrew/etc/redis.conf
==> Summary
🍺  /opt/homebrew/Cellar/redis/8.4.0: 15 files, 3MB
==> Running `brew cleanup redis`...
...
==> Caveats
==> redis
To start redis now and restart at login:
  brew services start redis
Or, if you don't want/need a background service you can just run:
  /opt/homebrew/opt/redis/bin/redis-server /opt/homebrew/etc/redis.conf


Let's connect to Redis server and execute ping command (expected response is PONG) and also get some information about the server:


% redis-cli \       
  --tls \
  -h redis.example.com \
  -p 6379

redis.example.com:6379> ping
PONG
redis.example.com:6379> info server 
# Server
redis_version:7.1
redis_mode:cluster
os:Amazon ElastiCache
arch_bits:64
run_id:0
redis.example.com:6379> 


---

Sunday, 25 January 2026

Strategies for AWS EKS Cluster Kubernetes Version Upgrade


This article explores AWS EKS cluster upgrade strategies end-to-end (control plane + nodes), and where node strategies fit inside them. We are assuming there is a single cluster (no multiple clusters, one per environment). In case there are multiple clusters, one per environment (e.g. dev, stage, prod), different approaches should also be considered.


General Upgrade order


This is critical in EKS. A proper cadence explicitly states the order:

  1. Non-prod clusters first
  2. Control plane
  3. Managed add-ons
    1. VPC CNI
    2. CoreDNS
    3. kube-proxy
  4. Worker nodes
  5. Platform controllers
    1. Ingress
    2. Autoscalers
    3. Observability

Example:

Upgrade order: dev → staging → prod
Control plane → add-ons → nodes → workloads

Wednesday, 14 January 2026

Elasticsearch Data Streams





In Elasticsearch, a data stream is an abstraction layer designed to simplify the management of continuously generated time-series data, such as logs, metrics, and events. 

Key Characteristics


  • Time-Series Focus: Every document indexed into a data stream must contain a @timestamp field, which is used to organize and query the data.
  • Append-Only Design: Data streams are optimized for use cases where data is rarely updated or deleted. We cannot send standard update or delete requests directly to the stream; these must be performed via _update_by_query or directed at specific backing indices.
  • Unified Interface: Users interact with a single named resource (the data stream name) for both indexing and searching, even though the data is physically spread across multiple underlying indices. 

Architecture: Backing Indices


A data stream consists of one or more hidden, auto-generated backing indices: 
  • Write Index: The most recently created backing index. All new documents are automatically routed here.
  • Rollover: When the write index reaches a specific size or age, Elasticsearch automatically creates a new backing index (rollover) and sets it as the new write index.
  • Search: Search requests sent to the data stream are automatically routed to all of its backing indices to return a complete result set. 

Automated Management


Data streams rely on two primary automation tools: 
  • Index Templates: These define the stream's structure, including field mappings and settings, and must include a data_stream object to enable the feature.
  • Lifecycle Management (ILM/DSL): Tools like Index Lifecycle Management (ILM) or the newer Data Stream Lifecycle automate tasks like moving old indices to cheaper hardware (hot/warm/cold tiers) and eventually deleting them based on retention policies. 

When to Use


  • Ideal for: Logs, events, performance metrics, and security traces.
  • Avoid for: Use cases requiring frequent updates to existing records (like a product catalog) or data that lacks a timestamp.

How does data stream know when to rollover?


Data streams are typically managed by:
  • Index Lifecycle Management (ILM)
  • Data Stream Lifecycle (DSL) - newer concept

In cluster settings, data_streams.lifecycle.poll_interval defines how often shall Elasticsearch go over each data stream, check if it is eligible for a rollover and then perform it. 

To find this interval value, check the output of 

GET _cluster/settings

By default, the GET _cluster/settings command only returns settings that have been manually overridden so if we are using default values, we need to add ?include_defaults=true.

Default interval value is 5 minutes which can be verified by checking cluster's default settings:

GET _cluster/settings?include_defaults=true&filter_path=defaults.data_streams.lifecycle.poll_interval

Output:

{
  "defaults": {
    "data_streams": {
      "lifecycle": {
        "poll_interval": "5m"
      }
    }
  }
}


After this interval, Elasticsearch rolls over the write index of the data stream, if it fulfills the conditions defined by cluster.lifecycle.default.rollover. If we are using default cluster settings, we can check its default value:

GET _cluster/settings?include_defaults=true&filter_path=defaults.cluster.lifecycle

Output:

{
  "defaults": {
    "cluster": {
      "lifecycle": {
        "default": {
          "rollover": "max_age=auto,max_primary_shard_size=50gb,min_docs=1,max_primary_shard_docs=200000000"
        }
      }
    }
  }
}


max_age=7d: This is why our indices are rolling over every week.
max_primary_shard_size=50gb: Prevents shards from becoming too large and slow.
max_primary_shard_docs=200000000: A built-in limit to maintain search performance, even if the 50GB size hasn't been reached yet

In our case max_age=auto which means Elasticsearch is using a dynamic rollover strategy based on our retention period. If we look at https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/action/admin/indices/rollover/RolloverConfiguration.java#L174-L195 we can see the comment:

    /**
     * When max_age is auto we’ll use the following retention dependent heuristics to compute the value of max_age:
     * - If retention is null aka infinite (default), max_age will be 30 days
     * - If retention is less than or equal to 1 day, max_age will be 1 hour
     * - If retention is less than or equal to 14 days, max_age will be 1 day
     * - If retention is less than or equal to 90 days, max_age will be 7 days
     * - If retention is greater than 90 days, max_age will be 30 days
     */


So, max age of backing index before rollover depends on how long we want to keep data overall in our data stream. For example, if it's 90 days, Elasticsearch will perform rollover and create a new backing index every 7 days.

Instead of a single fixed value for every data stream, auto adjusts the rollover age to ensure that indices aren't kept too long or rolled over too frequently for their specific retention settings.

max_age=auto is a "smart" setting designed to prevent "small index bloat" while ensuring data is deleted on time. It ensures our max_age is always a fraction of our total retention so that we have several backing indices to delete sequentially as they expire.


Data Stream Lifecycle (DSL)


This is a streamlined, automated alternative to the older Index Lifecycle Management (ILM). 

While ILM focuses on "how" data is stored (tiers, hardware, merging), the lifecycle block focuses on "what" happens to the data based on business needs, primarily focusing on retention and automated optimization.


How to find out if data stream is managed by Index Lifecycle Management (ILM) or Data Stream Lifecycle (DSL)?


Get the data stream's details and look at template, lifecycle, next_generation_managed_by and prefer_ilm attributes. Example:

GET _data_stream/ilm-history-7

Output snippet:

      "template": "ilm-history-7",
      "lifecycle": {
        "enabled": true,
        "data_retention": "90d",
        "effective_retention": "90d",
        "retention_determined_by": "data_stream_configuration"
      },
      "next_generation_managed_by": "Data stream lifecycle",
      "prefer_ilm": true,


lifecycle block in our data stream's index template refers to the Data Stream Lifecycle (DSL). 


Inside that lifecycle block, we typically see these children:
  • enabled (Boolean):
    • Interpretation: Determines if Elasticsearch should actively manage this data stream using DSL.
    • Behavior: When set to true, Elasticsearch automatically handles rollover (based on cluster defaults) and deletion (based on our retention settings). If this is missing but other attributes are present, it often defaults to true.
  • data_retention (String):
    • Interpretation: The minimum amount of time Elasticsearch is guaranteed to store our data.
    • Format: Uses time units like 90d (90 days), 30m (30 minutes), or 1h (1 hour).
    • Behavior: This period is calculated starting from the moment a backing index is rolled over (it becomes "read-only"), not from its creation date.
  • effective_retention
    • This is the final calculated value that Elasticsearch actually uses to delete data. 
    • What it represents: It is the minimum amount of time our data is guaranteed to stay in the cluster after an index has rolled over.
    • Why it might differ from our setting: We might set data_retention: "90d", but the cluster might have a global "max retention" or "default retention" policy that overrides our specific request
  • retention_determined_by
    • This attribute identifies the source of the effective_retention value. Common values include: 
      • data_stream_configuration: The retention is coming directly from the data_retention we set in our index template or data stream.
      • default_retention: We didn't specify a retention period, so Elasticsearch is using the cluster-wide default (e.g., data_streams.lifecycle.retention.default).
      • max_retention: We tried to set a very long retention (e.g., 1 year), but a cluster admin has capped all streams at a lower value (e.g., 90 days) using data_streams.lifecycle.retention.max
  • downsampling (Object/Array):
    • Interpretation: Configures the automatic reduction of time-series data resolution over time.
    • Behavior: It defines when (e.g., after 7 days) and how (e.g., aggregate 1-minute metrics into 1-hour blocks) data should be condensed to save storage space while keeping historical trends searchable.

Elasticsearch determines the final retention value using this priority: 
  • If a Max Retention is set on the cluster and our setting exceeds it, Max Retention wins.
  • If we have configured Data Retention on the stream, it is used (as long as it's under the max).
  • If we have not configured anything, the Default Retention for the cluster is used.
  • If no defaults or maxes exist and we haven't set a value, retention is Infinite.

prefer_ilm setting is a transition flag used when a data stream has both an Index Lifecycle Management (ILM) policy and a Data Stream Lifecycle (DSL) configuration. It tells Elasticsearch which of the two management systems should take control of the data stream. Value options are:

  • true: Elasticsearch will use the ILM policy defined in index template setting index.lifecycle.name. It will ignore the lifecycle block (DSL). Use this if you need granular control over shard allocation, force merging, or specific rollover ages that DSL doesn't offer.
  • false (or unset): Elasticsearch will prioritize the Data Stream Lifecycle (DSL) block. It will ignore the ILM policy for rollover and retention. This is the default behavior in modern 2026 clusters to encourage the use of the simpler DSL. 



What if data retention is set both in lifecycle (DSL) and in ILM associated to the index template used for data stream?


If we see retention-related settings in both the lifecycle block and the settings block of an index template, the lifecycle block takes precedence because it is the native configuration for the Data Stream Lifecycle (DSL). This is the modern way to manage data streams. When the lifecycle block is present and enabled: true, Elasticsearch ignores any traditional ILM "Delete" phase settings. It manages the retention of the data stream indices exclusively through the DSL background process.


If a data stream has both a lifecycle block and an ILM policy in data stream index template like:

"settings": {
  "index.lifecycle.name": "my-ilm-policy"
}

...then:
  • The lifecycle block wins: Elasticsearch will prioritize the Data Stream Lifecycle (DSL) for retention and rollover.
  • The ILM policy is ignored: We will often see a warning in the logs or the "Explain" API indicating that the ILM policy is being bypassed because DSL is active.

If we have a custom setting in the settings block (like a metadata field or a legacy retention setting) index.lifecycle.retention: It is ignored for logic: DSL only looks at the lifecycle object. Any other setting is treated as a static index setting and will not trigger the deletion of indices.


How do we associate ILM policy with data stream?


Associating an ILM policy with a data stream requires configuring the data stream's backing index template. Because your current template uses the newer Data Stream Lifecycle (DSL), you must also explicitly tell Elasticsearch to favor ILM.

1. Update the Index Template

To associate a policy, you must add the policy name to the index settings within the template that matches your data stream.

Using the API:

PUT _index_template/<your-template-name>
{
  "index_patterns": ["ilm-history-7*"],
  "template": {
    "settings": {
      "index.lifecycle.name": "your-ilm-policy-name",
      "index.lifecycle.prefer_ilm": true
    }
  },
  "data_stream": { }
}

index.lifecycle.name: Specifies which ILM policy to use.
index.lifecycle.prefer_ilm: true: This is critical if your template still has a lifecycle block. It forces Elasticsearch to use the ILM policy instead of DSL.

2. Apply to Existing Backing Indices

Updating a template only affects future indices created by the data stream. To apply the policy to the 14 indices already in your stream, you must update their settings directly: 


PUT .ds-ilm-history-7-*/_settings
{
  "index": {
    "lifecycle": {
      "name": "your-ilm-policy-name",
      "prefer_ilm": true
    }
  }
}


Note: Use a wildcard like .ds-ilm-history-7-* to target all existing backing indices at once.


If you are moving back to ILM because you need a specific max_age (e.g., rollover every 1 day instead of 7), ensure your new ILM policy has the rollover action defined in its Hot phase. Once applied, the ilm-history-7 stream will immediately begin following the custom timings in your ILM policy instead of the cluster-wide DSL defaults.

What if index template has lifecycle attribute but no index.lifecycle.name?


If an index template contains a lifecycle block, it is configured to use Data Stream Lifecycle (DSL).

If you want to associate a specific ILM policy (to gain granular control over rollover max_age, for example) while this block exists, you must handle the conflict as follows:

1. DSL vs. ILM Precedence

The presence of "lifecycle": { "enabled": true } tells Elasticsearch to ignore traditional ILM. To force the use of an ILM policy instead, you must add index.lifecycle.prefer_ilm: true to the settings block.
Without that setting, the lifecycle block will "win," and your ILM policy will be ignored.

2. How to associate the ILM Policy

To properly link an ILM policy to this specific template, you should update it to look like this:


{
  "template": {
    "settings": {
      "index": {
        "lifecycle": {
          "name": "your-ilm-policy-name",  // 1. Point to your ILM policy
          "prefer_ilm": true              // 2. Tell ES to favor ILM over the DSL block
        },
        "number_of_shards": "1",
        "auto_expand_replicas": "0-1",
        "routing": {
          "allocation": {
            "include": {
              "_tier_preference": "data_hot"
            }
          }
        }
      }
    },
    "mappings": { ... },
    "lifecycle": {              // This block can remain, but will be ignored 
      "enabled": true,          // because prefer_ilm is true above.
      "data_retention": "90d"
    }
  }
}


Key Interpretations

  • The lifecycle block at the root: This governs retention (90 days) and rollover (defaulted to auto, which usually means 7 days or 30 days depending on retention).
  • The settings.index block: This is where you define the ILM link.
  • Conflict Resolution: If you don't add prefer_ilm: true, Elasticsearch 2026 defaults to using the lifecycle block. Your data stream will continue rolling over every 7–30 days based on the auto logic, even if you put an ILM policy name in the settings.

Recommendation

If you want to use an ILM policy, the cleanest approach is to remove the lifecycle block from the template entirely and only use index.lifecycle.name in the settings. This eliminates any ambiguity for the orchestration engine.


How to fix a Shard Explosion?


Shard Explsion happens when there are more than ~20 shards per GB of heap (Elasticsearch node heap  - the JVM heap allocated to the Elasticsearch process, not the Kubernetes node's total memory.)

Force merge does not reduce the number of backing indices in a data stream. 

A force merge (_forcemerge) acts on the segments within the shards of the backing indices, not on the backing indices themselves
Here is the breakdown of what force merge does and how to reduce backing indices:

What Force Merge Does

  • Merges Segments: It reduces the number of Lucene segments in each shard, ideally to one, which improves search performance.
  • Cleans Deleted Docs: It permanently removes (expunges) documents that were soft-deleted, freeing up disk space.
  • Targets Shards: The operation is performed on the shards of one or more indices (or data stream backing indices). 

What Reduces Backing Indices


To reduce the number of backing indices for a data stream, you must use other strategies:
  • Rollover (ILM): Index Lifecycle Management (ILM) creates new backing indices and can automatically delete old ones based on age or size.
  • Data Stream Lifecycle: This automates the deletion of backing indices that are older than the defined retention period.
  • Shrink API: While not typically used to combine multiple daily indices into one, it can be used to reduce the primary shard count of a specific, read-only backing index.
  • Delete Index API: You can manually delete older backing indices (except the current write index). 

Summary:

  • Force Merge = Reduces segments inside shards (better search, less disk space).
  • Rollover/Delete = Reduces the total number of backing indices (fewer indices overall).


---

Wednesday, 7 January 2026

Kubernetes Scheduling

 


Pod scheduling is controlled by pod scheduling constraints section of the Kubernetes pod/deployment configuration which can be found in Kubernetes manifest (YAML) for resources like:
  • Deployment
  • StatefulSet
  • Pod
  • DaemonSet
  • Job/CronJob

Kubernetes scheduling mechanisms:
  • Tolerations
  • Node Selectors
  • Node Affinity
  • Node Affinity
  • Pod Affinity/Anti-Affinity
  • Taints (node-side)
  • Priority and Preemption
  • Topology Spread Constraints
  • Resource Requests/Limits
  • Custom Schedulers
  • Runtime Class


Example:

    tolerations:
      - key: "karpenter/elastic"
        operator: "Exists"
        effect: "NoSchedule"
    nodeSelector:
      karpenter-node-pool: elastic
      node.kubernetes.io/instance-type: m7g.large
      karpenter.sh/capacity-type: "on-demand"


Tolerations


Specify what node taints it can tolerate.

tolerations:
  - key: "karpenter/elastic"
    operator: "Exists"
    effect: "NoSchedule"

Allows the pod to be scheduled on nodes with the taint karpenter/elastic:NoSchedule.
Without this toleration, the pod would be repelled from those nodes.
operator: "Exists" means it tolerates the taint regardless of its value.

Karpenter applies the taint karpenter/elastic:NoSchedule to nodes in the "elastic" pool. This taint acts as a gatekeeping mechanism - it says: "Only pods that explicitly tolerate this taint can schedule here". By default, most pods CANNOT schedule on these nodes (they lack the toleration). Our pod explicitly opts in with the toleration, saying "I'm allowed on elastic nodes".

Why This Pattern?

This is actually a common workload isolation strategy:

Regular pods (no toleration) 
  ↓
  ❌ BLOCKED from elastic nodes
  ✅ Schedule on general-purpose nodes

Elastic workload pods (with toleration)
  ↓  
  ✅ CAN schedule on elastic nodes
  ✅ Can also schedule elsewhere (unless nodeSelector restricts)

Real-World Use Case:

# Elastic nodes are tainted to reserve them for specific workloads
# General traffic shouldn't land here accidentally

# Your pod says: "I'm an elastic workload, let me in"
tolerations:
  - key: "karpenter/elastic"
    operator: "Exists"
    effect: "NoSchedule"

# PLUS you add nodeSelector to say: "And I ONLY want elastic nodes"
nodeSelector:
  karpenter-node-pool: elastic


The Karpenter Perspective

Karpenter knows the node state perfectly. The taint isn't about node health—it's about reserving capacity for specific workloads. This prevents:
  • Accidental scheduling of non-elastic workloads
  • Resource contention
  • Cost inefficiency (elastic nodes might be expensive/specialized)

Think of it like a VIP section: the velvet rope (taint) keeps everyone out except those with a pass (toleration).


Node Selector


nodeSelector:
  karpenter-node-pool: elastic
  node.kubernetes.io/instance-type: m7g.large
  karpenter.sh/capacity-type: "on-demand"

Requires the pod to run only on nodes matching ALL these labels:
  • Must be in the "elastic" Karpenter node pool
  • Must be an AWS m7g.large instance (ARM-based Graviton3)
  • Must be on-demand (not spot instances; karpenter.sh/capacity-type can also have value "spot")

What This Means

This pod is configured to run on dedicated elastic infrastructure managed by Karpenter (a Kubernetes node autoscaler), specifically targeting:
  • ARM-based instances (m7g = Graviton)
  • On-demand capacity (predictable, no interruptions)
  • A specific node pool for workload isolation

This is common for workloads that need consistent performance or have specific architecture requirements.

Node Affinity


More flexible than nodeSelector with support for soft/hard requirements:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:  # Hard requirement
      nodeSelectorTerms:
      - matchExpressions:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["m7g.large", "m7g.xlarge"]
    preferredDuringSchedulingIgnoredDuringExecution:  # Soft preference
    - weight: 100
      preference:
        matchExpressions:
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-east-1a"]


Pod Affinity/Anti-Affinity


Schedule pods based on what other pods are running:

affinity:
  podAffinity:  # Schedule NEAR certain pods
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchLabels:
          app: cache
      topologyKey: kubernetes.io/hostname
      
  podAntiAffinity:  # Schedule AWAY from certain pods
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchLabels:
            app: my-app
        topologyKey: topology.kubernetes.io/zone


Taints (node-side)


Complement to tolerations, applied to nodes:

kubectl taint nodes node1 dedicated=gpu:NoSchedule


Priority and Preemption


Control which pods get scheduled first and can evict lower-priority pods:

priorityClassName: high-priority


Topology Spread Constraints


Distribute pods evenly across zones, nodes, or other topology domains:

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: my-app


Resource Requests/Limits


Influence scheduling based on available resources:

resources:
  requests:
    memory: "64Mi"
    cpu: "250m"
  limits:
    memory: "128Mi"
    cpu: "500m"


Custom Schedulers


You can even specify a completely different scheduler:

schedulerName: my-custom-scheduler


Runtime Class


For specialized container runtimes (like gVisor, Kata Containers):

runtimeClassName: gvisor

Each mechanism serves different use cases—nodeSelector is simple but rigid, while affinity rules and topology constraints offer much more flexibility for complex scheduling requirements.