Wednesday, 7 January 2026

Kubernetes Scheduling

 


Pod scheduling is controlled by pod scheduling constraints section of the Kubernetes pod/deployment configuration which can be found in Kubernetes manifest (YAML) for resources like:
  • Deployment
  • StatefulSet
  • Pod
  • DaemonSet
  • Job/CronJob

Kubernetes scheduling mechanisms:
  • Tolerations
  • Node Selectors
  • Node Affinity
  • Node Affinity
  • Pod Affinity/Anti-Affinity
  • Taints (node-side)
  • Priority and Preemption
  • Topology Spread Constraints
  • Resource Requests/Limits
  • Custom Schedulers
  • Runtime Class


Example:

    tolerations:
      - key: "karpenter/elastic"
        operator: "Exists"
        effect: "NoSchedule"
    nodeSelector:
      karpenter-node-pool: elastic
      node.kubernetes.io/instance-type: m7g.large
      karpenter.sh/capacity-type: "on-demand"


Tolerations


Specify what node taints it can tolerate.

tolerations:
  - key: "karpenter/elastic"
    operator: "Exists"
    effect: "NoSchedule"

Allows the pod to be scheduled on nodes with the taint karpenter/elastic:NoSchedule.
Without this toleration, the pod would be repelled from those nodes.
operator: "Exists" means it tolerates the taint regardless of its value.

Karpenter applies the taint karpenter/elastic:NoSchedule to nodes in the "elastic" pool. This taint acts as a gatekeeping mechanism - it says: "Only pods that explicitly tolerate this taint can schedule here". By default, most pods CANNOT schedule on these nodes (they lack the toleration). Our pod explicitly opts in with the toleration, saying "I'm allowed on elastic nodes".

Why This Pattern?

This is actually a common workload isolation strategy:

Regular pods (no toleration) 
  ↓
  ❌ BLOCKED from elastic nodes
  ✅ Schedule on general-purpose nodes

Elastic workload pods (with toleration)
  ↓  
  ✅ CAN schedule on elastic nodes
  ✅ Can also schedule elsewhere (unless nodeSelector restricts)

Real-World Use Case:

# Elastic nodes are tainted to reserve them for specific workloads
# General traffic shouldn't land here accidentally

# Your pod says: "I'm an elastic workload, let me in"
tolerations:
  - key: "karpenter/elastic"
    operator: "Exists"
    effect: "NoSchedule"

# PLUS you add nodeSelector to say: "And I ONLY want elastic nodes"
nodeSelector:
  karpenter-node-pool: elastic


The Karpenter Perspective

Karpenter knows the node state perfectly. The taint isn't about node health—it's about reserving capacity for specific workloads. This prevents:
  • Accidental scheduling of non-elastic workloads
  • Resource contention
  • Cost inefficiency (elastic nodes might be expensive/specialized)

Think of it like a VIP section: the velvet rope (taint) keeps everyone out except those with a pass (toleration).


Node Selector


nodeSelector:
  karpenter-node-pool: elastic
  node.kubernetes.io/instance-type: m7g.large
  karpenter.sh/capacity-type: "on-demand"

Requires the pod to run only on nodes matching ALL these labels:
  • Must be in the "elastic" Karpenter node pool
  • Must be an AWS m7g.large instance (ARM-based Graviton3)
  • Must be on-demand (not spot instances; karpenter.sh/capacity-type can also have value "spot")

What This Means

This pod is configured to run on dedicated elastic infrastructure managed by Karpenter (a Kubernetes node autoscaler), specifically targeting:
  • ARM-based instances (m7g = Graviton)
  • On-demand capacity (predictable, no interruptions)
  • A specific node pool for workload isolation

This is common for workloads that need consistent performance or have specific architecture requirements.

Node Affinity


More flexible than nodeSelector with support for soft/hard requirements:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:  # Hard requirement
      nodeSelectorTerms:
      - matchExpressions:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["m7g.large", "m7g.xlarge"]
    preferredDuringSchedulingIgnoredDuringExecution:  # Soft preference
    - weight: 100
      preference:
        matchExpressions:
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-east-1a"]


Pod Affinity/Anti-Affinity


Schedule pods based on what other pods are running:

affinity:
  podAffinity:  # Schedule NEAR certain pods
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchLabels:
          app: cache
      topologyKey: kubernetes.io/hostname
      
  podAntiAffinity:  # Schedule AWAY from certain pods
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchLabels:
            app: my-app
        topologyKey: topology.kubernetes.io/zone


Taints (node-side)


Complement to tolerations, applied to nodes:

kubectl taint nodes node1 dedicated=gpu:NoSchedule


Priority and Preemption


Control which pods get scheduled first and can evict lower-priority pods:

priorityClassName: high-priority


Topology Spread Constraints


Distribute pods evenly across zones, nodes, or other topology domains:

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: my-app


Resource Requests/Limits


Influence scheduling based on available resources:

resources:
  requests:
    memory: "64Mi"
    cpu: "250m"
  limits:
    memory: "128Mi"
    cpu: "500m"


Custom Schedulers


You can even specify a completely different scheduler:

schedulerName: my-custom-scheduler


Runtime Class


For specialized container runtimes (like gVisor, Kata Containers):

runtimeClassName: gvisor

Each mechanism serves different use cases—nodeSelector is simple but rigid, while affinity rules and topology constraints offer much more flexibility for complex scheduling requirements.


Useful kubectl commands

 


To get the list of all the nodes (physical nodes, e.g. EC2 instances in AWS EKS cluster) in the cluster:

kubectl get nodes

Output columns:
  • NAME e.g. ip-10-2-12-73.us-east-1.compute.internal
  • STATUS e.g. Ready
  • ROLES    <none>
  • AGE    e.g. 1d
  • VERSION e.g. v1.32.9-eks-ecaa3a6


kubectl get nodes -L node.kubernetes.io/instance-type,topology.kubernetes.io/zone,karpenter.sh/capacity-type

Output columns:
  • NAME e.g. ip-10-2-12-73.us-east-1.compute.internal
  • STATUS e.g. Ready
  • ROLES    <none>
  • AGE    e.g. 1d
  • VERSION e.g. v1.32.9-eks-ecaa3a6
  • INSTANCE-TYPE e.g. m7g.2xlarge
  • ZONE e.g. us-east-2a
  • CAPACITY-TYPE e.g. on-demand

kubectl get nodes -o wide

Output columns:
  • NAME e.g. ip-10-2-12-73.us-east-1.compute.internal
  • STATUS e.g. Ready
  • ROLES    <none>
  • AGE    e.g. 1d
  • VERSION e.g. v1.32.9-eks-ecaa3a6
  • INTERNAL-IP e.g. 10.2.12.73
  • EXTERNAL-IP e.g. <none> 
  • OS-IMAGE e.g. Amazon Linux 2023.9.20251208
  • KERNEL-VERSION e.g. 6.1.158-180.294.amzn2023.aarch64 or 6.1.132-147.221.amzn2023.x86_64
  • CONTAINER-RUNTIME e.g.  containerd://2.1.5

kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.providerID}{"\n"}{end}'

Output:

ip-10-2-12-73.us-east-2.compute.internal aws:///us-east-2a/i-039a9aaafa975358f
ip-10-2-13-147.us-east-2.compute.internal aws:///us-east-2a/i-0627bbbb3c15da009
...


List all pods (by default the output is grouped by NAMESPACE):

kubectl get pods -A -o wide 

Output columns:
  • NAMESPACE
  • NAME                                            
  • READY   (X/Y means X out of Y containers are ready)
  • STATUS
  • RESTARTS
  • AGE
  • IP
  • NODE
  • NOMINATED NODE
  • READINESS GATES

List all pods and sort them by NODE:

kubectl get pods -A -o wide --sort-by=.spec.nodeName 


To get a list of all namespaces in the cluster:

kubectl get ns

Output columns:
  • NAME e.g. default
  • STATUS e.g. Active
  • AGE e.g. 244d
...

Kubernetes DaemonSet

 


Kubernetes DaemonSet is a workload resource that ensures a specific pod runs on all (or selected) nodes in a cluster. It's commonly used for deploying node-level services like log collectors, monitoring agents, or network plugins.

Example:

Elasticsearch Agents are Elastic’s unified data shippers typically used in k8s cluster to collect container logs, Kubernetes metrics, node-level metrics and ship all of that data to Elasticsearch. They are deployed in the cluster as a DaemonSet.

We can use a DaemonSet to run a copy of a pod on every node, or we can use node affinity or selector rules to run it on only certain nodes.


What is the difference between ReplicaSet and DaemonSet?

ReplicaSets ensure a specific number of identical pods run for scaling stateless apps (e.g., web servers), while DaemonSets guarantee one pod runs on every (or a subset of) node(s) for node-specific tasks like logging or monitoring. The key difference is quantity versus location: ReplicaSets focus on maintaining pod count for availability, whereas DaemonSets ensure pod presence on each node for system-level services. 


ReplicaSet

  • Purpose: Maintain a stable set of replica pods for stateless applications, ensuring high availability and scalability.
  • Scaling: Scales pods up or down based on the replicas field you define in the manifest.
  • Use Case: Running web frontends, APIs, or any application needing multiple identical instances.
  • Behavior: If a pod dies, it creates a new one to meet the replica count; if a node fails, it tries to reschedule elsewhere. 


DaemonSet

  • Purpose: Run a single copy of a pod on every (or specific) node in the cluster for node-specific tasks.
  • Scaling: Automatically adds a pod when a new node joins the cluster and removes it when a node leaves.
  • Use Case: Logging agents (Fluentd, Elastic Agent), monitoring agents (Prometheus node-exporter), or storage daemons.
  • Behavior: Ensures that a particular service runs locally on each machine for local data collection or management. 


References:

DaemonSet | Kubernetes

DevOps Interview: Replica sets vs Daemon sets - DEV Community

Monday, 5 January 2026

Kubernetes ReplicaSets

 


A ReplicaSet is a Kubernetes object that ensures a specified number of identical pod replicas are running at any given time. It's a fundamental component for maintaining application availability and scalability.

Key Functions


A ReplicaSet continuously monitors your pods and takes action if the actual number differs from the desired number:
  • If pods crash or are deleted, it creates new ones to replace them
  • If there are too many pods, it terminates the excess ones
  • This self-healing behavior keeps your application running reliably

How It Works


You define a ReplicaSet with three main components:
  • Selector: Labels used to identify which pods belong to this ReplicaSet
  • Replicas: The desired number of pod copies
  • Pod template: The specification for creating new pods when needed

Example:

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: my-app-replicaset
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-container
        image: nginx:1.14

This ReplicaSet ensures 3 nginx pods are always running with the label app: my-app.



What is the relation between replicaset and deployment?


While ReplicaSets are still used internally by Kubernetes, you typically don't create them directly. Instead, you use Deployments, which manage ReplicaSets for you and provide additional features like rolling updates, rollbacks, and version history. Deployments are the recommended way to manage replicated applications in Kubernetes.

A Deployment is a higher-level Kubernetes object that manages ReplicaSets for you. Think of it as a wrapper that adds intelligent update capabilities on top of ReplicaSets.


The Relationship

When you create a Deployment, Kubernetes automatically creates a ReplicaSet underneath it. The Deployment controls this ReplicaSet to maintain your desired number of pods.
The key difference becomes apparent when you update your application:

With just a ReplicaSet: If you want to update your application (like changing the container image), you'd need to manually delete the old ReplicaSet and create a new one. This causes downtime.

With a Deployment: When you update the pod template, the Deployment intelligently manages the transition by:
  1. Creating a new ReplicaSet with the updated pod specification
  2. Gradually scaling up the new ReplicaSet while scaling down the old one
  3. Keeping both ReplicaSets around for rollback capability

Visual Example

Deployment: my-app
    │
    ├── ReplicaSet: my-app-abc123 (old version, scaled to 0)
    │
    └── ReplicaSet: my-app-xyz789 (current version, 3 replicas)
        ├── Pod: my-app-xyz789-1
        ├── Pod: my-app-xyz789-2
        └── Pod: my-app-xyz789-3

What Deployment Adds

  • Rolling updates: Gradual replacement of old pods with new ones
  • Rollback: Easy reversion to previous versions
  • Update strategies: Control how updates happen (RollingUpdate, Recreate)
  • Revision history: Track changes over time
  • Pause/resume: Control when updates are applied

In Practice


You almost always use Deployments rather than creating ReplicaSets directly. The Deployment handles the ReplicaSet management complexity, and you just declare what you want your application to look like.



Why is there many old ReplicaSets in my cluster? 


% kubectl get replicaset  -n staging | grep my-app-staging
my-app-staging-545f64d64b                   0         0         0       46d
my-app-staging-57897b6496                   0         0         0       131d
my-app-staging-57979c66c                    0         0         0       186d
my-app-staging-5974fd86f9                   0         0         0       185d
my-app-staging-5f9fbdf898                   0         0         0       173d
my-app-staging-6754c468                     0         0         0       185d
my-app-staging-695469db68                   0         0         0       144d
my-app-staging-6cd7d77789                   0         0         0       185d
my-app-staging-6f5cc5b6cf                   0         0         0       185d
my-app-staging-76f5bf4bf5                   0         0         0       186d
my-app-staging-d9c7895d9                    5         5         5       27d


This is completely normal Kubernetes behaviour — and the reason is almost always revisionHistoryLimit (or the default value of it).


Why there are many old ReplicaSets

Every time a Deployment’s pod template changes, Kubernetes:
  • Creates a new ReplicaSet
  • Scales the old one down to 0
  • Keeps the old ReplicaSet for rollback purposes

Examples of changes that trigger a new ReplicaSet:
  • Image tag change
  • Env var change
  • ConfigMap checksum change
  • Resource requests/limits change
  • Annotation change on the pod template
  • Helm re-deploy with different values

👉 Over time, this naturally accumulates ReplicaSets like the ones you see.

Why they are so old (46–186 days)

Because Kubernetes never deletes them automatically unless told to.

By default:

revisionHistoryLimit: 10


If:
  • revisionHistoryLimit is unset
  • or explicitly set high
  • or Helm chart doesn’t define it

...Kubernetes will happily keep many old ReplicaSets forever.

Your output strongly suggests:
  • Multiple deployments over months
  • Each one created a new ReplicaSet
  • All old ones were scaled to 0
  • Only the newest one is active:
            my-app-staging-d9c7895d9   5/5   27d   ← current

Why this is not usually a problem

Old ReplicaSets with 0 replicas:
  • Consume almost no resources
  • Do not schedule pods
  • Exist mainly as metadata
So:
  • Performance impact → negligible
  • Scheduler impact → none

When it does become a problem

It’s worth cleaning up if:
  • You have hundreds or thousands of old ReplicaSets
  • kubectl get rs becomes noisy
  • GitOps / audits become painful
  • You accidentally roll back to a very old revision
  • etcd size is a concern (rare, but real at scale)

The right fix: limit revision history

Set this on the Deployment:

spec:
  revisionHistoryLimit: 3


or (Helm values example):

revisionHistoryLimit: 3


This means:
  • Kubernetes keeps only the last 3 old ReplicaSets
  • Older ones are automatically deleted

Typical values:
  • 2–3 for staging
  • 5–10 for prod (depending on rollback needs)

How to confirm current setting

kubectl get deployment my-app-staging -n staging -o yaml | grep revisionHistoryLimit


If nothing shows → default behaviour applies.

Can you safely delete the old ones manually?

Yes — as long as they’re at 0 replicas.

Example:

kubectl delete rs my-app-staging-545f64d64b -n staging


But:
  • This is a one-off cleanup
  • Without fixing revisionHistoryLimit, they’ll come back

TL;DR

  • Many old ReplicaSets is normal
  • They exist for rollback history
  • They’re old because Kubernetes keeps them indefinitely
  • Fix it with revisionHistoryLimit
  • Manual deletion is safe but not a long-term solution


Friday, 2 January 2026

Kubernetes Probes

 


Kubernetes restarts the containers if they exit or crash. But this is sometimes not enough indication that the apps are running fine. They can stall and get into an error state while their process is still running indefinitely.

This is why we need to proactively probe the pods in order to check that the app is alive and ready to serve the requests.

Kubernetes probes are tests the cluster runs against pods to see if they are good to serve traffic, or if they need corrective action.


Liveness probe

  • Purpose: 
    • Tells Kubernetes if the application is still running properly
    • Detects that application/pod is in such state of failure that it requires a restart
    • Checks if a container is alive (running and healthy/responsive)
    • Detects situations where an application is running but unresponsive (e.g., deadlock) and needs a restart to recover. 
      • For example: suddenly, because of a bug or any other reason, instead of 2xx (or 3xx errors, both of which k8s renders as success), app's endpoints start returning other error codes which indicate failure. Without liveness probe pod itself will not restart on its own and we won't be able to see that something wrong is going on unless we monitor pod logs.
    • Ensures basic functionality by restarting. Each time pod restarts, RESTARTS number gets incremented by 1.
  • What it checks: 
    • Whether the app is in a broken state that requires a restart (deadlocked, corrupted state, etc.)
  • Behavior: 
    • Runs throughout the container's lifetime
  • Action on Failure: 
    • Restarts the container. Container is killed and restarted.
  • When to Use: 
    • For fundamental health checks, like ensuring a web server process is running. 
    • To force a restart on our container, even if it did not crash (terminated with non-zero code) nor exited (terminated gracefully, with zero code). We can use this to get our app killed and restarted in case an unrecoverable, or unforeseen problem happens. But sometimes it is a bad idea to kill the container. It may not be dead, it may just be taking very long to respond, processing a big user request. In this case we should not kill it.

Example:

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 1
  timeoutSeconds: 1


The configuration above tells Kubernetes cluster to send HTTP GET request to the pod's /health endpoint 10 seconds (initialDelaySeconds) after pod is started (or restarted), to keep probing every 10 seconds, to consider probe failed if 2xx or 3xx is not returned after 1 second (timeoutSeconds' default value is 1 second) and to report unhealthy pod on the 1st failed probe.

Example #2:

livenessProbe:
  httpGet:
    path: /healthz/live


How can app know that it's not deadlocked and that liveness endpoint need to return 200 OK?


Detecting deadlocks is tricky because a deadlocked thread can't check itself. Here are practical approaches:


1. Simple "I'm Alive" Check (Most Common)


Just return 200 if the HTTP server can respond:


@app.get("/healthz/live")
async def liveness_probe():
    return {"status": "ok"}

Logic: If the web server thread is responding, the process isn't completely frozen. This catches:
  • Process crashes
  • Complete hangs
  • Out of memory kills

Limitation: Doesn't detect deadlocks in background workers or specific threads.

2. Heartbeat Pattern (Background Workers)


Track that critical background tasks are still running:

import time
from threading import Thread

class Heartbeat:
    def __init__(self, timeout=60):
        self.last_beat = time.time()
        self.timeout = timeout
    
    def beat(self):
        self.last_beat = time.time()
    
    def is_alive(self):
        return (time.time() - self.last_beat) < self.timeout

worker_heartbeat = Heartbeat()

def background_worker():
    while True:
        # Do work
        process_queue()
        worker_heartbeat.beat()
        time.sleep(5)

@app.get("/healthz/live")
async def liveness_probe():
    if not worker_heartbeat.is_alive():
        return Response(status_code=503)
    return {"status": "ok"}



3. Watchdog Timer Pattern (More Robust)


Use a separate monitoring thread that expects regular check-ins:

import threading
import time

class WatchdogTimer:
    def __init__(self, timeout=30):
        self.timeout = timeout
        self.last_reset = time.time()
        self.lock = threading.Lock()
        self.healthy = True
        
        # Start watchdog thread
        self.thread = threading.Thread(target=self._monitor, daemon=True)
        self.thread.start()
    
    def reset(self):
        """Call this periodically from your main processing loop"""
        with self.lock:
            self.last_reset = time.time()
    
    def _monitor(self):
        while True:
            time.sleep(5)
            with self.lock:
                if (time.time() - self.last_reset) > self.timeout:
                    self.healthy = False
                else:
                    self.healthy = True
    
    def is_healthy(self):
        with self.lock:
            return self.healthy

watchdog = WatchdogTimer(timeout=30)

# In your main processing loop
def process_requests():
    while True:
        handle_request()
        watchdog.reset()  # Signal we're still alive

@app.get("/healthz/live")
async def liveness_probe():
    if not watchdog.is_healthy():
        return Response(status_code=503)
    return {"status": "ok"}


4. Critical Resource Check (Pragmatic)


Check that critical resources are accessible:

@app.get("/healthz/live")
async def liveness_probe():
    try:
        # Can we still access the database connection pool?
        db.ping()  # Lightweight check
        return {"status": "ok"}
    except Exception:
        # If we can't even ping, we're probably deadlocked on DB
        return Response(status_code=503)
5. Async Event Loop Check (Python-specific)
For async applications, verify the event loop isn't blocked:
pythonimport asyncio

@app.get("/healthz/live")
async def liveness_probe():
    try:
        # If this times out, event loop is blocked
        await asyncio.wait_for(
            asyncio.sleep(0),
            timeout=1.0
        )
        return {"status": "ok"}
    except asyncio.TimeoutError:
        return Response(status_code=503)


Best Practice: Keep It Simple

For most apps, approach #1 (simple check) is sufficient:

python@app.get("/healthz/live")
async def liveness_probe():
    return {"status": "ok"}

Add complexity only if you have specific deadlock scenarios you've observed. The liveness probe should be:
  • Fast (< 100ms)
  • Lightweight (no heavy computation or I/O)
  • Reliable (doesn't fail due to temporary issues)

Remember: Kubernetes will restart your container if liveness fails, so false positives cause unnecessary restarts. When in doubt, start simple and add monitoring only for known failure modes.


Startup probe

  • Purpose: 
    • Tells Kubernetes when the application has started successfully
    • Will delay the start of the other probes, when the pod is created or restarted.
  • What it checks:
    •  Whether the app has completed initialization (loaded config, connected to databases, warmed up caches, etc.)
  • Behavior: 
    • Runs only during startup; once it succeeds, it stops running
  • Failure: 
    • Container is killed and restarted
  • When to Use: 
    • When app needs some time for startup we don't want to set an arbitrary value for initialDelaySeconds for Liveness probe. This would be bad because a small initial delay means some apps may not be ready yet when the liveness probe kicks in. And they will get killed unnecessarily. If we put a large initial delay, we will have a different problem: bad containers will run for longer until Kubernetes notices they are bad. The startup probe addresses that by creating a configuration that takes care of both cases: containers that get ready quickly, are put to work quickly. And containers that take longer to get ready still have a chance of getting ready instead of being terminated too soon.
The liveness probe starts after the first successful startup probe (if one is configured), as the startup probe's main job is to delay liveness and readiness checks until the application is fully initialized, preventing premature restarts of slow-starting containers. Once the startup probe succeeds, Kubernetes begins running the regular liveness and readiness probes for the container's entire lifecycle. 

If a startup probe is defined, the initialDelaySeconds configuration for the liveness and readiness probes is irrelevant (specifically, those probes are disabled) until the startup probe succeeds. 

The startup probe acts as a gatekeeper: 
  • While the startup probe is running and failing, the liveness and readiness probes are paused and do not begin their checks.
  • Only after the startup probe has successfully passed does the standard logic for liveness and readiness probes begin, at which point their respective initialDelaySeconds (if set) will then be respected. 
This mechanism is designed to handle applications with variable or long startup times, preventing the liveness probe from prematurely killing the container or the readiness probe from marking it as "not ready" before it has had a chance to fully initialize. 

Here's the sequence:
  • Container Starts: The container begins to run.
  • Startup Probe Runs: The kubelet executes the startup probe.
  • Startup Probe Succeeds: If successful, the application is considered started, and the startup probe stops running.
  • Liveness & Readiness Start: Kubernetes immediately begins periodically running the configured liveness and readiness probes.
  • Liveness Probe Fails: If the liveness probe fails, Kubernetes restarts the container.
  • Readiness Probe Fails: If the readiness probe fails, the pod is removed from service endpoints but not restarted. 
In essence: The startup probe acts as a gatekeeper, allowing slow applications to get ready before the health checks (liveness/readiness) kick in and potentially kill them. 


Example configuration:

startupProbe:
  httpGet:
    path: /health
    port: 8080
  failureThreshold: 10
  periodSeconds: 5
  timeoutSeconds: 1 

This configuration tells k8s to probe this pod every 5 seconds (periodSeconds), wait for 1 second (timeoutSeconds; 1 second is the default value) for response and fail only after 10 requests (failureThreshold). So, this app has 10 x 5 x 1 = 50 seconds to initialize before Startup probe would fail.


Example configuration #2 (recommended endpoint format):

startupProbe:
  httpGet:
    path: /healthz/startup


/healthz/startup returns 200 only after full initialization is complete. 

How can app know that full initialization is complete so it can return 200 OK for startup endpoint?


The app needs to track its initialization state internally. Here are common approaches:


1) Boolean Flag Pattern (Simplest)


# At module level
is_initialized = False

async def initialize():
    # Load configuration
    config = load_config()
    
    # Connect to database
    await db.connect()
    
    # Warm up caches
    await cache.warm_up()
    
    # Mark as ready
    global is_initialized
    is_initialized = True

@app.get("/healthz/startup")
async def startup_probe():
    if is_initialized:
        return {"status": "ok"}
    return Response(status_code=503)


If startup probe fails, we chose to return HTTP ERROR 503 (Service Unavailable) as this server error response status code indicates that the server is not ready to handle the request.


2) Dependency Checklist Pattern (More Robust)



class AppState:
    def __init__(self):
        self.config_loaded = False
        self.db_connected = False
        self.cache_ready = False
    
    def is_ready(self):
        return all([
            self.config_loaded,
            self.db_connected,
            self.cache_ready
        ])

app_state = AppState()

async def initialize():
    app_state.config_loaded = load_config()
    app_state.db_connected = await db.connect()
    app_state.cache_ready = await cache.warm_up()

@app.get("/healthz/startup")
async def startup_probe():
    if app_state.is_ready():
        return {"status": "ok"}
    return Response(status_code=503)


3) Application Lifecycle Hook (Framework-specific)


# FastAPI example
from fastapi import FastAPI

app = FastAPI()
startup_complete = False

@app.on_event("startup")
async def startup_event():
    global startup_complete
    # Do all initialization
    await db.connect()
    await load_data()
    startup_complete = True

@app.get("/healthz/startup")
async def startup_probe():
    if startup_complete:
        return {"status": "ok"}
    raise HTTPException(status_code=503)


4) State Machine Pattern (For Complex Apps)


from enum import Enum

class AppStatus(Enum):
    STARTING = "starting"
    READY = "ready"
    SHUTTING_DOWN = "shutting_down"

current_status = AppStatus.STARTING

async def initialize():
    global current_status
    # initialization steps...
    current_status = AppStatus.READY

@app.get("/healthz/startup")
async def startup_probe():
    if current_status == AppStatus.READY:
        return {"status": "ok"}
    return Response(status_code=503)


Key Points

  • The startup endpoint returns 503 (Service Unavailable) until initialization is done, then 200 OK
  • Kubernetes will keep checking until it gets 200, or kill the container if it takes too long (configurable via failureThreshold and periodSeconds)
  • The flag/state should be set after all critical initialization steps complete
  • This is typically done in your application's main entry point or framework's startup hooks

Readiness probe

  • Purpose: 
    • Determines if an application is fully initialized and prepared to handle incoming network requests.
    • Checks if a container is ready to serve traffic (e.g., database connected, cache loaded)
    • Manages traffic flow for gradual rollouts and recovery from temporary issues
    • Tells Kubernetes if the application can handle traffic
  • Action on Failure: 
    • Pod is removed from service endpoints (no restart)
    • Stops sending new traffic to the pod (removes it from service) but doesn't restart it.
    • It temporarily removes container from service if not ready, preventing new requests but not restarting
    • It stops the pod from taking traffic, but it will not restart the container.
    • If Deployment haa 1 replica only and if this single pod is taken out of ready status, for a while no pods will be READY; no users could be served. This can be worked around setting multiple replicas for the Deployment or by using a podDisruptionBudgetobject.
  • What it checks: 
    • Whether the app is temporarily unable to serve requests (overloaded, waiting for dependencies, performing maintenance)
  • Behavior: 
    • Runs throughout the container's lifetime
  • When to Use: 
    • When an app needs time to load data, connect to databases, or initialize, preventing users from hitting a partially ready instance.
    • We need this probe besides Liveness probe as there can be a case when app needs an extra time to process some query which means that the next liveness query would fail (if request processing time exceeds liveness probe's periodSeconds + timeoutSeconds). In such case, failed liveness probe would restart the (healthy!) pod, just because it took it a long time to process a request. Setting large failureThreshold or timeoutSeconds for liveness probe is not good.
    • Benefits of using Readiness probe are in that by not restarting the container, the user is still going to get his response, and whatever processing that was halfway through is not lost. Also, by taking the pod out of service, we avoid sending more traffic to a blocked container that is, at least right now, responding slowly to all requests.
Example:

livenessProbe:
  ...
  failureThreshold: 10 <-- increased

readinessProbe:
  httpGet:
    path: /healthy
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 1


The readiness probe configuration above instructs k8s kubelet to send a Readiness probe to wait 10 seconds (initialDelaySeconds) after pod's start or restart, query the pod every 10 seconds (periodSeconds) and fail after 1st attempt (failureThreshold). This is if we expect that it is acceptable if it takes up to 10 seconds for pod to process a request. If this probe failes, k8s will take this pod out of service (it will be marked as not ready - READY column value will show e.g. 0/1 pods are ready) but pod will NOT be restarted (RESTARTS value will remain the same). As soon as long request is served and next Readiness probe is successful, k8s will put this pod back to service meaning it will again start receiving requests and READY column value will show that this pod is again ready to serve these requests).

The liveness and readiness probes are tested at the same time. If they have the same settings (failureThreshold, periodInterval, timeoutSeconds) when the readiness probe fails, and the pod is going to be taken out of ready status, the liveness probe will also fail, and the container gets restarted. It does not make a lot of sense to put same configs on readiness and liveness probes. The readiness probe becomes ineffective. There is no point in taking a pod out of service while restarting it.


Example #2:

Deployment configuration

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 20
  periodSeconds: 10
  timeoutSeconds: 9
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 20
  periodSeconds: 30
  timeoutSeconds: 60


20 seconds after pod is (re)started, both liveness and readiness probes will kick in. 

How can app know that it can accept traffic right now and return 200 OK for readiness probe?



The readiness probe should check dependencies and capacity - things that might be temporarily unavailable but don't require a restart. Here are practical approaches:

1. Dependency Health Checks (Most Common)


Check that external services your app needs are accessible:

@app.get("/healthz/ready")
async def readiness_probe():
    checks = {}
    
    # Check database
    try:
        await db.execute("SELECT 1", timeout=1)
        checks["database"] = "ok"
    except Exception as e:
        checks["database"] = "failed"
        return Response(
            status_code=503,
            content=json.dumps({"status": "not ready", "checks": checks})
        )
    
    # Check Redis cache
    try:
        await redis.ping(timeout=1)
        checks["redis"] = "ok"
    except Exception:
        checks["redis"] = "failed"
        return Response(status_code=503, content=json.dumps(checks))
    
    # Check message queue
    try:
        await rabbitmq.health_check(timeout=1)
        checks["rabbitmq"] = "ok"
    except Exception:
        checks["rabbitmq"] = "failed"
        return Response(status_code=503, content=json.dumps(checks))
    
    return {"status": "ready", "checks": checks}

2. Circuit Breaker Pattern


Track downstream service health and fail fast:

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failures = 0
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.last_failure_time = None
        self.state = "closed"  # closed, open, half_open
    
    def is_available(self):
        if self.state == "closed":
            return True
        
        if self.state == "open":
            # Check if timeout has passed
            if time.time() - self.last_failure_time > self.timeout:
                self.state = "half_open"
                return True
            return False
        
        return True  # half_open, try again
    
    def record_success(self):
        self.failures = 0
        self.state = "closed"
    
    def record_failure(self):
        self.failures += 1
        self.last_failure_time = time.time()
        if self.failures >= self.failure_threshold:
            self.state = "open"

db_circuit = CircuitBreaker()

@app.get("/healthz/ready")
async def readiness_probe():
    if not db_circuit.is_available():
        return Response(status_code=503, content="Database circuit open")
    
    try:
        await db.ping()
        db_circuit.record_success()
        return {"status": "ready"}
    except Exception:
        db_circuit.record_failure()
        return Response(status_code=503)

3. Load/Capacity Checks


Check if the app is overloaded:

import psutil

@app.get("/healthz/ready")
async def readiness_probe():
    # Check memory usage
    memory = psutil.virtual_memory()
    if memory.percent > 90:
        return Response(
            status_code=503,
            content=f"Memory usage too high: {memory.percent}%"
        )
    
    # Check CPU usage
    cpu = psutil.cpu_percent(interval=0.1)
    if cpu > 95:
        return Response(status_code=503, content=f"CPU too high: {cpu}%")
    
    # Check active connections/requests
    active_requests = get_active_request_count()
    if active_requests > MAX_CONCURRENT_REQUESTS:
        return Response(
            status_code=503,
            content=f"Too many active requests: {active_requests}"
        )
    
    return {"status": "ready"}


4. Graceful Shutdown State


During shutdown, mark as not ready while finishing existing requests:

class AppState:
    def __init__(self):
        self.accepting_traffic = True
        self.active_requests = 0
    
    def start_shutdown(self):
        self.accepting_traffic = False

app_state = AppState()

@app.get("/healthz/ready")
async def readiness_probe():
    if not app_state.accepting_traffic:
        return Response(status_code=503, content="Shutting down")
    return {"status": "ready"}

@app.on_event("shutdown")
async def shutdown_event():
    # Stop accepting new traffic
    app_state.start_shutdown()
    
    # Wait for existing requests to complete
    while app_state.active_requests > 0:
        await asyncio.sleep(0.1)

5. Combined Example (Production-Ready)


from datetime import datetime, timedelta

class ReadinessChecker:
    def __init__(self):
        self.is_shutting_down = False
        self.last_db_check = None
        self.db_healthy = True
        self.cache_timeout = timedelta(seconds=5)
    
    async def check_database(self):
        """Cached database check"""
        now = datetime.now()
        
        # Use cached result if recent
        if (self.last_db_check and 
            now - self.last_db_check < self.cache_timeout):
            return self.db_healthy
        
        try:
            await db.execute("SELECT 1", timeout=2)
            self.db_healthy = True
            self.last_db_check = now
            return True
        except Exception as e:
            logger.warning(f"Database health check failed: {e}")
            self.db_healthy = False
            self.last_db_check = now
            return False
    
    async def is_ready(self):
        if self.is_shutting_down:
            return False, "shutting down"
        
        if not await self.check_database():
            return False, "database unavailable"
        
        # Check memory
        memory = psutil.virtual_memory()
        if memory.percent > 90:
            return False, f"high memory: {memory.percent}%"
        
        return True, "ok"

readiness = ReadinessChecker()

@app.get("/healthz/ready")
async def readiness_probe():
    ready, reason = await readiness.is_ready()
    
    if ready:
        return {"status": "ready"}
    else:
        return Response(
            status_code=503,
            content=json.dumps({"status": "not ready", "reason": reason})
        )

Key Principles


DO check:
  • External dependencies (databases, caches, APIs)
  • System resources (memory, CPU, disk)
  • Application capacity (connection pools, queue sizes)
  • Shutdown state

DON'T check:
  • Things that would require a restart to fix (use liveness for that)
  • Expensive operations (keep it under 1-2 seconds)
  • Non-critical dependencies (if you can serve requests without it)

Timeouts are critical: Always use short timeouts (1-2 seconds) for dependency checks. A hanging check is worse than a failed check.

When the readiness probe fails, Kubernetes removes the pod from the service load balancer, so traffic stops flowing to it. Once dependencies recover and the probe succeeds again, traffic resumes - no restart needed.



Key Differences


  • Impact:
    • Liveness restarts; Readiness stops traffic.
  • Focus: 
    • Liveness is about the process being alive; Readiness is about being able to serve requests.
  • Scenario: 
    • A liveness failure means the container is broken; a readiness failure means it's busy or temporarily unavailable. 

Example Scenario


Imagine a web app:
  • Startup: Uses a startup probe to wait for initial setup (slow).
  • Running:
    • Liveness Probe: Checks if the web server process is still running. If it crashes, restart it.
    • Readiness Probe: Checks if the database connection is active and cache is warm. If not, stop traffic until it's ready, avoiding errors for users. 

What to choose for probes endpoints?


Endpoints (or sometimes probe endpoints/paths) generally shouldn't be the same - each serves a distinct purpose.


Example implementation:

livenessProbe:
  httpGet:
    path: /healthz/live
    
readinessProbe:
  httpGet:
    path: /healthz/ready
    
startupProbe:
  httpGet:
    path: /healthz/startup


Typical endpoint logic:

/healthz/startup - Returns 200 only after full initialization is complete
/healthz/live - Returns 200 if the process is running and not deadlocked (lightweight check)
/healthz/ready - Returns 200 if the app can accept traffic right now (may check dependencies, load, etc.)

Many simple applications use the same endpoint for liveness and readiness, but separating them gives you more control over pod lifecycle management.



Readiness Probe Times Out Scenario


Is it normal to have a deployment which pods frequently emit readiness probe error like this: 

Readiness probe failed: Get "http://10.2.30.79:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

It’s not “normal”, but it is common in certain situations — and whether it’s acceptable depends on how often, when, and what impact it has.

The error above tells us:
  • Kubelet could connect to the pod IP
  • The HTTP request did not receive response headers within the probe timeout
  • The container was not killed (readiness ≠ liveness)
  • The pod was temporarily marked NotReady and removed from Service endpoints
So this is about latency or blocking, not a hard crash.


It is NOT normal if:
  • It happens frequently under normal load
  • Pods flap between Ready / NotReady
  • Traffic drops or retries spike
  • Deployments take a long time to become stable
  • You see cascading failures in upstream services

This usually indicates:
  • Slow or blocking health endpoint
  • Thread starvation
  • GC pauses
  • DB / external dependency slowness
  • CPU or memory pressure
  • Too aggressive probe settings

It is somewhat normal if:
  • It happens briefly during startup
  • It happens during traffic spikes
  • It coincides with:
    • Rolling deployments
    • Node pressure
    • Cold caches
    • JVM warm-up / Python startup
  • Pods recover quickly and remain stable

Even then, it’s still a signal worth fixing, not something to ignore.


Common root causes (very common in real clusters)


1) Health endpoint is doing “real work”

Bad pattern:

/health → checks DB → checks Redis → checks downstream API

If any dependency is slow, readiness fails.

Best practice:
  • Readiness = “Can I serve traffic?”
  • Keep it fast and in-memory
  • Avoid network calls


2) Probe timeout too low

Default:

timeoutSeconds: 1

That’s extremely aggressive.

If your app occasionally takes >1s to respond → you’ll see exactly this error.

Typical sane values:

timeoutSeconds: 2–5
failureThreshold: 3
periodSeconds: 10

CPU throttling or memory pressure

Symptoms:
  • Happens under load
  • Worse on smaller nodes
  • Improves if you scale replicas

Check:

kubectl top pod
kubectl describe pod <pod>


Look for:
  • CPU limits too low
  • OOM kills nearby in time
  • High GC time (JVM)


4) Thread pool exhaustion

Very common with:
  • Java (Tomcat / Netty)
  • Node.js
  • Python (Gunicorn)

If all worker threads are busy:
  • App accepts connection
  • But doesn’t return headers in time
  • Probe times out


5) Node-level issues

Less common but real:
  • kubelet under pressure
  • noisy neighbors
  • CNI latency
  • iptables overload on busy nodes

When it becomes a real problem?

This is definitely a problem if:
  • HPA keeps scaling up/down unnecessarily
  • Ingress shows intermittent 502/503
  • You see “endpoint removed” events frequently
  • Rolling updates never fully settle
  • You rely on readiness for zero-downtime deploys


What to do immediately (practical checklist)


Hit the endpoint manually:

kubectl exec -it pod -- curl -v http://localhost:8080/health


Time it under load.

Check probe config:

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  timeoutSeconds: 1   # 🚩likely too low


Check resource limits:

resources:
  requests:
    cpu: "100m"
    memory: "256Mi"
  limits:
    cpu: "200m"       # 🚩 possible throttling


Check events:

kubectl describe pod


Check if readiness == liveness
If they point to the same endpoint → that’s usually wrong.

Rule of thumb:


A readiness probe should almost never fail once the pod is fully running and warmed up.

Occasional blips under extreme load might happen, but frequent failures mean something is mis-sized, misconfigured, or doing too much work.


How to check if the Readiness probe timeout failure (mentioned above) takes place on pod's startup (there is no Startup probe) or later, during regular running?



You can determine very reliably whether readiness probe timeouts happen only at startup or later during steady state, even without a startup probe.

1) Check when readiness failures happen relative to pod start


Look at pod events with timestamps:

kubectl describe pod <pod-name> -n staging


Focus on:

Events:
  Type     Reason     Age                  Message
  ----     ------     ----                 -------
  Warning  Unhealthy  3m (x12 over 30m)    Readiness probe failed: ...


Now compare that with:

Start Time:  <timestamp>

Interpret:

❌ Failures minutes/hours after Start Time → NOT startup
⚠️ Failures only within first ~30–120s → startup-related
🚨 Failures continue long after pod is Ready → runtime issue

This alone often answers the question.

2) Correlate readiness failures with container restarts


Even though readiness doesn’t restart pods, startup-related readiness failures usually coincide with restarts caused by something else.

kubectl get pod <pod-name> -n staging -o wide
kubectl describe pod <pod-name> -n staging


Look for:

Restart Count: 0


Interpretation:

  • Restart Count = 0 and failures happen later → runtime slowness
  • Restarts + readiness failures early → startup / initialization

3) Use kubectl get events sorted by time (very useful)


kubectl get events -n staging \
  --field-selector involvedObject.name=<pod-name> \
  --sort-by='.lastTimestamp'


This gives you a timeline:

Normal   Scheduled   pod/xxx   ...
Normal   Pulled      pod/xxx   ...
Normal   Started     pod/xxx   ...
Warning  Unhealthy   pod/xxx   Readiness probe failed
Normal   Ready       pod/xxx   ...

What you want to see:

Do Unhealthy events appear after the pod has already been Ready once?

If yes → definitely not startup.

4) Watch readiness transitions live


If the issue is intermittent:

kubectl get pod <pod-name> -n staging -w


You’ll see:

READY   STATUS
1/1     Running
0/1     Running   ← readiness failed
1/1     Running


Now compare with pod age:

kubectl get pod <pod-name> -n staging


If the pod is:

AGE=2h and readiness flips → runtime issue
AGE=20s → startup

5) Check application logs around the failure time


Readiness timeouts often align with:
  • GC pauses
  • Thread pool exhaustion
  • Slow DB calls

kubectl logs <pod-name> -n staging --timestamps


Then compare timestamps with event timestamps from step 3.

If you see:
  • Long pauses
  • Request backlogs
  • “Waiting for DB” messages

→ not startup.

6) Check if readiness fails during deployments only


This is a big clue.

kubectl rollout history deployment next-api-staging -n staging


Then check:

Do readiness failures cluster around rollout times?
Or happen randomly hours later?

Random = runtime
Rollout-only = startup / warm-up

7) (Optional but powerful) Add logging to the health endpoint


If you can change the app:
  • Log request duration for /health
  • Log thread pool usage
  • Log dependency latency

This gives you absolute proof.

8) Why a startupProbe would help (even now)


If failures are startup-only, the clean fix is:

startupProbe:
  httpGet:
    path: /health
    port: 8080
  failureThreshold: 30
  periodSeconds: 5


This:
  • Disables readiness + liveness until startup completes
  • Removes false negatives during cold start

Decision table:

Observation                          Conclusion
Failures only in first 1–2 min  Startup issue
Failures after pod was Ready  Runtime issue
Pod age is hours/days          Runtime issue
HPA / traffic spikes correlate  Runtime issue
GC / DB slowness in logs  Runtime issue

Bottom line

If you ever see a pod go Ready → NotReady → Ready again long after startup, it is NOT a startup problem — startup probes won’t fix it.

Let's say that describe pod output contains: 

Start Time: Tue, 23 Dec 2025 15:41:04 +0000 

and 

Events: 
Type Reason Age From Message
---- ------ ---- ---- ------- 
Warning Unhealthy 33m (x6 over 3h42m) kubelet Readiness probe failed: Get "http://10.2.44.98:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)


What this data proves

Pod start time: Start Time: Tue, 23 Dec 2025 15:41:04 +0000
Readiness failures: Warning  Unhealthy  33m (x6 over 3h42m)

This means:

The pod has been running for hours
Readiness probe failures happened repeatedly over ~4 hours
Failures occurred as recently as 33 minutes ago

Conclusion:
  • This is not related to pod startup
  • A startupProbe would not fix this

What is happening

Your pod is occasionally too slow to respond to GET /health within the probe timeout. During those moments:
  • The pod stays Running
  • It flips to NotReady
  • It is removed from Service endpoints
  • Traffic is temporarily stopped
  • Then it recovers

This is classic runtime degradation, not initialization.

Most likely root causes (ranked)


1) Health endpoint is blocking or slow


Very common:

/health waits on:
  • DB
  • Redis
  • external APIs

Under load → response time exceeds probe timeout

Strong indicator:

“awaiting headers” → handler accepted connection but couldn’t respond

2) Thread / worker starvation


Especially if this is:
  • Node.js
  • Java (Tomcat / Spring Boot)
  • Python (Gunicorn / uWSGI)

Symptoms:
  • All workers busy handling traffic
  • Health endpoint queued
  • Probe times out

3) CPU throttling


If CPU limits are tight:
  • App runs
  • But occasionally gets throttled
  • Health check misses deadline

Check:

kubectl top pod <pod> -n staging

4) GC pauses / memory pressure


Common with JVM apps:
  • Stop-the-world GC pauses
  • Health endpoint blocked briefly

What will NOT help:
  • startupProbe
  • increasing initialDelaySeconds only
  • ignoring the warnings

Recommended Immediate actions:


1) Relax readiness probe (quick mitigation)


If current config is aggressive (e.g. timeoutSeconds: 1):

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  timeoutSeconds: 3
  periodSeconds: 10
  failureThreshold: 3


This reduces false negatives without hiding real failures.

2) Ensure readiness endpoint is lightweight


Best practice:
  • In-memory checks only
  • No DB / Redis / network calls
  • No locks or long async work

If you need deeper checks:
  • Put them behind /health/deep
  • Use them for diagnostics, not readiness

3) Check resource limits carefully


Look for this pattern:

limits:
  cpu: 200m   # 🚩 common cause


If you see CPU throttling:
  • Increase limit
  • Or remove limit and rely on requests

4) Correlate with traffic & deploys


Ask:
  • Do failures align with traffic spikes?
  • With cron jobs?
  • With background workers?

This helps pinpoint starvation.

How to prove the cause (next steps)


If you want to be 100% sure:
  • Log request duration for /health
  • Log worker pool usage
  • Compare timestamps with readiness failures
  • Temporarily increase timeoutSeconds

If failures disappear → latency problem confirmed

Bottom line:


Your readiness probe is doing its job correctly — it’s telling you the app is occasionally not ready to serve traffic. This is a runtime performance or design issue, not Kubernetes misbehaviour.

Let's say we have this config:

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 20
  periodSeconds: 10
  timeoutSeconds: 9
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 20
  periodSeconds: 30
  timeoutSeconds: 60

and     

    router.get("/health", (_req: Request, res: Response) => {
        return res.status(200).json({});
    });

With this config the problem is almost certainly not Kubernetes probe tuning and not your health handler logic.

Let’s walk through it carefully and then I’ll tell you what is left.

1) This probe configuration is sane (even generous)

Readiness:
periodSeconds: 10
timeoutSeconds: 9


That means:

Kubelet waits up to 9 seconds
Only 1 second “gap” before next probe
Failure happens only if the app does not send headers for 9s

This is not aggressive at all.

Liveness:
timeoutSeconds: 60

Very forgiving — and you’re not seeing restarts, which confirms: App does eventually respond. It’s just slow occasionally

2) Your /health handler is trivial

router.get("/health", (_req, res) => {
  return res.status(200).json({});
});


This:
  • Does no I/O
  • No DB calls
  • No async work
  • Should respond in microseconds

So if this times out: the event loop was blocked, or the process wasn’t scheduled. That narrows the root causes dramatically.

Given everything above, only a few things can explain a 9-second no-headers timeout.

Most likely causes (ranked)


1) Node.js event loop blockage


If your app:
  • Does CPU-heavy work
  • JSON serialization of big objects
  • Synchronous crypto / compression
  • Large regexes
  • Blocking filesystem calls
Then:
  • Event loop freezes
  • /health cannot respond
  • Probe times out

This is the #1 cause in Node apps.

2) CPU throttling at container level


Even if your code is fine:

limits:
  cpu: 200m


Under load:
  • Node is throttled
  • Event loop starves
  • Requests queue
  • Health check misses deadline

Very common in staging.

3) Too few Node workers


If you are running:
  • A single Node process
  • No cluster / PM2
  • Handling concurrent traffic

Then a single hot path can stall everything.

4) Long GC pauses

If:
  • Memory limit is tight
  • Heap pressure increases

Node’s GC can block the event loop for seconds.

4) What it is definitely not

  • Not startup
  • Not probe timeout config
  • Not the health handler
  • Not networking
  • Not kubelet flakiness

This is application runtime starvation

How to prove it conclusively (very actionable)

A. Measure event loop lag


Add something like:

setInterval(() => {
  const start = Date.now();
  setImmediate(() => {
    const lag = Date.now() - start;
    if (lag > 1000) {
      console.warn("Event loop lag:", lag);
    }
  });
}, 1000);


You’ll see spikes that align with readiness failures.

B. Check CPU throttling


kubectl describe pod <pod> -n staging | grep -i throttle


or:

kubectl top pod <pod> -n staging

C. Increase replicas temporarily


If readiness failures disappear when you scale up:

kubectl scale deploy next-api-staging -n staging --replicas=10

→ confirms saturation/starvation.

Immediate mitigations (practical)


1) Increase CPU limit


If you have:

limits:
  cpu: 200m


Try:

limits:
  cpu: 500m


or remove limit entirely (often better for Node).

2) Run multiple Node workers


Use:

cluster
pm2 -i max

or one pod = one core (best)

3) Keep readiness & liveness separate (minor improvement)


livenessProbe:
  httpGet:
    path: /health
    port: 8080
  timeoutSeconds: 10


No need for 60s liveness; it hides deadlocks.

Why this is actually good news

Your readiness probe is doing exactly what it should:
  • Detecting real inability to serve traffic
  • Preventing bad pods from receiving requests
  • You’ve uncovered a real production issue that just happens to surface in staging first.

Bottom line


A trivial /health endpoint timing out for 9 seconds in Node.js means the event loop is blocked or starved.




---

References: