Friday, 2 January 2026

Kubernetes Probes

Kubernetes restarts the containers if they exit or crash. But this is sometimes not enough indication that the apps are running fine. They can stall and get into an error state while their process is still running indefinitely.

This is why we need to proactively probe the pods in order to check that the app is alive and ready to serve the requests.

Kubernetes probes are tests the cluster runs against pods to see if they are good to serve traffic, or if they need corrective action.

Liveness probe

Purpose:

Tells Kubernetes if the application is still running properly
Detects that application/pod is in such state of failure that it requires a restart
Checks if a container is alive (running and healthy/responsive)
Detects situations where an application is running but unresponsive (e.g., deadlock) and needs a restart to recover.

For example: suddenly, because of a bug or any other reason, instead of 2xx (or 3xx errors, both of which k8s renders as success), app's endpoints start returning other error codes which indicate failure. Without liveness probe pod itself will not restart on its own and we won't be able to see that something wrong is going on unless we monitor pod logs.

Ensures basic functionality by restarting. Each time pod restarts, RESTARTS number gets incremented by 1.

What it checks:

Whether the app is in a broken state that requires a restart (deadlocked, corrupted state, etc.)

Behavior:

Runs throughout the container's lifetime

Action on Failure:

Restarts the container. Container is killed and restarted.

When to Use:

For fundamental health checks, like ensuring a web server process is running.
To force a restart on our container, even if it did not crash (terminated with non-zero code) nor exited (terminated gracefully, with zero code). We can use this to get our app killed and restarted in case an unrecoverable, or unforeseen problem happens. But sometimes it is a bad idea to kill the container. It may not be dead, it may just be taking very long to respond, processing a big user request. In this case we should not kill it.

Example:

livenessProbe:

httpGet:

path: /health

port: 8080

initialDelaySeconds: 10

periodSeconds: 10

failureThreshold: 1

timeoutSeconds: 1

The configuration above tells Kubernetes cluster to send HTTP GET request to the pod's /health endpoint 10 seconds (initialDelaySeconds) after pod is started (or restarted), to keep probing every 10 seconds, to consider probe failed if 2xx or 3xx is not returned after 1 second (timeoutSeconds' default value is 1 second) and to report unhealthy pod on the 1st failed probe.

Example #2:

livenessProbe:

httpGet:

path: /healthz/live

How can app know that it's not deadlocked and that liveness endpoint need to return 200 OK?

Detecting deadlocks is tricky because a deadlocked thread can't check itself. Here are practical approaches:

1. Simple "I'm Alive" Check (Most Common)

Just return 200 if the HTTP server can respond:

@app.get("/healthz/live")

async def liveness_probe():

return {"status": "ok"}

Logic: If the web server thread is responding, the process isn't completely frozen. This catches:

Process crashes
Complete hangs
Out of memory kills

Limitation: Doesn't detect deadlocks in background workers or specific threads.

2. Heartbeat Pattern (Background Workers)

Track that critical background tasks are still running:

import time

from threading import Thread

class Heartbeat:

def __init__(self, timeout=60):

self.last_beat = time.time()

self.timeout = timeout

def beat(self):

self.last_beat = time.time()

def is_alive(self):

return (time.time() - self.last_beat) < self.timeout

worker_heartbeat = Heartbeat()

def background_worker():

while True:

# Do work

process_queue()

worker_heartbeat.beat()

time.sleep(5)

@app.get("/healthz/live")

async def liveness_probe():

if not worker_heartbeat.is_alive():

return Response(status_code=503)

return {"status": "ok"}

3. Watchdog Timer Pattern (More Robust)

Use a separate monitoring thread that expects regular check-ins:

import threading

import time

class WatchdogTimer:

def __init__(self, timeout=30):

self.timeout = timeout

self.last_reset = time.time()

self.lock = threading.Lock()

self.healthy = True

# Start watchdog thread

self.thread = threading.Thread(target=self._monitor, daemon=True)

self.thread.start()

def reset(self):

"""Call this periodically from your main processing loop"""

with self.lock:

self.last_reset = time.time()

def _monitor(self):

while True:

time.sleep(5)

with self.lock:

if (time.time() - self.last_reset) > self.timeout:

self.healthy = False

else:

self.healthy = True

def is_healthy(self):

with self.lock:

return self.healthy

watchdog = WatchdogTimer(timeout=30)

# In your main processing loop

def process_requests():

while True:

handle_request()

watchdog.reset() # Signal we're still alive

@app.get("/healthz/live")

async def liveness_probe():

if not watchdog.is_healthy():

return Response(status_code=503)

return {"status": "ok"}

4. Critical Resource Check (Pragmatic)

Check that critical resources are accessible:

@app.get("/healthz/live")

async def liveness_probe():

try:

# Can we still access the database connection pool?

db.ping() # Lightweight check

return {"status": "ok"}

except Exception:

# If we can't even ping, we're probably deadlocked on DB

return Response(status_code=503)

5. Async Event Loop Check (Python-specific)

For async applications, verify the event loop isn't blocked:

pythonimport asyncio

@app.get("/healthz/live")

async def liveness_probe():

try:

# If this times out, event loop is blocked

await asyncio.wait_for(

asyncio.sleep(0),

timeout=1.0

)

return {"status": "ok"}

except asyncio.TimeoutError:

return Response(status_code=503)

Best Practice: Keep It Simple

For most apps, approach #1 (simple check) is sufficient:

python@app.get("/healthz/live")

async def liveness_probe():

return {"status": "ok"}

Add complexity only if you have specific deadlock scenarios you've observed. The liveness probe should be:

Fast (< 100ms)
Lightweight (no heavy computation or I/O)
Reliable (doesn't fail due to temporary issues)

Remember: Kubernetes will restart your container if liveness fails, so false positives cause unnecessary restarts. When in doubt, start simple and add monitoring only for known failure modes.

Startup probe

Purpose:

Tells Kubernetes when the application has started successfully
Will delay the start of the other probes, when the pod is created or restarted.

What it checks:

Whether the app has completed initialization (loaded config, connected to databases, warmed up caches, etc.)

Behavior:

Runs only during startup; once it succeeds, it stops running

Failure:

Container is killed and restarted

When to Use:

When app needs some time for startup we don't want to set an arbitrary value for initialDelaySeconds for Liveness probe. This would be bad because a small initial delay means some apps may not be ready yet when the liveness probe kicks in. And they will get killed unnecessarily. If we put a large initial delay, we will have a different problem: bad containers will run for longer until Kubernetes notices they are bad. The startup probe addresses that by creating a configuration that takes care of both cases: containers that get ready quickly, are put to work quickly. And containers that take longer to get ready still have a chance of getting ready instead of being terminated too soon.

The liveness probe starts after the first successful startup probe (if one is configured), as the startup probe's main job is to delay liveness and readiness checks until the application is fully initialized, preventing premature restarts of slow-starting containers. Once the startup probe succeeds, Kubernetes begins running the regular liveness and readiness probes for the container's entire lifecycle.

If a startup probe is defined, the initialDelaySeconds configuration for the liveness and readiness probes is irrelevant (specifically, those probes are disabled) until the startup probe succeeds.

The startup probe acts as a gatekeeper:

While the startup probe is running and failing, the liveness and readiness probes are paused and do not begin their checks.
Only after the startup probe has successfully passed does the standard logic for liveness and readiness probes begin, at which point their respective initialDelaySeconds (if set) will then be respected.

This mechanism is designed to handle applications with variable or long startup times, preventing the liveness probe from prematurely killing the container or the readiness probe from marking it as "not ready" before it has had a chance to fully initialize.

Here's the sequence:

Container Starts: The container begins to run.
Startup Probe Runs: The kubelet executes the startup probe.
Startup Probe Succeeds: If successful, the application is considered started, and the startup probe stops running.
Liveness & Readiness Start: Kubernetes immediately begins periodically running the configured liveness and readiness probes.
Liveness Probe Fails: If the liveness probe fails, Kubernetes restarts the container.
Readiness Probe Fails: If the readiness probe fails, the pod is removed from service endpoints but not restarted.

In essence: The startup probe acts as a gatekeeper, allowing slow applications to get ready before the health checks (liveness/readiness) kick in and potentially kill them.

Example configuration:

startupProbe:

httpGet:

path: /health

port: 8080

failureThreshold: 10

periodSeconds: 5

timeoutSeconds: 1

This configuration tells k8s to probe this pod every 5 seconds (periodSeconds), wait for 1 second (timeoutSeconds; 1 second is the default value) for response and fail only after 10 requests (failureThreshold). So, this app has 10 x 5 x 1 = 50 seconds to initialize before Startup probe would fail.

Example configuration #2 (recommended endpoint format):

startupProbe:

httpGet:

path: /healthz/startup

/healthz/startup returns 200 only after full initialization is complete.

How can app know that full initialization is complete so it can return 200 OK for startup endpoint?

The app needs to track its initialization state internally. Here are common approaches:

1) Boolean Flag Pattern (Simplest)

# At module level

is_initialized = False

async def initialize():

# Load configuration

config = load_config()

# Connect to database

await db.connect()

# Warm up caches

await cache.warm_up()

# Mark as ready

global is_initialized

is_initialized = True

@app.get("/healthz/startup")

async def startup_probe():

if is_initialized:

return {"status": "ok"}

return Response(status_code=503)

If startup probe fails, we chose to return HTTP ERROR 503 (Service Unavailable) as this server error response status code indicates that the server is not ready to handle the request.

2) Dependency Checklist Pattern (More Robust)

class AppState:

def __init__(self):

self.config_loaded = False

self.db_connected = False

self.cache_ready = False

def is_ready(self):

return all([

self.config_loaded,

self.db_connected,

self.cache_ready

])

app_state = AppState()

async def initialize():

app_state.config_loaded = load_config()

app_state.db_connected = await db.connect()

app_state.cache_ready = await cache.warm_up()

@app.get("/healthz/startup")

async def startup_probe():

if app_state.is_ready():

return {"status": "ok"}

return Response(status_code=503)

3) Application Lifecycle Hook (Framework-specific)

# FastAPI example

from fastapi import FastAPI

app = FastAPI()

startup_complete = False

@app.on_event("startup")

async def startup_event():

global startup_complete

# Do all initialization

await db.connect()

await load_data()

startup_complete = True

@app.get("/healthz/startup")

async def startup_probe():

if startup_complete:

return {"status": "ok"}

raise HTTPException(status_code=503)

4) State Machine Pattern (For Complex Apps)

from enum import Enum

class AppStatus(Enum):

STARTING = "starting"

READY = "ready"

SHUTTING_DOWN = "shutting_down"

current_status = AppStatus.STARTING

async def initialize():

global current_status

# initialization steps...

current_status = AppStatus.READY

@app.get("/healthz/startup")

async def startup_probe():

if current_status == AppStatus.READY:

return {"status": "ok"}

return Response(status_code=503)

Key Points

The startup endpoint returns 503 (Service Unavailable) until initialization is done, then 200 OK
Kubernetes will keep checking until it gets 200, or kill the container if it takes too long (configurable via failureThreshold and periodSeconds)
The flag/state should be set after all critical initialization steps complete
This is typically done in your application's main entry point or framework's startup hooks

Readiness probe

Purpose:

Determines if an application is fully initialized and prepared to handle incoming network requests.
Checks if a container is ready to serve traffic (e.g., database connected, cache loaded)
Manages traffic flow for gradual rollouts and recovery from temporary issues
Tells Kubernetes if the application can handle traffic

Action on Failure:

Pod is removed from service endpoints (no restart)
Stops sending new traffic to the pod (removes it from service) but doesn't restart it.
It temporarily removes container from service if not ready, preventing new requests but not restarting
It stops the pod from taking traffic, but it will not restart the container.
If Deployment haa 1 replica only and if this single pod is taken out of ready status, for a while no pods will be READY; no users could be served. This can be worked around setting multiple replicas for the Deployment or by using a podDisruptionBudgetobject.

What it checks:

Whether the app is temporarily unable to serve requests (overloaded, waiting for dependencies, performing maintenance)

Behavior:

Runs throughout the container's lifetime

When to Use:

When an app needs time to load data, connect to databases, or initialize, preventing users from hitting a partially ready instance.
We need this probe besides Liveness probe as there can be a case when app needs an extra time to process some query which means that the next liveness query would fail (if request processing time exceeds liveness probe's periodSeconds + timeoutSeconds). In such case, failed liveness probe would restart the (healthy!) pod, just because it took it a long time to process a request. Setting large failureThreshold or timeoutSeconds for liveness probe is not good.
Benefits of using Readiness probe are in that by not restarting the container, the user is still going to get his response, and whatever processing that was halfway through is not lost. Also, by taking the pod out of service, we avoid sending more traffic to a blocked container that is, at least right now, responding slowly to all requests.

Example:

livenessProbe:

...

failureThreshold: 10 <-- increased

readinessProbe:

httpGet:

path: /healthy

port: 8080

initialDelaySeconds: 10

periodSeconds: 10

failureThreshold: 1

The readiness probe configuration above instructs k8s kubelet to send a Readiness probe to wait 10 seconds (initialDelaySeconds) after pod's start or restart, query the pod every 10 seconds (periodSeconds) and fail after 1st attempt (failureThreshold). This is if we expect that it is acceptable if it takes up to 10 seconds for pod to process a request. If this probe failes, k8s will take this pod out of service (it will be marked as not ready - READY column value will show e.g. 0/1 pods are ready) but pod will NOT be restarted (RESTARTS value will remain the same). As soon as long request is served and next Readiness probe is successful, k8s will put this pod back to service meaning it will again start receiving requests and READY column value will show that this pod is again ready to serve these requests).

The liveness and readiness probes are tested at the same time. If they have the same settings (failureThreshold, periodInterval, timeoutSeconds) when the readiness probe fails, and the pod is going to be taken out of ready status, the liveness probe will also fail, and the container gets restarted. It does not make a lot of sense to put same configs on readiness and liveness probes. The readiness probe becomes ineffective. There is no point in taking a pod out of service while restarting it.

Example #2:

Deployment configuration

readinessProbe:

httpGet:

path: /health

port: 8080

initialDelaySeconds: 20

periodSeconds: 10

timeoutSeconds: 9

livenessProbe:

httpGet:

path: /health

port: 8080

initialDelaySeconds: 20

periodSeconds: 30

timeoutSeconds: 60

20 seconds after pod is (re)started, both liveness and readiness probes will kick in.

How can app know that it can accept traffic right now and return 200 OK for readiness probe?

The readiness probe should check dependencies and capacity - things that might be temporarily unavailable but don't require a restart. Here are practical approaches:

1. Dependency Health Checks (Most Common)

Check that external services your app needs are accessible:

@app.get("/healthz/ready")

async def readiness_probe():

checks = {}

# Check database

try:

await db.execute("SELECT 1", timeout=1)

checks["database"] = "ok"

except Exception as e:

checks["database"] = "failed"

return Response(

status_code=503,

content=json.dumps({"status": "not ready", "checks": checks})

)

# Check Redis cache

try:

await redis.ping(timeout=1)

checks["redis"] = "ok"

except Exception:

checks["redis"] = "failed"

return Response(status_code=503, content=json.dumps(checks))

# Check message queue

try:

await rabbitmq.health_check(timeout=1)

checks["rabbitmq"] = "ok"

except Exception:

checks["rabbitmq"] = "failed"

return Response(status_code=503, content=json.dumps(checks))

return {"status": "ready", "checks": checks}

2. Circuit Breaker Pattern

Track downstream service health and fail fast:

class CircuitBreaker:

def __init__(self, failure_threshold=5, timeout=60):

self.failures = 0

self.failure_threshold = failure_threshold

self.timeout = timeout

self.last_failure_time = None

self.state = "closed" # closed, open, half_open

def is_available(self):

if self.state == "closed":

return True

if self.state == "open":

# Check if timeout has passed

if time.time() - self.last_failure_time > self.timeout:

self.state = "half_open"

return True

return False

return True # half_open, try again

def record_success(self):

self.failures = 0

self.state = "closed"

def record_failure(self):

self.failures += 1

self.last_failure_time = time.time()

if self.failures >= self.failure_threshold:

self.state = "open"

db_circuit = CircuitBreaker()

@app.get("/healthz/ready")

async def readiness_probe():

if not db_circuit.is_available():

return Response(status_code=503, content="Database circuit open")

try:

await db.ping()

db_circuit.record_success()

return {"status": "ready"}

except Exception:

db_circuit.record_failure()

return Response(status_code=503)

3. Load/Capacity Checks

Check if the app is overloaded:

import psutil

@app.get("/healthz/ready")

async def readiness_probe():

# Check memory usage

memory = psutil.virtual_memory()

if memory.percent > 90:

return Response(

status_code=503,

content=f"Memory usage too high: {memory.percent}%"

)

# Check CPU usage

cpu = psutil.cpu_percent(interval=0.1)

if cpu > 95:

return Response(status_code=503, content=f"CPU too high: {cpu}%")

# Check active connections/requests

active_requests = get_active_request_count()

if active_requests > MAX_CONCURRENT_REQUESTS:

return Response(

status_code=503,

content=f"Too many active requests: {active_requests}"

)

return {"status": "ready"}

4. Graceful Shutdown State

During shutdown, mark as not ready while finishing existing requests:

class AppState:

def __init__(self):

self.accepting_traffic = True

self.active_requests = 0

def start_shutdown(self):

self.accepting_traffic = False

app_state = AppState()

@app.get("/healthz/ready")

async def readiness_probe():

if not app_state.accepting_traffic:

return Response(status_code=503, content="Shutting down")

return {"status": "ready"}

@app.on_event("shutdown")

async def shutdown_event():

# Stop accepting new traffic

app_state.start_shutdown()

# Wait for existing requests to complete

while app_state.active_requests > 0:

await asyncio.sleep(0.1)

5. Combined Example (Production-Ready)

from datetime import datetime, timedelta

class ReadinessChecker:

def __init__(self):

self.is_shutting_down = False

self.last_db_check = None

self.db_healthy = True

self.cache_timeout = timedelta(seconds=5)

async def check_database(self):

"""Cached database check"""

now = datetime.now()

# Use cached result if recent

if (self.last_db_check and

now - self.last_db_check < self.cache_timeout):

return self.db_healthy

try:

await db.execute("SELECT 1", timeout=2)

self.db_healthy = True

self.last_db_check = now

return True

except Exception as e:

logger.warning(f"Database health check failed: {e}")

self.db_healthy = False

self.last_db_check = now

return False

async def is_ready(self):

if self.is_shutting_down:

return False, "shutting down"

if not await self.check_database():

return False, "database unavailable"

# Check memory

memory = psutil.virtual_memory()

if memory.percent > 90:

return False, f"high memory: {memory.percent}%"

return True, "ok"

readiness = ReadinessChecker()

@app.get("/healthz/ready")

async def readiness_probe():

ready, reason = await readiness.is_ready()

if ready:

return {"status": "ready"}

else:

return Response(

status_code=503,

content=json.dumps({"status": "not ready", "reason": reason})

)

Key Principles

DO check:

External dependencies (databases, caches, APIs)
System resources (memory, CPU, disk)
Application capacity (connection pools, queue sizes)
Shutdown state

DON'T check:

Things that would require a restart to fix (use liveness for that)
Expensive operations (keep it under 1-2 seconds)
Non-critical dependencies (if you can serve requests without it)

Timeouts are critical: Always use short timeouts (1-2 seconds) for dependency checks. A hanging check is worse than a failed check.

When the readiness probe fails, Kubernetes removes the pod from the service load balancer, so traffic stops flowing to it. Once dependencies recover and the probe succeeds again, traffic resumes - no restart needed.

Key Differences

Impact:

Liveness restarts; Readiness stops traffic.

Focus:

Liveness is about the process being alive; Readiness is about being able to serve requests.

Scenario:

A liveness failure means the container is broken; a readiness failure means it's busy or temporarily unavailable.

Example Scenario

Imagine a web app:

Startup: Uses a startup probe to wait for initial setup (slow).
Running:

Liveness Probe: Checks if the web server process is still running. If it crashes, restart it.
Readiness Probe: Checks if the database connection is active and cache is warm. If not, stop traffic until it's ready, avoiding errors for users.

What to choose for probes endpoints?

Endpoints (or sometimes probe endpoints/paths) generally shouldn't be the same - each serves a distinct purpose.

Example implementation:

livenessProbe:

httpGet:

path: /healthz/live

readinessProbe:

httpGet:

path: /healthz/ready

startupProbe:

httpGet:

path: /healthz/startup

Typical endpoint logic:

/healthz/startup - Returns 200 only after full initialization is complete

/healthz/live - Returns 200 if the process is running and not deadlocked (lightweight check)

/healthz/ready - Returns 200 if the app can accept traffic right now (may check dependencies, load, etc.)

Many simple applications use the same endpoint for liveness and readiness, but separating them gives you more control over pod lifecycle management.

Readiness Probe Times Out Scenario

Is it normal to have a deployment which pods frequently emit readiness probe error like this:

Readiness probe failed: Get "http://10.2.30.79:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

It’s not “normal”, but it is common in certain situations — and whether it’s acceptable depends on how often, when, and what impact it has.

The error above tells us:

Kubelet could connect to the pod IP
The HTTP request did not receive response headers within the probe timeout
The container was not killed (readiness ≠ liveness)
The pod was temporarily marked NotReady and removed from Service endpoints

So this is about latency or blocking, not a hard crash.

It is NOT normal if:

It happens frequently under normal load
Pods flap between Ready / NotReady
Traffic drops or retries spike
Deployments take a long time to become stable
You see cascading failures in upstream services

This usually indicates:

Slow or blocking health endpoint
Thread starvation
GC pauses
DB / external dependency slowness
CPU or memory pressure
Too aggressive probe settings

It is somewhat normal if:

It happens briefly during startup
It happens during traffic spikes
It coincides with:

Rolling deployments
Node pressure
Cold caches
JVM warm-up / Python startup

Pods recover quickly and remain stable

Even then, it’s still a signal worth fixing, not something to ignore.

Common root causes (very common in real clusters)

1) Health endpoint is doing “real work”

Bad pattern:

/health → checks DB → checks Redis → checks downstream API

If any dependency is slow, readiness fails.

Best practice:

Readiness = “Can I serve traffic?”
Keep it fast and in-memory
Avoid network calls

2) Probe timeout too low

Default:

timeoutSeconds: 1

That’s extremely aggressive.

If your app occasionally takes >1s to respond → you’ll see exactly this error.

Typical sane values:

timeoutSeconds: 2–5

failureThreshold: 3

periodSeconds: 10

CPU throttling or memory pressure

Symptoms:

Happens under load
Worse on smaller nodes
Improves if you scale replicas

Check:

kubectl top pod

kubectl describe pod <pod>

Look for:

CPU limits too low
OOM kills nearby in time
High GC time (JVM)

4) Thread pool exhaustion

Very common with:

Java (Tomcat / Netty)
Node.js
Python (Gunicorn)

If all worker threads are busy:

App accepts connection
But doesn’t return headers in time
Probe times out

5) Node-level issues

Less common but real:

kubelet under pressure
noisy neighbors
CNI latency
iptables overload on busy nodes

When it becomes a real problem?

This is definitely a problem if:

HPA keeps scaling up/down unnecessarily
Ingress shows intermittent 502/503
You see “endpoint removed” events frequently
Rolling updates never fully settle
You rely on readiness for zero-downtime deploys

What to do immediately (practical checklist)

Hit the endpoint manually:

kubectl exec -it pod -- curl -v http://localhost:8080/health

Time it under load.

Check probe config:

readinessProbe:

httpGet:

path: /health

port: 8080

timeoutSeconds: 1 # 🚩likely too low

Check resource limits:

resources:

requests:

cpu: "100m"

memory: "256Mi"

limits:

cpu: "200m" # 🚩 possible throttling

Check events:

kubectl describe pod

Check if readiness == liveness

If they point to the same endpoint → that’s usually wrong.

Rule of thumb:

A readiness probe should almost never fail once the pod is fully running and warmed up.

Occasional blips under extreme load might happen, but frequent failures mean something is mis-sized, misconfigured, or doing too much work.

How to check if the Readiness probe timeout failure (mentioned above) takes place on pod's startup (there is no Startup probe) or later, during regular running?

You can determine very reliably whether readiness probe timeouts happen only at startup or later during steady state, even without a startup probe.

1) Check when readiness failures happen relative to pod start

Look at pod events with timestamps:

kubectl describe pod <pod-name> -n staging

Focus on:

Events:

Type Reason Age Message

---- ------ ---- -------

Warning Unhealthy 3m (x12 over 30m) Readiness probe failed: ...

Now compare that with:

Start Time: <timestamp>

Interpret:

❌ Failures minutes/hours after Start Time → NOT startup

⚠️ Failures only within first ~30–120s → startup-related

🚨 Failures continue long after pod is Ready → runtime issue

This alone often answers the question.

2) Correlate readiness failures with container restarts

Even though readiness doesn’t restart pods, startup-related readiness failures usually coincide with restarts caused by something else.

kubectl get pod <pod-name> -n staging -o wide

kubectl describe pod <pod-name> -n staging

Look for:

Restart Count: 0

Interpretation:

Restart Count = 0 and failures happen later → runtime slowness
Restarts + readiness failures early → startup / initialization

3) Use kubectl get events sorted by time (very useful)

kubectl get events -n staging \

--field-selector involvedObject.name=<pod-name> \

--sort-by='.lastTimestamp'

This gives you a timeline:

Normal Scheduled pod/xxx ...

Normal Pulled pod/xxx ...

Normal Started pod/xxx ...

Warning Unhealthy pod/xxx Readiness probe failed

Normal Ready pod/xxx ...

What you want to see:

Do Unhealthy events appear after the pod has already been Ready once?

If yes → definitely not startup.

4) Watch readiness transitions live

If the issue is intermittent:

kubectl get pod <pod-name> -n staging -w

You’ll see:

READY STATUS

1/1 Running

0/1 Running ← readiness failed

1/1 Running

Now compare with pod age:

kubectl get pod <pod-name> -n staging

If the pod is:

AGE=2h and readiness flips → runtime issue

AGE=20s → startup

5) Check application logs around the failure time

Readiness timeouts often align with:

GC pauses
Thread pool exhaustion
Slow DB calls

kubectl logs <pod-name> -n staging --timestamps

Then compare timestamps with event timestamps from step 3.

If you see:

Long pauses
Request backlogs
“Waiting for DB” messages

→ not startup.

6) Check if readiness fails during deployments only

This is a big clue.

kubectl rollout history deployment next-api-staging -n staging

Then check:

Do readiness failures cluster around rollout times?

Or happen randomly hours later?

Random = runtime

Rollout-only = startup / warm-up

7) (Optional but powerful) Add logging to the health endpoint

If you can change the app:

Log request duration for /health
Log thread pool usage
Log dependency latency

This gives you absolute proof.

8) Why a startupProbe would help (even now)

If failures are startup-only, the clean fix is:

startupProbe:

httpGet:

path: /health

port: 8080

failureThreshold: 30

periodSeconds: 5

This:

Disables readiness + liveness until startup completes
Removes false negatives during cold start

Decision table:

Observation Conclusion

Failures only in first 1–2 min Startup issue

Failures after pod was Ready Runtime issue

Pod age is hours/days Runtime issue

HPA / traffic spikes correlate Runtime issue

GC / DB slowness in logs Runtime issue

Bottom line

If you ever see a pod go Ready → NotReady → Ready again long after startup, it is NOT a startup problem — startup probes won’t fix it.

Let's say that describe pod output contains:

Start Time: Tue, 23 Dec 2025 15:41:04 +0000

and

Events:

Type Reason Age From Message

---- ------ ---- ---- -------

Warning Unhealthy 33m (x6 over 3h42m) kubelet Readiness probe failed: Get "http://10.2.44.98:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

What this data proves

Pod start time: Start Time: Tue, 23 Dec 2025 15:41:04 +0000

Readiness failures: Warning Unhealthy 33m (x6 over 3h42m)

This means:

The pod has been running for hours

Readiness probe failures happened repeatedly over ~4 hours

Failures occurred as recently as 33 minutes ago

Conclusion:

This is not related to pod startup
A startupProbe would not fix this

What is happening

Your pod is occasionally too slow to respond to GET /health within the probe timeout. During those moments:

The pod stays Running
It flips to NotReady
It is removed from Service endpoints
Traffic is temporarily stopped
Then it recovers

This is classic runtime degradation, not initialization.

Most likely root causes (ranked)

1) Health endpoint is blocking or slow

Very common:

/health waits on:

DB
Redis
external APIs

Under load → response time exceeds probe timeout

Strong indicator:

“awaiting headers” → handler accepted connection but couldn’t respond

2) Thread / worker starvation

Especially if this is:

Node.js
Java (Tomcat / Spring Boot)
Python (Gunicorn / uWSGI)

Symptoms:

All workers busy handling traffic
Health endpoint queued
Probe times out

3) CPU throttling

If CPU limits are tight:

App runs
But occasionally gets throttled
Health check misses deadline

Check:

kubectl top pod <pod> -n staging

4) GC pauses / memory pressure

Common with JVM apps:

Stop-the-world GC pauses
Health endpoint blocked briefly

What will NOT help:

startupProbe
increasing initialDelaySeconds only
ignoring the warnings

Recommended Immediate actions:

1) Relax readiness probe (quick mitigation)

If current config is aggressive (e.g. timeoutSeconds: 1):

readinessProbe:

httpGet:

path: /health

port: 8080

timeoutSeconds: 3

periodSeconds: 10

failureThreshold: 3

This reduces false negatives without hiding real failures.

2) Ensure readiness endpoint is lightweight

Best practice:

In-memory checks only
No DB / Redis / network calls
No locks or long async work

If you need deeper checks:

Put them behind /health/deep
Use them for diagnostics, not readiness

3) Check resource limits carefully

Look for this pattern:

limits:

cpu: 200m # 🚩 common cause

If you see CPU throttling:

Increase limit
Or remove limit and rely on requests

4) Correlate with traffic & deploys

Ask:

Do failures align with traffic spikes?
With cron jobs?
With background workers?

This helps pinpoint starvation.

How to prove the cause (next steps)

If you want to be 100% sure:

Log request duration for /health
Log worker pool usage
Compare timestamps with readiness failures
Temporarily increase timeoutSeconds

If failures disappear → latency problem confirmed

Bottom line:

Your readiness probe is doing its job correctly — it’s telling you the app is occasionally not ready to serve traffic. This is a runtime performance or design issue, not Kubernetes misbehaviour.

Let's say we have this config:

readinessProbe:

httpGet:

path: /health

port: 8080

initialDelaySeconds: 20

periodSeconds: 10

timeoutSeconds: 9

livenessProbe:

httpGet:

path: /health

port: 8080

initialDelaySeconds: 20

periodSeconds: 30

timeoutSeconds: 60

and

router.get("/health", (_req: Request, res: Response) => {

return res.status(200).json({});

});

With this config the problem is almost certainly not Kubernetes probe tuning and not your health handler logic.

Let’s walk through it carefully and then I’ll tell you what is left.

1) This probe configuration is sane (even generous)

Readiness:

periodSeconds: 10

timeoutSeconds: 9

That means:

Kubelet waits up to 9 seconds

Only 1 second “gap” before next probe

Failure happens only if the app does not send headers for 9s

This is not aggressive at all.

Liveness:

timeoutSeconds: 60

Very forgiving — and you’re not seeing restarts, which confirms: App does eventually respond. It’s just slow occasionally

2) Your /health handler is trivial

router.get("/health", (_req, res) => {

return res.status(200).json({});

});

This:

Does no I/O
No DB calls
No async work
Should respond in microseconds

So if this times out: the event loop was blocked, or the process wasn’t scheduled. That narrows the root causes dramatically.

Given everything above, only a few things can explain a 9-second no-headers timeout.

Most likely causes (ranked)

1) Node.js event loop blockage

If your app:

Does CPU-heavy work
JSON serialization of big objects
Synchronous crypto / compression
Large regexes
Blocking filesystem calls

Then:

Event loop freezes
/health cannot respond
Probe times out

This is the #1 cause in Node apps.

2) CPU throttling at container level

Even if your code is fine:

limits:

cpu: 200m

Under load:

Node is throttled
Event loop starves
Requests queue
Health check misses deadline

Very common in staging.

3) Too few Node workers

If you are running:

A single Node process
No cluster / PM2
Handling concurrent traffic

Then a single hot path can stall everything.

4) Long GC pauses

If:

Memory limit is tight
Heap pressure increases

Node’s GC can block the event loop for seconds.

4) What it is definitely not

Not startup
Not probe timeout config
Not the health handler
Not networking
Not kubelet flakiness

This is application runtime starvation.

How to prove it conclusively (very actionable)

A. Measure event loop lag

Add something like:

setInterval(() => {

const start = Date.now();

setImmediate(() => {

const lag = Date.now() - start;

if (lag > 1000) {

console.warn("Event loop lag:", lag);

}

});

}, 1000);

You’ll see spikes that align with readiness failures.

B. Check CPU throttling

kubectl describe pod <pod> -n staging | grep -i throttle

or:

kubectl top pod <pod> -n staging

C. Increase replicas temporarily

If readiness failures disappear when you scale up:

kubectl scale deploy next-api-staging -n staging --replicas=10

→ confirms saturation/starvation.

Immediate mitigations (practical)

1) Increase CPU limit

If you have:

limits:

cpu: 200m

Try:

limits:

cpu: 500m

or remove limit entirely (often better for Node).

2) Run multiple Node workers

Use:

cluster

pm2 -i max

or one pod = one core (best)

3) Keep readiness & liveness separate (minor improvement)

livenessProbe:

httpGet:

path: /health

port: 8080

timeoutSeconds: 10

No need for 60s liveness; it hides deadlocks.

Why this is actually good news

Your readiness probe is doing exactly what it should:

Detecting real inability to serve traffic
Preventing bad pods from receiving requests
You’ve uncovered a real production issue that just happens to surface in staging first.

Bottom line

A trivial /health endpoint timing out for 9 seconds in Node.js means the event loop is blocked or starved.

---

References:

Liveness, Readiness, and Startup Probes | Kubernetes

Configure Liveness, Readiness and Startup Probes | Kubernetes

Readiness vs. Liveness probes: what is the difference? (and startup probes!) | by Juliano Kessler | Medium

Pages

Friday, 2 January 2026

Kubernetes Probes

Liveness probe

How can app know that it's not deadlocked and that liveness endpoint need to return 200 OK?

1. Simple "I'm Alive" Check (Most Common)

2. Heartbeat Pattern (Background Workers)

3. Watchdog Timer Pattern (More Robust)

4. Critical Resource Check (Pragmatic)

Startup probe

How can app know that full initialization is complete so it can return 200 OK for startup endpoint?

1) Boolean Flag Pattern (Simplest)

2) Dependency Checklist Pattern (More Robust)

3) Application Lifecycle Hook (Framework-specific)

4) State Machine Pattern (For Complex Apps)

Key Points

Readiness probe

How can app know that it can accept traffic right now and return 200 OK for readiness probe?

1. Dependency Health Checks (Most Common)

2. Circuit Breaker Pattern

3. Load/Capacity Checks

4. Graceful Shutdown State

5. Combined Example (Production-Ready)

Key Principles

Key Differences

Example Scenario

What to choose for probes endpoints?

Readiness Probe Times Out Scenario

Common root causes (very common in real clusters)

What to do immediately (practical checklist)

Rule of thumb:

How to check if the Readiness probe timeout failure (mentioned above) takes place on pod's startup (there is no Startup probe) or later, during regular running?

1) Check when readiness failures happen relative to pod start

2) Correlate readiness failures with container restarts

3) Use kubectl get events sorted by time (very useful)

4) Watch readiness transitions live

5) Check application logs around the failure time

6) Check if readiness fails during deployments only

7) (Optional but powerful) Add logging to the health endpoint

8) Why a startupProbe would help (even now)

Most likely root causes (ranked)

1) Health endpoint is blocking or slow

2) Thread / worker starvation

3) CPU throttling

4) GC pauses / memory pressure

Recommended Immediate actions:

1) Relax readiness probe (quick mitigation)

2) Ensure readiness endpoint is lightweight

3) Check resource limits carefully

4) Correlate with traffic & deploys

How to prove the cause (next steps)

Bottom line:

Most likely causes (ranked)

1) Node.js event loop blockage

2) CPU throttling at container level

3) Too few Node workers

4) What it is definitely not

A. Measure event loop lag

B. Check CPU throttling

C. Increase replicas temporarily

Immediate mitigations (practical)

1) Increase CPU limit

2) Run multiple Node workers

3) Keep readiness & liveness separate (minor improvement)

Bottom line

References:

No comments: