Kubernetes restarts the containers if they exit or crash. But this is sometimes not enough indication that the apps are running fine. They can stall and get into an error state while their process is still running indefinitely.
This is why we need to proactively probe the pods in order to check that the app is alive and ready to serve the requests.
Kubernetes probes are tests the cluster runs against pods to see if they are good to serve traffic, or if they need corrective action.
- Purpose:
- Tells Kubernetes if the application is still running properly
- Detects that application/pod is in such state of failure that it requires a restart
- Checks if a container is alive (running and healthy/responsive)
- Detects situations where an application is running but unresponsive (e.g., deadlock) and needs a restart to recover.
- For example: suddenly, because of a bug or any other reason, instead of 2xx (or 3xx errors, both of which k8s renders as success), app's endpoints start returning other error codes which indicate failure. Without liveness probe pod itself will not restart on its own and we won't be able to see that something wrong is going on unless we monitor pod logs.
- Ensures basic functionality by restarting. Each time pod restarts, RESTARTS number gets incremented by 1.
- What it checks:
- Whether the app is in a broken state that requires a restart (deadlocked, corrupted state, etc.)
- Behavior:
- Runs throughout the container's lifetime
- Action on Failure:
- Restarts the container. Container is killed and restarted.
- When to Use:
- For fundamental health checks, like ensuring a web server process is running.
- To force a restart on our container, even if it did not crash (terminated with non-zero code) nor exited (terminated gracefully, with zero code). We can use this to get our app killed and restarted in case an unrecoverable, or unforeseen problem happens. But sometimes it is a bad idea to kill the container. It may not be dead, it may just be taking very long to respond, processing a big user request. In this case we should not kill it.
Example:
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 1
timeoutSeconds: 1
The configuration above tells Kubernetes cluster to send HTTP GET request to the pod's /health endpoint 10 seconds (initialDelaySeconds) after pod is started (or restarted), to keep probing every 10 seconds, to consider probe failed if 2xx or 3xx is not returned after 1 second (timeoutSeconds' default value is 1 second) and to report unhealthy pod on the 1st failed probe.
Example #2:
livenessProbe:
httpGet:
path: /healthz/live
How can app know that it's not deadlocked and that liveness endpoint need to return 200 OK?
Detecting deadlocks is tricky because a deadlocked thread can't check itself. Here are practical approaches:
1. Simple "I'm Alive" Check (Most Common)
Just return 200 if the HTTP server can respond:
@app.get("/healthz/live")
async def liveness_probe():
return {"status": "ok"}
Logic: If the web server thread is responding, the process isn't completely frozen. This catches:
- Process crashes
- Complete hangs
- Out of memory kills
Limitation: Doesn't detect deadlocks in background workers or specific threads.
2. Heartbeat Pattern (Background Workers)
Track that critical background tasks are still running:
import time
from threading import Thread
class Heartbeat:
def __init__(self, timeout=60):
self.last_beat = time.time()
self.timeout = timeout
def beat(self):
self.last_beat = time.time()
def is_alive(self):
return (time.time() - self.last_beat) < self.timeout
worker_heartbeat = Heartbeat()
def background_worker():
while True:
# Do work
process_queue()
worker_heartbeat.beat()
time.sleep(5)
@app.get("/healthz/live")
async def liveness_probe():
if not worker_heartbeat.is_alive():
return Response(status_code=503)
return {"status": "ok"}
3. Watchdog Timer Pattern (More Robust)
Use a separate monitoring thread that expects regular check-ins:
import threading
import time
class WatchdogTimer:
def __init__(self, timeout=30):
self.timeout = timeout
self.last_reset = time.time()
self.lock = threading.Lock()
self.healthy = True
# Start watchdog thread
self.thread = threading.Thread(target=self._monitor, daemon=True)
self.thread.start()
def reset(self):
"""Call this periodically from your main processing loop"""
with self.lock:
self.last_reset = time.time()
def _monitor(self):
while True:
time.sleep(5)
with self.lock:
if (time.time() - self.last_reset) > self.timeout:
self.healthy = False
else:
self.healthy = True
def is_healthy(self):
with self.lock:
return self.healthy
watchdog = WatchdogTimer(timeout=30)
# In your main processing loop
def process_requests():
while True:
handle_request()
watchdog.reset() # Signal we're still alive
@app.get("/healthz/live")
async def liveness_probe():
if not watchdog.is_healthy():
return Response(status_code=503)
return {"status": "ok"}
4. Critical Resource Check (Pragmatic)
Check that critical resources are accessible:
@app.get("/healthz/live")
async def liveness_probe():
try:
# Can we still access the database connection pool?
db.ping() # Lightweight check
return {"status": "ok"}
except Exception:
# If we can't even ping, we're probably deadlocked on DB
return Response(status_code=503)
5. Async Event Loop Check (Python-specific)
For async applications, verify the event loop isn't blocked:
pythonimport asyncio
@app.get("/healthz/live")
async def liveness_probe():
try:
# If this times out, event loop is blocked
await asyncio.wait_for(
asyncio.sleep(0),
timeout=1.0
)
return {"status": "ok"}
except asyncio.TimeoutError:
return Response(status_code=503)
Best Practice: Keep It Simple
For most apps, approach #1 (simple check) is sufficient:
python@app.get("/healthz/live")
async def liveness_probe():
return {"status": "ok"}
Add complexity only if you have specific deadlock scenarios you've observed. The liveness probe should be:
- Fast (< 100ms)
- Lightweight (no heavy computation or I/O)
- Reliable (doesn't fail due to temporary issues)
Remember: Kubernetes will restart your container if liveness fails, so false positives cause unnecessary restarts. When in doubt, start simple and add monitoring only for known failure modes.
Startup probe
- Purpose:
- Tells Kubernetes when the application has started successfully
- Will delay the start of the other probes, when the pod is created or restarted.
- What it checks:
- Whether the app has completed initialization (loaded config, connected to databases, warmed up caches, etc.)
- Behavior:
- Runs only during startup; once it succeeds, it stops running
- Failure:
- Container is killed and restarted
- When to Use:
- When app needs some time for startup we don't want to set an arbitrary value for initialDelaySeconds for Liveness probe. This would be bad because a small initial delay means some apps may not be ready yet when the liveness probe kicks in. And they will get killed unnecessarily. If we put a large initial delay, we will have a different problem: bad containers will run for longer until Kubernetes notices they are bad. The startup probe addresses that by creating a configuration that takes care of both cases: containers that get ready quickly, are put to work quickly. And containers that take longer to get ready still have a chance of getting ready instead of being terminated too soon.
The liveness probe starts after the first successful startup probe (if one is configured), as the startup probe's main job is to delay liveness and readiness checks until the application is fully initialized, preventing premature restarts of slow-starting containers. Once the startup probe succeeds, Kubernetes begins running the regular liveness and readiness probes for the container's entire lifecycle.
If a startup probe is defined, the initialDelaySeconds configuration for the liveness and readiness probes is irrelevant (specifically, those probes are disabled) until the startup probe succeeds.
The startup probe acts as a gatekeeper:
- While the startup probe is running and failing, the liveness and readiness probes are paused and do not begin their checks.
- Only after the startup probe has successfully passed does the standard logic for liveness and readiness probes begin, at which point their respective initialDelaySeconds (if set) will then be respected.
This mechanism is designed to handle applications with variable or long startup times, preventing the liveness probe from prematurely killing the container or the readiness probe from marking it as "not ready" before it has had a chance to fully initialize.
Here's the sequence:
- Container Starts: The container begins to run.
- Startup Probe Runs: The kubelet executes the startup probe.
- Startup Probe Succeeds: If successful, the application is considered started, and the startup probe stops running.
- Liveness & Readiness Start: Kubernetes immediately begins periodically running the configured liveness and readiness probes.
- Liveness Probe Fails: If the liveness probe fails, Kubernetes restarts the container.
- Readiness Probe Fails: If the readiness probe fails, the pod is removed from service endpoints but not restarted.
In essence: The startup probe acts as a gatekeeper, allowing slow applications to get ready before the health checks (liveness/readiness) kick in and potentially kill them.
Example configuration:
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 10
periodSeconds: 5
timeoutSeconds: 1
This configuration tells k8s to probe this pod every 5 seconds (periodSeconds), wait for 1 second (timeoutSeconds; 1 second is the default value) for response and fail only after 10 requests (failureThreshold). So, this app has 10 x 5 x 1 = 50 seconds to initialize before Startup probe would fail.
Example configuration #2 (recommended endpoint format):
startupProbe:
httpGet:
path: /healthz/startup
/healthz/startup returns 200 only after full initialization is complete.
How can app know that full initialization is complete so it can return 200 OK for startup endpoint?
The app needs to track its initialization state internally. Here are common approaches:
1) Boolean Flag Pattern (Simplest)
# At module level
is_initialized = False
async def initialize():
# Load configuration
config = load_config()
# Connect to database
await db.connect()
# Warm up caches
await cache.warm_up()
# Mark as ready
global is_initialized
is_initialized = True
@app.get("/healthz/startup")
async def startup_probe():
if is_initialized:
return {"status": "ok"}
return Response(status_code=503)
If startup probe fails, we chose to return HTTP ERROR 503 (Service Unavailable) as this server error response status code indicates that the server is not ready to handle the request.
2) Dependency Checklist Pattern (More Robust)
class AppState:
def __init__(self):
self.config_loaded = False
self.db_connected = False
self.cache_ready = False
def is_ready(self):
return all([
self.config_loaded,
self.db_connected,
self.cache_ready
])
app_state = AppState()
async def initialize():
app_state.config_loaded = load_config()
app_state.db_connected = await db.connect()
app_state.cache_ready = await cache.warm_up()
@app.get("/healthz/startup")
async def startup_probe():
if app_state.is_ready():
return {"status": "ok"}
return Response(status_code=503)
3) Application Lifecycle Hook (Framework-specific)
# FastAPI example
from fastapi import FastAPI
app = FastAPI()
startup_complete = False
@app.on_event("startup")
async def startup_event():
global startup_complete
# Do all initialization
await db.connect()
await load_data()
startup_complete = True
@app.get("/healthz/startup")
async def startup_probe():
if startup_complete:
return {"status": "ok"}
raise HTTPException(status_code=503)
4) State Machine Pattern (For Complex Apps)
from enum import Enum
class AppStatus(Enum):
STARTING = "starting"
READY = "ready"
SHUTTING_DOWN = "shutting_down"
current_status = AppStatus.STARTING
async def initialize():
global current_status
# initialization steps...
current_status = AppStatus.READY
@app.get("/healthz/startup")
async def startup_probe():
if current_status == AppStatus.READY:
return {"status": "ok"}
return Response(status_code=503)
Key Points
- The startup endpoint returns 503 (Service Unavailable) until initialization is done, then 200 OK
- Kubernetes will keep checking until it gets 200, or kill the container if it takes too long (configurable via failureThreshold and periodSeconds)
- The flag/state should be set after all critical initialization steps complete
- This is typically done in your application's main entry point or framework's startup hooks
Readiness probe
- Purpose:
- Determines if an application is fully initialized and prepared to handle incoming network requests.
- Checks if a container is ready to serve traffic (e.g., database connected, cache loaded)
- Manages traffic flow for gradual rollouts and recovery from temporary issues
- Tells Kubernetes if the application can handle traffic
- Action on Failure:
- Pod is removed from service endpoints (no restart)
- Stops sending new traffic to the pod (removes it from service) but doesn't restart it.
- It temporarily removes container from service if not ready, preventing new requests but not restarting
- It stops the pod from taking traffic, but it will not restart the container.
- If Deployment haa 1 replica only and if this single pod is taken out of ready status, for a while no pods will be READY; no users could be served. This can be worked around setting multiple replicas for the Deployment or by using a podDisruptionBudgetobject.
- What it checks:
- Whether the app is temporarily unable to serve requests (overloaded, waiting for dependencies, performing maintenance)
- Behavior:
- Runs throughout the container's lifetime
- When to Use:
- When an app needs time to load data, connect to databases, or initialize, preventing users from hitting a partially ready instance.
- We need this probe besides Liveness probe as there can be a case when app needs an extra time to process some query which means that the next liveness query would fail (if request processing time exceeds liveness probe's periodSeconds + timeoutSeconds). In such case, failed liveness probe would restart the (healthy!) pod, just because it took it a long time to process a request. Setting large failureThreshold or timeoutSeconds for liveness probe is not good.
- Benefits of using Readiness probe are in that by not restarting the container, the user is still going to get his response, and whatever processing that was halfway through is not lost. Also, by taking the pod out of service, we avoid sending more traffic to a blocked container that is, at least right now, responding slowly to all requests.
Example:
livenessProbe:
...
failureThreshold: 10 <-- increased
readinessProbe:
httpGet:
path: /healthy
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 1
The readiness probe configuration above instructs k8s kubelet to send a Readiness probe to wait 10 seconds (initialDelaySeconds) after pod's start or restart, query the pod every 10 seconds (periodSeconds) and fail after 1st attempt (failureThreshold). This is if we expect that it is acceptable if it takes up to 10 seconds for pod to process a request. If this probe failes, k8s will take this pod out of service (it will be marked as not ready - READY column value will show e.g. 0/1 pods are ready) but pod will NOT be restarted (RESTARTS value will remain the same). As soon as long request is served and next Readiness probe is successful, k8s will put this pod back to service meaning it will again start receiving requests and READY column value will show that this pod is again ready to serve these requests).
The liveness and readiness probes are tested at the same time. If they have the same settings (failureThreshold, periodInterval, timeoutSeconds) when the readiness probe fails, and the pod is going to be taken out of ready status, the liveness probe will also fail, and the container gets restarted. It does not make a lot of sense to put same configs on readiness and liveness probes. The readiness probe becomes ineffective. There is no point in taking a pod out of service while restarting it.
Example #2:
Deployment configuration
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 20
periodSeconds: 10
timeoutSeconds: 9
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 20
periodSeconds: 30
timeoutSeconds: 60
20 seconds after pod is (re)started, both liveness and readiness probes will kick in.
How can app know that it can accept traffic right now and return 200 OK for readiness probe?
The readiness probe should check dependencies and capacity - things that might be temporarily unavailable but don't require a restart. Here are practical approaches:
1. Dependency Health Checks (Most Common)
Check that external services your app needs are accessible:
@app.get("/healthz/ready")
async def readiness_probe():
checks = {}
# Check database
try:
await db.execute("SELECT 1", timeout=1)
checks["database"] = "ok"
except Exception as e:
checks["database"] = "failed"
return Response(
status_code=503,
content=json.dumps({"status": "not ready", "checks": checks})
)
# Check Redis cache
try:
await redis.ping(timeout=1)
checks["redis"] = "ok"
except Exception:
checks["redis"] = "failed"
return Response(status_code=503, content=json.dumps(checks))
# Check message queue
try:
await rabbitmq.health_check(timeout=1)
checks["rabbitmq"] = "ok"
except Exception:
checks["rabbitmq"] = "failed"
return Response(status_code=503, content=json.dumps(checks))
return {"status": "ready", "checks": checks}
2. Circuit Breaker Pattern
Track downstream service health and fail fast:
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failures = 0
self.failure_threshold = failure_threshold
self.timeout = timeout
self.last_failure_time = None
self.state = "closed" # closed, open, half_open
def is_available(self):
if self.state == "closed":
return True
if self.state == "open":
# Check if timeout has passed
if time.time() - self.last_failure_time > self.timeout:
self.state = "half_open"
return True
return False
return True # half_open, try again
def record_success(self):
self.failures = 0
self.state = "closed"
def record_failure(self):
self.failures += 1
self.last_failure_time = time.time()
if self.failures >= self.failure_threshold:
self.state = "open"
db_circuit = CircuitBreaker()
@app.get("/healthz/ready")
async def readiness_probe():
if not db_circuit.is_available():
return Response(status_code=503, content="Database circuit open")
try:
await db.ping()
db_circuit.record_success()
return {"status": "ready"}
except Exception:
db_circuit.record_failure()
return Response(status_code=503)
3. Load/Capacity Checks
Check if the app is overloaded:
import psutil
@app.get("/healthz/ready")
async def readiness_probe():
# Check memory usage
memory = psutil.virtual_memory()
if memory.percent > 90:
return Response(
status_code=503,
content=f"Memory usage too high: {memory.percent}%"
)
# Check CPU usage
cpu = psutil.cpu_percent(interval=0.1)
if cpu > 95:
return Response(status_code=503, content=f"CPU too high: {cpu}%")
# Check active connections/requests
active_requests = get_active_request_count()
if active_requests > MAX_CONCURRENT_REQUESTS:
return Response(
status_code=503,
content=f"Too many active requests: {active_requests}"
)
return {"status": "ready"}
4. Graceful Shutdown State
During shutdown, mark as not ready while finishing existing requests:
class AppState:
def __init__(self):
self.accepting_traffic = True
self.active_requests = 0
def start_shutdown(self):
self.accepting_traffic = False
app_state = AppState()
@app.get("/healthz/ready")
async def readiness_probe():
if not app_state.accepting_traffic:
return Response(status_code=503, content="Shutting down")
return {"status": "ready"}
@app.on_event("shutdown")
async def shutdown_event():
# Stop accepting new traffic
app_state.start_shutdown()
# Wait for existing requests to complete
while app_state.active_requests > 0:
await asyncio.sleep(0.1)
5. Combined Example (Production-Ready)
from datetime import datetime, timedelta
class ReadinessChecker:
def __init__(self):
self.is_shutting_down = False
self.last_db_check = None
self.db_healthy = True
self.cache_timeout = timedelta(seconds=5)
async def check_database(self):
"""Cached database check"""
now = datetime.now()
# Use cached result if recent
if (self.last_db_check and
now - self.last_db_check < self.cache_timeout):
return self.db_healthy
try:
await db.execute("SELECT 1", timeout=2)
self.db_healthy = True
self.last_db_check = now
return True
except Exception as e:
logger.warning(f"Database health check failed: {e}")
self.db_healthy = False
self.last_db_check = now
return False
async def is_ready(self):
if self.is_shutting_down:
return False, "shutting down"
if not await self.check_database():
return False, "database unavailable"
# Check memory
memory = psutil.virtual_memory()
if memory.percent > 90:
return False, f"high memory: {memory.percent}%"
return True, "ok"
readiness = ReadinessChecker()
@app.get("/healthz/ready")
async def readiness_probe():
ready, reason = await readiness.is_ready()
if ready:
return {"status": "ready"}
else:
return Response(
status_code=503,
content=json.dumps({"status": "not ready", "reason": reason})
)
Key Principles
DO check:
- External dependencies (databases, caches, APIs)
- System resources (memory, CPU, disk)
- Application capacity (connection pools, queue sizes)
- Shutdown state
DON'T check:
- Things that would require a restart to fix (use liveness for that)
- Expensive operations (keep it under 1-2 seconds)
- Non-critical dependencies (if you can serve requests without it)
Timeouts are critical: Always use short timeouts (1-2 seconds) for dependency checks. A hanging check is worse than a failed check.
When the readiness probe fails, Kubernetes removes the pod from the service load balancer, so traffic stops flowing to it. Once dependencies recover and the probe succeeds again, traffic resumes - no restart needed.
Key Differences
- Impact:
- Liveness restarts; Readiness stops traffic.
- Focus:
- Liveness is about the process being alive; Readiness is about being able to serve requests.
- Scenario:
- A liveness failure means the container is broken; a readiness failure means it's busy or temporarily unavailable.
Example Scenario
Imagine a web app:
- Startup: Uses a startup probe to wait for initial setup (slow).
- Running:
- Liveness Probe: Checks if the web server process is still running. If it crashes, restart it.
- Readiness Probe: Checks if the database connection is active and cache is warm. If not, stop traffic until it's ready, avoiding errors for users.
What to choose for probes endpoints?
Endpoints (or sometimes probe endpoints/paths) generally shouldn't be the same - each serves a distinct purpose.
Example implementation:
livenessProbe:
httpGet:
path: /healthz/live
readinessProbe:
httpGet:
path: /healthz/ready
startupProbe:
httpGet:
path: /healthz/startup
Typical endpoint logic:
/healthz/startup - Returns 200 only after full initialization is complete
/healthz/live - Returns 200 if the process is running and not deadlocked (lightweight check)
/healthz/ready - Returns 200 if the app can accept traffic right now (may check dependencies, load, etc.)
Many simple applications use the same endpoint for liveness and readiness, but separating them gives you more control over pod lifecycle management.
Readiness Probe Times Out Scenario
Is it normal to have a deployment which pods frequently emit readiness probe error like this:
Readiness probe failed: Get "http://10.2.30.79:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
It’s not “normal”, but it is common in certain situations — and whether it’s acceptable depends on how often, when, and what impact it has.
The error above tells us:
- Kubelet could connect to the pod IP
- The HTTP request did not receive response headers within the probe timeout
- The container was not killed (readiness ≠ liveness)
- The pod was temporarily marked NotReady and removed from Service endpoints
So this is about latency or blocking, not a hard crash.
It is NOT normal if:
- It happens frequently under normal load
- Pods flap between Ready / NotReady
- Traffic drops or retries spike
- Deployments take a long time to become stable
- You see cascading failures in upstream services
This usually indicates:
- Slow or blocking health endpoint
- Thread starvation
- GC pauses
- DB / external dependency slowness
- CPU or memory pressure
- Too aggressive probe settings
It is somewhat normal if:
- It happens briefly during startup
- It happens during traffic spikes
- It coincides with:
- Rolling deployments
- Node pressure
- Cold caches
- JVM warm-up / Python startup
- Pods recover quickly and remain stable
Even then, it’s still a signal worth fixing, not something to ignore.
Common root causes (very common in real clusters)
1) Health endpoint is doing “real work”
Bad pattern:
/health → checks DB → checks Redis → checks downstream API
If any dependency is slow, readiness fails.
Best practice:
- Readiness = “Can I serve traffic?”
- Keep it fast and in-memory
- Avoid network calls
2) Probe timeout too low
Default:
timeoutSeconds: 1
That’s extremely aggressive.
If your app occasionally takes >1s to respond → you’ll see exactly this error.
Typical sane values:
timeoutSeconds: 2–5
failureThreshold: 3
periodSeconds: 10
CPU throttling or memory pressure
Symptoms:
- Happens under load
- Worse on smaller nodes
- Improves if you scale replicas
Check:
kubectl top pod
kubectl describe pod <pod>
Look for:
- CPU limits too low
- OOM kills nearby in time
- High GC time (JVM)
4) Thread pool exhaustion
Very common with:
- Java (Tomcat / Netty)
- Node.js
- Python (Gunicorn)
If all worker threads are busy:
- App accepts connection
- But doesn’t return headers in time
- Probe times out
5) Node-level issues
Less common but real:
- kubelet under pressure
- noisy neighbors
- CNI latency
- iptables overload on busy nodes
When it becomes a real problem?
This is definitely a problem if:
- HPA keeps scaling up/down unnecessarily
- Ingress shows intermittent 502/503
- You see “endpoint removed” events frequently
- Rolling updates never fully settle
- You rely on readiness for zero-downtime deploys
What to do immediately (practical checklist)
Hit the endpoint manually:
kubectl exec -it pod -- curl -v http://localhost:8080/health
Time it under load.
Check probe config:
readinessProbe:
httpGet:
path: /health
port: 8080
timeoutSeconds: 1 # 🚩likely too low
Check resource limits:
resources:
requests:
cpu: "100m"
memory: "256Mi"
limits:
cpu: "200m" # 🚩 possible throttling
Check events:
kubectl describe pod
Check if readiness == liveness
If they point to the same endpoint → that’s usually wrong.
Rule of thumb:
A readiness probe should almost never fail once the pod is fully running and warmed up.
Occasional blips under extreme load might happen, but frequent failures mean something is mis-sized, misconfigured, or doing too much work.
How to check if the Readiness probe timeout failure (mentioned above) takes place on pod's startup (there is no Startup probe) or later, during regular running?
You can determine very reliably whether readiness probe timeouts happen only at startup or later during steady state, even without a startup probe.
1) Check when readiness failures happen relative to pod start
Look at pod events with timestamps:
kubectl describe pod <pod-name> -n staging
Focus on:
Events:
Type Reason Age Message
---- ------ ---- -------
Warning Unhealthy 3m (x12 over 30m) Readiness probe failed: ...
Now compare that with:
Start Time: <timestamp>
Interpret:
❌ Failures minutes/hours after Start Time → NOT startup
⚠️ Failures only within first ~30–120s → startup-related
🚨 Failures continue long after pod is Ready → runtime issue
This alone often answers the question.
2) Correlate readiness failures with container restarts
Even though readiness doesn’t restart pods, startup-related readiness failures usually coincide with restarts caused by something else.
kubectl get pod <pod-name> -n staging -o wide
kubectl describe pod <pod-name> -n staging
Look for:
Restart Count: 0
Interpretation:
- Restart Count = 0 and failures happen later → runtime slowness
- Restarts + readiness failures early → startup / initialization
3) Use kubectl get events sorted by time (very useful)
kubectl get events -n staging \
--field-selector involvedObject.name=<pod-name> \
--sort-by='.lastTimestamp'
This gives you a timeline:
Normal Scheduled pod/xxx ...
Normal Pulled pod/xxx ...
Normal Started pod/xxx ...
Warning Unhealthy pod/xxx Readiness probe failed
Normal Ready pod/xxx ...
What you want to see:
Do Unhealthy events appear after the pod has already been Ready once?
If yes → definitely not startup.
4) Watch readiness transitions live
If the issue is intermittent:
kubectl get pod <pod-name> -n staging -w
You’ll see:
READY STATUS
1/1 Running
0/1 Running ← readiness failed
1/1 Running
Now compare with pod age:
kubectl get pod <pod-name> -n staging
If the pod is:
AGE=2h and readiness flips → runtime issue
AGE=20s → startup
5) Check application logs around the failure time
Readiness timeouts often align with:
- GC pauses
- Thread pool exhaustion
- Slow DB calls
kubectl logs <pod-name> -n staging --timestamps
Then compare timestamps with event timestamps from step 3.
If you see:
- Long pauses
- Request backlogs
- “Waiting for DB” messages
→ not startup.
6) Check if readiness fails during deployments only
This is a big clue.
kubectl rollout history deployment next-api-staging -n staging
Then check:
Do readiness failures cluster around rollout times?
Or happen randomly hours later?
Random = runtime
Rollout-only = startup / warm-up
7) (Optional but powerful) Add logging to the health endpoint
If you can change the app:
- Log request duration for /health
- Log thread pool usage
- Log dependency latency
This gives you absolute proof.
8) Why a startupProbe would help (even now)
If failures are startup-only, the clean fix is:
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 30
periodSeconds: 5
This:
- Disables readiness + liveness until startup completes
- Removes false negatives during cold start
Decision table:
Observation Conclusion
Failures only in first 1–2 min Startup issue
Failures after pod was Ready Runtime issue
Pod age is hours/days Runtime issue
HPA / traffic spikes correlate Runtime issue
GC / DB slowness in logs Runtime issue
Bottom line
If you ever see a pod go Ready → NotReady → Ready again long after startup, it is NOT a startup problem — startup probes won’t fix it.
Let's say that describe pod output contains:
Start Time: Tue, 23 Dec 2025 15:41:04 +0000
and
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 33m (x6 over 3h42m) kubelet Readiness probe failed: Get "http://10.2.44.98:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
What this data proves
Pod start time: Start Time: Tue, 23 Dec 2025 15:41:04 +0000
Readiness failures: Warning Unhealthy 33m (x6 over 3h42m)
This means:
The pod has been running for hours
Readiness probe failures happened repeatedly over ~4 hours
Failures occurred as recently as 33 minutes ago
Conclusion:
- This is not related to pod startup
- A startupProbe would not fix this
What is happening
Your pod is occasionally too slow to respond to GET /health within the probe timeout. During those moments:
- The pod stays Running
- It flips to NotReady
- It is removed from Service endpoints
- Traffic is temporarily stopped
- Then it recovers
This is classic runtime degradation, not initialization.
Most likely root causes (ranked)
1) Health endpoint is blocking or slow
Very common:
/health waits on:
Under load → response time exceeds probe timeout
Strong indicator:
“awaiting headers” → handler accepted connection but couldn’t respond
2) Thread / worker starvation
Especially if this is:
- Node.js
- Java (Tomcat / Spring Boot)
- Python (Gunicorn / uWSGI)
Symptoms:
- All workers busy handling traffic
- Health endpoint queued
- Probe times out
3) CPU throttling
If CPU limits are tight:
- App runs
- But occasionally gets throttled
- Health check misses deadline
Check:
kubectl top pod <pod> -n staging
4) GC pauses / memory pressure
Common with JVM apps:
- Stop-the-world GC pauses
- Health endpoint blocked briefly
What will NOT help:
- startupProbe
- increasing initialDelaySeconds only
- ignoring the warnings
Recommended Immediate actions:
1) Relax readiness probe (quick mitigation)
If current config is aggressive (e.g. timeoutSeconds: 1):
readinessProbe:
httpGet:
path: /health
port: 8080
timeoutSeconds: 3
periodSeconds: 10
failureThreshold: 3
This reduces false negatives without hiding real failures.
2) Ensure readiness endpoint is lightweight
Best practice:
- In-memory checks only
- No DB / Redis / network calls
- No locks or long async work
If you need deeper checks:
- Put them behind /health/deep
- Use them for diagnostics, not readiness
3) Check resource limits carefully
Look for this pattern:
limits:
cpu: 200m # 🚩 common cause
If you see CPU throttling:
- Increase limit
- Or remove limit and rely on requests
4) Correlate with traffic & deploys
Ask:
- Do failures align with traffic spikes?
- With cron jobs?
- With background workers?
This helps pinpoint starvation.
How to prove the cause (next steps)
If you want to be 100% sure:
- Log request duration for /health
- Log worker pool usage
- Compare timestamps with readiness failures
- Temporarily increase timeoutSeconds
If failures disappear → latency problem confirmed
Bottom line:
Your readiness probe is doing its job correctly — it’s telling you the app is occasionally not ready to serve traffic. This is a runtime performance or design issue, not Kubernetes misbehaviour.
Let's say we have this config:
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 20
periodSeconds: 10
timeoutSeconds: 9
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 20
periodSeconds: 30
timeoutSeconds: 60
and
router.get("/health", (_req: Request, res: Response) => {
return res.status(200).json({});
});
With this config the problem is almost certainly not Kubernetes probe tuning and not your health handler logic.
Let’s walk through it carefully and then I’ll tell you what is left.
1) This probe configuration is sane (even generous)
Readiness:
periodSeconds: 10
timeoutSeconds: 9
That means:
Kubelet waits up to 9 seconds
Only 1 second “gap” before next probe
Failure happens only if the app does not send headers for 9s
This is not aggressive at all.
Liveness:
timeoutSeconds: 60
Very forgiving — and you’re not seeing restarts, which confirms: App does eventually respond. It’s just slow occasionally
2) Your /health handler is trivial
router.get("/health", (_req, res) => {
return res.status(200).json({});
});
This:
- Does no I/O
- No DB calls
- No async work
- Should respond in microseconds
So if this times out: the event loop was blocked, or the process wasn’t scheduled. That narrows the root causes dramatically.
Given everything above, only a few things can explain a 9-second no-headers timeout.
Most likely causes (ranked)
1) Node.js event loop blockage
If your app:
- Does CPU-heavy work
- JSON serialization of big objects
- Synchronous crypto / compression
- Large regexes
- Blocking filesystem calls
Then:
- Event loop freezes
- /health cannot respond
- Probe times out
This is the #1 cause in Node apps.
2) CPU throttling at container level
Even if your code is fine:
limits:
cpu: 200m
Under load:
- Node is throttled
- Event loop starves
- Requests queue
- Health check misses deadline
Very common in staging.
3) Too few Node workers
If you are running:
- A single Node process
- No cluster / PM2
- Handling concurrent traffic
Then a single hot path can stall everything.
4) Long GC pauses
If:
- Memory limit is tight
- Heap pressure increases
Node’s GC can block the event loop for seconds.
4) What it is definitely not
- Not startup
- Not probe timeout config
- Not the health handler
- Not networking
- Not kubelet flakiness
This is application runtime starvation.
How to prove it conclusively (very actionable)
A. Measure event loop lag
Add something like:
setInterval(() => {
const start = Date.now();
setImmediate(() => {
const lag = Date.now() - start;
if (lag > 1000) {
console.warn("Event loop lag:", lag);
}
});
}, 1000);
You’ll see spikes that align with readiness failures.
B. Check CPU throttling
kubectl describe pod <pod> -n staging | grep -i throttle
or:
kubectl top pod <pod> -n staging
C. Increase replicas temporarily
If readiness failures disappear when you scale up:
kubectl scale deploy next-api-staging -n staging --replicas=10
→ confirms saturation/starvation.
Immediate mitigations (practical)
1) Increase CPU limit
If you have:
limits:
cpu: 200m
Try:
limits:
cpu: 500m
or remove limit entirely (often better for Node).
2) Run multiple Node workers
Use:
cluster
pm2 -i max
or one pod = one core (best)
3) Keep readiness & liveness separate (minor improvement)
livenessProbe:
httpGet:
path: /health
port: 8080
timeoutSeconds: 10
No need for 60s liveness; it hides deadlocks.
Why this is actually good news
Your readiness probe is doing exactly what it should:
- Detecting real inability to serve traffic
- Preventing bad pods from receiving requests
- You’ve uncovered a real production issue that just happens to surface in staging first.
Bottom line
A trivial /health endpoint timing out for 9 seconds in Node.js means the event loop is blocked or starved.
---
References: