Friday, 2 January 2026

Kubernetes Probes

 


Kubernetes restarts the containers if they exit or crash. But this is sometimes not enough indication that the apps are running fine. They can stall and get into an error state while their process is still running indefinitely.

This is why we need to proactively probe the pods in order to check that the app is alive and ready to serve the requests.

Kubernetes probes are tests the cluster runs against pods to see if they are good to serve traffic, or if they need corrective action.


Liveness probe

  • Purpose: 
    • Checks if a container is alive (running and healthy/responsive)
    • Detects situations where an application is running but unresponsive (e.g., deadlock) and needs a restart to recover. 
      • For example: suddenly, because of a bug or any other reason, instead of 2xx (or 3xx errors, both of which k8s renders as success), app's endpoints start returning other error codes which indicate failure. Without liveness probe pod itself will not restart on its own and we won't be able to see that something wrong is going on unless we monitor pod logs.
    • Ensures basic functionality by restarting. Each time pod restarts, RESTARTS number gets incremented by 1.
  • When to Use: 
    • For fundamental health checks, like ensuring a web server process is running. 
    • To force a restart on our container, even if it did not crash (terminated with non-zero code) nor exited (terminated gracefully, with zero code). We can use this to get our app killed and restarted in case an unrecoverable, or unforeseen problem happens. But sometimes it is a bad idea to kill the container. It may not be dead, it may just be taking very long to respond, processing a big user request. In this case we should not kill it.
  • Action on Failure: 
    • Restarts the container

Example:

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 1
  timeoutSeconds: 1


The configuration above tells Kubernetes cluster to send HTTP GET request to the pod's /health endpoint 10 seconds (initialDelaySeconds) after pod is started (or restarted), to keep probing every 10 seconds, to consider probe failed if 2xx or 3xx is not returned after 1 second (timeoutSeconds' default value is 1 second) and to report unhealthy pod on the 1st failed probe.


Startup probe

  • Purpose: 
    • Will delay the start of the other probes, when the pod is created or restarted.
  • When to Use: 
    • When app needs some time for startup we don't want to set an arbitrary value for initialDelaySeconds for Liveness probe. This would be bad because a small initial delay means some apps may not be ready yet when the liveness probe kicks in. And they will get killed unnecessarily. If we put a large initial delay, we will have a different problem: bad containers will run for longer until Kubernetes notices they are bad. The startup probe addresses that by creating a configuration that takes care of both cases: containers that get ready quickly, are put to work quickly. And containers that take longer to get ready still have a chance of getting ready instead of being terminated too soon.
The liveness probe starts after the first successful startup probe (if one is configured), as the startup probe's main job is to delay liveness and readiness checks until the application is fully initialized, preventing premature restarts of slow-starting containers. Once the startup probe succeeds, Kubernetes begins running the regular liveness and readiness probes for the container's entire lifecycle. 

If a startup probe is defined, the initialDelaySeconds configuration for the liveness and readiness probes is irrelevant (specifically, those probes are disabled) until the startup probe succeeds. 

The startup probe acts as a gatekeeper: 
  • While the startup probe is running and failing, the liveness and readiness probes are paused and do not begin their checks.
  • Only after the startup probe has successfully passed does the standard logic for liveness and readiness probes begin, at which point their respective initialDelaySeconds (if set) will then be respected. 
This mechanism is designed to handle applications with variable or long startup times, preventing the liveness probe from prematurely killing the container or the readiness probe from marking it as "not ready" before it has had a chance to fully initialize. 

Here's the sequence:
  • Container Starts: The container begins to run.
  • Startup Probe Runs: The kubelet executes the startup probe.
  • Startup Probe Succeeds: If successful, the application is considered started, and the startup probe stops running.
  • Liveness & Readiness Start: Kubernetes immediately begins periodically running the configured liveness and readiness probes.
  • Liveness Probe Fails: If the liveness probe fails, Kubernetes restarts the container.
  • Readiness Probe Fails: If the readiness probe fails, the pod is removed from service endpoints but not restarted. 
In essence: The startup probe acts as a gatekeeper, allowing slow applications to get ready before the health checks (liveness/readiness) kick in and potentially kill them. 


Example configuration:

startupProbe:
  httpGet:
    path: /health
    port: 8080
  failureThreshold: 10
  periodSeconds: 5
  timeoutSeconds: 1 

This configuration tells k8s to probe this pod every 5 seconds (periodSeconds), wait for 1 second (timeoutSeconds; 1 second is the default value) for response and fail only after 10 requests (failureThreshold). So, this app has 10 x 5 x 1 = 50 seconds to initialize before Startup probe would fail.

Readiness probe

  • Purpose: 
    • Determines if an application is fully initialized and prepared to handle incoming network requests.
    • Checks if a container is ready to serve traffic (e.g., database connected, cache loaded)
    • Manages traffic flow for gradual rollouts and recovery from temporary issues
  • Action on Failure: 
    • Stops sending new traffic to the pod (removes it from service) but doesn't restart it.
    • It temporarily removes container from service if not ready, preventing new requests but not restarting
    • It stops the pod from taking traffic, but it will not restart the container.
    • If Deployment haa 1 replica only and if this single pod is taken out of ready status, for a while no pods will be READY; no users could be served. This can be worked around setting multiple replicas for the Deployment or by using a podDisruptionBudgetobject.
  • When to Use: 
    • When an app needs time to load data, connect to databases, or initialize, preventing users from hitting a partially ready instance.
    • We need this probe besides Liveness probe as there can be a case when app needs an extra time to process some query which means that the next liveness query would fail (if request processing time exceeds liveness probe's periodSeconds + timeoutSeconds). In such case, failed liveness probe would restart the (healthy!) pod, just because it took it a long time to process a request. Setting large failureThreshold or timeoutSeconds for liveness probe is not good.
    • Benefits of using Readiness probe are in that by not restarting the container, the user is still going to get his response, and whatever processing that was halfway through is not lost. Also, by taking the pod out of service, we avoid sending more traffic to a blocked container that is, at least right now, responding slowly to all requests.
Example:

livenessProbe:
  ...
  failureThreshold: 10 <-- increased

readinessProbe:
  httpGet:
    path: /healthy
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 1


The readiness probe configuration above instructs k8s kubelet to send a Readiness probe to wait 10 seconds (initialDelaySeconds) after pod's start or restart, query the pod every 10 seconds (periodSeconds) and fail after 1st attempt (failureThreshold). This is if we expect that it is acceptable if it takes up to 10 seconds for pod to process a request. If this probe failes, k8s will take this pod out of service (it will be marked as not ready - READY column value will show e.g. 0/1 pods are ready) but pod will NOT be restarted (RESTARTS value will remain the same). As soon as long request is served and next Readiness probe is successful, k8s will put this pod back to service meaning it will again start receiving requests and READY column value will show that this pod is again ready to serve these requests).

The liveness and readiness probes are tested at the same time. If they have the same settings (failureThreshold, periodInterval, timeoutSeconds) when the readiness probe fails, and the pod is going to be taken out of ready status, the liveness probe will also fail, and the container gets restarted. It does not make a lot of sense to put same configs on readiness and liveness probes. The readiness probe becomes ineffective. There is no point in taking a pod out of service while restarting it.


Example #2:

Deployment configuration

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 20
  periodSeconds: 10
  timeoutSeconds: 9
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 20
  periodSeconds: 30
  timeoutSeconds: 60


20 seconds after pod is (re)started, both liveness and readiness probes will kick in. 

Key Differences


  • Impact:
    • Liveness restarts; Readiness stops traffic.
  • Focus: 
    • Liveness is about the process being alive; Readiness is about being able to serve requests.
  • Scenario: 
    • A liveness failure means the container is broken; a readiness failure means it's busy or temporarily unavailable. 

Example Scenario


Imagine a web app:
  • Startup: Uses a startup probe to wait for initial setup (slow).
  • Running:
    • Liveness Probe: Checks if the web server process is still running. If it crashes, restart it.
    • Readiness Probe: Checks if the database connection is active and cache is warm. If not, stop traffic until it's ready, avoiding errors for users. 

What to choose for tests?


...


---

References:



No comments: