Episode 4: The one with probes.

Concurrency and Distributed Systems: A guide with Kubernetes

--

In the previous episode we talked about fault tolerance and resiliency. In this episode we want to talk about startup, liveness and readiness probes. These probes are feedback and control mechanisms which help with resiliency and fault tolerance in a way.

These Kubernetes probes, while very useful, if not configured correctly could lead to daunting problems in the system that are hard to debug. These are signals to K8s management system which tell about the followings:

  • Whether a container has started or not.
  • Whether a container is ready or not.
  • Whether a container is alive or not.

Each of these signals would entail an action somewhere in the system.

  • Once the startup probe passes, K8s starts to direct traffic to the container.
  • If the readiness probe fails , K8s will stop traffic to the container until the probe passes.
  • If the liveness probe fails, K8s will restart the container.

Startup

This probe is used to signal whether an app or container has successfully started. If this probe is configured, K8s will ignore the other probes(liveness and readiness) until the startup probe succeeds. We highly recommend that you configure this probe. When we auto-scale and add new nodes in a cluster, we need a sure way to know whether a worker’s container has started or not before starting to send traffic to it. Some apps are pretty straightforward and fast to start. On the other hand, there are apps and services which require seconds to minutes before they are available to serve. This startup time could be due to the following things:

  • Process loading into memory. Java apps for example may take tens of seconds to start and load into memory.
  • The application needs to load data over the network. They might need to load configuration files or machine learning models from an S3 bucket, etc. All of that would take an indeterminate amount of time.
  • They might be waiting on other downstream services to start.

Having a reliable startup signal helps with a more robust system.

Timeline of a startup probe.

Here are some example of defining startup probes:

Examples of startup probe configuration.

In the above example, K8s will run the probe by sending an http request on the specified path and port, or run the command(second example) inside the container. In the case of http, it will look for a 200 response code and in the case of the command, it expects a successful run with an exit code of 0.

Make sure you are choosing reasonable values for periodSeconds, failureThreshold and initialDelaySeconds. Account for possible failures and set some proper thresholds. This all depends on the app and your specific requirements.

There is a small subtlety when it comes to configuring this probe(as well as readiness probe) in some applications. If you have a coordinator/worker architecture where there is a coordinator that has some workers doing tasks for it, you need to be careful on how you configure the startup probe. In such applications, usually, the worker nodes send a signal to the coordinator or to a discovery server and announce that they are ready to work.

If the application specific readiness signal is not aligned properly with the K8s startup probe signal, it could lead to issues. That is because K8s only allows connections to the container after it sees the startup probe. To demonstrate this, let’s look at the following diagram:

Mis-aligned startup probe and application signal.

As can be seen in the above diagram, the worker sends a ready to do work signal to the coordinator. Coordinator, with the assumption that it can reach the worker’s container, sends requests to the worker but fails because the container is not open to accepting connections yet. Hence the first two requests, in this example, will fail. Make sure that your application-specific startup/readiness signals are aligned properly with the K8s probes:

Timeline of a properly aligned startup probe.

It’s worthwhile to emphasize that when we are dealing with service dependencies of such, we need to be careful when configuring these probes. To see more examples of how to define startup probes, visit this link.

Readiness

Readiness probe is very similar to startup probe, but it is used throughout the lifetime of the container. It is meant to check periodically if the app is still ready to work or not. If it is not ready, K8s will stop sending requests to it until it is ready again. Why would the app not be ready?

  • It is overloaded.
  • It might need to do some housekeeping or cleanup.
  • Some of its dependencies(A cache, a database, etc.) are not working or are unreachable.
Timeline of a readiness probe.

Defining readiness probe is very similar to the startup probe and you can find some examples of it in this link.

Livesness

This probe is designed to test the status of a container and restart the container if it is not live. You app/container may become unresponsive for various reasons:

  • A bug in the system.
  • Waiting for another resource indefinitely(A deadlock which can be considered a hard to track bug).
  • Becoming too overloaded.

In such situations, sometimes the best thing to do to automatically resolve the problem is to restart the process(container). That is when this probe will become handy. Syntax wise, configuring this probe is similar to the previous two. But watch out how to set the various parameters. Set a reasonable initialDelaySeconds and make sure to give it enough failureThreshold before declaring the container dead.

Let’s look at an example of a liveness probe:

Code snippet for liveness probe.

This is a http probe hosted on port 8888. The probe is checked every 10 seconds and will be tried 2 times before giving up. Every time K8s checks this probe, it waits timeoutSeconds = 4 before declaring a failure. Here is an example of this probe in action which leads to restarting the container:

Timeline of a failed liveness probe.

In summary, these are important probes to be configured in K8s. However, one has to careful when configuring them or otherwise it could lead to unforeseen issues in a distributed system with inter-service dependencies.

--

--