Kubernetes Events and Warnings That Won’t Go Away

Chris Cunningham
Appsbroker CTS Google Cloud Tech Blog
3 min readDec 11, 2023

--

(well …you try finding an interesting image for this blog topic!)

Recently I was deploying an application into a cluster, and, whilst looking into a deployment issue, I noticed something that I couldn’t shake off until I investigated it. The result was a better understanding of the Kubernetes event model and how it can lead to confusion when naively debugging through the time-honoured method of “tail all the logs and fix all the errors”.

The application I was deploying is a Python-based Web service that needs to connect to a database. If deploying the whole service from scratch, it might be the case that it takes a while for the database to be ready, on top of the overhead of application startup. In addition, if the service becomes unavailable for some reason (such as a container fault, or a bug in the application logic) we want to ensure that the pod is marked as unhealthy and traffic routed to healthy pods. Here’s a quick look at how this might be configured:

containers:
- name: my-app
image: my-image:v1
ports:
- name: http
containerPort: 5000
protocol: TCP
readinessProbe:
httpGet:
path: /
port: http

We’re leaving the default values for most of the readinessProbe’s configuration, but here they are for completeness:

  • initialDelaySeconds: 0 — this could be tweaked but we expect the service to be available pretty soon so it is okay if the first probe fails.
  • periodSeconds: 10
  • successThreshold: 1 — as soon as the probe succeeds, the probe will be satisfied.
  • failureThreshold: 3 — the pod will have its Ready condition set to false if three probes in a row all fail.

So I apply my configuration and watch the service get deployed, and a few seconds later, I have a fully working application responding to my requests. But I notice something in the event log:

Warning  Unhealthy  85s (x2 over 86s)  kubelet  Readiness probe failed: Get "http://10.244.201.221:5000/": dial tcp 10.244.201.221:5000: connect: connection refused

My pod is unhealthy? Shouldn’t it not be receiving traffic then?

Even curiouser was when I looked a few minutes later:

Warning  Unhealthy  9m48s (x2 over 9m49s)  kubelet  Readiness probe failed: Get "http://10.244.201.221:5000/": dial tcp 10.244.201.221:5000: connect: connection refused

Ten minutes! And yet my application is happily serving requests. Why is it marked as unhealthy?

The answer of course is that the pod isn’t unhealthy.

Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True

And so let’s talk about Kubernetes events.

Kubernetes events are closely associated with state changes: that is, a resource has transitioned from one state to another. But they can also be generated directly, such as with the above livenessProbe failures. The important thing is there is nothing intrinsically tying the two together: events are not necessarily generated by state changes, and crucially, state changes do not necessarily generate events.

When you start a container, you’ll be familiar with the events that get generated when it starts:

  Type     Reason     Age                  From    Message
---- ------ ---- ---- -------
Normal Pulling 11m kubelet pulling image "my-image:v1"
Normal Pulled 11m kubelet Successfully pulled image "my-image:v1"
Normal Created 11m kubelet Created container
Normal Started 11m kubelet Started container

These are one-offs: a container only indicates that it’s ready once. But livenessProbes don’t only run during service creation, but continuously (in order to redistribute traffic if containers stop responding). And livenessProbes only generate events on failure. Why is that? Because otherwise, every probe would be sending a new event every ten seconds just to say that your service was operating as expected. The result is that a livenessProbe will only ever appear in your event log when it isn’t working, and that the only way to deduce that things are working is to either look directly at the pod status, or by working out from the Age and recurrence rate of the event that it’s stale.

(if you truly need to see the results of livenessProbes, then you can up the logging level on the kubelet service on each node, and look on there. But you won’t get it from the pod event log.)

Takeaways here:

  • Events don’t show the whole story behind resource state: inspect the resource directly and check it.
  • Not every state change generates an event.
  • Sometimes a big red tab at the top of your ArgoCD resource summary really can just be ignored.

--

--