Solving the mystery of pods health checks failures in Kubernetes

Roman Kuchin

Published in

Pipedrive R&D Blog

4 min readJan 10, 2023

Pipedrive Infra operates self-managed Kubernetes clusters in different clouds, mainly AWS and on-premise OpenStack.

At the time of writing, we manage over 20 different clusters — more and less specific, big and small.

The history of failing pod health check

We noticed a long time ago that pod health checks would sometimes fail without any apparent reason, then restore almost immediately. Still, no one considered it an issue since it would happen rarely and not affect anything.

We manage our infra as cattle, keeping in mind that everything dies eventually: pods get OOMkilled and recreated, deployments are shrunk by HPA, Kubernetes workers can disappear sometimes and so on. In the end, clouds are someone else’s computers, which is why we build fault-tolerant infrastructure.

In short, failing health checks was negligible when compared to other events.

But then, it became more often, meaning developers started getting more frequent alerts about their deployment’s health. When this happens, the first team to ask is usually infra, but we had nothing to report: everything looked healthy.

The search begins

One day, we decided to find the root cause of this problem. At first, it was so rare that it was pretty hard to catch the event.

Step 1: logs

Kubernetes workers syslog — nothing there.
Kubelet logs — nothing there.
Containerd logs — nothing.
CNI logs — nothing.
Logs in pods with recently failed checks — nothing: They seem to be happy.
Logs in failed pods “friends” — nothing: They don’t seem to detect any downtime of friendly service.

But health checks were still failing, happening more often and in more places!

Things to keep in mind:

It’s not tied to the cloud: We experienced the same cadence of failures in AWS ec2 and in on-premise VMs.
Not tied to CNI: They’re different in different clouds. We use calico on-premises and AWS CNI in AWS.
Not tied to Containerd or Kubernetes version: It was failing everywhere.
Didn’t depend on cluster load: It Happened in test environments, at peak times and during the night.

Taking a deeper look

As it seemed a “network problem,” the case was assigned to the networking team that wanted to see actual traffic and turned to tcpdump.

Step 2: tcpdump

In traffic capture, we noticed when TCP SYN from Kubelet to pod, pod replied with TCP SYN-ACK, but TCP ACK didn’t follow from Kubelet. After some retries, Kubelet established a TCP session with no issues — a complete random failure.

Just in case, we checked seq and ack numbers in failed TCP flows, and everything was perfectly fine.

We started to suspect the source process on the worker node: What if something happens to Kubelet and it doesn’t want to continue?

Step 3: ss

We checked the output of “ss -natp” every second. This should have shown something — at least connections per process.

We quickly found that failed connections were stuck in SYN-SENT. That didn’t match with what we saw in tcpdump , which should’ve been SYN-RECV.

Step 4: conntrack

In Conntrack, broken connections were stuck in SYN-RECV, which at least was expected.

What might happen to return traffic to Kubelet after it passes the firewall? What can prevent TCP SYN-ACK from reaching the socket that Kubelet opened?

At this point, we were out of ideas, but we noticed that connections stuck in SYN-SENT or SYN-RECV are not entirely random, as all source ports seemed similar.

We allow a wide range of ports to be used as source:

net.ipv4.ip_local_port_range=12000 65001

Problematic ones appeared as 30XXX or 31XXX, a range that looked pretty familiar.

Step 5. ipvs

We checked our ipvs config with ipvsadm and discovered all ports in stuck connections were reserved by Kubernetes nodeports.

Uncovering the root cause

So, Kubelet was initiating a TCP session to pod using a random source port. For example,31055. TCP SYN reached the pod and the pod replied back with TCP SYN-ACK to port 31055.

The reply hit IPVS, where we have load-balancer for Kubernetes service with nodeport 31055. TCP SYN-ACK got redirected to service endpoints (other pods). The result is predictable : Nobody answers back.

Solution

The solution for failing health checks was to disallow the usage of nodeport range as the source port for TCP sessions.

Luckily, it’s just a single line:

net.ipv4.ip_local_reserved_ports=”30000–32768"

After implementing it on all nodes, the problem was resolved right away. Ironically, several hours of troubleshooting resulted in only one line of code.