Troubleshooting Kube-Proxy and RKE2-Canal Pods Restart Loop: Lessons from Upgrading Ubuntu 20.04 to 22.04

Özkan Poyrazoğlu
4 min readAug 10, 2023

As Ubuntu 20.04 starts to fall behind, the painful OS upgrade process of VMs hosting Kubernetes cluster is approaching. In this article, if you are using RKE2 (Rancher Kubernetes Engine) cluster with Containerd and want to upgrade your hosts to Ubuntu 22.04, I will tell you about our upgrade process and our experiences; I will try to give you insightful details so that you will not be in the same situation.

First of all, I should give information about our setup; we were using as you can imagine Ubuntu 20.04 host VM’s and RKE2 (v1.25.8) with Containerd as a container runtime, and all our workloads that would be affected were in production. So we had to make a smooth transition. There are 3 masters (with etcd) and 17 worker nodes on the cluster. The last point I should mention is that EKS-Distro images are used in the cluster.

To talk about the process we have done; we created a tester worker node with NoExecute taint to prevent existing scheduling production workloads on that worker node and we registered node to RKE2 cluster. Our goal here is to simulate all the actions we will take on other nodes without adding our workloads.

Let’s not skip, it is possible to pass taint via config in RKE2 or K3s clusters. For achieve this;

In /etc/rancher/rke2/config.yaml file you can add this block;

node-taint:
- "UpgradeTesting=true:NoExecute"

“So why are you taking such an action in the production cluster, isn’t there a test cluster for this?” I can hear you asking. The answer is yes; however, we were able to transition smoothly in the test cluster; we preferred to create a control worker and proceed through it in order not to see any surprise problems during the upgrade process of the production cluster.

After RKE2 agent installation and registering to the existing cluster, we verified node is up and ready state and it is working any fatal errors. Of course, we have continued with Ubuntu 20.04 until here. After making sure of the whole process, we started the do-release-upgrade process, and our VM was updated to Ubuntu 22.04 after 45 minutes.

With Ubuntu 22.04 came some issues at the CNI level; kube-proxy and rke2-canal pods were keep restarting although the node look ready. Let me talk briefly about findings;

  • kube-proxy pod doest not throw ant error logs when restarting. Also when we tried to run kube-proxy manually (we edit the static pod manifest and run it in the container) it worked properly until any iptables registry comes up.
  • When we examine the attached rules in iptables or nftables, there does not seem to be a serious problem. The iptables version is iptables v1.8.6 (nf_tables) on the container and iptables v1.8.7 (nf_tables) on the host.
  • We deleted manually kube-proxy static pod, and we restarted rke2-agent on host VM; also we examined rke2-canal (especially kube flannel container), and it not solved our issue.
  • We checked contained and kubelet logs, we cannot find any logs that are cause restart loop.
  • Deleting node on kube-api and re-registering node cannot solved issue
  • update-alternatives — set iptables /usr/sbin/iptables-legacy and restarting node not resolved our issue

We compared with fresh rke2-agent installation on fresh Ubuntu 22.04 and we found some differences about containerd configurations. Api definitions, version definition, enable_unprivileged_ports and enable_unprivileged_icmp definitions seem different.

Stop the rke2-agent or rke2-server service and use the rke2-killall script before changing the configuration files. In our case, we proceeded as follows;

systemctl stop rke2-agent

bash /usr/local/bin/rke2-killall.sh

You can check and change containerd configuration file in /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl .If we compare the configurations;

Older configuration file;

[plugins.opt]
path = "/var/lib/rancher/rke2/agent/containerd"

[plugins.cri]
stream_server_address = "127.0.0.1"
stream_server_port = "10010"
enable_selinux = false
sandbox_image = "index.docker.io/rancher/pause:3.6"

[plugins.cri.containerd]
disable_snapshot_annotations = true
snapshotter = "overlayfs"

[plugins.cri.containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"

Changed to newer configuration file;

version = 2

[plugins."io.containerd.internal.v1.opt"]
path = "/var/lib/rancher/rke2/agent/containerd"

[plugins."io.containerd.grpc.v1.cri"]
stream_server_address = "127.0.0.1"
stream_server_port = "10010"
enable_selinux = false
enable_unprivileged_ports = true
enable_unprivileged_icmp = true
sandbox_image = "index.docker.io/rancher/pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
disable_snapshot_annotations = true
snapshotter = "overlayfs"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true

After you have customized the configurations, run rke2-agent again. In this process, delete the rke2-canal pod (on the same node) along with the rke2-agent, since it runs as daemonset, a new container will be created with new containerd configurations.

CNI related containers on the node, especially kube-system and rke2-canal pods, showed no new restarts and no errors at the end of the process.

--

--