For the love of god, learn when to use CPU limits on Kubernetes.

Eliran Cohen
7 min readMar 5, 2023

The world of Kubernetes is full of opinionated articles trying to tell you what’s best for your workload. One of the most controversial topics is CPU limits. In this article, we’ll explore how CPU requests and limits work, why they were introduced, and how to monitor CPU usage.
Remember that there’s no one-size-fits-all approach, and it’s up to you to determine what works best for your workload. Hopefully, by the end of this article, you’ll be inspired to learn more about how things work.

How does the CPU work on Linux?

Just like our brains(at least my brain), the CPU can only handle one task at a time. However, unlike my brain, the CPU can quickly switch between tasks, making it seem like it is doing many things at once.
The CFS (Completely Fair Scheduler) is the controller responsible for moving programs in and out of the CPU on Linux.
The CFS works by dividing the available processing time into short time slices and tracking how long each program has used the CPU during each time slice. It ensures that each program gets its fair share of processing time, so no program is left out.
To understand how the CFS works, imagine three siblings trying to talk to their parent simultaneously before bedtime. The parent can only listen to one sibling at a time, just like the CPU can only handle one task. The CFS acts as the parent in this scenario, carefully switching attention between the different siblings to ensure each one gets a fair chance to speak before sleep. Like the CFS, the parent tries to be completely fair, keeping track of how long each sibling has spoken and ensuring that no one gets all the attention.
If two siblings fall asleep during this scenario, the parent will give all their attention to the remaining awake sibling. Similarly, if only one program needs to use the CPU, it will receive 100% of the available processing time.

How does the CPU work on Kubernetes?

When it comes to servers and containerized environments, prioritizing certain services over others is often necessary instead of just allocating processing time based on a “fair share” model that might be more suitable for desktop users. To meet this need and maximize the potential of containerized environments, Google engineers created a feature called Cgroups in 2006, which was later integrated into the main Linux kernel in 2008.

Cgroups is a kernel feature that limits, tracks, and isolates resource usage (such as CPU, memory, I/O, network, and so on) for one or more processes. The CPU subsystem of Cgroups enables you to set a new “share” value for a container, which the CFS will consider when managing the CPU time.

The “share” value is specified in milliCPU and is guaranteed to the container. For instance, a container with a “share” value of 200m is guaranteed to receive at least 20% of the CPU time when other containers require the CPU and up to 100% of the CPU time if no other containers need it.

In Kubernetes, you can set the “share” value for a pod using resources.request.cpu — when you specify this value, the Kubernetes API will schedule your pod on a node with enough available CPU to meet your requirements. From that point on, the CFS on the node will ensure that the pod receives at least the amount of CPU requested.

Bad Pod

Before we move on, I want to clear up a common misconception about the necessity of CPU limits. Some believe that setting CPU limits will prevent a single workload from using up all the CPU time and leaving no resources for other pods. However, this is not accurate. In fact, when a pod requires CPU time, it will always receive at least the requested value. This means that other pods will be provided with the resources they need. But you don’t have to take my word for it — you can read about it in the Cgroups documentation:

Cgroups can be guaranteed a minimum number of "CPU shares"
when a system is busy. This does not limit a cgroup's CPU
usage if the CPUs are not busy.

Let’s run a quick demo using a tool called stress-ng. “stress-ng” is a stress testing tool for Linux, and it’s perfect for our demo because it can simulate heavy CPU-bound workloads.

  1. Launch a new Ubuntu instance (such as a free-tier t2.micro instance) and SSH/SSM into it.
  2. Install stress-ng by running the following command:
sudo apt update && sudo apt install stress-ng

3. Create our first Cgroup cg1using the systemd-run command:

sudo systemd-run --unit=cg1 -p "CPUShares=300" stress-ng --cpu 1

This command creates a new Cgroup named cg1with a CPU share value of 300m and runs the stress-ng command with 1 CPU.

4. Monitor the resource usage of the Cgroups using sudo systemd-cgtop.
systemd-cgtop displays a real-time view of the resource usage of all the Cgroups on the system.

As the instance is idle, cg1can use 100% of the CPU.

5. Create a second Cgroup by running the following command:

sudo systemd-run --unit=cg2 -p "CPUShares=700" stress-ng --cpu 1

6. Run sudo systemd-cgtop again.

As expected, each service gets its share of CPU time.

7. When you’re done with the test, stop the stress-ng processes by running the following command:

sudo systemctl stop cg1 cg2

The Pay-Per-Use model

Perhaps you’re wondering why CPU limits are necessary now that you understand how CPU requests work. To shed some light on this matter, let’s explore the Cgroups documentation for further insights:

In Linux 3.2, this controller was extended to provide CPU
“bandwidth” control. If the kernel is configured with
CONFIG_CFS_BANDWIDTH, then within each scheduling period
(defined via a file in the cgroup directory), it is
possible to define an upper limit on the CPU time
allocated to the processes in a cgroup. This upper limit
applies even if there is no other competition for the CPU
.

Why would we want to limit CPU time if there’s no competition for resources? In 2009, Google extended the Cgroups CPU subsystem to include CPU limit control. The primary motivation behind this was Google’s emphasis on predictability. As a cloud vendor with a pay-per-use model, Google needed a way to predict and limit customers from using more resources than they paid for.

From Google white paper:

“…In enterprise systems that cater to multiple clients/customers, a customer pays for, and is provisioned with a specific share of CPU resources….In this case CPU bandwidth provisioning could be used directly to constrain the customers usage and provide soft bandwidth to interval guarantees. Such pay-per-use scenarios are frequently seen in cloud systems where service is priced by the required CPU capacity.”

There it is! CPU limits are essential for predicting workloads usage in a multi-tenant environment where customers pay for resources based on their usage. By setting CPU limits, you can ensure that customers stay within their allocated resources.

The dark side of CPU limits

The issue with applying CPU limits to workloads that prioritize performance or high CPU utilization is that it can be counterproductive.
In these cases, CPU limits can end up throttling the workload and causing more harm than good.

Thankfully, the Cgroups CPU subsystem provides various statistics that offer insights into the CPU usage of processes running in the Cgroup

  • nr_periods — Indicates the number of times the Cgroup has gone through a scheduling period. (container_cpu_cfs_periods_total in Prometheus)
  • nr_throttled — Indicates the number of times the Cgroup has exceeded its CPU usage limit. (container_cpu_cfs_throttled_periods_total in Prometheus)
  • throttled_usec — Indicates the amount of time the Cgroup has been throttled in microseconds due to exceeding the CPU usage limit. (container_cpu_cfs_throttled_seconds_total in Prometheus)

To see these statistics in action, we can return to our demo environment and follow these steps:

  1. Create Cgroup as before, this time with CPU limit value:
sudo systemd-run --unit=limited-cg -p "CPUShares=300" -p "CPUQuota=50%" stress-ng --cpu 1

2. Run sudo systemd-cgtop

Although the instance is idle, and stress-ng requires one CPU, the Cgroup is limited to 50%

3. Check the Cgroup CPU statistics by running the following:

sudo cat /sys/fs/cgroup/system.slice/limited-cg.service/cpu.stat
nr_periods 5072
nr_throttled 5071
throttled_usec 250328499

In this example, the Cgroup was throttled 5071 out of 5072 CPU cycles for a total of 250 seconds. This is 250 seconds that the Cgroup could have used the otherwise idle CPU.

4. If you want to check the statistics for your real workloads, you can query Prometheus or use the following kubectl command to retrieve the CPU stats:

kubectl exec -it $pod -n $namespace -- cat /sys/fs/cgroup/cpu/cpu.stat

Let me know in the comments how surprised you have been:)

Unravel the mystery

Kubernetes is a complicated system, and managing resources can be tricky. CPU limits are just one aspect of this complex web, and what works for one workload may not be suitable for another.
After all, we wouldn’t have a job if managing Kubernetes were as easy as following a 5-minutes-reading blog post.

Therefore, investing time in learning how things work and what works best for your specific workload is crucial. Collecting data and monitoring your changes is also critical to ensure optimal performance.
Ultimately, you are responsible for your workload performance, and it’s up to you to make informed decisions based on your needs and requirements.

--

--