A Guide to Kubernetes Application Resource Tuning — part 1

7 min readJan 4, 2023

This article aims at providing a good understanding on container resource sizing in Kubernetes.

This is the first of 3 parts:

From bare metal to virtualization

Running multiple workloads on bare metal requires to share resources between the different processes on a given host. Processes are not protected against each other (for CPU, it depends on the OS scheduler).

If a process decides at some point to go through a lot of processing, it will increase its usage on, say memory and CPU, possibly starving the other processes running on the same host. This might have an impact on these other processes, which might stop them from meeting their service level objectives. Conversely, the application requiring more resources might fail absorbing its peak of activity, if the other applications running on the host are themselves in a phase where they need additional resources.

Several strategies can be used to limit the impact or probability of this happening:

Isolation: move workload that needs a higher level of guarantee, or is likely to be a bad neighbor onto dedicated hosts.
Carefully and progressively add new workload, making sure “it fits in”.
Capacity planning: rely on observability (monitoring and metrics) to figure out if a VM has space left for new workload. If not, optionally offload some workload to other hosts.

One aspect that has worked well with traditional workload is that it is static: it is either installed on the host, or not. Obviously, usage can be dynamic, but new workload does not appear suddenly. It is the result of provisioning. This makes it somehow more manageable and predictable than if workloads came and left as it pleased.

But one fundamental characteristic of running workload on bare metal is that unused resources (such as memory and CPU) get lost. As the industry shifted to virtualization, people quickly realized that, beside agility, running workload on VMs provided a way to not waste those resources through using over provisioning. However there are 2 challenges with this:

The virtualization service provider may not know the type of workload that is going to run on its ESX. As a result it will be difficult for him to figure out if he can be agressive on the overcommitting ratio or should be conservative.
The VM owner has an expectation on resources given by the provider, and is not necessarily aware of the use, the level or even the impact of the overcommitting ratio configured on the virtualization cluster that its VM is running on.

The only guarantee that virtualization service providers will offer is that VMs will not be able to go over the number of vCPU it is configured for. But unless there is a 1:1 ratio, it is not guaranteed to always be able to use of all these vCPUs at any point in time (this will depend on the behavior of the other VMs sharing the same underlying physical host).

Of course VM owners and providers are encouraged to collaborate, to make sure there is a good mutual understanding. However, the expectation of the VM owner on one side, and partial workload knowledge for the provider on the other, will end up usually in conservative choices, such as a low overcommitting ratio. As a result, resources may stay idle, leading to inefficiencies, most likely resulting in suboptimal investments (e.g. paying for cores that you don’t use).

Still, overcommitting at the virtualization layer should be an implementation detail, whose responsibility lies into the hands of the infrastructure team, which will treat its cluster as a whole, trying to maximize efficiency of resources, to lower the cost of the vCPUs, but with a primary focus on making available the requested resources (i.e. near-guaranteed resources). The corollary, is that virtualization overcommit should not be used as a substitute to proper capacity management at the application level.

Running workload in containers

Traditional workload offers little protection in terms of resource access, or ability to limit noisy neighbors. Container technology was introduced more than 10 years ago, and offered significant improvements by exposing some features available in linux cgroups.

Container runtimes offer the added ability of configuring a resource access and limit options. For instance in docker: -m for memory and --cpus, --cpu-period, --cpu-quota and --cpu-shares for CPU.

The -m memory parameter is a limit that the container cannot go over. Since memory is not a compressible resource, if the container tries to go over, it will go OOM and get killed.

The CPU shares defines a proportional weight for host CPU access, relative to the other containers. The default value in docker is 1024, but it is important to note that this is not equal to a number of millicores. Let’s take an example to illustrate:

Start launching docker stats to observe resource consumption:

docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"

Assuming a docker host with 2 cores, let’s launch a stress test that will consume all 2 cores for 60 seconds:

docker run -it --rm alexeiled/stress-ng --cpu 2 --timeout 60s --metrics-brief

While executing, the stats show:

NAME                CPU %               MEM USAGE / LIMIT     MEM %
sleepy_torvalds     202.42%             7.059MiB / 7.776GiB   0.09%

Now let’s run the same container with a CPU shares of 150 and another container with 50:

docker run -it --rm -d --cpu-shares=50 --name=container_50 alexeiled/stress-ng --cpu 2 --timeout 60s --metrics-brief
docker run -it --rm -d --cpu-shares=150 --name=container_150 alexeiled/stress-ng --cpu 2 --timeout 60s --metrics-brief

We can see that the 50 container is using 25% of the total capacity (i.e. 50/200), and the other container uses the other 75%. The total 200% because we are using a 2 cores docker host:

NAME            CPU %     MEM USAGE / LIMIT     MEM %
container_50    49.94%    8.988MiB / 7.776GiB   0.11%
container_150   151.81%   8.992MiB / 7.776GiB   0.11%

Without waiting for the 2 containers to stop, launch a third container with CPUs shares equal to 300:

docker run -it --rm -d --cpu-shares=300 --name=container_300 alexeiled/stress-ng --cpu 2 --timeout 60s --metrics-brief

Stats now show:

NAME            CPU %     MEM USAGE / LIMIT     MEM %
container_50    19.99%    8.984MiB / 7.776GiB   0.11%
container_150   61.01%    8.965MiB / 7.776GiB   0.11%
container_300   122.93%   8.969MiB / 7.776GiB   0.11%

We can see that the 2 first containers were readjusted to consume a fraction of the total number of shares defined on all containers:

50/500: 10% of total capacity (i.e. 2 cores)
150/500: 30%
300/500: 60%

The results would have been the same if shares had been 500, 1500 and 3000, instead of 50, 150, and 300. So whatever the values may be, they need to be consistent with one another.

CPU shares are used when the system has contention (i.e. processes ask for more than the total capacity). Without contention, processes are free to use whatever capacity is available on the host. As a result CPU shares cannot be used to limit access to CPU, but only to guarantee some access to it.

Recognizing this as an issue, Paul Turner, Bharata B. Rao and Nikhil Rao introduced the CPU bandwidth control for CFS in the 2010 Linux Symposium, as a mean to limit access to CPU for processes: CPU quota and CPU period.

The CPU period is by default 100000 microseconds (i.e. 100 ms). The CPU quota defines the number of microseconds per CPU periods that the container is limited to. In docker, --cpu-period="100000" --cpu-quota="150000" means that the process is limited to 1.5 CPUs, which can be expressed more conveniently with --cpus="1.5" in docker 1.13.

Run again the 2 first as before, and add a limit of 900 millicores on the third one:

docker run -it --rm -d --cpu-shares=50 --name=container_50 alexeiled/stress-ng --cpu 2 --timeout 60s --metrics-brief
docker run -it --rm -d --cpu-shares=150 --name=container_150 alexeiled/stress-ng --cpu 2 --timeout 60s --metrics-brief
docker run -it --rm -d --cpu-shares=300 --cpus=0.9 --name=container_300 alexeiled/stress-ng --cpu 2 --timeout 60s --metrics-brief

The stats show:

NAME                CPU %               MEM USAGE / LIMIT     MEM %
container_50    26.96%    8.996MiB / 7.776GiB   0.11%
container_150   80.76%    8.973MiB / 7.776GiB   0.11%
container_300   90.58%    8.969MiB / 7.776GiB   0.11%

In terms of millicores, the allocation is:

container_50: 27/200*2000 = 270m
container_150: 81/200*2000 = 810m
container_300: 91/200*2000 = 910m

The first thing we see is that container_300 is effectively limited to roughly 900m. But where do the limits come from for the 2 other containers?

When container_300 reaches its limit, the other containers had received:

container_150: half of container_300, which is roughly 450m.
container_50: a third of container_150, which is 150m.

So we end up with 900 + 450 + 150 = 1500m being used out of a total capacity of 2000m. This leaves 500m to be shared between container_50 (25%) and container_150 (75%):

container_50: 0.25*500 + 150 = 275m
container_150: 0.75*500 + 450 = 825m

As we can see, the container runtime engine provides options to a guarantee (through the CPU shares) and limit (through the CPU quota) access to CPU. Guaranteed access does not mean that CPU cycles will go wasted however if not used. If a container reserves 1 core, but stays idle, those CPU cycles will go to a global shared pool, and get redistributed to whoever needs them by the CFS.

These mechanisms offer a huge improvement compared to running workload on VMs, as now it is possible to tune access to resources. They can be used to define lower and upper bounds for CPU usage, and help with different scenarios:

Provide dedicated resources, and limit the process to the resources.
Provide a minimal amount of guaranteed resources, and define an upper bound allowing the process to go above the minimal amount.
Provide no minimal amount, and define an upper bound.

This brings a lot of flexibility. However, running static containers on hosts, can be tedious (e.g. translating CPU shares into millicores) and workload cannot be moved easily. For that reason, the industry has started working on container orchestrators, to allow running containers at scale, while allowing resource efficiency, and ease of resource configuration for the individual workloads.

A Guide to Kubernetes Application Resource Tuning — part 1

From bare metal to virtualization

Running workload in containers

Written by V. Sevel