Setting the right requests and limits in Kubernetes
September 2020
TL;DR: In Kubernetes resource constraints are used to schedule the Pod in the right node, and it also affects which Pod is killed or starved at times of high load. In this blog, you will explore setting resource limits for a Flask web service automatically using the Vertical Pod Autoscaler and the metrics server.
There are two different types of resource configurations that can be set on each container of a pod.
They are requests and limits.
Requests define the minimum amount of resources that containers need.
If you think that your app requires at least 256MB of memory to operate, this is the request value.
The application can use more than 256MB, but Kubernetes guarantees a minimum of 256MB to the container.
On the other hand, limits define the max amount of resources that the container can consume.
Your application might require at least 256MB of memory, but you might want to be sure that it doesn't consume more than 1GB of memory.
That's your limit.
Notice how your application has 256MB of memory guaranteed, but it can grow up until 1GB of memory.
After that, it is stopped or throttled by Kubernetes.
Setting limits is useful to stop over-committing resources and protect other deployments from resource starvation.
You might want to prevent a single rogue app from using all resources available and leaving only breadcrumbs to the rest of the cluster.
If limits are used to stop your greedy containers, what are requests for?
Requests affect how the pods are scheduled in Kubernetes.
When a Pod is created, the scheduler finds the nodes which can accommodate the Pod.
But how does it know how much CPU and memory is needed?
The app hasn't started yet, and the scheduler can't inspect memory and CPU usage at this point.
This is where requests come in.
The scheduler reads the requests for each container in your Pods, aggregates them and finds the best node that can fit that Pod.
Some applications might use more memory than CPU.
Others the opposite.
It doesn't matter, Kubernetes checks the requests and finds the best Node for that Pod.
You could visualise Kubernetes scheduler as a skilled Tetris player.
For each block, Kubernetes finds the best Node to optimise your resource utilisation.
CPU and memory requests define the minimum length and width of each block, and based on the size kubernetes finds the best Tetris board to fit the block.
It's important to always set your requests (width and height of the blocks).
Without those the block has no size, and how does one play Tetris with sizeless blocks?
You could fit an infinite number of blocks in your Tetris board.
And if your Tetris board is a real server, you might end up scheduling unlimited processes.
Of course, processes still have CPU and memory requirements.
So you if you don't set requests, you end up overcommiting resources.
Let's play Tetris with Kubernetes with an example.
You can create an interactive busybox pod with CPU and memory requests using the following command:
bash
kubectl run -i --tty --rm busybox \
--image=busybox \
--restart=Never \
--requests='cpu=50m,memory=50Mi' -- sh
What do these numbers actually mean?
Understanding CPU and memory units
Imagine you have a computer with a single CPU and wish to run three containers in it.
You might want to assign a third of CPU each — or 33.33%.
In Kubernetes, the CPU is not assigned in percentages, but in thousands (also called millicores or millicpu).
One CPU is equal to 1000 millicores.
If you wish to assign a third of a CPU, you should assign 333Mi (millicores) to your container.
Memory is a bit more straightforward, and it is measured in bytes.
Kubernetes accepts both SI notation (K,M,G,T,P,E) and Binary notation (Ki,Mi,Gi,Ti,Pi,Ei) for memory definition.
To limit memory at 256MB, you can assign 268.4M (SI notation) or 256Mi (Binary notation).
If you are confused on which notation to use, stick to the Binary notation as it is the one used widely to measure hardware.
Now that you have created the Pod with resource requests, let's explore the memory and CPU used by a process.
Inspecting and collecting metrics with the metrics server
In the previous example, you launched an idle busybox container.
It's currently using close to zero memory and CPU.
But how do you know for sure?
Is there a component in Kubernetes that measures the actual CPU and memory?
Kubernetes has several components designed to collect metrics, but two are essential in this case:
- The kubelet collects metrics such as CPU and memory from your Pods.
- The metric server collects and aggregates metrics from all kubelets.
Inspecting the kubelet for metrics isn't convenient — particularly if you run clusters with thousands of nodes.
When you want to know the memory and CPU usage for your pod, you should retrieve the data from the metric server.
Not all clusters come with metrics server enabled by default. For example, EKS (the managed Kubernetes offering from Amazon Web Services) does not come with a metrics server installed by default.
How can you check the actual CPU and memory usage with the metrics server?
Since the busybox container is idle, let's artificially generate a few metrics.
Let's fill the memory with:
bash
dd if=/dev/zero of=/dev/shm/fill bs=1k count=1024k
And let's increase the CPU with an infinite loop:
bash
while true; do true; done
In another terminal run the following command to inspect the resources used by the pod:
bash
kubectl top pods
NAME CPU(cores) MEMORY(bytes)
busybox 462m 64Mi
From the output you can see that the memory utilised is 64Mi and the total CPU used is 462m.
The kubectl top
command consumes the metrics exposed by the metric server.
Also, notice how the current values for CPU and memory are greater than the requests that you defined earlier (cpu=50m,memory=50Mi
).
And that's fine because the Pod can use more memory and CPU than what is defined in the requests.
However, why is the container consuming only 400 millicores?
Since the Pod is running an infinite loop, you might expect it to consume 100% of the available CPU (or 1000 millicores).
Why is it not running at 100% CPU?
When you define a CPU request in Kubernetes, that doesn't only describe the minimum amount of CPU but also establishes a share of CPU for that container.
All containers share the same CPU, but they are nice to each other, and they split the times based on their shares.
Let's have a look at an example.
Imagine having three containers that have a CPU request set to 60 millicores, 20 millicores and 20 millicores.
The total request is only 100 millicores, but what happens when all three processes start using as much CPU as possible (i.e. 100%)?
If you have a single CPU, the processes will grow to 600 millicores, 200 millicores and 200 millicores (i.e. 60%, 20%, 20%).
All of them increased by a factor of 10x until they used all the available CPU.
If you have 2 CPUs (or 2000 millicores), they will use 1200 millicores, 400 millicores and 400 millicores (i.e. 60%, 20%, 20%).
As they compete for resources, they are careful to divide the CPU based on the shares assigned.
In the previous example, the Pod is consuming 400 millicores because it has to compete for CPU time with the rest of the processes in the cluster such as the Kubelet, the API server, the controller manager, etc.
Let's have a look at another example to understand CPU shares better.
CPU requests and CPU shares
Please notice that the following example is executed in a system with 2 vCPU.
To see the number of cores in your system, you can use:
bash
docker info | grep CPUs
Now, let's run a container that consumes all available CPU and assign it a CPU share of 1024.
bash
docker run -d --rm --name stresser-1024 \
--cpu-shares 1024 \
containerstack/cpustress --cpu 2
The container containerstack/cpustress is engineered to consume all available CPU, but it has to how many CPUs are currently available (in this case is only 2 --cpu 2
).
The command uses a few flags:
--rm
to delete the container once it's stopped.--name
to assign a friendly name to the container.-d
to run the container in the background as a daemon.--cpu-shares
defines the weight of the container.
You can run docker stats
to see the resource utilised by the container:
bash
docker stats
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM %
446bde82ad8a stresser-1024 198.01% 4.562MiB / 3.848GiB 0.12%
The container is using 198% of the available CPU — all of it considering that you have only 2 cores available.
But how can the CPU usage be more than 100%?
Here the CPU percentage is the sum of the percentage per core.
If you are running the same example in a 6 vCPU machine, it might be around 590%.
Let's create another container with CPU share of 2048.
bash
docker run -d --rm --name stresser-2048 \
--cpu-shares 2048 \
containerstack/cpustress --cpu 2
Is there enough CPU to run a second container?
You should inspect the container and check.
bash
docker stats
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM %
270ac57e5cbf stresser-2048 133.27% 4.605MiB / 3.848GiB 0.12%
446bde82ad8a stresser-1024 66.66% 4.562MiB / 3.848GiB 0.12%
The docker stats
command shows that the stresser-2048 container uses 133% of CPU, and the stresser-1024 container uses 66%.
When two containers are running in a 2 vCPU node, the stresser-2048 container gets twice the share of the available CPU.
The two containers are assigned 133.27% and 66.66% share of the available CPU, respectively.
In other words, processes are assigned CPU shares, and when they compete for CPU time, they compare their shares and increase their usage accordingly.
Can you guess what happens when you launch a third container that is as CPU hungry as the first two combined?
bash
docker run -d --name stresser-3072 \
--cpu-shares 3072 \
containerstack/cpustress --cpu 2
Let's have a look at the metrics:
bash
docker stats
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM %
270ac57e5cbf stresser-3072 101.17% 4.605MiB / 3.848GiB 0.12%
446bde82ad8a stresser-2048 66.31% 4.562MiB / 3.848GiB 0.12%
e5cbfs82270a stresser-1024 32.98% 4.602MiB / 3.848GiB 0.12%
The third container is using close to a 100% CPU, whereas the other two use ~66% and ~33%.
Since all containers want to use all available CPU, they will divide the 2 CPU cores available according to their shares (3072, 2048, and 1024).
So the total is 6144 shares, and each is equal to 0.33% CPU per share.
So the CPU time is divided as follows:
- 1024 share (or 33.33% CPU) to the first container.
- 2048 shares (or 66.66% CPU) to the second container.
- 3072 shares (or 99.99% CPU) to the third container.
Now that you're familiar with CPU and memory requests let's have a look at limits.
Memory and CPU limits
Limits define the hard limit for the container and make sure the process doesn't consume all resources in the Node.
Let's imagine you have an application with a limit of 250Mi of memory.
When the application uses more than the limit, Kubernetes kills the process with an OOMKilling (Out of Memory Killing) message.
In other words, the process doesn't have an upper memory limit, and it could cross the threshold of 250Mi.
However, as soon as that happens, the process is killed.
Now that you know what happens to memory limits let's have a look at CPU limits.
Is the Pod killed when it's using more CPU than the limit?
No, it's not.
In reality, CPU is measured as a function of time.
When you say 1 CPU limit, what you really mean is that the app runs up to 1 CPU second, every second.
If your application has a single thread, you will consume at most 1 CPU second every second.
However, if your application uses two threads, it is twice as fast, and it can complete the work in half of the time.
Also, the CPU quota is used in half of the time.
If you have two threads, you can consume 1 CPU second in 0.5 seconds.
Eight threads can consume 1 CPU second in 0.125 seconds.
What happens for the remaining 0.875 seconds?
Your process has to wait for the next CPU slot available, and the CPU is throttled.
- 1/6
In the following scenario there are three processes with 1, 2 and 8 threads.
- 2/6
The single thread process consumes 1 CPU second every second.
- 3/6
The process with two threads consumes the same quota of 1 CPU second in half of the time.
- 4/6
The process with eight threads consumes the available quota in 1/8 of the time.
- 5/6
In the next second, the quota is allocated, and the processes can consume the new allocation.
- 6/6
Notice how the last process is frequently throttled as it consumes its allocation too quickly.
Let's revisit the example discussed earlier to understand how CPU limits differ from requests.
Now, let's run the same cpustress image with half a CPU.
You can set a CPU limit with the --cpus
flag.
bash
docker run --rm -d --name stresser-.5 \
--cpus .5 \
containerstack/cpustress --cpu 2
Run docker stats
to inspect the CPU usage with:
bash
docker stats
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM %
c445bbdb46aa stresser-.5 49.33% 4.672MiB / 3.848GiB 0.12%
The container only uses half a CPU core.
Of course, that's the limit.
Let's repeat the experiment with a full CPU:
bash
docker run --rm -d --name stresser-1 \
--cpus 1 \
containerstack/cpustress --cpu 2
Run docker stats to inspect the cpu usage with:
bash
docker stats
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM %
9c64c2d99be6 stresser-1 105.34% 4.648MiB / 3.848GiB 0.12%
c445bbdb46aa stresser-.5 51.25% 4.609MiB / 3.848GiB 0.12%
Unlike CPU requests, the limits of one container do not affect the CPU usage of other containers.
That's precisely what happens in Kubernetes as well.
Defining the CPU limit sets a max on how CPU a process can use.
Please notice that setting limits doesn't make the container see only the defined amount of memory or CPU.
The container can see the all of the resources of the node.
If the application is designed in a way to use the resources available to determine the amount of memory to use or number of threads to run, it can lead to a fatal issue.
One such example is when you set the memory limits for a container running a JAVA application, and the JVM uses the amount of memory in the node to set the Heap size.
Now that you understand how requests and limits work, it's time to put them in practice.
How do find the right value for CPU and memory requests and limits?
Let's explore the CPU and memory used by a real app.
Limits and requests in practice
You will use a simple cache service which has two endpoints, one to cache the data and another for retrieving it.
The service is written in Python using the Flask framework.
You can find the complete code for this application here.
Before you start, make sure that your cluster has the metrics server installed.
If you're using minikube, you can enable the metrics server with:
bash
minikube addons enable metrics-server
You might also need an Ingress controller to route the traffic to the app.
In minikube, you can enable the ingress-nginx controller with:
bash
minikube addons enable ingress
You can verify that the ingress and metrics servers are installed correctly with:
bash
kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS
kube-system coredns-66bff467f8-nclrr 1/1 Running
kube-system etcd-minikube 1/1 Running
kube-system ingress-nginx-controller-69ccf5d9d8-n6lqp 1/1 Running
kube-system kube-apiserver-minikube 1/1 Running
kube-system kube-controller-manager-minikube 1/1 Running
kube-system kube-proxy-cvkcg 1/1 Running
kube-system kube-scheduler-minikube 1/1 Running
kube-system metrics-server-7bc6d75975-54twv 1/1 Running
It's time to deploy the application.
You can use the following YAML file:
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: flask-cache
spec:
replicas: 1
selector:
matchLabels:
name: flask-cache
template:
metadata:
labels:
name: flask-cache
spec:
containers:
- name: cache-service
image: xasag94215/flask-cache
ports:
- containerPort: 5000
name: rest
---
apiVersion: v1
kind: Service
metadata:
name: flask-cache
spec:
selector:
name: flask-cache
ports:
- port: 80
targetPort: 5000
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: flask-cache
spec:
rules:
- http:
paths:
- backend:
service:
name: flask-cache
port:
number: 80
path: /
pathType: Prefix
You might recognise the three components:
- The Deployment definition with a Pod template.
- A Service to route traffic to the Pods.
- An Ingress manifests to route external traffic to the Pods.
You can submit the resources with:
bash
kubectl apply -f deployment.yaml
If the metrics server is installed correctly, you should be able to inspect the memory and CPU consumption for the Pod with:
bash
kubectl top pods
NAME CPU(cores) MEMORY(bytes)
flask-cache-85b94f6865-tvbg8 6m 150Mi
Please notice that the container does not define requests or limits for CPU or memory at the moment.
You can finally access the app by visiting the cluster IP address:
bash
minikube ip
Open your browser on http://<minikube ip>
and you should be greeted by the running application.
Now that you have the application running, it's time to find the right value for requests and limits.
But before you dive into the tooling needed, let's lay down the plan.
A plan for finding the right requests and limits
Requests and limits depend on how much memory and CPU the application uses.
Those values are also affected by how the application is used.
An application that serves static pages might have a memory and CPU mostly static.
However, an application that stores documents in the database might behave differently as more traffic is ingested.
The best way to decide requests and limits for an application is to observe its behaviour at runtime.
So you will need:
- A mechanism to programmatically generate traffic for your application.
- A mechanism to collect metrics and decide how to derive requests and limits for CPU and memory.
Let's start with generating the traffic.
Generating traffic with Locust
There are many tools available to load testing apps such as ab, k6, BlazeMeter etc.
In this tutorial, you will use Locust — an open-source load testing tool.
Locust includes a convenient dashboard where you can inspect the traffic generated as well as see the performance of your app in real-time.
In Locust, you can generate traffic by writing Python scripts.
Writing code is ideal in this case because you can simulate calls to the cache service and create and retrieve the cached value from the app.
The following script does just that:
load_test.py
from locust import HttpUser, task, constant
import json
import uuid
import random
class cacheService(HttpUser):
wait_time = constant(1)
ids = []
@task
def create(self):
id = uuid.uuid4()
payload = {"username":str(id)}
headers = {'content-type': 'application/json'}
resp = self.client.post("/cache/new", data=json.dumps(payload),headers=headers)
if resp.status_code == 200:
out = resp.json()
cache_id = out["_id"]
self.ids.append(cache_id)
@task
def get(self):
if len(self.ids) == 0:
self.create()
else:
rid = random.choice(self.ids)
self.client.get(f"/cache/{rid}")
Even if you're not proficient in Python, you might recognise the two blocks that start with @task
:
- The first block creates an entry in the cache.
- The second block retrieves the id from the cache.
The load testing script executed by Locust will write and retrieve items from the Flask service using this code.
If you save the file locally, you can start Locust as container with:
bash
docker run -p 8089:8089 \
-v $PWD:/mnt/locust \
locustio/locust -f /mnt/locust/load_test.py
When it starts, the container binds on port 8089 on your computer.
You can open your browser on http://localhost:8089 to access the web interface.
It's time to start the first test!
You should simulate 1000 users with a hatch rate of 10.
As the URL of the app, you should use the same URL that was exposed by the cluster.
If you forgot, you could retrieve the IP address of the cluster with:
bash
minikube ip
The host field should be http://<minikube ip>
.
Click on start and switch over to the graph section.
The real-time graph shows the requests per second received by the app, as well as failure rate, response codes, etc.
Now that you have a mechanism to generate load, it's time to take a look at the application.
Has the CPU and memory increased?
Let's have a look:
bash
kubectl top pods
NAME CPU(cores) MEMORY(bytes)
flask-cache-79bb7c7d79-lpqm5 461m 182Mi
The application is under load, and it's using CPU and memory to respond to the traffic.
The app doesn't have requests and limits yet.
Is there a way to collect those metrics and use them to compute a value for requests and limits?
Analysing requests and limits for running apps automatically
It's usually common to have a metrics server and a database to store your metrics.
If you can collect all of the metrics in a database, you could take the average, max and min of the CPU and memory and extrapolate requests and limits.
You could then use those values in your containers.
But there's a quicker way.
The SIG-autoscaling (the group in charge of looking after the autoscaling part of Kubernetes) developed a tool that can do that automatically: the Vertical Pod Autoscaler (VPA).
The Vertical Pod Autoscaler is a component that you install in the cluster and that estimates the correct requests and limits for Pod.
In other words, you don't have to come up with an algorithm to extrapolate limits and requests.
The Vertical Pod Autoscaler applies a statistical model to the data collected by the metrics server.
So as long as you have:
- Traffic hitting the application
- A metrics server installed and
- The Vertical Pod Autoscaler (VPA) installed in your cluster
You don't need to come up with requests and limits for CPU and memory.
The Vertical Pod Autoscaler (VPA) does that for you!
Let's have a look at how it works.
First, you should install the Vertical Pod Autoscaler.
You can download the code from the official repository.
bash
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
You can install the autoscaler in your cluster with the following command:
bash
./hack/vpa-up.sh
The script creates several resources in Kubernetes, but, more importantly, creates a Custom Resource Definition (CRD).
The new Custom Resource Definition (CRD) is called VerticalPodAutoscaler
, and you can use it to track your Deployments.
So if you want to the Vertical Pod Autoscaler (VPA) to estimate limits and requests for your Flask app, you should create the following YAML file:
vpa.yaml
apiVersion: "autoscaling.k8s.io/v1beta2"
kind: VerticalPodAutoscaler
metadata:
name: flask-cache
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: flask-cache
resourcePolicy:
containerPolicies:
- containerName: '*'
minAllowed:
cpu: 10m
memory: 50Mi
maxAllowed:
cpu: 1
memory: 500Mi
controlledResources: ["cpu", "memory"]
You can submit the resource to the cluster with:
bash
kubectl apply -f vpa.yaml
It might take a few minutes before the Vertical Pod Autoscaler (VPA) can predict values for your Deployment.
Once it's ready you can query the vpa
object with:
bash
kubectl describe vpa flask-cache
# more output
Status:
Conditions:
Last Transition Time: 2020-09-01T06:52:21Z
Status: True
Type: RecommendationProvided
Recommendation:
Container Recommendations:
Container Name: cache-service
Lower Bound:
Cpu: 25m
Memory: 60194k
Target:
Cpu: 410m
Memory: 262144k
Uncapped Target:
Cpu: 410m
Memory: 262144k
Upper Bound:
Cpu: 1
Memory: 500Mi
In the lower part of the output, the autoscaler has three sections:
- Lower bound — the minimum resource recommended for the container.
- Upper Bound — the maximum resource recommended for the container.
- Uncapped Target — the target resource recommended if minAllowed and maxAllowed is not set.
In this case, the recommended numbers are a bit skewed to the lower end because you haven't load test the app for a sustained period.
You can repeat the experiment with Locust and keep inspecting the Vertical Pod Autoscaler (VPA) recommendation.
Once the recommendations are stable, you can apply them back to your deployment.
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: flask-cache
spec:
replicas: 1
selector:
matchLabels:
name: flask-cache
template:
metadata:
labels:
name: flask-cache
spec:
containers:
- name: cache-service
image: xasag94215/flask-cache
ports:
- containerPort: 5000
name: rest
resources:
requests:
cpu: 25m
memory: 64Mi
limits:
cpu: 410m
memory: 512Mi
You can use the Lower bound as your requests and the Upper bound as your limits.
Great!
You just set requests and limits for a brand new application even if you were not familiar with it.
You could extend the same techniques to your apps and set the right requests and limits even if you haven't used them before.
Visualising limits and requests recommendations
Inspecting the VPA object is a bit annoying.
If you prefer a visual tool to inspect the limit and request recommendations, you can install the Goldilocks dashboard.
The Goldilocks dashboard creates VPA objects and serves the recommendations through a web interface.
Let's install it and see how it works.
Since Goldilocks manages the Vertical Pod Autoscaler (VPA) object on your behalf, let's delete the existing Vertical Pod Autoscaler with:
bash
kubectl delete vpa flask-cache
Next, let's install the dashboard.
Goldilocks is packaged as a Helm chart.
So you should head over to the official website and download Helm.
You can verify that Helm is installed correctly by printing the version:
bash
helm version
version.BuildInfo{Version:"v3.3.0"}
At this point you can install the dashboard with:
bash
helm install goldilocks fairwinds-stable/goldilocks \
--set dashboard.service.type=NodePort
You can visit the dashboard by typing the following command:
bash
minikube service goldilocks-dashboard --url
You should notice an empty page in your browser.
If you want Goldilocks to display Vertical Pod Autoscaler (VPA) recommendations, you should tag the namespace with a particular label:
bash
kubectl label namespace default goldilocks.fairwinds.com/enabled=true
At this point, goldilocks creates the Vertical Pod Autoscaler (VPA) object for each Deployment in the namespace and displays a convenient recap in the dashboard.
Time to load test the app with Locust.
If you repeat the experiment and flood the application with requests, you should be able to see the Goldilocks dashboard recommending limits and requests for your Pods.
Summary
Defining requests and limits in your containers is hard.
Getting them right can be a daunting task unless you rely on a proven scientific model to extrapolate the data.
The Vertical Pod Autoscaler (VPA) paired with metrics server is an excellent combo to remove any sort of guesstimation from choosing requests and limits.
But why stopping at the recommendations?
If you don't want to update requests and limits after the Vertical Pod Autoscaler (VPA) recommendations, you can also configure the VPA to propagate the values to the Deployment automatically.
Using this setup, you can be sure that your Pods always have the right requests and limits as they are updated and adjusted in real-time.
If you wish to know more about the updater mechanism in the Vertical Pod Autoscaler (VPA), you can read the official documentation.