Tolerations & NodeAffinity for Deterministic Pod Scheduling in Kubernetes
A pod relies on the Kubernetes scheduler to be placed in a node. In simple terms, a Kubernetes node is a computer that runs containerized applications as part of a Kubernetes cluster. Think of a node as a worker machine that is responsible for executing the containers that make up your application. How the kubernetes scheduler decides which node the pod will be placed on to, is depending on several factors including,
- what is pod resource requirements
- which node has enough resources
- what is rule applied to the pod
- what rule applied to the node
This post will focus on, how the rule applied on pods and nodes affect the pod scheduling
Imagine, in an organization, the node occupation is regulated based on the team the workload belongs to and its computing resources requirement. The team is only allowed to deploy their workload into its designated node and its respective resource requirement. The following diagram illustrates the situation,
The organization cannot afford to have oortcloud team to wrongly deploy their apps into other node that is designated for kuiperbelt team for example.
tolerations and nodeAffinity come to the rescue
For a pod works with tolerations, a taint needs to be applied on the node. The node with a taint accepts a tolerant pod to be scheduled on it. For the above mentioned situation, the first step for the solution could be to taint the node in such a way, so that it repels the pods unless it is owned by the oortcloud team. It means, that any nodes that serving workloads for oortcloud team will have identical taint, while those serving the kuiperbelt will have different taint. With this approach, there are 3 nodes with identical taint, for 3 different purposes indicated in its label, for small, medium and large compute resource requirements, respectively. If a web app pod, is scheduled, it is not guaranteed that kubernetes scheduler will place it on the node with small compute resources. It may end up in a node with large compute resources which is not supposed to. This gives a conclusion that, the tolerations alone will not solve our problem. The nodeAffinity comes to the picture. The following diagram can give more clarity,
Let’s experiment the behaviour on how the tolerations and podAffinity affects the scheduling. The scenario in this post was tested in killercoda environment and backed by a github repo. The interesting part is, validation for different scheduling scenarios can be achieved in a cluster with one node, like docker desktop or minikube.
Inspect what nodes are there in the cluster.
controlplane $ kubectl get nodes -o name
node/controlplane
Lets taint the node
kunectl taint node controlplane owner=oortcloud:NoSchedule
We make sure that the node only accepts those pods owned by oortcloud team, through a taint defined as owner=oortcloud:NoSchedule. To be more specific on what kind pod it is accepted, we add a label, compute=SMALL on the node.
kubectl label node controlplane compute=SMALL
Verify if taint and label are properly applied
kubectl describe node controlplane |grep "Taints:"
controlplane $ kubectl describe node controlplane |grep "Taints:"
Taints: owner=oortcloud:NoSchedule
...
ontrolplane $ kubectl describe node controlplane |grep -A 7 "Labels:"|grep "compute="
compute=SMALL
Scenario 1: Verify that a non tolerant pod should not schedule in a node
Create namespace
kubectl create namespace kuiperbelt
Create a deployment any-kuiperbelt in kuiperbelt namespace with nginx:alpine image. Initiate the manifest creation with imperative approach, and save its manifest into a file /tmp/any-kuiperbelt.yaml
kubectl create deployment \
any-kuiperbelt - image=nginx:alpine -n kuiperbelt \
--dry-run=client -o yaml > /tmp/any-kuiperbelt.yaml`
Open the manifest in edit mode,
vim /tmp/any-kuiperbelt.yaml
At the pod sections, spec.template.spec, add the tolerations.
Check on the kubernetes document what is the specification for pod.spec.tolerations. Or simply run kubectl explain pod.spec.tolerations
tolerations, is an array of object as given below,
tolerations:
- key: owner
operator: Equal
value: kuiperbelt
effect: NoSchedule
To reiterate the intention, we intentionally created a pod which is not tolerant to the taint applied in node. Currently node is tainted with,
tolerations:
- key: owner
operator: Equal
value: oortcloud
effect: NoSchedule
The next step is adding the nodeAffinity in the manifest.
Please have a look on kubernetes documentation for pod’s affinity, simply run kubectl explain pod.spec.affinity.nodeAffinity
The snippet of pod.spec.affinity.nodeAffinity which is used here is,
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: compute
operator: Exists
With the above node affinity definition, kubernetes scheduler will look for any nodes that have the compute as a key in its labels. Once added the complete deployment manifest should looks similar to the following,
apiVersion: apps/v1
kind: Deployment
metadata:
creationTimestamp: null
labels:
app: any-kuiperbelt
name: any-kuiperbelt
namespace: kuiperbelt
spec:
replicas: 1
selector:
matchLabels:
app: any-kuiperbelt
strategy: {}
template:
metadata:
creationTimestamp: null
labels:
app: any-kuiperbelt
spec:
containers:
- image: nginx:alpine
name: nginx
resources: {}
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: compute
operator: Exists
tolerations:
- key: owner
operator: Equal
value: kuiperbelt
effect: NoSchedule
status: {}
Apply it,
kubectl apply -f /tmp/any-kuiperbelt.yaml
Verify if the scheduling is failed, by extracting events inside the namespace kuiperbelt.
kubectl get events -n kuiperbelt -o \
go-template='{{ range .items }}{{ .involvedObject.kind}}{{"/"}}{{.involvedObject.name}}{{"\t"}}{{.message}}{{"\n"}}{{end}}' |grep Pod
The output is something similar to the following,
Pod/any-kuiperbelt-7c596d75bf-ksxhf
0/1 nodes are available: 1 node(s) had untolerated taint {owner: oortcloud}.
preemption: 0/1 nodes are available:
1 Preemption is not helpful for scheduling..
The error message is self descriptive, isn’t it ? “Pod/any-kuiperbelt-7c596d75bf-ksxhf 0/1 nodes are available: 1 node(s) had untolerated taint {owner: oortcloud}”
As expected, pod is in Pending, and deployment is not in ready state as shown below,
kubectl get deployment,pod -A \
-l 'app in (any-kuiperbelt, small-oortcloud-tolerant)'
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kuiperbelt deployment.apps/any-kuiperbelt 0/1 1 0 5m53s
NAMESPACE NAME READY STATUS RESTARTS AGE
kuiperbelt pod/any-kuiperbelt-7c596d75bf-ksxhf 0/1 Pending 0 5m53s
Second scenario: A deployment with tolerant pod and match nodeAffinity will be able to schedule
Let’s create deployment for oortcloud team with small resource requirements in another namespace called, oortcloud.
kubectl create namespace oortcloud
The following is the imperative approach to create the deployment manifest,
kubectl create deployment small-oortcloud \
—image=nginx:alpine -n oortcloud —dry-run=client o yaml \
> /tmp/small-oortcloud.yaml
Edit the /tmp/small-oortcloud.yaml file then add the tolerations and podAffinity. The final manifest looks as below,
apiVersion: apps/v1
kind: Deployment
metadata:
creationTimestamp: null
labels:
app: small-oortcloud-tolerant
name: small-oortcloud-tolerant
namespace: oortcloud
spec:
replicas: 1
selector:
matchLabels:
app: small-oortcloud-tolerant
strategy: {}
template:
metadata:
creationTimestamp: null
labels:
app: small-oortcloud-tolerant
spec:
containers:
- image: nginx:alpine
name: nginx
resources: {}
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "compute"
operator: "In"
values:
- "SMALL"
- "MEDIUM"
tolerations:
- key: owner
operator: Equal
value: oortcloud
effect: NoSchedule
status: {}
As we can see, the tolerations is declared, to make sure that the pod is tolerant to the node, with taint owner=oortcloud:NoSchedule. Whereas, the nodeAffinity above, means that this pod can be scheduled to the node with either compute=SMALL or compute=MEDIUM labels.
Apply the manifest,
kubectl apply -f /tmp/small-oortcloud.yaml
Check the events in oortcloud namespace
Pod/small-oortcloud-tolerant-6546b85b4d-p6jsg
Successfully assigned oortcloud/small-oortcloud-tolerant-6546b85b4d-p6jsg
to controlplane
...
The small-oortcloud-tolerant pod is able to schedule.
Check the all deployments and pods
controlplane $ kubectl get deployment,pod -A -l 'app in (any-kuiperbelt, small-oortcloud-tolerant)'
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kuiperbelt deployment.apps/any-kuiperbelt 0/1 1 0 13m
oortcloud deployment.apps/small-oortcloud-tolerant 1/1 1 1 4m40s
NAMESPACE NAME READY STATUS RESTARTS AGE
kuiperbelt pod/any-kuiperbelt-7c596d75bf-ksxhf 0/1 Pending 0 13m
oortcloud pod/small-oortcloud-tolerant-6546b85b4d-p6jsg 1/1 Running 0 4m40s
Scenario 3: Validate if the pod with the same rule still schedule in node with label compute=MEDIUM
Recall that the pod definition has nodeAffinity with matchExpression,
- matchExpressions:
- key: compute
operator: In
values:
- SMALL
- MEDIUM
The same pod should be able to be scheduled in the node with label, compute=MEDIUM. Lets change node’s label,
kubectl label node controlplane podSize=MEDIUM --overwrite
Make sure the node label is change,
kubectl describe node controlplane |grep -A 7 "Labels:" | grep "compute="
compute=MEDIUM
Scale up then scale down the small-oortcloud-tolerant deployment, to trigger the rolling upgrade,
controlplane $ kubectl scale deployment -n oortcloud small-oortcloud-tolerant --replicas 0
deployment.apps/small-oortcloud-tolerant scaled
controlplane $ kubectl scale deployment -n oortcloud small-oortcloud-tolerant --replicas 1
deployment.apps/small-oortcloud-tolerant scaled
Check the events in the oortcloud namespace,
controlplane $ kubectl get events -n oortcloud\
-o go-template='{{ range .items }}{{ .involvedObject.kind}}{{"/"}}{{.involvedObject.name}}{{"\t"}}{{.message}}{{"\n"}}{{end}}' |grep Pod
---
Pod/small-oortcloud-tolerant-6546b85b4d-nvhs2 Successfully assigned oortcloud/small-oortcloud-tolerant-6546b85b4d-nvhs2 to controlplane
...
Check the deployment and pod
controlplane $ kubectl get deployment,pod -A -l 'app in (any-kuiperbelt, small-oortcloud-tolerant)'
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kuiperbelt deployment.apps/any-kuiperbelt 0/1 1 0 19m
oortcloud deployment.apps/small-oortcloud-tolerant 1/1 1 1 10m
NAMESPACE NAME READY STATUS RESTARTS AGE
kuiperbelt pod/any-kuiperbelt-7c596d75bf-ksxhf 0/1 Pending 0 19m
oortcloud pod/small-oortcloud-tolerant-6546b85b4d-nvhs2 1/1 Running 0 61s
It is interesting to see what happen if the node label change to compute=LARGE
controlplane $ kubectl label node controlplane compute=LARGE --overwrite
node/controlplane labeled
Scale up then scale down the small-oortcloud-tolerant deployment, to trigger the rolling upgrade,
controlplane $ kubectl scale deployment\
-n oortcloud small-oortcloud-tolerant --replicas 0
---
deployment.apps/small-oortcloud-tolerant scaled
controlplane $ kubectl scale deployment\
-n oortcloud small-oortcloud-tolerant --replicas 1
---
deployment.apps/small-oortcloud-tolerant scaled
Check the events in the oortcloud namespace,
kubectl get events -n oortcloud -o \
go-template='{{ range .items }}{{ .involvedObject.kind}}{{"/"}}{{.involvedObject.name}}{{"\t"}}{{.message}}{{"\n"}}{{end}}' |grep Pod
---
Pod/small-oortcloud-tolerant-6546b85b4d-hkcmr
0/1 nodes are available: 1 node(s)
didn't match Pod's node affinity/selector.
preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
The error is self descriptive, “Pod/small-oortcloud-tolerant-6546b85b4d-hkcmr 0/1 nodes are available: 1 node(s) didn’t match Pod’s node affinity/selector”
Check the deployment and pod status for all oortcloud and kuiperbelt namespace
controlplane $ kubectl get deployment,pod -A -l 'app in (any-kuiperbelt, small-oortcloud-tolerant)'
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kuiperbelt deployment.apps/any-kuiperbelt 0/1 1 0 25m
oortcloud deployment.apps/small-oortcloud-tolerant 0/1 1 0 16m
NAMESPACE NAME READY STATUS RESTARTS AGE
kuiperbelt pod/any-kuiperbelt-7c596d75bf-ksxhf 0/1 Pending 0 25m
oortcloud pod/small-oortcloud-tolerant-6546b85b4d-hkcmr 0/1 Pending 0 4m19s
As expected, nodeAffinity with matchExpression,
- matchExpressions:
- key: compute
operator: In
values:
- SMALL
- MEDIUM
will not find any match when there is only one node with label compute=LARGE. So as a result pod will not be able to schedule.
Let’s create another pod in oortcloud namespace that tolerant to the node and have the correct nodeAffinity definition.
kubectl create deployment large-oortcloud-tolerant \
--image=nginx:alpine -n oortcloud --dry-run=client -o yaml > /tmp/large-oortcloud-tolerant.yaml
As usual, edit /tmp/large-oortcloud-tolerant.yaml , then add the toleration and nodeAffinity,
apiVersion: apps/v1
kind: Deployment
metadata:
creationTimestamp: null
labels:
app: large-oortcloud-tolerant
name: large-oortcloud-tolerant
namespace: oortcloud
spec:
replicas: 1
selector:
matchLabels:
app: large-oortcloud-tolerant
strategy: {}
template:
metadata:
creationTimestamp: null
labels:
app: large-oortcloud-tolerant
spec:
containers:
- image: nginx:alpine
name: nginx
resources: {}
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "compute"
operator: "In"
values:
- "LARGE"
tolerations:
- key: owner
operator: Equal
value: oortcloud
effect: NoSchedule
status: {}
Apply it,
kubectl apply -f /tmp/large-oortcloud-tolerant.yaml
Pod scheduling is successful,
Pod/large-oortcloud-tolerant-57c74d458c-5748r
Successfully assigned oortcloud/large-oortcloud-tolerant-57c74d458c-5748r
to controlplane
Check the deployment status and pod,
controlplane $ kubectl get deployment,pod -A -l 'app in (any-kuiperbelt, small-oortcloud-tolerant, large-oortcloud-tolerant)'
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kuiperbelt deployment.apps/any-kuiperbelt 0/1 1 0 29m
oortcloud deployment.apps/large-oortcloud-tolerant 1/1 1 1 103s
oortcloud deployment.apps/small-oortcloud-tolerant 0/1 1 0 20m
NAMESPACE NAME READY STATUS RESTARTS AGE
kuiperbelt pod/any-kuiperbelt-7c596d75bf-ksxhf 0/1 Pending 0 29m
oortcloud pod/large-oortcloud-tolerant-57c74d458c-5748r 1/1 Running 0 103s
oortcloud pod/small-oortcloud-tolerant-6546b85b4d-hkcmr 0/1 Pending 0 8m33s
As expected, the pod with toleration
tolerations:
- key: owner
operator: Equal
value: oortcloud
effect: NoSchedule
and nodeAffinity
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "compute"
operator: "In"
values:
- "LARGE"
is able to schedule in the cluster.
Scenario 4: Validate if deployment from kuiperbelt team can schedule on node with owner=kuiperbelt taint
Let’s taint the node so that, the other-kuiperbelt deployment is able to deploy.
controlplane $ kubectl taint node controlplane owner=kuiperbelt:NoSchedule --overwrite
node/controlplane modified
Scale up and scale down the any-kuiperbelt, small-oortcloud-tolerant and large-oortcloud-tolerant deployment to make the new node taint take effect on pod scheduling,
controlplane $ kubectl scale deployment -n kuiperbelt any-kuiperbelt --replicas 0
deployment.apps/other-kuiperbelt scaled
controlplane $ kubectl scale deployment -n kuiperbelt any-kuiperbelt --replicas 1
deployment.apps/other-kuiperbelt scaled
controlplane $ kubectl scale deployment -n oortcloud small-oortcloud-tolerant --replicas 0
deployment.apps/small-oortcloud-tolerant scaled
controlplane $ kubectl scale deployment -n oortcloud small-oortcloud-tolerant --replicas 1
deployment.apps/small-oortcloud-tolerant scaled
controlplane $ kubectl scale deployment -n oortcloud large-oortcloud-tolerant --replicas 0
deployment.apps/large-oortcloud-tolerant scaled
controlplane $ kubectl scale deployment -n oortcloud large-oortcloud-tolerant --replicas 1
deployment.apps/large-oortcloud-tolerant scaled
Check the event in kuiperbelt namespace,
Pod/other-kuiperbelt-7c596d75bf-ksxhf
Successfully assigned kuiperbelt/any-kuiperbelt-7c596d75bf-ksxhf to
controlplane
...
It is able to schedule as expected.
While in oortcloud namespace both pods are not able to schedule, with the untolerated reason,
Pod/large-oortcloud-tolerant-57c74d458c-nljr8 0/1 nodes are available:
1 node(s) had untolerated taint {owner: kuiperbelt}.
preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
Pod/small-oortcloud-tolerant-6546b85b4d-58mwn 0/1 nodes are available:
1 node(s) had untolerated taint {owner: kuiperbelt}.
preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
Check the deployment and pod
controlplane $ kubectl get deployment,pod -A -l 'app in (any-kuiperbelt, small-oortcloud-tolerant,large-oortcloud-tolerant)'
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kuiperbelt deployment.apps/any-kuiperbelt 1/1 1 1 35m
oortcloud deployment.apps/large-oortcloud-tolerant 0/1 1 0 7m44s
oortcloud deployment.apps/small-oortcloud-tolerant 0/1 1 0 26m
NAMESPACE NAME READY STATUS RESTARTS AGE
kuiperbelt pod/any-kuiperbelt-7c596d75bf-nxg4c 1/1 Running 0 4m49s
oortcloud pod/large-oortcloud-tolerant-57c74d458c-nljr8 0/1 Pending 0 4m41s
oortcloud pod/small-oortcloud-tolerant-6546b85b4d-58mwn 0/1 Pending 0 4m36s
With both tolerations and nodeAffinity in pod manifest, the scheduling of a pod can be more deterministic, as more complex rule can be applied into it. A cluster with 4 nodes is able to schedule 4 workloads with different compute resources and tolerations properly. Here is the summary
+------------+--------------------------------------------------+-------------------------------------------+--------+
| Team | Pod Toleration and Node Affinity | Node Taint and Label | Status |
+------------+--------------------------------------------------+-------------------------------------------+--------+
| oortcloud | owner=oortcloud AND compute in (SMALL, MEDIUM) | compute:SMALL,owner=oortcloud:NoSchedule | OK |
| oortcloud | owner=oortcloud AND compute in (SMALL, MEDIUM) | compute:MEDIUM,owner=oortcloud:NoSchedule | OK |
| oortcloud | owner=oortcloud AND compute in (LARGE) | compute:LARGE,owner=oortcloud:NoSchedule | OK |
| kuiperbelt | owner=kuiperbelt AND compute exists | compute:ANY,owner=kuiperbelt:NoSchedule | OK |
+------------+--------------------------------------------------+-------------------------------------------+--------+