Tolerations & NodeAffinity for Deterministic Pod Scheduling in Kubernetes

10 min readMay 7, 2023

A pod relies on the Kubernetes scheduler to be placed in a node. In simple terms, a Kubernetes node is a computer that runs containerized applications as part of a Kubernetes cluster. Think of a node as a worker machine that is responsible for executing the containers that make up your application. How the kubernetes scheduler decides which node the pod will be placed on to, is depending on several factors including,

what is pod resource requirements
which node has enough resources
what is rule applied to the pod
what rule applied to the node

This post will focus on, how the rule applied on pods and nodes affect the pod scheduling

Imagine, in an organization, the node occupation is regulated based on the team the workload belongs to and its computing resources requirement. The team is only allowed to deploy their workload into its designated node and its respective resource requirement. The following diagram illustrates the situation,

The organization cannot afford to have oortcloud team to wrongly deploy their apps into other node that is designated for kuiperbelt team for example.

tolerations and nodeAffinity come to the rescue

For a pod works with tolerations, a taint needs to be applied on the node. The node with a taint accepts a tolerant pod to be scheduled on it. For the above mentioned situation, the first step for the solution could be to taint the node in such a way, so that it repels the pods unless it is owned by the oortcloud team. It means, that any nodes that serving workloads for oortcloud team will have identical taint, while those serving the kuiperbelt will have different taint. With this approach, there are 3 nodes with identical taint, for 3 different purposes indicated in its label, for small, medium and large compute resource requirements, respectively. If a web app pod, is scheduled, it is not guaranteed that kubernetes scheduler will place it on the node with small compute resources. It may end up in a node with large compute resources which is not supposed to. This gives a conclusion that, the tolerations alone will not solve our problem. The nodeAffinity comes to the picture. The following diagram can give more clarity,

Let’s experiment the behaviour on how the tolerations and podAffinity affects the scheduling. The scenario in this post was tested in killercoda environment and backed by a github repo. The interesting part is, validation for different scheduling scenarios can be achieved in a cluster with one node, like docker desktop or minikube.

Inspect what nodes are there in the cluster.

controlplane $ kubectl get nodes -o name
node/controlplane

Lets taint the node

kunectl taint node controlplane owner=oortcloud:NoSchedule

We make sure that the node only accepts those pods owned by oortcloud team, through a taint defined as owner=oortcloud:NoSchedule. To be more specific on what kind pod it is accepted, we add a label, compute=SMALL on the node.

kubectl label node controlplane compute=SMALL

Verify if taint and label are properly applied

kubectl describe node controlplane |grep "Taints:"

controlplane $ kubectl describe node controlplane |grep  "Taints:"
Taints:             owner=oortcloud:NoSchedule

...

ontrolplane $ kubectl describe node controlplane |grep -A 7  "Labels:"|grep "compute="
                    compute=SMALL

Scenario 1: Verify that a non tolerant pod should not schedule in a node

Create namespace

kubectl create namespace kuiperbelt

Create a deployment any-kuiperbelt in kuiperbelt namespace with nginx:alpine image. Initiate the manifest creation with imperative approach, and save its manifest into a file /tmp/any-kuiperbelt.yaml

kubectl create deployment \
any-kuiperbelt - image=nginx:alpine -n kuiperbelt \
--dry-run=client -o yaml > /tmp/any-kuiperbelt.yaml`

Open the manifest in edit mode,

vim /tmp/any-kuiperbelt.yaml

At the pod sections, spec.template.spec, add the tolerations.

Check on the kubernetes document what is the specification for pod.spec.tolerations. Or simply run kubectl explain pod.spec.tolerations

tolerations, is an array of object as given below,

tolerations:
- key: owner
  operator: Equal
  value: kuiperbelt
  effect: NoSchedule

To reiterate the intention, we intentionally created a pod which is not tolerant to the taint applied in node. Currently node is tainted with,

tolerations:
- key: owner
  operator: Equal
  value: oortcloud
  effect: NoSchedule

The next step is adding the nodeAffinity in the manifest.

Please have a look on kubernetes documentation for pod’s affinity, simply run kubectl explain pod.spec.affinity.nodeAffinity

The snippet of pod.spec.affinity.nodeAffinity which is used here is,

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
          - key: compute
            operator: Exists

With the above node affinity definition, kubernetes scheduler will look for any nodes that have the compute as a key in its labels. Once added the complete deployment manifest should looks similar to the following,

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: any-kuiperbelt
  name: any-kuiperbelt
  namespace: kuiperbelt
spec:
  replicas: 1
  selector:
    matchLabels:
      app: any-kuiperbelt
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: any-kuiperbelt
    spec:
      containers:
      - image: nginx:alpine
        name: nginx
        resources: {}
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: compute
                operator: Exists
      tolerations:
      - key: owner
        operator: Equal
        value: kuiperbelt
        effect: NoSchedule
status: {}

Apply it,

kubectl apply -f /tmp/any-kuiperbelt.yaml

Verify if the scheduling is failed, by extracting events inside the namespace kuiperbelt.

kubectl get events -n kuiperbelt -o \ 
go-template='{{ range .items }}{{ .involvedObject.kind}}{{"/"}}{{.involvedObject.name}}{{"\t"}}{{.message}}{{"\n"}}{{end}}' |grep Pod

The output is something similar to the following,

Pod/any-kuiperbelt-7c596d75bf-ksxhf   
0/1 nodes are available: 1 node(s) had untolerated taint {owner: oortcloud}. 
preemption: 0/1 nodes are available: 
1 Preemption is not helpful for scheduling..

The error message is self descriptive, isn’t it ? “Pod/any-kuiperbelt-7c596d75bf-ksxhf 0/1 nodes are available: 1 node(s) had untolerated taint {owner: oortcloud}”

As expected, pod is in Pending, and deployment is not in ready state as shown below,

kubectl get deployment,pod -A \
-l 'app in (any-kuiperbelt, small-oortcloud-tolerant)'

NAMESPACE    NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
kuiperbelt   deployment.apps/any-kuiperbelt   0/1     1            0           5m53s

NAMESPACE    NAME                                    READY   STATUS    RESTARTS   AGE
kuiperbelt   pod/any-kuiperbelt-7c596d75bf-ksxhf   0/1     Pending   0          5m53s

Second scenario: A deployment with tolerant pod and match nodeAffinity will be able to schedule

Let’s create deployment for oortcloud team with small resource requirements in another namespace called, oortcloud.

kubectl create namespace oortcloud

The following is the imperative approach to create the deployment manifest,

kubectl create deployment small-oortcloud \
—image=nginx:alpine -n oortcloud —dry-run=client o yaml \
> /tmp/small-oortcloud.yaml

Edit the /tmp/small-oortcloud.yaml file then add the tolerations and podAffinity. The final manifest looks as below,

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: small-oortcloud-tolerant
  name: small-oortcloud-tolerant
  namespace: oortcloud
spec:
  replicas: 1
  selector:
    matchLabels:
      app: small-oortcloud-tolerant
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: small-oortcloud-tolerant
    spec:
      containers:
      - image: nginx:alpine
        name: nginx
        resources: {}
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: "compute"
                    operator: "In"
                    values:
                      - "SMALL"
                      - "MEDIUM"
      tolerations:
        - key: owner
          operator: Equal
          value: oortcloud
          effect: NoSchedule
status: {}

As we can see, the tolerations is declared, to make sure that the pod is tolerant to the node, with taint owner=oortcloud:NoSchedule. Whereas, the nodeAffinity above, means that this pod can be scheduled to the node with either compute=SMALL or compute=MEDIUM labels.

Apply the manifest,

kubectl apply -f /tmp/small-oortcloud.yaml

Check the events in oortcloud namespace

Pod/small-oortcloud-tolerant-6546b85b4d-p6jsg   
Successfully assigned oortcloud/small-oortcloud-tolerant-6546b85b4d-p6jsg 
to controlplane
...

The small-oortcloud-tolerant pod is able to schedule.

Check the all deployments and pods

controlplane $ kubectl get deployment,pod -A -l 'app in (any-kuiperbelt, small-oortcloud-tolerant)'
NAMESPACE    NAME                                       READY   UP-TO-DATE   AVAILABLE   AGE
kuiperbelt   deployment.apps/any-kuiperbelt           0/1     1            0           13m
oortcloud    deployment.apps/small-oortcloud-tolerant   1/1     1            1           4m40s

NAMESPACE    NAME                                            READY   STATUS    RESTARTS   AGE
kuiperbelt   pod/any-kuiperbelt-7c596d75bf-ksxhf           0/1     Pending   0          13m
oortcloud    pod/small-oortcloud-tolerant-6546b85b4d-p6jsg   1/1     Running   0          4m40s

Scenario 3: Validate if the pod with the same rule still schedule in node with label compute=MEDIUM

Recall that the pod definition has nodeAffinity with matchExpression,

- matchExpressions:
  - key: compute
    operator: In
    values:
    - SMALL
    - MEDIUM

The same pod should be able to be scheduled in the node with label, compute=MEDIUM. Lets change node’s label,

kubectl label node controlplane podSize=MEDIUM --overwrite

Make sure the node label is change,

kubectl describe node controlplane |grep -A 7  "Labels:" | grep "compute="
                    compute=MEDIUM

Scale up then scale down the small-oortcloud-tolerant deployment, to trigger the rolling upgrade,

controlplane $ kubectl scale deployment -n oortcloud small-oortcloud-tolerant --replicas 0
deployment.apps/small-oortcloud-tolerant scaled
controlplane $ kubectl scale deployment -n oortcloud small-oortcloud-tolerant --replicas 1
deployment.apps/small-oortcloud-tolerant scaled

Check the events in the oortcloud namespace,

controlplane $ kubectl get events -n oortcloud\
 -o go-template='{{ range .items }}{{ .involvedObject.kind}}{{"/"}}{{.involvedObject.name}}{{"\t"}}{{.message}}{{"\n"}}{{end}}' |grep Pod

---
Pod/small-oortcloud-tolerant-6546b85b4d-nvhs2   Successfully assigned oortcloud/small-oortcloud-tolerant-6546b85b4d-nvhs2 to controlplane
...

Check the deployment and pod

controlplane $ kubectl get deployment,pod -A -l 'app in (any-kuiperbelt, small-oortcloud-tolerant)'
NAMESPACE    NAME                                       READY   UP-TO-DATE   AVAILABLE   AGE
kuiperbelt   deployment.apps/any-kuiperbelt           0/1     1            0           19m
oortcloud    deployment.apps/small-oortcloud-tolerant   1/1     1            1           10m

NAMESPACE    NAME                                            READY   STATUS    RESTARTS   AGE
kuiperbelt   pod/any-kuiperbelt-7c596d75bf-ksxhf           0/1     Pending   0          19m
oortcloud    pod/small-oortcloud-tolerant-6546b85b4d-nvhs2   1/1     Running   0          61s

It is interesting to see what happen if the node label change to compute=LARGE

controlplane $ kubectl label node controlplane compute=LARGE --overwrite
node/controlplane labeled

Scale up then scale down the small-oortcloud-tolerant deployment, to trigger the rolling upgrade,

controlplane $ kubectl scale deployment\
 -n oortcloud small-oortcloud-tolerant --replicas 0
---
deployment.apps/small-oortcloud-tolerant scaled

controlplane $ kubectl scale deployment\
 -n oortcloud small-oortcloud-tolerant --replicas 1
---
deployment.apps/small-oortcloud-tolerant scaled

Check the events in the oortcloud namespace,

kubectl get events -n oortcloud -o \
go-template='{{ range .items }}{{ .involvedObject.kind}}{{"/"}}{{.involvedObject.name}}{{"\t"}}{{.message}}{{"\n"}}{{end}}' |grep Pod

---
Pod/small-oortcloud-tolerant-6546b85b4d-hkcmr   
0/1 nodes are available: 1 node(s) 
didn't match Pod's node affinity/selector. 
preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..

The error is self descriptive, “Pod/small-oortcloud-tolerant-6546b85b4d-hkcmr 0/1 nodes are available: 1 node(s) didn’t match Pod’s node affinity/selector”

Check the deployment and pod status for all oortcloud and kuiperbelt namespace

controlplane $ kubectl get deployment,pod -A -l 'app in (any-kuiperbelt, small-oortcloud-tolerant)'
NAMESPACE    NAME                                       READY   UP-TO-DATE   AVAILABLE   AGE
kuiperbelt   deployment.apps/any-kuiperbelt           0/1     1            0           25m
oortcloud    deployment.apps/small-oortcloud-tolerant   0/1     1            0           16m

NAMESPACE    NAME                                            READY   STATUS    RESTARTS   AGE
kuiperbelt   pod/any-kuiperbelt-7c596d75bf-ksxhf           0/1     Pending   0          25m
oortcloud    pod/small-oortcloud-tolerant-6546b85b4d-hkcmr   0/1     Pending   0          4m19s

As expected, nodeAffinity with matchExpression,

- matchExpressions:
  - key: compute
    operator: In
    values:
    - SMALL
    - MEDIUM

will not find any match when there is only one node with label compute=LARGE. So as a result pod will not be able to schedule.

Let’s create another pod in oortcloud namespace that tolerant to the node and have the correct nodeAffinity definition.

 kubectl create deployment large-oortcloud-tolerant \
--image=nginx:alpine -n oortcloud  --dry-run=client -o yaml > /tmp/large-oortcloud-tolerant.yaml

As usual, edit /tmp/large-oortcloud-tolerant.yaml , then add the toleration and nodeAffinity,

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: large-oortcloud-tolerant
  name: large-oortcloud-tolerant
  namespace: oortcloud
spec:
  replicas: 1
  selector:
    matchLabels:
      app: large-oortcloud-tolerant
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: large-oortcloud-tolerant
    spec:
      containers:
      - image: nginx:alpine
        name: nginx
        resources: {}
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: "compute"
                    operator: "In"
                    values: 
                    - "LARGE"
      tolerations:
        - key: owner
          operator: Equal
          value: oortcloud
          effect: NoSchedule
status: {}

Apply it,

kubectl apply -f /tmp/large-oortcloud-tolerant.yaml

Pod scheduling is successful,

Pod/large-oortcloud-tolerant-57c74d458c-5748r   
Successfully assigned oortcloud/large-oortcloud-tolerant-57c74d458c-5748r 
to controlplane

Check the deployment status and pod,

controlplane $ kubectl get deployment,pod -A -l 'app in (any-kuiperbelt, small-oortcloud-tolerant, large-oortcloud-tolerant)'
NAMESPACE    NAME                                       READY   UP-TO-DATE   AVAILABLE   AGE
kuiperbelt   deployment.apps/any-kuiperbelt           0/1     1            0           29m
oortcloud    deployment.apps/large-oortcloud-tolerant   1/1     1            1           103s
oortcloud    deployment.apps/small-oortcloud-tolerant   0/1     1            0           20m

NAMESPACE    NAME                                            READY   STATUS    RESTARTS   AGE
kuiperbelt   pod/any-kuiperbelt-7c596d75bf-ksxhf           0/1     Pending   0          29m
oortcloud    pod/large-oortcloud-tolerant-57c74d458c-5748r   1/1     Running   0          103s
oortcloud    pod/small-oortcloud-tolerant-6546b85b4d-hkcmr   0/1     Pending   0          8m33s

As expected, the pod with toleration

tolerations:
- key: owner
  operator: Equal
  value: oortcloud
  effect: NoSchedule

and nodeAffinity

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: "compute"
          operator: "In"
          values: 
          - "LARGE"

is able to schedule in the cluster.

Scenario 4: Validate if deployment from kuiperbelt team can schedule on node with owner=kuiperbelt taint

Let’s taint the node so that, the other-kuiperbelt deployment is able to deploy.

controlplane $ kubectl taint node controlplane owner=kuiperbelt:NoSchedule --overwrite
node/controlplane modified

Scale up and scale down the any-kuiperbelt, small-oortcloud-tolerant and large-oortcloud-tolerant deployment to make the new node taint take effect on pod scheduling,

controlplane $ kubectl scale deployment -n kuiperbelt any-kuiperbelt --replicas 0
deployment.apps/other-kuiperbelt scaled
controlplane $ kubectl scale deployment -n kuiperbelt any-kuiperbelt --replicas 1
deployment.apps/other-kuiperbelt scaled
controlplane $ kubectl scale deployment -n oortcloud small-oortcloud-tolerant --replicas 0
deployment.apps/small-oortcloud-tolerant scaled
controlplane $ kubectl scale deployment -n oortcloud small-oortcloud-tolerant --replicas 1
deployment.apps/small-oortcloud-tolerant scaled
controlplane $ kubectl scale deployment -n oortcloud large-oortcloud-tolerant --replicas 0
deployment.apps/large-oortcloud-tolerant scaled
controlplane $ kubectl scale deployment -n oortcloud large-oortcloud-tolerant --replicas 1
deployment.apps/large-oortcloud-tolerant scaled

Check the event in kuiperbelt namespace,

Pod/other-kuiperbelt-7c596d75bf-ksxhf   
Successfully assigned kuiperbelt/any-kuiperbelt-7c596d75bf-ksxhf to 
controlplane
...

It is able to schedule as expected.

While in oortcloud namespace both pods are not able to schedule, with the untolerated reason,

Pod/large-oortcloud-tolerant-57c74d458c-nljr8   0/1 nodes are available: 
1 node(s) had untolerated taint {owner: kuiperbelt}. 
preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..

Pod/small-oortcloud-tolerant-6546b85b4d-58mwn   0/1 nodes are available: 
1 node(s) had untolerated taint {owner: kuiperbelt}. 
preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..

Check the deployment and pod

controlplane $ kubectl get deployment,pod -A -l 'app in (any-kuiperbelt, small-oortcloud-tolerant,large-oortcloud-tolerant)'
NAMESPACE    NAME                                       READY   UP-TO-DATE   AVAILABLE   AGE
kuiperbelt   deployment.apps/any-kuiperbelt           1/1     1            1           35m
oortcloud    deployment.apps/large-oortcloud-tolerant   0/1     1            0           7m44s
oortcloud    deployment.apps/small-oortcloud-tolerant   0/1     1            0           26m

NAMESPACE    NAME                                            READY   STATUS    RESTARTS   AGE
kuiperbelt   pod/any-kuiperbelt-7c596d75bf-nxg4c           1/1     Running   0          4m49s
oortcloud    pod/large-oortcloud-tolerant-57c74d458c-nljr8   0/1     Pending   0          4m41s
oortcloud    pod/small-oortcloud-tolerant-6546b85b4d-58mwn   0/1     Pending   0          4m36s

With both tolerations and nodeAffinity in pod manifest, the scheduling of a pod can be more deterministic, as more complex rule can be applied into it. A cluster with 4 nodes is able to schedule 4 workloads with different compute resources and tolerations properly. Here is the summary


+------------+--------------------------------------------------+-------------------------------------------+--------+
|    Team    |         Pod Toleration and Node Affinity         |                   Node Taint and Label    | Status |
+------------+--------------------------------------------------+-------------------------------------------+--------+
| oortcloud  | owner=oortcloud AND compute in (SMALL, MEDIUM) | compute:SMALL,owner=oortcloud:NoSchedule  | OK     |
| oortcloud  | owner=oortcloud AND compute in (SMALL, MEDIUM) | compute:MEDIUM,owner=oortcloud:NoSchedule | OK     |
| oortcloud  | owner=oortcloud AND compute in (LARGE)           | compute:LARGE,owner=oortcloud:NoSchedule  | OK     |
| kuiperbelt | owner=kuiperbelt AND compute exists              | compute:ANY,owner=kuiperbelt:NoSchedule   | OK     |
+------------+--------------------------------------------------+-------------------------------------------+--------+

Written by Putu Mas Mertayasa