Exploring Kubernetes Descheduler

HungWei Chiu
6 min readNov 12, 2023

This article documents what Kubernetes Descheduler is and how to use this project to balance workloads running on Kubernetes.

Use Cases

Kubernetes provides various built-in resources that allow us to efficiently influence node decisions, such as NodeSelector, NodeAffinity, PodAffinity, Anti-PodAffinity, and Pod Topology Spread Constraints. Readers familiar with these usages may recall a field called requiredDuringSchedulingIgnoredDuringExecution. This field implies that the decision is only effective at the moment of scheduling by the Scheduler. Once a Pod is deployed on a node, these rules do not impact it. Additionally, concepts like Taint/Toleration cease to have an effect after deployment.

In most cases, this usage pattern poses no issues. However, in certain situations, there might be a need to reallocate these Pods. For example, when a new node is added to the cluster, resource usage may become uneven, with most Pods located on the existing nodes, leading to uneven resource utilization. Furthermore, when labels on nodes change, all running Pods remain unaffected. In some cases, it may be desirable to readjust based on the latest labels.

Typically, we can trigger rescheduling by actively updating or redeploying these applications. The goal of Descheduler is to provide an automated mechanism for achieving runtime rescheduling.

Installation and Usage

Descheduler provides various deployment methods, including Job, CronJob, and Deployment. The Deployment option utilizes the Leader Election mechanism to implement a High Availability (HA) architecture. This architecture is primarily based on the AP (Availability and Partition Tolerance) model because, at any given time, only one descheduler decision should be executed.

Additionally, the server offers Metrics for Prometheus to collect and view for future debugging purposes.

The following demonstrates the deployment and installation using Helm, with the HA mechanism enabled.

$ helm repo add descheduler https://kubernetes-sigs.github.io/descheduler/

Preparing the following values.yaml

replicas: 3
leaderElection:
enabled: true
kind: Deployment
$ helm upgrade --install descheduler --namespace kube-system descheduler/descheduler --version=0.28.0 -f values.yaml

After installation, observe the system information. You can see three replicas, and through lease information, it is evident that the blf4l Pod has won the election.

bash=
$ kubectl -n kube-system get lease
NAME HOLDER AGE
descheduler descheduler-845867b84b-blf4l_f48c09d6-2633-4ec5-95aa-d0b62ffe412c 5m9s

$ kubectl -n kube-system get pods -l "app.kubernetes.io/instance"="descheduler"
NAME READY STATUS RESTARTS AGE
descheduler-845867b84b-7kv9z 1/1 Running 0 2m15s
descheduler-845867b84b-blf4l 1/1 Running 0 2m15s
descheduler-845867b84b-d4tw7 1/1 Running 0 2m15s

Descheduler Architecture

The configuration of Descheduler primarily consists of Policies, with Policies composed of Evictors and Strategies.

Strategies determine under what circumstances Pods should be redistributed, while Evictors determine which Pods are eligible for redistribution.

Evictor

The Evictor currently employs two built-in mechanisms, namely Filter and PreEvicitionFilter.

Filter is used to exclude certain Pods from consideration for “rescheduling.”

Examples include:

  • Static Pods
  • DaemonSet Pods
  • Terminating Pods
  • Pods meeting specific Namespace/Priority/Label conditions

By examining the code, you can observe that some rules are hardcoded, while others are configured through constraints.

func (d *DefaultEvictor) Filter(pod *v1.Pod) bool {
checkErrs := []error{}

if HaveEvictAnnotation(pod) {
return true
}

ownerRefList := podutil.OwnerRef(pod)
if utils.IsDaemonsetPod(ownerRefList) {
checkErrs = append(checkErrs, fmt.Errorf("pod is a DaemonSet pod"))
}

if utils.IsMirrorPod(pod) {
checkErrs = append(checkErrs, fmt.Errorf("pod is a mirror pod"))
}

if utils.IsStaticPod(pod) {
checkErrs = append(checkErrs, fmt.Errorf("pod is a static pod"))
}

if utils.IsPodTerminating(pod) {
checkErrs = append(checkErrs, fmt.Errorf("pod is terminating"))
}

for _, c := range d.constraints {
if err := c(pod); err != nil {
checkErrs = append(checkErrs, err)
}
}

if len(checkErrs) > 0 {
klog.V(4).InfoS("Pod fails the following checks", "pod", klog.KObj(pod), "checks", errors.NewAggregate(checkErrs).Error())
return false
}

return true
}

The PreEvicitionFilter currently implements a mechanism related to “NodeFit,” which means that before selecting a Pod for eviction, it checks whether, upon eviction, the Pod’s NodeSelector/Affinity/Taint/Request has an available node for redeployment.

The implementation logic can be referenced in the following code. When the user sets “NodeFit=true” in the configuration, it enters the main function to check if there is a node available for the current Pod’s redeployment under the current circumstances.

func (d *DefaultEvictor) PreEvictionFilter(pod *v1.Pod) bool {
defaultEvictorArgs := d.args.(*DefaultEvictorArgs)
if defaultEvictorArgs.NodeFit {
nodes, err := nodeutil.ReadyNodes(context.TODO(), d.handle.ClientSet(), d.handle.SharedInformerFactory().Core().V1().Nodes().Lister(), defaultEvictorArgs.NodeSelector)
if err != nil {
klog.ErrorS(err, "unable to list ready nodes", "pod", klog.KObj(pod))
return false
}
if !nodeutil.PodFitsAnyOtherNode(d.handle.GetPodsAssignedToNodeFunc(), pod, nodes) {
klog.InfoS("pod does not fit on any other node because of nodeSelector(s), Taint(s), or nodes marked as unschedulable", "pod", klog.KObj(pod))
return false
}
return true
}
return true
}

Strategy

The structure of a Strategy mainly comprises Strategy -> Strategy Plugin, where the Strategy Plugin is the algorithmic component responsible for determining when to reschedule Pods. These Strategy Plugins are categorized under the same Strategy based on their common purpose. Currently, there are two built-in Strategies: Deschedule and Balance.

There are approximately 10 built-in Strategy Plugins, and details can be found in the official documentation. Examples include:

  • RemoveDuplicates
  • RemovePodsViolatingNodeAffinity
  • RemovePodsHavingTooManyRestarts
  • HighNodeUtilization
  • … and so on.

Deschedule primarily focuses on conditions like Label/Taint/Affinity for rescheduling Pods, while Balance aims to reschedule Pods based on node utilization or the distribution of Pods.

The diagram below attempts to consolidate all the mentioned rules and presents the corresponding rules and Strategy Plugins for each different Filter/Strategy.

Descheduler Example

Configuring Descheduler involves modifying relevant ConfigMaps. If Helm is used for installation, the configuration is overridden by values in the Helm chart.

It’s essential to note the different apiVersion configurations. Helm currently defaults to descheduler/v1alpha1, while the latest version is descheduler/v1alpha2. The official documentation provides examples based on descheduler/v1alpha2, so caution should be exercised to avoid mixing versions.

Modify the values.yaml file with the following content and redeploy. This file enables the RemovePodsHavingTooManyRestarts Strategy Plugin and sets the threshold to 5.

replicas: 3

leaderElection:
enabled: true

kind: Deployment

deschedulerPolicyAPIVersion: "descheduler/v1alpha2"
deschedulerPolicy:
profiles:
- name: test
pluginConfig:
- name: "RemovePodsHavingTooManyRestarts"
args:
podRestartThreshold: 5
includingInitContainers: true
plugins:
deschedule:
enabled:
- "RemovePodsHavingTooManyRestarts"

Next, deploy an unstable Pod and observe what happens after the Pod restarts five times.

apiVersion: apps/v1
kind: Deployment
metadata:
name: www-deployment
spec:
replicas: 1
selector:
matchLabels:
app: www
template:
metadata:
labels:
app: www
spec:
containers:
- name: www-server
image: hwchiu/python-example
command: ['sh', '-c', 'date && no']
ports:
- containerPort: 5000
protocol: "TCP"

From the logs, it can be observed that when the Pod’s restart count exceeds 5 times (i.e., the sixth time), the Pod is removed, and a new Pod is successfully deployed.

$ kubectl get pods -w
NAME READY STATUS RESTARTS AGE
www-deployment-77d6574796-bhktm 0/1 CrashLoopBackOff 4 (53s ago) 2m37s
www-deployment-77d6574796-bhktm 0/1 Error 5 (92s ago) 3m16s
www-deployment-77d6574796-bhktm 0/1 CrashLoopBackOff 5 (12s ago) 3m28s
www-deployment-77d6574796-bhktm 0/1 CrashLoopBackOff 5 (104s ago) 5m
www-deployment-77d6574796-bhktm 0/1 Terminating 5 (104s ago) 5m
www-deployment-77d6574796-rfkws 0/1 Pending 0 0s
www-deployment-77d6574796-bhktm 0/1 Terminating 5 5m
www-deployment-77d6574796-rfkws 0/1 Pending 0 0s
www-deployment-77d6574796-rfkws 0/1 ContainerCreating 0 0s
www-deployment-77d6574796-bhktm 0/1 Terminating 5 5m
www-deployment-77d6574796-bhktm 0/1 Terminating 5 5m
www-deployment-77d6574796-bhktm 0/1 Terminating 5 5m
www-deployment-77d6574796-bhktm 0/1 Terminating 5 5m

Simultaneously, within Descheduler logs, clear indications can be observed regarding which node’s Pods were removed and for what reasons.

I1112 13:08:25.209203       1 descheduler.go:156] Building a pod evictor
I1112 13:08:25.209408 1 toomanyrestarts.go:110] "Processing node" node="kind-control-plane"
I1112 13:08:25.209605 1 toomanyrestarts.go:110] "Processing node" node="kind-worker"
I1112 13:08:25.209648 1 toomanyrestarts.go:110] "Processing node" node="kind-worker2"
I1112 13:08:25.209687 1 toomanyrestarts.go:110] "Processing node" node="kind-worker3"
I1112 13:08:25.209735 1 toomanyrestarts.go:110] "Processing node" node="kind-worker4"
I1112 13:08:25.261639 1 evictions.go:171] "Evicted pod" pod="default/www-deployment-77d6574796-bhktm" reason="" strategy="RemovePodsHavingTooManyRestarts" node="kind-worker4"
I1112 13:08:25.261811 1 profile.go:323] "Total number of pods evicted" extension point="Deschedule" evictedPods=1
I1112 13:08:25.261856 1 descheduler.go:170] "Number of evicted pods" totalEvicted=1

The official documentation provides various usage examples for different Strategy Plugins, which will not be repeated here.

Summary

Descheduler provides a mechanism for dynamically redeploying Pods. Some strategies are based on deployment conditions such as Label/Taint/Affinity, while others adjust based on node utilization. The latter set of strategies can distribute the number of Pods based on node utilization or utilize Cluster Autoscaler to remove nodes with low utilization, achieving cost savings.

--

--