Customizing Kubernetes Resource Management using NRI

11 min readJan 12, 2024

Background

Kubernetes runs in a wide variety of environments, anywhere from virtual machines (VMs) in public Cloud Service Provider (CSP) environments to bare metal hardware in 5G edge networks. Inevitably the stock Kubernetes resource assignment algorithms are a compromise which try to provide reasonable performance and fulfill often contradicting requirements in such a wide range of environments. The introduction of Node Resource Interface (NRI) as a cross-runtime extension mechanism now offers an alternative for plugging in custom resource assignment algorithms better tailored for environments that clearly benefit from them, for instance application-specific bare metal cloud environments. In Kubernetes, hardware resources are often oversimplified. For instance, all vCPUs are considered equal, memory is treated uniformly, and there exists a single, extensive pool of resources for all cluster activities, resulting in minimal isolation. In reality, hardware resources consist of multiple components and zones. The manner in which workloads are distributed among those zones can significantly influence overall performance and resource consumption.

To better understand the role of the NRI, let’s revisit the flow by which a pod is assigned to a node. When a new pod gets created, the scheduler gets notified and seeks the best node taking into account pod’s resource requests & limits, affinity rules, etc. Once the scheduler picks the optimal node, it informs the node’s agent kubelet, about the need to create containers. Then kubelet assigns some resources to the pod and passes down the container creation request to the container runtime. Generally speaking, kubelet manages the lifecycle of containers on the node, but it doesn’t directly do it by itself; instead, it coordinates the request with the container runtime. Although, kubelet has already assigned some resources to the container, it is still possible to alter resource assignment at the runtime level via NRI and its pluggable algorithms (a.k.a. plugin). NRI allows plugging domain or vendor-specific custom logic into Open Container Initiative (OCI) compatible runtimes. This logic can hook into lifecycle events of pods and containers to make controlled changes to the configuration of containers. Both containerd and CRI-O container runtimes now include the NRI feature, enabling users to integrate their custom logic. Traditionally, containers are immutable once created, except for modifications to compute resources like CPU and memory. The ability to adjust these resources is valuable, especially when focusing on considerations such as resource allocation, performance and cost efficiency.

Let’s explore how NRI and its plugins come into play during the pod creation. Consider the following scenario:

A Pod spec is applied with kubectl.
Kubernetes then communicates with the container runtime (containerd/CRI-O) through the Container Runtime Interface (CRI), instructing it to create container(s) based on the provided pod spec (specifically, the OCI Spec).
Upon receiving the OCI Spec, the container runtime consults the NRI, giving registered NRI plugins an opportunity to alter resource assignments and other selected aspects of the container.
NRI, in itself, doesn’t actively modify the received OCI Spec; instead, it serves as an interface for incorporating external logic. Subsequently, NRI transfers the OCI Spec to a designated NRI plugin (the plugin must be registered with NRI, more details on that later) currently in use.
The plugin applies its logic to modify a subset of resources within the OCI Spec.

Pod creation flow when using NRI & NRI plugin

Enabling NRI in container runtimes

By default, NRI is disabled in both containerd and CRI-O. Before installing any NRI Plugin, it is necessary to enable the NRI. The specific method of enabling NRI depends on your cluster provisioner, and there are various approaches available.

Kops configuration
Kubespray configuration
CSP-managed, OpenShift, baremetal, local, or any other type of cluster, you can depend on the plugin installation process. This installation includes an optional flag, when passed, automatically enables NRI on cluster nodes before starting the plugin. Further details on this can be found in Deploying NRI Plugins.

To learn more about all the available NRI configuration options, please refer to containerd’s NRI document or CRI-O NRI table.

NRI Plugins

Several community-maintained nri-plugins cater to different use cases, providing various configuration options to fine-tune plugin behavior and resource allocation. containerd/nri project comes with a set of plugins, like examples and debug helpers, including fully functional plugins like device-injector and ulimit-adjuster. Furthermore, the nri-plugins GitHub repository currently hosts the source code for community maintained plugins like Topology-Aware, Balloons, Memtierd, Memory-QoS, and SGX-EPC. While these plugins cover many use cases already, you always have the flexibility to develop your own if your specific needs are not addressed by existing ones.

If you consume any of the community maintained plugins, you can install them as a Kubernetes application using the corresponding Helm chart for each plugin. These Helm charts are published on artifacthub.io. An alternative to Helm is the operator, a Kubernetes operator that manages the plugin (community maintained) lifecycle of your choice. Each plugin offers a set of parameters or configuration options, allowing users to fine-tune how resources are allocated. These parameters are instrumental in optimizing resource allocation, even in complex setups. For a comprehensive understanding of how these plugins operate and to explore all possible configurations, refer to the official documentation.

Deploying NRI Plugins

NRI plugins can be installed using either Helm or the Kubernetes operator. In this example, we will use Helm. The Helm charts for all available plugins can be found on artifacthub.io.

Cluster configuration
In this example below, we will be using a single node cluster with the following configurations:

cluster provisioner: kubeadm
kubernetes version: 1.28.2
number of nodes: 1
containerd version: 1.7.6
node image: Fedora37
Helm version: 3.0.0+

Hardware architecture
Before we install any NRI plugin, let’s take a look at the hardware layout of our cluster node. To get a snapshot of how our hardware is organized we will use lstopo utility. lstopo is a command-line utility that provides a graphical representation of the underlying hardware architecture of a system including the topology of processors, memory, caches, and interconnects.

The above architecture diagram describes a machine with two CPU packages, each with two NUMA nodes, and each NUMA node consists of two CPU cores with associated caches and processing units. The total memory (7935MB) is distributed among the NUMA nodes. We can see the similar output from lscpu below.

$ lscpu

Architecture: x86_64
 CPU op-mode(s): 32-bit, 64-bit
 Address sizes: 46 bits physical, 48 bits virtual
 Byte Order: Little Endian
CPU(s): 16
 On-line CPU(s) list: 0–15
Vendor ID: GenuineIntel
 Model name: 12th Gen Intel(R) Core(TM) i7–1270P
 CPU family: 6
 Model: 154
 Thread(s) per core: 2
 Core(s) per socket: 4
 Socket(s): 2
 Stepping: 3
NUMA: 
 NUMA node(s): 4
 NUMA node0 CPU(s): 0–3
 NUMA node1 CPU(s): 4–7
 NUMA node2 CPU(s): 8–11
 NUMA node3 CPU(s): 12–15

Topology-aware resource policy plugin

In this setup, we will use topology-aware plugin on a single node kubeadm cluster. Note that, you can run nri plugins on major cloud providers’ managed clusters like AWS EKS, Google GKE or Azure EKS. The topology-aware plugin builds a tree of pools based on the detected hardware topology without any special or extra configuration (a.k.a. zero configuration). Each pool consists of a set of CPUs and memory zones. The topology-aware plugin allocates resources by first picking the pool which is considered to fit the best the resource requirements of the workload and then assignes CPU and memory from this pool.

Enable NRI

Before installing NRI plugin, it’s essential to enable the NRI feature on the container runtime for each node in our cluster. Thus, as part of plugin installation via Helm, we will pass nri.PatchRuntime Helm parameter which results in running an init container before starting the actual plugin. The init container process automatically detects the runtime on the host, modifies its configuration file to enable the NRI and finally restarts the runtime systemd unit. To be able to re-configure the container runtime on the host, the init container runs with securityContext.privileged set to true, so that the container runs with all available Linux capabilities and with access to all devices on the host system. Since the plugin runs as a DaemonSet object, we can be sure that NRI also gets activated on every node of our cluster unless there is no Taint(s) set on some Nodes where DaemonSet Pods can’t be scheduled. However, if there are taints on some nodes, it is possible to pass corresponding toleration via Helm.

The next steps will take are as following:

deploy our workload pod in default namespace
verify resource allocation for each container
install topology-aware nri plugin in kube-system
check again resource (memory & cpu) allocation

We will use the below example workload.yaml Pod spec with Guaranteed, BestEffort and Burstable containers as our workload.

apiVersion: v1
kind: Pod
metadata:
  name: multicontainer
spec:
  containers:
  - name: c0-burstable
    image: nginx
    resources:
      requests:
        memory: "128Mi"
        cpu: "100m"
      limits:
        memory: "256Mi"
        cpu: "200m"
    volumeMounts:
    - name: html
      mountPath: /usr/share/nginx/html
  - name: c1-guaranteed
    image: debian
    resources:
      requests:
        memory: "512Mi"
        cpu: "400m"
      limits:
        memory: "512Mi"
        cpu: "400m"
    volumeMounts:
    - name: html
      mountPath: /html
    command: ["/bin/sh", "-c"]
    args:
      - while true; do
          date >> /html/index.html;
          sleep 1;
        done
  - name: c2-besteffort
    image: busybox:latest
    command: ["/bin/sh"]
    args: ["-c", "while true; do sleep 3600; done"]
  volumes:
  - name: html
    emptyDir: {}

Let’s start with checking the currently running pods

kubectl get pods -A

NAMESPACE     NAME                                                        READY   STATUS    RESTARTS   AGE
kube-system   cilium-7csmq                                                1/1     Running   0          5h44m
kube-system   cilium-operator-89b79bd9f-wpvps                             1/1     Running   0          5h44m
kube-system   coredns-5dd5756b68-7gvs9                                    1/1     Running   0          5h44m
kube-system   coredns-5dd5756b68-fh6nr                                    1/1     Running   0          5h44m
kube-system   etcd-n4c16-generic-fedora37-containerd                      1/1     Running   0          5h45m
kube-system   kube-apiserver-n4c16-generic-fedora37-containerd            1/1     Running   0          5h45m
kube-system   kube-controller-manager-n4c16-generic-fedora37-containerd   1/1     Running   0          5h45m
kube-system   kube-proxy-kkl2n                                            1/1     Running   0          5h44m
kube-system   kube-scheduler-n4c16-generic-fedora37-containerd            1/1     Running   0          5h45m

Let’s deploy workload pod and wait until it enters into Running state

kubectl apply -f workload.yaml

Next, run our custom script to verify resource allocation for the workload Pod. You can find the custom script from here.

./verify.sh

* resources for multicontainer:c0-burstable
    Cpus_allowed_list: 0-15
    Mems_allowed_list: 0-3
* resources for multicontainer:c1-guaranteed
    Cpus_allowed_list: 0-15
    Mems_allowed_list: 0-3
* resources for multicontainer:c2-besteffort
    Cpus_allowed_list: 0-15
    Mems_allowed_list: 0-3

Currently, without topology-aware NRI plugin, all the CPUs are shared betwen 3 containers as the default behavior of Kubernetes. Kubernetes, is not aware of NUMA nodes and sockets and thus can not make resource optimized workload scheduling. Let’s see now how NRI plugin can help in this particular scenario.

Now, let’s install the topology-aware NRI plugin via Helm. Remember that we will use nri.patchRuntimeConfig parameter to enable NRI on every node before startig to run NRI plugin.

helm repo add nri-plugins https://containers.github.io/nri-plugins

helm install topology-aware \
    nri-plugins/nri-resource-policy-topology-aware \
    --set nri.patchRuntimeConfig=true \
    --namespace kube-system

topology-aware — name we gave to our chart
nri-plugins/nri-resource-policy-topology-aware — official name of public chart
set nri.patchRuntimeConfig=true — requests NRI to be enabled on the runtime
namespace kube-system — a namespace into which the plugin gets deployed

Let’s check that plugin pod is running on the kube-system namespace.

kubectl get pods -n kube-system | grep nri

nri-resource-policy-topology-aware-x5tr4    1/1     Running   0   6s

Let’s run our custom script again to check resource allocation after installing the plugin.

./verify.sh
* resources for multicontainer:c0-burstable
    Cpus_allowed_list: 4-7
    Mems_allowed_list: 1
* resources for multicontainer:c1-guaranteed
    Cpus_allowed_list: 8-11
    Mems_allowed_list: 2
* resources for multicontainer:c2-besteffort
    Cpus_allowed_list: 12-15
    Mems_allowed_list: 3

After installing the topology-aware plugin, there occurred a reorganization of how resources were assigned. As a result, certain containers were allocated dedicated CPUs.

To see the all these in action, check out the following recording

FAQ

Is the project open source and what licence does it have?
Both nri and nri-plugins are open source projects and procted by Apache 2.0 licence.
What benefits do I get by using NRI/NRI-plugins?
- Possibility to plug in domain or vendor-specific workload resource assignment logic
- Improved resource isolation
- Freedom of modifying container resource allocation
- More granular resource management
- Improved performance
Do I need to enable NRI in containerd/CRI-O ?
For now yes, because currently by default NRI is disabled on both runtimes.
How to enable NRI in containerd/CRI-O?
Follow the Enabling NRI in container runtimes section
Is NRI plugin limited to only resource management?
No, NRI plugins do much more than only resource management. For example, resource isolation.
Does enabling NRI without running NRI plugins do something?
No, since the actual logic is in the plugin.
Can I run NRI Plugins on managed clusters?
Yes.
Do I need priviliged access to run NRI Plugins?
Yes.
What security aspects should be considered when using NRI?
From a security perspective NRI plugins should be considered part of the container runtime.
How do I register NRI Plugins?
Register means deploying. NRI Plugins run as Kubernetes application and every publicly available NRI plugin can be installed by simply deploying its Kubernetes objects. All the plugins have their Helm charts published in artifacthub.io.
Can I run multiple NRI Plugins at the same time?
Yes.
Are NRI plugins production ready?
Currently we can’t say that and all the community maintained plugins are in v0.X.X version. Though, we are seeking for your feedback.
Where the public NRI plugins get published?
Source code for reference plugins is in https://github.com/containers/nri-plugins but you can find corresponding Helm charts in artifacthub.io.
Can I develop/contribute a new NRI Plugin?
You are more than welcome to develop/contribute your custom plugin and nri-plugins project maintainers would be happy to instruct you on developing your plugin. As a starting point, you can check sample plugin like template and read more about the developer guide here.
Is there a template plugin to help developing my custom NRI plugin?
Yes, nri-plugins project maintainers have created a template plugin and some other plugins logger or ulimit-adjuster.
Does it make sense to run NRI plugin if my cluster nodes are virtual machines instead of baremetal?
Yes, NRI plugins can be used beyond resource management, extending their functionality to tackle various issues, such as mitigating the impact of noisy neighbor.
Does it matter what version of Kubernetes I’m running?
Yes, but it is more about the containerd or CRI-O version. Because NRI is not developed on top of Kubernetes but rather runtimes. For containerd it must be at least 1.7.0 and for CRI-O v1.26.0.

Summary

In this blog post, we discussed the intricacies of customizing Kubernetes resource management through the Node Resource Interface (NRI). This interface serves as a cross-runtime extension mechanism, empowering users to integrate custom resource assignment algorithms seamlessly. We also discussed step-by-step process of enabling the NRI within container runtimes, followed by an introduction to community-maintained plugins and the discussion of their deployment models. Furthermore, we took the time to address frequently asked questions, providing clarity and insights for a better understanding.