GPU-powered Kubernetes clusters

A comprehensive and incremental hands-on guide

Gleb Vazhenin
Bumble Tech

--

GPUs (graphics processing units) are specialized hardware devices that are designed to perform rapid calculations and are particularly well-suited for training machine learning models. They can be used to improve the performance of certain workloads, such as complex simulations or data-intensive tasks.

Kubernetes is an open-source platform for managing containerised applications. Kubernetes allows you to deploy, scale, and manage containerised applications in a consistent and efficient manner. Recently it started to provide support for graphics cards, granting a convenient and transparent way to schedule GPU resources for ML workloads.

At BumbleTech, we’ve created a GPU-powered Kubernetes cluster to schedule ML workloads (such as model training) and perform fast real-time inference.

Why GPUs?

While CPUs can process many general tasks in a fast, sequential manner, GPUs use parallel computing to break down massively complex problems into multiple smaller, simultaneous calculations. This makes them ideal for handling the distributed computational processes required for machine learning. Outcomes are better because a huge number of processing units can take calculations in parallel. For example, an NVIDIA A100 GPU has 6,912 CUDA cores which can run calculation tasks simultaneously. In most cases, it would cost more to set-up CPU based infrastructure that achieves the same parallel performance (a single workstation CPU usually has no more than 128 cores).

GPUs are a great fit to improve model performance and reduce inference latency.

In most cases, GPUs are overperforming CPUs in both training and serving. We found useful a sum-up prepared by AI-Benchmark — it provides a detailed latency benchmark for different network architectures, batch sizes and resolutions for many known processing units, including both CPUs and GPUs.

GPU serving

Many available open-source serving frameworks are being shipped with native GPU support. At Bumble Tech, among the different serving frameworks, we’re using a Triton Inference Server on KServe to attach GPUs to our models.

However, it is important to say that GPU-based serving may not always be the fastest. Some situations when you might prefer CPU-based serving over GPU-based include:

  • Small RPS (Requests-per-second) inference without dynamic batching, where transferring data to GPU is taking more time than inference itself. (See Data Transfer Matters for GPU Computing)
  • Real-time inference for algorithms that don’t parallelise easily. E.g. recurrent neural networks.
  • In general, CPU gives a better trade-off in terms of cost per RPS.
  • Inference on power-limited devices such as phones and embedded computers.

Sometimes, much better performance may be achieved by using GPU-based serving over CPU-based. Still, the MLOps engineer should likely take into account SLA requirements and costs before making decisions. A proper load test should be established when it’s not obvious which processing unit is better for a particular task (considering the above situations).

GPU training

GPU could be used for both model serving and model training. While CPUs aren’t considered as efficient for data-intensive machine learning processes, they are still a cost-effective option when using a GPU isn’t ideal, such as:

  • Some algorithms are optimized to use CPUs over GPUs.
  • When you intend to train non-parallel (sequential/non-concurrent) algorithms, GPUs may not be the best option. (In simple words, “P-complete” programs are not meant to use GPUs).
  • When working with large datasets or large models (e.g. recommender system with large embedding layers).

Outside of these scenarios, you can significantly improve performance when using GPUs for training.

NVIDIA GPUs on Kubernetes cluster: Installation guide

1. Prerequisites

  • A running Kubernetes cluster. Follow another article from the MLOps at Bumble series to find out more about how we set up our GitOps approach.
  • At least one worker node with a GPU connected to the cluster.

2. Install Nvidia drivers on worker hosts

GPU isn’t a fully autonomous device, it requires a management tool and drivers to be installed on the host machine in order to operate properly.

First, download and run the suitable driver installer from the NVIDIA website. For example, install a driver version 525.60.11 for Linux x86_64:

wget https://us.download.nvidia.com/XFree86/Linux-x86_64/525.60.11/NVIDIA-Linux-x86_64-525.60.11.run && chmod +x NVIDIA-Linux-x86_64-525.60.11.run
./NVIDIA-Linux-x86_64-525.60.11.run

After unpacking, follow the interactive installer that will walk you through the installation. It also will suggest removing the existing driver if there is one.

After installation, check that your device is visible via the nvidia-smi tool:

nvidia-smi output

3. NVIDIA GPU Operator

Now that our host machines are ready, let’s look at the NVIDIA GPU Operator.

The GPU Operator allows administrators of Kubernetes clusters to manage GPU nodes just like CPU nodes in the cluster. Instead of provisioning a special OS image for GPU nodes, administrators can rely on a standard OS image for both CPU and GPU nodes and then rely on the GPU Operator to provide the required software components for GPUs.

In general, installing a device plugin is enough to run Kubernetes applications with GPU support. However, we recommend installing the full GPU Operator as it comes with a lot of useful additional components that’ll help in operating GPU workloads at scale. They include:

  • k8s-device-pluginDevice Plugin for Kubernetes
  • container-toolkit — Set of tools to run GPU operated containers
  • dcgm-exporter — DCGM exporter used for monitoring and telemetry
  • gpu-feature-discovery — a component that allows you to automatically generate labels for the set of GPUs available on a node
  • mig-manager — component capable of repartitioning GPUs into different MIG configurations

GPU Operator requires Helm to be installed. For the installation tips see official documentation. You can install an out-of-the-box operator using the following charts and values:

Chart.yaml
values.yaml

Though there is a component of gpu-operator that helps with the NVIDIA drivers, we prefer managing them separately through Puppet.

After the deployment of this Helm Chart, ensure all the containers are running and validators are completed:

gpu-operator pods

4. (Optional) MIG Management

NVIDIA GPU Operator comes with a mig (multi-instance-gpu) manager that allows partitioning GPUs. According to NVIDIA guide, there are several GPUs that supports MIG:

  • H100-SXM5
  • H100-PCIE
  • A100-SXM4–80GB
  • A100-SXM4–40GB
  • A100-PCIE-80GB
  • A100-PCIE-40GB
  • A30

In order to apply MIG configuration, you need to simply label the node with the name of the existing configuration. However, changing MIG profile is only supported when the GPU is free of load.

Here’s an example of how we’re applying a 1g.5g MIG profile configuration to all GPUs on the node ds-node-1 (A100-PCIE-40GB)

kubectl label nodes ds-node1 nvidia.com/mig.config=all-1g.5gb

After a short while, we could see repartitioned GPU:

nvidia-smi output

MIG manager uses a default-mig-parted-config configmap in the GPU Operator namespace to include supported MIG profiles. However, you can specify your own profiles as well. We encourage everyone to follow GitOps practices, but for the sake of this specific example:

1. Create a mig-parted-config referencing default configmap. The copy of the default-mig-parted-config could be also found in the gpu-operator namespace (default-mig-parted-config), but it will be overwritten by operator in case of any manual changes. Therefore, a new one should be added.

2. Open available configurations config:

kubectl edit cm -n gpu-operator mig-parted-config

3. Add your desired configuration:

additional MIG config

4. Apply new configuration:

kubectl label nodes ds-node1 nvidia.com/mig.config=all-3x1g.5gb_2x2g.10gb

5. GPU scheduling

Once the GPU Operator components are up and running, we can schedule a workload. First, let’s check how the gpu-feature-discovery component labelled our nodes (we are particularly looking for nvidia.com/gpu.product label value):

node labels

Schedule a pod that’s going to utilise the NVIDIA A100 GPU:

pod-a100.yaml

If your GPU is MIG partitioned, you can schedule a specific mig part, the manifest would be the following:

pod-a100-mig.yaml

Go on a node assigned and check that nvidia-smi shows the running process:

GPU scheduled pod

The node is ds-node1. Running nvidia-smi on this node shows that the GPU is busy running a sample task assigned:

nvidia-smi output

We’ve now completed the task of running a GPU application on the Kubernetes cluster. Moreover, we got a full toolkit to run applications at scale, including feature discovery, metrics exporter (dcgm-exporter), and multi-instance GPU manager.

6. GPU monitoring

In order to gather GPU metrics, we need to add a custom scrape config. If you’re running Prometheus as a part of kube-prometheus-stack, simply add the following to additionalScrapeConfigs, ensuring your gpu-operator is running in the namespace gpu-operator:

For more information, see the official NVIDIA documentation.

Let’s check that metrics are being successfully delivered to Prometheus:

DCGM_FI_DEV_GPU_TEMP{modelName="NVIDIA A100-PCIE-40GB"}
DCGM GPU Metrics in Prometheus

We’ve also prepared a Grafana Dashboard that monitors both GPU load, requests, and limits. With a combination of the two below, you’ll have full visibility of your cluster GPUs, as well as the ability to make capacity planning more convenient.

GPU Capacity Dashboard

NVIDIA DCGM Exporter Dashboard

7. Bin Packing (Bonus)

When it comes to a large-scale operation, you might face a bin packing problem. Imagine you have 10 nodes with 4 GPUs installed on each, for a total of 40 GPUs. If no affinity or taints are specified, the kube-scheduler is going to allocate GPUs randomly, sometimes leading to a situation when you can’t schedule 4 GPUs simultaneously for a single pod (at least one GPU is going to be busy on every node).

Kubernetes 1.24 introduces the MostAllocated strategy for bin packing, intended to solve this problem. The MostAllocated strategy scores the nodes based on the utilisation of resources, favouring the ones with higher allocation. For each resource type, you can set a weight to modify its influence on the node score. If we’re only considering GPUs, the KubeSchedulerConfiguration will be the following:

scheduler-conf.yaml

With that config, kube-scheduler will pack GPU nodes until GPUs are fully allocated, meaning you’ll have multiple (4, in our case) GPUs available simultaneously — as long as you’re not operating at the resource limits. Kubernetes also allows you to have multiple schedulers, which makes it easier to operate in a mixed CPU/GPU environment.

Conclusion

GPUs are an integral element of a Data Science team’s infrastructure framework. Using Kubernetes can facilitate teams in utilising GPUs in an optimal manner. Facilitating a streamlined process for scheduling GPU usage is crucial in enhancing the productivity of the Data Science team, as well as reducing operational expenses.

Given that this issue is relatively new, there isn’t much shared knowledge available. Nevertheless, the NVIDIA GPU Operator significantly simplifies the deployment of all necessary components. The provided guide should assist with the setup process, and I have sought to not only furnish the necessary instructions but also to highlight potential challenges that may arise.

We would love to hear more about your approach to handling extensive ML workloads. Let us know what you think either in the comments section, or reach me out on Linkedin.

Special thanks to Andrei Potapkin and Bumble MLE team.

Disclaimer: None of the external links above are endorsed by us and are used merely as examples and what the author found useful to share.

--

--