LVM+QCOW2: creating a perfect CSI driver for shared SAN in Kubernetes

Flant staff
Deckhouse blog
Published in
12 min readNov 20, 2023

--

A few months ago, we were looking for a Kubernetes CSI driver to store virtual machine disks in Deckhouse Virtualization. The driver also needed to be able to support regular Kubernetes containers. The hardware our customers use typically entails certain technical requirements. In most cases, that’s a classic SAN (Storage Area Network) featuring external storage and a jointly shared LUN allocated to several different nodes. Multiple virtual machines or containers use the same LUN at the same time.

Among the tasks needed the driver to accomplish was support a number of CoW features, such as snapshots, thin provisioning, and the option to live-migrate virtual machines in Kubernetes. Upon examining the existing solutions out there, we realized that there weren’t any of them that provided all the features we needed (although there are a number of Open Source projects well deserving of a mention). On top of that, they were all burdened by clear scalability issues.

We are going to talk a lot about virtualization, so I encourage you to read the translation of the article “The Evolution of Network Virtualization Technologies in Linux,” written by the ByteDance technical team, the developers of the VDUSE technology.

In search of a suitable backend

The cluster file systems, such as GFS2 and OCFS2 didn’t work for us seeing as the maximum cluster size is limited to 32 nodes. On top of that, a distributed lock manager (DLM), dependent on Corosync, is required to configure them. Architecturally, these are rather old and complex technologies that don’t quite fit into the Kubernetes paradigm.

A POSIX-compliant file system could simplify things somewhat, but it would add another layer of abstraction, resulting in a downgrade in performance. And since there were no suitable Open Source solutions being offered on the market, we decided to keep looking.

Clustered LVM is the primary alternative to clustered file systems. It can run in cluster mode, is relatively easy to configure with the new lock manager (sanlock), and supports cluster sizes up to 2,000 nodes. Sadly, the only Kubernetes CSI driver we’ve managed to find so far that could handle clustered LVM turned out to be an unofficial Google project maintained by one lone person.

It seems that in developing it, the author went wild: he implemented snapshots, fencing, as well as disk RPC, which allows driver components to communicate by sending commands directly through a special partition on the disk device.

By the way, if you’re interested in gaining a better understanding of what clustered LVM is, I recommend this great article which covers the topic.

Another issue is that snapshots in LVM have a profound impact on I/O performance — refer to our benchmark below (see a table of the raw values here):

In this benchmark, we have compared three different technologies for implementing snapshots:

  • classic LVM;
  • LVM Thin (a LVM extension that supports Copy-on-write);
  • QCOW2 (a virtual machine image format used in QEMU).

We used the same benchmarks we mentioned in one of our previous articles. In it, we compared the ways LINSTOR, Ceph, Mayastor, and Vitastor perform in Kubernetes.

The graphs visualize how latency increases and performance declines as snapshots are created for each of the technologies mentioned above. Note that the regular file-based QCOW2 with external snapshots shows consistently steady performance in the same tests. This means that QCOW2 is in fact not too shabby for implementing snapshots, especially when the backend doesn’t support the Copy-on-write mechanism.

Interestingly, in some tests, LVM exhibits faster read speeds when creating snapshots than without creating them. I don’t know why this is. I assume that in the case of non-mapped extents, LVM returns zeros right away without actually reading from the disk.

What we were primarily interested in was write performance, so we decided to go with classic LVM for splitting the physical LUN into virtual volumes and QCOW2 file format for the snapshot mechanism and thin provisioned volumes.

Using a file format without the file system

“But QCOW2 is a file format,” — you might say. “Is there any way to use it without a file system?” It turns out there is. While exploring the solutions various vendors provide, we encountered an interesting document. It details a mechanism for implementing thin provisioning in oVirt.

In fact, the authors of this document use LVM as well and simply write QCOW2 over the block device, then set up an additional handler in libvirt to monitor the virtual volume’s size and extend the underlying LV in advance.

That is, LVM is used for volume partitioning while QCOW2 is used for thin provisioning. Meanwhile, the snapshot is always a separate QCOW2 volume linking to the previous volume using the chaining mechanism.

We found this approach to be quite promising: combining versatility and performance. On top of that, technically, you can apply it not only to SANs, but also to regular local disks on nodes.

It sounds doable, but as always, there is a catch. Unlike Kubernetes, the basic entity that oVirt works with is the virtual machine. And it is quite logical that it leverages all the libvirt capabilities to manage the entire VM lifecycle as well as the related systems (networking, storage). Libvirt, on the other hand, runs as a system daemon and has full access to both the virtual machine and the disk subsystem.

Thus, oVirt uses libvirt to handle all the tasks required to set up and run a virtual machine: it sets up the network and storage, and starts the VM itself (and does it all as a unified tool). Kubernetes focuses on containers. To manage them, it provides a whole stack of loosely coupled (by design) interfaces, such as CRI, CSI, and CNI. Each interface has its own, well-defined function: e.g., CSI is responsible for storage, CNI is responsible for networking, while CRI is responsible for runtime and setting up a sandbox to run processes in the container. Therefore, in Kubernetes (in the KubeVirt extension, in particular), libvirt is used solely as a means of starting virtual machines. It runs as a separate process in the container and does not handle network and storage management, relying entirely on CNI and CSI.

That’s why in Kubernetes, you must keep in mind while developing a CSI driver that the main consumer is the kernel, which can run any process in the sandbox, not just libvirt. That is, the CSI driver’s role should not be limited to virtual machines. Instead, the driver should provide the system a raw block device and ensure that it can be used by standard host OS tools, for example to:

  • create and mount the device as a file system (default option). This usually resembles a block device on top of which a file system is created. This file system is mounted to the container as a directory to “stack its files” in.
  • Forward the block device to the container “as is”. The point is virtual machines don’t need a file system per se. What they need is a virtual disk either as a file or as a separate block device. That’s why Kubernetes managed to forward block devices into the container “as is” — and the virtual machine was able work with them directly, without a file system interlayer.

But how do you “mount” a QCOW2 file so that the host system sees it as a raw block device? Currently, one of the easiest and commonly used ways to access the contents of QCOW2s without first starting the virtual machine is qemu-nbd. But if you dig a little deeper, you’ll find that qemu-nbd has serious limitations and is not designed to be routinely used for anything more than one-off tasks or debugging (like pulling a file from an image). For example, it does not enable you to create and delete snapshots or resize the block device on the fly.

Then there is a more advanced solution, qemu-storage-daemon. This is a standalone daemon with functionality decoupled from QEMU that handles the storage subsystem only. You can communicate with it via a unix socket using the QMP protocol and dynamically execute various commands like “open file”, “close file”, “create snapshot”, “export device,” and so on.

By the way, the KubeVirt community has already made an attempt to write a CSI-driver using qemu-storage-daemon. However, the project focuses on providing COW on top of a regular file system, while we emphasize SAN support and the cluster mode.

However, this is far from the only option. Indeed, the sixth version of the Linux kernel features the ublk (io_uring-based userspace block device driver) system. You can mount a QCOW2 file to it using ubdsrv: the kernel will treat it as a raw block device, and it will have fewer interlayers it has to pass through.

At that point, we were able to test ublk and compare it to other mounting methods. Let me assure you: this is one of the most effective methods so far. Unfortunately, it is still in the experimental phase and not yet supported by the qemu-storage-daemon at the time of writing this article.

Exports in qemu-storage-daemon

The qemu-storage-daemon supports the following methods for exporting virtual devices to the system: NBD, FUSE, vhost-user, and VDUSE.

While NBD is fairly well-known, seeing FUSE among these can be a bit confusing, since its primary job is to act as a file system userspace interface, not a block device interface. In Linux, however, some file systems can be mounted not only in a directory, but in regular files as well. In a way, this is analogous to a virtualized block device. Granted, this is not quite the way to go, since the idea is to end up with a genuine block device that Kubernetes can set up with special permissions to use in cgroups for our container, and mount it into the container.

Vhost-user is also a fairly niche protocol that allows you to establish a direct communication channel between the virtual machine and the storage system (in the case that they are two separate processes). This way, the virtual machine will not require a separate controller to translate VirtIO system calls — such calls can be passed straight to the storage instead. On the host side, this type of export looks like a regular unix socket. It is simply passed to the QEMU process the virtual machine is running in. Given its lack of any interlayers, vhost-user is supposed to be the most efficient protocol for connecting storage to virtual machines.

However, KubeVirt does not currently support vhost-user for the storage subsystem, while our goal is to build a universal solution that works with both virtual machines and containers. So we decided not to rely solely on vhost-user, but just add the option to use it, assuming that KubeVirt will introduce the corresponding interface in the future.

For the time being, we decided to base our solution on the latest VDUSE protocol, which was recently added to the qemu-storage-daemon. It is essentially an interface for the vDPA bus in the Linux kernel.

If you are not yet familiar with these virtualization technologies, I strongly recommend reading our translation of the article “The Evolution of Network Virtualization Technologies in Linux”.

What is vDPA?

The vDPA (VirtIO Data Path Acceleration) technology in the Linux kernel enables device drivers to provide a fully VirtIO-compatible interface (vhost). This way, the virtual machine can communicate directly with physical devices, without any need to create an extra control plane to translate system calls.

In practice, the vDPA bus has a backend and a frontend. The backend can be either a device driver or VDUSE (vDPA Device in Userspace). Basically, it’s another Linux kernel module that shifts the vDPA backend completely into the userspace. VDUSE creates a character device in the /dev/vduse/device directory so that SDSs or SDNs running in the userspace can communicate with it. qemu-storage-daemon is the only implementation of this storage interface in existence that I know of at the moment. As of version 7.1.0, QSD can export a QCOW2 file to VDUSE, making it available on the vDPA bus as a ready-to-connect device.

As for the frontend, there are two choices (implemented as kernel modules as well):

  • vhost-vdpa is used for virtual machines. From the OS perspective, the exported device resembles another character device (e.g., /dev/vhost-vdpa-0) that can be forwarded to a QEMU process or other application that can handle the vhost protocol.
  • virtio-vdpa is used for containers. It resembles a regular /dev/vda that runs right on bare metal.
Source: article on infoq.cn (in Chinese)

Therefore, the basic implementation of our CSI driver is supposed to export a chain of QCOW2 files as VirtIO block devices. Each of those devices will be funneled into a container. At that point, it will be used to run both virtual machines and regular containers.

Upon testing the VDUSE speeds, we found out that it is at least as good as ublk, and in some cases even outperforms it (see the results):

Implementing ReadWriteMany for QCOW2

It seemed that we were nearing the logical culmination of our journey, when we stumbled upon another interesting point related to the implementation of competitive data access. Recall that on the driver side, we must provide a block device or file system regardless of whether it is used by a virtual machine or a container.

Meanwhile, virtual machines are able to migrate between hosts (so the driver must have ReadWriteMany support). So, that ended up begging a reasonable question: “How do you implement ReadWriteMany for QCOW2, when QEMU itself does not support this feature?” When live-migrating a virtual machine, QEMU simply resets all the caches during the switchover, and the virtual machine starts running with the brand new QCOW2.

This seemed to pose a serious problem, seeing as in the Kubernetes world, the disk subsystem must be clearly separated from the workloads: the former provides us with a block device, and the latter uses it. That is, on the CSI driver side, we wouldn’t be able to use the “end of the live migration” event to reset the caches because the driver “knows nothing” about both the virtual machine and any other process using the device in the container.

But we were lucky enough to find a solution: setting the cache.direct parameter to true will result in invalidating the page caches on the node, allowing us to bypass the aforementioned limitation.

Thus, when the driver is only required to attach a single node, this parameter uses the default value of false for best performance. However, if another attachment is necessary, this parameter should be changed to true. This degrades the disk subsystem’s performance during migration but allows you to avoid the problems associated with cache invalidation.

I must stress that this is not a full-fledged ReadWriteMany: if you write from both hosts at the same time, you risk corrupting the structure of the QCOW2 file. Still, this solution is enough to guarantee the smooth live migration of virtual machines in Kubernetes.

Conclusion

This article has detailed how you can enjoy all the features of thin provisioning with little to no loss in performance using clustered LVM and the QCOW2 file format. With these tools, you can implement a fast and, most importantly, universal driver for connecting and using any SAN-like storage system in Kubernetes.

That is all for now. We decided on a technology stack and went ahead with developing the driver.

--

--