The internals and the latest trends of container runtimes (2023)

Akihiro Suda

Published in

nttlabs

16 min readJun 21, 2023

Last week I had an opportunity to give an online lecture about containers to students at Kyoto University.

The slide deck can be found here (PDF):

Contents:

Introduction to containers
Internals of container runtimes
Latest trends in container runtimes

1. Introduction to containers

What are containers?

Containers are a set of various lightweight methods to isolate filesystems, CPU resources, memory resources, system permissions, etc. Containers are similar to virtual machines in many senses, but they are more efficient and often less secure than virtual machines. (Slide 5)

An interesting thing is that there is still no strict definition of “containers”. Even virtual machines can be called "containers" when they provide container-like interfaces, e.g., when they implement the OCI (Open Container Initiative) specs. Such "non-container" containers are discussed later in Section 3.

Docker

Docker is the most popular container engine. Docker natively supports Linux containers and Windows containers, but Windows containers are out of the scope of this talk.

A typical command line to start a Docker container is as follows:

docker run -p 8080:80 -v .:/usr/share/nginx/html nginx:1.25

After executing this command, the content of `index.html` in the current directory will be visible in http://<the host’s IP>:8080/ .

The `-p 8080:80` part in the command line specifies to forward the TCP port 8080 of the host into the port 80 of the container.

The `-v .:/usr/share/nginx/html` part specifies to mount the current directory on the host onto `/usr/share/nginx/html` in the container.

The `nginx:1.25` specifies to use the official nginx image on Docker Hub. Docker images are somewhat similar to virtual machine images, however, they usually do not contain additional daemons such as systemd and sshd.

You can find the official images for other applications on Docker Hub too. You can also build your own images by yourself, using a language called Dockerfile:

FROM debian:12
RUN  apt-get update && apt-get install -y openjdk-17-jre
COPY myapp.jar /myapp.jar
CMD  ["java", "-jar", "/myapp.jar"]

An image can be built with the `docker build` command, and can be pushed to Docker Hub or other registry services with the `docker push` command.

Kubernetes

Kubernetes clusterizes multiple container hosts such as (but not limited to) Docker hosts to provide load balancing and fault-tolerance (Slide 10).

It is noteworthy that Kubernetes is also an abstraction framework for interacting with objects such as Pods (groups of containers that are always co-scheduled on a same host), Services (entities for network connectivity), and any kind of objects, but it is beyond the scope of this talk.

Docker vs pre-Docker containers

While containers didn't get much attention until the release of Docker in 2013, Docker wasn’t the first container platform:

1999: FreeBSD Jail
2000: Virtual Environment system for Linux (precursor to Virtuozzo and OpenVZ)
2001: Linux Vserver
2002: Virtuozzo
2004: BSD Jail for Linux
2004: Solaris Containers (Apparently, the term "container" was coined this time)
2005: OpenVZ
2008: LXC
2013: Docker

It is widely considered that FreeBSD Jail (circa 1999) is the first practical container implementation for Unix-like operating systems, although the term "container" wasn't coined at that time.

Since then, several implementations appeared for Linux too. However, pre-Docker containers were fundamentally different from Docker containers; they had focused on mimicking an entire machine with System V init, sshd, syslogd, etc., inside it. It was also often common to put a Web server, an application server, a database server, and everything into a single container

Docker changed the paradigm. In the case of Docker, a container usually only contains a single service (Slide 14) so that containers can be stateless and immutable. This design significantly reduces maintenance costs, as containers are now disposable; When something needs to be updated, you can just remove the container and recreate it from the latest image. You no longer need to install sshd and other utilities inside the container either, as you never need a shell access for it. This simplifies load-balancing and fault-tolerance too for multi-host clusters.

2. Internals of container runtimes

This section assumes using Docker v24 with its default configuration, but most parts are applicable to non-Docker containers too.

Docker under the hood

Docker consists of the client program ( `docker` CLI ) and the daemon program (`dockerd`). The `docker` CLI connects to the `dockerd` daemon via an Unix socket (`/var/run/docker.sock`) to create containers.

However, the `dockerd` daemon doesn't create containers by itself. It delegates control to the `containerd` (/container-dee/) daemon to create containers (Slide 17). But it doesn't create containers either; it further delegates control to the `runc` (/run-see/) runtime, which composes multiple Linux kernel features such as Namespaces, Cgroups, and Capabilities to implement the concept of "containers". There is no "container" object in the Linux kernel.

Namespaces

Namespaces isolate resources from the host and from other containers.

The most well-known namespaces are mount namespaces (Slide 19). Mount namespaces isolate the filesystem view so that a container can change the rootfs to `/var/lib/docker/.../<container's rootfs>` using the `pivot_root(2)` syscall. This syscall is similar to traditional `chroot(2)` but more secure.

The container's rootfs has very similar structure as the host, but it has several restrictions on `/proc`, `/sys`, and `/dev`. e.g.,

The `/proc/sys` directory is remounted as a read-only bind mount to prohibit sysctl.
The `/proc/kcore` file (RAM) is masked by mounting `/dev/null` over it.
The `/sys/firmware` directory (firmware data) is masked by mounting an empty read-only tmpfs over it.
Accesses to the `/dev` directories are restricted by Cgroups (discussed later).

Network namespaces (Slide 21) allow assigning dedicated IP addresses to containers so that they can talk to each other by IP.

PID namespaces (Slide 23) isolate process trees so that a container can't control processes outside it.

User namespaces (Slide 24; not to be confused with "user spaces") isolate the root privilege by mapping a non-root user on the host to the pseudo "root" in a container. The pseudo root can behave like the root in the container to run `apt-get`, `dnf`, etc., but it doesn't have privileged accesses to resources outside the container.

User namespaces significantly mitigate potential container breakout attacks, but it is still not used by default in Docker.

Other namespaces:

IPC namespaces: Isolates System V inter-process communication objects, etc.
UTS namespaces: Isolates the hostname. "UTS" (Unix Time Sharing system) seems a misnomer for this namespace.
(Optional) Cgroup namespaces: Isolates `/sys/fs/cgroup` hierarchy.
(Optional) Time namespaces: Isolates clocks. Not used by most containers yet.

Cgroups

Cgroups (control groups) impose several resource quotas such as CPU usage, memory usage, block I/O, and number of processes in a container.

Cgroups also control accesses to device nodes. The default configuration of Docker allows unlimited accesses to `/dev/null`, `/dev/zero`, `/dev/urandom`, etc., and disallows accesses to`/dev/sda` (disk devices), `/dev/mem` (memory), etc.

Capabilities

On Linux, the root privilege is represented by a 64-bit capability flag set. 41 bits are in use today.

The default configuration of Docker drops system-wide administration capabilities such as `CAP_SYS_ADMIN`.

The retained capabilities include:

`CAP_CHOWN`: for running `chown` inside containers.
`CAP_NET_BIND_SERVICE`: for binding TCP and UDP ports beneath 1024 inside containers.
`CAP_NET_RAW`: for running legacy `ping` implementations that need to craft raw Ethernet packets. This capability is quite dangerous, as it allows ARP spoofing and DNS spoofing in the container's network. A future version of Docker may disable it by default.

(Optional) Seccomp

Seccomp (Secure computing) allows specifying an explicit allowlist (or a denylist) of syscalls. The default configuration of Docker allows about 350 syscalls.

Seccomp is used for defense in depth; It is not a hard requirement for containers. For the sake of backward compatibility, Kubernetes still does not use seccomp by default, and it probably will never change the default configuration in the foreseeable future. Users can still opt-in to enable seccomp via `KubeletConfiguration`.

(Optional) AppArmor XOR SELinux

AppArmor and SELinux (Security Enhanced Linux) are LSMs (Linux Security Modules) that provide further fine-grained configuration knobs.

These are mutually exclusive; one is chosen by host OS distributors (not by container image distributors):

AppArmor: chosen by Debian, Ubuntu, SUSE, etc.
SELinux: chosen by Fedora, Red Hat Enterprise Linux, and similar host OS distributions.

Docker's default AppArmor profile almost just overlaps with its default configuration for capabilities, mount masks, etc., for the sake of defense-in-depth. Users may add custom settings for further security.

But the story is different for SELinux. To run containers in the `selinux-enabled` mode, you have to append an option `:z` (lower character) or `:Z` (upper character) to a bind mount, or run complex `chcon` commands by yourself to avoid permission errors.

The `:z` (lower character) option is used for Type Enforcement (Slide 32). Type Enforcement protects host files from containers, by assigning "types" to processes and files. A process running with the `container_t` type can read files with the `container_share_t` type, and read/write files with the `container_file_t` type, but it can't access files with other types.

The `:Z` (upper character) option is used for Multi-category Security (Slide 33). Multi-category Security protects a container from another container, by assigning category numbers to processes and files. e.g., A process with Category 42 can't access files labeled with Category 43.

What about Docker for Mac/Win?

Docker Desktop products support running Linux containers on Mac and Windows, but they are just running a Linux virtual machine under the hood to run containers on it. The containers are not directly running on macOS and Windows.

3. Latest trends in container runtimes

Alternatives to Docker (as Kubernetes runtimes)

The first version of Kubernetes (2014) was solely made for Docker (Slide 37). Kubernetes v1.3 (2016) added an interim support for an alternative container runtime called rkt, but rkt was retired in 2019. The effort for supporting alternative container runtimes yielded the Container Runtime Interface (CRI) API in Kubernetes v1.5 (2016). After the debut of CRI, the industry has converged to have two alternative runtimes: containerd (/container-dee/) and CRI-O (/cry-oh/, /cree-oh/, or /see-er-eye-oh/).

Kubernetes still had a built-in support for Docker (Slide 38), but it was finally removed in Kubernetes v1.24 (2022). Docker still continues to work for Kubernetes as a third party runtime (via the `cri-dockerd` shim), but Docker is now seeing less adoptions for Kubernetes.

The big names in the industry has already switched away from Docker to containerd, or to CRI-O:

Adopters of containerd: Amazon Elastic Kubernetes Service (EKS), Azure Kubernetes Service (AKS), Google Kubernetes Engine (GKE), k3s, ... (many)
Adopters of CRI-O: Red Hat OpenShift, Oracle Container Engine for Kubernetes (OKE), ...

containerd focuses on extensibility and supports non-Kubernetes workloads as well as Kubernetes workloads. In contrast, CRI-O focuses on simplicity and solely supports Kubernetes.

Alternatives to Docker (as CLI)

While Kubernetes has become the standard for multi-node production clusters, users still want Docker-like CLI for building and testing containers locally on their laptops. Docker basically satisfies this demand, but runtime developers in the community wanted to build their own "lab" CLIs to incubate new features ahead of Docker and Kubernetes, as it was often hard to propose new features to Docker and Kubernetes, for several technical/technological reasons.

Podman (formerly called kpod in 2016) is a Docker-compatible standalone container engine created by Red Hat and others. Its main difference from Docker is that it does not have the daemon process by default. Also, Podman is unique in the sense that it provides first-class support for managing Pods (groups of containers that share the same network namespace and often data volumes on the same host for efficient communication) as well as containers. However, most users seem to just use Podman for non-pod containers.

nerdctl (/nerd-see-tee-el/, founded by myself in 2020) is a Docker-compatible CLI for containerd (/container-dee/). nerdctl was originally made for experimenting new features such as lazy-pulling (discussed later), but it is also useful for debugging Kubernetes nodes that are running containerd.

See also my blog article "Released nerdctl v1.0" (October 2022) for the further information:

Released nerdctl v1.0

After nearly two years of development, nerdctl (contaiNERD CTL) finally reached v1.0.0 🤓 (October 21, 2022) . A huge…

medium.com

Running containers on Mac

Docker Desktop products for Mac and Windows are proprietary. Windows users can just run the Linux version of Docker (Apache License 2.0, no GUI) in WSL2, but there was no equivalent for Mac users so far.

Lima (/lee-mah/, founded by myself too in 2021) is a command line tool to create a WSL2-like environment on macOS for running containers. Lima uses nerdctl by default, but it supports Docker and Podman too.

See also my blog article "Lima is now a CNCF project" (October 2022).

Lima is now a CNCF project 🎉

Lima, the Linux virtual machine for running containerd on macOS, is now accepted in the CNCF Sandbox (Sep 13) 🎉.

medium.com

Lima is also adopted by third party projects such as colima (2021), Rancher Desktop (2021), and Finch (2022).

Podman community released Podman Machine (command line tool, 2021) and Podman Desktop (GUI, 2022) as an alternative for Docker Desktop. Podman Desktop supports Lima too, optionally.

Docker being refactored

containerd mainly provides two subsystems: the runtime subsystem and the image subsystem. However, the latter one is not used by Docker. This is problematic because Docker's own legacy image subsystem is far behind containerd's modern image subsystem (and it caused me to launch the nerdctl project):

No support for lazy-pulling (on-demand image pulling)
Limited support for multi-platform images (e.g., AMD64/ARM64 dual-platform images)
Limited compliance of OCI Image Spec

This long-standing problem is finally being resolved. Docker v24 (2023) added an experimental support for using containerd's image subsystem with an undocumented option (subject to change) in `/etc/docker/daemon.json`:

{"features":{"containerd-snapshotter": true}}

A future version of Docker (2024? 2025?) is likely to use containerd's image subsystem by default.

Lazy-pulling

Most files in container images are never used:

“pulling packages accounts for 76% of container start time, but only 6.4% of that data is read”
From “Slacker: Fast Distribution with Lazy Docker Containers” (Harter, et al., FAST 2016)

"Lazy-pulling" is a technique to reduce container startup time by pulling partial image contents on demand. This is not possible with OCI-standard tar.gz images, as they do not support `seek()` operations. Several alternative formats are being proposed to support lazy-pulling:

eStargz (2019): Optimizes gzip granularity for seek()-ability; Forward compatible with OCI v1 tar.gz.
SOCI (2022): Captures a checkpoint of tar.gz decoder state; Forward compatible with OCI v1 tar.gz.
Nydus (2022): An alternate image format;
Not compatible with OCI v1 tar.gz.
OverlayBD (2021): Block devices as container images; Not compatible with OCI v1 tar.gz.

Slide 51 shows a benchmark result of eStargz. Lazy-pulling (+additional optimizations) can reduce the container startup time to 1/9.

See also articles from my colleague Kohei Tokunaga:

Speeding Up Pulling Container Images on a Variety of Tools with eStargz

Over the past year, eStargz-based lazy pulling of containers has been available on a variety of runtimes and builders…

medium.com

P2P Container Image Distribution on IPFS With Containerd

medium.com

Expanding adoption of User namespaces

User namespaces are still rarely used in the Docker and Kubernetes ecosystem, although Docker has been supporting it since v1.9 (2015).

One of the reasons is that the complexity and the overhead of “chowning” container rootfs for a pseudo root. Linux kernel v5.12 (2021) added “idmapped mounts” to eliminate the necessity for chowning. This is planned to be supported in runc v1.2.

After the release of runc v1.2, user namespaces are expected to be more popular for Docker and Kubernetes, which just added preliminary support for user namespaces in v1.25 (2022). For compatibility sake, it is unlikely that Kubernetes will ever enable User namespaces by default. However, Docker may still potentially enable user namespaces by default in future. Nothing is decided yet, though.

Rootless containers

Rootless containers is a technique to put container runtimes, as well as containers, in a user namespace that is created by a non-root user to mitigate potential vulnerabilities of runtimes.

Even if a container runtime has a bug that allows an attacker to escape from a container, an attacker can't have a privileged access to other user's files, kernel, firmware, and devices.

Here is a brief history of rootless containers:

2014: LXC v1.0 introduced support for rootless containers. At that time, rootless containers were called "unprivileged containers". LXC's unprivileged containers are slightly different from modern rootless containers, as they require a SETUID binary for bringing up networks.
2017: runc v1.0-rc4 gained initial support for rootless containers
2018: Several works has begun to support rootless containers in containerd, BuildKit (backend of `docker build`), Docker, Podman, etc., slirp4netns (Slide 56) was created (by myself) to allow SETUID-less networking by translating Ethernet packets to unprivileged socket syscalls.
2019: Docker v19.03 was released with an experimental support for rootless containers. Podman v1.1 was also released with the same feature in this year, slightly ahead of Docker v19.03.
2020: Docker v20.10 was released with general availability of rootless containers.

Through 2020 to 2022, we also worked on bypass4netns (Slide 57) to eliminate the overhead of slirp4netns, by hooking socket file descriptors inside a container and reconstructing them outside the container. The achieved throughput is even faster than "rootful" containers.

Rootless containers have successfully gained popularity, but there have been also criticisms against rootless containers. Especially, it is controversial whether non-root users should be allowed to create user namespaces that are required for running rootless containers. I'd answer yes for container users, because rootless containers are at least much safer than running everything as the root. However, I'd rather answer no for who don't use containers, because user namespaces can be also attack surfaces. e.g., CVE-2023–32233: "Privilege escalation in Linux Kernel due to a Netfilter nf_tables vulnerability".

The community has been already seeking remedies for this dilemma. Ubuntu (since 13.10) and Debian provide a sysctl knob `kernel.unprivileged_userns_clone=<bool>` to specify whether to allow or disallow creating unprivileged user namespaces. However, their patch is not merged in the upstream Linux kernel.

Instead, the upstream kernel introduced a new LSM (Linux Security Module) hook `userns_create` in Linux v6.1 (2022) so that an LSM can dynamically decide whether to allow or disallow creating a user namespace. This hook is callable from eBPF (`bpf_program__atttach_lsm()`), so it is expected that there will be a fine-grained and non-distribution-specific knob that does not depend on AppArmor nor SELinux. However, userspace utilities for eBPF + LSM are not matured yet to provide a good user experience for this.

More LSMs

Landlock LSM was merged into Linux v5.13 (2021). Landlock is similar to AppArmor in the sense that it restricts file accesses by paths (`LANDLOCK_ACCESS_FS_EXECUTE`, `LANDLOCK_ACCESS_FS_READ_FILE`, etc.), but Landlock does not require the root privilege for setting up a new profile. Landlock is also very similar to OpenBSD's `pledge(2)`.

Landlock is still not supported by the OCI Runtime Spec, but I guess it can be included in the OCI Runtime Spec v1.2.

Kata Containers

As I mentioned in Section 1, "containers" is not a well-defined terminology. Anything can be called "containers" when it provides good compatibility with the existing container ecosystem.

Kata Containers (2017) are such sort of "containers" that are not actually containers in the narrower sense. Kata Containers are actually virtual machines but with support for the OCI Runtime Spec. Kata Containers are much more secure than runc containers, however, they have drawbacks on performance and they do not work well on typical non-baremetal IaaS instances that do not support nested virtualization.

Kata Containers works as a containerd runtime plugin, and receives same images and runtime configurations as runc containers. Its user experience is almost indistinguishable from runc containers.

gVisor

gVisor (2018) is yet another exotic container runtime. gVisor traps syscalls and execute them in a Linux-compatible usermode kernel to mitigate attacks. gVisor currently has three modes for trapping syscalls:

KVM mode: rarely used, but the best option for bare-metal hosts
ptrace mode: the most common option but slow
SIGSYS trap mode (since 2023): expected to replace ptrace mode eventually

gVisor has been used in Google's several products including Google Cloud Run. However, Google Cloud Run has switched away from gVisor to microVM in 2023:

“This means that software that previously didn’t run in Cloud Run due to unimplemented system call issues can now run in Cloud Run’s second-generation execution environment.”
From https://cloud.google.com/blog/products/serverless/cloud-run-jobs-and-second-generation-execution-environment-ga/?hl=en

This implies that gVisor's performance and compatibility issues are not negligible for their business.

WebAssembly

WebAssembly (WASM) is a platform-independent byte code format that was originally designed for Web browsers in 2015. WebAssembly is somewhat similar to Java applets (1995) but it puts more focus on portability and security. One interesting aspect of WebAssembly is that it splits the code address space from the data address space; there are no instructions like `JMP <immediate>` and `JMP *<reg>`. It only supports jumping to labels that are resolved on compilation time. This design reduces arbitrary code execution bugs, although it also sacrifices feasibility of JIT-compiling other byte code formats into WebAssembly.

WebAssembly is also in the spotlight as a potential alternative to containers. For running WebAssembly out of browsers, WASI (WebAssembly System Interface) was proposed in 2019 to provide low-level API (e.g., `fd_read()`, `fd_write()`, `sock_recv()`, `sock_send()`) that can be used for implementing POSIX-like layers on it. containerd added "runWASI" plugin in 2022 to treat WASI workloads as containers.

In 2023, WASIX was proposed to extend WASI to provide more convenient (and somewhat controversial) functions:

Threads: `thread_spawn()`, `thread_join()`, ...
Processes: `proc_fork()`, `proc_exec()`, ...
Sockets: `sock_listen()`, `sock_connect()`, ...

Eventually, these movements may replace a huge (but non-100%) portion of containers. Solomon Hykes, the founder of Docker, says that "If WASM+WASI existed in 2008, we wouldn’t have needed to created Docker":

Recap

Containers are more efficient, but often less secure, than virtual machines. Lots of security technologies are being introduced to harden containers. (User namespaces, Rootless containers, Linux security modules, ...)
Alternatives to Docker are arising (containerd, CRI-O, Podman, nerdctl, Finch, ...), but Docker isn’t fading out.
“Non-container” containers are trends too.
(Kata: VM-based, gVisor: user mode kernel, runWASI: WebAssembly, ...)

Slide 71 shows the landscape of the well-known runtimes.

See also the rest of the slides for the further topics that could not be covered in the talk.

NTT is hiring!

We at NTT have been proudly leading the trends of containers and other open source software. Visit https://www.rd.ntt/e/sic/recruit/ to see how to join us.

私たちNTTは、コンテナ等のOSSの流行を牽引していることを自負しています。ぜひ弊社採用情報ページをご覧ください: https://www.rd.ntt/sic/recruit/

The internals and the latest trends of container runtimes (2023)

1. Introduction to containers

What are containers?

Docker

Kubernetes

Docker vs pre-Docker containers

2. Internals of container runtimes

Docker under the hood

Namespaces

Cgroups

Capabilities

(Optional) Seccomp

(Optional) AppArmor XOR SELinux

What about Docker for Mac/Win?

3. Latest trends in container runtimes

Alternatives to Docker (as Kubernetes runtimes)

Alternatives to Docker (as CLI)

Released nerdctl v1.0

After nearly two years of development, nerdctl (contaiNERD CTL) finally reached v1.0.0 🤓 (October 21, 2022) . A huge…

Running containers on Mac

Lima is now a CNCF project 🎉

Lima, the Linux virtual machine for running containerd on macOS, is now accepted in the CNCF Sandbox (Sep 13) 🎉.

Docker being refactored

Lazy-pulling

Speeding Up Pulling Container Images on a Variety of Tools with eStargz

Over the past year, eStargz-based lazy pulling of containers has been available on a variety of runtimes and builders…

P2P Container Image Distribution on IPFS With Containerd

Expanding adoption of User namespaces

Rootless containers

More LSMs

Kata Containers

gVisor

WebAssembly

Recap

NTT is hiring!

Written by Akihiro Suda