Is Your Go Application Really Using the Correct Number of CPU Cores?

4 min readJul 3, 2023

In the world of modern cloud-native languages, Go has gained significant popularity for its efficiency and concurrency support through goroutines. However, when it comes to utilizing the appropriate number of CPU cores, Go applications face some challenges. This article explores the underlying issues and sheds light on why Go applications may fail to leverage the correct number of CPU cores, particularly in containerized environments like Kubernetes.

Go’s GMP Scheduler

Go employs a unique scheduling model called GMP, where G represents goroutines, M represents kernel-level threads, and P represents available CPU cores. Multiple goroutines can share a single thread (M), which needs to be bound to a specific P during execution. If the number of CPUs perceived by the operating system differs from the number of available CPUs perceived by a Go process, even binding an M to a specific P may not guarantee its execution. Therefore, the ability to obtain an accurate count of available CPU cores directly affects Go’s scheduling efficiency.

Resource Limits in Kubernetes

Suppose a user sets resource limits for a Go application in a Kubernetes environment using the following specification:

spec:
  containers:
  - name: "my-go-app"
    resources:
      limits:
        cpu: "4"

One might assume that Go, being a popular language in the cloud-native era, would have built-in support for Kubernetes and automatically recognize these limits. However, that assumption is not entirely accurate. Let’s conduct a small experiment:

If the GOMAXPROCS environment variable is not explicitly set when starting the Go process, it defaults to the output of the runtime.NumCPU() function, which represents the number of available CPU cores (i.e., the value of P). However, adding a print statement for NumCPU reveals that it actually outputs the number of CPUs on the node, regardless of the limits.cpu value.

package main

import "runtime"

func main() {
  print(runtime.NumCPU()) // equals the number of CPUs of the node
}

Limitations of runtime.NumCPU()

On Linux, the runtime.NumCPU() function relies on the sched_getaffinity system call, which only considers CPU core bindings and doesn’t take into account CPU resource restrictions imposed by cgroups in container environments.

For instance, in Docker, the --cpuset-cpus flag defines the CPU core numbers available for a container at runtime, while restricting CPU resource usage primarily relies on --cpus=. Only the former (cpuset) can be recognized by sched_getaffinity.

For more information, refer to the Docker documentation on resource constraints.

Accurate Calculation of CPU Cores

Fortunately, there is a Go library, that accurately calculates GOMAXPROCS by reading the cgroup file system.

This library supports both cgroup v1 and v2. In cgroup v1, it reads the cpu.cfs_quota_us and cpu.cfs_period_us files and calculates their quotient. Typically, these files are located under /sys/fs/cgroup/cpu/ (automaxprocs retrieves the mount information to determine the actual location). In cgroup v2, it reads the corresponding fields representing quota and period in the cpu.max file and calculates their quotient.

Since the division result may not be an integer, an additional step of rounding down is performed.

Challenges Beyond Go

It’s important to note that Go is not the only language facing difficulties in recognizing the available CPU cores accurately within container environments. Nginx, Envoy, and other frameworks also struggle with recognizing cgroup configurations. However, it’s been reported that Rust and Java (OpenJDK implementation) have dedicated solutions to handle cgroup configurations.

If your application meets the following criteria: Performs compute-intensive tasks or has critical business logic executed within a fixed number of workers. Deployed in a containerized environment. It’s worth investigating whether the framework you are using properly handles cgroup configurations.

Performance Impact

Let’s examine the benchmark results of the uber automaxprocs library to see the impact of setting the correct GOMAXPROCS configuration on performance.

Data measured from Uber’s internal load balancer with 200% CPU quota (i.e., 2 cores):

When GOMAXPROCS is increased above the CPU quota, we see P50 decrease slightly, but see significant increases to P99. We also see that the total RPS handled also decreases.

Conclusion

The mismatch between perceived and actual CPU core availability in container environments poses a challenge for Go applications. Although Go relies on the runtime.NumCPU() function, it doesn’t consider the CPU resource limitations imposed by groups.

Developers must be aware of this limitation and consider utilizing libraries like automaxprocs to accurately determine the optimal number of CPU cores for their Go applications. As the cloud-native landscape evolves, addressing these challenges will ensure efficient resource utilization and enhance the performance of Go applications in containerized environments.

Good luck and happy coding!