How to autoscale apps on Kubernetes with custom metrics

October 2019

How do you scale apps on Kubernetes?

Deploying an app to production with a static configuration is not optimal.

Traffic patterns can change quickly, and the app should be able to adapt to them:

When demand increases, the app should scale up (increasing the number of replicas) to stay responsive.
When demand decreases, the app should scale down (decreasing the number of replicas) to not waste resources.

Kubernetes provides excellent support for autoscaling applications in the form of the Horizontal Pod Autoscaler.

In the following, you will learn how to use it.

Different types of autoscaling

First of all, to eliminate any misconceptions, let's clarify the use of the term "autoscaling" in Kubernetes.

In Kubernetes, several things are referred to as "autoscaling", including:

Horizontal Pod Autoscaler: adjusts the number of replicas of an application
Vertical Pod Autoscaler: adjusts the resource requests and limits of a container
Cluster Autoscaler: adjusts the number of nodes of a cluster

While these components all "autoscale" something, they are completely unrelated to each other.

They all address very different use cases and use different concepts and mechanisms.

They are developed in separate projects and can be used independently from each other.

This article treats the Horizontal Pod Autoscaler.

What is the Horizontal Pod Autoscaler?

The Horizontal Pod Autoscaler is a built-in Kubernetes feature that allows to horizontally scale applications based on one or more monitored metrics.

Horizontal scaling means increasing and decreasing the number of replicas. Vertical scaling means increasing and decreasing the compute resources of a single replica.

Technically, the Horizontal Pod Autoscaler is a controller in the Kubernetes controller manager, and it is configured by HorizontalPodAutoscaler resource objects.

The Horizontal Pod Autoscaler can monitor a metric about an app and continuously adjust the number of replicas to optimally meet the current demand.

Resources that can be scaled by the Horizontal Pod Autoscaler include the Deployment, StatefulSet, ReplicaSet, and ReplicationController.

To autoscale an app, the Horizontal Pod Autoscaler executes an eternal control loop:

The steps of this control loop are:

Query the scaling metric
Calculate the desired number of replicas
Scale the app to the desired number of replicas

The default period of the control loop is 15 seconds

The calculation of the desired number of replicas is based on the scaling metric and a user-provided target value for this metric.

The goal is to calculate a replica count that brings the metric value as close as possible to the target value.

For example, imagine that the scaling metric is the per-second request rate per replica:

If the target value is 10 req/sec and the current value is 20 req/sec, the Horizontal Pod Autoscaler will scale the app up (i.e. increasing the number of replicas) to make the metric decrease and get closer to the target value.
If the target value is 10 req/sec and the current value is 2 req/sec, the Horizontal Pod Autoscaler will scale the app down (i.e. decreasing the number of replicas) to make the metric increase and get closer to the target value.

The algorithm for calculating the desired number of replicas is based on the following formula:

X = N * (c/t)

Where X is the desired number of replicas, N is the current number of replicas, c is the current value of the metric, and t is the target value.

You can find the details about the algorithm in the documentation.

That's how the Horizontal Pod Autoscaler works, but how do you use it?

How to configure the Horizontal Pod Autoscaler?

Configuring the Horizontal Pod Autoscaler to autoscale your app is done by creating a HorizontalPodAutoscaler resource.

This resource allows you to specify the following parameters:

The resource to scale (e.g. a Deployment)
The minimum and maximum number of replicas
The scaling metric
The target value for the scaling metric

As soon as you create this resource, the Horizontal Pod Autoscaler starts executing the above-mentioned control loop against your app with the provided parameters.

A concrete HorizontalPodAutoscaler resource looks like that:

hpa.yaml

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: myhpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 1
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: myapp_requests_per_second
        target:
          type: AverageValue
          averageValue: 2

There exist different versions of the HorizontalPodAutoscaler resource that differ in their manifest structure. The above example uses version v2beta2, which is the most recent one at a time of this writing.

This resource specifies a Deployment named myapp to be autoscaled between 1 and 10 replicas based on a metric named myapp_requests_per_second with a target value of 2.

You can imagine that the myapp_requests_per_second metric represents the request rate of the individual Pods of this Deployment — so the intention of this specification is to autoscale the Deployment with the goal of maintaining a request rate of 2 requests per second for each of the Pods.

So far, this all sounds good and nice — but there's a catch.

Where do the metrics come from?

What is the metrics registry?

The entire autoscaling mechanism is based on metrics that represent the current load of an application.

When you define a HorizontalPodAutoscaler resource you have to specify such a metric.

But how does the Horizontal Pod Autoscaler know how to obtain these metrics?

It turns out that there's another component in play — the metrics registry.

The Horizontal Pod Autoscaler queries metrics from the metrics registry:

The metrics registry is a central place in the cluster where metrics (of any kind) are exposed to clients (of any kind).

The Horizontal Pod Autoscaler is one of these clients.

The purpose of the metrics registry is to provide a standard interface for clients to query metrics from.

The interface of the metrics registry consists of three separate APIs:

These APIs are designed to serve different types of metrics:

Resource Metrics API: predefined resource usage metrics (CPU and memory) of Pods and Nodes
Custom Metrics API: custom metrics associated with a Kubernetes object
External Metrics API: custom metrics not associated with a Kubernetes object

All of these metric APIs are extension APIs.

That means, they are extensions to the core Kubernetes API that are accessible through the Kubernetes API server.

What does that mean for you if you want to autoscale an app?

Any metric that you want to use as a scaling metric must be exposed through one of these three metric APIs.

Because only in that way they are accessible to the Horizontal Pod Autoscaler.

So, to autoscale an app, your task is now not only to configure the Horizontal Pod Autoscaler...

You also have to expose your desired scaling metric through the metric registry.

How do you expose a metric through a metric API?

By installing and configuring additional components in your cluster.

For each metric API you need a corresponding metric API server and you need to configure it to expose a specific metric through the metric API.

By default, no metric API servers are installed in Kubernetes, which means that the metric APIs are not enabled by default.

Furthermore, you need a metrics collector that collects the desired metrics from the sources (e.g. from the Pods of the target app) and provides them to the metric API server.

There are different choices of metric API servers and metric collectors for the different metrics APIs.

Resource Metrics API:

The metrics collector is cAdvisor, which runs as part of the kubelet on every worker node (so it's already installed by default)
The official metric API server for the Resource Metrics API is the Metrics Server

Custom Metrics API and External Metrics API:

A popular choice for the metrics collector is Prometheus — however, other metrics systems like Datadog or Google Stackdriver may be used instead
The Prometheus Adapter is a metric API server that integrates with Prometheus as a metric collector — however, other metric collectors have their own metric API servers

So, to expose a metric through one of the metric APIs, you have to go through these steps:

Install a metrics collector (e.g. Prometheus) and configure it to collect the desired metric (e.g. from the Pods of your app)
Install a metric API server (e.g. the Prometheus Adapter) and configure it to expose from the metrics collector through the corresponding metrics API

Note that this applies specifically to the Custom Metrics API and External Metrics API, which serve custom metrics. The Resource Metrics API only serves default metrics and can't be configured to serve custom metrics.

This was a lot of information, so let's put the bits together.

Putting everything together

Let's go through a full example of configuring an app to be autoscaled by the Horizontal Pod Autoscaler.

Imagine, you want to autoscale a web app based on the average per-second request rate of the replicas.

Also, assume that you want to use a Prometheus-based setup for exposing the request rate metric through the Custom Metrics API.

The request rate is a custom metric associated with a Kubernetes object (Pods), so it must be exposed through the Custom Metrics API.

Here's a sequence of steps to reach your goal:

Instrument your app to expose the total number of received requests as a Prometheus metric
Install Prometheus and configure it to collect this metric from all the Pods of your app
Install the Prometheus Adapter and configure it to turn the metric from Prometheus into a per-second request rate (using PromQL) and expose that metric as myapp_requests_per_second through the Custom Metrics API
Create a HorizontalPodAutoscaler resource (as shown above) specifying myapp_requests_per_second as the scaling metric and an appropriate target value

As soon as the HorizontalPodAutoscaler resource is created, the Horizontal Pod Autoscaler kicks in and starts autoscaling your app according to your configuration.

And you can lean back and watch your app adapting to traffic.

This article sets the theoretical framework for autoscaling an application based on a custom metric.

In a future article, you will put this knowledge into practice and execute the above steps with your own app on your own cluster.

From zero to a fully autoscaled application.

Stay tuned!

That's all folks!

If you enjoyed this article, you might find the following articles interesting:

Architecting Kubernetes clusters — choosing a worker node size where you'll learn the pros and cons of having clusters with large and small instance types for your cluster nodes.
Scaling Jupyter notebooks with Kubernetes and Tensorflow One of the most common hurdles with developing AI and deep learning models is to design data pipelines that can operate at scale and in real-time. In this article, you will explore how you can leverage Kubernetes, Tensorflow and Kubeflow to scale your models without having to worry about scaling the infrastructure.