Cost-Effective Scalability with Latency-Based Autoscaling

5 min readMay 11, 2023

One of the biggest challenges in modern cloud-native applications is to achieve high scalability while keeping costs low. Traditional auto-scaling approaches based on CPU or memory usage can be suboptimal, as they do not consider the latency of requests. In this blog post, we will explore how to implement a latency-based autoscaling strategy using Linkerd2 and Prometheus using custom metrics on Kubernetes.

Linkerd2 is a service mesh for Kubernetes, that provides observability, security, and reliability features to microservices architectures. Prometheus is a popular monitoring and alerting system that can collect metrics from various sources, including Linkerd2. You can read about our Linkerd2 journey from here and here.

By combining these two tools, it is possible to implement a latency-based autoscaling strategy on Kubernetes, where the idea behind this is to set a target threshold for the response time of requests and then adjust the number of instances of an application to keep the response time below that threshold.

To enable this type of autoscaling, first, we need to understand a couple of things like “Custom & External Metrics”, “Prometheus Adapter” and “Horizontal Pod Autoscaler (HPA)”, assuming you’re using Linkerd2 and Prometheus already.

Custom & external metrics are important features in Kubernetes for monitoring and scaling applications. Custom metrics are application-specific metrics (e.g. number of requests per second, the average response time, or the number of errors) defined using the Custom Metrics API (custom.metrics.k8s.io), while external metrics are collected from external systems (e.g. Apache Pulsar, Buildkite, etc.) using the External Metrics API (external.metrics.k8s.io).

Once you have defined these custom or external metrics, you can use a “Prometheus Adapter” to expose them to Kubernetes. The Prometheus Adapter acts as a bridge between Kubernetes and Prometheus, allowing you to use custom or external metrics to drive autoscaling.

Finally, you can use the Horizontal Pod Autoscaler (HPA) to automatically scale the number of pods in your deployment based on the custom or external metrics exposed by the Prometheus Adapter. The HPA continuously monitors the metrics and adjusts the number of pods up or down as needed to maintain the desired level of performance.

Linkerd2 exposes a bunch of metrics, which you can find all here. You can use “response_latency_ms_bucket” to define a custom metric rule. There are various ways to configure Prometheus Adapter to expose custom metrics depending on how you installed the Prometheus. Here is how if you’ve installed via “kube-prometheus”:

prometheusAdapter+: {
  namespace: 'monitoring',
  config+:: {
    rules: [
      {
        seriesQuery: 'response_latency_ms_bucket{namespace="deep-learning", pod!=""}',
        seriesFilters: [],
        resources: {
          template: '<<.Resource>>',
        },
        name: { matches: '^(.*)_bucket$', as: '${1}_90th' },
        metricsQuery: 'histogram_quantile(0.90, sum(rate(<<.Series>>{<<.LabelMatchers>>, direction="inbound"}[5m])) by (le, <<.GroupBy>>))'
      }
    ],
  },
}

The rule in the above snippet uses the “response_latency_ms_bucket” metric to calculate and expose a new metric, and also renames it to “response_latency_ms_90th”. You can find an awesome Prometheus Adapter configuration walkthrough here, that explains every line of the above configuration one by one, and the keywords like “<<.Resource>>”, “seriesQuery”, etc.

After adding a custom rule and restarting the Prometheus Adapter, you should be able to see it via:

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta2

The response should be like this:

{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "custom.metrics.k8s.io/v1beta2",
  "resources": [
    {
      "name": "jobs.batch/response_latency_ms_90th",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "namespaces/response_latency_ms_90th",
      "singularName": "",
      "namespaced": false,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "pods/response_latency_ms_90th",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "deployments.apps/response_latency_ms_90th",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

Now you can use exported “response_latency_ms_90th” metric to automatically scale the number of pods.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-server
  namespace: deep-learning
spec:
  minReplicas: 2
  maxReplicas: 10
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 120
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120
    scaleDown:
      stabilizationWindowSeconds: 180
      policies:
        - type: Pods
          value: 1
          periodSeconds: 180
  metrics:
    - pods:
        metric:
          name: response_latency_ms_90th
          selector: { }
        target:
          averageValue: "100000m"
          type: AverageValue
      type: Pods
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference-server

In the above resource, the HPA has two types of scaling behaviors: scale-up and scale-down, with corresponding stabilization windows of 2 minutes and 3 minutes, respectively. The scale-up policy will add one replica every 120 seconds if the response latency metric is above 100ms for the 90th percentile of the pods. The scale-down policy will remove one replica every 180 seconds if the response latency metric is below or equal to 100ms for the same percentile.

You can find an awesome HPA configuration walkthrough here, that explains every line of the above configuration one by one, and the keywords like “stabilizationWindowSeconds”, “periodSeconds”, etc.

After combining all, as you can see in the screenshot when the latency is increased, autoscaler created new pods and adjusted latency to around 100ms within the range of +-10%, which is a default tolerance value in K8s, globally configurable via a parameter. Check out here for the algorithm details.

If you run the below command:

kubectl describe horizontalpodautoscalers -n deep-learning inference-server

You’ll probably see something like this, with ups and downs:

In conclusion, latency-based autoscaling via Linkerd2 and Prometheus can be a powerful tool for achieving cost-effective scalability and improving the performance of modern cloud-native applications. By dynamically adjusting the number of instances based on actual traffic demand, organizations can optimize their resource usage and provide better user experiences.

Cost-Effective Scalability with Latency-Based Autoscaling

Resources

Written by hadican