Performance evaluation of the autoscaling strategies vertical and horizontal using Kubernetes

14 min readJul 28, 2022

Scalable applications may adopt horizontal or vertical autoscaling to dynamically provision resources in the cloud. To help to choose the best strategy, this work aims to compare the performance of horizontal and vertical autoscaling in Kubernetes. Through measurement experiments using synthetic load to a web application, the horizontal was shown more efficient, reacting faster to the load variation and resulting in a lower impact on the application response time.

Thanks to Marcus Carvalho and Raquel Lopes for guiding me on this work.

Introduction

The workload received by a cloud based application can fluctuate over time. In order to be scalable, it's possible to adopt autoscaling strategies to automatically scale the processing power of applications according to specific metrics (e.g. CPU). With that, we seek to achieve goals of Quality of Service (QoS) on the application and reduce costs of infrastructure in the cloud, that uses the approach pay-as-you-go.

There are two types of autoscaling: The horizontal, where the number of servers are increased or decreased depending on the workload and the vertical, where the computational resources of the server (e.g. CPU and memory) are upgraded or downgraded, also based on the workload. Even with these two approaches, there aren’t well-defined criteria to decide which one to use, depending on the scenario. Also, there is a scarcity of analysis related to the vertical autoscaling and how it compares to the horizontal autoscaling when it comes to performance and cost benefit.

So, to evaluate the performance of both approaches, it was created an experiment of measurement with Kubernetes, and with help of a benchmark tool, it was possible to send workloads in a controlled way to a ‘busy-wait’ application. The mechanisms were evaluated based on the time to make an autoscaling decision after any change on the workload, in the capacity of requested CPU on each decision and in the impact of the response time of the application.

Autoscaling strategies of the Kubernetes

The k8s is an open-source project based on Borg, focused on container orchestration and allows the execution of container-based applications on a cluster, and also facilitates the configuration of different kinds of environments (production, development and staging). In general, the k8s provides a group of physical and virtual machines (nodes), where a master node is responsible to control and allocate tasks to worker nodes, as shown in Figure 1. [1]

Figure 1. Representation of the container orchestration on Kubernetes.

In the k8s world, a pod is the smallest unit of allocation on a node, that can package one or more containers and define rules of execution. It is necessary to highlight that a node can run N pods, as long as it has enough resources.

The communication with the k8s can be done through the kubectl (kube control), that is a tool that allows an interaction with the kube api-server using the CLI. By using the API server, it is possible to obtain detailed information about the cluster (e.g. resources utilization, historic of events and activities time), and also apply CRUD operations to manipulate k8s objects. It is also possible to define limits of utilization on the pods (e.g. CPU and memory), the amount of replicas and network policy.

To create an object in the k8s it is necessary to create a configuration file where it will contain all necessary specifications. It is needed to mention that the objects can be used for different purposes — monitoring, network configuration, autoscaling, etc. — so, they need to be identified by a type to diff them according to their objectives. In this work, the types used were:

horizontal pod autoscaler
vertical pod autoscaler

Horizontal Pod Autoscaler

The goal of the horizontal autoscaling is to decrease or increase the number of Pods to use resources efficiently on the cluster and to attend the demands of the application. This approach of autoscaling works around metrics such as CPU, memory, custom metrics (response time or size of a queue) or with external metrics (based on the workload of an application outside the Kubernetes). [2][3]

In order to use the horizontal autoscaling, it is necessary to create a configuration file with the object HorizontalPodAutoscaler and define a limit in percentage of the CPU usage, if the utilization of the pod reaches this limit, more replicas will be created. There is a verification that is done periodically, where the HPA checks the utilization of the Pod every 15 seconds (this can be changed) to decide if new Pods have to be created.

The algorithm behinds the HPA works based on the average of the current utilization of all existing Pods watched by the HPA (Uₐ), the desired utilization of the Pods (U𝒹) and the current quantity of replicas (Uₐ). So, with that, we have to calculate with the following formula:

To better understand this, let’s suppose the following scenario:

There is a cluster with 5 replicas (Nₐ = 5) and with a limit of average utilization of 100 milicores or 0.1 CPU-core (U𝒹 = 100).
After a peak of workload, the average of the utilization of all pods increases to 200 m (Uₐ = 200).
Applying the equation of the Figure 2, we have N𝒹 = 5 * (200 / 100) = 10, where N𝒹 = 10 is the ideal quantity of Pods to keep the average utilization on 100 m, respecting the threshold.

This way, it is possible to realize that the HPA can even double the number of replicas instead of creating one by time in order to keep the cluster stable, making the HPA very precise.

The HPA contains a default delay (cool-down period) of 5 minutes to downscale when the workload is reduced. This time is considered only when the utilization is less than the limit of utilization defined. Although this value can be configured, in this work will be used the default.

Vertical Pod Autoscaler

The vertical autoscaling goal is to increase or decrease the allocated resources (e.g. CPU or memory) to existing Pods. In the Kubernetes, it would change the capacity of resources requested by a Pod. [4]

In order to use this approach, it is necessary to create an object of the type VerticalPodAutoscaler and point the deployment that will target of the autoscaling. The approach contains three main components:

Updater: act as a sentinel with the role of verify if the Pods have the correct resources, otherwise they will be recreated with the desired resources.
Admission controller: works together with the updater by defining the correct capacity of resources requested to the Pod.
Recommender: watch the resources to provide recommendations to upscale or downscale the memory or CPU based on the past or current utilization.

Currently, the VPA provides 3 types of recommendations:

Target: recommends the ideal capacity of memory and CPU to reach his goals.
Upper bound: recommends an upper limit of the capacity of resources requested. If the request is greater than this limit, considering the confidence factor, the Pod will be down scaled.
Lower bound: recommends a lower limit of the capacity of resources requested. If the request is lower than this limit, considering the confidence factor, the Pod will be up scaled.

The confidence factor is a way of keeping the VPA more conservative on its autoscaling decisions. In this approach, the following variables are used: current request of CPU of the Pod (Rₐ); the lower bound (Bₗ) and its confidence factor (aₗ); and the upper bound (Bᵤ) and its confidence factor (aᵤ).

The VPA will down scale the resources when Rₐ > (Bᵤ * aᵤ), where the confidence factor aᵤ increases as the uptime of the Pod increases, slowly converging to 1. The formula behind the confidence factor of the upper bound is aᵤ = (1 + 1/Δₜ), where Δₜ is the time calculated in days since the creation of the Pod.

On the other hand, the VPA will up scale the resources when Rₐ < (Bₗ * aₗ), where the confidence factor aₗ increases as the uptime of the Pod increases, slowly converging to 1. The formula behind the confidence factor of the lower bound is aₗ = (1 + 0.001/Δₜ)^-2. So, with this confidence factor, the VPA will make decisions faster.

So, to better understand, consider a pod which the current request of CPU is Rₐ = 100, its current lower bound is Bₗ = 150 and the time since it started its execution is 5 minutes. Converting the time from minutes to days, we’ll have Δₜ = 5 /60/24 = 0.003472. The confidence factor of the lower bound will be aₗ = (1 + 0.001/0.00347)^-2 = 0.6. So, considering that 100 < 150 * 0.6 ⇒ 100 < 90 is false, the capacity of the Pod won’t be increased. In order to be recreated, the confidence factor should be at least aₗ = 0.67, in other words it would be necessary approximately 7 minutes.

The environment of experiments

In order to generate and analyze the results of the experiment, it was necessary to create an environment of tests, define a way to generate utilization of resources to trigger the autoscaling strategies, automate all experiments and finally save and organize the data of the experiment. The architecture of the environment and its components are presented in the Figure 3.

Figure 3 in English:

Eventos: Events
Aplicação: Application
Legenda: Subtitle
Green: workload delivery
Blue: cluster information gathering
Red: Log storage

Figure 3: Architecture of the environment of tests.

The container orchestration environment was created using the Minikube, because it facilitates running a kubernetes cluster on a local machine. It provides a cluster with resources limited to the machine where it is running. Although it was done on a local environment, the results obtained by using Minikube can be replicated on cloud providers.

The tool used to generate workload was Hey Benchmark Tool, a modern benchmark tool written in Go, and it is able to send great quantities of requests in parallel. It also contains all necessary parameters:

(z): define for how long the requests will be executed.
(c): define the amount of workers executed concurrently.
(q): define the request rate sent by worker.

In order to generate workload on the Minikube a web application developed in node.js was created, this application expose an endpoint REST that calls a busy wait function that utilize 100% of a CPU-core for a specific time of service in milliseconds. As shown in the Figure 4, it receives the time of service and keeps the CPU busy until the time ends.

Figure 4. Algorithm of the busy wait function.

Scenarios of evalutation

Considering that the vertical autoscaling needs at least 1 healthy Pod, it was necessary to configure 2 initial Pods to each autoscaling strategy to keep the configurations similar. Also, it was defined an initial request of CPU of 0.15 CPU-cores and limit of 1.5 CPU-cores to each Pod.

In all evaluated scenarios, the time of service (time that the endpoint will take to process 1 request) was constant S = 0.175 seconds. The workload intensity was controlled by the request rate (λ) sent, and also by the amount of concurrent clients sending 1 request per second. Each scenario of the experiment was separated in 9 stages, where each stage has a different series of workload. Also, the stages takes 2 minutes to be executed, adding up a total of 18 minutes for each scenario.

In order to reach the desired CPU utilization on each stage, the request rate was defined based on the operational laws of the queue theory. The traffic intensity, according to the law of utilization, is defined as ρ = λ ∗ S. So, for example, to reach a utilization of 2 cores (ρ = 2) with service time S = 0.1, we have to define a rate of λ = ρ / S = 2 / 0.1 = 20 requests per second. Also, if this request rate be exceeded to 40 for instance, the system will be unbalanced because it is needed 4 cores to deal with this workload. [5]

Figure 5. Flow of the stages of the experiment.

As shown in Figure 5, the request rate varied in λ = 2 (to demand ρ = 0.35 CPU-cores); λ = 4 (to demand ρ = 0.7 CPU-cores); λ = 6 (to demand ρ = 1.05 CPU-cores); and λ = 8 (to demand ρ = 1.4 CPU-cores). So, these scenarios pretend to synthesize the following:

In the first scenario λ = [2, 2, 4, 6, 8, 6, 4, 2, 2], the objective is to increase and decrease the workload gradually and in a non-aggressive way.
In the second scenario λ = [2, 2, 8, 8, 8, 2, 2, 2, 2, the objective is to increase the workload abruptly, keeping the workload on its high for 3 stages and decrease to its lowest in the remaining stages.
In the third scenario λ = [2, 2, 8, 8, 2, 2, 2, 2, 2], the objective is the same as the second, but the workload will stay less time on it’s high .
In the fourth scenario λ = [2, 2, 8, 2, 8, 2, 8, 2, 2], the objective is to vary the workload with multiple peaks.

As soon as these scenarios were defined, everything was automated using Shell Scripting.

During the execution of the experiment, the Kubernetes API will provide data that are essential to the evaluation: 1) CPU usage; 2) autoscaler recommendations; and 3) amount of CPU requested by the Pod. These data are retrieved every 10 seconds and saved in log files. So, with this information, is possible to say the CPU requested by each pod over time.

Also, every time the scripts executed a command to generate workload using Hey, the metrics of the application were also saved in log files, providing elements to analyze the behavior of the application during the tests.

Results

The four experimental scenarios were executed for each autoscaling strategy. Also, both approaches started with 2 initial pods, each one with 0,15 CPU-core, that will be resized by the autoscaler over time. The Figures 6 and 7 shows the request CPU by each POD during the time of the experiment. The dashed line indicates the CPU capacity needed to reach 100% of utilization on each stage of workload.

Figure 6. Requested CPU by each POD in the vertical autoscaling.

It was observed that, in the VPA, there was a delay resizing the resources, staying most of the time with the CPU capacity below the necessary (colored bars below the dashed line). The delay in the autoscaling decisions was greater in the scenario 1, where the workload is increased gradually, and lower in the scenarios 2 and 3 where the change in the workload happens abruptly. In the scenario 4, with short duration peaks, the provision of resources only happened in the stage 8. Also, it was noticed that when doing an upscale, the VPA requests more CPU than necessary, and in the downscale the VPA is more conservative. [Figure 6]

Furthermore, even in the last 5 stages receiving a low intensity of workload, the VPA didn’t downscale, creating an unnecessary over provision of resources. The reason behind this delay is the confidence factor of this mechanism, that requires more time to have confidence on the recommendations. Also, the existence of 3 pods in some point is because when resizing a Pod, the VPA creates a new one with the desired capacity of resources, and only terminate the previous Pod when the new one be totally ready. So, the confidence factor comes in to decrease the overhead in re-creating Pods multiple times.

Figure 7. Requested CPU by each POD in the horizontal autoscaling.

The HPA, most of the time, reacted effectively to the changes in the workload despite keeping the requested CPU slightly above the necessary CPU. The time to make a scaling decision was 40 seconds average when the load goes up.
Only in the up scale transitions of the stage 3 to all scenarios and in the stage 4 and 5 of the scenario 1 that the CPU stayed below the necessary for approximately 1 minute.

The HPA was able to downscale the Pods with a delay of 5 minutes, unlike the VPA that didn’t downscale. In the scenario 4, the HPA kept the requested CPU over provisioned, what is something positive for dealing with short duration peaks, but in a long term situation it can be prejudicial to the costs of infrastructure.

Figure 8. Response time of the application, managed by the vertical and horizontal autoscaling.

The Figure 8 shows boxplots comparing the response time of the requests done to the web application over the stages of workloads of each scenario. The middle line of each box represents the median, while the dot and triangle are the mean of the response time on each stage.

The horizontal autoscaling presented response times very close to the time of service (0.175 seconds) in almost all scenarios, having only the mean and 3rd quartile slightly greater in a couple of stages when the workload goes up. By the other side, the vertical autoscaling presented response times much greater than the time of service in various stages, both for the mean and quartiles, because of the delay in resizing the Pods.

It’s possible to say that, in the time this evaluation was made and using the default configuration for both autoscaling strategies, the HPA has proved to be more effective by responding faster to the changes in the workloads with a adequate amount of Pods to attend the requests, while the VPA was negatively impacted by the delay in resizing the Pods.

Conclusion

This work aimed to analyze the performance of both vertical and horizontal autoscaling in the Kubernetes thought measurement experiments. To do this, it was necessary to create a way to generate and control loads by using a benchmarking tool and creating scenarios of workload to analyze the behavior of the autoscaling approaches focusing on metrics of response time, CPU requests of the Pods and time between autoscaling events.

From the experiments done in this work, it was noticed that the horizontal autoscaling is less conservative and more effective in resizing the resources in the evaluated scenarios. It is important to remember that this precision is motivated by the objectivity of the horizontal pod autoscaler algorithm, which is keeping the resources request in the mean of the defined resource usage limit.

In contrast, the vertical autoscaling is more conservative in the decisions of resources provision, because it relies on the log of increasing the confidence factor according to the time. It was concluded that, in longer experiments where it will be possible to generate more historical data of pods execution, the vertical autoscaling will be more effective in the autoscaling decisions.

So, with the parameters and scenarios used in this work, the horizontal autoscaling has shown more effective than the vertical approach because of the precision in the autoscaling decisions, agility in providing resources and consequently because of a faster response time of the web application. It’s important to highlight that in the time that this experiment was done, the vertical autoscaling was in beta, and it was receiving daily improvements, what can make this approach more effective in the future. Also, it was used the default configurations of the Kubernetes for both autoscaling mechanism, so changing the parameters can provide different results.

Thinking about future works, it is intended to evaluate newer versions of the Kubernetes with real workloads of public datasets and in infrastructure of greater scale. Also, investigate a strategy that choose the best autoscaling approach based on the situation.

OBS: In the time the tests were made, the vertical pod autoscaler of Kubernetes was in beta, while the horizontal pod autoscaler already was considered stable.

The scripts of the experiment: https://github.com/kewynakshlley/k8s-autoscaler-analysis

Busywait web application: https://github.com/kewynakshlley/busy-wait-clustering

This article was accepted and published on the Brazilian Symposium on Computer Networks and Distributed Systems. It’s possible to check it here: https://sol.sbc.org.br/index.php/sbrc_estendido/article/view/21437

[1] Borg: The Predecessor to Kubernetes: https://kubernetes.io/blog/2015/04/borg-predecessor-to-kubernetes/

[2] Gandhi, A., Dube, P., Karve, A. et al. Model-driven optimal resource scaling in cloud. Softw Syst Model 17, 509–526 (2018). https://doi.org/10.1007/s10270-017-0584-y

[3] Horizontal Pod Autoscaler: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/

[4] C. Liu, M. Shie, Y. Lee, Y. Lin, and K. Lai. Vertical/horizontal resource scaling mechanism for federated clouds. In 2014 International Conference on Information Science Applications (ICISA), pages 1–4, 2014. doi: 10.1109/ICISA.2014.6847479

[5] Daniel A. Menasce, Lawrence W. Dowdy, and Virgilio A. F. Almeida. Performance by Design: Computer Capacity Planning By Example. Prentice Hall PTR, USA, 2004. ISBN 0130906735.