DevOps at Agoda: How we move 3 million CI jobs from VMs to Kubernetes

Published in

Agoda Engineering & Design

13 min readApr 24, 2023

Introduction

At Agoda, we use GitLab in multiple critical areas, ranging from source control and CI/CD to packages and containers registry, as well as Agile planning. Our developers leverage GitLab to build and test their applications, deploy to Kubernetes, run batch jobs, and perform other functions.

The CI/CD system, which powers over 3 million jobs monthly, is among the core systems enabling our developers to deliver features rapidly and reliably. Initially, GitLab runners' architecture depended on Docker Machine runners, deployed on our internal OpenStack infrastructure using Terraform/Ansible.

While it performed well with a few runners, scaling it up to over 800 revealed its limitations. It became increasingly complex and challenging to ensure the desired state on OpenStack, optimize hardware utilization, manage fast and easy scaling, and monitor for self-healing.

Most of the issues mentioned above can be solved with features and abstractions provided by Kubernetes. After evaluating our options, we decided to shift our runners to Kubernetes.

Our goal is to increase stability, reduce maintenance effort, enhance visibility, and optimize hardware usage and cost. Although we are still in the implementation process, we will share our progress thus far and how it has contributed to the continuous improvement of our CI/CD system at Agoda.

The problem with our VM-based GitLab Runners

Over the past year, we transitioned from a self-hosted GitHub/Teamcity setup to a self-hosted GitLab/GitLab CICD.

As part of this migration, we introduced the VM-based Docker Machine runners for CI, as recommended by GitLab. This solution runs each build in a docker container on a dedicated virtual machine. It gave each build a clean and ephemeral environment while providing good security and isolation between them.

The implementation on our side looked like this:

Creating the base image of the runner with Packer.
Deploying all runner managers with temporary shell runners using Terraform and destroying the temporary runners once the deployment was complete.
The Terraform state is stored in the built-in GitLab Terraform HTTP backend.
External scripts on cronjobs monitored OpenStack resources and handled any leftover runners in a bad state.

The entire Terraform process ran on a nightly schedule to maintain the desired state on OpenStack.

During the early days of our migration to GitLab CI, this setup proved sufficient to meet our needs.

However, as most of Agoda’s CI workload shifted to GitLab CI over time, we encountered issues that hindered our ability to scale up or down quickly. This made it challenging to maintain low wait times for CI jobs while optimizing resource utilization:

Slow startup time

The Docker Machine managers are responsible for provisioning Runner on demand. Each manager is set up with a maximum number of Runners it can provision, and within this limit, the manager will scale OpenStack VMs up or down based on the number of CI jobs running.

On average, it takes 2.5 minutes for a Runner to start up. If a job is waiting for a new runner to be spawned, it will be queued for an average of 2.5 minutes before running.

The average OpenStack machine creation is 2min

Slow propagation of changes

Using Terraform means we had two built-in ways to propagate changes to our 800+ runners, either sequentially or all at once.

Opting for the latter would have resulted in excessive disruption to the CI platform, as there would be a period with no runners available whatsoever. Instead, we chose to deploy the managers sequentially, reducing the impact on our developers but resulting in a slower change propagation.

Updating a single configuration parameter could take several hours. It could take even longer if managers need to be fully recreated (as a manager must wait for all jobs running on its runners to finish before shutting down and being recreated).

Dead Leftover Runners

Some Runners VMs were still online and left behind even after their Runner Manager was deleted. We called those Ghost Runners.

Upon deletion, the Runner Manager will wait for the Runner to finish the current job execution, shut down the Runner process then send a request to delete the VM. Any issues between the request and the VM’s deletion could leave us with some Ghost Runner.

Their numbers were growing over time and consuming actual hardware resources on our OpenStack, which reduced our future capacity to scale up. We have a custom cleanup script running on a schedule to mitigate this issue.

Dependency on the CI runner itself

The deployment of the Runners is done through Gitlab CI/CD, which is dependent on already having available Runners to run all those Terraform jobs.

This presents a problem in two situations. The first is when our Runners have reached 100% utilization, and we require additional managers; this would be impeded because there are no available Runners to execute our jobs. The second scenario arises when we need to initiate a runner from scratch without any pre-existing runners.

To avoid being restricted from making modifications, we have established a temporary dedicated runner that solely runs our Terraform jobs.

Monitoring:

We had to make a few of our Python scrapers to extract OpenStack and VMs metrics related to Runners. Even with those, we lacked visibility of real-time CPU utilization on OpenStack, and we could not observe the CPU utilization of child Runners spawned by the manager. It made capacity planning and scaling up complicated.

To overcome those challenges, we looked into Kubernetes as an alternative to the VM-based Docker Machine Runner.

What are the alternatives for running GitLab Runners on Kubernetes?

Gitlab Kubernetes executor

We began our journey with the GitLab Kubernetes executor, which proved fast and effective. However, we encountered a major drawback: security.

The container had to run in privileged mode to support DinD (Docker-in-Docker), a common practice at Agoda for integration testing using docker-compose.

A privileged container could mount the host device into itself, giving full access to the host device. The CI allows developers to run any arbitrary code they want. This could allow a malicious CI job to gain complete control of the Kubernetes worker node, read most of the CI secrets, and use lateral movement to move through our network.

It was a deal breaker for our security and infrastructure teams. To avoid this privileged vulnerability, we continued our journey by evaluating alternative solutions for container runtime, such as Sysbox and Kata containers.

Kata containers runtime

Kata containers is an open-source container runtime alternative that can replace RunC (The default container runtime of Kubernetes). It is OCI (Open Container Initiative) compliant and provides a layer of security for containers by creating lightweight virtual machines for each pod, achieving VM-level isolation.

Traditional containers would run directly on the Kubernetes worker node under RunC runtime. Kata container runtime runs each pod into its virtual machine, making running privileged containers safe. There is no need to worry about the container using privilege escalation to escape and run commands on the worker node.

Kata container runtime provides two storage driver options: overlay and block storage with a device mapper. We evaluated and compared both for performance:

Overlay Storage

This option works fine for most jobs but requires additional Virtio-fs arguments for DinD. After adding those Virtio-fs arguments, we made some measurements, but jobs took a major performance impact. In some cases, jobs froze until they reached the timeout.

Block Storage with Device Mapper

As we ruled out the previous one, we tried the other storage driver supported by the Kata containers runtime, which features improvements in the storage I/O speed. However, an additional block device is required on the Kubernetes worker node to enable block storage. Behind the scenes, Containerd requests block storage from the device mapper pool, and the newly created device is formatted and mounted as the root filesystem.

While this setup is suitable for smaller file writes, it can cause memory usage problems. Specifically, the write-back caching may lead to excessive use of worker memory, which could result in the hypervisor restarting the worker node due to an OOM (Out-of-Memory) error.

Memory utilization going up until the worker node is restarted due to OOM

While the overlay storage does not seamlessly support DinD builds, and the block storage with device-mapper cause a noticeable memory overhead, a common issue we encountered for both was a significant increase in overall build time:

This increase is caused by the performance drop of the storage driver I/O. It was enough for us to rule out Kata Containers as some CI jobs rely on fast underlying storage to run efficiently (compilation, cache extraction, build container image).

We found that Kata containers fell short in job build time, increasing it by a factor of 4, and were also more resource-intensive, particularly regarding memory usage.

As a result, we needed to identify an alternative solution that could meet our requirements while still ensuring acceptable performance for our CI.

The secure and performant solution: KubeVirt

The next option was a project incubated under the CNCF (Cloud Native Computing Foundation): Kubevirt. It allows running VMs within a Kubernetes pod using KVM (Kernel-based Virtual Machine).

Each virtual machine is wrapped around a Kubernetes pod, making it easy to manage with the native Kubernetes command kubectl.

We measured the build time performance the same way we did with Kata containers:

Results were far superior compared to Kata containers. The performance is close to the OpenStack VMs, achieving the security isolation layer we needed. As an extra benefit, we also inherit all the features that Kubernetes provides, like orchestration.

The way we deploy our Runners after our shift to Kubernetes and KubeVirt completely changed:

We created the QEMU Linux base image using packer and stored it as a container image in our registry.
We deployed a Helm Chart through GitlabCI in our dedicated self-hosted Kubernetes cluster. It deploys KubeVirt VirtualMachineInstanceReplicaSet resources for each different Runner type we provide.
The KubeVirt Operator creates the PODs using the previously created QEMU Linux image container. Resources (CPU, memory, ephemeral disk) depend on the type or runner created.
Post-provisioning setup to configure and register the runner is done by passing UserData and having cloud-init run at boot.
The monitoring is done through Prometheus/Loki, and Kubernetes always enforce the desired state at all times.

Deployment flow for our Linux GitLab Runners on Kubernetes

While Most CI jobs in Agoda run in container executor on Linux-based Runner, a small percentage of them require Windows-based PowerShell runner. KubeVirt provides the capability to run Windows VMs within Kubernetes. We took our chances and played it around.

The initial process was similar to what we described in the first part of this blog: a Windows ISO in QCOW2 format created with Packer and Terraform/Ansible to provision the VM Runners on OpenStack.

We faced two challenges while trying to shift our Windows PowerShell Runner from VMs to the KubeVirt pod:

The Windows VM image type and size: We can save the ISO file to a container for the Linux runners and mount it to Linux KubeVirt VM. Windows does not support it. The QCOW Windows image size is considerably more extensive (~60 Gb) as it includes software dependencies like multiple Visual Studio versions. To store it in Kubernetes, we use KubeVirt’s CDI (Containerized Data Importer) add-on, which allows the image to be stored in a PVC (Persistent Volume Claim) that can be used as a KubeVirt Disk Template. Upon creating the Windows PowerShell runner, we clone the PVC and mount it in the runner.
The post-provisioning setup of the runner: for the Linux runners, we use cloud-init stored as a Configmap to configure the runner once the pod starts automatically. This solution is not supported for KubeVirt Windows VMs and cloudbase-init. Kubevirt supports a sidecar pod that could be used to run an Ansible playbook, but after testing, it seems the support for a Windows pod is not there. To solve the Runner dynamic configuration after the pod is spawned, we created a new CRD (Custom Resource Definition) to listen to Windows pod creation events. When an event is caught, a sidecar pod is created to run an Ansible playbook before shutting down.

Windows Runners deployment ended up being a bit different than the Linux one described previously:

We created the QCOW2 Windows Core base image using packer and stored it as a PVC in the Kubernetes cluster using KubeVirt’s CDI.
We deployed a Helm Chart through GitlabCI in our dedicated self-hosted Kubernetes cluster. It created a volume based on the image PVC and deployed an Agoda CRD WindowsRunnerReplicaSet resource.
KubeVirt created the Windows POD using the previously created volume.
When the KubeVirt POD was up, our CRD launched another POD, which runs an Ansible playbook on the Windows VM to configure the Runner.

Deployment flow for our Windows GitLab Runners on Kubernetes

Switching to KubeVirt in Kubernetes allowed us to reduce the Runner startup time to 45s and increase the speed and reliability at which we can modify our entire fleet of Runner. Recreation of a runner is now simply deleting a POD and letting Kubernetes recreate it automatically. The desired state is continuously enforced internally by Kubernetes. No scheduled or Terraform jobs are involved. The monitoring is simplified and provides more visibility and granularity using the Prometheus Operator.

Taking advantage of Kubernetes Autoscaling

As of today, we only provide three different Runner flavors with the following specs:

m1.large (4 vCPU, 6 Gb RAM, 80 Gb Disk)
m1.xlarge (8 vCPU, 16 Gb RAM, 160 Gb Disk)
m1.xxlarge (16 CPU, 30 Gb RAM, 160 Gb Disk)

Developers can select the Runner that fits their workload requirements.

During peak hours, we sometimes encounter a higher wait time than usual for CI jobs for some of the most used flavors. To reduce the wait time, we wanted to change and balance the number of runners for each flavor. For example, if there is a strong demand for m1.large but not for m1.xxlarge, we would reduce the amount of m1.xxlarge Runners and use the free resources to increase the m1.large Runners.

As this re-balance of the number of Runners is technically limited by the physical resources of our Kubernetes cluster On-Premise, we also wanted to be able to scale up in GCP (Google Cloud Platform) in case we used all our private resources or to absorb a sudden peak in demand.

Achieving this with our former setup was painful, static, and manual. Now that we shifted to Kubernetes, we can use some of its features and tooling to tackle all those issues.

We use KEDA, an Event-Driven Autoscaler that considers data source metrics and adjusts the number of replicas based on specified target thresholds.

https://cdn-images-1.medium.com/max/800/0*heSMKjakJokUWhq6.png

In our case, GitLab jobs and Runner’s utilization is exposed through Graphite and Prometheus. KEDA is configured to monitor those metrics and update Runner’s replicas in the KubeVirt manifest as needed.

KEDA CRD to start scaling KubeVirt replicas when runner utilization is above 30%

Each of our Runner flavors has at least one KEDA ScaledObject using a custom utilization metric that allows them to scale up and down On-Premise based on its demand. If we want a flavor scaled in GCP in case of a sudden demand burst, we can add KEDA ScaledObject using another custom metric.

With this approach, we rarely reach a situation where many CI jobs are pending or for a specific flavor. It allows us to maintain a low wait time so our developers have a better experience in their day-to-day CI usage.

Our journey and what’s next for the CI in Agoda

Shifting our GitLab Runner from VM to Kubernetes was a long but exciting journey for our team. We learned a lot along the way, and it was worth investing the time and effort as it solved most of the issues and shortcomings VM Runner had while being seamless and transparent for our users.

Here are some highlights made possible by shifting to KubeVirt Runner in Kubernetes:

Reduce creation and startup time: average Runner creation and startup time reduced from 2 minutes to 45 seconds.
Faster upgrade, recreation, and maintenance of our fleet of Runners: our Runners are now Kubernetes POD. We can leverage Kubernetes/KubeVirt primitives (replicaSet, POD, Replica, Rollout…) to recreate or update the Runners easily.
Desired state overtime: the desired state is continuously enforced internally by Kubernetes. No external action is needed (CI Schedule, Cronjob…).
Autoscaling: Using KEDA, our different flavors of Runner are continuously re-balanced and scaled up or down dynamically without any manual work required.
Auto-Healing: Kubernetes native POD Health check, and probes help us automatically recreate a Runner if one goes down or is degraded.
Better monitoring: metrics are automatically exposed via the Prometheus Kubernetes Operator, providing more insight into the resource utilization of each Runner POD.

Our journey is not complete yet. We still have a long way to go and a few areas we want to improve. The migration is ongoing as we have yet to reach the state where 100% of our Runner is in Kubernetes.

The deployment of our Runner still involves using Gitlab CI/Runner, and we could find ourselves in a chicken-and-egg situation. To mitigate this, we could switch to a pull model and use KAS (GitLab agent server for Kubernetes), Argo CD, or flux. And finally, to better serve our developers, we want to extend the different flavors of Runner we offer to have some geared toward memory or CPU-heavy CI jobs.

Acknowledgments

Thanks to Jinna Chodchoy, Passakorn Chueaphanich, Chanpol Kongsute, and Guillaume Lefevre for contributing to this blog.