How Agoda Transitioned to Private Cloud

Published in

Agoda Engineering & Design

11 min readNov 17, 2022

At Agoda, we are moving towards our Private Cloud platform, away from our traditional Virtual Machine deployments. The goal of our transition stems from our desire to improve flexibility in resource utilization in our data centers. We want to increase the average hardware utilization while also being more flexible to scale services up and down rapidly.

This is where moving Kubernetes into our data centers as a “Private Cloud” comes in. Throughout this transition, our systems faced various challenges with resource management and JVM configuration — in some cases — even leading to a 30% capacity loss of critical services.

How we transitioned from Puppet to dockerized

Initially, we deployed our applications “straight onto” the VM using Puppet to do all the configuration management for the required system packages and application configuration.

Over time our QA environments started shifting towards containerized deployments. Eventually, our production deployments followed suit: Out was Puppet, and the docker images now ensured repeatable deployments without configuration drift of the VMs.

At this point, our resource allocation model still needed to change. We still used VMs to control how much memory and CPU were available. Often we even disabled the docker networking as we didn’t use any of the features. One VM runs exactly one container for most systems.

Our transition from IaaS to PaaS

The next step in our journey to Private Cloud was to change the deployment model from “Infrastructure as a Service” to “Platform as a Service.” We provided the Kubernetes cluster to deploy on and did not need to bother maintaining the infrastructure anymore. Besides that, the platform also provided a minimum level of visibility on the running applications regardless of their framework/language directly from the service mesh. Before, this was handled in every app as the service discovery was client-side.

But this is where a significant change in resource management got introduced. The containers are now responsible for constraining the resource use of the application rather than the VM.

Let’s look at how Kubernetes resource allocation works compared to our traditional VMs.

Kubernetes memory allocation

In a Kubernetes deployment, you’re expected to specify how many resources you need for your application. This is separated into two sections: Request and Limit.

When “requesting” a certain amount of memory, the Kubernetes scheduler will ensure that your pod gets scheduled onto a worker node whose total requested memory does not exceed its allocatable memory.

The second part is “limit,” which will maximize the amount of memory your pod can consistently use. Once you exceed this limit, Kubernetes will kill your pod and restart it. If your limit is higher than the request, it is not guaranteed to be available for your application to use. As pod scheduling is based on requests, a densely packed worker node might not have sufficient memory to let pods “burst” memory usage for a while.

Comparing this to VMs, it will be fairly similar if you have matching requests and limits for memory. Mainly because we also set “restart: always” on our docker deployments. So if the application on the VM ran out of memory, it would restart. Similar to when Kubernetes kills the application for exceeding the memory limit. However, in the past, it was rare for us to enforce memory limits through docker. Thus, how memory usage gets controlled is slightly different but, behavioral-wise, similar.

The JVM and -Xmx

Memory utilization graph at 12GB while Max Heap is 10GB

One common misconception when configuring the JVM is that `-Xmx` (max heap memory) controls the maximum memory the JVM will use. However, the JVM has several other components that allocate memory “Off Heap” (i.e., not controlled by the Xmx flag).

Some of the most familiar ones will be Threads. On a 64-bit JVM, the default is 2MB for stack per thread. Some of our applications spawn hundreds of threads that can easily add a few hundred MB of memory usage on your heap.

Another one that is bound to eat quite some memory for larger applications is Metaspace. Every class loader has a little overhead, and when applications become bigger, they’re bound to load many classes (either from their code or the various libraries).

Some of the other smaller JVM components will be the JIT C1/C2 compiled native code and the garbage collector.

There’s one more library we love to use that uses off-heap memory, too: Our favorite lz4 library. As it bridges to C-code, it needs to use off-heap memory to give a fixed memory address to the library to work with. As some of our systems process quite a lot of compressed data, the extra memory used for this library can be quite noticeable.

Overall all these small / few hundred MBs of overhead can add up to a gigabyte or more of “unexpected” memory usage when looking only at the Xmx. Ideally, your application is flexible enough to adjust to the Kubernetes deployment’s requested memory in a containerized world. This usually means no longer adding some “buffer” on -Xmx but rather XX: MaxRAMPercentage = 75.0to dynamically set the max heap while leaving some space free for the non-heap components.

A side note on memory allocation

Even after considering the above section, we still faced issues with our container breaching the memory limit. We found a significant mismatch between what the OS reported as allocated by JVM and what the JVM itself tracked as allocated from Native Memory Tracking. We concluded that this was a side effect of “Arena Allocation.”

So what’s Arena allocation? Allocating and freeing memory is a relatively expensive operation to do for the OS. A trick that systems do to improve the allocation throughput is to allocate a bigger chunk of memory (an Arena) and handle the allocation within that section in the process rather than the OS. This also has a benefit once the memory needs to be freed: You only need to free a single arena instead of several smaller allocations.

However, the Glibc implementation we used had a bit of a flaw in this: Memory fragmentation, which made the JVM “release” memory. Still, it wasn’t free as only some of the arena was released. As a mitigation, we switched to using jemalloc instead of Glibc, which aims to minimize fragmentation and successfully stabilize our memory utilization.

Kubernetes CPU allocation

Similar to memory, the CPU also has request and limit configurations. The request is used for scheduling pods to worker nodes (same as memory). It also controls priority on the worker node level to provide the pod with the expected cores even under extreme loads.

Limit is a slightly different story, though. Instead of being killed for using too much CPU, your pod will be paused for a bit. Likewise to memory, if your limit is higher than the request, it is not guaranteed that you can use more than your requested cores.

However, the units are different between VM and Kubernetes. On VMs, one core means one core available on the OS. On Kubernetes, 1 core means 1000 millicpu worth of time (which can be used across multiple physical cores or threads). This is a big difference in how resources get isolated!

Threads, cores, and you

So as a start, let’s clarify the terms:

Concurrency is doing more than one thing at a time, but not necessarily at the same time.
Parallelism or Multithreading is doing more than one thing at the same time.

Achieving concurrency and parallelism is usually done using Thread Pools on the JVM. A thread pool (very generalized) is a queue of tasks to execute on a “pool” of 1 or more threads. Each CPU core can run one thread at a time. Switching between threads requires a “Context Switch,” which is a fairly expensive operation. It must store the current thread state, restore another thread and resume executing. Let’s take a small sample on a single-core machine with four threads trying to do work concurrently.

Graph showing each thread getting 25ms cpu time sequentially — Graph showing each thread getting 25ms CPU time sequentially

Under fair conditions, each thread would get a slice of time to do its work. Now, how do we get cores to run these threads on?

Kubernetes vs. VM CPU isolation

On our typical VMs, we usually have 8 or 16 cores available, while the Kubernetes worker nodes have 44, 64, or 96 cores. The difference here is that on Kubernetes, the limit doesn’t prevent you from using more than `limit` cores. You can still use all the cores of the worker node in parallel if you want. You will just get penalized by CFS (the underlying Linux component that Kubernetes uses) if you consistently use more cores than your limit (remember, it’s based on CPU time used, not cores!). On the other hand, on VMs, you’re isolated to the virtualized cores, so you won’t be able to utilize any more than you got assigned.

CFS can be confusing

CPU utilization correlating with CFS Throttling

The picture above shows a sample scenario of what we encountered in production. This application had a request of 7 cores limit of 16 cores (on VMs, it used 8 cores). As you can see, even when running at only 43% of the requested cores, we still see a significant amount of throttling from CFS. But why? We don’t even use half of the request cores, and we have a big buffer between request and limit.

CFS works with periods to enforce the limit. By default, it uses 100ms periods, so essentially every period, we have 16 core limit * 100ms = 1'600 seconds of CPU time available in that 100ms period. Our application was configured to use 24 threads to do the number crunching to get you the best price. This means we are trying to actively use 2'400 seconds of CPU, but only for a short moment.

On average, we look to be well within the request, but at the smaller granularity, we exceeded our limit. The last part of the graph shows when we deployed the change to align our compute threads with what we asked from Kubernetes. We immediately saw throttling being gone and higher application throughput (pretty much matching our old VM performance). Let’s put this throttling in pictures to show this behavior.

CFS as a resource limiter

Our simple sample shown earlier would be the same under CFS: There is only one core, so there is no way to use more than we request as the system can’t provide it. Let’s scale it up to a dual-core worker but only ask for one core limit in our pod.

Graph showing two threads taking 25ms cpu time at a time sequentially twice before all threads are being paused by CFS for 50ms

Suddenly, we can use the two cores for 50ms each, and then we’ve used our 1 core * 100ms = 100 ms quota, which results in CFS pausing us for the remainder of the period.

Now let’s put this into a bigger sample with an eight-core worker node and a pod with two-core limits and compare it to a two-core VM.

Graph showing VM Threads taking twice 25ms cpu time at a time continuously, while Kubernetes threads are all active for 25ms before being paused for 75ms.

In this, you can see quite a different picture of when work is executed on the threads. On VMs, constant work is performed, while on Kubernetes, the workload is bursty (we use the entire quota in 25ms). The total work done is still the same; we got the same amount of total CPU time in the end. But the excessive pauses in Kubernetes can cause earlier failures. Remember that we use a queue of tasks to execute on a thread pool. Usually, a single user request is broken up into multiple (sequential or parallel) tasks that may jump from thread to thread depending on the thread pool schedule.

On VMs, the constant activity might mean the first part of my request is done in the 0–25 window, and the second part may be the 50–75 window, but on Kubernetes, that would be the 0–25 window, and the second part 100–125. But what if it was an API call that only had an 80ms timeout? On VMs, we barely made it, but because we didn’t use our quota smartly, we failed to meet it in Kubernetes.

The need for threads

So why do we need threads in the first place? Nowadays, single-core speeds aren’t growing as rapidly as before. So often, to reduce latencies, we usually start splitting up our requests into smaller independent blocks so we can use multiple threads in parallel to perform the work and scale horizontally better.

Another very common reason is to do blocking IO (SQL drivers are often still implemented in a blocking fashion). These threads are usually parked, waiting for the kernel to wake them up, telling them some network data has arrived. So, these generally only contribute a little to the overall CPU usage. It’s more resource friendly to use asynchronous clients as they remove the need to have one thread per concurrent io call.

A less common pattern is to use them as bulkheading to avoid potential deadlocks from resource or thread starvation. This is usually reserved for specialized use cases (ex. Using the thread pool as a connection pool size limitation).

Tuning for VMs vs. Containerized

Due to the differences in CPU isolation (along with CFS), the impact of tuning the application incorrectly can be quite big. Sometimes tricks that show small benefits on VMs might even be the root cause of issues under CFS limits. VMs can time-slice threads, so using a few more threads might lower internal latency.

However, in general, the following rules will result in optimal performance on both VM and Kubernetes environments:

Your “main” CPU bound pool should match 1:1 with the amount of cores you expect to use. Generally, we recommend ForkJoinPool here to optimal latencies.
Blocking IO can run on (nearly) unbounded pools as they don’t consume much CPU, but unbounded, they can be dangerous memory-wise.
Non-blocking/Async IO usually only needs 1 or 2 threads to run on a fixed-size pool.
Scheduling/Timers and similar should run on a single dedicated thread (if it starts doing actual compute work, it should offload that to the core compute pool).

We’ve already provided the tools to follow these guidelines easily for the Agoda libraries, and the bootstrapping libraries have been updated to follow these patterns by default.

Final note: VMs on Kubernetes?

This sounds like a lot of work if all applications need to adjust to this. Do we have any alternatives to reduce this burden?

We have a few options that we can apply to get VM first application to work better:

Another container runtime like Kata Containers could help to bring better isolation.
Can we reduce the CFS period to minimize application stalls?
Should we rely on applications using our standard libraries/bootstrapping and thus do mostly the right thing by default?
Are we even sure that using the CFS limit benefits us, or are we just making it needlessly harder for ourselves?

We mostly took the last two options for us: We’re simplifying it for applications to be sane by default while also considering whether the limit should be applied to our workloads.

Latency-sensitive applications probably are better off without limits. Also, many applications don’t have the volume where the performance difference will be a major concern for their replica numbers. So, investing in every application to do it perfectly wouldn’t be a good return on investment.