Managed K8s still need to be managed: How we utilize Karpenter to handle resources issues

Osher Levi
Gloat
Published in
7 min readMar 13, 2023

--

“We have to choose a managed service for this infrastructure. The operational cost of managing it ourselves is huge”. If you are an infrastructure engineer, you probably used this sentence a few times in your career in technical discussions.

I’ve no doubt that mostly this sentence is correct, and you have to choose a managed infrastructure to reduce operational costs. Still, you have to remember that managed infrastructure has its downsides and complexity, which you should take into account.

Managed services are intended to make our life easier by letting us focus on the main flow while they take care of the complexity of internals and management of the infrastructure. Managed services simplify the actions we should take in order to set up the infrastructure, and therefore the “bits and bytes” of the managed infrastructure are hidden. Not all logs are available for you, so it might be hard sometimes to understand how things are working and especially, when they are not working as expected.

As part of Gloat’s migration journey to a managed K8s infrastructure, we faced some resource challenges and needed to dive into the hidden parts of a managed infrastructure. In this blog post, we will go through the issues we faced and how to utilize Karpenter to handle those issues and improve the way we do auto-scaling on K8s.

Our Transition

In the last year and a half, Gloat’s infrastructure has transitioned from a VM-based application into dockerized microservices architecture running on K8s. We had chosen to run our workloads on a managed K8s service (EKS), and were facing some challenges down the road, like how to manage the node groups, which applications could be stateless, and which were stateful, requests and limits, sizing, HPA, observability, CI/CD strategy, networking, etc.

All of the above are technical considerations where we can follow best practices, learn from case studies, etc. Still, the most challenging one was how to build a K8s culture, reduce knowledge gaps within the team, knowledge sharing across R&D, learn how to debug issues on the new infrastructure and mostly, how to deep dive into the infrastructure and understand the internals of it. This is always a big challenge, and it’s a bigger challenge when using a managed service.

We went bigger, and more and more clusters came to life. In one year, we found ourselves managing 20 clusters across 6 regions. The number of new microservices and types of workloads multiplied by 3 each month, and the challenges became more challenging.

What are the 3 effects of starvation?

I don’t recommend you search for the answer on Google, as this will probably lead you to some medical stuff, but on our journey, the 3 effects of starvation ( there are probably more) were expressed by 3 pain points that we currently have in our clusters:

  1. Flapping nodes — Nodes going from Ready to NotReady state
    Improper resource allocation causes nodes to get overcommitted, which results in Kubelet not having enough resources to function.

2. Inability to mount volumes
As a result of Kubelet being unavailable, volumes are not being detached/attached properly from the nodes, which causes the following issue:
“Kubelet: Unable to attach or mount volumes:… “

3. Pods stuck in terminating state
Once again, as a result of Kubelet being unavailable, pods are unable to terminate properly and stay stuck in the terminating state until they are deleted forcibly.

So the obvious question is, why does it happen?

We were connecting the dots between the issues, and the obvious conclusion was that Kubelet was STARVING!

The main reason was overcommitting — here at Gloat, we run science-related workloads on pods. This requires loading large models to app memory which requires high memory allocation on application startup, and then the resource usage drops significantly.

The above forces us to have a big gap between the pod Requests & Limits on the pod level, which causes nodes to become extremely overcommitted.

How to stop feeling hungry

The plan is to find a way to reserve resources for critical components running on our nodes to avoid starvation and increase our clusters’ stability. We were under the impression that a managed service would deal with such

a configuration, but have found out that this configuration does not exist by default on our nodes.

We found a solution using the following configurations:

  • kube-reserved is meant to capture resource reservations for Kubernetes system daemons like the kubelet, container runtime, node problem detector, etc.
  • system-reserved is meant to capture resource reservations for OS system daemons like sshd, udev, etc. system-reserved should reserve memory for the kernel too, since kernel memory is not accounted to pods in Kubernetes at this time.

Cluster autoscaler: the skeletons in the closet

Autoscaling is one of the most important missions on K8s infrastructure, and when using managed K8s infrastructure, the default auto-scaling will probably not meet your needs. There are a couple of add-ons you can install on your cluster that can handle this task. The famous one is cluster autoscaler.

At the time we started our journey with K8s, cluster autoscaler was the de facto standard for node auto-scaling on EKS. It utilizes the known auto-scaling group (ASG) mechanism on AWS to scale down or up nodes using a launch template. Although it seems like this is the right approach to auto-scaling, again, using the “a managed server will do the work for me” method has some downsides.

In the case of cluster autoscaler, we don’t have direct control over EC2 instances. We control them through Auto Scaling Group.
For public cloud providers like AWS, diversification and availability considerations dictate the use of multiple instance types. The cluster autoscaler only functions correctly with Kubernetes node groups that have nodes with the same capacity. This might not be optimal.

When we need a new pod with different needs, it might be necessary to configure a new node group.

The cluster autoscaler way of working

In our research, we found out that the time it takes for the cluster auto-scaler to respond to an auto-scaling event is about 7 minutes from the time we have a pod waiting for scheduling, until the time we’ve provisioned a new node and it’s in a ready state where the pod can be allocated to it.

We were super limited in making those adjustments due to the way we manage our node pools, and the fact that we are using cluster auto scaler, which has limited capabilities. Therefore we decided to go on POC with Karpenter, which seems to support the option of setting Kube/system reserved allocation and also gives us much better flexibility.

Karpenter — the next-gen auto scaler for EKS

Karpenter is an open-source node provisioning project built for Kubernetes. Karpenter improves the efficiency and cost of running workloads on Kubernetes clusters by:

  • Watching for pods that the Kubernetes scheduler has marked as waiting for schedule
  • Evaluating scheduling constraints (resource requests, node selectors, affinities, tolerations, and topology spread constraints) requested by the pods
  • Provisioning nodes that meet the requirements of the pods
  • Removing the nodes when the nodes are no longer needed

The main difference between Karpenter and cluster autoscaler, is that Karpenter manages the node directly, which enables it to retry in milliseconds instead of minutes when capacity is unavailable. It also allows Karpenter to leverage diverse instance types, availability zones, and purchase options without the creation of hundreds of node groups.

Karpenter auto-scaling mechanism

We’ve seen significant improvement in the time it takes for pods to be scheduled on Node when a scale-up event is triggered. In our POC with Karpenter, the average time was 1–1.5 minutes. This improvement is mainly because Karpenter spins up new nodes directly without going through the auto-scaling group, and because Karpenter launches its pods bound immediately. The kubelet doesn’t have to wait for the scheduler or the node to become ready. It can start preparing the container runtime immediately, including pre-pulling the image. This can shave seconds off of node startup latency.

We can announce a successful POC with Karpenter, we’ve got both performance improvements, as shown above, cost-optimized nodes by using various types of node types including spot instances, and most of all, we were able to reserve resources for Kubelt and prevent starvation. We monitored our infrastructure with Karpenter and found that the observed issues — flapping nodes, inability to mount a volume, pods stuck in terminating state — stopped happening, which was great progress for us.

You’ve got a lot on your plate

Our journey with a managed K8s infrastructure was full of unknowns, but the lessons learned gave us a new perspective that led us to abandon the “plug-and-play” approach. When it comes to a managed service, you should always research and try to deep dive into the internals of the service, consider and understand what is still on your plate, and understand which tools and add-ons you can utilize in order to have a more mature managed infrastructure.

This blog gave insight into specific aspects like auto-scaling, and system critical components, but there is much more to care about when it comes to managed K8s

This story was written in collaboration with Aviel Revivo

--

--