The Culture of Cost Optimization — Reducing Kubernetes cost by $300,000

Kanishk Soni

Published in

Razorpay Engineering

9 min readNov 17, 2023

Authors: Kanishk Soni, Simon Rajan and Soji Antony

This blog is part 3 of a series of posts that we plan to publish on cost optimization.

In the last blog, we dived deep into Graviton adoption for enhanced performance at effectively lower compute expenses. If you haven’t already please feel free to go through part-1 and part-2.

As we are moving from monolith to microservices architecture, more and more microservices are coming into the picture. As a result, we have thousands of microservices across multiple Kubernetes clusters. So continuing on our cost optimization journey further, we found that EC2 compute spend is our major cost contributor in our Kubernetes clusters.

This article will explain how we tackled overprovisioning, optimized our Kubernetes workload, spread awareness, and established ownership among teams on EC2 cost by building in-house tools.

The following topics will be covered in the article.

The Manual Way
Searching for an Automation
In-house tool Orchestrator for optimal CPU/Memory Requests

The Manual Way

The approach

On observing cluster-level data through Grafana — looking at the request and usage metrics, it was evident that there is overprovisioning in the system. Resource requests can be fine-tuned to reduce the resource wastage and optimize the consumption.

Note: Overprovisioning is when the resources provisioned are more than the actual utilization for the applications.

The challenges

There were many blockers along the way like

We are managing 1000s of microservices across multiple clusters
Lack of clarity on owners of these microservices
To map usage trends — CPU/Memory efficiency and present them to a wider audience

The Remediation

To solve the problem we broke it down into multiple components. Visualization, Ownership mapping, and tracking utilization at the microservice level.

“Just Because You Can’t See It
Doesn’t Mean It’s Not There.”

- Myron Golden

For Visualization, we created detailed dashboards, drilling down the CPU/ Memory Efficiency at the namespace, deployment, and container level. This would give an idea if a microservice is overprovisioned and if it is. then by how much?

For Ownership mapping our solid program management team compiled a consolidated list of namespace and services and mapped it to the owner teams and managers who will take the final call on the optimization.

For utilization, we set up a cron to track the overprovisioned microservices and present their usage data in an Excel sheet so that application owners can refer to it, review the usage, and make changes accordingly.

This had to be followed up regularly to nudge the team to take action on their overprovisioned applications.

Conclusion:

It was about a month-long process for the first set of optimizations to go live. After multiple follow-ups and calls, we were able to achieve a lot of gains of about $250,000 annually, but we found the process to be very tedious and required a lot of manual intervention to achieve the desired output.

We figured out the ideal scenario would be usage + 30% buffer along with some exceptions based on the use case and tried to explore if there is a way to automate this process.

Searching for an Automation

Third-Party POC

We explored a 3rd party offering that claims to cut infrastructure costs and improve application performance with Autonomous Continuous Workload Optimization. It involves 0 code changes. The agent is installed as a daemon set on the node.

It worked in 3 phases.

Learning: Agents passively learn the service’s data flows, processing patterns, and resource contentions.
Optimizing: Activating the agents will immediately start tailoring resource scheduling decisions to adapt to the service, resolving inefficiencies and increasing performance.
Cost Saving: Leverage workload’s performance gains, to reduce CPU/Memory/HPA requests at the deployment level for better resource provisioning.

Conclusion:
However, the POC did not go through because the access and permissions required by the SaaS model to achieve optimization and cost savings were intrusive; moreover, the results were not that attractive in the staging environment.

The primary reason for poor results was that a good amount of our compute-heavy workload was not applicable for optimizations due to language barriers, e.g. Java, PHP, and Native code.

Adding to that, the on-premise solution for the same did not have all the features promised by the team on SaaS, hence we decided to drop the POC.

Exploring Open-Source Solutions

VPA — Vertical Pod Autoscaler
VPA is a powerful tool for autoscaling Kubernetes workloads
It did solve the primary purpose, but it did not wholly satisfy our use case
We require more granular control and more conditions to meet a change
It did help us devise our basic architecture, i.e. recommender, updater and webhook
Goldilocks
works on VPA in recommendation mode
It is a UI on top of the VPA
It only provides recommendations, and changes have to be made manually

Conclusion:
Although VPA is an amazing tool, it does not fit our use case because it optimizes at the pod level and not at the deployment level. The optimization requires pod restarts and can lead to downtime if not done at non-peak hours. It did not provide control over when the eviction of the pod took place.

The tool missed some important features like a buffer value on top of the usage or exclusion of some critical services. It lacked api support, a feature for cooldown periods, and most importantly VPA could not be used with HPA — which we use extensively in our infrastructure to horizontally scale applications

In-house tool Orchestrator for optimal CPU/Memory Requests

After all this struggle, we decided to build the tool in-house from scratch based on our use case using Python. It consists of 3 main components: recommender, updater, and mutating-webhook.

Recommender

The purpose of the recommender is to analyze the usage of the services at deployment, pod, and container levels by leveraging CPU and memory usage metrics and setting up recording rules on top of them. This is to be done over a period of time, along with some custom calculations to get the recommendations, which are exposed as API and metrics in Grafana.

Note: Recording rules allow you to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series. Querying the precomputed result is often much faster than executing the original expression whenever needed.

Functional requirements:

The lookback period refers to the timeframe when the recommender will observe the CPU and Memory usage of the service and calculate the recommendations. A service will be available for optimization once the period has passed a minimum threshold which in our case is 14 days.
Buffer: This refers to the additional capacity allocated on top of the usage to cater to unpredictable spikes in workloads with a default capacity of 30%.
Each service offers Four types of recommendations: Max, p99, p95, p90.
Max: it refers to the max usage in the lookback period and provides recommendation
p99: It refers to the 99th percentile of use in the lookback period and provides recommendation
p95: It refers to the 95th percentile of use in the lookback period and provides recommendation
p90: It refers to the 90th percentile of use in the lookback period and provides recommendation

Recommendations can be chosen based on application criticality and nature.

Exclude deployment: Feature to exclude a particular deployment.
Change Management: Any changes in the CPU/Memory should again be analyzed in the lookback period before optimization.
In case of the same service being deployed on different clusters in a blue-green fashion (explained below), then the recommendation should be the same for both clusters
Global config: All the above parameters are configurable and can be fine-tuned on a case-to-case basis at a service level using a global configuration.

Note: Blue-green deployment: A blue/green deployment is a deployment strategy in which you create two separate, but identical environments. Normally both environments handle 50% load each. Using a blue/green deployment strategy increases application availability. It reduces deployment risk by simplifying the rollback process if a deployment fails since it gradually increases the load on one environment during deployments.

It runs as a cron every 5 minutes on each cluster, fetching data from the recording rules and feeds to the RDS Database.

Updater

The purpose of Updater is to fetch data from the Database. Check if there is a service where Updater is enabled, and optimization is available. Compare it with the current deployed values and patch the deployment if the values differ.

Functional requirements:

Control over the time when changes are deployed: As these patches lead to pod restarts, they must be done at the right time to avoid potential downtime. Default — 12 midnight
Change logs must be pushed to logging infra and as slack notifications for better visibility and tracking.
Namespace Exclusion: To have a second layer to security and control Namespace level exclusion is there to skip the updation in all the microservices in a particular Namespace

Updater is run as a cron once at night to patch wherever the optimizations are available.

Mutating Webhook

The purpose of Mutating webhook is to track all the changes going into the cluster and patch the requests to recommended values for services that the orchestrator manages.

Why do we need a mutating webhook?

There were some challenges to syncing the CI/ code with the optimized requests, to overcome this in the CI/ CD flow the mutating webhook is required to ensure that optimizations are not reverted on the next deployment eg. image updation, or configuration changes.

All the patch changes in the Kubernetes cluster should go via this mutating webhook which will check if the Deployment is patched for optimization or not and take a decision accordingly.

We use hashes to check this
Master hash: The request value set in the CI/code
Recommender hash: Recommended requests value post-analysis

Incoming Request hash: The request value coming from the deployment patch

Comparing the incoming hash and master hash helps us make a decision on whether we should mutate the requests to the optimized value or not

Functional requirements:

Patching requests for Services that Orchestrator manages.
If there are any changes in resource requests, they should be deployed.
If there are no changes in the resource requests, they should be patched to the recommended value and deployed.
CI/ CD flow should not be affected by this.

Orchestrator UI

Orchestrator UI helps us with the visualization of this tool.

Functional Requirements:

Recommendation API: To provide visibility on recommendations available for the services
UI in tabular format:
UI with search functionality should be there for app owners to look into their applications.
It should include all the types of recommendations for the service.
It should also provide data for optimizations done on the services.

Results and Conclusion

We saved a good amount of Kubernetes expenditure through manual and automated orchestration, bringing down the cost by about 300,000$ annually.

We were able to achieve a proactive approach to optimize k8s resources based on their utilization in real-time.

Apart from optimizing requests, another significant factor leading to overprovisioning is antiAffinities set at the deployment level or custom use cases where we need a certain number of nodes up irrespective of the usage.

Limitations

Orchestration is limited to Kubernetes.
HPA needs to be configured for a percentage of utilization to rebalance automatically on CPU/memory request changes.

Future Scope

The current automated update Scope is only limited to single container deployments. In the future, we would like to extend this to multi-container deployments.
It should also recommend and optimize workloads like CronJobs, Daemonsets, Statefulsets, etc.
Open Source the application

The Culture of Cost Optimization — Reducing Kubernetes cost by $300,000

The Manual Way

The approach

The challenges

The Remediation

Searching for an Automation

Third-Party POC

Exploring Open-Source Solutions

In-house tool Orchestrator for optimal CPU/Memory Requests

Recommender

Updater

Mutating Webhook

Orchestrator UI

Results and Conclusion

Limitations

Future Scope

References

Written by Kanishk Soni