How did we upgrade our EKS clusters from 1.15 to 1.22 without K8s knowledge?

Safaa Alnabulsi
Scout24 Engineering
9 min readAug 23, 2022

--

In mid 2020, I joined a new team along with two other colleagues with a new restructuring in Application Platform after the carve-out project of AutoScout24. It was a super exciting opportunity as this team owned a business critical product; the compute platform at Scout24 where almost 700 of Scout24 services run. It is built on the top of Amazon (AWS) Elastic Kubernetes Service (EKS).

However, I knew nothing about Kubernetes (K8s) back then except for its fame in the tech world and its steep learning curve. Our EKS clusters were on version 1.15; five versions behind the latest released upstream version (v1.20) and approaching End-of-life of AWS EKS supported version! We were drowning in technical debt in addition to being short on resources and knowledge in the team. Within a time period of 1.5 years, we succeeded in upgrading to the latest version supported by AWS, gained more knowledge about the product and improved it along the way.

It had not occurred to me that solving technical debt could be something interesting for others, however, attending KubeCon 2022 in Valencia changed my perspective. In many of the panels and talks, the K8s upgrade was brought up. We heard stories of struggling teams who were stuck on outdated versions, having not upgraded for over two years. They were desperately searching for tips and tricks to magically jump to the latest version.

In this article, I will explain to you not only about our EKS upgrade journey but also how you can help your team to gain more confidence in new unknown technical territories.

The Kubernetes project is continually integrating new features, design updates, and bug fixes. The community releases new Kubernetes minor versions. New version updates are available on average every three months. Each minor version is supported for approximately twelve months after it’s first released. Source [AWS Docs]

Let’s dive in!

Short Background about the product

Our compute platform enables our users to run their services at scale without worrying about the complexity of infrastructure and having deep experience in AWS. It provides features out of the box such as load balancing, human readable DNS records, logging, events, monitoring, tracing and many more. Under the hood, the compute platform consists of several EKS clusters isolated across different AWS accounts. We provide a custom resource which our users deploy in their accounts. A new service is then created in one of our EKS clusters.

In order to provide the custom resource and the feature set that it offers, we maintain a set of custom controllers(written using go client and Kubebuilder) in-house, alongside using third party controllers such as velero, datadog, etc.

The upgrade journey

When tackling the AWS EKS upgrade, you have to keep in mind, it’s not a click of the “Update Now” button which you see in the AWS console. Once you upgrade your EKS control plane to a newer version, you cannot rollback to an older one without re-creating the whole cluster from scratch. This means running services within your cluster would either have to move to another cluster or have downtime. Some of the main controllers have to be updated either before, with, or after the EKS control plane. The more customization you have, the more effort and thoughts you have to put in.

Jumping to the latest version supported by AWS (by time of writing this article v1.22) didn’t happen overnight. It took us around 1.5 years.

I will walk through the three main rounds of upgrades according to our level of confidence and experience, with the main takeaways from each round.

First round: Low Confidence

Confidence 5–10%

Time needed 1–2.5 months

With a new team, new product and new technology; we approached the topic of EKS upgrade with caution and strategy. We created a “spike’’ ticket to investigate the effort needed to upgrade to a newer version. The outcome of this spike was to understand the different work packages which needed to be done in order to reach a newer version “safely” without causing an incident or downtime for any of our users. Moreover, to define a “workflow” of upgrade steps.

1. Research

Discover all changes between your current K8s version and the one you intend to upgrade to, such as API deprecation, version difference of controllers, possible bugs/takeaways from other people’s upgrade journey, etc. The outcome of your research may be an “Epic” ticket with clearly defined stories.

There are different resources to check for new changes:

2. Controllers Upgrade

This is where the heavy workload resides. If your research was done well, you will save lots of time while doing the upgrade of the controllers.

3. Nodes AMI update

If you are using managed nodes, then you need to build your AMI with the newest EKS version. You can either build your AMI roll update first as this is usually backward compatible or roll AMI node update and EKS Control plane upgrade together in one release. We followed both ways depending on the version upgrade complexity. For example, in the v1.15 to v1.16 upgrade we had to touch the labels of the nodes, hence we preferred to update them separately from the control plane upgrade.

4. EKS Control plane upgrade

This step used to be the last one in our upgrade journey as there is no going back once you do the EKS control plane upgrade. The way back would be to destroy the cluster and create a new one; which would mean a downtime for all services running in the cluster if there’s no way to move them to another cluster. Hence, we took our time with testing on a test EKS cluster.

5. Post upgrade tasks

Verify everything works properly by keeping an eye on your monitoring dashboards. Finally, document your notes about the upgrade (e.g., the process, learnings, challenges, etc.)

Takeaways from this round

  • Research is needed at the beginning.
  • EKS control plane at the end: we started upgrading all components of the product before reaching EKS itself.
  • The first upgrade we did as a new team (v1.15 -> v1.16), including the research and execution, took around 2.5 months. Within this level of confidence, it later improved to one month.
  • We blocked the release for almost 2.5 months because we were skeptical about the effect of components on each other. This was a learning to break it down into smaller releases.
  • With so many components, we needed to learn the purpose of each one of them and to improve the documentation of the product.

Second round: Medium Confidence

Confidence 60%

Time needed 2–3 weeks

After performing several upgrades, including one with breaking API changes (v1.15 -> v1.16), our confidence increased significantly. We were able to test freely and whenever we break the test cluster, we either fix it or nuke it and recreate it. We knew our pipelines well. The EKS upgrade became regular work which followed a workflow, with a predictable outcome. However, we needed improvement and more structure.

In this round, we differentiated between three types of controllers:

  • EKS system controllers: must be done with/before the EKS upgrade such as kube-proxy and coreDNS.
  • Third party controllers: can be done after EKS upgrade such as velero and datadog.
  • Custom written controllers: can be done after EKS upgrade considering the kubernetes client-go compatibility.

Having this in mind, the overhead workload before reaching the EKS control plane upgrade was significantly reduced.

To keep track of all our changes in the product, we created the table below as part of the spike ticket so it can be filled out during research. The definition of each column follows.

  • Current Version: which version of this controller are we running at the moment?
  • Works with the new version? [Yes, No]: Check release notes, issues and articles related to the K8s version.
  • Does it have newer versions? [Yes, No]: Check the main repo of the controller.
  • Latest version: the last release of this controller. Check the main repo of the controller.
  • Version to update to: Latest version is not always the one we aim to upgrade to. Please make sure to read all release notes to learn which version works with the K8S version we are aiming to upgrade to.
  • Our Decision: [Update, Don’t update, Update after EKS control-plane]
  • Notes & links: state here what info and data led you to that decision.

Note: The go client version has to match with the k8s version.Thus, all go-written controllers should be updated and recompiled.

Upgrading controllers breakdown
  • In this round, we had new members joining the team who also didn’t know about K8s. It was our chance to experiment whether our created workflow is useful. We let them drive the spikes while pairing with them and kept rotating that way until two of the upgrades were handled fully by the new team members.

Takeaways from this round

  • EKS control plane upgrade is not the last item: with the controller categorization of upgrading it before/with/after EKS upgrade, we decreased the overhead workload of EKS upgrades.
  • Seeing all components as independent made it much easier to perform the actual upgrade. We didn’t need to worry about any known incompatibilities.
  • Time: the EKS upgrade time decreased to 50% and we were able to do it in one sprint (our sprint is two weeks long)!
  • The post-upgrade tasks were estimated and worked on in later sprints as independent maintenance tickets.
  • Smaller releases and smaller iterations are better.

Third round: High Confidence

Confidence 95%

Time needed 1–2 days

A new version with breaking API changes was already reaching its deadline. We were two versions behind. However, we were again short on resources and we had other deadlines. The upgrade was still a burden for us. We felt the need for more improvement there.

So, we shifted our focus from doing it into rethinking the workflow. We met and brainstormed our doubts and fears about our product and prioritized improvements. With the help of an external K8s expert in-house, we succeeded in improving our upgrade workflow further.

Below are some of the improvements we did:

  • We carved out the releases of our custom written controllers into their own pipelines.
  • We improved the tests for the product and its small components (custom written controllers).
  • We removed some controllers which were not needed anymore.
  • We refactored our custom controllers and gained more hands-on experience.
  • We enabled Dependabot with Github Actions in our repositories to get the latest update of npm packages and Docker upgrades automatically.
  • Furthermore, we improved our monitoring and added more dashboards. Please read how we defined SLOs/SLIs in this article.
  • Lastly, we are now enabling automatic releases for the whole product to increase the speed.

At the time writing, we no longer do the EKS upgrade as sprint work but rather as technical debt tasks which we classify as “interruptible tasks”. The “interruptible” person is responsible for unplanned sprint tickets, ad-hoc requests, and customer support questions. The “interruptible” rotation is two days for each team member.

To sum it up, this round consists of mainly three steps:

  • EKS Control plane upgrade, AMI nodes update and kube-proxy
  • Evaluating the upgrade
  • Controllers Upgrade

Takeaways from this round

  • A research spike is no longer needed. It is still necessary to check for API deprecations.
  • We upgraded the EKS control plane at the beginning along with a couple of controllers which must be aligned with EKS’s version such as kube-proxy.
  • We handle all tickets related to the EKS upgrade as interruptible tasks.

Finally,

I’d like to list some tips if you are reading this and don’t know where to start:

  • Iterate, iterate, iterate!
  • Research first
  • Have a trusted testing environment
  • Always aim for smaller releases
  • Know your product well and document it
  • Track your progress in a written fashion; time, notes .. etc
  • Optimize along the way: code, releases, flows, process .. etc
  • Knowledge sharing is essential
  • Train your employees and improve their knowledge base

Acknowledgement

Multiple people contributed to get to the confidence level we are at today and the credit goes to all of them, not only to me, the writer of this article. Over the years we have had the following teams:

Team 1: Safaa AlNabulsi, Raffaello Perez De Vera, Ulrich Kautz, John Ford (EM)

Team 2: Safaa AlNabulsi, Raffaello Perez De Vera, Ulrich Kautz, Nashwan Abou Zidan, Ajay kumar mandapati, Vibhor Kainth, Gerd Doliwa(EM)

Team 3: Safaa AlNabulsi, Florian Kasper (external), Vibhor Kainth, Joseph Welling, Martin Lefringhausen (EM)

Interested in working with us? Check open positions in our career page.

--

--

Safaa Alnabulsi
Scout24 Engineering

Engineering ☁️🤖🌍 I write to either clear my mind or as part of active learning 💡