The Application Gateway Ingress Controller is broken

Daniel Jimenez Garcia
7 min readSep 30, 2022

TLDR

The Application Gateway Ingress Controller is a great idea. It allows exposing applications hosted in Kubernetes to the outside world via Azure’s native Application Gateway, taking care of most of the necessary configuration and rules.

However, it has important design flaws, which can cause minutes of downtime when updating your workloads! These flaws are still present at the time of writing, Sept’ 22.

If you are planning on using it for production workloads, I would advise to read through the article, check the status of Github issues and run your own tests/benchmarks.

Note the article describes issues with the Application Gateway Ingress Controller. These do not necessarily translate to the Application Gateway itself. You can still leverage the Application Gateway and avoid these flaws, as long as you take care of the configuration that the Ingress Controller would manage for you.

Update Sep’23: Microsoft has announced a public preview of a new Application Gateway tier designed specifically for AKS: https://learn.microsoft.com/en-gb/azure/application-gateway/for-containers/overview. I have only briefly looked at the docs, but it appears to address the flaw discussed in this article.

A fundamental design flaw causes downtime when updating Pods

Understanding the issue

The design of the Application Gateway Ingress Controller is based around the idea of sending traffic from the gateway directly to the Pods. Per se not a bad idea and not too different from other solutions.

But this means the Ingress Controller needs to create a backend pool per Ingress. And it needs to keep it up to date with the IPs of the Pods behind that Ingress.

What happens when you upgrade/redeploy a workload?

  • The current Pods are terminated and new Pods replace them.
  • The Ingress Controller updates the Gateway’s backend pool for the affected Ingress, replacing the IPs of the old Pods with the new Pod IPs.

So far so good. Here is the catch, the Application Gateway does not guarantee fast updates of the IP addresses. In my experience, 10 minutes were normal, sometimes even longer. Other users have reported up to 40m! And according to Azure support, these times are within the expectations of the Application Gateway, which does not make any guarantees regarding update times.

This is the main design flaw. The failure to account for the long update times means users experience minutes of downtime. Once the old Pods are terminated, your application won’t be reachable until the Application Gateway update completes. While the update completes, the Application Gateway rules will still point to the old Pods IP, sending traffic to Pods that no longer exist.

The following diagram shows this sequence, assuming the usual 10m update times I have experienced:

downtime sequence after updating a deployment workload

As you can see, several minutes of downtime can be experienced. Let’s assume a rolling upgrade process, whereby all your replicas have disappeared after X minutes. If the Application Gateway update takes 10m then you could experience 10 - X minutes of downtime.

This issue isn’t recent. It was noticed as early as 2018, and explicitly called out multiple times like in Jan’20 or Feb’21. Users noted the flaws of the current design, as it depends on fast gateway updates which are not guaranteed.

The official guidelines to avoid downtime are not enough

The Application Gateway Ingress Controller docs publishes some guidelines to minimize downtime during deployments.

Note the word minimize. The docs admit you cannot guarantee avoiding downtime, not even when following these guidelines:

This document offers a method to alleviate this problem. Since the method described in this document relies on correctly aligning the timing of deployment events it is not possible to guarantee 100% elimination of the probability of running into a 502 error. Even with this method there will be a non-zero chance for a period of time where Application Gateway backends could lag behind the most recent updates applied by a rolling update to the Kubernetes pods.

Here are the guidelines:

  • Add a preStop lifecycle hook to the Pods where you execute a sleep of at least 90 seconds
  • Add terminationGracePeriodSeconds to the Pod, with a value greater than in the preStop hook. Example uses 101 seconds
  • Decrease thePod’s probes period/timeout, so they are marked unhealthy sooner
  • Setup connection draining annotations, for in-flight connections to be gracefully handled before removing Pod from Gateway backend

The issue with them is that they are betting on the update to the Application Gateway to finish before the terminationGracePeriodSeconds . Ie, we are keeping the old Pods for a while longer before they are finally removed, hoping this gives enough extra time for the Application Gateway update to finish.

suggested mitigations are not enough

As you can see in the diagram above, for users experiencing updates that consistently take 10 minutes or even longer, this won’t be enough. The guidelines make the same mistake, they assumes updates to the Application Gateway are fast!

  • Also be aware that preStop isn’t called in situations like a Pod crashing or OOM killed. In these situations the workarounds won’t be able to mitigate the issue at all

Ok, let’s say you are fortunate and are experiencing the updates taking 60 seconds instead of 10 minutes. Or that you are willing to keep your old Pods around for 10m via the preStop lifecycle and theterminationGracePeriod. Even then, we still need to account for updates/redeployments of multiple workloads!

This is when avoiding downtimes gets way more complicated. The updates to the Application Gateway have to be done one at a time. Once a workload starts the update process, any other updated workload will need to wait for the first update to complete:

  1. Imagine you update/redeploy workload A, which begins an update to the gateway. We know the update can take a long time, 10 minutes or longer.
  2. Imagine after 2 minutes, you update/redeploy another workload B. This will require updating the Application Gateway, but it will first need to wait 8 minutes for the first update the complete, then another 10 minutes for this new update
  3. Meanwhile Kubernetes begins the process to replace the Pods in workload B.
  4. Even if you were prepared to keep old pods around for 10 minutes, in the case of workload B this won’t be enough. You had an extra 8 minutes wait that you couldn’t account for!
updates to multiple workloads result in even longer downtimes

There are many users reporting updates that take frequently 10, 20 or more minutes. With these slow updates, downtimes can become quite severe once multiple workloads are updated/redeployed.

It can get even worse in some scenarios

Knowing that multiple updates to workloads causes even longer downtimes, think about these scenarios:

  • Horizontal Pod Autoscaler (HPA). While you configured HPA for scalability and resiliency, in combination with the issues you have seen, the automated removal/addition of Pods might end up causing intermittent 502 errors
  • A node needs to be drained. Given all pods in that node have to be terminated, there will be a long queue of updates to the Application Gateway!
  • All the cluster nodes have to be updated. A not so uncommon scenario to process OS updates in the cluster nodes, by cordoning, draining and rebooting each in sequence. This will eventually see all of your Pods being terminated as the update cycles through all the nodes. Just imagine the queue of pending Application Gateway updates!

I cannot but agree with this comment:

The AGIC design requires AppGW to make guarantees that it simply doesn’t. The design is in itself broken.

What can you do instead

As it is today, I would not recommend relying on the Application Gateway Ingress Controller for any production workloads.

Without deviating much from an architecture based on it, you could consider:

  • An Application Gateway (managed by you) in front of a load balancer managed by the NGINX ingress. This way you expose the AppGw to the outside world and keep WAF happening before traffic reaches your cluster. The downsides are that you need to be the one updating rules/certificates/etc in the AppGw, and you have extra network hoops plus cpu/memory consumed by NGINX ingress
  • Just a load balancer managed by the NGINX ingress. The load balancer is directly exposed to the outside world, and WAF moves to the NGINX ingress via ModSecurity. Everything is managed for you, but now WAF happens within your cluster, so you might want to account for extra cpu/memory and perhaps use dedicated/separate nodes from your actual workloads.

Of course, these are mere pointers in directions to explore, and not the only alternatives. However, I wanted to close the article by providing a way out for anyone who based its design in the Application Gateway Ingress Controller!

--

--

Daniel Jimenez Garcia

Principal engineer @oliverwyman. Author @DotNetCurry and @DorksKaizen. Interested in Vue, Node, Python, .NET, Kubernetes, Terraform, DevOps and cloud