Case study: Containers health check mechanism crashes when application is over-loaded with requests

This is a case study of a production incident happened to our team. The reason for the crash is an over-load of HTTP requests sent to one of our micro-services. The health check mechanism doesn’t function under high stress and made the situation worst. I detail the investigation steps and share interesting conclusions related to scaling and health checks mechanism.

12 min readNov 11, 2023

Intro:

This is a case study for a production incident when my team had faced. The service stopped functioning and all of its pods have crashed with no warning. No warning or error logs were found.

Later, we have found the reasons for the incident: a batch of requests sent to our cluster which was bigger than the service ability to handle them. Then, we have found that the health check mechanism didn’t function under the high stress over our service, and made it all crash.

I share the conclusions and understandings from our incident.

disclaimer — a few small changes were done to the story to keep the identity of our team a secret.

Our architecture:

Our team has a micro-service serving internal clients. It is based on Python running using Apache WSGI server and fully containerized using Docker. The service is running in an EKS K8S cluster. It has a MySQL database at the back of it used to store and fetch data. User’s traffic to the cluster is accepted through an AWS load balancer. (ALB)

Flow: end-to-end architecture of the micro-service

A sudden crash:

At one time some K8S pods running the application crashed. Ideally, you would be expecting the K8S controller to replace the terminated pods by creating new ones instead. After a small initiation time new pods should be ready to serve clients’ requests again.

That didn’t happen as planned… Instead, after a minute or so all pods crashed as well. The K8S controller tried to create new pods non-stop but they all kept crashing. A quick glance showed they failed due to health checks not returning an HTTP 200 OK response.

This situation stayed for around 40 minutes, and then the service returned to act normally — and all pods were ready, stable, healthy and serving clients.

So what happened?

Graph: number of pods over the incident’s time period. Note that in some points we had zero pods running, meaning the service was totally unresponsive

Debugging:

First thing we did was to check the application logs to understand what happened, but they don’t show any error at all. Literally they all showed normal application activity — no warnings, no errors, nothing!

Checking the K8S cluster and the MySQL database metrics and events logs showed they were fine and had normal activity. Furthermore, there were no major code changes for our application code over the last weeks at all.

Lastly we checked the AWS application load balancer (ALB) logs — yes, this well known AWS resource has a logging system of itself. This is where we have found the first indication of an error. ( Again, the application logs didn’t show any error. )

Comparing the service statistics between a normal time and the incident time showed a major difference in the statistic of HTTP response codes. Normally, 99.9% of the responses were OK returning 200 HTTP response code. In this incident only 20% responses were OK (200) and almost 80% were 503.

At the beginning we thought the 503 response returned from our application because of a bug, but after re-checking our code we didn’t see any problem.

The ALB logs showed many requests passed from the ALB to the application didn’t get a response from the application itself, and after a timeout period of getting no response from our application the requests sent passed from the ALB were timed out, resulting the ALB to return 503 HTTP response for the clients.

That is it — the application (deployed as a micro-service in a K8S cluster) didn’t return any response. The 503 responses the users received weren’t returned by the application, but actually were returned by the ALB itself due to a timeout.

Checking the ALB logs, we were able to see the micro-service had more than 1 million requests sent at a time of 15 minutes! That is, around 6,000 requests in a minute. Comparably, in a normal time frame of 15 minutes we serve around 1,000 requests every minute.

The hypothesis — something made our application to fail health checks, resulting the pod to terminate. Because there were no available pods to serve requests, the ALB didn’t get responses from the application and had to return 503 responses to clients. Whatever this “something” is it made the application fail health check for a period of 40 minutes. Only question left is — what made the health checks failed?

Back to the basics — how does a web application serves HTTP requests?

Before continuing with our debugging efforts I want to discuss about how an application is actually served by a server.

The application is a piece of code capable of accepting a single HTTP request as an input and to return an HTTP response as an output. A single instance of code can serve a single HTTP request.

In order to serve multiple requests we use a web serving gateway interface server (WSGI server) which behind the scenes runs multiple instances of code. Depending on configurations it can run between 1 to hundreds of workers each of them running a single instance of our code. In our case, we opted for our WSGI to run 50 workers which in turn run 50 instances of our application code. That is, every WSGI server which we use can handle 50 users’ requests.

Then we containerized our application using docker. This doesn’t impact the work of the WSGI server at all, actually it just gives an isolated environment for running our application securely and to easily deploy it on cloud using K8S.

Our WSGI server is able to serve 50 HTTP requests at a time, so every container running our micro-service code can handle 50 requests. We run between 8–16 pods in our cluster so we can serve between 400–800 requests at a single moment. ( 50*8=400 requests and 50*16=800 requests )

Flow: end-to-end path for a request, including WSGI server and application code. This flow shows only 2 pods, though their number can be increased more.

The service was over saturated:

Using the ALB logs we were able to see that almost all failing requests sent to the application were for a specific API route. This specific API route is heavy resources consuming, requiring around 3 minutes to give response.

The immediate batch of 6,000 requests in a minute made it catch all the possible request handlers of our micro-service. Our WSGI server workers serving the application were all busy serving the requests for this specific API which takes around 3 minutes to response. Because all of the workers were caught for 3 minutes — it means that new requests will be denied by the WSGI server, telling the client to try again later.

The K8S controllers sends a health check every few seconds to the every pod running our micro-service. In our case, the health checks sent by the K8S controller were rejected as well by the WSGI server because it was busy serving other clients requests.

The WSGI server just ignores the requests and it’s the client responsibility to retry them or to take an action in case no response is given at all. When health check requests sent by the K8S controller are being denied, the controller thinks the micro-service is not functioning, declares it as unhealthy and kills the pod running it.

When the incident happened, the micro-service was over-saturated with requests and all of our pods (and WSGI servers respectively) were busy and couldn’t handle new requests. This included health checks sent by the K8S cluster controller.

Shockingly, the health check mechanism which needs to make sure our application is healthy and running, was the reason for our incident. It kept thinking the application is not healthy while in fact it was healthy and busy serving customers. This resulted this phenomena of pods crashing and being re-deployed again and again without an actual error with the application running it.

Every time a pod crashed, a new one was raised — and immediately a batch of requests were sent by clients until the pod was over-saturated and couldn’t serve any more requests. Then, the health check requests were rejected and the pod was killed.

Because the pod was killed with no warning or real bug originating from the application code itself — no warning or critical logs were outputted. This explained why we couldn’t find any trace for the issue.

Because our pods were terminated before they have finished processing the clients’ requests, they didn’t return an answer to the application load balancer (ALB) which in turn returned 503 to the clients.

And the worst of all — because our pods were terminated, and no answer was given to the clients their codes retried the requests. Retrying the requests added more load over our service, as it had to handle these as well as new ones. The over-load had saturated all WSGI workers leaving no worker to answer to health checks sent by our K8S cluster controller. That made the K8S to declare all pods as unhealthy and to terminate them for a long period time of 40 minutes. Our service ability to serve requests was reduced, whilst the amount of requests raised. The result was pods couldn’t become healthy until all requests had timed out.

To conclude our incident — we had too many requests sent at a short time window, much more than our micro-service can handle. This resulted the service being unable to handle more requests, consequently making the K8S controller health checks to fail causing it to restart the pods non-stop. Eventually, when the overload finished, the health checks passed and the pods became stable again.

How to handle a big stress on an application?

Systems get saturated. Period. This goes for any system — power-grids which have a high demand of electricity, trains which have not enough space for all passengers — and yes, also web applications which can’t handle all HTTP requests at a certain point of time.

In case of a stress over a system two major solutions should be considered -

Increasing the maximum point of saturation — so the system can increase its ability to serve clients. ( Like adding more trains, or building new power stations for the examples above. )
Change the clients requests handling, so it will be spread on a bigger time window, with no peak points which gets to the maximum point of saturation. ( In our examples above a solution can be to offer lower price ticket on non-peak hours or running a campaign to use electric-saving machines etc… )

Eventually, every system will get to a maximum point which afterwards it can’t handle more clients. This point is called saturation point. Think about it for a minute.

Lucky for us, we don’t need to manage power-grids and trains — but web applications. There are many solutions for application stress and over-load, mainly focusing on two major aspects -

Increase the ability of application to handle more clients’ requests. This can be done for example by scaling it and running more instances of it, in our example by having more pods or increasing amount of handlers for the WSGI server.
Reduce the amount of requests sent to the application itself, by controlling the timing of the requests or by moving some of the HTTP requests to other applications handling them. For example, we can set a cache server to cache some of the frequently used HTTP requests and return the same HTTP response for multiple clients instead of processing the same requests multiple times by the application itself.

1st attempt — scale the application to be able to handle more clients’ requests:

The first solution which we did was to scale out the application horizontally by spinning more pods running the application. This solution is nice, but is not a limitless solution.

Very fast we got to our maximum limit of pods which can run. As a reminder, our application is having a MySQL DB at its back end and every pod contains more instances of our application which connects to the DB. Eventually, we go too many connections to the database and it just couldn’t handle more.

Conclusion: The maximum amount of running instances of a code is limited by the amount of connections which the database can have, therefore we couldn’t scale-out any more.

Sadly, even when we scaled-out as much as we could — the load over our micro-service was too much, and we have seen a crash of our service. We needed a better solution.

2nd attempt — use caching to reduce amount of requests sent to the application itself:

The second thing which we did was to use a caching server. We used the server in order to cache frequently used HTTP responses processed by the application once. When other clients were requesting the same thing, the cache server sends the cached response instead of re-processing the request again by the application. This solution is great in case the response is the same for a multiple HTTP requests.

The API route which was required by our clients was indeed cache-able. We have set a caching server which examined every request sent to our application. If the request is cache-able it will return a cached result.

The caching server handled the burden of requests, reducing the load on our application. Meanwhile, our application kept handling HTTP requests which can’t be cached — such as those manipulating data in our database. In total, the over-load on our application was cut-off, and all health checks are handled in time so no pod terminations occur anymore. We can summarize it by saying we increased the saturation point of our application.

This change solved our incident for good!

The positive effect of over-loading a micro-service:

Later after solving the incident, we have discovered that one of our clients did a change in his code, which resulted the high-demand of HTTP requests from our application.

Checking the customer’s code, when the requests failed the code retried sending them multiple times with a 30–40 minutes of timeout! When the micro-service was over-saturated, the K8S controller started killing pods. This has got into a serious issue: our micro-service’s maximum stress point was reduced, due to lacked running pods, reducing its ability to handle requests. ( 8 pods = 400 requests can be served, 7 pods 350 requests… ) Eventually our micro-service has had zero running pods, and respectively couldn’t handle any requests.

This made a positive response effect where amount of requests by clients was increased because of HTTP requests retries, while less pods are available. More requests, less pods. This effect, made the recovery of the service harder and longer.

Only after all requests were handled and all retries have timed-out, the stress over our micro-service was released, letting enough time new pods to pass health-checks and be able to respond for our clients.

Epilogue — it is our responsibility:

It is right that we could ask and have asked our clients to work more efficiently with their code too. But still, it is our responsibility to make sure the application can serve requests for anyone.

Following this incident we changed the architecture of our application to be able to handle more requests, work more efficiently with the MySQL DB, and to cache existing responses.

A final thought

Health checks are meant to monitor the containers and service, to make the K8S controller to restart the pods when needed. In this case we can see the health check functionality didn’t work well under high loads. Alas, it made the service become unavailable.

This shows how complex and how hard is is to be able to go big-scale. Many companies and teams will eventually get to this point of stress, and will have to figure out how to extend it for their product.

I can give these recommendation for whoever reads this article -

Make sure to run a stress test on your application. This will show you what is the maximum stress point of your application.
When planning an application take into consideration to design it in a way which allows you to cache some of the results.
Check the way you handle connections with other dependencies, mainly databases — as they will eventually limit you from scaling out.