A more resource efficient HA setup for metrics collection agents using Kubernetes leader election

Published in

Voi Engineering

11 min readJun 9, 2023

Setting up the scene

Greetings from the Voi cloud infrastructure team! Our team is in charge of maintaining and running the core infrastructure supporting our fleet electric scooters and e-bikes dispatched around Europe and all the rides happening daily using them!

One of the key missions as an infrastructure team is to provide observability with zero effort to our developers. At Voi, every existing and newly scaffolded microservice comes with pre provisioned dashboards, and predefined alerts to ensure that our developers can easily understand their application state at any point of time.

To support a safe development workflow, we maintain 2 different environments in addition to our production environment to make sure that our developers have all the necessary resources to test and validate their changes before shipping them! Each environment means at least a full Kubernetes cluster running roughly 200 applications (~500 pods) that are observed every 15s! And I’m leaving out all our other dedicated infrastructures (data and Machine Learning) that are also very metrics oriented.

As you can see, we are heavily relying on time series for observability and we’re dealing with a bunch of them. To be specific, roughly 2 million active time series in total for only the three environments supporting our main application (dev, stage and prod).

In an effort to provide an easy to use and reliable monitoring system, we deployed in January 2023 an instance of Grafana Mimir running in a dedicated Kubernetes cluster, that acts as a shared metrics storage backend for all our environments. This allows us to provide highly available time series storage with (almost) unlimited retention time because of its object storage offloading for cold samples.

We’re incredibly happy with this solution as it has proven to be a very reliable asset and easy to work with!

The following diagram sums up our current architecture regarding metrics based monitoring. Long story short, we’re running per environment a pair of Prometheus instances in agent mode as well as the environment Grafana that reads from Mimir.

We’re running per environment a pair of Prometheus instances in agent mode as well as the environment Grafana that reads from Mimir.

Highly available metrics collection

We strive to achieve high availability (HA) for each component of that monitoring stack. As our Kubernetes clusters are automatically managed (upgrades or autoscaling), at any point in time a pod can go down because it is rescheduled to another node. If this happens to a metrics collection agent we would be losing observability for the environment while the pod is being rescheduled. Sadly, scheduling and starting a pod can take a non negligible amount of time depending on the current state of the cluster.

So running a single agent instance isn’t acceptable for us.

The recommended approach to achieve HA on the agent side is to run a pair of agents at the same time on every monitored Kubernetes cluster, and then have the Mimir distributors, through a system called the HA tracker, choose which replica is being accepted to avoid ingesting multiple times the same samples.

The following figures summarizes this setup. In that case both agents are collecting and sending samples but only one stream of samples is actually ingested (the green one).

This approach while it is the most reliable, comes with a few drawbacks

It requires for now to run an etcd cluster alongside the existing Mimir installation to store the currently active replica across all distributors.
It is consuming a lot of resources and is very talkative over the network, to have half the samples sent by a Kubernetes cluster actually dropped by the distributors because they are redundant
For the record, we roughly measure ~4Gb of memory usage, ~1vCPU and ~3Mbps transmitted for a single Prometheus Agent replica in production (~1 million active time series). So count twice that for a pair of agents, this means money!

(For full transparency, all our measurements are made on Prometheus 2.44.0 running in agent mode with GOGC set to 50)

In an effort to provide a simpler and more efficient setup for agents while still maintaining decent guarantees regarding availability, we started toying with another approach.
We want to elect a single agent as the leader, just like what the Mimir HA tracker does on the server side. But this time, we want to do it using the Kubernetes API directly from the observed cluster.

The following figure summarizes this idea: only the current leader (prometheus-agent-0) collects and sends samples. The follower watches the lease and periodically tries to claim it.

This alternative is interesting because it doesn’t require to run etcd alongside Mimir and still guarantees a decent failover time if the active pod starts misbehaving or is rescheduled. Bonus point: only one active pod would do the heavy lifting at a time, so this means less resources consumed overall!

Now that we have been using it for a little while, we’re happy to introduce in this blog post the result of this experimentation: prometheus-elector.

Interested? Let’s dive in!

Kubernetes leader election 101

The Kubernetes API offers a subset of endpoints that allows to easily implement a leader election within multiple replicas of an app running in a cluster.

The main idea is that the multiple replica members of the election race against each other to acquire a leader lease, represented as a Kubernetes resource. The first replica that grabs the lease becomes the leader of the group and then proceeds to renew it periodically as long as it is willing to keep the leadership. The followers are periodically trying to claim the lease and become leader as soon as they manage to do it.

All of this is also backed by etcd, but this time it is our cloud provider that is running it for us 😀!!

configmap-reload + k8s leader election = prometheus-elector

Executing Prometheus alongside a sidecar container that watches and tells Prometheus about configuration changes (configmap-reload or more recently prometheus-config-reloader) is a common way of running Prometheus in the Kubernetes ecosystem. This sidecar watches the configuration file and tells Prometheus to reload its configuration through the reload endpoint of the management API as soon as it changes, avoiding a time consuming restart.

The following figure summarizes this setup.

prometheus-elector is built around the same idea but allows to express a configuration based on the current status of the replica in the ongoing election. If the replica is the follower, then the follower configuration is applied. If it is the leader, then the leader configuration is applied.
To be specific, prometheus-elector generates a Prometheus configuration file and writes it to a shared volume between the two containers. It then tells Prometheus to reload the configuration through its management API, just like `configmap-reload` or `prometheus-config-reloader` would do.

The next figure shows a Prometheus pod running prometheus-elector.

It is also worth mentioning that during the pod startup, an initial follower configuration file is generated using an init container to make sure that Prometheus can start, and always starts as a follower until ready.

Last but not least, prometheus-elector continuously monitors the local Prometheus’s state to intelligently manage its participation in the leader election. For example:

On startup, prometheus-elector will wait for Prometheus to report a ready status before starting to participate in the election. With existing data on disk, Prometheus might take some time after starting to become ready (~ a minute sometimes) and we don’t want to risk holding the leadership during that time.
During execution, if the Prometheus container reports an unhealthy state or fails continuously to answer an healthcheck, prometheus-elector will stop participating in the election and if it is leader, it will release the leadership to let another replica take over the leader role.
This allows it to detect nasty cases like the Prometheus container being OOM killed or running out of disk.

Configuration file

The prometheus-elector configuration allows to specify a configuration applied when the replica is follower and applied when it is the leader. To limit duplication, the leader section only defines overrides and not the full configuration.

The recursive merge of the two configurations is done using the great imdario/mergo library.

For example, the following prometheus-elector configuration tells all the followers replica to scrape the target localhost:8080/metrics every 5s but only have the current leader actually send the collected metrics to a remote write storage.

# /etc/prometheus-elector/config.yaml
follower:
  scrape_configs:
  - job_name:    'somejob'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:8080']

leader:
  remote_write:
    - url: http://remote.write.com

If the replica is a follower, the generated configuration will look like this:

# /etc/prometheus/config.yaml
# only the `follower` part is present.
scrape_configs: 
  - job_name:    'somejob'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:8080']

If the replica is the election leader, the generated configuration will look like this:


# /etc/prometheus/config.yaml
# The `follower` part is still present, but is merged with the `leader` one.
remote_write:
  - url: http://remote.write.com
scrape_configs:
  - job_name:    'somejob'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:8080']

An Active/Passive setup of Prometheus Agents.

The previous configuration example already kinda hinted at it 😀. But, yes, prometheus-elector allows us to solve in an efficient manner our initial problem of agent high availability. We can leverage the leader election to make sure that only one active agent actually performs the metrics scrapping and sending.

In our case, we decided not to have the follower perform any metric scraping except itself and the prometheus-elector metrics. Only the leader does the full scrapping and sample pushing. This is mainly motivated by wanting to avoid any unnecessary network and resource usage, at the cost of risking to lose a few scrapes. But prometheus-elector allows you to do whatever you want as long as you can express it in the configuration file!

Our scrape interval is globally set to 15s for all our jobs and the lease duration for prometheus-elector is set to a slightly smaller interval of 10s.

The following graph represents the cumulative network traffic of the 2 agents before and after we rolled out prometheus-elector:

On the left side running the standard HA setup (2 active agents), the blue line is transmitted bytes per second and the green line is the received bytes per second multiplied by -1
On the right only one active agent using prometheus-elector, orange line is the transmitted bytes per second for all the replicas, yellow line is the received bytes per second multiplied by -1

As expected, we cut by half the amount of received and transmitted data… Even by a little more than a half, but to be 100% transparent we also rolled out at the same time a configuration change that dropped a few time series we didn’t care about.

Failover and Resource management

The following graphs show the resource usage when a failover happens, in that case the replica-0 was rescheduled due to a node scale down. The pod received a SIGTERM and performed a graceful shutdown.

Regarding CPU usage, it gets more or less the same when the replica takes over. The follower CPU usage goes to almost none.

On the memory side of things we can see that the first replica (that got rescheduled) still consumed a bit of memory after starting back up and slowly released it.

Also interesting fact: Prometheus clears its disk after becoming a follower!

To finish, network traffic clearly switches to the other replica, and the follower network traffic becomes negligible.

As we can see, as soon as the follower configuration is applied, Prometheus releases allocated resources (CPU, Memory, Network and even its disk) pretty nicely when it goes from leader to follower.

Failover and missed scrapes

Just like Mimir’s distributor HA tracker, prometheus-elector does not guarantee that the collection agents are not going to miss a single scrape.

This guarantee depends on the failure scenario happening on the agent:

If the pod is gracefully terminated, then prometheus-elector will release the leadership before shutting down, this means that a takeover will happen as soon as another replica claims it, they retry by default every 2 seconds.
If the pod is non gracefully terminated, the lease needs to expire before another replica can claim the leadership. In that case you will likely lose a single scrape of data with a default lease time (10s) and scrape interval of 15 seconds.
If the Prometheus container becomes unhealthy or gets restarted, then it becomes a bit more complicated as prometheus-elector would have to acknowledge multiple health check failures (by default 3, checked every 5 seconds) before releasing the leadership. In that case, we would expect the Mimir HA tracker to be more reactive as it would immediately start consuming the other stream of samples.

All of this timing configuration is fully customizable, so it can be optimized!

Overall, prometheus-elector offers similar guarantees than Mimir’s HA tracker, but it also comes with less resource consumption and saves us the job of running an etcd cluster alongside Mimir!
In our case we can afford to miss a few scrapes so that’s a win.

To scrape or not to scrape when following?

prometheus-elector allows to specify a scrape target when being a follower as well but, does it make sense? One could imagine that during a failover, if the newly promoted replica tries to resend the samples it could allow the remote storage to fill up the gap if an unexpectedly long failover happens.

I have to be honest here: I don’t have a strong answer. I would tend to think that as for now Prometheus doesn’t behave like this. It actually ignores all recorded samples before the instant where the remote write target is added to the configuration. I actually asked the question to the community, so feel free to monitor this thread as I could have completely gone off the rails here 😀.

Try it!

An example helm chart can be found in the repository, we recommend starting from there and then building your own as we don’t plan for now to release an official Helm chart.
If you wish to play with it, the repository also comes with a development environment that allows you to run this agent setup on a local k3d cluster.

Please note that you need to run it with a version of Prometheus superior to v2.32 if you want to start Prometheus in agent mode without any remote_writes target configured.

That’s all folks!

We hope that you enjoyed this article! If you’re interested in prometheus-elector feel free to check out the repository. If you’re interested in joining a great and innovative company on a mission to make cities more liveable, we have a ton of open positions in our research and development department!