Why and How We Use Prometheus to Monitor Kubernetes

ZipRecruiter’s journey from Icinga and Graphite to cloud-native open-source monitoring

Published in

ZipRecruiter Tech

10 min readMay 10, 2023

Site Reliability Engineers and other operators constantly monitor their systems to ensure they are healthy and functioning. This can be as simple as glancing at a laptop’s screen to see that it’s still turned on, or as complex as having an entirely separate distributed monitoring system running alongside the main business system.

Icinga is one such monitoring system that has provided great service to organizations for over a decade, ZipRecruiter included. However, as we moved our production services (along with our development and staging environments) to Kubernetes (K8s), Icinga was no longer the best choice for monitoring.

Here we cover the reasoning that led us to Prometheus, and some hard earned insights on how to implement successfully.

Icinga’s server-centric model clashes with K8s’ fluid nodes

Icinga is a popular open source monitoring tool that evolved from Nagios, originally known as NetSaint. It was designed in the days when one installed and configured servers, deployed software onto them, and let it run that way for months or years at a time. Icinga has the “server” as a central piece of its data monitoring model, and adding new servers requires regenerating Icinga’s configuration and restarting its instances.

Kubernetes, however, manages compute power with much less of a unique, long-term identity view. Instead of condemning “these 5 servers to run the web site, these 2 to run the databases, etc.” now and forevermore, K8s deals with a pool of “nodes”, as interchangeable servers.

The Kubernetes Scheduler can choose to migrate workloads across the nodes. For example, when using cloud servers (such as AWS EC2 as we do at ZipRecruiter), Kubernetes can determine that some nodes are under-utilized, decommission a few, and move any remaining workload from the to-be-terminated nodes to continuing nodes with available capacity. Icinga’s ability to monitor processes in this fluid approach is very limited.

A cumbersome workaround with Icinga merited a new solution

Before Kubernetes, we were running our server software directly on EC2 instances. Although Icinga has some rudimentary performance data collection features built into it, we had chosen Graphite to collect and store metrics about our services. We used these metrics both for dashboard displays on Grafana, and for alerts via Icinga.

We had an Icinga plug-in to query Graphite metrics and alert when they exceeded a specified threshold. We’d also added some extra bells and whistles to support slightly greater complexity in the Graphite queries it could execute, but it wasn’t very flexible. Our alerts had to be shoe-horned into “pseudo-hosts” in order to fit Icinga’s rigid preconceptions, even when we were querying metrics that had been aggregated across all pods providing a service.

Since Kubernetes emits its own metrics in a format intended to be consumed by Prometheus, moving away from Graphite was only logical; but there are other reasons too.

Monitoring architecture. Side-by-side diagrams comparing Icinga and Graphite in the old system, to Prometheus AlertManager and Prometheus Metrics Database in the new system, using Grafana Dashboards as the user interface in both designs.

Graphite vs. Prometheus — hierarchical metrics vs. unstructured labels

There are several conceptual differences between Prometheus and Graphite. First, Graphite uses hierarchical names for metrics, while Prometheus uses an unstructured set of labels (key/value pairs). If one Graphite time series is named “service.$hostname.responsetime.$appname” and another is named "response.$appname.$hostname" it’s quite hard to write a query that combines both time series in any sensible way.

In Prometheus, each label on a time series is an independent dimension, and a query can join two time series on a set of labels, regardless of what other labels the two series might not have in common. So in Prometheus you could have “service_response{host=$hostname, app=$appname, other=$thing}” (where “host”, “app”, and “other” are the labels) and “other_kind_of_response{category=$cat, appname=$appname, server=$hostname}” and you can easily join the two time series on whichever labels you choose.

Graphite vs. Prometheus — system push vs. monitor pull

Another difference between Graphite and Prometheus is how they collect information. Graphite is push-driven; that is, the application code connects to the Graphite server and sends it metrics. This can be done over a TCP connection, or via connectionless UDP packets. Both methods have their respective advantages and drawbacks.

UDP is high-performance, but it can be difficult to manage your network to reliably deliver UDP packets under all conditions, and if UDP packets end up getting dropped, it can be challenging to determine why. TCP is designed to be reliable, but Graphite requires your source of metrics to connect to the monitoring system to push metrics to it. This means that if the monitoring system goes down, it can affect the performance of the system being monitored, and also means that it’s more difficult to have redundant instances of the monitoring system.

With Prometheus’ pull-driven approach, the system being monitored doesn’t have to do anything other than respond to requests on a specific port, serving its current set of metrics. The monitoring system(s) can scrape the system being monitored as frequently or infrequently as they choose. Scraping in this way is significantly more scalable. Applications can update metrics as frequently as needed and it won’t put any additional load on Prometheus, and if a Prometheus instance goes away, the application won’t care. You can have more than one Prometheus instance scraping the same application, and the application won’t care, or even need to know.

Dealing with High Cardinality in Prometheus

Despite Graphite’s aforementioned drawbacks, it does have two advantages over Prometheus that we had to deal with. First, Graphite doesn’t care how finely you subdivide your time series, since each leaf of its hierarchical structure is stored in its own file on disk. Prometheus, on the other hand, keeps all labels of a given time series together, and can run into performance problems (and even crash) with labels having “high cardinality”, meaning a time series has too many different values of labels. As a result, we had to be more careful with the cardinality of our Prometheus metrics than we had been with Graphite.

3 guidelines to limit metric cardinality in Prometheus

Avoid server specific labels — We avoid using ephemeral data like IP addresses or node names as labels, because these come and go, and over time can gradually add up to very high cardinality.
Avoid application specific labels — We also avoid high-cardinality application data like the names or ids of employers, job-seekers, or jobs in our metric labels, because we have hundreds of thousands to many millions of these, and we don’t need to monitor metrics at such a detailed granularity.
Group web data labels — Even for http status codes, we try to only use the first digit as a metric label. We can track 3xx redirects versus 4xx user errors versus 5xx server errors, but not the details of exactly which of maybe 20 specific status codes we served.

The general rule of thumb we follow for any one metric is to keep the product of the cardinalities of all the labels for that metric under a thousand.

In this example:

{host=$hostname, app=$appname, other=$thing}

this would mean (number of hostnames * number of appnames * number of other things) < 1,000.

Storing metrics over long time periods in Prometheus is RAM intensive

The second advantage of Graphite is that it is designed to store metrics over long time periods, aggregating data over longer intervals as the data gets older. Prometheus doesn’t have this capability. It stores its data at a single sampling frequency for a fixed amount of time, and then deletes old data past that point.

We currently store 30 days of metrics sampled every 15 seconds for most purposes. In a few cases we recently added a lot of RAM and disk space to a few of our Prometheus instances so we could increase the retention period to 120 days, in order to see month-over-month trends for capacity planning purposes.

There are several projects that claim to add longer-term storage to Prometheus. We are watching these projects mature, and hope to implement one of them soon. This will allow us to migrate the last few use cases where we still use Graphite to store metrics over a span of a year or more.

Serverless cloud meta-monitoring of multiple Prometheus instances

Our philosophy at ZipRecruiter is that every development team is responsible for their application’s maintenance and functionality for the entirety of its lifecycle. This means that when our SREs chose Prometheus as the main monitoring system, they made sure that each team could create their own instances and alert themselves if something goes wrong.

However, the SRE team did set up meta-monitoring systems to watch all the Prometheus instances. If a considerable amount of alerts pop up it is likely that there is a broader issue and the SRE team can intervene for a cross organizational solution.

We achieve this by using AWS lambda to run code serverless in the cloud, and not Prometheus itself.

This is the main skeleton of the golang code that runs in AWS Lambda, and is configurable to check Prometheus and AlertManager instances, as well as checking arbitrary http or tcp services.

package main

import (
    "context"
    "fmt"
    "net"
    "net/http"
    "os"
    "sync"
    "time"

    "github.com/aws/aws-lambda-go/lambda"
    "github.com/aws/aws-sdk-go/service/secretsmanager"

    "ziprecruiter-internal/common/go/aws/session"
    "ziprecruiter-internal/common/go/buildinfo"
    "ziprecruiter-internal/common/go/config"
    "ziprecruiter-internal/common/go/errutil"
    "ziprecruiter-internal/common/go/interfaces"
    "ziprecruiter-internal/common/go/log"
    "ziprecruiter-internal/monitoring/meta/internal/checkers"
)

func main() {
    ctx := context.Background()
    l := log.New("@tag", "monitoring--meta")
    log.SetStandardLoggerOutput(l)
    defer log.PanicMessagesTo(ctx, l)

    log.Info(ctx, l, "starting up metamonitoring version "+buildinfo.GitCommit)

    var c configuration
    if err := config.Read(&c); err != nil {
     fmt.Printf("Couldn't load config: %s\n", err)
     os.Exit(1)
    }

    cs := buildCheckers(c.Metamon.Monitors, c.Metamon.HistorySize)

    sess, err := session.New(session.Options{Region: c.Pagerduty.APIKeySecretRegion})
    if err != nil {
     log.Err(ctx, l, errutil.Wrap(err, "couldn't create an AWS session"))
     os.Exit(1)
    }

    asm := secretsmanager.New(sess)
    pdik, err := lookupPagerdutyIntegrationKey(asm, c.Pagerduty.APIKeySecretName, c.Pagerduty.ServiceID)
    if err != nil {
     log.Err(ctx, l, err)
     os.Exit(1)
    }

    log.Info(ctx, l, "successfully loaded pagerduty integration key")

    log.Info(ctx, l, "starting lambda function handler")
    lambda.Start(func(ctx context.Context) error {
     checkAll(ctx, l, pdik, cs, c)
     return nil
    })
}

func buildCheckers(ms []monitor, sz int) []checkers.CheckFailer {
    cs := make([]checkers.CheckFailer, 0)
    for _, m := range ms {
     switch m.Type {
     case "alertmanager":
      cs = append(cs, checkers.Alertmanager(m.Label, m.Host, m.Insecure, m.Peers, sz))
     case "prometheus":
      cs = append(cs, checkers.Prometheus(m.Label, m.Host, m.Namespace, m.Insecure, sz))
     case "http":
      cs = append(cs, checkers.HTTP(m.Label, m.Host, m.Insecure, sz))
     case "tcp":
      cs = append(cs, checkers.TCP(m.Label, m.Host, sz))
     }
    }
    return cs
}

func checkAll(ctx context.Context, l interfaces.Logger, pagerdutyKey string, cfs []checkers.CheckFailer, cfg configuration) {
    client := &http.Client{}
    client.Transport = &http.Transport{
     Proxy: http.ProxyFromEnvironment,
     DialContext: (&net.Dialer{
      Timeout:   1 * time.Second,
      KeepAlive: 30 * time.Second,
     }).DialContext,
     ForceAttemptHTTP2:  true,
     MaxIdleConns:       100,
     IdleConnTimeout:    90 * time.Second,
     TLSHandshakeTimeout:   10 * time.Second,
     ExpectContinueTimeout: 1 * time.Second,
    }
    wg := sync.WaitGroup{}

    wg.Add(len(cfs))
    log.Info(ctx, l, "running checkers", "count", len(cfs))
    for _, cf := range cfs {
     go func(c checkers.Checker) {
      pctx, cancel := context.WithTimeout(ctx, time.Duration(cfg.Metamon.Timeout)*time.Second)
      defer cancel()

      c.Check(pctx, l, client)
      wg.Done()
     }(cf)
    }

    log.Info(ctx, l, "waiting for all checkers to complete")
    wg.Wait()
    log.Info(ctx, l, "checkers finished")

    wg.Add(len(cfs))
    log.Info(ctx, l, "evaluating history", "count", len(cfs))
    for _, cf := range cfs {
     go func(f checkers.Failer) {
      fi := f.Failure(cfg.Metamon.MaxFailureWindow)
      dl := log.WithDetails(l, "pagerduty_dedupe_key", fi.DedupeKey)
      if fi.Failures >= cfg.Metamon.MinFailureThreshold {
       // Trigger alert
       log.Info(ctx, dl, "triggering alert", "triggering_failure", fi)
       if err := emitPagerdutyEvent(cfg, pagerdutyKey, true, fi); err != nil {
        log.Err(ctx, dl, err)
       }
      } else if fi.Failures == 0 {
       // Resolve alert
       log.Info(ctx, dl, "resolving alert")
       if err := emitPagerdutyEvent(cfg, pagerdutyKey, false, fi); err != nil {
        log.Err(ctx, dl, err)
       }
      } else {
       log.Info(ctx, dl, "within failure window, waiting for more observations", "failure_progress", fi)
      }
      wg.Done()
     }(cf)
    }

    log.Info(ctx, l, "waiting for all evaluations to complete")
    wg.Wait()
    log.Info(ctx, l, "evaluations finished")
}

4 Key insights for a successful migration to Prometheus

Take it slow — We made the transition from Icinga to Prometheus fairly slowly, like one should when deploying any new service.
Teach — We gave each team their own Prometheus server, and provided extensive examples of how to configure alerts using the Prometheus AlertManager.
Auto-discover — We make sure to label every application in Kubernetes with the team it belongs to. A simple code auto-discovers Kubernetes services that expose a metrics port, and adds them to the Prometheus instance belonging to the labeled team. See the example ServiceMonitor manifest below.
Monitor stragglers and give them a deadline — As applications were re-structured to run in Kubernetes, their developers also wrote Prometheus alerts for them, and removed the old Icinga alerts. Eventually, very few useful alerts remained in Icinga that weren’t duplicated in Prometheus, and we gave their owners a month to migrate their alerts before we shut down Icinga and Graphite.

Example ServiceMonitor manifest:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    core.zr.org/team: myteam
  name: app-monitor
spec:
  endpoints:
    - honorLabels: true
      port: http-metrics
    - honorLabels: true
      port: https-metrics
      scheme: https
      tlsConfig:
        insecureSkipVerify: true
  jobLabel: app.kubernetes.io/name
  namespaceSelector:
    any: true

At ZipRecruiter building is just half the path to success; monitoring and maintenance at scale are equally important and the systems we create to do so are equally as innovative. A big part of this is using open source projects that allow us to tailor and tweak things to our specific needs.

If you’re interested in working on solutions to problems like these, please visit our Careers page to see open roles.

About the Author
Jeremy is a Senior Software Engineer at ZipRecruiter, with decades of experience in search engine advertising, game development, and even mainframes. He loves the challenge of creating tools and infrastructure that allow others to build, deploy, and manage their systems while scaling with ZipRecruiter’s rapid growth.