Monitor Kubernetes Clusters With Prometheus/PromQL and Grafana

Site Reliability Engineering

(λx.x)eranga

Published in

Effectz.AI

9 min readMar 29, 2023

1. Introduction

With the growing complexity of modern applications, it is crucial to have a reliable and efficient monitoring system in place to ensure the smooth operation of Kubernetes clusters. Prometheus is a powerful open-source monitoring tool that is widely used to monitor Kubernetes clusters. In this blog post, we will explore how to monitor the Kubernetes cluster(control plane and worker nodes) using Prometheus. We will cover the key metrics that should be monitored and provide examples of how to do the monitoring with those metrics.

To have a comprehensive monitoring of a Kubernetes cluster, it is important to not only collect application-level metrics but also metrics related to the Kubernetes infrastructure itself. This includes metrics related to the Kubernetes services, nodes, and orchestration status. To achieve this, we have used 1) Node exporter which is used to collect classical host-related metrics like CPU, memory, and network usage, 2) The Kube-state-metrics tool is used to collect orchestration and cluster-level metrics, such as information about deployments, pod metrics, and resource reservations, 3) Kubernetes control plane metrics including kubelet, etcd, dns, scheduler, and others are also collected for monitoring the health and performance of the Kubernetes infrastructure.

2. System Architecture

The following figure illustrates the system architecture for collecting Kubernetes metrics using Kube-state-metrics, Node-exporter, and Kubernetes control plane metrics.

2.1 Kubernets cluster metrics

Kube-state-metrics is a service that listens to the Kubernetes API servers and generates metrics about the state of objects like deployments, nodes, and pods. It’s essential to note that Kube-state-metrics only provides a metrics endpoint, and another entity needs to scrape it and provide long-term storage, such as the Prometheus server. The exposed metrics of the kube-state-metrics available in the documentation here. Node-exporter is a Prometheus exporter for hardware and OS metrics that gets automatically deployed when we use the Prometheus operator. It allows measuring various machine resources such as memory, disk, and CPU utilization. It can be deployed as a DaemonSet and thus automatically scales if you add or remove nodes from your cluster. Additionally, we use Kubernetes control plane metrics to monitor the control plane. Each control plane component (APIServer, kublet, etcd, kube-dns, kube-proxy) is instrumented and exposes the /metrics endpoint, which can be scraped by Prometheus.

2.2 Scrape metrics

The proposed architecture enables Prometheus to collect Kubernetes metrics by leveraging Kube-state-metrics, Node-exporter, and Kubernetes control plane metrics. Prometheus utilizes the HTTP pull model, which involves pulling metric data from the services at predetermined intervals instead of the services pushing the data to it. During each scrape, Prometheus retrieves metric data from all endpoints and stores it in a local time-series database. The scraping process is highly customizable, allowing users to adjust parameters such as scraping interval and timeout to meet the system’s requirements.

2.3 Visualizing

After Prometheus collects the Kubernetes metrics data, it is time to visualize the data with Grafana dashboards. Grafana is a popular open-source data visualization tool that allows users to create custom dashboards and graphs for time-series data. With Grafana, we can create and share complex visualization dashboards that display real-time monitoring data in a user-friendly way.

To get started with Grafana, we first need to configure the data source to connect with the Prometheus server. Once we establish the connection, we can start building custom dashboards and panels to display the Kubernetes metrics data. With Grafana’s powerful query language, we can apply various statistical and mathematical functions to the collected metrics data to extract meaningful insights.

Grafana also supports alerting based on the metrics data, which is useful for triggering notifications and taking appropriate actions in case of threshold breaches. We can set up various alert rules based on the Kubernetes metrics data and configure Grafana to send notifications via various channels like email, Slack, or PagerDuty.

2.4 Alerting

Alerting based on the metrics data is a crucial aspect of monitoring any system. In the case of monitoring a Kubernetes cluster with Prometheus, we can use the Prometheus Alertmanager to create and manage alerts based on the collected metrics data. Alertmanager allows users to define alerting rules using PromQL expressions and configure different notification channels such as email, Slack, PagerDuty, and more to receive notifications when an alert is fired.

For example, if the Prometheus metrics data shows that a pod is down or a node is unreachable, an alert can be configured to fire and send a notification to the designated channels. Alertmanager also allows users to configure silence periods for certain alerts or suppress alerts during maintenance periods. This helps reduce alert fatigue and ensures that the alerts that are fired are meaningful and require attention.

With Alertmanager, it is also possible to configure alert cascading and grouping, which enables users to group similar alerts and manage them together. This makes it easier to manage and respond to multiple alerts firing at the same time.

3. PromQL Examples

Once we have set up the system according to the architecture described above, Prometheus will automatically scrape the Kubernetes metrics. We can then use PromQL to analyze the metrics of the Kubernetes cluster. Prometheus offers a flexible and powerful query language called PromQL, which can be used to query and analyze the collected metrics. This language makes it simple to filter and aggregate metrics, calculate rates and ratios, and perform complex operations on the data. Here are some examples of PromQL queries that can be used to monitor the Kubernetes cluster using the scraped metrics.

3.1 Count pods per cluster by namespace

The kube_pod_info metric provides information about each pod, including its namespace, name, IP address, and status. The sum by (namespace) part of the query groups the pods by their namespace and adds up the total number of pods in each namespace. The resulting output will show a breakdown of the number of pods running in each namespace of the cluster.

sum by (namespace) (kube_pod_info)

3.2 Pod restart by namespace

This PromQL query calculates the sum of changes in the ready status of Kubernetes pods over the last 5 minutes, grouped by namespace. It selects the “true” condition of the kube_pod_status_ready metric, which indicates that the pod is ready to serve requests. The “changes” function returns the number of times this condition has changed (from false to true or vice versa) during the specified time range, and the sum by (namespace) clause groups the results by the namespace of the pods. In other words, this query can be used to monitor the stability and availability of pods in each namespace, by counting how many times their readiness status changed over the last 5 minutes.

sum by (namespace)(changes(kube_pod_status_ready{condition="true"}[5m]))

3.3 Pods not ready

This query will sum up the number of pods in the not ready state for each namespace. This query will specifically look for the ready condition of the pods and filter out any that have a condition of false. The sum by clause is used to group the results by namespace, providing a breakdown of the number of pods that are not ready in each namespace. This query is useful for identifying namespace-specific issues related to pod readiness, such as when a particular namespace is experiencing a higher number of failing or unhealthy pods compared to others.

sum by (namespace) (kube_pod_status_ready{condition="false"})

3.4 CPU overcommit

This PromQL query calculates the total CPU resource used by Kubernetes pods in a cluster, subtracted by the total CPU capacity of the nodes in the cluster. The sum(kube_pod_container_resource_limits{resource="cpu"}) part of the query calculates the sum of CPU limits across all pods in the cluster, while the sum(kube_node_status_capacity_cpu_cores) part of the query calculates the sum of CPU capacity across all nodes in the cluster. By subtracting these two values, the query returns the remaining available CPU capacity in the cluster. This query can be used to identify if there are insufficient CPU resources in the cluster, and to plan for scaling up the cluster or optimizing pod resource allocation.

sum(kube_pod_container_resource_limits{resource="cpu"}) - sum(kube_node_status_capacity_cpu_cores)

3.5 Memory overcommit

This PromQL query calculates the total amount of memory that the Kubernetes pods in the cluster are requesting, subtracted from the total amount of memory available on the worker nodes. It does this by summing the kube_pod_container_resource_limits metric for the memory resource type across all pods in the cluster, and then subtracting the sum of the kube_node_status_capacity_memory_bytes metric, which reports the total amount of memory available on each node.

This query can be useful to identify cases where pods are requesting more memory than is available on the nodes where they are scheduled, which can lead to performance issues or even pod failures. If the result of this query is consistently negative, it may indicate that additional nodes or node resources are needed to support the current workload.

sum(kube_pod_container_resource_limits{resource="memory"}) - sum(kube_node_status_capacity_memory_bytes)

3.6 Number of healthy cluster nodes

This PromQL query is used to count the number of nodes that are currently ready in the Kubernetes cluster. It calculates the sum of nodes with the Ready condition set to true in the kube_node_status_condition metric. The “status” label is used to filter the nodes that are true. The ==1 at the end of the query is used to ensure that only nodes with exactly one matching condition are counted. By running this query, you can easily determine the number of nodes in the cluster that are ready and available for use.

sum(kube_node_status_condition{condition="Ready", status="true"}==1)

3.7 Number of cluster nodes that may not work correctly

This PromQL query is used to check if any Kubernetes node has been unresponsive for a specified time period. It calculates the number of times a node’s Ready condition has changed to true in the last 15 minutes using the changes function. Then, it groups the results by node using the by (node) clause. Finally, it checks if the number of changes is greater than 2 using the > 2 comparison. If the result of this query is greater than 2 for any node, it indicates that the node has been unresponsive for the last 15 minutes, triggering an alert to be sent to the Alertmanager.

sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by (node) > 2

3.8 CPU idle by cluster

This PromQL query is used to monitor CPU usage of containers running in Kubernetes pods. The query first retrieves the total CPU usage in seconds of all containers except those with a container name of “POD” or an empty name over the last 30 minutes using the container_cpu_usage_seconds_total metric. It then groups the result by namespace, pod, and container.

Next, the query calculates the average CPU resource requests of the containers using the kube_pod_container_resource_requests metric and filtering for requests of the "cpu" resource. The group_left keyword ensures that the query will retain all the data from the previous query even if there is no matching data in the second query.

Finally, the two results are subtracted and multiplied by -1 to get the difference in CPU usage and CPU requests, respectively. The >0 at the end of the query checks if the result is greater than 0, which means the container is exceeding its requested CPU limit. The sum of these values will give the total number of containers that are exceeding their requested CPU limit.

sum((rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[30m]) - on (namespace,pod,container) group_left avg by (namespace,pod,container)(kube_pod_container_resource_requests{resource="cpu"})) * -1 >0)

3.9 Memory idle by cluster

The query begins with sum which is used to aggregate the memory usage metrics of all pods. (container_memory_usage_bytes{container!="POD",container!=""} selects the memory usage of all containers except for the ones with the name "POD" or empty names. The next part of the query uses on to group the memory usage data by namespace, pod and container. Then, it calculates the average memory requested by each container in each group using avg by (namespace,pod,container)(kube_pod_container_resource_requests{resource="memory"}). After that, it subtracts the average memory requests from the actual memory usage and multiplies it by -1. This will give the amount of memory that is being used more than what is being requested by the container. If this value is greater than 0, it means that the container is using more memory than what it has requested. Finally, the query divides the result by (1024*1024*1024) to convert it to GB.

sum((container_memory_usage_bytes{container!="POD",container!=""} - on (namespace,pod,container) avg by (namespace,pod,container)(kube_pod_container_resource_requests{resource="memory"})) * -1 >0 ) / (1024*1024*1024)

3.10 Number of containers by cluster and namespace without CPU limits

This PromQL query is used to count the number of pods in each namespace that do not have CPU limits set. It does this by first summing the kube_pod_container_info metric by namespace, pod, and container labels to get the total number of containers in each pod. Then it uses the unless keyword to subtract from that count the number of containers that have CPU limits set, which is obtained by summing the kube_pod_container_resource_limits metric by namespace, pod, and container labels and filtering for resource="cpu". Finally, the count by (namespace) function is used to count the number of pods in each namespace that have at least one container without CPU limits set.

count by (namespace)(sum by (namespace,pod,container)(kube_pod_container_info{container!=""}) unless sum by (namespace,pod,container)(kube_pod_container_resource_limits{resource="cpu"}))