reCap: Elasticity in Kubernetes/GKE

9 min readNov 2, 2022

Happened to come across client stories and articles on how Kubernetes autoscaling didn’t work for them or seeing patterns like large “sawtooth” spikes in the number of pods used by Kubernetes to address fairly regular not so spiky or bursty load when compared to percentage of Pods scaling, very interesting to review in details, as these symptoms can results in unnecessary cost.

Scaling is multi dimensional in Kubernetes, will recap that details, but those are unlike traditional servers. On-prem or typical VM based environments, autoscaling is driven by metrics (CPU, Memory ..etc or number of requests on Cluster frontend like Load Balancer) mostly either in Vertical or in Horizontal dimension.

Let’s first understand the domain of controllability first, in Kubernetes, and then move to how elasticity is managed (i.e., scheduled, scaled or evicted). While doing so let’s briefly touch upon best practices, so that we shall hopefully be able to analyze the scaling events and optimize the cluster.

I am taking a view that elasticity is not just factor of autoscaling but also the behavior of Pod scheduling.

Domain(s) of Elasticity

Scaling is multi (four) dimensional in Kubernetes, as depicted below and the levers, aka HPA/VPA/CA/NAP that supports each dimension.

Both at Node & Pod level, a horizontal and vertical scaling. Namely, [More Pods via **Horizontal Pod Autoscaler** (HPA)] [Diff Pods Sizes via **Vertical Pod Autoscaler** (VPA)][More Nodes via **Cluster Autoscaler**] [Diff Node Sizes via **Node auto-provisioning**]

Apart from above four dimensions which are part of autoscaling controllability definitions, applications are also can define their scheduling controllability choices, interms of scheduling constraints like priorityClassName, PodDisruptionBudget, nodeSelector, Pod/nodeAffinity , anti-affinity,topologySpreadConstraints and tolerations that are applied based on the characteristics of workloads.

Kube-scheduler, while trying to find the best node for the Pod to run, does node elimination and scoring the remaining, where each node getting labeled and/or tainted by the cluster operator.

--node-taints KEY=VALUE:EFFECT
--node-labels KEY=VALUE

Node affinity, nodeSelector is the simplest way to constrain Pods to nodes with specified labels.
Pod Affinity, ensures two pods to be co-located in a single node. Whenever higher availability is desired, anti-affinity settings can be used to place pods

Using taints and tolerations, Taints are the opposite — they allow a node to repel a set of pods. Tolerations are applied to pods. Tolerations allow the scheduler to schedule pods with matching taints.

Pods continue to run if it is already running on the node when the taint is added, unless taint’s effect is NoExecute where it expects to evict the running Pod.

Example affinity if Pod is not expected to deployed on pre-emptable VM. Where **cloud.google.com/gke-preemptible** is inbuilt taint in GKE.

In large clusters, you can speedup the scheduler’s behavior, balancing between latency (time it takes to place new Pods) and accuracy (the scheduler rarely makes poor placement decisions).

You configure this tuning setting via kube-scheduler setting percentageOfNodesToScore. Default is ‘least 5%’ of your cluster.
There are two scoring strategies that support the bin packing of resources: MostAllocated and RequestedToCapacityRatio.

Having understood constraints from workload, and scheduling policies and profile. Let’s review autoscaling dimensions and make a sense of them so as to defining best practices, one dimension at time.

Horizontal Pod Autoscaler

Horizontal scaling means that the response to increased load is to deploy more Pods, based on configured metrics thresholds. Workload resources that support scaling are Deployments and StatefulSet, and that doesn’t support is DaemonSet.

When Pods are throttled due to lack of resources, then Cluster Autoscaler adds new nodes to your cluster

HPA examples, first one based on CPU utilization and second one based on the traffic capacity

Traffic-based autoscaling in GKE is enabled by the Gateway controller and its global traffic management capabilities.

Horizontal Pod Autoscalers can be created from multiple metrics, custom metrics and external metrics. Also, a Multidimensional Pod autoscaling, you can use horizontal scaling based on CPU and vertical scaling based on memory at the same time.

Keep in mind that HPA cannot scale down less than 1 replica, which could be useful when the workload resources are not used at all — such as staging environments at non-working hours. [Serverless can be option in these cases]

Vertical Pod Autoscaler

When you specify a Pod, you can optionally specify how much of each resource, CPU or Memory a container needs. Refer to Resources in Kubernetes for details.

Ephemeral (block) storage, is available from V1.25. The kubelet can provide scratch space (for logs or caching) to Pods using local ephemeral storage to mount emptyDir volumes into containers.

It’s possible (and allowed) for a container to use more resource than its requests. However, a container is not allowed to use more than its resource limits.

So limits is a hard ceiling & requests are soft.

The scheduler ensures that (for each resource type) the sum of the resource requests of the scheduled containers is less than the capacity of the node. Container runtime typically configures kernel cgroups that apply and enforce the limits you defined. If usage limits are exceeded then Pod is evicted.

Follow these best practices for enabling VPA, either in Initial or Auto mode, in your application:

Don’t use VPA either Initial or Auto mode if you need to handle sudden spikes in traffic. Use HPA instead.
A good practice for setting your container resources is to use the same amount of memory for requests and limits, and a larger or unbounded CPU limit.
Add Pod Disruption Budget (PDB) policy object to control how many Pods can be taken down at the same time.
You can opt out, from VPA, container you want with resourcePolicy

In an overcommitted system (where sum of limits > machine capacity) containers might eventually have to be killed i.e., containers that are less important, divided into 3 QoS classes: Guaranteed, Burstable, and Best-Effort, in decreasing order of priority. More details about eviction from link.

For more information about VPA limitations, see Limitations for Vertical Pod autoscaling.

Cluster Autoscaler

CA provides nodes for Pods. Unlike HPA and VPA, CA doesn’t depend on load CPU/GPU/Memory usage metrics, but based on the resource requests to Pods running on that node pool’s nodes, default of 50% request utilization.

Most application teams over-provision the pods via replication sets, so chances of low request utilization, hence in practice we see aggressive upscaling and conservative downscaling.

Best practice is to define Pod Disruption Budget (PDB), limiting the number of Pods that can be taken down simultaneously, for all your applications. If PDB is not used then application can tolerate occasional downtime. PDB use cases are like ‘Do not reduce number of instances below quorum’ & ‘do not terminate this application without talking to me (maxUnavailable=0)’

Why downsizing is conservative ? If any of below node or pod-level checks do not pass, then the node will not be turned down.

Pods with restrictive PodDisruptionBudget.
Pods using local storage, can’t be evicted, as storage will be lost
System Pods (metrics-server,kube-dns) are not evicted, unless PodDisruptionBudget is specified.
Pods cannot be moved elsewhere due to various constraints (lack of resources, non-matching node selectors or affinity, matching anti-affinity, "cluster-autoscaler.kubernetes.io/safe-to-evict": "false", Pods not backed by container object ..etc) might block scale down operations of nodes.

To scales down the cluster more aggressively, in GKE, use optimize-utilization Autoscaling profile. The default isbalanced.

gcloud container clusters update CLUSTER_NAME \
    --autoscaling-profile optimize-utilization

In GKE clusters with control plane version 1.22 or later, Pods with local storage no longer block scaling down.

As a best practice,

Define a PodDisruptionBudget as appropriate.
To avoid temporary disruption, don’t set PDB for system Pods, that have only 1 replica (such as metrics-server).
Keep your pod requests close to actual utilization. Consider using the VPA to set your requests close to actual utilization automatically.
Avoid using local storage for pods
Use auto scaling limits. --min-nodes and --max-nodes in your node pools to support that capacity.

Node auto-provisioning

Cluster Autoscaler that automatically adds new node pools in addition to managing their size on the user’s behalf. It uses Compute Engine Managed Instance Group (MIG) for the node pool based on the specifications of Pods waiting to be scheduled.

When you apply a taint or a label at the node pool level, any new nodes, such as those created by autoscaling, will automatically get the specified taints and labels.

Metric server

Metrics Server is the source of the container resource metrics for GKE built-in autoscaling pipelines. Metrics Server retrieves metrics from kubelets and exposes them through the Kubernetes Metrics API. HPA and VPA then use these metrics to determine when to trigger autoscaling.

With GKE metrics-server deployment, a resizer nanny is installed, which makes the Metrics Server container grow vertically by adding or removing CPU and memory according to the cluster's node count. Which means metrics-server Pod is restarted, causing latencies in high growth clusters.

Pick the GKE version that supports metrics-server resize delays i.e., delays the resizing of cluster, starting at GKE 1.15.11-gke.9

Conclusion

In conclusion, having reviewed all aspects of elasticity in Kubernetes and GKE in particular, being a first class citizen, I wanted to keep my workload simple (not to get fancy about other schedulability features or tricks) with recommendations as below.

As an Application owner

Size your application correctly by setting appropriate resource requests and limits as mentioned in VPA.
Use HPA for horizontal scaling based on available metrics. The official recommendation is that you must not mix VPA and HPA on either CPU or memory. However, you can mix them safely when using recommendation mode in VPA or custom metrics in HPA — for example, requests per second. Let VPA understand your Pod’s resource needs, with HPA in-place initially.
Set meaningful readiness and liveness probes.
Evenly distribute traffic to Pods, using Container-native load balancing (available in GKE)
Bring required affinity, for example, separate batch workloads in different node pools by using labels and selectors or by using taints and tolerations.
Define Priority class for Pod only as needed. Priority indicates the importance of a Pod relative to other Pods.
Implementing exponential retries for handling transient issues, for your dependencies [or use a service mesh’s feature]. This is fundamental for any fail-safe application design.
Make sure your application (and hence Pods) can be restarted and can be eviction ready, while receiving traffic. This can also help use of preemptive VMs and an excellent lever for your FinOps.
Make sure your application starts as quickly as possible and shuts down according to Kubernetes expectations.

See my other blog, on turning containers as a first class citizens.

Use Pause Pods, only if over provisioning is required, and avoiding your workload going through node scale up latencies. Pause Pods requires a fair amount of configuration, so last resort option for you.

Pause Pods are low-priority deployments that do nothing but reserve room in your cluster. Whenever a high-priority Pod is waiting to be scheduled, pause Pods get evicted. The evicted pause Pods are then rescheduled, and if there is no room in the cluster, Cluster Autoscaler spins up new nodes for fitting them. Thus high priority workloads can avoid node creation latencies.

As a Cluster operator

Use resource quotas. Resource quotas manage the amount of resources used by objects in a namespace. You can set quotas in terms of compute (CPU and memory) and storage resources, or in terms of object counts.
Configure the NodeLocal DNSCache in your cluster. NodeLocal DNSCache is an optional GKE add-on that improves DNS lookup latency and reduces the number of DNS queries to kube-dns by running a DNS cache on each cluster node.
Use Gatekeeper (or Anthos policy controller), create policy constraints to enforce cluster compliance & accept/reject deployments.