GKE Multi-Cluster Services — one bad probe away from disaster

Dan Williams
loveholidays tech
Published in
12 min readDec 12, 2023

--

The loveholidays platform continues to grow with millions of holidays sold in 2023, processing 755 billion hotel + flight combinations per day and serving ~7,000 requests per second. Our cloud infrastructure scales on demand and is continuously being tested to 10x throughput (see load testing in production), but we’re vulnerable to regional outages as we currently operate out of a single Google Cloud region. Our only major outage in the past 24 months occurred when Google’s London data centre overheated in July ‘22.

We have been working on increasing availability by introducing a second region to our infrastructure using Kubernetes Multi-Cluster features available in GCP, however we recently discovered a breaking issue with Google’s Multi-Cluster Services that required involvement from Google’s support and product teams to diagnose. This issue is (at the time of writing, at least!) undocumented and undiscoverable, so we wanted to share our findings with the community.

If you are already familiar with the inner workings of GKE’s MCS offering and just want to skip straight to the issue we discovered, scroll down to “One bad probe away from disaster”. Otherwise, in the following section we will cover the basics of MCS along with a real-world example.

Understanding GKE MCS

GCP has a number of features available to support running workloads across multiple clusters:

  • Multiple clusters across GCP regions can be configured as part of an Anthos Fleet.
  • Clusters in an Anthos Fleet can use Multi-Cluster Services (MCS) to “share” Kubernetes services/workloads across clusters.
  • GKE’s Gateway API can be used to do geographical load balancing / distribution between clusters/regions.

There is already an excellent series of posts from Kishore Jagannath that goes into the detail on how GKE MCS works. I would recommend reading his content before continuing:

Use case: High Availability using MCS

Identical workloads in multiple clusters

A quick sketch here shows that we can join N Kubernetes clusters in an Anthos Fleet. We deploy an identical workload to each of our clusters (service, deployment/pods) along with a ServiceExport object in each cluster. Creating a ServiceExport will automatically create a ServiceImport object in all Fleet clusters, containing the Endpoints (Pod IPs) of the matching Pods across all Fleet clusters. Workloads inside the cluster can target this ServiceImport to send traffic to any pod that matches across all clusters, providing HA across regions.

The ServiceImport object can also be a target from a Global Load Balancer (using Gateway API) to automatically distribute traffic to the most appropriate cluster containing that workload. Combining the ServiceImport with a Global Gateway API Load Balancer, we get automatic geographical load balancing across clusters/regions along with a number of other benefits and capabilities.

Use case: Exporting a stateful application to other clusters/regions

Accessing stateful services cross-cluster

We can also use MCS to share workloads between Fleet clusters, without requiring the workload to be deployed in all clusters.

In the above diagram, we are running RabbitMQ in the London cluster only.

We export the RabbitMQ service to the other Fleet clusters by creating a ServiceExport in just the London cluster. This automatically creates a ServiceImport in all Fleet clusters, along with a new DNS record in the clusterset.local domain (instead of cluster.local) resolvable in all Fleet clusters: <service>.<namespace>.svc.clusterset.local

Any traffic destined to this clusterset.local address in any Fleet cluster will be sent to the RabbitMQ Pods in the London cluster, so we can consume messages from any cluster just by changing the DNS Host in the config for any application that talks to RabbitMQ.

The stateful nature of RabbitMQ, and the complexity with Federating RabbitMQ across clusters, means we’d like to continue to run this in London only and be able to consume it from our other clusters. This allows us to deal with this complexity at a future date without blocking us from continuing with building our second region, giving us the flexibility to deliver incrementally.

See consuming cross-cluster services for more information, and we also cover this in more detail in the following GKE MCS Example section.

GKE MCS Example

Let’s run through a real example using a service exported from our (staging!) London Cluster to be used by workloads in our Netherlands Cluster. The following YAML manifests will showcase all the aspects of the following diagram:

The hotel-search service exists only in the London cluster (note the cluster context we specify in the following commands):

> kubectl get svc hotel-search -n sas -o yaml --context london-cluster

apiVersion: v1
kind: Service
metadata:
name: hotel-search
namespace: sas
spec:
clusterIP: 10.224.0.148
clusterIPs:
- 10.224.0.148
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: http-iap
port: 8080
protocol: TCP
targetPort: 8080
- name: http
port: 80
protocol: TCP
targetPort: 8080
selector:
app: hotel-search
app.kubernetes.io/instance: hotel-search
type: ClusterIP

This service targets a deployment which has two replicas, as we can see in the endpoints registered on the service:

> kubectl describe svc hotel-search -n sas --context london-cluster

Name: hotel-search
Namespace: sas
Selector: app.kubernetes.io/instance=hotel-search,app=hotel-search
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.224.0.148
IPs: 10.224.0.148
Port: http-iap 8080/TCP
TargetPort: 8080/TCP
Endpoints: 10.225.27.10:8080,10.225.27.9:8080
Port: http 80/TCP
TargetPort: 8080/TCP
Endpoints: 10.225.27.10:8080,10.225.27.9:8080
Session Affinity: None
Events: <none>

Where the two pods for the hotel-search deployment have the IP addresses 10.225.27.10 and 10.225.27.9 .

We want this service to be accessible from workloads in the Netherlands, so we create a ServiceExport in the London cluster:

> kubectl apply -f service-export.yaml --context london-cluster

--- service-export.yaml
kind: ServiceExport
apiVersion: net.gke.io/v1
metadata:
namespace: sas
name: hotel-search

Creating this ServiceExport in the London cluster will automatically create a ServiceImport in the Netherlands cluster (and any other registered Fleet clusters) about ~5 minutes later:

> kubectl get serviceimport hotel-search -n sas -o yaml --context netherlands-cluster

apiVersion: net.gke.io/v1
kind: ServiceImport
metadata:
annotations:
net.gke.io/derived-service: gke-mcs-v21p1i8kc5
labels:
app.kubernetes.io/managed-by: gke-mcs-controller.gke.io
net.gke.io/backend-service-name: gkemcs-sas-hotel-search
name: hotel-search
namespace: sas
spec:
ips:
- 10.229.39.167
ports:
- name: http-iap
port: 8080
protocol: TCP
- name: http
port: 80
protocol: TCP
sessionAffinity: None
type: ClusterSetIP
status:
clusters:
- cluster: projects/<gcp-projectID>/locations/global/memberships/london-cluster

The ServiceImport object tells us:

  • The ServiceImport was created by the london-cluster (described under status)
  • The exported service has two ports: 80, 8080
  • It has automatically created a new ClusterIP service in the Netherlands cluster: gke-mcs-v21p1i8kc5
  • The IP address of the new ClusterIP service in the Netherlands cluster is: 10.229.39.167

If we look into the new ClusterIP service that has been automatically created, we can see the Endpoints from the London cluster:

> kubectl describe svc gke-mcs-v21p1i8kc5 -n sas --context netherlands-cluster

Name: gke-mcs-v21p1i8kc5
Namespace: sas
Labels: app.kubernetes.io/managed-by=gke-mcs-controller.gke.io
Annotations: net.gke.io/service-import: hotel-search
networking.istio.io/exportTo: -
Selector: <none>
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.229.39.167
IPs: 10.229.39.167
Port: http-iap 8080/TCP
TargetPort: 8080/TCP
Endpoints: 10.225.27.10:8080,10.225.27.9:8080
Port: http 80/TCP
TargetPort: 8080/TCP
Endpoints: 10.225.27.10:8080,10.225.27.9:8080
Session Affinity: None
Events: <none>

This service has two endpoints associated with it, with the IP addresses matching the Pods inside the London cluster. We can also see that an Endpoint object has been created inside the Netherlands cluster, containing the London Pod IPs:

> kubectl get endpoint gke-mcs-v21p1i8kc5 -n sas -o yaml --context netherlands-cluster

apiVersion: v1
kind: Endpoints
metadata:
creationTimestamp: "2023-11-13T10:05:51Z"
labels:
app.kubernetes.io/managed-by: gke-mcs-importer
name: gke-mcs-v21p1i8kc5
namespace: sas
subsets:
- addresses:
- ip: 10.225.27.10
- ip: 10.225.27.9
ports:
- name: http
port: 8080
protocol: TCP
- addresses:
- ip: 10.225.27.10
- ip: 10.225.27.9
ports:
- name: http-iap
port: 8080
protocol: TCP

This is how it all looks if everything is working well. Pods in the Netherlands cluster can send traffic to the hotel-search pods in London either by IP address, or using the new hotel-search clusterset DNS record that is created within the Netherlands cluster.

A little DNS magic

Using the clusterset DNS record in the Netherlands cluster would require changes to our application configurations to point to the new clusterset domain instead of the cluster domain. Instead of making this change at an application level, we are creating ExternalName Services to alias clusterset to cluster for imported services:

> kubectl get service hotel-search -n sas -o yaml --context netherlands-cluster

apiVersion: v1
kind: Service
metadata:
name: hotel-search
namespace: sas
spec:
externalName: hotel-search.sas.svc.clusterset.local
sessionAffinity: None
type: ExternalName

This means applications can be deployed into any cluster, and not have special configuration to be told when to use cluster.local and when to use clusterset.local. We do need to deploy these ExternalName services in the other Fleet clusters, but this is a one-off activity per exported service, compared to updating every deployment configuration that uses a service.

For more information on setting up MCS in your own environment, see here.

Incremental Delivery

With the above strategies in mind, our initial goal for production is simple: create a cluster in another region, deploy all of our stateless applications to both clusters, and let the stateless applications talk to the stateful applications in the original cluster / region via Multi-Cluster Services. This means we don’t have to worry initially about how to migrate, replicate or otherwise deal with anything stateful (things like RabbitMQ, Redis Cluster etc), and we can fully utilise our existing Prometheus / Mimir / Grafana / Tempo / Loki monitoring stack by simply pushing observability data from the new cluster(s) to the old cluster.

Once this has been proven to work reliably, we can then begin to think about how we deal with stateful applications in a multi-region fashion.

We trialled this approach in staging for a while, and we have some of our lowest priority production workloads now running across two clusters (London and Netherlands) using a combination of GKE Gateway API, a global Multi-Cluster load balancer, and the same workloads running in both clusters exported to each other as described in the first diagram of this post.

One bad probe away from disaster

We noticed weird things started to happen with workloads we were trying to export. We would create a ServiceExport, the ServiceImport would be created in the other clusters along with a new derived service and all would appear healthy, but the newly created services would have no Endpoints, meaning no pods to route traffic to:

> kubectl describe svc gke-mcs-<id> -n <namespace> --context new-region-cluster

Name: gke-mcs-<id>
Namespace: <namespace>
Labels: app.kubernetes.io/managed-by=gke-mcs-controller.gke.io
Annotations: net.gke.io/service-import: <service-name>
networking.istio.io/exportTo: -
Selector: <none>
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.229.39.167
IPs: 10.229.39.167
Port: http 80/TCP
TargetPort: 8080/TCP
Endpoints: None
Session Affinity: None
Events: <none>

We exhausted every avenue we could think of to try and diagnose why this was happening. There is a list of known limitations and issues with MCS that were not relevant to us. We considered whether there was an undocumented limitation around things like:

  • Services that had BackendConfigs
  • Services that had FrontendConfigs
  • Services that were already part of an Ingress
  • Services that were already part of a Network Endpoint Group

But our testing yielded nothing conclusive. In short, we couldn’t reliably reproduce this error as the majority of our exported services continued to work, but ones that had worked previously were now silently failing.

The only indication we could find that something wasn’t working correctly was inside Traffic Director; the ServiceExports / Imports that were being created with no Endpoints did not have a valid routing map associated with them:

For reference, here is what a healthy exported service will look in Traffic Director:

We enabled audit logs on the Traffic Director API but had no logs indicating that anything was wrong, and there were no error logs in the gke-mcs-importer workload in the clusters.

With no other options for troubleshooting, we reached out to Google support and after going through first line, technical consultation via Meet, then finally a deep dive by the product team, here is the response we received:

Hello,

Thank you for your patience.

The product engineering team looked into this issue in depth using internal tooling and was able to isolate the root cause of the issue. The MCS pipeline is broken due to misconfigured workloads in your fleet. Some of the workloads in your fleet have their readiness probes configured with (periodSeconds < timeoutSeconds) meaning that a new readiness probe can be initiated before the previous one is completed. An example deployment with this issue is web/hoarder, and can be found at [0].

Furthermore, while Kubernetes allows such settings, the backend system we configure on behalf of the user is stricter and does not accept these values. If this type of issue appears with any service in the fleet, MCS will not configure anything. You are required to fix the issues with all the services to unblock the configuration pipeline.

I hope that the provided information will prove useful and if you have any other inquiries regarding this matter, please do not hesitate to contact me as I will be more than happy to assist you.

Best regards,

To reiterate, when a readinessProbe has a timeoutSeconds that is greater than (or most importantly, equal to) periodSeconds, Traffic Director will stop processing ServiceExports and fail completely silently.

Equal or greater-than values for periodSeconds / timeoutSeconds is a valid configuration inside Kubernetes, but is considered invalid by Traffic Director due to it allowing probes to be executed before the previous has finished. This condition applies to all workloads in your cluster; the services we first noticed with no Endpoints did not have invalid probes configured, but were being blocked from being exported by otherwise unrelated workloads.

It was very common for our engineers to use a configuration like the following where the timeoutSeconds and periodSeconds are equal:

readinessProbe:
failureThreshold: 3
httpGet:
path: /ready
port: 8080
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 5

If we had pushed further into our production multi-cluster journey, we could have easily found our clusters no longer updating Endpoints from exported services, leading to failed requests and a big loss in revenue. Thankfully, we approached this cautiously and gave ourselves enough time to be stung by any edge cases.

We use Conftest as part of our GitOps repository so we were able to quickly add another validation check to our pipeline to check for any readinessProbes that aren’t compliant with Traffic Director:

package main

name = input.metadata.name
kind = input.kind
namespace = input.metadata.namespace


deny[msg] {
readinessProbe := input.spec.template.spec.containers[_].readinessProbe

not readinessProbe.periodSeconds > readinessProbe.timeoutSeconds

msg = sprintf("readinessProbe in %s/%s/%s is invalid. periodSeconds must be greater than timeoutSeconds to not have conflicting probes. [Learn more.](https://github.com/<internal-docs>#readinessProbes)", [namespace, kind, name])
}

This highlighted approximately 50 instances in our production cluster which would be invalid for Traffic Director and would be breaking our ServiceExports.

Want to (crudely) check your own clusters for this “misconfiguration”?

kubectl get deployment -A -o json | jq '.items[] | .metadata.namespace as $NS | .metadata.name as $NAME | .spec.template.spec.containers[] | select(.readinessProbe != null and (.readinessProbe.periodSeconds <= .readinessProbe.timeoutSeconds)) | $NS + "/" + $NAME'

We have asked Google to surface these errors back to us users, or to at the very least add a reference in the MCS known limitations documentation. At the moment, this is completely invisible as far as we can tell. If / when this is surfaced back to the user, we will update this blog with where you can find the new information. At the moment we are waiting on support to create an issue in the public tracker we can subscribe to.

Please reach out if you are already using MCS, Fleets and Gateway API in your production set up as we’d love to compare notes / battle scars.

One last thing, we are hiring! Take a look at our open positions here.

--

--