Testing Business continuity of a sample application using GKE and GCP Cloud SQL

Harinderjit Singh
ITNEXT
Published in
15 min readJan 27, 2023

--

The Objective

This post is the second in the series to test the business continuity of an application that has a database backend. You will be deploying the sample application on Kubernetes and using SQL Server as a database.

Our first test was based on Azure. Our second test is based on GCP and so we will use GKE, GCP Cloud SQL, and some other GCP services to achieve an architecture that can provide resiliency for Region Failure.

Cloud SQL Business Continuity

Business continuity refers to the mechanisms, policies, and procedures that enable your business to continue operating in the face of disruption, particularly to its computing infrastructure.

Cloud SQL is a regional service when Cloud SQL is configured for HA, so if a region is down Cloud SQL is unavailable. As of now (Jan 2023), the best possible way to achieve business continuity for GCP Cloud SQL Instances is to use the read replicas (in another region). The cross-region read replica for a Cloud SQL can be created only if the Primary Cloud SQL is of version “Enterprise”.

To continue processing, you must make the database available in a secondary region as soon as possible. The DR plan requires you to configure a cross-region read replica in Cloud SQL. A failover based on export/import is also possible, but that approach takes longer, especially for large databases.

For the most Business critical application databases, enabling read replica in a secondary region is a must. When a failover occurs, you can promote a replica to ensure business continuity (given that application is also available in another region). This decision depends on how much RTO is acceptable to the business. Once the replica is promoted, it gets detached from the original primary.

If the primary Instance becomes available after the replica has been promoted and there are still some clients connected to the old primary. Those clients might read and write data (even by accident) on the original primary instance. In this case, a split-brain situation can develop, where some clients access stale data in the old primary database, and other clients access fresh data in the new primary database, leading to problems in your business application. To avoid a split-brain situation, you must ensure that clients can no longer access the original primary instance after the original Primary region becomes available.

The case I am particularly interested in is when the Primary Region becomes unavailable and we are not testing High Availability which is mostly achieved by zonal redundancy. Because we can’t really replicate Region Failure, we will have to do manual failover (by promoting the replica) so in this case, RTO depends on the steps involved in this process.

What do you gain?

While we prepare to have all things ready before the test, we will learn some basic go (from the application code), terraform, terragrunt (for infrastructure provisioning), and helm (to deploy the application on Kubernetes). From the GCP end, besides GKE and Cloud SQL, we will familiarize ourselves with VPC, Cloud DNS, and GCP LoadBalancer.

We will also be able to compare Azure and GCP cloud Platforms with respect to the Business continuity of an application deployed on Kubernetes that has an SQLServer database backend.

The Architecture

  • Two GCP Regions are used: US Central1 (primary) and US East4 (DR)
  • One GCP VPC with one subnet in each region
  • One Private DNS Zone
  • The application will be deployed on GKE in both regions.
  • GCP Cloud SQL Instance is created in the primary region (US Central1).
  • Read Replica is created in the secondary region (US East4)
  • SQL database is created in the primary Region.
  • Application (in both Regions) on GKE will always connect to Cloud SQL database Read Write endpoint using DNS entry which will always point to Primary Instance.
  • HTTP Load Balancer is used to send all HTTP traffic to the NodePort services associated with the application on both GKE clusters. It's an active-active configuration as far as the application is concerned.
  • We didn't use Cloud SQL Auth proxy to connect to the database from the application. Failover steps would differ in that case.
  • This doesn’t consider any application integrations and complexities around that and doesn’t consider any monitoring failover.

The complete database DR process consists of the following steps

  1. The primary region (US Central1), which is running the primary database, becomes unavailable.
  2. The operations team recognizes and formally acknowledges the disaster and decides whether a failover is required.
  3. If a failover is required, the cross-region read replica in the secondary region (US East4) is promoted to the new primary instance.
  4. Need to update the DNS record to point it to the new Primary (old read replica).
  5. Client connections from GKE (US East4) can access and process on the new primary instance (US East4) without any reconfiguration.
  6. Traffic is only routed to GKE (US East4)

What happens if the Old primary region (US Central1) comes back online?

  1. GKE and Database Instances in the Original Primary region (US Central1) come back online.
  2. User Traffic is routed to GKE in the Old primary region (US Central1)
  3. GKE can access the new primary database Instance (US East4) as the DNS record points to the new primary
  4. Cloud SQL Database Instance (US Central1) is also the primary Instance but no client connections are forwarded to that. Had some of the client connections got forwarded to this Database Instance, this would have caused a split-brain condition. We avoided this by using a private DNS record instead of an IP in connection strings in client apps.
  5. We must delete the Cloud SQL Database Instance (US Central1).
  6. Create a read replica of the new Primary Cloud SQL (US East4) in the secondary region(US Central1).

If you don’t have time or are not interested to test, please skip and jump to “The key takeaways”.

Provision the Infrastructure

Using Terraform( terragrunt), the following will be provisioned:

  1. Create a VPC with two subnets ( one in US Central1 and US East4) and
  2. Create two service accounts (one for GCR and one for GKE)
  3. Create a Private Cloud DNS zone.
  4. Create a GCP Cloud SQL SQL server Instance (in US Central1)
  5. Create a Database in Cloud SQL (in US Central1)
  6. Create a GCP Cloud SQL Read Replica (in US East4)
  7. Create GCR
  8. Create GKE clusters ( one in US Central1 and US East4)
  9. Create an HTTP LoadBalancer with Backend Service having GKE Instance groups as backends

Let’s get Hands-on

The code for this exercise is hosted in a GitHub repository.

Prerequisites:

cd
git clone https://github.com/harinderjits-git/sampleapppub.git

Below YAML file has the configuration of the infrastructure.

~/sampleapp/terraform/gcp/terragrunt/orchestration/config_env_sampleapp.yaml

Replace the GCP Billing account and project folder, Project name and ID, and DB password with yours in this YAML file.

  db_password: &common_password stays-overhung-reconcile #replace this value 
billing_account: 00E-077BE7 #replace this value
project:
parent: folders/5489839 #replace this value
name: mysampleappproj1-ffgd #replace this value
id: mysampleappproj1-ffgd12345 #replace this value

Initiate terraform remote state. It will create a storage bucket for terraform remote state.

gcloud auth login #follow the prompts
cd ~/sampleapppub/terraform/gcp/rundir_init
terraform init
terraform apply -auto-approve

Create all GCP resources using terragrunt and terraform

cd ~/sampleapppub/terraform/gcp/terragrunt
. ./set-env.sh
make apply-all-ha

This will deploy all GCP resources required for this test.

Application and Database Deployment

Load Data into the database

  • Download and install SSMS or Azure Data Studio
  • Connect to the master database using “sqlserver”
  • When the database is created using terraform or console or config connector in the Cloud SQL server, its ownership is always set to sqlserver login. Due to this applogin DB login doesn’t have permission to act on the database created.
  • When the logins are created using the console or config connector, you can’t map it to a database user. That needs to be done separately using sqlserver login (SSMS)
  • So we have to assign db_owner role to appuser DB user after mapping it to applogin DB login
USE [sampleappdb]
GO
CREATE USER [appuser] FOR LOGIN [applogin] with default_schema=[dbo]
GO
USE [sampleappdb]
GO
ALTER Role [db_owner] ADD member [appuser]
GO
  • Connect to the sampleappdb database using “applogin”
  • This will connect to sampleappdb, in Object explorer select “sampleappdb” and click on file-> Open -> file and select the file ~/sampleapppub/src/sql/hr_cre_mssql.sql
  • This will connect to sampleappdb, in Object explorer select “sampleappdb” and click on file-> Open -> file and select the file ~/sampleapppub/src/sql/hr_cre_mssql.sql
  • Execute the statements in the file, this will create a table and insert some dummy data into that Table

Prepare the application image

  • Build a docker image and push it to GCR
# update <project_id>
cd ~/sampleapppub/src/go
docker build -t "sampleaapp01" .
cat /tmp/private_key.pem | docker login -u _json_key --password-stdin https://us.gcr.io
docker tag sampleaapp01:latest us.gcr.io/<project_id>/httpapp:latest
docker push us.gcr.io/<project_id>/httpapp
  • The docker image is pushed to GCR.

Deploy the app using Helm

  • Authenticate to GKE Primary. Navigate to the GKE in the Primary region on the console and click on connect and follow the instructions
# update <project_id>
gcloud container clusters get-credentials sampleappprodprimarygkeue --region us-central1 --project <project_id>
  • Execute below to deploy the application to GKE Primary
cd ~/sampleapppub/helm
read -s -p " database appuser login password:" PASSWORD
echo $PASSWORD
helm install httpappgcp httpappgcp --values httpappgcp/values.yaml --set dbsecretpassword=$PASSWORD
kubectl get pods,svc,secret -n httpapp
kubectl logs --selector=app.kubernetes.io/instance=httpappgcp -n httpapp
  • Repeat for DR GKE
# update <project_id>
gcloud container clusters get-credentials sampleappproddrgkeue2 --region us-east4 --project <project_id>
helm install httpappgcp httpappgcp --values httpappgcp/values.yaml --set dbsecretpassword=$PASSWORD
kubectl get pods,svc,secret -n httpapp
kubectl logs --selector=app.kubernetes.io/instance=httpappgcp -n httpapp

The Testing

GKE in the primary region goes down

  • We can’t turn off the GKE in US Central1 to replicate this scenario. We have to reduce the number of nodes in the node pool to 0 to replicate this.
gcloud container clusters resize sampleappprodprimarygkeue --node-pool private-np-1  --num-nodes 0 --region us-central1
#follow prompts
  • Once the GKE node pool has started shutting down, the Loadbalancer Healthcheck will be set to “Unhealthy” for the Primary GKE endpoint.
  • LB will redirect all the ingress traffic to the instance group and nodeport of httpappgcp service for the secondary region (US East4)
  • There is a loss of availability of the application for connected sessions and very few incoming requests.
  • We ran status_checker to generate some traffic and we ran the below query in Logs Explorer to check the traffic (stream logs).
httpRequest.status>=200 
resource.type=http_load_balancer
  • We observed that the application was not available for 18 seconds (maximum)
  • We did an “add employee” transaction to verify the application.
  • If we started the Primary GKE, the Load balancer will detect that it’s available and start routing the traffic to both Primary and Secondary region GKE clusters.
  • When traffic redistributes back to Primary GKE (after it becomes available again) and Secondary GKE, there is no downtime at all.

The primary region goes down

  • You run status_checker.go to generate some traffic.
cd ~/sampleapppub/src/status_checker

go run status_checker.go
  • For GKE: We have to reduce the number of nodes in the node pool to 0 to replicate this as we did in the last test.
  • For Cloud SQL: We can’t replicate the Cloud SQL Server crash or failure nor can we stop Primary Instance because it has a Read replica, so we will manually promote the read replica. So RTO here is the time taken by our process of promoting the database, DNS update, and database user mapping of newer logins.
  • Before promoting the replica Instance, update the DNS to reflect the private IP of the Read Replica Instance (US East4) because if you update the DNS record after promoting the replica, there would be data loss as the application would still write data to the original Primary for the time between replica’s promotion to the Primary to DNS update. So we need to update DNS before we promote read replica to the primary role.
  • Promote Replica: Old Read replica (new primary in US East4) will have the read-write ability.
gcloud sql instances promote-replica  sampleappprodprimarydbserver-1-primary-replica1
  • That means the Application should be able to connect to Application Database in the new Primary Cloud SQL server (US East4). Since the Application is using the DNS record for the database in the Database connection string, no changes are required.
  • If you created applogin login after creating a read replica, you would notice the application is still not up and you would see the below error in the container logs
2023/01/07 03:57:03 mssql: login error: Login failed for user 'applogin'.
  • Because applogin login in SQL Server doesn’t have privileges to use appuser user in sampleappdb Database. The users and roles of users are managed at the database level and logins are maintained at the master DB level. The master databases are specific to each SQL server instance and are not replicated.
  • Any logins created at the primary DB Instance before Replica creation are synced to Read Replica.
  • So every time we failover a database, users in that database need to be mapped to any new logins created and this can be done by executing the below SQL statement as an example(should be done for all new logins)
ALTER USER appuser WITH LOGIN = applogin;
  • Results: We observed that the application was not available for a few seconds. That is just the time to promote the replica.

What if you used SQL Auth Proxy for connecting to DB?

Note: Code shared in the GitHub repository is for the case where we used Cloud SQL DNS Record(Private IP) and not SQL Auth Proxy.

In some environments, SQL Auth Proxy is preferred over other methods of Database connectivity for security reasons.

The Cloud SQL Auth proxy does not provide a new connectivity path; it relies on existing IP connectivity. To connect to a Cloud SQL instance using private IP, the Cloud SQL Auth proxy must be on a resource with access to the same VPC network as the instance.

As it relies on private IP, we can not use the DNS Record switching in case of a failover (Read Replica promotion) and we have to update the “INSTANCE_CONNECTION_NAME” in auth proxy sidecars yaml. This needs more manual intervention in case of Failover to the DR region if Both GKEs need to connect to the Primary Cloud SQL Instance. OR we can connect GKE instances to respective Cloud SQLs in their respective regions.

Before Failover

  • GKE in the primary region connects to Database Instance in Primary Region
  • GKE in the Secondary or DR region connects to Database Instance in the Secondary or DR Region
  • The DNS Record for the application points to the Load balancer associated with the application service in Primary GKE.
Both regions are available

During Failover

  • The primary region goes down.
  • That means the application’s DNS Record pointing to the load balancer associated with the application service in Primary GKE, is not resolving to the available application.
  • Hence Users can not access the application.
the primary region is unavailable

After Failover

  • Replica Cloud SQL Instance in the DR region is promoted after the decision.
  • Replica Cloud SQL Instance becomes the Primary.
  • All client connection strings must be updated.
  • GKE in DR already points to the new primary (old replica)
  • DNS Record must be updated to point to the load balancer associated with the application service in DR GKE.
  • Users can access the application again.
  • Old primary Cloud SQL Instance should be deleted whenever that comes up to avoid the split-brain condition.
the primary region is unavailable

To make it cost-effective, You can scale up (add nodes to the node pools or add new node pools) the GKE in the Secondary region after the Failover to DR GKE. This increases the number of steps to do DR failover but it saves money. So it all depends on the RTO and RPO of the application in concern.

Falling back to an original primary region

This explains the steps involved to get back to a state where we were before the failover or replica promotion occurred. That means the Primary database is running in the Primary region (US Central1) with HA enabled with a replica instance in the DR region (US East1). This doesn’t explain how the client connections are handled for the Primary Instance (at any given time) which has been discussed earlier. Fallback is usually required as a part of regulatory compliance.

Starting from the original state of infrastructure before Failover.

Before the Failover

  • Primary and its standby (not visible on the console) are in UC1 region
  • Read Replica is in UE1 region

Primary Region Failure

  • The primary region (UC1) is experiencing some failure and is not available
  • Replica in the DR region (UE1) is promoted and has become the new Primary Instance
  • (Optional for less critical workloads) Enable regional HA for new Primary Cloud SQL Instance, hence a standby Instance is created (not visible on console). This requires Instance’s restart, so better to do it right after the replica promotion.

We covered until this step in Testing Failover scenarios exercise (above in this post). Next are the additional steps needed to fallback to original primary.

Create New Replica

  • Suppose UC1 is available again
  • Delete Old Primary Instance from UC1
  • Create Read replica Instance of new Primary Instance (UE1) in old Primary region (UC1)

Promote Replica

  • Promote the read replica to Primary Instance in UC1 i.e. Original Primary region
  • (Optional for less critical workloads) Enable regional HA for new Primary Cloud SQL Instance, hence a standby Instance is created (not visible on the console). This requires Instance’s restart, so better to do it right after the replica promotion.
  • Delete the Old Primary in UE1

Create New Replica again

  • Create Read replica Instance of new Primary Instance (UC1 i.e. Original Primary region ) in old Primary region (UE1) i.e. Original DR region.

We are Back to the same state that we were before failover. It is a very tedious procedure and requires a lot of manual effort. You may use another region as a transient region while UC1 was down.

The key takeaways

  • The cross-region read replica for a Cloud SQL can be created only if the Primary Cloud SQL is of version “Enterprise”.
  • The GCP Cloud SQL cross-region read replica might support an RPO “close to zero” which is great for Business critical applications.
  • This architecture can be used for planned maintenance of the Primary Region’s GKE.
  • This architecture can not be used for planned maintenance of the Primary Region’s Database Instances. Because if we promote Replica, we lose the original primary and the falling back process is tedious.
  • Promotion of the read replica is a manual step that can be automated using Cloud Function. Azure’s failover group provides fully managed auto-failover capability which in my opinion is a superior solution that scales very well. Imagine an organization failing over 50 production databases in case of a region failure, doing that manually is prone to mistakes and RTO for all Databases might not be achievable.
  • We didn’t test the scenario with “Primary Database Instance goes down” because Database Instance failback/fallback (having primary Instance again in Primary region with read replica in secondary/DR region after the Primary region becomes available post failover) is a long process it is not as simple as it is when using Azure failover groups.
  • A tedious failback/fallback process makes it harder to automate the cross-regional failover with failback/fallback requirements. In Azure, this is smoother and we have an end-to-end automated fully managed failover procedure.
  • The time taken to update DNS affects RTO and can be automated using cloud functions (using golang lib for GCP). We didn’t have to do this in the last post because fully managed Azure Failover groups take care of this part.
  • Mapping logins and users affect the RTO and can be automated to reduce RTO.
  • Could not find a GCP Load Balancer which can route traffic based on endpoint’s priority like Azure Traffic manager.
  • Recovery of the application not only depends on the availability of the database after failover but also on the retry logic.
  • Firewall rules are auto-generated for GKE services and that saves you from the hassle of creating rules for each service.
  • Even though we set the root password in terraform to create a database server and database, that doesn’t mean that the SQL server login password is set. That needs to be done just like any other login that you want to create.

I highly recommend you go through my last post “Testing Business Continuity of a sample application using AKS and Azure SQL Server” where you can repeat this exercise on Azure. I hope it helps anyone looking to compare both cloud platforms for Kubernetes plus SQL Server needs.

A reliable DR is the one where you can still access and use the backups taken in the older primary region. Is that the case with GCP Cloud SQL? Lets check that out in my post about it.

Please read my last blog about Testing Business Continuity using AKS and Azure SQL Server.

Please read my other articles as well and share your feedback. If you like the content shared please like, comment, and subscribe for new articles.

--

--

Technical Solutions Developer (GCP). Writes about significant learnings and experiences at work.