Disaster Recovery Solution for Azure Kubernetes Service (AKS) Persistent Volume Storage

Shivam Gupta | KloudSaga
10 min readNov 1, 2022

Disaster recovery (DR) capabilities should therefore be a key consideration when choosing a cloud platform. Leveraging the cloud as a secondary data center/region for DR is often the first step in cloud adoption.

Disaster Recovery or cross-regions backup and restore Solution for Azure Kubernetes Service (AKS) Persistent Volume Storage is simple to architect, cloud-native, highly available, and resilient.

We are demonstrate the disaster recovery solution for Azure Kubernetes Service (AKS) Persistent Volume Storage in simple 5 steps.

1- Installation of Cross-region backup and restore tool.
2- Setup External Backup target for Rancher longhorn tool
3- Deploy Statefull application in AKS Cluster
4- Backup Persistent Volume storage data from Longhorn in Primary Region
5- Restore Persistent Volume storage data from Longhorn in Secondary Region

Pre-Requites:
• Two AKS Cluster in paired region(We are using two AKS cluster in West Europe and North Europe in our demonstration.)
• Helm cli
• Azure NetApp Files
• Dedicated Subnet for Azure NetApp Files

1- Installation of Cross-region backup and restore persistent storage tool.

As we have went through some tools and sharing the comparison below:

We are choosing the Rancher Longhorn tool and deploying on both AKS cluster on West Europe and North Europe.
• Longhorn is a lightweight, reliable and easy-to-use distributed block storage system for Kubernetes.
• Longhorn is free, open source software.
• Originally developed by Rancher Labs, it is now being developed as a incubating project of the Cloud Native Computing Foundation.
Run below commands for installing Rancher Longhorn.

• helm repo add longhorn https://charts.longhorn.io
• helm repo update
• helm install longhorn longhorn/longhorn — namespace longhorn-system — create-namespace
• kubectl -n longhorn-system get pod

We can see above list of pods running in longhorn-system namespace. You can use different method for installation from below link.

https://longhorn.io/docs/1.2.4/deploy/install/

You can edit and expose the longhorn-frontend service as Load Balancer to access the Longhorn UI dashboard on both cluster. Run below command for edit the kubernetes service

• kubectl edit svc longhorn-frontend -n longhorn-system

Change Service type from Cluster IP to Load Balancer. Once Service type changed for longhorn, Public IP will assign to longhorn UI service and you can access portal from local machine.

2. Setup External Backup target for Rancher longhorn tool

A backup target is the endpoint used to access a backupstore in Longhorn. A backupstore is a NFS server or S3 compatible server that stores the backups of Longhorn volumes. The backup target can be set at Settings/General/BackupTarget.

Note: For using NFS server as backupstore, NFS server must support NFSv4.

We need to setup NFS service in Azure NetApp Files for Longhorn. Azure NetApp Files is a fully managed Microsoft service that support Built-in local HA and Cross-region replication and will take your business, applications, and workflows to the cloud faster and more securely.

To use the Azure NetApp Files service, you need to register the NetApp Resource Provider using az cli.

• az provider register — namespace Microsoft.NetApp –wait

Now you can create Azure NetApp Files service in West Europe region after registration. Azure NetApp Files service also require a dedicated subnet in Vnet.

First you should create a subnet in same Vnet where AKS cluster located in West Europe because AKS cluster require reachability for Azure NetApp Files private IP and make sure you have selected Delegate subnet to a service Micosoft.NetApp/Volumes. If you don’t have available IP in AKS Vnet. You can just add new address pool in Vnet and create the subnet with new address pool.

Once you have created subnet in Vnet. You have to create Azure NetApp Files service in West Europe region.

After created NetApp service you need to create Storage capacity pool according to the requirements and then create the Volumes inside capacity pool and define you Quota limit and select the same Vnet where you added subnet recently when you selected the Vnet, Delegated Subnet will automatically select now you can click review + create the volume.

Now we need to follow same steps on North Europe region but this time you will create add data replication volume in capacity pool and Quota limit will be same as defined previously.

While creating replication volume you will define full Source volume ID(West Europe Volume ID) and select replication schedule as Every 10 minutes.

Once volumes will created you need to authorize replication from the source volume in West Europe. You can review below Microsoft official documentation for Azure NetApp Files cross-region replication.

https://docs.microsoft.com/en-us/azure/azure-netapp-files/cross-region-replication-create-peering

You can see below Azure NetApp Files

Now We are ready from Azure side. We can check NFS private IP and mount instruction details inside the volume.

You can use any VM from same Vnet or temporary use AKS worker node terminal to mount the NFS service and create a folder name backup after mounting and create temporary files in side backup folder. You can see same data replicate in North Europe replication volume after 10 mins. Now we need to setup external backup target in Rancher Longhorn in both regions.

Click on Setting, Go to General Setting and enter the NFS path in Backup Target under Backup Setting. Please see below screen shot.

West Europe NFS service will configured/mapped in Longhorn which deployed in West Europe AKS cluster.

North Europe NFS service will configured/mapped in Longhorn which deployed in North Europe AKS cluster.

Now we are ready for Stateful application deployment in West Europe region.

3. Deploy Statefull application in AKS Cluster

We are deploying a RabbitMQ sample application in West Europe region and will add some data in RabbitMQ like: VHost, message queue and user. You can also deploy other sample application according to you.

NOTE: Before installing any application in your cluster make sure you have set longhorn storage class as default. You can see below screenshot if any other class also showing default tag then you can edit storage class and set “storageclass.kubernetes.io/is-default-class: false” in annotation.

Using below command install sample RabbitMQ application

• helm repo add bitnami https://charts.bitnami.com/bitnami
• kubectl create ns rabbitmq
• helm install rabbitmq bitnami/rabbitmq –n rabbitmq

You can see below RabbitMQ service is running and Persistent volume created for RabbitMQ and showing in longhorn dashboard.

Now we need to add some data in RabbitMQ. I have edited the RabbitMQ service and exposed the service as Load Balancer to access the RabbitMQ portal from my local machine.

• kubectl edit svc rabbitmq -n rabbitmq

Access the RabbitMQ login password from below command.

• kubectl get secret — namespace rabbitmq rabbitmq -o jsonpath=”{.data.rabbitmq-password}” | base64 –decode

Once public IP assigned to RabbitMQ service, you can access RabbitMQ portal publicIP:15672 and login with username: user and pass which got from above command.
I have created one virtual host, queue and publish 51 messages in delivery mode Persistent from portal.

4- Backup Persistent Volume storage from Longhorn in West Europe

Click on RabbitMQ volume from longhorn portal as you can see below.

Click on Create Backup option

Once Backup created you can see in backup in below screen.

You can setup Recurring Jobs Schedule for backup in real environment so that according to cron job backup will auto created and save in external NFS service.
Once Backup is created you can see backup under Backup option.

Backup will replicate automatically on north Europe in every 10 minutes and same backup will visible in north Europe longhorn.

5- Restore Persistent Volume data from Longhorn in North Europe

You can see below screenshot backup replicated on north Europe longhorn dashboard.

You need to follow same Step.3 for statefull application deployment on North Europe AKS cluster.
We have deployed RabbitMQ application. We need to note down the PV name in text file which we need later.

We can start restore backup on north Europe region.
First, we need to stop the application on cluster and using below command to scale down the application.

• kubectl scale — replicas=0 statefulset.apps/rabbitmq -n rabbitmq

After that we need to delete the persistent volume from cluster and create the new persistent volume from backup using same name. Run below command to delete volume from cluster.

• kubectl delete pvc data-rabbitmq-0 -n rabbitmq

Before to create volume from backup, we need to stop replication first because replication volume created read-only mode so when we create volume from backup it write information on north Europe NFS service.
Go to the north Europe volume select Replication and click on Break peering.

Once replication break you will see again resync option. You will click resync button once your volume restoration part complete on north Europe.

Now we can create volume from backup on longhorn portal.
Go to the longhorn dashboard in select backup and create Disaster Recovery Volume.

Backup name and volume will be same and click OK.

Now you can see below screenshot volume created and showing under volume section in longhorn dashboard.
Click on Activate Disaster Recovery Volume and click Ok.

Once Disaster Recovery Volume activated you can see create PV/PVC option inside drop down menu.

Click on the Create PV/PVC and provide details. As we have provide same PV name, PVC name and namespace which we have deleted from North Europe Cluster.

Click on Ok
You can also check from kubeclt cli pvc has created.

Now persistent volume has created and we can start RabbitMQ application.
You can check in below screenshot application started and running.

Now we can validate data on north Europe application.
Access RabbitMQ portal from north Europe public IP:15672 and login using same username & password which you have entered previously at time of accessing RabbitMQ west Europe service.

You can see above screenshot RabbitMQ virtual host, queue and queue message restored successfully on North Europe cluster.

As we have used RabbitMQ application for demonstration. You can follow same procedure for any stateful application for cross-region backup and restore. it is very easy to protect your Kubernetes cluster. Rancher Longhorn also allows you to migrate your applications to another Kubernetes cluster at regular intervals.

Azure does not provide any solution for AKS persistent volume disaster recovery or cross-region backup restore solution. To protect from regional outages, you should consider multi-region deployments that leverage Azure Traffic Manager to route traffic to available regions as per the Azure official documentation.

Thank you for reading my article. I really appreciate your response.

For more information about us. https://kloudsaga.com/

Our Udemy Visit- https://www.udemy.com/user/kloudsagatutorials/
Udemy Vs KloudSaga Practice Sets- https://kloudsaga.com/udemy-vs-kloudsaga

--

--

Shivam Gupta | KloudSaga

10x Microsoft Azure Certified | Microsoft Certified Trainer | Azure DevOps Expert | Azure Architect | Azure Security Expert | AWS Solution Architect| Kubernetes