ArgoWorkflows for Distributed MongoDB Logical Backup

Yossi Cohn
5 min readSep 27, 2023

In this post, I will describe, how we used ArgoWorkflows to create a MongoDB Logical Backup in a practical manner that is applicable to big MongoDB clusters as well as small ones.

This would be a very high level article that can paint the manner we can do the Logical Backup in a safe manner especially when we have many TB of Data.

Created via imagine.art AI

What is MongoDB Logical backup — Mongodump

Physical Backup is the process of back up for the raw files on file system.
Logical Backup is the process of back up for the logical content.
Mongodump is the MongoDB utility for Logical Backup.
Mongorestore is the MongoDB utility for Logical restore of logical dump created by mongodump.

The data is extracted from the MongoDB Replicas and compressed as Gzip ready to be stored for a later restore with Mongorestore.

mongodump is a utility that creates a binary export of a database's contents.

The Motivation — Unsustainable Mongodump dump process

When you have a MongoDB cluster with many TB of data, the MongoDB logical backup, which uses the Mongodump utility is not practical.

The reason is that the Mongodump is a CPU-intensive process that takes a long time, and when the Mongodump is running as a single process to Dump all the databases, it would mean that it would take a very long time in our case, around 5 days!!

Having Mongodump process running for a few days is not really practical.
There is a high chance of failure which would mean that you need to start the process all over again!!!

Finding the right infrastructure

We need to find the infra that would be used to create the logical backup (and the restore as well).

The current MongoDB dump we use is based on Jenkins.
Jenkins serves us well in many cases, but now we can already see the rigid behavior and limitations.
Moreover, Jenkins has its own instability issues and dependency on plugins that are many times serve as the root cause of Jenkins issues failures, etc.

Like most companies, we are using Kubernetes, which is de-facto the orchestration infrastructure adopted by organizations today and ours as well.

Defining the Solution

Since we understand that having a single mongodump process is a problem, going to a distributed process of backups should be part of the solution.

Distributed processes

Assuming that the Databases we have are usually in the range of tens to hundreds of GB we can handle a process per Database.

Note that:

  • Distributed small processes would be more sustainable(as they are small and short, hence a lower chance of failure)
  • Failure would mean that only the failing processes should run again(instead of the whole big process).

ArgoWorkflows as Infrastructure

The Kubernetes huge community is thriving and the Linux Foundation/CNCF organizations are promoting many projects, and this is a good starting point.

One of the mature and graduated CNCF Projects is Argo Project.

ArgoWorkflows is part of the Argo Project umbrella and is defined as

workflow engine supporting DAG and step-based workflows.

Moving to use this infrastructure seemed like it would answer the needs of our use case as well as existing and future use cases.

Implementing the Solution

So until now, we handled the Why, explained the motivation, and the main use case.

In the following section, we will go through the How.

Instead of going through the details, we will try to explain the concept.

Assuming you have your own ArgoWorkflows and that you understand the basics.

Using Workflows (Cron)

Argo is implemented as a Kubernetes CRD.

The base object is Workflows but since we would like to have this backup happen in a rescheduled manner we should have a Cron behavior which is given via the CronWorkflow object.

CronWorkflow are workflows that run on a preset schedule.

They are designed to be converted Workflow easily and to mimic the same options as Kubernetes CronJob.

In essence, CronWorkflow = Workflow + some specific cron options.

Our CronWorkflows should do the following:

  • Get MongoDB Cluster Metadata
    - DNS & Port
    - DB Name (do be backed up)
    - Credentials
  • Get The MongoDB DB Metadata
    - Size of Database
  • Prepare
    -
    Create PVC for the Dump
  • Process — Run Mongodump Process
    - done for the specific DB using Credentials
  • Persist the Data
    -
    Create VolumeSnapshot for the PVC, making the data available for restoration when we need it

In the following schema we can see the very high level schema of the flow.
The yellow blocks are actually the Metadata for the flow,
and the orange blocks are the creation of Kubernetes entities (VPC, POD, VolumeSnapshot)to implement the process.

In the real production flow we would have additional steps for other metadata and cleanup of the resources, but this is the main skeleton.

After we have this schema we can define the ArgoWorflows Workflow manifest, and from that define a Template for the CronWorkflow.

Below is the Kubernetes objects schema that would build the pipeline of the mongodump:

Kubernetes Objects in the Backup Flow

Note that each DB has a different CronWorkflow which eventual creates for the K’th DB the POD DB-Xk

Additional details

Helm

After we have the basic stuff we can easily use Helm to define a Chart with template that would receive the Specific metadata via the Values files and render the different CronWorkflows we need…this way distributing the Mongodump per DB vs per Cluster as in the initial state.

Observability

With ArgoWorkflows we can get the logs of the different steps the same way we get for any other POD as well as add Prometheus Metrics to get the different aspects of the process we do.
We can also connect to Slack/Email etc.

Restore phase

The restore (can come by some kind of on-demand scenario), whether it is a single DB we need to recover or many.
In any case this flow would be again distributed and done in parallel since we have a VolumeSnapshot per DB, hence deploying multiple PODs of Mongorestore running in parallel, while each POD (Mongorestore) is restoring a specific DB.

Summary

In this article I described how we can create a distributed flow of MongoDB Logical Backup while using ArgoWorkflows as the underline infrastructure.

We got a sustainable, safe and observable pipeline to backup MongoDB Cluster.

This example shows that we can change the way we do things e.g. with Jenkins and adopt modern infrastructures like ArgoWorkflows to be used in many other scenarios and use the Kubernetes capabilities to implement sustainable flows and pipelines.

References

--

--

Yossi Cohn

Software Engineer, Tech Lead, DevOps, Infrastructure @ HiredScore