K8s Workflow Management for Software Developers Using Argo Workflows

Published in

Riskified Tech

8 min readSep 12, 2022

Many tech companies face challenges as a result of having their data volume gradually increase over time. This natural growth might occasionally require running historical data reprocessing — the procedure of retroactively modifying/loading new data into existing records, to make it available for current needs.

Historical data reprocessing (in short: “Backfill”) might seem like a day-to-day task for Data Engineers, but in this post, I’d like to discuss how we’ve cracked this challenge in my Full Stack team here at Riskified.
We managed to do that using Argo Workflows — a k8s-based tool that we found was almost tailormade for our use case. Later we’ll look at some examples of how to get started with it, but first — let’s see how it all began…

Why did we have to run Historical Data Reprocessing?

We first encountered the need to run relatively high-scale historical data reprocessing in my team about 1.5 years ago. It all started when we decided to gradually index more fields into our application’s Elasticsearch, and fetch it from there instead of from a relational database (PostgreSQL).

Being responsible for Riskified’s primary client-facing application presenting KPI metrics reports, we understood the importance of accurately acquiring and indexing our data.
So we carefully mapped all relevant fields that weren’t indexed to Elasticsearch from the beginning and started indexing them from that moment on.

This was quite trivial thanks to our existing async Producer-Consumer architecture, but what about past data? For example — how could we support presenting the new fields for old orders that hadn’t been originally indexed with them?

What challenges did we face?

We had all historical data saved in our PostgreSQL database, and we’d already run some small-scale backfills in the past — but never before had we faced the challenge of running them on all of our existing data, starting from the very first “live” order, which practically meant running on billions of records.

In addition, this time we had to index computed fields, meaning it was no longer possible to just fetch and publish them “as-is” directly from the DB. We had to call the exact same logic we wrote in Ruby for calculating the fields in-memory upon publishing their indexing request, to prevent discrepancies between new and old data.

This made it less of a conventional task for our Data team, and better coupled to my Full Stack team’s domains and responsibilities. But, it also meant we had to face new challenges that inevitably came with our up-growing scale & complexity. Let’s review why the flow we used until then — could no longer serve us.

Pitfalls of our old backfill flow

The old “orchestration” solution was simply running the backfill scripts on several docker containers at once — each assigned with a different batch — on a single EC2 machine.
It wasn’t optimal, but it worked, mainly thanks to the small number of orders and straight-forward “fetch & publish” logic. At this point, it probably could’ve been equally performed by Data teams without our support.

But the new scale and complexity we described earlier — quickly no longer made it a feasible solution, with some obvious pitfalls:

Awkward parallel run — difficult to control and monitor, unstable
Unreasonable run time — would’ve taken us several weeks (!) to complete due to its cumbersomeness
No clear, unified visibility of the current status and failures
Manual & prone to error reruns (due to miscalculations of failed batches’ ids range)
High-cost, resource-consuming — EC2 machine had to be kept alive throughout the entire process

Those obstacles weren’t clear to us at first glance. We hadn’t realized how unrealistic it would be to insist on using the existing mechanism, and hoped that investing some extra time in writing generic orchestration scripts would suffice.

Luckily, at some point we realized it couldn’t be solved “the Agile way”, so we went back to the drawing board to reassess our possibilities, defined a new set of requirements, and started looking for new solutions.

The chosen solution: using Argo Workflows

Thanks to some successful brainstorming with our DevOps teams, we identified that our requirements almost perfectly matched a platform that was already integrated into our infra, but was patiently waiting to prove itself with its 1st applicative use-case: Argo Workflows.

Argo Workflows, according to its documentation, “is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes”.
At first we felt intimidated by all the new k8s terms and concepts, but as we dug deeper into its docs, together with DevOps’s support, we saw the huge potential of it, including some highly-relevant core features:

Modeling multi-step workflows, where one step’s output serves as the next step’s input
Parallelism
Retry mechanism
Easy injection of environment variables and secrets
Cost-efficient, dedicated k8s pod-based orchestration

So we decided to overcome the fear of many software developers — and dive into this “DevOps tool” and its k8s-based implementation 😱

Let’s take a closer look at Argo Workflows’s syntax, and show how we can use it to build an orchestration solution.

Defining the workflow with a YAML template

Argo Workflows is a YAML-based platform, meaning that defining a workflow with the steps and features mentioned above involves writing a YAML template, starting with the most basic object, of kind: Workflow.

We were partially familiar with this syntax, as we use the YAML-based Helm Charts to manage our k8s resources with our DevOps engineers. The main difference was that Argo Workflows introduced us to new abstractions (“objects”) for wrapping different types of k8s resources. Eventually, it all sums up to defining the pods we want to run, and their relations and dependencies. So let’s dive into our workflow’s template and explain its parts:

NOTE: This is a simplified version of our template that is suggested as a boilerplate. Irrelevant parts are omitted for the sake of simplicity and readability.

PART #1 (Lines 1–9): General workflow attributes
In this part, we’re defining an object of kind: Workflow and setting its general attributes, such as its entrypoint, parallelism (max number of pods to run in parallel), etc.

PART #2 (Lines 10–21): Job definitions and relations
Here we’re defining the steps (jobs) to run; in our example we have 2 steps: generate-ids & runner. They will run sequentially, and as we can see — generate-ids’s output (start & end ids to use for the current pod’s batch) will be used as runner’s input arguments.

PART #3 (Lines 22–52): Job templates
At last, we’re referencing the jobs as defined in PART #2 (see above) and specifying each referenced job’s commands and specific attributes, determining the manner in which it will run, for example: command, retryStrategy, inputs, pod env variables, etc.

Bringing the template to life

We ran the workflow by pasting its YAML template into Argo Workflows’s UI and clicking on ‘SUBMIT’:

NOTE: I warmly suggest starting to run on a small number of records, to minimize the risk if something doesn’t work as expected. Afterward, you can safely increase your range.

The generated workflow ended up looking like this:

We can see that the workflow started with the first job — generate-ids — and used its outputs to start running multiple runner jobs on multiple pods in parallel. Each pod got assigned its own ids range and had a retry limit of 4 attempts, after which it was marked as failed.
Scrolling to the left and right allowed us to see the currently running pods along with the completed ones, and to get a quick view of the failed pods (after exceeding their retry limit), in order to understand what ranges need to be re-run.

And… that’s it!

From this moment on we only occasionally monitored it, allowing Argo Workflows to orchestrate the entire flow for us. Its UI provided a convenient view for the failed batches while reducing the friction to a minimum and encapsulating complex behaviors, such as the parallelism and retry mechanisms that come out of the box.

We did have some tweaking to do to find our right publishing & indexing rate — such as splitting into several consecutive workflows, which also reduced the load on Argo’s UI struggling to present thousands of pods at once — but our mission was accomplished:

We completed the entire backfill within just a few days (instead of several weeks!) and the old flow was gone for good! 🎉

Next steps

Once we ran several high-scale backfills with our workflow, the word spread and more teams in Riskified adopted this platform to run their applicative historical data reprocessing jobs.

Later on, an applicative infra team gathered our requirements and built generic templates available through our k8s charts to be used by all dev teams, based on our workflow!
In addition, DevOps integrated some attributes to those generic templates, to enable configuring them to run on AWS Spot Instances, which significantly reduced the running cost.

As a next step, we’re considering writing or searching for a tool that will crawl over our data stores (e.g. PostgreSQL, Elasticsearch), to identify differences between orders stored in them. This could help us verify our data integrity after completing the backfill, but it could also be useful not only from a historical data reprocessing perspective.

Wrapping up

This effort was made possible and went relatively smoothly due to having Argo Workflows already integrated into our organization’s infrastructure, and also thanks to being partially familiar with YAMLs. But I believe this is a good, straightforward, “developer-friendly” solution nonetheless.

I encourage you to read more about what to consider before choosing Argo Workflows in this great article, which also compares it to other popular workflow-management tools in the market (such as Airflow).

For me, implementing this almost-tailormade solution was a great opportunity to broaden my horizons and get more comfortable with k8s. It made me dive deeper into k8s object/resource types, get familiar with Helm Charts as their wrappers, gain independence when maintaining or writing new templates for our k8s services, and more!

I personally believe it’s important for us as software developers to obtain these skills, to not only understand what our code is doing, but also how it gets wrapped up into deployable applications, and what great tools and platforms can enable this while supporting an up-growing scale.