Optimizing and running AI & Data: why Plastic Origins choose Azure Kubernetes Service and Airflow for sustainable operations.

clement Le Roux
6 min readApr 11, 2023

Plastic waste is a growing problem that affects rivers, oceans, and wildlife all over the world. To address this issue, Surfrider Foundation Europe has been working on the Plastic Origins R&D program which is powered by a cloud native digital platform hosted in Azure, Microsoft supporting the project.

Surfrider challenged us since the first day to design and build an optimized solution that could meet both efficient and sustainable operations. It’s been a 3 years journey to reach that point, here is the story of our data and artificial intelligence solution.

Plastic Data lifecycle

First we needed to get data related to plastic in river. We did not exactly get started from scratch, as there was an existing manual process of gathering data from kayak cruise on rivers. Based on that, the Plastic Origins mobile application was developed to allow volunteer end-users to get more data.

The trash data identified different plastic categories like bottle, fragment, sheet or tire within rivers alongside their GPS coordinates. We have been continuously extracting this data from two sources: JSON files through a manual process and MP4 videos with artificial intelligence.

After moving raw data from mobile devices to the Cloud, we then stored the semi-structured (JSON) and unstructured (MP4) data to a simple data lake environment build on top of Azure Cloud Storage service. We had then to figure out how to process this data so that it could serve the final goal of Surfrider to be provided with analytics and insights to take action.

To that end, we developed a custom Extract Transform Load (ETL) process that is able to transform both the manual and video data and prepare it to be loaded afterwards in a structured data store, a Postgre SQL database, that became our new ground truth for plastic pollution in river.

What a ride from plastic in river to actual trash data !

Manual acquisition of plastic trashes as JSON data

Plastic Origins ETL process

Our ETL process manages as described below the two types of trash data:

  • The first data type comes handy as small JSON files including trash categories along their GPS coordinate, easy to store and to transform with basic data engineering technics using standard python library like pandas.
  • The second data type comes as MP4 videos, either from mobile or gopro, alongside additional GPS JSON file when the video was produced by the Plastic Origins mobile application (gopro video format including GPS data directly within the video stream). To process those videos, we used our home-made deep learning technology named Surfnet, to automatically extract the plastic trashes from the optical flow and join them with their GPS coordinates through a post-process.

With a common Postgre SQL data store target, we have been challenged by a double requirement of processing two types of data very different by nature. We put effort in defining a similar stack of operation, typically by packaging both our ETL and AI as containers and making them accessible as API. However, although processing the manual JSON file in real-time was quite easy, deploying a real-time AI infrastructure was not the perfect match neither in terms of cost expenses nor in term of environment impact. At this point, we felt a little bit frustrated: we had prepared a lot of engineering to make sure our solution was state-of-the-art but somehow we hit some limitations due to the variety of data to process. We realized that the real-time approach was leading to a dead end and that we would need extra engineering to implement our efficient and sustainable data and artificial intelligence platform.

Azure Kubernetes Service and Airflow

Our ETL and AI APIs were ready, we needed to make some improvements to run efficient operations on the long term.

First we had to improve our AI API. We replaced our deep learning backbone to a lighter footprint and switch from Tensorflow to Pytorch framework. By doing that, we got a much more performant AI in terms of it/s, and most of all, an AI that was able to run inference on CPU, no only on GPU. This opened immediately for us new infrastructure deployment option like Azure Kubernetes Service (AKS) as our AI was shipped as a Docker container. Because our ETL process was shipped as well as a container, we could figure out a new deployment approach for both AI and ETL on AKS, which is the perfect match when it comes to stateless and serverless components. Why did we not choose then Azure Functions one could argue ? Actually we used them in the past, but for some long-time running ETL process, we were actually more free to manage our own infrastructure with an AKS cluster.

Now moving the real-time processing to batch was quite a change as well. We had to redefine the triggering of the ETL process that was done by an event in the data lake, basically when a JSON file of a video was uploaded. To do so, we had the intuition that Airflow could be our friend, but we were not sure to what extent. Actually, that was the perfect timing to now redesign our ETL process to batch the data instead of processing it real-time. We wrote a Directed Acyclic Graph (DAG), the basic batch unit of Airflow, to process our files once a week. This gave as well a user friendly interface for Surfrider admins, to monitor the execution of the scheduled ETL process or even to trigger it manually when necessary. We leverage in addition some cool feature of Airflow, typically its capability to manage many different type of processing target, to also manually scale up and down the AKS node pool on which the AI was running. This cut the price of our operation by a 10x factor! Last but not least, to push further our consolidation, we decided to deploy Airflow environment itself on the always running nodes of the AKS cluster. The last benefit we got from that, was to move as well our Business Intelligence (or BI, a workflow providing more meaningful insights on plastic river pollution) to the same Airflow instance.

You read it well: we were finally able to deploy on the same AKS cluster, Airflow, our ETL and AI to deliver an efficient and sustainable platform providing BI dashboards and data to map and monitor plastic pollution.

Future map of plastic pollution in rivers

Final Thoughts

It’s been quite a cruise to put in production this backend of the Plastic Origins platform, but we learned a lot from this journey. We considered almost every possible Azure deployment solution, from Virtual Machine, VM Scale Set, Azure Function to finally ended up with one single Azure Kubernetes Service cluster. Although we had to re-architect some part of the solution along the way, there are 3 main benefits we can highlights from our result.

  • First, we ease our operations and merge our ETL, AI and BI workloads into one single point of management, our Azure Kubernetes cluster.
  • Second, we reduced our energy and environment footprint by moving our AI from GPU to CPU and auto-scaling the node when required.
  • Finally, we cut our expense by x10 moving from real-time to batch.

One could argue that our technical prerequisite are high-standard: it’s true that running deep learning AI, ETL and BI with Airflow on Kubernetes is not piece of cake for everyone. But from our budget, environmental and operation requirement, this was at least the best we can do to support Surfrider Foundation Europe to kick plastic pollution out from rivers :)

--

--