Application migration from Docker Compose to Kubernetes. How, why, and what problems we’ve encountered. Part 1

Written by Ronald Ramazanov, Head of DevOps, Loovatech

Loovatech
8 min readJun 14, 2023

Often, an application that has significant growth in its user base isn’t ready for this. Requirements for speed and availability are growing, and infrastructure and application architecture don`t allow them to be met.

My task was to improve the application infrastructure and quality. The app managed to grow from an MVP and a stand for a single client into a popular SaaS. The lack of fault tolerance and service components’ scalability began to significantly interfere with users. It was time to adapt the application to a cluster mode.

The application was deployed in containers, and the orchestration was performed by Docker Compose. The application components were not designed to be run in cluster mode. I guess it is often the case for many apps in their early stages. The business is more focused on getting new features, and it isn`t always possible and necessary to waste any time on such premature optimization. At some point though, stability and speed become the top priorities.

What does a migration to the cluster mode feel like for an app with an existing tech stack and live users? What are the options? Any pitfalls? Is it feasible from a cost and effort standpoint? In this article, I tried to share my first-hand experience of application migration from Docker Compose to Kubernetes.

The app

First of all, let me do a quick scene set-up. Picvario is an enterprise platform built to store, search and share digital assets: images, audio, video, tests, spreadsheets, and other formats. It started with a focus on the DAM, and eventually targeted a wider CSP market.

Features:

  • The multi-tenant architecture with Django as a backend. Application clients use the same backend and database but they are isolated from each other in different database schemas.
  • SSR Frontend was built on Nuxt.js
  • There are many asynchronous and background tasks implemented with Celery, they have their queues and priorities. This includes media assets loading and processing, external storage scanning, exporting, face recognition, etc.
  • Nginx is responsible for routing traffic to components, distributing statics, and downloading files from S3 by using the X-Accel mechanism.
  • Elastic search is used for full-text search.
  • The statistics visualization is done in Grafana. A tenant administrator can view a detailed report on the environment, and an application administrator can view a higher-level summary for all tenants. This data is stored in separate Elasticsearch and sent there via Logstash.
  • There’s a significant volume of media data. Tens of terabytes with the possibility of growing to several hundred in a couple of years.
  • The media import load fluctuates.

How the infrastructure looked before and why a rebuild was required

All application components were run on the same virtual machine with Docker Compose. There were 24 containers in total. PostgreSQL and Elasticsearch were moved to separate instances, media assets storage was moved to the S3 bucket.

While such an approach was relatively cheap and easy, it had a number of drawbacks.

For instance,

  • Lack of fault tolerance. Virtual machine suspension or minor network failures led to the service outage.
  • Performance. It’s more efficient to distribute processes across multiple instances rather than running them on a single instance even if the latter has a decent resource surplus.
  • In case of a single component failure, e.g. a memory leak or high CPU usage, there’s a negative impact on other components. One worker’s fall can be survived, but in case there are lags on the front-end or back-end sides (or even worse, if both are down), that could become a major issue.
  • Missing autoscaling. Running on Docker Compose does not imply any container or instance scaling. In cases when users simultaneously generated a heavy load on the app, there was a strong fall in performance. I’ll share more on this in the scalability section.

First of all, I had to choose a solution that would allow containers to orchestrate between several hosts, with automatic scaling support.

The platform choice. Why Kubernetes?

Speaking of platforms, for me, the key option to consider was Kubernetes. The main reason being it is a de-facto standard for container orchestration. Like it or not, it seems to be one of the most feature-rich platforms and it keeps getting better all the time. I guess it covers most of our current and any future requirements. Among obvious disadvantages and compared with other platforms, one may note that there’s some configuration and operation complexity. But since I had already picked up some Kubernetes skills and experience on another project, this didn’t scare me. On the contrary, it was interesting to deepen and develop my knowledge of K8s by solving the migration problem.

The team also had reservations about Kubernetes becoming the platform of choice. The solution looked a little bit of an overkill for our needs, and some team members thought we would have been better off with something simpler and cheaper. For example, Ansible, that would be used for container orchestration across several hosts, traffic between which would be distributed by a load balancer. The lack of an auto-scaling mechanism could be offset by launching more worker instances at the places with the highest workload. I guess the Ansible option could address some of the challenges that we had, but only for a short time. To me, it looked like a compromise, and it would be just a question of time before the same challenges we tried to address would arise again. In the table below I tried to highlight some pros and cons of various tools that we had considered.

For us, Kubernetes turned out to be the preferred option, suitable for the current requirements and application further development plans. In the cloud we use, there is Kubernetes as a managed service, which is another plus for choosing this particular orchestrator.

However, each of the tools above has its own niche. I guess it would be wrong to assume that Kubernetes is the best for all of the applications. More often than not, simpler tools could totally meet the requirements, especially for those apps that are in the early stages of the app life.

Changes in application architecture

Before diving into Kubernetes details and running a service on it, I would like to talk about changes that were made to the Picvario code itself. For the application to work correctly in new realities, the file-handling logic was revised. It was a joint effort with the development team. Also, some other changes were made.

The diagram shows how to work with loading and processing files arranged in the application.

Kubernetes migration has the following issues:

  • How to provide Celery workers with access to the original file if it isn’t running on the same node as the backend?
  • How should the uploaded file merging be handled (steps №2 and №3 in the diagram), if now the backend has several replicas and all parts of the file are in different pods?

The most obvious solution was, as before, to provide shared disk space between containers. I`ve read about Persistent Volumes in Kubernetes and tried to set it up and connect it to pods. In reality, the result turned out to be somewhat different than I expected — instead of one PV, a separate one was created for each pod. Trying to address this, I learned that there’s an Access Modes parameter. Its default value is set to “ReadWriteOnce”. This parameter allows you to connect volume to a single node. While I wanted to use the “ReadWriteMany” parameter, as it allowed me to connect volume to several nodes with a read and write option available. As it turned out, the ability to create volume with this access mode wasn’t supported in the used cloud service. There were other options as well. Like one could deploy and configure one’s own NFS cluster and mount it to pods.

As a result, after discussion with the development team, we decided to go the other way and use an S3 bucket to address the challenge that we had.

• The original file is uploaded by the backend to the S3 storage, after that it’s downloaded by Celery workers.

• Instead of merging the file parts locally, we used a multipart upload in S3. Each pod of the backend immediately uploads its piece to the S3 bucket. The further merging into the single file takes place on the S3 bucket side.

So the loading and processing assets functionality issue was resolved. We had more changes to make though. Here are some more nuances that were solved when the test environment migrated to Kubernetes:

  • We have many background tasks performed by Celery workers. Previously, they worked in the most static environment — on a host available 24/7, where container work was interrupted only during scheduled deployments. Under the new conditions, the container’s life cycle has become floating. It moves from host to host, is removed when scaling, stops when limits are exceeded, etc. We were faced with the fact that some tasks began to get lost — workers performing the work task were interrupted in absence of a mechanism that would allow another instance to pick up the task. The solution was to change the Celery acks_late parameter. By default, tasks are removed from the broker’s queue just before they are started. After the change, it happens after the task completes. If a worker exits without a task completion, another worker will pick it up. Because tasks are idempotent, running them again doesn’t cause problems
  • Django static files were stored in local storage and sent out via Nginx. We also moved the storage to the S3 bucket by changing STATIC_URL and STATICFILES_STORAGE parameters in Django’s settings.py. You can read more about moving statics to S3 in the article Storing Django Static and Media Files on Amazon S3.
  • There was no mechanism for self-checking the component’s status. It was necessary to make sure that traffic doesn’t go to container instances that have not yet been initialized after creation, or have failed. To do this, the developer added health checks for components, and I set up Readiness and Liveness probes for pods. These probes were used to check the container readiness and performance.

As a result of improvements and changes, the application components became completely ready to launch and worked correctly in Kubernetes.

In the next article, I will share more about the nuances of the cluster preps and how to run apps in it.

If you like what you just read, please hit the green “Recommend” button below so that others might stumble upon this essay.

Loovatech is an outsourced development company. We help startups and businesses create successful digital products. Find us on LinkedIn

--

--

Loovatech

Web developer company. Helping startups and business create successful digital products