Unleashing the Power of Cloud-Native Storage: A Journey to Seamlessly Orchestrated Storage on Kubernetes

Fabien Marliac
Peaksys Engineering
8 min readJul 18, 2023

--

How we deployed the Rook operator on our Kubernetes clusters to enable orchestration and management of Ceph-based distributed storage, ensuring efficient and reliable data storage.

TL; DR:

Our company embraced cloud-native storage by implementing the on-premises Ceph solution on Kubernetes, which provided scalability, cost-effectiveness, flexibility, data protection, archiving, and compression advantages. Initially, we deployed a successful bare-metal Ceph cluster but were confronted with challenges in maintenance and updates. We created a new cluster, integrating it with Kubernetes using the ROOK operator for simplified deployment, scalability, and management. Performance was optimized, and lifecycle features were introduced. Challenges with CephFS and RBD mounts were overcome through network configuration and IPVS proxy-mode. The integration of Rook and Ceph on Kubernetes improved storage management, scalability, synchronization, and cost optimization, enabling us to provide our users with a better service and to adapt to evolving needs.

Cloud native architectures

Peaksys, the technology subsidiary of Cdiscount, the French e-commerce leader, is by definition a tech company, with 1 billion queries a year on the search engine, over 5000 servers, 1000+ order/mn on Black Friday, one change in production every 7 minutes.

We own 2 distant datacenters in France which are almost exact replicas. Both manage multiple Kubernetes clusters and applications hosted on them are active/active load balanced.

The first question you could ask is: why move to a cloud native storage?

While it largely depends on your organization and maturity, here are some of the answers we got.

Cloud-native software-based storage offers several advantages over traditional storage solutions:

  • Scalability: Cloud-native solutions are designed for scalability, allowing for seamless expansion as storage needs grow. NAS, on the other hand, is limited by the hardware and the capacity of the storage devices.
  • Cost-effectiveness: Cloud native solutions mostly use commodity hardware instead of proprietary storage appliances. This makes it easier to deploy and maintain, and also reduces the total cost of ownership.
  • Flexibility: Commodity hardware and software-based solutions provide flexible volume management, allowing users to add or remove storage as needed. This enables them to allocate storage resources more efficiently, affording better resource utilization.
  • Data protection: It provides better data protection than NAS, with some using checksums to ensure data integrity. It can also offer built-in redundancy, ensuring that data is not lost in the event of hardware failure.
  • Archiving: Long-term archiving is enabled, as most cloud native solutions provide built-in support for data retention policies. This allows users to keep data for longer periods of time, without having to worry about data loss or corruption.
  • Compression: Some also support data compression, which reduces the amount of storage space required for data. This not only reduces storage costs, but also improves the storage system performance.

Bare-metal CEPH — what we needed when we needed it

Without going into detail, our company chose the open-source, software-defined Ceph storage solution.

With over 1.5 billion images, Cdiscount is a heavy consumer of data storage, so we need to anticipate our future needs. Our first goal was to replace NAS usage for our media storage to benefit from the scalability and cost effectiveness of CEPH.

And it worked. So well, that it gave us the time we needed to fully embrace the cloud native world. Indeed, Cdiscount was one of the first French platform to deploy a bare-metal Vanilla Kubernetes cluster.

Our bare-metal cluster worked perfectly, and adoption was so easy with the S3 API that over the last few years we have had to put little effort into managing and maintaining it.

Everything has its pros and cons though, and while the team deployed and used a well-oiled Kubernetes cluster, the CEPH cluster just stayed up to date without any improvements.

Our first cluster went from brand new to legacy

For 3 years, we only updated our cluster, without having to perform any major maintenance. The thing is, we then lost a bit of expertise and didn’t track our users’ needs.

At this point, CEPH only provided Object storage (with the S3 API), and with the success of Kubernetes, our users’ needs were growing exponentially. Most of our users were keen to get a proper stateful feature for Kubernetes. In short, stateful is the capacity for Kubernetes service to dispose of persistent storage (by default, Kubernetes pods are stateless).

Additionally, we faced some new challenges as the available cluster space was decreasing over time. To be more precise, having an SSD and an HDD hybrid cluster with different capacities can cause issues when the cluster doesn’t have enough space available.

For all these reasons, we decided to create a new CEPH cluster. Learning from our mistakes, we knew we needed a strong platform that was fully integrated with Kubernetes to manage our users’ needs.

Integrating CEPH into Kubernetes for improved data management and scalability

By sheer coincidence, a new operator for Kubernetes, ROOK, graduated to “project maturity level” at the Cloud Native Computing Foundation right when we needed it 😊.

“Rook uses the power of the Kubernetes platform to deliver its services via a Kubernetes Operator for CEPH.”

https://rook.io/docs/rook/v1.11/Getting-Started/storage-architecture/#design

Below are the reasons why we ‘instantly’ adopted it:

  • Ease of deployment: a Kubernetes Operator automates its deployment. We can simply define our storage requirements in Kubernetes manifests and let ROOK take care of the rest.
  • Scale the deployment: we have many Kubernetes clusters, and the operator facilitates industrialization of the deployments within our clusters.
  • Ease of updating: the ROOK operator also helps a lot with updating CEPH.
  • Scale the resources: need more space? Just add a disk and let ROOK take care of the rest.
  • Orchestration and management: ROOK provides additional features such as automatic repair and resiliency, which helps to ensure the reliability of our storage infrastructure.
  • Observability: Rook also provides a rich set of monitoring and alert capabilities, enabling us to proactively identify and address any issues with our storage infrastructure.

We still provide the object storage (S3) feature, but better

Creating a new cluster allowed us to fully understand the features we offer, and even to improve them.

S3 synchronization view

Pump up those performances!

We separated the Rados GateWays (API) used for synchronization from the RGWs used by clients. In doing so, we were able to reduce the workload on each deployment and ensure the synchronization process was not impacted by other processes. It also helps us to troubleshoot.

Helping the community:

As we’re talking about performance, take care, as using Kube can be misleading. Don’t try to use more than one pod for the sync RGW, or you’ll get object locks instead of scaling the synchro (you’re welcome 😊). https://github.com/rook/rook/issues/12272

Take control of the data:

We also took the opportunity to introduce the lifecycle features. We were able to automatically delete stale data after a specified period, which allowed our users to take control of their storage space and reduce costs by automatically dropping the obsolete data.

State performance:

This time we decided to standardize our offer, especially concerning the cluster performance where we had to create our very own metric tools to monitor synchronization performance between our datacenters (read and write perf were tested with traditional tools).

Test Synchronization tool and standardized SLA

Trouble with integrating Stateful through CephFS and RBD on Kubernetes

Let’s go further and offer new features!

We decided to provide stateful features by deploying the Ceph File System (CephFS) and Ceph Rados Block Device (RBD).

During our deployment of CephFS and RBD on our Kubernetes cluster, we encountered an unexpected problem with the CSI (Container Storage Interface) plugins used to mount the storage on our pods. Restarting the CSI plugins caused CephFS and RBD mounts (and data) to be lost on the client pods.

Understanding the HostNetwork Configuration and CSI Plugin Restart

In Kubernetes, the HostNetwork option allows containers to use the host’s network stack. By default, containers use an isolated network stack that provides network connectivity between the containers and other Kubernetes services.

The CSI plugins we use to mount CephFS and RBD volumes require access to the Ceph nodes (which host monitor pods). However, since we did not use the HostNetwork option on the Ceph Monitor deployments, the CSI plugins were unable to communicate with the Ceph Monitors after a restart.

IPVS as Proxy-Mode in kube-proxy

In Kubernetes, kube-proxy is responsible for routing network traffic to the correct Kubernetes service endpoints. It can operate in different modes, such as iptables, IPVS, or userspace. In our Kubernetes cluster, we were using the IPVS proxy-mode, which gives better performance and scalability than the iptables mode.

Adding --masquerade-all argument for Persistent Routes to Ceph Monitors when HostNetwork is not enabled

Despite the IPVS proxy-mode being an excellent choice for our Kubernetes cluster, it created an unexpected problem with our CephFS and RBD deployments.

To address this issue, we added the — masquerade-all argument to kube-proxy in ipvs mode. This ensures that all source IP addresses in outgoing packets are replaced with the IP address of the node’s primary network interface. In so doing, the persistent routes to the Ceph Monitors were maintained, even after restarting the CSI plugins, allowing the client pods to retain their CephFS and RBD mounts.

Lessons Learned and Future Considerations about Stateful with Rook

One lesson we learned is that if we had installed Rook in hostnetwork mode from the start, we would not have needed to add the — masquerade-all parameter to kube-proxy in order to get both CephFS and RBD to work. However, retrofitting the hostnetwork mode after the initial installation requires complete reinstallation of the Kube cluster, which was not an option for us at the time.

Deploying and managing storage solutions on a Kubernetes cluster requires a thorough understanding of the underlying technologies and careful consideration of the tradeoffs between performance, reliability, and ease of management. By continuing to learn and adapt, we can ensure that our storage infrastructure meets the evolving needs of our business.

Overcoming Challenges: A Successful Journey to Implement Rook and Ceph on Kubernetes

In conclusion, adopting the Rook overlay has revolutionized our storage management and scalability on Kubernetes. By transitioning from our outdated Ceph cluster to Rook, we overcame various challenges such as complex cluster administration, synchronization issues, and limitations of hybrid SSD/HDD setups. Additionally, the introduction of lifecycle-based retention management allowed us to optimize storage usage and significantly reduce costs.

Rook’s seamless integration with Kubernetes empowered us to streamline storage infrastructure management both efficiently and effectively. Leveraging Rook’s adaptable and scalable architecture, we achieved enhanced performance and reliability for our S3 infrastructure, ultimately improving the user experience.

We are highly satisfied with the numerous benefits Rook has delivered, and have full confidence that it will continue to play a vital role in our storage management and scalability on Kubernetes, helping to propel our business forward.

--

--