Different Kinds of Managed Kubernetes

Explore a new World with Kubernetes

Published in

ITNEXT

17 min readNov 22, 2023

Note: This is not a tutorial how to setup Clusters with Cluster-API. This is on overview and entry in a new universe. I am not an expert at Cluster-API, maybe be wrong with my statement. My target is to show that kubernetes is not only kubernetes and what it means if you have an established or try to establish one Platform-Team, IDP, etc. To show you what is available on the market and how the solutions differ from each other.

Introduction

With years of hands-on experience under my belt, I felt confident in my understanding of Kubernetes and the myriad solutions it has spawned. However, the advent of Cluster-API felt akin to uncovering a hidden passageway, leading me to a whole new expanse of possibilities that I hadn’t explored before. This revelation has reshaped my perception of container orchestration and opened up avenues for innovation and optimization that were previously off my radar. Don’t get me wrong, I’ve known Cluster API for a while and have been working on my own with an implementation of it (TKGs and Tanzu) for over 3 years. But I wasn’t aware of its scope.

In this blog, I’ll employ high-level abstractions for my diagrams because the intricacies of implementation are not our focus here. The solutions may seem similar at a glance, but trust me — they are distinctively different, and we’re about to uncover just how much.

Choosing the Right Kubernetes Strategy for Your Organization

Why am I sharing this? I’m not being compensated to promote products from various companies. My task, as assigned internally, is to survey the market for products that can benefit our platform and foster its growth. After interacting with multiple companies, I now have a clearer perspective, and I appreciate their insights.

Here are the key criteria:

Continuous 24/7 support for any platform incidents.
Avoidance of vendor lock-in with respect to Kubernetes, if feasible.
The solution should represent the latest in industry best practices.
An application service catalog tailored to our platform’s context should be available, offering self-service, built-in maintenance, and tested versions among other features.
A user interface that integrates with SSO for managing clusters based on role concepts, such as projects.
The capability to set templates so that our development team can simply select a relevant template, for instance, to obtain a Kubernetes cluster along with an external DNS (this would be an added advantage).
Security and compliance measures should be stringent and guided by established policies.

Managed Kubernetes Service (AKS)

A Managed Kubernetes Service is a cloud-based offering in which the cloud provider handles the underlying infrastructure, setup, and operational aspects of a Kubernetes cluster, allowing users to focus primarily on deploying and managing their applications within the cluster.

Azure Kubernetes Service (AKS) is a managed container orchestration service provided by Microsoft Azure. I will use it as a simple example, that enable me to make a clear difference to the following solutions.

Comparison Keypoints for Azure Kubernetes Service:

Vendor-Lock: High (the more integration you use, the higher the lock)
Costs: Cluster Management (Standard tier $0.10 per cluster per hour), Nodes (pay as you go)
Application-Service-Catalog: partwise (GitOps with Flux, Monitoring, eBPF, etc.)
Self-Service: Azure API and UI
Required Skill-Set | Entry-Level: Azure Basics | Beginner, Starter
Type: fully-managed
Identity and Access Management: Azure Active Directory (now Microsoft Entra ID), SSO/OIDC-Integration over Enterprise Application.

Opting for AKS (Azure Kubernetes Service) entails committing to a specific hyperscaler and cloud provider, potentially resulting in a vendor lock-in. While AKS simplifies many aspects of Kubernetes management, it also means you’re fully reliant on Microsoft’s Cluster Provider and its implementation. This can limit customizations, such as directly configuring the Control Plane for specifics like RBAC Authorization. Therefore, it’s crucial to weigh the ease of use against the flexibility and control requirements before deciding on AKS.

I’ve been working with AKS for over three years, and the quality of the service has improved significantly. Microsoft has made significant strides to offer an accessible entry point while maintaining high standards for security, governance, and compliance.

PS: Companies such as iits have gone beyond creating scalable platforms; they’ve developed environments grounded in AKS or Cloud Container Engine (CCE) that furnish developers with comprehensive tools, enabling them to concentrate solely on their development work.

Cluster-API

The Kubernetes Special Interest Groups, or SIGs, are communities within the Kubernetes project focused on specific aspects or areas. One of these SIGs is the SIG Cluster-API.

Here are some key points about Cluster-API:

Purpose: The primary goal of SIG Cluster-API is to simplify cluster lifecycle management in the Kubernetes ecosystem by defining a declarative, Kubernetes-style API for cluster creation, configuration, and management.
Cluster-API Project: At the core of this SIG is the Cluster-API project, which defines custom resources (Cluster, Machine, MachineSet, etc.) and controllers to manage the lifecycle of Kubernetes clusters across different infrastructure providers.
Kubernetes Native: Everything in Cluster-API is designed to be Kubernetes-native. This means that managing cluster lifecycles can be done using familiar kubectl commands and manifests.

Within this range of products, several utilize the most recent versions of Cluster API, including VMware Tanzu, Spectro Cloud, Giant Swarm, and Airship. Conversely, products like OpenShift, Anthos, and Kubermatic are based on forks from earlier Cluster API versions. However, all these products share a common principle: leveraging Kubernetes to manage Kubernetes.

SIG Cluster-API is community-driven. They have regular meetings, discussions, and collaborations on design proposals and implementations to drive the project forward. You can see, they have a strong community with notable companies.

Contributor's overview over last 5 years

What exactly does Cluster-API do? As previously mentioned, the concept revolves around using Kubernetes to manage Kubernetes. When you initialize your cluster, regardless of the solution you choose, this initialization step converts a regular Kubernetes cluster into a management cluster equipped with an operational operator. The subsequent diagram illustrates this process.

After the initialization step, the operator will manage or allow you to bootstrap different cluster on different target clusters like:

I will use this simplified diagram for the next steps.

Next we will see different solutions that use the Cluster-API as an Upstream project and the different benefits from it.

VMware Kubernetes Solution (vSphere with Tanzu)

VMware’s Kubernetes solution, specifically vSphere with Tanzu, integrates Kubernetes directly into the vSphere platform. The underlying mechanism for this Kubernetes integration and management is heavily influenced by the Cluster API (CAPI) project.

With the introduction of vSphere with Tanzu, VMware has embedded Kubernetes into the vSphere control plane. This allows developers to deploy and manage Kubernetes clusters right from vSphere, turning vSphere into a Kubernetes-native platform

How it Works with CAPI:

Management Cluster or Supervisor Cluster: vSphere with Tanzu has a management cluster, which is responsible for the creation and management of other Kubernetes clusters.
Workload Clusters: When a user desires a new Kubernetes cluster, they define it using the Cluster API’s CRDs. The management cluster’s controllers then see this desired state and begin the provisioning process within vSphere

This what happens, if cluster-admin or platform engineer applies a TanzuKubernetesCluster Custom Resource.

The Controller will deploy a Guest cluster and, depends on your RBAC Setup and Integration to your environment, the developer will get access to the Guest Cluster. The Guest cluster consists of Controplanes and Workernodes.

VMware makes difference between the roles. The different roles do not have visibility or control over each other’s environments:

Here, you can observe a deeply integrated utilization of the Cluster-API, specifically with the vSphere provider.

Certainly, while vSphere with Tanzu has made significant strides in integrating Kubernetes with vSphere, there are some challenges or disadvantages associated with it like:

Limited Customization: While vSphere with Tanzu aims to provide a comprehensive solution, it might not allow for as much customization or flexibility as a hand-crafted, vanilla Kubernetes setup or other third-party solutions.
Dependency on vSphere: While the Cluster API promotes a cloud-agnostic approach to Kubernetes, using vSphere with Tanzu ties you to the VMware ecosystem. This could limit flexibility if you want to use other cloud providers or infrastructures.

Comparison Keypoints for vSphere with Tanzu:

Vendor-Lock: High
Costs: depends on your vSphere Setup (HAProxy Mode or NSX-T)
Application-Service-Catalog: yes, the Tanzu Mission Control Catalog
Self-Service: Cluster-API Provider for vSphere and UI
Required Skill-Set | Entry-Level: Kubernetes Basics | Beginner, Starter
Type: fully-managed, leverages Cluster-API
Identity and Access Management: Active Directory, LDAP, etc. (depends on your infrastructure)

I’ve been working with VMware Tanzu for over two years, and the caliber of the service has seen marked enhancements. VMware has not only ensured an approachable entry point but also upheld stringent standards for security, governance, and compliance. Additionally, they’ve been significant contributors to the Cluster-API project, and it’s evident in the evolution of their product, such as the introduction of node pools.

Kubernetes based on SIG Cluster-API (100% Downstream)

In this Part I am gonna share some different approaches how different companies implementation their solution 100% based on Cluster-API as Upstream Project.

— — — — — — — — — — — — — — — CAPZ — — — — — — — — — — — — — — — — — — —

CAPZ (Cluster API Provider Azure) is an implementation of Cluster API (CAPI) for Microsoft Azure. CAPZ has replaced AKS Engine from Microsoft for self-managed kubernetes cluster. The Cluster API project is an initiative by Kubernetes SIG Cluster Lifecycle to bring declarative, Kubernetes-style APIs to cluster creation, configuration, and management.

Implements the CAPI model specifically for Azure. This means interpreting those CAPI resources (Cluster, Machine, etc.) into Azure-specific actions, like creating VMs, setting up networking, and so on.

Let’s dive into the components:

AzureCluster: Represents an Azure-managed Kubernetes Cluster’s infrastructure. This mainly includes networking components, such as the VNet, Subnets, and Network Security Groups.
AzureMachine: Represents an individual Azure Virtual Machine and its configuration. For a control plane node or a worker node, an AzureMachine resource will be created.
AzureMachineTemplate: A template to describe AzureMachine configurations, often used with MachineDeployments to describe a group of similar machines.
Other components: There are more resources for Azure specifics, like AzureIdentity to bridge with Azure Active Directory.

You specify Custom Resources, and the Operator handles the tasks. Ultimately, you receive a target cluster comprising Control Planes and Worker Nodes.

Vendor-Lock: High
Costs: Controlplanes (pay as you go), Workernodes (pay as you go)
Application-Service-Catalog: no
Self-Service: Cluster-API Provider Azure
Required Skill-Set | Entry-Level: Kubernetes and Azure intermediate | Intermediate
Type: fully-managed, leverages Cluster-API
Identity and Access Management: Kubernetes RBAC

I dont have real long term production experience with CAPZ, so i cant tell you, how its work and which challenges you are facing with this implementation.

— — — — — — — Giant Swarm (Kubernetes Platform) — — — — — —

Note (Update): There seems to have been a change. To avoid the Verdor lock, Giant Swarm has switched from its own operator to a bootstrap process, which in the end uses 100% of the cluster API and thus also the corresponding CRDs and CRs. The principle described below has not changed. The cluster API operator is rolled out through the bootstrap process. I hope I have reproduced it correctly, no liability, otherwise contact GS directly.
Note-2 (Update 24.11.2023): So that there is no misunderstanding. What you get with Giant Swarm is not a self-managed Kubernetes that you have to take care of, but a fully managed platform.
From a customer perspective, I get a managed Kubernetes solution and much more. As a customer, I focus on my business and the Giant Swarm platform takes care of the rest.

The Kubernetes Platform from Giant Swarm is like Cluster-API on Steroids.

Why? Giant Swarm harnesses the power of the Cluster-API, employing an operator that integrates various Cluster-API implementations such as CAPZ, CAPA, and CAPI for vSphere, among others. This provides a unified application point, enabling the creation of multiple self-managed Kubernetes clusters across diverse cloud provider endpoints, including Azure, AWS, and vSphere.

The unique approach taken by Giant Swarm involves organizing resources as Custom Resources (CRs). These organizations can be employed to distinctly separate resources for varied projects, business units, teams, and more, all within a single Giant Swarm management cluster. Management of these is facilitated through the management API provided by Giant Swarm’s management clusters. This approach not only offers enhanced isolation through Role-Based Access Control (RBAC) but also delivers greater control overall. Furthermore, it provides the flexibility to deploy to various target cloud providers. The end result is a self-managed Kubernetes cluster comprised of Control Planes and Worker Nodes.

Vendor-Lock: Low
Costs: Subscription based on vCPU and vRAM. Additional you have to pay for Controlplanes (pay as you go) and Workernodes (pay as you go) by the target provider.
Application-Service-Catalog: yes
Self-Service: Giant Swarm Kubernetes Operator (deprecated), Giant Swarm Catalog, App Platform, CLI and UI
Required Skill-Set | Entry-Level: Kubernetes Basics | Beginner
Type: fully-managed, leverages Cluster-API
Identity and Access Management: DEX + OIDC Provider like AAD

Giant Swarm offers a range of solutions tailored to your needs. You can opt for a fully managed approach where “We implement and manage everything for you,” or choose a collaborative model where “We work alongside your team in tandem.” Alternatively, you might prefer a guided setup with “Our assistance in the setup, followed by management by your team.” Whichever path you select, rest assured that 24/7 support for the Management Cluster is provided by Giant Swarm. Beyond just a Kubernetes platform, you’ll also benefit from an App Platform and a comprehensive Catalog of Apps. You can define your own catalogs or use the already provide two: giantswarm (Giant Swarm Catalog) contains applications that we know how to manage; giantswarm-playground(Playground) contains applications that we have integrated but are not managed by us. Moreover its feels like Giant Swarm focused with their Platform Approach on “SRE as a Service”.

— — — — — — — — — — CLASTIX (Kamaji) — — — — — — — — — — — —

Kamaji implemtation of the Cluster-API is entirely open-source, free of charge and 100% downstream version. With Kamaji, CLASTIX illustrates the ease of scaling control planes on a management cluster. This not only facilitates rapid scaling but also helps reduce costs, as it conserves OS resources typically used when control planes run on VMs.

As you can see, you are tasked with creating the worker nodes and enabling them to join the respective tenants. While the tenant produces the essential join key, you will apply it to the VM nodes.

Vendor-Lock: Low
Costs: FOSS (real Open Source). Additional you have to pay for Controlplanes (pay as you go) and Workernodes (pay as you go) by the target provider.
Application-Service-Catalog: no
Self-Service: CLASTIX Kamaji Operator
Required Skill-Set | Entry-Level: Kubernetes Basics/Intermediate | Beginner/Advanced
Type: fully-managed (Controlplanes), Worker will be managed by yourself “Bring your own Worker”, leverages Cluster-API
Identity and Access Management: Kubernetes RBAC

When you contrast its performance with the time taken to create a nodepool on various cloud providers, you’ll see a notable difference. With Kamaji’s approach, the TenantControlPlane launches in mere seconds, followed swiftly by a quick setup of VMSS (for Azure) or Auto Scaling Groups (for AWS).

For those whose applications or infrastructure operate on an event-driven basis and are time-sensitive, Kamaji could be an ideal solution.
There are multiple methods to create VMs and subsequently join them to the Tenant. Among these methods are using kubeadm or the Cluster-API.

Additionally, various combinations can be explored, such as:

Crossplane with VMSS and a custom joining script.
Flux with the Terraform Operator and a custom joining script.

CLASTIX showcases the scalability and efficiency of their solution on their website, highlighting a benchmark where 100 Controlplanes were rapidly scaled with a single data instance (etcd), optimally utilizing the resources of the management cluster.

Forks from Cluster API versions (100% Downstream to specific version)

When referring to “Forks from Cluster API versions,” it indicates creating a divergent or separate branch from the main Cluster API version, which could be for customization, experimentation, or addressing specific use cases. When it’s mentioned that the forks are “100% Downstream to a specific version,” it implies that these forks are entirely derived from a particular version of the Cluster API, maintaining a downstream relationship, which could be beneficial for ensuring compatibility or consistency with that specific Cluster API version while tailoring the fork to specific needs or requirements.

— — — — — — — — Kubermatic Kubernetes Platform — — — — — — — —

Note (Update 30.11.2023): To prevent any potential misunderstanding, it’s important to clarify that the Kubermatic Kubernetes Platform is a versatile product. It can be self-hosted as open-source, providing a feature set available without the need for enterprise licensing. However, enterprise features are accessible through licensing for those seeking additional capabilities. KKP is not a Managed Kubernetes Services like “we manage it for you” that Kubermatic offers.
Note-2 (Update 30.11.2023): Kubermatic had already implemented a similar solution and gained experience in this area before the Cluster API was available. They were therefore able to benefit from their experience and incorporate it into the Cluster API at an early stage.
Note-3 (Update 28.01.2024): The SSH Connection is an optional connection, if end users want to get access to the worker VMs via SSH e.g. for debugging. But for the bootstrapping and management of workers a SSH connection is not required. Every Machine the machine controller creates is immutable and created via a cloud-init bootstrapping. So from the network perspective, only HTTPS access from worker to seed cluster is required.

If I understood the story right, kubermatic starts very early with the first Version of Cluster-API. The version dont have the neccessary implementation to allow you a seamless self-managed-kubernetes approach. So Kubermatic forked the project and implemented or extend the Cluster-API with a machine-controller.

The Kubermatic machine-controller is open-source Cluster API implementation that takes care of:

Creating and managing instances for worker nodes
Joining worker nodes a cluster
Reconciling worker nodes and ensuring they are healthy

KKP architecture has two main components:

Master Cluster: This runs the main KKP namespace components that are considered as the main components for Kubermatic Dashboard, such as Kubermatic API and Kubermatic controller manager.
Seed Cluster: This runs the control plane of the Kubernetes such as API server, scheduler, machine controller, etc.

This is a simplified overview; for a more in-depth look at the architecture, consider checking out the kubermatic blogs.

The seed cluster contains essential components, including a unique feature called the “Machine Controller” in its namespace. This controller is notable for adapting and extending the Cluster-API Machine CR management in a specific manner. Contrary to previous understanding, the configuration and management of worker node VMs don’t rely primarily on SSH access. Instead, SSH connections are optional and mainly used for tasks like debugging. The machine controller creates immutable machines through cloud-init bootstrapping, requiring only HTTPS access from workers to the seed cluster for network or joining operations. This approach facilitates efficient management of worker nodes and control planes, optimizing resource utilization. The controlplanes are running on the Management Cluster, so do save cost and resources to spin VMs that acts like controlplanes.

Vendor-Lock: Middle
Costs: For Entprise Features, Subscriptopn based on vCPU and vRAM. Additional you have to pay for Controlplanes (pay as you go) and Workernodes (pay as you go) by the target provider.
Application-Service-Catalog: yes (enterpirse app store)
Self-Service: KKP Operator, CLI and UI
Required Skill-Set | Entry-Level: Kubernetes Basics | Beginner
Type: fully-managed, leverages Cluster-API specific version and extend it (but still 100% downstream)
Identity and Access Management: OIDC Account + RBAC

Special should be mention:

Only VMs or API needed. VMs will get ssh keys and after that the operator fullfill his works to join the VMs as Nodes to the seed cluster running on the management cluster.
The controlplanes running as pods on the management cluster (save your resources and your wallet)

So you get a combination of kamaji extended by managed joining way of the workernodes. I dont know the benchmark between the products and I also dont want to compare the both products. Kubermatic allows 6000+ Seed Cluster with only one master cluster. My intention is, try to explain the concepts behind the companies as maximum I am capable of.

Final Thoughts

The prevailing question today is no longer whether to adopt Kubernetes, but rather which managed solution aligns with your organization’s long-term strategy. Fully Managed Services such AKS, EKS, GKE, and others offer tightly integrated cloud ecosystems, whereas if cost savings and avoiding vendor lock-in are your priorities, requiring more flexibility, you might lean towards self-managed Kubernetes solutions.

No single solution fits all use cases. The choice depends on your organization’s long-term goals, as well as the capacity and skill set of your teams. It also hinges on critical factors such as:

Maintenance requirements
Utilizing Kubernetes service catalog as an Internal Developer Platform (IDP)
Availability of 24/7 support
Service Level Agreements (SLAs), time-to-market considerations, and a hardened service catalog necessary for your operations?
Governance, Compliance and more

The decision also influences the role of Platform Engineering or IDP within your organization. Consider asking yourself:

How many engineers do you currently have, and how many will you need?
Do you prefer a Platform Engineering approach or an IDP approach?
What aspects should be managed by a service provider?
How does demographic change affect your industry?

If you compare the different implementations of the Cluter-API to the managed kubernetes solution like AKS, I would say in one sentence to every solution.

vSphere with Tanzu: This solution is designed to integrate a managed Kubernetes service into the VMware vSphere ecosystem, facilitating a seamless workflow.
Kamaji: It’s an open-source solution that prioritizes rapid scaling and high performance while being resource-efficient. It empowers developers to create custom solutions and contribute to the project.
Kubermatic Kubernetes Platform: This platform emphasizes resource conservation and offers robust integration with various environments, including on-premises and edge computing, where hardware may limit resources.
Giant Swarm: The focus here is on avoiding vendor lock-in, maintaining compatibility with the latest Cluster-API versions, and providing a strong emphasis on Site Reliability Engineering (SRE). It features high integration with managed services and allows development teams to deploy through a service catalog.

Each solution has distinct advantages tailored to different operational needs and strategic objectives.

Contact Information

If you have some Questions, would like to have a friendly chat or just network to not miss any topics, then don’t use the comment function at medium, just feel free to add me to your LinkedIn network!