Breaking and Rebuilding Kubernetes: Understanding Resilience in Production-Ready Clusters

How does Kubernetes actually recover when nodes fail?

What happens to your workloads when the control plane becomes unavailable?

In this webinar, you will explore the real-world resilience of Kubernetes by seeing how a cluster is built from scratch and methodically broken to reveal its recovery mechanisms.

You will learn:

The process of bootstrapping a Kubernetes cluster without relying on managed services
The inner workings of etcd, API server, scheduler, and kubelet and how they interact
How Kubernetes detects and responds to node failures in real-time
Why the system takes 5 minutes to reschedule pods after a node failure
Critical insights about workload distribution when nodes rejoin the cluster

By the end of the session, you will:

Have a deep understanding of Kubernetes' fault-tolerance mechanisms
Understand how to evaluate cluster resilience through controlled experiments
Learn troubleshooting approaches for common failure scenarios
Develop practical insights for designing robust production deployments

👤 Who is this for? DevOps engineers, SREs, platform engineers, and Kubernetes administrators who want to move beyond theoretical knowledge and gain deeper insights into Kubernetes' resilience features.

🧑🏻‍🏫 Who is the speaker? Daniele is an instructor at Learnk8s, teaching Kubernetes and containers to small and large enterprises.

🔗 Useful links

The content presented during this webinar is drawn from the advanced courses that we run at Learnk8s https://learnk8s.io/training