Breaking and Rebuilding Kubernetes: Understanding Resilience in Production-Ready Clusters

Breaking and Rebuilding Kubernetes: Understanding Resilience in Production-Ready Clusters

Presenter:

  • Daniele Polencic

Friday, May 9, 2025
1:00 PM UTC

If you want to be updated when the webinar goes live, receive a link to the recording, or be notified about any upcoming webinars, sign up here:

How does Kubernetes actually recover when nodes fail?

What happens to your workloads when the control plane becomes unavailable?

In this webinar, you will explore the real-world resilience of Kubernetes by seeing how a cluster is built from scratch and methodically broken to reveal its recovery mechanisms.

You will learn:

  • The process of bootstrapping a Kubernetes cluster without relying on managed services
  • The inner workings of etcd, API server, scheduler, and kubelet and how they interact
  • How Kubernetes detects and responds to node failures in real-time
  • Why the system takes 5 minutes to reschedule pods after a node failure
  • Critical insights about workload distribution when nodes rejoin the cluster

By the end of the session, you will:

  • Have a deep understanding of Kubernetes' fault-tolerance mechanisms
  • Understand how to evaluate cluster resilience through controlled experiments
  • Learn troubleshooting approaches for common failure scenarios
  • Develop practical insights for designing robust production deployments

👤 Who is this for? DevOps engineers, SREs, platform engineers, and Kubernetes administrators who want to move beyond theoretical knowledge and gain deeper insights into Kubernetes' resilience features.

🧑🏻‍🏫 Who is the speaker? Daniele is an instructor at Learnk8s, teaching Kubernetes and containers to small and large enterprises.

🔗 Useful links

The content presented during this webinar is drawn from the advanced courses that we run at Learnk8s https://learnk8s.io/training