How does Kubernetes actually recover when nodes fail?
What happens to your workloads when the control plane becomes unavailable?
In this webinar, you will explore the real-world resilience of Kubernetes by seeing how a cluster is built from scratch and methodically broken to reveal its recovery mechanisms.
You will learn:
- The process of bootstrapping a Kubernetes cluster without relying on managed services
- The inner workings of etcd, API server, scheduler, and kubelet and how they interact
- How Kubernetes detects and responds to node failures in real-time
- Why the system takes 5 minutes to reschedule pods after a node failure
- Critical insights about workload distribution when nodes rejoin the cluster
By the end of the session, you will:
- Have a deep understanding of Kubernetes' fault-tolerance mechanisms
- Understand how to evaluate cluster resilience through controlled experiments
- Learn troubleshooting approaches for common failure scenarios
- Develop practical insights for designing robust production deployments
👤 Who is this for? DevOps engineers, SREs, platform engineers, and Kubernetes administrators who want to move beyond theoretical knowledge and gain deeper insights into Kubernetes' resilience features.
🧑🏻🏫 Who is the speaker? Daniele is an instructor at Learnk8s, teaching Kubernetes and containers to small and large enterprises.
🔗 Useful links
- Troubleshooting Kubernetes deployments
- Kubernetes instance calculator
- Learn Kubernetes Weekly newsletter
The content presented during this webinar is drawn from the advanced courses that we run at Learnk8s https://learnk8s.io/training