SREs main task is to keep the operations up and running. An SRE dealing with Kubernetes has many challenges to keep resilience is at the desired level and improving over time. In this talk we will go through techniques to measure and improve resilience of Kubernetes platforms in a Cloud-Native way.
The number of micro services in a Kubernetes environment can grow into hundreds easily. Continuous upgrade of these micro services and the Kubernetes platform itself will need a system to measure resilience of the deployment. SREs need to practice chaos engineering in a cloud native way in such a way that it is easily manageable, reuses as many as chaos experiments and workflows. This talk is intended for those SREs that would like to practice or are already practising chaos engineering in their environments. In this talk, we will introduce chaos hub for SRE and discuss how to construct complex chaos workflows using Litmus and Argo projects. A live demo will take the audience through the construction of a end-to-end chaos workflow involving Kubernetes node failure, a CPU hog, a network slowness in a ecommerce application and how resilience is measured and monitored during this process.
Priority access to all content
Community Discord
Exclusive promotions and giveaways