Chaos Experiments under the lens of AIOps

Imagine this: you’re a Site Reliability Engineer (SRE) at a major tech giant and you are responsible for the overall system health, which is running in prod. Numerous alerts, server crashes, Jira tickets, incidents and an avalanche of responsibilities, which sometimes simply feel like a ticking time bomb. These are just some of the daily struggles an average SRE needs to go through. But why should it be like that? Well, it shouldn’t - thanks to a term coined by Gartner in 2016. AIOps, meet audience. Audience, meet AIOps.

Let’s extend this scenario. On top of all of the above mentioned issues, our poor SRE needs to watch out for potential security breaches and make sure nothing ever gets in through the cracks. However, by conducting proactive experimenting, continuos verification and improvement, he makes sure that the system is able to withstand these turbulent and malicious times that we’re living in. Do these notions ring any bells? They sure do! Chaos Engineering, meet audience. Audience, meet Chaos Engineering.

What’s our angle, you’re wondering? AIOps and CE are two concepts, which are often kept separate. In this talk, we will discuss (and show you!) how both practices combined can significantly increase cyber resiliency, while at the same time maintain full E2E transparency and observability of your entire system.

For this session, we have prepared and analyzed several use cases, followed main principles, summarized best practices and prepared a live demo through a combination of CE and AIOps tools.

Above all, we are SRE Engineers. As such, during this session, we will stay close to the SRE principles and best practices that we used to achieve our goals, e.g. reduce organizational silos, measure everything, learn from failures, analyze changes holistically, etc… As we proceed with our talk, the audience will be able to identify how these are related to AIOps, as well as CE, and finally, how it all ties together.

