AI and Chaos Engineering: Smarter Failure Testing for Resilient Systems

Video size:

Abstract

Modern cloud systems are becoming increasingly complex, making traditional failure testing methods inefficient and reactive. AI-driven Chaos Engineering introduces automation, intelligence, and adaptability to fault injection, enabling predictive failure detection and self-healing capabilities. By leveraging machine learning, SREs can identify failure patterns, optimize chaos experiments dynamically, and accelerate incident response. This talk explores how AI enhances Chaos Engineering, reducing downtime, improving resilience, and enabling proactive reliability strategies. Attendees will gain insights into real-world implementations of AI-powered failure testing and how to integrate it into their SRE practices.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. Welcome to my talk. My name is Rahul Amte and I work as a cloud architect. I've been working in IT and in cloud technologies for more than a decade now. Today my topic is AI and chaos and energy, and how we can implement smarter failure system testing for resilient systems. We'll start with introduction. AI is transforming chaos. Moving from manual failure testing to intelligent, productive, and automated resilience validation. We'll cover foundational concepts and then look into AI integration and real world use cases. So what is our agenda today? It is the flow. We'll begin with chaos engineer and its evolution. Then explore why and how a AI enhances it. We'll go through the architecture tools, case studies, and close with the best practices and future trends. With that, we'll move on to explain what is chaos engineering. Can you imagine your systems going down at 3M in an night, but coming back by themselves, by recovering by themselves? But what if there is no human intervention needed? But before which, what if you can test the whole scenario that in simple words, is inducing chaos engineering. Chaos engineering involves running controlled experiments to test the system's ability to withand turbulent conditions. Netflix was the first one to introduce this to ensure system reliability, especially for distributor systems. Next, we'll move on to core principles for chaos engineering. These principles guide the practice, so we start with a steady state hypothesis, introduce real world disruptions, run in production, automate the process, and always minimize the risk of blast radius. Let's now talk about the evolution of chaos. Engineer. How did it start? Netflix Chaos Monkey. From an ecosystem to a full ecosystem of tools like Lin and Litmus Chaos, chaos Engineering has matured into a discipline engineering practice. It is now integrated with CICD pipeline and the all workflows to induce chaos engineering and test the systems. Going on to the limitations. So what are the limitations of traditional chaos engineering? Traditional methods are limited. They rely heavily on manual scenario selection and lack of automation in analyzing results. This slow down learning and scale where AI can jump in and help us and solve the problem, we'll move on to why AI in chaos Engineering. How does it help? AI actually can enhance chaos, meaning by making predictions generate. Dynamic test scenarios and analyzing system behavior in their time, reducing the manual overhead significantly. But what are the key AI capabilities to our resilience engineering? We'll look into that. Key capabilities conclude anomaly detection through machine learning, forecasting component failures, and reinforcement learning for scenario planning. Of course NLP. Analyzing incident reports. Incident reports as in data is definitely a key AI capability for chaos engineering. How does the workflow work? Let's look into AI and the chaos engineering workflow together. So this is how AI fits in. From collecting the observability data to generating hypothesis, designing chaos experiments, injecting forms, and using real time analytics for insights and remediation in the bottom line. Your observability data, as in your logging monitoring observability data is a very important component for this. So let's actually now look at a sample AI power architecture on AWS cloud provider. This is how the whole integration of integrating and inducing the Chaos Monkey works for your systems like EC2. Also, there's multiple other use case architectures where you can include parameters open elementary as input and AI ML models for anomaly detection. So what are some of the tools in chaos engineering? Let's look at some popular tools in chaos engineering. A snapshot of popular tools includes Chaos Monkey. As I mentioned, Netflix started all, basically when you're watching a stream, a show on Netflix, and you will not even know that the systems were down on the backend on Netflix. How? Because they have tested it multiple times using Chaos Monkey and made sure it's a seamless streaming for the audience. MBR also offers a SA model, which is another. Chaos Engineering tool limit Chaos is a Kubernetes native tool and chaos toolkit is fully open source. And again, all these tools have their strengths and all AI integration ship. Let's look at how these tools integrate AI into the existing tools which your enterprises or companies or organizations might be already using. AI doesn't need to replace existing tools. It can enhance them. For instance, chaos Toolkit can use AI based probes for validation. Lin can use ML insights for targeted attacks. Litmus chaos on the other side supports the AI driven workflows as we speak. Let's look at couple of quick case studies. As I mentioned, Netflix was the first to try this. The tool is called Chaos Monkey. It's just one part though. They now used corrective analysis and AI based simulations sim simulations to such resilience of the global streaming services. That's why we never see at least 9.9% of the times we don't see Netflix being done resulting in highly available systems. Another case study we can talk about is Lin, which is another AI chaos engineering tool. They come with integrated with ml. It was used in a Fortune 500 form and it actually proactively rejected risk systems and states and resulted in the reduced inquiry time by 60%, avoiding major outages. This is very big win for thems since they are operating critical trading systems. So now, so what are the benefits of AI driven if you want to integrate AI into chaos engineering? There is fast reduction of failures, reduce downtime, more accurate tests, and serve learning systems that adapt to changing environments. It's a big leap from what we have to the traditional static testing to the AI driven. But of course, every benefits comes with challenges and risks. So some challenges and risks are air brings complexity. Morning tuning is very important. You'll have to trust in its decisions. Data quality issues and governance. So it is critical to keep a human in the loop when AI is used for resilience testing, as well as making sure the models are fine tuned as per the observability data, KPI metrics. Next, let's move on to best practices for implementation. Always start small experiment in your dev environments, sandbox environments, make AI explainable. Collaborate with multiple teams like S-R-E-M-L, ai DevOps teams, and most importantly track APIs like availability, latency and TPR to measure the real impact. That's how we can put into practice and make the best use of this. And there are some amazing resources from awesome chaos engineering. It talks about the tools, papers, and community projects and open source initiatives. It's a very good difference for interested to check out. To conclude, what is the future of chaos Engineering? Things self failing systems, autonomous chaos agents, root cause analysis, powered by gen, AI and chaos testing across hybrid and multi-cloud platforms. AI will lead this evolution, and you don't have to answer a, call it 3:00 AM anymore because your chaos engineering software is taking care of it because you have tested, experimented, and tested it multiple times. Thank you for your time today. I appreciate it.

Slides

Download slides (PDF)

See all 109 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

AI and Chaos Engineering: Smarter Failure Testing for Resilient Systems

Video size:

Abstract

Summary

Transcript

Slides

Rahul Amte

Senior Cloud Engineer @ Nivid Technologies

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

AI and Chaos Engineering: Smarter Failure Testing for Resilient Systems

Video size:

Abstract

Summary

Transcript

Slides

Rahul Amte

Senior Cloud Engineer @ Nivid Technologies

Join the community!