AI-Driven Chaos Engineering: Automating Resilience Testing with Predictive Insights

Video size:

Abstract

Discover how AI-powered predictive analytics can enhance chaos engineering by leveraging historical data to automate fault injection and improve system resilience.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, everyone. I hope everyone is doing well. Welcome and thank you for joining the session on AI Driven Chaos Engineering, Automating Resiliency Testing with Predictive Insights. Shreeman Yaram, Senior Engineering Manager in Test at Coupa Software. I specialize in AI, automation, and resiliency testing, including focusing on DevSecOps and security testing. Over the years, I have worked with distributed systems. I'm going to talk about what is Chaos Engineering and AI driven testing strategies. Helping teams to build more reliable, failure architectures and infrastructures as well. focus is on making chaos engineering smarter in today's session, instead of just breaking the system to see how they fail it. believe AI can help us to predict the future. failures, automate chaos testing, and continuously improve system resiliency. I'll walk you through how AI can enhance chaos testing, shifting us from reacting to reacting the failures, anticipating and preventing them before they impact users. Let's dive in. Introduction to chaos engineering and most of us would be knowing what is chaos engineering. But let us take a moment and let us first recap and refresh our knowledge on what is basically chaos engineering is all about. Chaos engineering is all about testing system resiliency by intentionally introducing failures. We want to intentionally introduce the failures and see how the system of waiting for someone, something goes wrong in production, we do is we try to simulate real world failures in a controlled way and see how the system responds. the purpose we want to identify the weaknesses before they cause the actual issues in the production for our users, for our customers. Be proactive. Testing different failure scenarios, we can fix the problems early while in the development phase itself and make our system more reliable rather than seeing these problems in production. its core, the key principle, at its core, engineering, chaos engineering follows one principle, embrace failure to build robust system. Instead of fearing of failures, we can use them as opportunity to learn, adopt, and improve the system resiliency. This approach helps us to move from being reactive, fixing things when they break, to being proactive, designing the systems that can handle failures gracefully. all about the chaos engineering. While many of my colleagues, many of the speakers have already covered what is kiosk engineering and all sort of things, it's a, I don't want to stress on that, but I want to show you how the traditional kiosk testing is being done and why it's not enough and why we need to use AI that could take us into the next level. Challenges in, in predicting real time failures. Unpredictable failures. Even with all the kiosk testing we do, unexpected failures still happens, often at the worst possible times. Engineers, SREs, DevOps, SecOps teams work hard to anticipate issues, but real world systems are unpredictable, and failures can still slip through. One big challenge is complexity. Because why it happens? Because of the complexity. Modern systems are made up of countless services, dependencies, and external integrations like AWS, S3s, what, you name it. In the microservices world, people use a lot of third party software, third party systems and they integrate with them. As the scale predicting every possible failure mode becomes nearly impossible. And then there is a cost, both in time and resources. Running continuously chaos testing takes effort, infrastructure, and careful coordination. It's not always feasible to simulate every failure scenario at scale. It's impossible, basically. So while chaos engineering is a great way to uncover weaknesses, it also has limitations. The question comes. How do we get beyond traditional chaos testing? How can we make it make this process smarter and more efficient? where the AI comes into the picture. Let's dive in how AI can help predict and manage failures in a better way. When we Let's, before we go in deeper about the idea, let's break down the difference between the traditional kiosk engineering and AI driven kiosk testing. traditional kiosk engineering, the approach engineer use usually takes is manually fault injecting into the systems, which we know, which the engineer knows or which the team knows, a shutdown in your server, adding a latency to an API. or simulating a database outage or simulating a third party system outage. These tests are based on what an engineer or what a team could think of it. Meaning that we are relaying whole on a knowledge and past experiences. see a problem here. The problem is We are not covering the real world failures. We do not know once the system goes in production, what kind of a problems we would face. Real world problems, failures do not always follow our assumptions. Because systems are complex and unknown failure can occur anytime and that could slip through us, through from us. Plus, because the, This approach is most manual. The scope is often limited, if you see, because it's all the engineering team's knowledge. can only run a set of limited real world test cases. after that, the tests are done. Engineer After the tests are done, engineers have to manually try to see the results, which takes time and leaves a room for human error, because when we introduce manually It leaves a human error. And there is the biggest issue. This whole process is a reactive approach and we are identifying failures after they happen rather than preventing them in the first place. we do, but at one level, one level, but not at a dynamic level. Now, what else can we use to stop this? Yeah, here is what my thought process on doing this and we have started doing it. Let's understand how we can use AI driven approach. Now let's use, AI to help us to make it better. of engineers manually introducing failures, why not AI can simulate faults based on the real world data. What do you mean by the real world data? We will come there, but let us see what the real world data could help. is just not following a predefined test cases. If you see in the traditional approach, we go by knowledge. Here, we want to go adaptive and dynamically from the real world data, we want to continuously monitor the log. The real world data here is logs. the real production logs, analyze the traffic patterns, analyze how the systems are behaving in the production environment learn from the past incidents to uncover the failure patterns, which we are unfamiliar. The teams are unfamiliar we might not, be even aware of it. For example, let's take an example for this context. I'm taking an example as a, in this context as an e commerce platform. E commerce platform has a suddenly traffic spike during Black Friday, okay, AI can observe how the infrastructure behaves under such loads, detect subtle issues, like maybe a payment service might be a DOM or a shipping service may not be supporting it and dynamically creating those kiosks test based on those findings. That means. These learnings, what we have in the, in the Black Friday's, things, we want that data to be converted into a machine, into a new experiments for kiosk engineering testing and continuously integrating into our CI CD process. This means instead of repeating the same old kiosk experiments, AI adopts the test based on what actually happens in production. How? Using a real world data. What do you mean by real world data? All the locks, all the system locks, all the production locks we take and we use AI to make it an adaptable. again, it scales across different systems. We can, since we, we have the production, We have all the locks running continuously and automates, we can automate the entire testing lifecycle from injecting failures to analyzing results and even suggesting improvements as well. Do you think this is the biggest shift? Yes. AI moves from reactive to proactive. Using AI, now we are going from reactive approach in the traditional one. to a proactive approach. Instead of waiting for a failure to happen, which we do with the traditional approach as well, where we have a limited knowledge, where with the limited knowledge we are thinking and we are creating the experiment. In the AI driven approach, we are using the production logs to manifest and create dynamic test cases as a part of our continuous, loop, continuous loop. So it anticipates risks and tests. for them before they cause the real world issues. bottom line I want to say is traditional chaos testing is like guessing what might break and testing for that. driven chaos is like watching what actually break in the real world testing for those patterns dynamically and as part of our continuous development cycle. AI makes chaos engineering smarter in my view. faster and more effective and more resilient. It makes us the system to be more resilient. Again, my suggestion is that we don't, we should not leave the traditional replacement. it's not a replacement for the traditional approach. We should use AI as an extension for what we are already doing it, what we are already doing in the traditional way. the knowledge base help us to simulate the scenarios. However, in the production, once the system goes into the production, use the production logs to become more effective, run continuous and improve your chaos testing. And that's the philosophy that, that says the, testing, continuously improve your experiments. That's exactly what we are doing. But right now we are using AI with that, with the real world data and we are using it. The best approach is, in my view, the best approach or what I suggest is the best approach is to combine both methods. Keep running well designed manual chaos experiments while using AI to expand and improve upon them. understand integrating AI to predict and automate failures. Now that we understood AI driven Chaos engineering is more adaptive. Let's break down how AI actually makes this happen, AI for analysis, learning from historical failures. We talked about that. We have a historical data in the sense like historical logs from multiple systems. We have captured that. The first step is to analyze the past failures. The very first step, we want to analyze the past failures. can go through historical logs, everything from system crashes, slowdowns, error spikes, CPU spikes, memory spikes, you name it, and identify these patterns which humans might have missed in their static experiments, in the traditional approach. Instead of relying on intuition, what I would think is, AI can surface hidden trends in failure data because we have a large set of data from last couple of months, last couple of years, and use the data, go into deeper and deeper to understand insights, how the system may behave under a stress. Again, for example, let's take, our database, example, our database frequently, black Friday afternoon or on a Friday afternoon. There could be slowdowns on the system. On the weekends, we may not be using it, so we are trying to bring it down because these are external systems. may not immediately notice this pattern, but AI after analyzing months of logs can flag that this is a coincidence with a weekly traffic surge. We don't, we know about our internal system, we might have a knowledge on our internal system. But let's say a payment gateway system we are integrating, that knowledge may not be having the steam. And that external teams, let's say Visa or Mastercard, if you're using those systems, they want to reduce, because of their cost improvement plans, they want to reduce on the weekends, the usage, they want to have it as low, and they're tearing down the servers on the weekends. It's a hypothetical situation, we are taking an example here. That pattern we may not be able to know until and unless we analyze the logs. that's where we want our systems to be very proactively acting on it. And how do we handle this situation? That's where the beauty of this idea comes into the picture. in other words, if a storm is coming, if we know that storm is coming, we can Let's prepare ahead of a time instead of waiting to get it gone the rain. Similarly, AI warns us about a potential system failures before they escalate, giving time, giving a team a time ahead to enforce those areas. Automated injections, kiosk experiments. Now what we want is smarter kiosk experiments. So what first we did is AI analysis and then we have predictive modeling we want to give and. forecast what could be failure, and then we want to generate automatic injections or experiments in chaos world, we see it as experiments automatically generating the chaos experiments. Now that's the exciting part, right? So AI does not just predict it, acts on that. on the insights that it learned from analysis and modeling, AI can automatically help us design and inject. A targeted failure. Areas that are more likely to break based on the external factors need not to be external internals as well based on the traffic and all those things dynamic in nature. If AI detects that the payment service is slow during a peak hours or during weekends, learns on that pattern and it can generate a controlled latency into the service before the real traffic search happens. This model will allow the engineers to test fixes before users are affected. And this is a continuous process, which we keep on learning. on, on just to bring everything all together, what we are trying to do is AI learns from historical data instead of relying on assumptions, which we talked about In, in, in the slide where we want to, where we are trying to show the differentiation. AI predicts future weak points based on the past data, helping the teams to stay ahead on the failures. automates the chaos experiments, making tests more relevant and effective. means we are no longer ran, testing randomly with the static knowledge. We are running chaos experiment where they actually matter. that's the thing. let's explore how AI driven chaos testing can be integrated into a real world force. And again, I just want to focus why this works uses relatable examples like database slowdowns, weather forecasting, payment service issues, shipping services, many shipping services have been slowed down or we want to take alternative shipping mechanisms. All sorts of things are. are really important for an example of an e commerce. connects AI role in each stage, we are learning, predicting and acting it. excitement while, keeping it practical and easy to flow. Now, if you see most applications, these are the popular tools that they use it. Okay. And this is what, where our entry point would be. So key observability tools and logs, if you see, to make AI driven chaos engineering effective, okay. We need the right observability. we don't have a proper logging mechanism, this is very hard to achieve. Any application For troubleshooting or debugging, the logs and the observability tools are very important. Tools like New Relic, Splunk, Elk, CloudWatches, AWS CloudWatches. There are hundreds of observability tools, commercial, non commercial and all. New Relic can help us to monitor application performance. Splunk can aggregate and analyze the logs. And AWS Clouds provides deeper insights into both infrastructure and application health. All these tools are individual tools can help us to collect the comprehensive data in and then passing it to the AI model to learn from the real world system behavior and finally generate the experiments files. Okay. Now, here is what, I would say, this is a model. some of these things are, at a very high level. If you see from, the steps that, what I'm trying to do it here is first initialize S3 clients. Here. This could be replaced with your, Splunk SD case or a New Relic SD case, which could help us to get the data from the real world system. But for our reference, for a just easy understanding, let's assume that there is an S3 bucket, which has all the system log, okay. Or. Or a job which will get the data from this Splunk New Relic, you name it, and then gets that into the S3 and then, that's where, so we are trying, I did a small sample program to explain you how it works. It should, it can be used. So if you see line number, 8 to, 8 to 11 or, 12, say S3 client is equals to Boto. I'm using here Boto, bucket name and get, your S3 bucket and logs. Okay. Fetch logs file from S3. So let us go bucket name, log file, key file name and whatever the date it is and read the CSV file and based on the timestamp and all sort of things you are trying to do a data frames and Now converts the timestamp column and into date format for a proper time based analysis and again think of S3 as I mentioned, think of S3 as a jaint warehouse storing system logs. section fetch can help you to fetch all the logs and organize them into a structure format. Once again, I'm trying to tell, this is a sample piece, which I'm trying to take to get an idea of how we can use this system and, the idea of using the existing data or the historical data to do it. There are many ways to do it instead of S3, instead of Splunk, you can use a fetching mechanism or you can write a client separate code block for this itself or a separate program which will keep on fetching the data and passes it to our AI model. This you can do it. This is just an example. Then you define your feature for AI analysis that as we talked line number 15 to 17 I would not such, 17 would help you. Which will help you to understand the key performance indicators, KPIs, like CPI usage and response time and all This metrics are like health indicators for your system, like checking heart rate, blood pressure and temperature. And you can use what other mechanisms as well. Again, as I said, it's a just one. And then detecting anomalies with AI. Use the isolation forest. I'm using the isolation forest here. Okay. Thanks. and then, detecting the lysis. If you see line number 20 to 22 is what we are talking about, that label, on, a normal logs as normal and lysis log logs as an normal anomaly. Like for example, if it is info error debug based on that, you can label it. Okay. it's it's oh. It's If I have to take an example, it's like airport security. Most people pass through a normal, but suspicious behavior anomalies get flagged for a deeper inspections. Okay. Extract anomaly locks. Now we use the anomaly locks, filters the data set that only makes meaningful and flagged anomalies. then what we try to do is finally line number 28 to 25, you are trying to, inject it. You are injecting the experiments. That means if you see on the right side, experiment is injected. This is AWS FIS experiment I have taken for a demonstration purpose. And that could help us to understand that there is a CPU. Usage is high and this could be applied and it's a regular example I've taken for CPU, but there could be many things, right? DDoS attacks were happening and that is causing much more slower and that's where the CPU is increasing. And how do we simulate that DDoS attacks? Maybe fake up, use a faker library or anything and fake up the calls so that, this kind of scenarios you can build up. finally, what we are trying to do is we are trying to finally save the experiment configuration file. And, finally we are visualizing it and, that's how we are trying to do. The summary is fetching the logs from AWS. That could be anything. you can use, SDK clients like Splunk clients and all sorts of things, and then detecting anomalies, generating the fault injections experiments, visualizing, the display detected anomalies. that's exactly what this program is trying to do. This approach automates, chaos engineering, making it smarter and proactive. instead of relaying on a predefined scenarios. Now, overall summarizing the benefits of AI driven chaos engineering. So basically what we are trying to do is proactive resiliency, anticipate and mitigate potential failures. We are trying to make sure instead of waiting for failures to happen reacting, we shift the predicting, predicting it where it could happen and we are preventing that. AI kiosk engineering analyzes past failures simulates potential issues experimental files like AWS AI and AWS FIS or other kiosk toolkits. This will help us to strengthen. And continuous streamlining the process. Overall efficiency, streamline testing through automation. you take the previous ones, we are, the traditional approach. It is manual and also it's time consuming to see what's going on, how it, the application works and all those things. With the ai, are preparing the experiments dynamically based on our past learnings, which also auto, which is like analyzing the locks, detecting risks. generating the fault, faults, dynamical, experimentals, dynamical. This reduce the needs for constant human monitoring and allows teams to focus on improving the system instead of just reading it, of just testing. Again, continuous improvement is another one. Since we are leveraging AI for insights for ongoing system enhancements, can learn from every test and every failure. Making future predictions more accurate. Again, instead of static testing, our resiliency strategy adapts more over the time, ensuring stronger, more reliable systems and new challenging arises. I close out, you can use different models to predict. In our example, I used random forest, but you can use other things which could be a time series models which can understand and do it. You could find many models available online and search, Google search could help. And by close out thoughts, AI driven chaos engineering is not just about breaking the things. It's about learning, improving and building systems. that can handle failures before they even happen. layer of testing we are adding from a traditional versus AI. By combining prediction, automating and continuous learning, we are creating a smarter proactive approach to the resiliency problem. you for attending. If you have any, questions, please feel free to send it to me. Thank you. Have a great day guys. Bye.

Slides

Download slides (PDF)

See all 31 talks at this event!

Conf42 Chaos Engineering 2025 - Online

February 20 2025 - premiere 5PM GMT

AI-Driven Chaos Engineering: Automating Resilience Testing with Predictive Insights

Video size:

Abstract

Summary

Transcript

Slides

Srimaan Yarram

Senior Engineering Manager - Quality @ Coupa Software

Join the community!

Featured event

2025

2024

Info

Conf42 Chaos Engineering 2025 - Online

February 20 2025 - premiere 5PM GMT

AI-Driven Chaos Engineering: Automating Resilience Testing with Predictive Insights

Video size:

Abstract

Summary

Transcript

Slides

Srimaan Yarram

Senior Engineering Manager - Quality @ Coupa Software

Join the community!