Conf42 Chaos Engineering 2025 - Online

- premiere 5PM GMT

Chaos Engineering in AI: Predicting and Preventing System Outages

Video size:

Abstract

Discover how Chaos Engineering can revolutionize AI system resilience by proactively identifying weaknesses and preventing costly outages. Learn how strategic failure injection can safeguard AI-driven applications, ensuring scalability, reliability, and continuous performance.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, this is Srinivasa Rao Bithla. In today's topic, I'm going to cover about chaos engineering in AI, predicting and preventing system outages. Before I start my presentation, I want to make a quick disclaimer. All views expressed here are my own and do not reflect the opinions of any affiliated organization. In today's agenda, I'm going to cover about chaos What is chaos engineering? I'm going to give a little bit of history on how chaos engineering started evolved. And I'm going to talk about the system architecture for AI applications. And also I'm going to tell about how the AI systems will fail in a given cloud native architecture. And I'm going to go in detail about chaos engineering in a context. And how do you need to design experiments. for any chaos for AI based applications and how you can make AI applications stronger by introducing kind of vulnerabilities to break the AI applications. Then I'm going to cover about which are the principles you may need to follow to make AI applications more resilient and implementing AI. The chaos engineering in your organization, which are the steps and remedies you may need to follow and what is the future of chaos engineering or how you can actually make your AI applications more robust in the future and what are the challenges implementing chaos engineering within your organization and which are the tools that you can use for chaos engineering and finally key take key takeaways and we are going to close from there. So now let's move to what is chaos engineering. So chaos engineering is about, it's a discipline, focused on improving the system reliability by proactively testing failures in a controlled environment. So why do we actually make a system to fail in a controlled environment? So if we mean to get to that point, I want to give a little bit of history, how the chaos engineering actually, became as a practice. So in 2008, Netflix had an outage, it's a prolonged outage where, Netflix was down and people are not able to get up and so which was impacting the Netflix revenue and loyalty of the customers involved. So that's when they introduced, they, they made it as an industry practice as a chaos engineering. So the main intent is make the system to fail. While it is running live. So that way, any kind of issue they come, they can, they should go and proactively fix it. So earlier it used to take days. Eventually the time started reducing and limiting, when they're actually practicing these failures in a real environment, in a controlled way, in a controlled environment, they know how to revert back with you. So first you introduce the fault, then you go back and fix it. So since they already know how and where it is failing, they started fixing, but they get the real time behavior of the application, how the system is failing. So this is the way the chaos engineering, came into practice for most of the cloud native businesses and cloud native applications. So major principles behind the chaos engineering is you build the hypothesis, meaning What can go wrong and you try to inject the failure, right? And then measure it. What is the impact, how the business is getting impacted, how the user users are getting impacted. Then improve the system to ensure that the system is becoming resilient when the real problem comes into the picture. So now let's get into the AI system architecture. So here, the AI system architecture. I divided into the four layers. So one is the generative AI and ML layer, where, the user, whatever the user interact with, the users are going to get the information from the generative ML layer. Data layer, where the data is going to sit predominantly, right? Then enterprise foundation, where you have other infrastructure, like networking, identity, and other stuff. Then there comes the computational or infrastructure, layer where you have multiple systems like Kubernetes and all where you can, scale up your total environment and then set up the things. So why I'm introducing the architecture is the failures can come in any of these deployments, right? So this is going to run as a whole one single application. So when we do the chaos engineering, we need to understand the layer. And then you may need to inject the failures, then we'll try to work on, fixing that. So that is all about Chaos Engineering, especially in the AI systems architecture. I'm going to deep dive into these things later. So if you look at AI systems and follow up with the architectural diagram. So where and all the failures can come. One is the infrastructure failures. It can be latency from the network. It can be a GPU or TPU bottleneck, and there could be any cloud disruptions, like when you're hosting on any of the cloud infrastructure, and there could be the data pipeline failures where when the data is flowing, suddenly some network going down between the systems or between two intranets, and any missing values, the data is not ingested properly or data corrupted. In real time ingestion issues, any failures from the client and validation side and model failures, right? Concept of drift in the data is a model is trained in the wrong ways with wrong data and adversarial attacks and incorrect features engineering. Any of these things can cause a system to fail. So now we cannot prevent most of these issues. Then how do we actually simulate and how we can actually get into reality and how we can actually make the system more resilient. That is what we are going to cover in the future. So chaos engineering in AI context. Now looking back, chaos engineering in AI versus traditional chaos engineering. In the traditional class engineering, in any cloud native applications, in the whole cloud native system architecture, the systems can fail in the entire architecture. that could be a data layer or computational layer or the UI layer. So anything anywhere can fail. if you bring the same thing to the A applications, it is. In addition to whatever the issues that cloud native applications can fail, it can also add a little bit more additional, components like pipelines or the data drifting part or GPUs. In a typical application, you may not really see GPUs, but predominantly in AI applications, GPUs context coming, more often. of course, as I said, in the typical cloud native applications, whatever the failures can come, those can be repeated in AI. In addition to that, you will have model specific issues and GPU specific issues. That's what I want to, highlight. I know I'm repeating here, but yeah, that's what it is. Just to give you the clarity. Now, designing a. Chaos experiments. So how do you design a chaos experiments when you're, trying to, the simulate, or even you're trying to, run in the real environment, you need to follow this five step approach. Of course, you can also change according to your organization needs. So first, what you need to do is you need to design a hypothesis, meaning what experiments you want to do. What happens if the AI model gets 20 percent corrupted data? So that is the hypothesis. What happens if the network goes down? That is another hypothesis, right? So now step to select the target, meaning choose the components like which component might get impacted because of the change. Is it a data pipeline issue? Or is the data itself is an issue? Are the model inferences the issue select like what is the target candidate, which layer of the architecture is going to have the problem. Then you may need to, inject the failures. So what you do is you introduce the synthetic faults, right? Such as missing data or model response time, model responding slowly the delays. So that's what you're going to do it. Then what you do. You need to observe how the system is failing and measure the impact, right? Is it system is degrading or is there anything like the wrong data is coming? That's what you're going to observe. Then you work on improving the resilience of the system. Meaning if any of these four steps comes into reality, how system can automatically recover that and how it can. make it better by itself. That is what you actually expect your AI system to do, right? So these are the things you do it in the, these are the, five steps you follow to design your AI chaos experiments. So how do you make your AI system to be stronger, right? So in the previous. Step I told like, how do you design the whole, chaos related stuff, but here, what are the faults that you can increment introduce into AI systems? So here I'm talking about six different kind of, issues that you can introduce into the whole AI system architecture, and then see like how the system behaves and come up with the solution that How it can automatically, how it should automatically self-heal and then minimize the issues that the user faces. So this is what we do it as part of, breaking a to, make it stronger, right? so we introduce the chaos and then we make the AI system to go stronger, as in AI breaking AI to make it. So one of the, chaos that we can introduce in the system is adversarial attack testing. So here, how the intruder can manipulate a system's behavior by injecting the external perturbations. So in this case, we are trying to intrude with the existing image, which are the images that are used for facial recognition. Recognization. So here you use the code, the one which is in the red color. We use that code. We use the Fox tool to modify the existing image and disrupt the AI facial recognition, images or models. So then. The expected outcome of the training with those images, it should not have any impact in recognizing the person's, face. The reason is here, it should still recognize the original of faces or original class. Otherwise we say it has a vulnerability in the system. So when we compute that, the cosine, factor should be close to one. If it recognizes the face properly, that means there is no change. The system is not vulnerable. the intruder could not make any changes in the model. It is still behaving as per expectation. But if the cosine factor comes close to zero, that means The system is vulnerable and it is not recognizing the faces of the respective, the person who is supposed to be authenticated. So this is how you introduce the adversarial attack and see how a system is vulnerable or not. The expected behavior is it should not be. So that's where you need to work on the, the system's resilience. And next is the GPU failure simulation. So we use, here, we try to simulate memory leaks on a specific GPU and see whether the system can automatically, prevent the failures out of memory errors, how it can prevent, how it can actually transfer the respective processing thing to a next GPU or next CPU, and then the system works as per expectation. So here with this code, when I run on a specific GPU, here you see the out of memory error and system is failed. That means this is vulnerable. So now you need to write a code in such a way that if this memory leak happens, how the task will be transferred to the next GPU or CPU. That is what the expected outcome is. If it is transferring properly, that means the GPU failure simulation worked fine. And the system is resilient for any kind of outages. The next one is the data pipeline latency simulation. So here, any kind of a delay, right? How do you introduce into the system? So let's say if you want to simulate a network latency, so you write this code on the right hand side. Here I'm giving 15, 500 milliseconds of, Latency and I'm repeating this for 60 seconds and see whether my system is getting the response. So this is what the chaos you are injecting into the AI data pipelines and see if the model handles the delays or it fails gracefully or if it just fails without any warnings. So the expected outcome should be It should see wherever the information that is available in any specific cache or any kind of layer if it is there, it should give, or else it has to transfer to the nearest data center or nearest, deployment of the systems. But in the given kind of current environment, here you see the failures. These are the real time failures which are captured. And here, if you look at it, You have, a deep seek models failing with, the network issue, and here it is a charge GPT failing with similar kind of issues. So this is what, it's not even resilient, even given the kind of, the huge infrastructure these, Two companies runs with right? So the data pipeline latency simulation And automatic recovery is very important for any AI driven systems Next one is the data corruption in AI training So if you are training the AI models with the corrupted data how the system is actually going to Behave right? So in this kind of scenario, so you try to You know, load the corrupted data into the system over the period of time. Every time you train with 10 percent of corrupted data and then over the period of time, you see only corrupted data within the system. So when you're training with the corrupted data, the system should really understand whether it is getting genuine data or corrupted data. So if the system is Unable to detect corrupted and uncorrupted data, it will take the data and then the model and the training, everything will get fumed. And, and then, it will give wrong out. For example, here, this is the MLOps layers. This is, one of the chart GPT's models I used to generate machine learning operations layers. So what I got is a sphere. I was laughing looking at this image because machine learning. operations layers. It can be like multiple layers. It can see what is nothing, right? So machine learning system architecture, that is what I wanted to generate. But finally I got this. So if you are feeding only squares and spheres related data, whatever the architecture you ask, you are giving only that input. Finally, you'll end up seeing this. So this is what the problem this is. If this outcome is coming, that means the system is not resilient. So if the system is giving expected output, what you're supposed to get, that means the system is resilient and it is doing what is correct and the users can rely on it, right? So the data corruption should not be there. Even if you do corrupted data, the AI model should take only the valid data and then it should generate the expected outcome properly. And then the model drift simulation. Let's say if your model is drifting. And then that means if it is biasing, if it is going, wrong decisions over the period of time, if it is taking, the models slowly, start building the things that are not supposed to be the way it's supposed to look like. That means the accuracy level of the model is going down. So every time, if in a scale, if it is slowly started drifting down the accuracy. So here you see the months on the x axis and the percentage on the y axis. So here you see slowly the model started drifting. After some time, you see only inaccurate information. So when the model is drifting, you need to ensure that you need to train. You need to take the necessary remedies to ensure that the model drift is not happening. So that's where, you need to ensure that. The system should have that automatic recovery of the drifting things as well. Here you can use the code to simulate the drifting thing. Then AI model fallback testing. So if the AI models are not giving the intended outcome, right? So if there is a fault information that is coming, so the model should not use the, the latest one. If the latest model is having the issue, then it has to fall back to the. Table model. So here I have multiple layers of decisions. So if there are like multiple models are already interconnected. So let's say if you have five, six versions of models, okay, this is your first model, right? This is the latest one. Okay. This is not valid. Then you need to go take the decision. Then if it is not accurate, again, it needs to take another decision. So it has to go to the multiple decisions. to ensure that the user always gets the accurate information. So this is the way you may also need to do the model fallback testing. If the model is not giving accurate results, it has to always go back to the latest stable model. Then chaos engineering principles for AI applications. So how do you actually implement. So now we have the different methods of chaos that you can introduce in AI. We have seen can we do all of these things together in even if it is a controlled environment, can we implement everything? I say, no, you have to start small and start slow. Though you do the chaos in a controlled environment at a time, give only one of the aspects of the system. And then, and also restrict. The kind of impact that happens on the system also to the smaller users, have smaller region of the system. Okay. And then control the radius, meaning the nu number of users because you don't wanna make your business to go, bust into it. So you may need to control the whole impact into very limited segment, and then automate the recovery. It should have self-healing in place whenever the failure comes. The whole intent is to recover by itself without any human interventions. Then you may need to measure the impact, right? And then see how many, how much of time it is taking to recover. And what is the business value you are losing? Or what is, the kind of challenges? What are the challenges that are going to face when the impact really happens, right? This is what you may need to do. As a principle of chaos engineering, then you may need to find out like, which are the tools that you may need to use in chaos engineering. So for, especially for a, and some of the tools you may also see for cloud native applications, some of them definitely for AI. So the first set of tools like Gremlin, Chaos Monkey, Litmus Chaos, these things, you can use it in cloud native as well as AI applications. But AI specific tools. Here you can see TensorFlow model analysis and AI explainability tools. These things are specific to, artificial intelligent, tools. So where you may need to, simulate the code of these things to ensure that, the models are accurate and they're giving predictable results. And the observability to monitor the metrics to monitor the system's behavior. Processing capabilities and usage capabilities. You use the observability tools to capture the metrics like that. You have promote Grafana and Elk Stack. You can use these tools to monitor the systems. So given the kind of things that we have seen, the system reliability in ai, very important, right? The failures can come in different, layers of your, AI systems architecture. So the failures can be there in models, meaning the models may not be giving you the accurate information, maybe a lot of hypothesis, right? And, and also the data itself is a problem. What kind of data you're feeding, what kind of data training you're doing and what kind of cleanup you're doing. All of that is very important. And data pipelines, like what kind of, connect, what kind of data pipelines of the information is floating and where it is coming from. And infrastructure. So these are the, main, issues in the system's reliability, right? So the system always gives you the accurate information, so you need to ensure that the system is trustworthy for any given query To do that, you may need to do that. Us and, have the remedies and self feeling things in place, right? So implementing ai, k, ias, engineering. In your organization. So how do you do it? Again, you may need to build the culture of resilience testing, meaning, it should have your dev and engineering team should have, the culture of, auto healing for any kind of issue that they come into picture. See, preventing issues is very hard, but auto healing is the one which makes more sense. Because failures will come, it's very difficult to avoid, but how it can heal by itself and how it can actually perform better by itself, that is what's very important. That's where you need to have that culture. So define the key AI failure scenarios. So that's where you may need to come up with a thing, like what kind of, AI failures may come and then simulate as part of your chaos engineering, and then have all of your remedies in place. Use automated kiosks, testing frameworks, you may need to build that, that's where, while developing the system itself, you need to keep all of those, perspectives thinking, right? Okay. If I'm developing this specific code, okay. If this middleware fails, what will happen to this? Or if the specific component fails, what will happen to this? So all of those kinds of perspectives, thought process should be in place and continuously improve the system design. Okay. This is where the, improvement comes into the picture. So every API, every code, every program, every, component that you build, you need to think in all these aspects, right? And challenges in chaos engineering. especially for the AI systems, first of all, we have a lot of trust issues on any of the AI models. But if you are introducing a chaos in the real world applications, it is going to be a real problem because when you do chaos, you may, it might impact. Let's say if there is a specific, user data sitting on the system and you introduce the chaos and it manipulated the data and then the data is retrieved by a specific user from a specific region and he may get the wrong data. So that's where the problem, right? You are intruding the data privacy of a specific user and you're causing security risks. So this is the problem, especially in the chaos engineering, if you are doing in the real world applications, and it is very difficult to track the cascading failures, like the dependencies and from one issue leading to another issue, to monitor and track, that is also one of the major challenges and, how do you get rid of those things? You may need to be very conservative in the going back to the previous slide. You need to ensure that these complexities and data privacy things are taken care of. And you should do it in a very controlled environment to eliminate these issues. And future of chaos engineering in AI. As I mentioned, you may need to build robust systems by way of implementing self healing systems, right? And integrate the AI operations for proactive failure preventions. So like DevOps, you need to have artificial intelligence operations, any failure that comes, how we can actually proactively, prevent it. so the self healing with preventive failure protections, proactive failure protections, it would help the systems to become stable and work reliably, and you also need to do the continuous monitoring and adaptive learning in AI models. So you need to use your observability metrics and see alerting mechanisms. If something goes wrong, the respective user should be alerted and remedies will be, taking care. So from this, what are the key takeaways, right? So chaos engineering is essential for AI resilience. Yeah. See, because AI systems to become resilient for any of these intruder thing that's going to happen on the system. You may need to, plan it, you may need to, run it, those kiosks, you may need to inject those pairs in the system and make sure that your, AI systems are released, resilient, okay? And focus on data pipelines, model inferences, and cloud infrastructure, right? Whatever the AI components are there. Focus on them, ensure that, each one of them is, tested with all of these, chaos and use the control failure injections to strengthen the AI reliability, right? So you should not do it in the, the whole system. You need to be in the controlled environment in a specific region, specific user group, ensure you notify the specific users and you notify the QA, everybody who is supposed to be. there before intruding or before your fault injecting into the system, then you minimizing the impact, right? So you may need to leverage the automation and improve the scalability and recovery. See, these things manually doing is very difficult. So you need to have the automated script to fall back or the automated script to heal, automated script, automated way to, make the system to, Hel itself, like falling back the model or if there is a CCP issue, moving issue, moving onto a different CP or different host, the specific task. So all of those things should be there. So these things should be automatically to be done. So these are the key takeaways when you talk about I engineering in ai, world. So with this I'm concluding. Thank you very much for attending my talk. If you have any questions. You can connect with me on LinkedIn and you can reach me directly, through LinkedIn itself. Thank you. Thanks for listening.
...

Srinivasa Rao Bittla

Performance Specialist @ Adobe

Srinivasa Rao Bittla's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)