Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, this is Srinivasa Rao Bithla.
In today's topic, I'm going to cover about chaos engineering in AI,
predicting and preventing system outages.
Before I start my presentation, I want to make a quick disclaimer.
All views expressed here are my own and do not reflect the opinions
of any affiliated organization.
In today's agenda, I'm going to cover about chaos What is chaos engineering?
I'm going to give a little bit of history on how chaos engineering started evolved.
And I'm going to talk about the system architecture for AI applications.
And also I'm going to tell about how the AI systems will fail in a
given cloud native architecture.
And I'm going to go in detail about chaos engineering in a context.
And how do you need to design experiments.
for any chaos for AI based applications and how you can
make AI applications stronger by introducing kind of vulnerabilities
to break the AI applications.
Then I'm going to cover about which are the principles you may need
to follow to make AI applications more resilient and implementing AI.
The chaos engineering in your organization, which are the steps and
remedies you may need to follow and what is the future of chaos engineering
or how you can actually make your AI applications more robust in the
future and what are the challenges implementing chaos engineering within
your organization and which are the tools that you can use for chaos engineering
and finally key take key takeaways and we are going to close from there.
So now let's move to what is chaos engineering.
So chaos engineering is about, it's a discipline, focused on improving the
system reliability by proactively testing failures in a controlled environment.
So why do we actually make a system to fail in a controlled environment?
So if we mean to get to that point, I want to give a little bit of
history, how the chaos engineering actually, became as a practice.
So in 2008, Netflix had an outage, it's a prolonged outage where, Netflix was down
and people are not able to get up and so which was impacting the Netflix revenue
and loyalty of the customers involved.
So that's when they introduced, they, they made it as an industry
practice as a chaos engineering.
So the main intent is make the system to fail.
While it is running live.
So that way, any kind of issue they come, they can, they should
go and proactively fix it.
So earlier it used to take days.
Eventually the time started reducing and limiting, when they're actually
practicing these failures in a real environment, in a controlled way,
in a controlled environment, they know how to revert back with you.
So first you introduce the fault, then you go back and fix it.
So since they already know how and where it is failing, they started fixing, but
they get the real time behavior of the application, how the system is failing.
So this is the way the chaos engineering, came into practice for
most of the cloud native businesses and cloud native applications.
So major principles behind the chaos engineering is you build the hypothesis,
meaning What can go wrong and you try to inject the failure, right?
And then measure it.
What is the impact, how the business is getting impacted, how the
user users are getting impacted.
Then improve the system to ensure that the system is becoming resilient when
the real problem comes into the picture.
So now let's get into the AI system architecture.
So here, the AI system architecture.
I divided into the four layers.
So one is the generative AI and ML layer, where, the user, whatever
the user interact with, the users are going to get the information
from the generative ML layer.
Data layer, where the data is going to sit predominantly, right?
Then enterprise foundation, where you have other infrastructure, like
networking, identity, and other stuff.
Then there comes the computational or infrastructure, layer where you have
multiple systems like Kubernetes and all where you can, scale up your total
environment and then set up the things.
So why I'm introducing the architecture is the failures can come in any
of these deployments, right?
So this is going to run as a whole one single application.
So when we do the chaos engineering, we need to understand the layer.
And then you may need to inject the failures, then we'll
try to work on, fixing that.
So that is all about Chaos Engineering, especially in the AI systems architecture.
I'm going to deep dive into these things later.
So if you look at AI systems and follow up with the architectural diagram.
So where and all the failures can come.
One is the infrastructure failures.
It can be latency from the network.
It can be a GPU or TPU bottleneck, and there could be any cloud disruptions,
like when you're hosting on any of the cloud infrastructure, and there could
be the data pipeline failures where when the data is flowing, suddenly
some network going down between the systems or between two intranets, and
any missing values, the data is not ingested properly or data corrupted.
In real time ingestion issues, any failures from the client and validation
side and model failures, right?
Concept of drift in the data is a model is trained in the wrong ways
with wrong data and adversarial attacks and incorrect features engineering.
Any of these things can cause a system to fail.
So now we cannot prevent most of these issues.
Then how do we actually simulate and how we can actually get into
reality and how we can actually make the system more resilient.
That is what we are going to cover in the future.
So chaos engineering in AI context.
Now looking back, chaos engineering in AI versus traditional chaos engineering.
In the traditional class engineering, in any cloud native applications,
in the whole cloud native system architecture, the systems can
fail in the entire architecture.
that could be a data layer or computational layer or the UI layer.
So anything anywhere can fail.
if you bring the same thing to the A applications, it is.
In addition to whatever the issues that cloud native applications can
fail, it can also add a little bit more additional, components like pipelines
or the data drifting part or GPUs.
In a typical application, you may not really see GPUs, but
predominantly in AI applications, GPUs context coming, more often.
of course, as I said, in the typical cloud native applications, whatever the failures
can come, those can be repeated in AI.
In addition to that, you will have model specific issues and GPU specific issues.
That's what I want to, highlight.
I know I'm repeating here, but yeah, that's what it is.
Just to give you the clarity.
Now, designing a. Chaos experiments.
So how do you design a chaos experiments when you're, trying to, the simulate,
or even you're trying to, run in the real environment, you need to
follow this five step approach.
Of course, you can also change according to your organization needs.
So first, what you need to do is you need to design a hypothesis, meaning
what experiments you want to do.
What happens if the AI model gets 20 percent corrupted data?
So that is the hypothesis.
What happens if the network goes down?
That is another hypothesis, right?
So now step to select the target, meaning choose the components like which component
might get impacted because of the change.
Is it a data pipeline issue?
Or is the data itself is an issue?
Are the model inferences the issue select like what is the target candidate,
which layer of the architecture is going to have the problem.
Then you may need to, inject the failures.
So what you do is you introduce the synthetic faults, right?
Such as missing data or model response time, model responding slowly the delays.
So that's what you're going to do it.
Then what you do.
You need to observe how the system is failing and measure the impact, right?
Is it system is degrading or is there anything like the wrong data is coming?
That's what you're going to observe.
Then you work on improving the resilience of the system.
Meaning if any of these four steps comes into reality, how system can
automatically recover that and how it can.
make it better by itself.
That is what you actually expect your AI system to do, right?
So these are the things you do it in the, these are the, five steps you follow
to design your AI chaos experiments.
So
how do you make your AI system to be stronger, right?
So in the previous.
Step I told like, how do you design the whole, chaos related stuff, but
here, what are the faults that you can increment introduce into AI systems?
So here I'm talking about six different kind of, issues that you can introduce
into the whole AI system architecture, and then see like how the system
behaves and come up with the solution that How it can automatically, how it
should automatically self-heal and then minimize the issues that the user faces.
So this is what we do it as part of, breaking a to, make it stronger, right?
so we introduce the chaos and then we make the AI system to go stronger,
as in AI breaking AI to make it.
So one of the, chaos that we can introduce in the system
is adversarial attack testing.
So here, how the intruder can manipulate a system's behavior by
injecting the external perturbations.
So in this case, we are trying to intrude with the existing
image, which are the images that are used for facial recognition.
Recognization.
So here you use the code, the one which is in the red color.
We use that code.
We use the Fox tool to modify the existing image and disrupt the AI
facial recognition, images or models.
So then.
The expected outcome of the training with those images, it should not have any
impact in recognizing the person's, face.
The reason is here, it should still recognize the original
of faces or original class.
Otherwise we say it has a vulnerability in the system.
So when we compute that, the cosine, factor should be close to one.
If it recognizes the face properly, that means there is no change.
The system is not vulnerable.
the intruder could not make any changes in the model.
It is still behaving as per expectation.
But if the cosine factor comes close to zero, that means The system is
vulnerable and it is not recognizing the faces of the respective, the person
who is supposed to be authenticated.
So this is how you introduce the adversarial attack and see how
a system is vulnerable or not.
The expected behavior is it should not be.
So that's where you need to work on the, the system's resilience.
And next is the GPU failure simulation.
So we use, here, we try to simulate memory leaks on a specific GPU and see
whether the system can automatically, prevent the failures out of memory errors,
how it can prevent, how it can actually transfer the respective processing thing
to a next GPU or next CPU, and then the system works as per expectation.
So here with this code, when I run on a specific GPU, here you see the out
of memory error and system is failed.
That means this is vulnerable.
So now you need to write a code in such a way that if this memory
leak happens, how the task will be transferred to the next GPU or CPU.
That is what the expected outcome is.
If it is transferring properly, that means the GPU failure simulation worked fine.
And the system is resilient for any kind of outages.
The next one is the data pipeline latency simulation.
So here, any kind of a delay, right?
How do you introduce into the system?
So let's say if you want to simulate a network latency, so you write
this code on the right hand side.
Here I'm giving 15, 500 milliseconds of, Latency and I'm repeating this
for 60 seconds and see whether my system is getting the response.
So this is what the chaos you are injecting into the AI data pipelines
and see if the model handles the delays or it fails gracefully or if
it just fails without any warnings.
So the expected outcome should be It should see wherever the information
that is available in any specific cache or any kind of layer if it is
there, it should give, or else it has to transfer to the nearest data center
or nearest, deployment of the systems.
But in the given kind of current environment, here you see the failures.
These are the real time failures which are captured.
And here, if you look at it, You have, a deep seek models failing with, the
network issue, and here it is a charge GPT failing with similar kind of issues.
So this is what, it's not even resilient, even given the kind
of, the huge infrastructure these, Two companies runs with right?
So the data pipeline latency simulation And automatic recovery is very important
for any AI driven systems Next one is the data corruption in AI training
So if you are training the AI models with the corrupted data how the system
is actually going to Behave right?
So in this kind of scenario, so you try to You know, load the corrupted data
into the system over the period of time.
Every time you train with 10 percent of corrupted data and then over
the period of time, you see only corrupted data within the system.
So when you're training with the corrupted data, the system should
really understand whether it is getting genuine data or corrupted data.
So if the system is Unable to detect corrupted and uncorrupted data, it will
take the data and then the model and the training, everything will get fumed.
And, and then, it will give wrong out.
For example, here, this is the MLOps layers.
This is, one of the chart GPT's models I used to generate machine
learning operations layers.
So what I got is a sphere.
I was laughing looking at this image because machine learning.
operations layers.
It can be like multiple layers.
It can see what is nothing, right?
So machine learning system architecture, that is what I wanted to generate.
But finally I got this.
So if you are feeding only squares and spheres related data,
whatever the architecture you ask, you are giving only that input.
Finally, you'll end up seeing this.
So this is what the problem this is.
If this outcome is coming, that means the system is not resilient.
So if the system is giving expected output, what you're supposed to get,
that means the system is resilient and it is doing what is correct and
the users can rely on it, right?
So the data corruption should not be there.
Even if you do corrupted data, the AI model should take only the valid
data and then it should generate the expected outcome properly.
And then the model drift simulation.
Let's say if your model is drifting.
And then that means if it is biasing, if it is going, wrong decisions over
the period of time, if it is taking, the models slowly, start building the
things that are not supposed to be the way it's supposed to look like.
That means the accuracy level of the model is going down.
So every time, if in a scale, if it is slowly started drifting down the accuracy.
So here you see the months on the x axis and the percentage on the y axis.
So here you see slowly the model started drifting.
After some time, you see only inaccurate information.
So when the model is drifting, you need to ensure that you need to train.
You need to take the necessary remedies to ensure that the
model drift is not happening.
So that's where, you need to ensure that.
The system should have that automatic recovery of the drifting things as well.
Here you can use the code to simulate the drifting thing.
Then AI model fallback testing.
So if the AI models are not giving the intended outcome, right?
So if there is a fault information that is coming, so the model
should not use the, the latest one.
If the latest model is having the issue, then it has to fall back to the.
Table model.
So here I have multiple layers of decisions.
So if there are like multiple models are already interconnected.
So let's say if you have five, six versions of models, okay,
this is your first model, right?
This is the latest one.
Okay.
This is not valid.
Then you need to go take the decision.
Then if it is not accurate, again, it needs to take another decision.
So it has to go to the multiple decisions.
to ensure that the user always gets the accurate information.
So this is the way you may also need to do the model fallback testing.
If the model is not giving accurate results, it has to always go
back to the latest stable model.
Then chaos engineering principles for AI applications.
So how do you actually implement.
So now we have the different methods of chaos that you can introduce in AI.
We have seen can we do all of these things together in even if it is a controlled
environment, can we implement everything?
I say, no, you have to start small and start slow.
Though you do the chaos in a controlled environment at a time, give only
one of the aspects of the system.
And then, and also restrict.
The kind of impact that happens on the system also to the smaller users,
have smaller region of the system.
Okay.
And then control the radius, meaning the nu number of users because you don't wanna
make your business to go, bust into it.
So you may need to control the whole impact into very limited segment,
and then automate the recovery.
It should have self-healing in place whenever the failure comes.
The whole intent is to recover by itself without any human interventions.
Then you may need to measure the impact, right?
And then see how many, how much of time it is taking to recover.
And what is the business value you are losing?
Or what is, the kind of challenges?
What are the challenges that are going to face when the
impact really happens, right?
This is what you may need to do.
As a principle of chaos engineering, then you may need to find out like,
which are the tools that you may need to use in chaos engineering.
So for, especially for a, and some of the tools you may also
see for cloud native applications, some of them definitely for AI.
So the first set of tools like Gremlin, Chaos Monkey, Litmus Chaos,
these things, you can use it in cloud native as well as AI applications.
But AI specific tools.
Here you can see TensorFlow model analysis and AI explainability tools.
These things are specific to, artificial intelligent, tools.
So where you may need to, simulate the code of these things to ensure
that, the models are accurate and they're giving predictable results.
And the observability to monitor the metrics to monitor the system's behavior.
Processing capabilities and usage capabilities.
You use the observability tools to capture the metrics like that.
You have promote Grafana and Elk Stack.
You can use these tools to monitor the systems.
So given the kind of things that we have seen, the system reliability
in ai, very important, right?
The failures can come in different, layers of your, AI systems architecture.
So the failures can be there in models, meaning the models may not be
giving you the accurate information, maybe a lot of hypothesis, right?
And, and also the data itself is a problem.
What kind of data you're feeding, what kind of data training you're doing
and what kind of cleanup you're doing.
All of that is very important.
And data pipelines, like what kind of, connect, what kind of data
pipelines of the information is floating and where it is coming from.
And infrastructure.
So these are the, main, issues in the system's reliability, right?
So the system always gives you the accurate information, so you
need to ensure that the system is trustworthy for any given query To
do that, you may need to do that.
Us and, have the remedies and self feeling things in place, right?
So implementing ai, k, ias, engineering.
In your organization.
So how do you do it?
Again, you may need to build the culture of resilience testing, meaning, it should
have your dev and engineering team should have, the culture of, auto healing for any
kind of issue that they come into picture.
See, preventing issues is very hard, but auto healing is the
one which makes more sense.
Because failures will come, it's very difficult to avoid, but how it
can heal by itself and how it can actually perform better by itself,
that is what's very important.
That's where you need to have that culture.
So define the key AI failure scenarios.
So that's where you may need to come up with a thing, like what kind of,
AI failures may come and then simulate as part of your chaos engineering, and
then have all of your remedies in place.
Use automated kiosks, testing frameworks, you may need to build that, that's
where, while developing the system itself, you need to keep all of
those, perspectives thinking, right?
Okay.
If I'm developing this specific code, okay.
If this middleware fails, what will happen to this?
Or if the specific component fails, what will happen to this?
So all of those kinds of perspectives, thought process should be in place and
continuously improve the system design.
Okay.
This is where the, improvement comes into the picture.
So every API, every code, every program, every, component that you build, you need
to think in all these aspects, right?
And challenges in chaos engineering.
especially for the AI systems, first of all, we have a lot of trust
issues on any of the AI models.
But if you are introducing a chaos in the real world applications, it is going
to be a real problem because when you do chaos, you may, it might impact.
Let's say if there is a specific, user data sitting on the system and you
introduce the chaos and it manipulated the data and then the data is retrieved
by a specific user from a specific region and he may get the wrong data.
So that's where the problem, right?
You are intruding the data privacy of a specific user and
you're causing security risks.
So this is the problem, especially in the chaos engineering, if you are
doing in the real world applications, and it is very difficult to track the
cascading failures, like the dependencies and from one issue leading to another
issue, to monitor and track, that is also one of the major challenges and,
how do you get rid of those things?
You may need to be very conservative in the going back to the previous slide.
You need to ensure that these complexities and data privacy things are taken care of.
And you should do it in a very controlled environment to eliminate these issues.
And future of chaos engineering in AI.
As I mentioned, you may need to build robust systems by way of implementing
self healing systems, right?
And integrate the AI operations for proactive failure preventions.
So like DevOps, you need to have artificial intelligence operations,
any failure that comes, how we can actually proactively, prevent it.
so the self healing with preventive failure protections, proactive failure
protections, it would help the systems to become stable and work reliably, and you
also need to do the continuous monitoring and adaptive learning in AI models.
So you need to use your observability metrics and see alerting mechanisms.
If something goes wrong, the respective user should be alerted
and remedies will be, taking care.
So from this, what are the key takeaways, right?
So chaos engineering is essential for AI resilience.
Yeah.
See, because AI systems to become resilient for any of these intruder thing
that's going to happen on the system.
You may need to, plan it, you may need to, run it, those kiosks, you
may need to inject those pairs in the system and make sure that your, AI
systems are released, resilient, okay?
And focus on data pipelines, model inferences, and cloud
infrastructure, right?
Whatever the AI components are there.
Focus on them, ensure that, each one of them is, tested with all of these, chaos
and use the control failure injections to strengthen the AI reliability, right?
So you should not do it in the, the whole system.
You need to be in the controlled environment in a specific region,
specific user group, ensure you notify the specific users and you notify the
QA, everybody who is supposed to be.
there before intruding or before your fault injecting into the system, then
you minimizing the impact, right?
So you may need to leverage the automation and improve the scalability and recovery.
See, these things manually doing is very difficult.
So you need to have the automated script to fall back or the automated
script to heal, automated script, automated way to, make the system to,
Hel itself, like falling back the model or if there is a CCP issue, moving
issue, moving onto a different CP or different host, the specific task.
So all of those things should be there.
So these things should be automatically to be done.
So these are the key takeaways when you talk about I engineering in ai, world.
So with this I'm concluding.
Thank you very much for attending my talk.
If you have any questions.
You can connect with me on LinkedIn and you can reach me
directly, through LinkedIn itself.
Thank you.
Thanks for listening.