Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, and thank you for joining me today.
My name is Garish.
Today we'll go beyond SLOs and see how AI is revolutionizing
site reliability engineering.
So before we dive into specifics, let's take a moment to look at the roadmap
for our discussion Today we're gonna cover a lot of important doubt related
to SRE and how AI is transforming it.
We'll start by talking about the core challenges that Sari faces today.
That is set the stage for talking about why we need some of these
advanced solutions for them.
We'll talk about the what and the how what are the different
solutions and how they're applied and how they're applicable.
We look at anomaly detection and little bit of detail.
We break down what anomalies are the different types, how we detect them.
This is crucial for understanding how we can start to predict them.
And then we look at some of the case studies for big companies out there
who use some of these techniques.
Effectively we look at tools and technologies.
We'll discuss the major open source and commercial options available for
applying some of these solutions.
We'll look at the future of AI and SRE.
What the trends in this field are.
And finally we'll wrap up with some practical recommendations for getting
started and applying these practices.
Alright, so what are the key metrics that we are going to rely upon as
we start building these systems?
In the world of SRE metrics like availability, latency,
throughput and error rates.
From the foundation for doing a lot.
The analysis that we do, these metrics are the North stop and they help
us build reliable user experiences.
We'll also talk about the monitoring explosion that is happening.
So it's almost as if we have become, too good at collecting data and
now there's an explosion of it.
There's too many signals.
And too much noise.
So that makes it difficult for us to pinpoint problems and
understand what's really going on.
It also causes alert fatigue, right?
If there are just too many, signals and noise coming in, we don't know what the
false positives are, what significant or insignificant pieces of data are, and this
can be problematic because it can cause us to, have a delay and how quickly we
can implement solutions to these problems.
So then we'll talk about the shift in predictive reliability.
So traditionally SRE has been reactive.
We wait for things to break and then we fix them.
But today we are moving more towards being able to predict.
And be ready for problems to happen before they actually happen.
And this is where AI and ML also come in and we'll dive
into these in the next section.
Alright
firstly let's talk about the, what, what can AI do for us?
So one of the things it can do is intelligent learning.
So what that means is AI can be used to reign in the flood of alerts that we can
get if we are gathering too many metrics because AI can filter out the noise
and identify the really crucial bits of data that we should be focusing on.
This reduces alert fatigue and helps us focus on what is important.
So the next what.
We should look at as predictive maintenance.
So this is one of the major benefits of AI, is its ability to look at
past data and help us identify patterns that we can use to forecast
our problems before they happen.
So this proactive approach helps us avoid outages altogether.
The next thing we should look at is automated remediation.
So this is an exciting topic.
So AI can also remediate many of the problems that we face in an automated way.
So when an issue is detected, AI can trigger corrective actions automatically.
This significantly reduces the time it takes to solve problems,
and it minimizes doubting.
And systems can become self-healing.
AI also helps us in capacity planning.
So this is another crucial area.
AI can predict future demand and help us optimize a resource
allocation accordingly.
This ensures that we have enough resources to handle peak loads without over
provisioning and without wasting money.
It also helps us be efficient and cost effective.
We can use AI to do root cause analysis, so AI can assist us in quickly
identify the root cause of problems.
So it can do this by analyzing various data points and correlations from
the past, and it can identify and point to source of issues, saving
a lot of time and effort that you would've had to otherwise spend.
Do this using traditional approaches.
We can also use AI for incident management.
So AI can automate various incident management tasks statistical
creation and prioritization.
This streamlines the incident response process, and that ensures that
critical issues are less promptly.
And let's talk about ChatOps enhancement.
So finally, AI can enhance ChatOps.
By using natural language processing to interact with systems
and run diagnostics via chat.
So this makes it easier for teams to troubleshoot and
manage systems in real time.
Excellent.
Alright now let's talk about the how as we've seen what AI can do in the field of
now let's talk a little bit about the how.
Let's look at the different machine learning algorithms and
how they're applicable here.
So there's of course, supervised learning.
So think of this as AI learning by looking at past examples.
So we provide label data, excuse me data where we've had the correct outcome or
data about incidents from the past and.
AI can use that to identify, what issues look like and what the resolution
resolutions have been in the past.
There's also unsupervised learning, so this is where AI can find patterns
in its on its own without label data.
So this is incredibly useful for anomaly detection.
AI can look for unusual behavior and deviations from the norm.
It's like saying hey does something look off here?
Let me know if something looks off so I can investigate.
So for example, if there is certain spike in traffic, then you know, non
detection algorithms can identify it because, it looks different from
what the non spike behavior is.
There's also reinforcement.
This is where we are training our AI through trial and error.
AI takes an action and receives feedback from us, and it uses that to improve how
it does remediation the next time around.
For example it can, using this, it can find optimal ways to
respond to different situations.
For example AI can try different ways to restart a service if one method fails.
It can realize that and, look for a different approach.
Let's talk about the different anomaly detection techniques.
So there can be statistical methods.
These are based on statistical techniques where we are looking at.
Our data and calculating various scores and using that to identify anomaly.
So for example with the Z score, we are measuring how a specific data
point deviates from the average.
With the MAD, we are measuring how it deviates from the media.
And this can help us identify outliers like individual data points that
significantly differ from the rest.
For example, if the average response time is a hundred milliseconds and we see a
certain spike of 500, we know this is an outlier because this will significantly
deviate from the average or median.
We can also use clustering techniques so we can look at groups of data points
and we can cluster them together.
And see what doesn't fit in the cluster.
And then, that becomes a layer.
It's like finding an
so an example would be if CP utilization falls outside a range we are seeing
consistently some utilization, and then suddenly there's a different utilization
number, then, it falls outside the cluster and we know, we should focus on that.
We also have different AI ops approaches that we can use.
So these are different ways in which AI can be incorporated in our SRE practices.
For example, we can have the overlay model, which is pretty much about
taking our AI capabilities and running them alongside our existing tools.
So our current monitoring systems and alerting systems stay in place, but we
add an AI layer that gives us additional sort of signals, additional insights
that we can use to make decisions.
This is a good on ramp to AI and SRE because it doesn't require replacing
a lot of existing infrastructure and we can go step by step.
The other approach is with embedded models.
So think of this as, a set of monitoring platforms that can have AI
capabilities already built into them.
So many modern platforms now offer AI driven anomaly detection or predictive
analytics, or automated learning.
So these help us create, streamlined workflows.
And give some more integrated experience.
And then one very crucial aspect of how AI works in SREs with automated remediation.
So I can go beyond just detecting problems and can also
take corrective action, right?
So this could involve things like restarting services, scaling resources
triggering other automated workflows.
Alright.
Now what are some of the challenges that we, face implementing?
This one is, of course, with data quality and data requirements.
For implementing any of these approaches, we need access to, a lot of high quality
data which may or may not be available.
AI is very data intensive.
Set of approaches.
If data's incomplete, inaccurate, or noisy.
AI performance suffers.
So we need to invest in, robust data collection and cleaning
and management practices.
Another challenge we run into implementing this is with just
the training and tuning of yeah.
So this is an ongoing thing.
We can put systems in place, but we also need to invest in making sure that.
These systems continuously trained with data as it, evolves.
The data that a system sees evolves over time and, so our systems also
need to be aware of the new normal.
Alright, let's look at normal detection in a little more detail.
So our detection at its core is just identifying deviation
from the expected, right?
So we first need to identify what expected behavior is, and we can do
that by learning from strong data so we understand what normal looks like
so we can, spot what is not normal.
Here are some types of anomalies.
For example, a point anomaly is a single data point That air is an outlier.
It stands on its own.
So for example, a certain spike in CPU utilization on one server
is an example of a point anomaly.
We could also have contextual anomalies that are unusual in a given context.
So some system behavior might be normal in one.
And become anomalous in a different situation, right?
So for example, if we see various spikes in our system utilization during a normal
workday, that might not be unusual.
But if there is a spike in CP activity, for example, at 3:00
AM then that might be abnormal.
And, that's an example of a contextual anomaly.
Another sort of set of anomalies are just behaviors that we see only when we
collectively look at various data points.
So for example a graduate increase in latency across multiple
services over time, right?
So any one of those points by themselves may not be an example of anomaly,
but taking together they give us a signal that something might be wrong.
Okay, what are some of the ways in which we can detect them?
There are statistical methods.
So these involve just analyzing data distributions and identifying outliers.
And, I can name suggest we apply this by using various statistical
techniques like moving averages, exponential smoothing to identify what
an outlier is in a given set of data.
We can also use machine learning methods.
So we talked about clustering just now.
So clustering time series forecasting, or other ways in which we, use machine
learning methodologies to cluster information or to what, to create a sense
of what normal is so that we can identify when something is outside that normal.
So one of the challenges we run into with knowledge detection, so whereas
of course seasonality many businesses are, have season has seasonality
in their sort of system usage.
So that needs to be factored in what constitutes normal changes
at different times of the year or different times of the day.
Another problem is with, data that might be noisy.
So we might need some pre-processing to filter out the noise so we can accurately
identify, what a true anomaly is.
We can also have situations where what constitutes normal evolves over time, so
as systems grow, as usage patterns change.
What might be normal today might not be normal tomorrow.
So we need to invest in ongoing sort of training and tuning of our
algorithms to make sure that they are current in terms of what is normal.
And there might be, cultural effort involved in making sure that our
data is labeled correctly so that, AI systems can use them effectively.
But despite all these challenges, anomaly detection is crucial and
is a very important technique for catching issues early.
Alright let's talk about a few case studies.
So these are all publicly available and it'll be, if you're
interested to read about them.
So Netflix and how Netflix used anoma detection to prevent streaming outages.
So Netflix, as we know, is the massive streaming service.
They have probably the largest streaming infrastructure, one
of the largest in the world.
And they rely heavily on normal detection to, identify when something
is going wrong with their system.
So they continuously monitor their boss network.
They look for anomalies or any unusual patterns, and by identifying these
things early, they can take proactive steps and, have been by and large, good
at preventing streaming outages and ensuring smooth viewing experience.
So wherever they have a new launch of a Blockbuster movie, they're able to predict
ahead of time the kind of spikes that they might see, and they plan for it.
So the other system we should talk about is LinkedIn.
So LinkedIn is of course, the huge professional network that we perhaps
all of us use, and they face the challenge of quickly resolving incidents.
So they have an AI powered correlation system that's able to analyze a lot
of different data sources and identify difference I'm sorry, identify
relationships between these events.
This helps to pinpoint whose causes of problems much faster than
manual investigation would have.
So as a result, they significantly reduced their F TT R and minimized
impact of incidents on their users.
So this is another showcase of how AI can streamline incident
response and improve efficiency.
Uber, the ride sharing service many of us use is another case study in how AI was
used successfully in the realm of SRE.
Uber faces a lot of fluctuating demand, and they use machine learning to predict
future traffic patterns and rider demands.
So with this they're able to automatically scale that infrastructure up and down as.
By predicting demand they ensure they have enough resources to handle peak
times, and, they don't waste money over provisioning during off peak times.
This demonstrates how, AI can be used for proactive capacity planning for
resource optimization, and, and thereby saving cost and into the performance.
Okay, let's look at some of the, most famous and popular
tools in this space, right?
And we have open source as well as commercial tools and platforms that can
be used depending upon our situation.
So some of the really popular ones, Prometheus it's a metric
collection and a storage system.
It's excellent for gathering data.
Which is a fourth cliche for anomaly platform.
Grafana is a visualization and and dashboarding system.
Using that, we can visualize some of the metrics that we are capturing
and we can, visualize anomalies.
Excuse me.
There is the elk stack of tools, the elastic search log.
So with this stack, we can do log management and analysis.
We can collect and process and search our log.
And and this is very useful for root cause analysis, this profit from
Facebook this is a useful tool for, predicting future trends and patterns.
So using this tool, we can do things like capacity planning and monitoring.
Then there are of course tools like psychic intensive flow by watch.
So these tools can be used if we want to build, our AI ML systems from scratch and
apply some of these algorithms ourselves.
Then there are commercial platform platforms like Datadog from
anomaly detection and forecasting.
New Relic which has, automated incident intelligence built into it.
There's Dynatrace for automated monitoring and Splunk which is log capturing System,
which has very advanced log analysis and anomaly detection systems built into it.
And depending upon whether we are on, the Amazon ecosystem
or the Google Cloud ecosystem.
There are AI tools built into these cloud providers that help us to anomaly
detection, forecasting, log answers in a variety of these practices.
Alright, so where are we headed?
So we know that there's going to be increased automation especially in
the field of incident response and.
AI will not just detect problems, but will also trigger and execute
corrective actions corrective workflows.
And, it can do a variety of things before human intervention is needed.
Sometimes human intervention is not even needed and it can speed up recovery
times we know that there, there are gonna be more and more self-healing systems.
Systems that can detect, diagnose, and fix problems all on their own.
So without, or with minimal human intervention self-healing system
is becoming more and more of a buzzword and AI and ML powering that,
we're gonna have gradually more and more sophisticated models that have
the ability to detect more complex.
Problems.
And then using deep learning, they are able to analyze a vast variety
of data sets, sorry, vast of data.
Identify problems that otherwise may not have been caught.
Another trend is the rise of explainable ai.
We don't just want AI to, make a decision or make a prediction.
We also want to understand, how it got to that point, why
it made a certain decision.
And explainable ai, if I is the, is the approach to look inside
the black box to see how AI is working so that, we are able to have
greater confidence in descriptions.
And of course we will have AI driven observability as well.
So AI will help us gain a more holistic comprehensive
understanding of our systems.
AI can help us correlate data from various sources, identify
dependencies, and provide a deeper level insight into our system behavior.
Alright, so how can you get started?
Step one, focus on specific high value use cases.
So rather than try to, boil the ocean and fixing everything on day
one we should identify specific high value use cases and leverage AI to
make an impact with those use cases.
Think about what the biggest pain points of our systems are.
Where are we spending the most amount of money and effort?
And if we start with those areas so for example if the problem that you're facing
is alert fatigue, then you focus on that first, then building the right foundation.
So any AI system is only as good as the data we use it.
We train it on, right?
So they need a solid foundation of data.
This means that, we need the right metrics.
We have to store them properly.
They should be an available for analysis, and we have to invest in good
monitoring tools and data pipelines.
Clean, comprehensive, relevant data is essentially we also need
to invest in skill development.
So implementing AI isn't just, for tool at it.
But we must understand and build our own skill and identify
what tool or what technique is applicable in a given situation.
And this is also an ongoing thing.
We, we train, we hire accordingly and we keep ourselves updated,
right?
So in essence, we start small, we create a roadmap, and we follow the roadmap.
All right, so to recap, we saw, how AI, in particular, how and how detection can
significantly transform SI e practices.
By leveraging these techniques, we can achieve several benefits.
We can reduce MTTR, we can identify and address our issues quickly.
We can improve liability and we can review increase our efficiency.
So I encourage all of us to explore the basic possibilities that AI
and noal detection in particular give to our SRE workflows.
This may seem intimidating at first, but the potential for reward is very high.
And of course, we should start small focus on specific use cases.
Have a roadmap and then follow the roadmap and then it's, this is going to be an
IT process, so we're gonna try something and then we are gonna improve it.
And let's repeat.
Alright.
Thank you so much for joining me today.