Beyond SLOs: Proactive Failure Prediction Through Anomaly Detection

Video size:

Abstract

Learn how to shift from reactive monitoring to proactive prediction. Dive into the world of anomaly detection and find signals hidden in your data before they become outages.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone, and thank you for joining me today. My name is Garish. Today we'll go beyond SLOs and see how AI is revolutionizing site reliability engineering. So before we dive into specifics, let's take a moment to look at the roadmap for our discussion Today we're gonna cover a lot of important doubt related to SRE and how AI is transforming it. We'll start by talking about the core challenges that Sari faces today. That is set the stage for talking about why we need some of these advanced solutions for them. We'll talk about the what and the how what are the different solutions and how they're applied and how they're applicable. We look at anomaly detection and little bit of detail. We break down what anomalies are the different types, how we detect them. This is crucial for understanding how we can start to predict them. And then we look at some of the case studies for big companies out there who use some of these techniques. Effectively we look at tools and technologies. We'll discuss the major open source and commercial options available for applying some of these solutions. We'll look at the future of AI and SRE. What the trends in this field are. And finally we'll wrap up with some practical recommendations for getting started and applying these practices. Alright, so what are the key metrics that we are going to rely upon as we start building these systems? In the world of SRE metrics like availability, latency, throughput and error rates. From the foundation for doing a lot. The analysis that we do, these metrics are the North stop and they help us build reliable user experiences. We'll also talk about the monitoring explosion that is happening. So it's almost as if we have become, too good at collecting data and now there's an explosion of it. There's too many signals. And too much noise. So that makes it difficult for us to pinpoint problems and understand what's really going on. It also causes alert fatigue, right? If there are just too many, signals and noise coming in, we don't know what the false positives are, what significant or insignificant pieces of data are, and this can be problematic because it can cause us to, have a delay and how quickly we can implement solutions to these problems. So then we'll talk about the shift in predictive reliability. So traditionally SRE has been reactive. We wait for things to break and then we fix them. But today we are moving more towards being able to predict. And be ready for problems to happen before they actually happen. And this is where AI and ML also come in and we'll dive into these in the next section. Alright firstly let's talk about the, what, what can AI do for us? So one of the things it can do is intelligent learning. So what that means is AI can be used to reign in the flood of alerts that we can get if we are gathering too many metrics because AI can filter out the noise and identify the really crucial bits of data that we should be focusing on. This reduces alert fatigue and helps us focus on what is important. So the next what. We should look at as predictive maintenance. So this is one of the major benefits of AI, is its ability to look at past data and help us identify patterns that we can use to forecast our problems before they happen. So this proactive approach helps us avoid outages altogether. The next thing we should look at is automated remediation. So this is an exciting topic. So AI can also remediate many of the problems that we face in an automated way. So when an issue is detected, AI can trigger corrective actions automatically. This significantly reduces the time it takes to solve problems, and it minimizes doubting. And systems can become self-healing. AI also helps us in capacity planning. So this is another crucial area. AI can predict future demand and help us optimize a resource allocation accordingly. This ensures that we have enough resources to handle peak loads without over provisioning and without wasting money. It also helps us be efficient and cost effective. We can use AI to do root cause analysis, so AI can assist us in quickly identify the root cause of problems. So it can do this by analyzing various data points and correlations from the past, and it can identify and point to source of issues, saving a lot of time and effort that you would've had to otherwise spend. Do this using traditional approaches. We can also use AI for incident management. So AI can automate various incident management tasks statistical creation and prioritization. This streamlines the incident response process, and that ensures that critical issues are less promptly. And let's talk about ChatOps enhancement. So finally, AI can enhance ChatOps. By using natural language processing to interact with systems and run diagnostics via chat. So this makes it easier for teams to troubleshoot and manage systems in real time. Excellent. Alright now let's talk about the how as we've seen what AI can do in the field of now let's talk a little bit about the how. Let's look at the different machine learning algorithms and how they're applicable here. So there's of course, supervised learning. So think of this as AI learning by looking at past examples. So we provide label data, excuse me data where we've had the correct outcome or data about incidents from the past and. AI can use that to identify, what issues look like and what the resolution resolutions have been in the past. There's also unsupervised learning, so this is where AI can find patterns in its on its own without label data. So this is incredibly useful for anomaly detection. AI can look for unusual behavior and deviations from the norm. It's like saying hey does something look off here? Let me know if something looks off so I can investigate. So for example, if there is certain spike in traffic, then you know, non detection algorithms can identify it because, it looks different from what the non spike behavior is. There's also reinforcement. This is where we are training our AI through trial and error. AI takes an action and receives feedback from us, and it uses that to improve how it does remediation the next time around. For example it can, using this, it can find optimal ways to respond to different situations. For example AI can try different ways to restart a service if one method fails. It can realize that and, look for a different approach. Let's talk about the different anomaly detection techniques. So there can be statistical methods. These are based on statistical techniques where we are looking at. Our data and calculating various scores and using that to identify anomaly. So for example with the Z score, we are measuring how a specific data point deviates from the average. With the MAD, we are measuring how it deviates from the media. And this can help us identify outliers like individual data points that significantly differ from the rest. For example, if the average response time is a hundred milliseconds and we see a certain spike of 500, we know this is an outlier because this will significantly deviate from the average or median. We can also use clustering techniques so we can look at groups of data points and we can cluster them together. And see what doesn't fit in the cluster. And then, that becomes a layer. It's like finding an so an example would be if CP utilization falls outside a range we are seeing consistently some utilization, and then suddenly there's a different utilization number, then, it falls outside the cluster and we know, we should focus on that. We also have different AI ops approaches that we can use. So these are different ways in which AI can be incorporated in our SRE practices. For example, we can have the overlay model, which is pretty much about taking our AI capabilities and running them alongside our existing tools. So our current monitoring systems and alerting systems stay in place, but we add an AI layer that gives us additional sort of signals, additional insights that we can use to make decisions. This is a good on ramp to AI and SRE because it doesn't require replacing a lot of existing infrastructure and we can go step by step. The other approach is with embedded models. So think of this as, a set of monitoring platforms that can have AI capabilities already built into them. So many modern platforms now offer AI driven anomaly detection or predictive analytics, or automated learning. So these help us create, streamlined workflows. And give some more integrated experience. And then one very crucial aspect of how AI works in SREs with automated remediation. So I can go beyond just detecting problems and can also take corrective action, right? So this could involve things like restarting services, scaling resources triggering other automated workflows. Alright. Now what are some of the challenges that we, face implementing? This one is, of course, with data quality and data requirements. For implementing any of these approaches, we need access to, a lot of high quality data which may or may not be available. AI is very data intensive. Set of approaches. If data's incomplete, inaccurate, or noisy. AI performance suffers. So we need to invest in, robust data collection and cleaning and management practices. Another challenge we run into implementing this is with just the training and tuning of yeah. So this is an ongoing thing. We can put systems in place, but we also need to invest in making sure that. These systems continuously trained with data as it, evolves. The data that a system sees evolves over time and, so our systems also need to be aware of the new normal. Alright, let's look at normal detection in a little more detail. So our detection at its core is just identifying deviation from the expected, right? So we first need to identify what expected behavior is, and we can do that by learning from strong data so we understand what normal looks like so we can, spot what is not normal. Here are some types of anomalies. For example, a point anomaly is a single data point That air is an outlier. It stands on its own. So for example, a certain spike in CPU utilization on one server is an example of a point anomaly. We could also have contextual anomalies that are unusual in a given context. So some system behavior might be normal in one. And become anomalous in a different situation, right? So for example, if we see various spikes in our system utilization during a normal workday, that might not be unusual. But if there is a spike in CP activity, for example, at 3:00 AM then that might be abnormal. And, that's an example of a contextual anomaly. Another sort of set of anomalies are just behaviors that we see only when we collectively look at various data points. So for example a graduate increase in latency across multiple services over time, right? So any one of those points by themselves may not be an example of anomaly, but taking together they give us a signal that something might be wrong. Okay, what are some of the ways in which we can detect them? There are statistical methods. So these involve just analyzing data distributions and identifying outliers. And, I can name suggest we apply this by using various statistical techniques like moving averages, exponential smoothing to identify what an outlier is in a given set of data. We can also use machine learning methods. So we talked about clustering just now. So clustering time series forecasting, or other ways in which we, use machine learning methodologies to cluster information or to what, to create a sense of what normal is so that we can identify when something is outside that normal. So one of the challenges we run into with knowledge detection, so whereas of course seasonality many businesses are, have season has seasonality in their sort of system usage. So that needs to be factored in what constitutes normal changes at different times of the year or different times of the day. Another problem is with, data that might be noisy. So we might need some pre-processing to filter out the noise so we can accurately identify, what a true anomaly is. We can also have situations where what constitutes normal evolves over time, so as systems grow, as usage patterns change. What might be normal today might not be normal tomorrow. So we need to invest in ongoing sort of training and tuning of our algorithms to make sure that they are current in terms of what is normal. And there might be, cultural effort involved in making sure that our data is labeled correctly so that, AI systems can use them effectively. But despite all these challenges, anomaly detection is crucial and is a very important technique for catching issues early. Alright let's talk about a few case studies. So these are all publicly available and it'll be, if you're interested to read about them. So Netflix and how Netflix used anoma detection to prevent streaming outages. So Netflix, as we know, is the massive streaming service. They have probably the largest streaming infrastructure, one of the largest in the world. And they rely heavily on normal detection to, identify when something is going wrong with their system. So they continuously monitor their boss network. They look for anomalies or any unusual patterns, and by identifying these things early, they can take proactive steps and, have been by and large, good at preventing streaming outages and ensuring smooth viewing experience. So wherever they have a new launch of a Blockbuster movie, they're able to predict ahead of time the kind of spikes that they might see, and they plan for it. So the other system we should talk about is LinkedIn. So LinkedIn is of course, the huge professional network that we perhaps all of us use, and they face the challenge of quickly resolving incidents. So they have an AI powered correlation system that's able to analyze a lot of different data sources and identify difference I'm sorry, identify relationships between these events. This helps to pinpoint whose causes of problems much faster than manual investigation would have. So as a result, they significantly reduced their F TT R and minimized impact of incidents on their users. So this is another showcase of how AI can streamline incident response and improve efficiency. Uber, the ride sharing service many of us use is another case study in how AI was used successfully in the realm of SRE. Uber faces a lot of fluctuating demand, and they use machine learning to predict future traffic patterns and rider demands. So with this they're able to automatically scale that infrastructure up and down as. By predicting demand they ensure they have enough resources to handle peak times, and, they don't waste money over provisioning during off peak times. This demonstrates how, AI can be used for proactive capacity planning for resource optimization, and, and thereby saving cost and into the performance. Okay, let's look at some of the, most famous and popular tools in this space, right? And we have open source as well as commercial tools and platforms that can be used depending upon our situation. So some of the really popular ones, Prometheus it's a metric collection and a storage system. It's excellent for gathering data. Which is a fourth cliche for anomaly platform. Grafana is a visualization and and dashboarding system. Using that, we can visualize some of the metrics that we are capturing and we can, visualize anomalies. Excuse me. There is the elk stack of tools, the elastic search log. So with this stack, we can do log management and analysis. We can collect and process and search our log. And and this is very useful for root cause analysis, this profit from Facebook this is a useful tool for, predicting future trends and patterns. So using this tool, we can do things like capacity planning and monitoring. Then there are of course tools like psychic intensive flow by watch. So these tools can be used if we want to build, our AI ML systems from scratch and apply some of these algorithms ourselves. Then there are commercial platform platforms like Datadog from anomaly detection and forecasting. New Relic which has, automated incident intelligence built into it. There's Dynatrace for automated monitoring and Splunk which is log capturing System, which has very advanced log analysis and anomaly detection systems built into it. And depending upon whether we are on, the Amazon ecosystem or the Google Cloud ecosystem. There are AI tools built into these cloud providers that help us to anomaly detection, forecasting, log answers in a variety of these practices. Alright, so where are we headed? So we know that there's going to be increased automation especially in the field of incident response and. AI will not just detect problems, but will also trigger and execute corrective actions corrective workflows. And, it can do a variety of things before human intervention is needed. Sometimes human intervention is not even needed and it can speed up recovery times we know that there, there are gonna be more and more self-healing systems. Systems that can detect, diagnose, and fix problems all on their own. So without, or with minimal human intervention self-healing system is becoming more and more of a buzzword and AI and ML powering that, we're gonna have gradually more and more sophisticated models that have the ability to detect more complex. Problems. And then using deep learning, they are able to analyze a vast variety of data sets, sorry, vast of data. Identify problems that otherwise may not have been caught. Another trend is the rise of explainable ai. We don't just want AI to, make a decision or make a prediction. We also want to understand, how it got to that point, why it made a certain decision. And explainable ai, if I is the, is the approach to look inside the black box to see how AI is working so that, we are able to have greater confidence in descriptions. And of course we will have AI driven observability as well. So AI will help us gain a more holistic comprehensive understanding of our systems. AI can help us correlate data from various sources, identify dependencies, and provide a deeper level insight into our system behavior. Alright, so how can you get started? Step one, focus on specific high value use cases. So rather than try to, boil the ocean and fixing everything on day one we should identify specific high value use cases and leverage AI to make an impact with those use cases. Think about what the biggest pain points of our systems are. Where are we spending the most amount of money and effort? And if we start with those areas so for example if the problem that you're facing is alert fatigue, then you focus on that first, then building the right foundation. So any AI system is only as good as the data we use it. We train it on, right? So they need a solid foundation of data. This means that, we need the right metrics. We have to store them properly. They should be an available for analysis, and we have to invest in good monitoring tools and data pipelines. Clean, comprehensive, relevant data is essentially we also need to invest in skill development. So implementing AI isn't just, for tool at it. But we must understand and build our own skill and identify what tool or what technique is applicable in a given situation. And this is also an ongoing thing. We, we train, we hire accordingly and we keep ourselves updated, right? So in essence, we start small, we create a roadmap, and we follow the roadmap. All right, so to recap, we saw, how AI, in particular, how and how detection can significantly transform SI e practices. By leveraging these techniques, we can achieve several benefits. We can reduce MTTR, we can identify and address our issues quickly. We can improve liability and we can review increase our efficiency. So I encourage all of us to explore the basic possibilities that AI and noal detection in particular give to our SRE workflows. This may seem intimidating at first, but the potential for reward is very high. And of course, we should start small focus on specific use cases. Have a roadmap and then follow the roadmap and then it's, this is going to be an IT process, so we're gonna try something and then we are gonna improve it. And let's repeat. Alright. Thank you so much for joining me today.

Slides

Download slides (PDF)

See all 109 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Beyond SLOs: Proactive Failure Prediction Through Anomaly Detection

Video size:

Abstract

Summary

Transcript

Slides

Gaurav Mittal

Software Development Engineer @ Nordstrom

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Beyond SLOs: Proactive Failure Prediction Through Anomaly Detection

Video size:

Abstract

Summary

Transcript

Slides

Gaurav Mittal

Software Development Engineer @ Nordstrom

Join the community!