Tackling Alert Fatigue with SLOs

Video size:

Abstract

SLOs allow teams to prioritize the more impactful or important aspects of their services. Metrics that center the user experience gives teams focus and goals. Highlighting the metrics that matter most gives teams space to disable alerts that contribute nothing to over all customer happiness.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Okay, thanks for. Joining my session this is Tackling Alert Fatigue with SLOs. My name is Mandy Walls. I am a developer advocate for PagerDuty. And if you'd like to get in touch with me, I'm happy to chat anytime. You can email me. I'm mWalls@pagerduty.com. You can reach out on Blue Sky, you can hit me up on LinkedIn if you'd like. You can reach out to our whole team. We're community team@pager.com. Or you can scan that QR code and join our community forums if you are a PagerDuty customer or maybe in the future, you can join us there and hang out, ask questions, answer questions share fun. It's all good stuff. So that's all I'm really gonna talk about as far as PagerDuty goes. So if you have other questions about our product, please join us in those other forms. So I'm gonna be talking about alert fatigue and using SLOs to help you deal with that. So let's let's set a baseline for what alert fatigue actually is. A lot of you out there may have actually experienced alert fatigue, maybe even if you didn't really know that term. But basically alert fatigue can happen when your team is. Constantly inundated with alerts, events, alarms, notifications, sirens, claxons, blinking lights whatever the case might be from your various monitoring systems. And we know folks have a lot of different monitoring systems depending on the solutions that they need for all of the services in their environment. So there's lots of stuff out there that can be sending you. Noise notifications, and it doesn't really matter if the alert is a positive event, a negative event, high priority, low priority, false positive, whatever it is. We're really just dealing with the, just the sheer volume of alerts, distractions, things that pull you out of your work, that contribute then to what we call alert fatigue. Unfortunately. Our entire working life is full of things that alert at us, right? Your chat application, your email, your phone. Like I turn a lot of notifications off, but I know lots of folks don't like, they fear they're gonna miss an important notification or whatever it is. But like I'm in 15 different slacks. None of them get to play any sounds. So like when I hear someone else's slack noises. On a call. I'm like, do you really need to know the noise? Oh my gosh, that would drive me crazy. So you're already in a place where we're set up to be distracted all the time anyway. And then if your team has production responsibilities, you have all this other stuff that then piles on top of that and it really can just drive people crazy, when we talk to, folks who are, who maybe called pre on-call folks who aren't going to, aren't on call yet, but are going to be on call, right? They don't currently have on-call responsibilities, but it's coming. That's one of the things they worry about is you know, how many alerts are there going to be? How often am I going to be pulled out of the other things that I'm doing, the other work that I'm doing, or even, my personal time, my sleep, and all that kind of stuff. It's the biggest part of, dealing with being on call. It's just the volume of things to come through. What we want then is to make sure that the stuff that. Comes through to notifying human beings about the production environment is gonna be the right stuff, going to the right people at the right time, and it's going to be important. And we find that too many teams. Alert on stuff that doesn't really meet that criteria. So there are lots of concerns about the volume of alerts that teams get and how they deal with them. For the rest of this talk, I'm gonna be referring to alerts as specifically the real time events that send notifications to some human responder, right? So there's lots of different terms for that. Different products, call them different things. And we're just gonna be referring to those just generically as alerts, but specifically the things that need to be actioned in real time. They represent some kind of, or hopefully represent some kind of issue in your production environments. We're gonna focus on that sort of definition. So there's plenty of signs of alert fatigue, some of the main ones. Maybe you've experienced these, maybe you see them on your team right now. The big one is going to be delayed response time. That's going to impact your customer experience, right? Your potentially your bottom line as well. If there's a lot of alerts, it takes longer to deal with the volume and find the real problems, right? Because you're constantly shifting through all this stuff that's maybe junk, not helpful. Not informational enough to do anything about. We also find that folks just simply miss alerts. It looks like all of these 10 things are the same, but it turns out that the ninth one was just a little bit different and actually a problem. But again, if you're drowning under a huge volume of alerts, it's a lot more likely you're gonna miss something. Folks also ignore false positives for the same reason. Everything looks the same, and you just assume that. It's the same as those other ones and maybe it wasn't, right. New alerts may look like they're not a problem when they are. All of this then leads to increased stress and burnout. We see this across the board for very busy teams with very rate high rates of incidents and notifications during their on-call shifts anyway, and the volume of alerts is definitely contributor to that. Based on market conditions. Sometimes folks will leave, sometimes they feel like they can't. There's a bit of that, but at the same time, like job satisfaction, stress, burnout, if you are especially a team manager, team lead, those are things that should concern you, right? Like folks are going to their job quality will decrease if they are under a lot of stress and maybe not sleeping, or they're feeling burned out. So that's something to keep an eye on overall, then it all contributes to decreased productivity. That's probably the most obvious one, right? There's lots of alerts coming in, there's more distractions, there's less useful work being done. And while one of our primary recommendations for folks who are going to be on call is that they don't carry a regular workload during their on-call shift for that week or whatever it is. They still should be able to do something, right? We don't want them constantly inundated, just buried under a flood of alerts 24 by seven while they're on call, because there's always work that can be done as far as like documentation or updating monitors and those kinds of things. But overall, what we end up seeing on teams that have too many alerts is a great decrease in their productivity overall. Then that gives you. The less ability to learn from all the stuff that is coming in, right? You have lower quality, instant reviews and documentation. 'cause there simply isn't enough time to deal with all of the stuff that's coming in. Sift through it, find the things are important, find the things that you haven't seen before, document those update monitors, whatever the case might be. They folks just never catch up to do that kind of work that. Is the stuff that we really want to get out of having incidents in the first place. We bring these alerts into our systems because there is something wrong and we want to get to a place where we can shore up the systems so that we don't see those problems the same way in another time when we run outta time to do that if folks are too buried. So some basic things we can be doing to improve your alert volume and overall alert health. Are going to help us before we can get to using SLOs. Even if you don't have access to SLOs, these things are gonna help you, right? So if you haven't gotten to a place yet where you're going to deploy those, or think about SLOs, this stuff is still gonna help you if you feel like you have too many alerts coming in. So we'll set a good baseline for things all teams can really be doing or thinking about to improve their overall alert volume. The first one's a big one. Like just get rid of something that's alerting you when something succeeds. There are very few cases where this additional noise is helpful. Like I, we're all dopa fanatics. But get your dopamine hit on Instagram or social media or somewhere else don't have your systems like constantly pinging you with success, right? Unless you absolutely need to know in real time that something succeeded. Maybe you've got like a long running pipeline, or you've got some ETLs that are important or whatever. Some reports or something that runs none of that stuff really should come to your real time notifications. And by that comes to your phone pages. You in the middle of the night, sends a text message, like that kind of immediate stuff. Things that go to chat channels that can be responded to asynchronously that are deferred for not necessarily real time. That's great. That's a great place to put those kinds of notifications. Where they don't need to be coming to is basically what you think of as your pager, right? For your emergencies. Number two, get rid of anything that isn't actionable. I know there's stuff out there that I was assisted men for a long time. There's chronic issues. Let's sneak into the production systems, and if it's stuff that happens all the time and can't be fixed and the people that you're notifying about it can't do anything about it anyway, don't notify them about it. Track it as a metric to say, oh yeah, last week this thing happened 17 times this week. It's looking like it's gonna happen 25 times. Use that as leverage to get things fixed with your product managers. Have that discussion, especially if you're the manager or team lead advocate for time to fix these chronic issues, but don't notify people who can't fix them. Part of that too is making sure alerts go to the right people. So if you have a bunch of stuff that comes into your team that's like. This isn't really our responsibility. We don't have access to fix this stuff. Maybe it should go to somebody else, have a talk. With that other team that should be getting those notifications. I know that's challenging in lots of environments because you've got like lots of different platform teams and someone's responsible for the cloud account. Maybe somebody else is running your Kubernetes and then you have your application teams and they know about the code. There's a lot of division of labor and you wanna make sure all those alerts go to the right places. But if stuff is coming into your team can't do anything about, you don't want it, turn it off. Anything else that's low urgency that's coming in is just an informational notification. Shouldn't be going to real time notifications to humans again, somewhere else. Put it to Slack or Teams or whatever. Something that's asynchronous that's just there for information, right? Have a channel for your bots on your chat and pipe it there. Not to a place that is going to notify someone at three o'clock in the morning. Right? Low urgency. Maybe it comes into your regular system, but it doesn't notify people. You can set in PagerDuty in particular, you can set the notifications to not. Notify people via SMS or app push for low urgency alerts and they just will hang out there. You could have a goalie or triage time in the morning to say, okay, nine 30 we're here. We're gonna go through everything that came in overnight that was low urgency and deal with it that way. You can also push things maybe to a ticketing system instead of your real time notifications. If it's stuff that can really be deferred and be decided on later, you can be super aggressive with these. I know everyone thinks their baby is beautiful, but if you put stuff out into production and your users aren't engaging with it. Maybe you can think about lowering the urgency of your response to those parts of the service and reevaluating those on a regular basis to say, has this reached a threshold where we need to up it to 24 by seven urgency, high urgency alerts, or things like that. Be aggressive with all of those, right? And then finally delete or suspend broken alerts. Like it happens code changes somewhere and there's, a mess and like the monitors don't quite work, but you haven't had time to go back and really dis distinguish like why they don't work. Good practice means all the alerts get updated when the code changes, but if you've got janky stuff sitting around. You might have some of these old alerts that aren't being super helpful, don't necessarily reflect the state of the current code and, just dump them. If you wanna save them for a while just to make sure that maybe you can go back and fix them, that's fine, but disable them. You don't have to turn them on to keep them ping people in the middle of the night if they don't need to. What you're doing with this whole practice is. Helping with alert fatigue by creating mental space, right? You're increasing the cognitive capacity of your team to deal with the things that are real and urgent and impactful by getting rid of all the stuff that doesn't meet that barrier, right? Everything that is extraneous. So if the alert isn't real, dump it. If it's not urgent, create a record. Make a ticket. When you're left with just the stuff you care about, then you can be more productive. So how can you go about doing this if you feel like your team is really drowning under a lot of alerts that aren't useful to you? Try and declare an alert, cleanup, sprint, whatever it takes. You are the impacts of alert fatigue are real, right? We talked about them in the beginning. They are going to be impactful for the reliability of your services and systems. So if you are in danger of experiencing alert fatigue, it is worth investing the time, and it is an investment, right? It does take resources to fix the things that are going to impact your reliability. That means you cannot be missing important alerts. You cannot be ignoring the things that are coming in. But if you have too many alerts. You are gonna have to deal with that. So give your team some time to take stock of the alerts you are getting if you feel like you have too many, right? If you start with, say, the 10 most noisy alerts, you walk through the list by frequency, you deal with the noisiest first and a kind of snowball effect, right? So we deal with teams that have tens of thousands of alerts in a week, and like they use AIOps and some of the other tools to group things together into a limited number of incidents. But like taking stock of all those. Alerts and making sure that they are meeting your criteria is going to be super helpful. And just going through them list by list for every alert. Marie Kondo, that thing, right? Is it actionable? If not, we're not gonna alert like we will maybe track it as a metric, leave it in the dashboard, but it doesn't have to go to somebody's pager in the middle of the night. Is it urgent? If yes, fine. Get it in front of a responder. If not, log it to future work. Send it to the Slack channel. Create a low urgency queue, whatever it takes, but get it out of the flow that's gonna wake someone up in the middle of the night. Is it telling you something important about the performance of the service? That is a big one, and we're gonna talk about that more. If it doesn't turn it off, set it aside for discussion, maybe revisit it if it might be helpful, right? You want to take a good look at what you're alerting on and how you're alerting about it. If you have a bunch of baseline alerts that come in all the time. It's maybe time to have another sprint about best practices for your services, right? Stuff like running out of capacity have a discussion about how to set up the system so they rightsize automatically how to see the trends that would lead to. Services being outta capacity. If you're running in Kubernetes environments, all that stuff is available, right? You just need to be able to make sure you're deploying it correctly. Things like memory leaks, other chronic issues. You might have to have a discussion with your product managers. Put that stuff in the backlog for engineering investigation to get it fixed, right? I used to work in Java applications and like memory leaks are par for the course at that time, deal with those in a more productive way. Just don't it's easy to get to feel like you're not gonna make any progress with these if they have been alerting for a very long time. But you can push back on your product managers and have that discussion. 'cause again, if it's gonna impact your reliability, sometimes these things get hidden, right? If you've got a service that has. A thousand instances across however many availability zones, like it is easy to hide that one of them is restarting for a memory leak every so often, but can still be a chronic issue. You may have a lot of legacy alerts that no longer make sense, right? Depending on the mix of newer versus legacy. Services in your environment. You may still have stuff that's like alerting on disc full and things like that really take a hard look at whether the or not. Those things are super helpful to you. They feel comfortable, right? We, they're well understood if nothing else but they may not be anything that actually gives you a hint of the performance of the system. There may be some that are there that might be helpful, but there's gonna be a whole lot more that maybe aren't. So once you've called these alerts, take a hard look at the ones that you've determined are important. Part of the problem we see folks deal with is a lot of their alerts look the same. You look through a list of incidents that have been created, you cannot tell from one incident to the next what exactly why they're different. Because the first couple of lines are all exactly the same. Are your alerts giving you enough information? Can responders tell that something important is going on from the first initial line of the report? This is easier to do in some monitoring systems than others. Some of them are very verbose. They give you a lot of stuff, but. The stuff that's important and indicates what's going on, if something's hidden or hard to find right, or hard to surface in a way that makes it immediately apparent to your responders. And we get a lot of questions about this because folks are relying on SMS in particular, and the questions always are can we add more to the SMS? Can we add links to the SMS? Can we do this? Can we do that? The long answer is a whole other talk, but the short answer unfortunately is no. Because SMS is governed basically by each country in the world, like everybody has their own different regulations. Some places you're not allowed to send a link at all. And the US we get a lot of spam. Suddenly there's other things that are allowed and disallowed and there's all these rules about it. So we've dialed back our SMS. On our delivery systems to be lowest common denominator. So you're very limited in what things can go through. 'cause we're a global company. We have customers all over the world. To avoid those kinds of undeliver ability errors, we've cut things way back. So thinking about how your notifications will get to your humans. Is also part of making sure that your alerts are working correctly. The other part, like I mentioned earlier, is making sure they're getting to the right team. If you have division of responsibility across the different layers of your platforms. Also making sure that all those actionable things are going to the team that can actually do the action. Also super important, right? If your application development team knows nothing about the Kubernetes other than it packages up. Their CICD tool packages things up into a container. You can't ask them to respond to errors in the Kubernetes layer. Same with the virtualization layer, whatever it is, right? However many different teams you might have responsible for things, making sure that when that alert comes in, the responders that receive it, have the access and knowledge to fix it, is part of all of this as well. And if you need to ship things off to another team. That's also helpful if you are in a more sort of advisory SRE kind of role, putting together good guidelines about that sort of separation of duties and separation of alerts and what alerts should look like, should be part of the things you provide for folks. It's part of your. Golden path, right? It's part of your best practices, your recommendations for getting things into production, what these alerts should look like, where they should go, what kind of information they should convey, right? It's all part of that advisory role, if that is how your SRE team is set up. So there's a lot of. Things to think about there and things to work on. If you are coming into this from a place where you have a lot of noise in your alerts, maybe a lot of legacy systems to deal with that maybe are being modernized or reworked or whatever. So there's a lot of, it feels like it should be super simple and unfortunately it never is, right? Because this systems themselves are complex, making sure they're running correctly is also going to be complex, and that's where SLOs come into the play. Once we've cleaned up our junk alerts, it's time to talk about improving the alerts that aren't junk, right? So if you're not currently working with SLOs and SLIs, or maybe you're planning on looking at those to improve your position, your reliability, there's lots of. Resources. I'll mention a couple at the end that you can take a look at, but SLOs essentially are the team goals that you set for your production metrics. They are usually inside any contractually obligated SLA, so like SLA. Often has lawyers involved. There's contracts involved, like that's the legal part, right? Like service level agreements between your company and your customers. Your SLIs, your SLOs sit inside of that. And what I mean by that is if you are a promising 99.9% reliability on some system, you are setting your personal goals to like 99.99, right? Whatever that is. And they may be focused on certain parts of your customer workflow. You may have other components that are fall outside of SLA because maybe they are not GA yet. Maybe they aren't used as much. Maybe they're premium features and there's other things going on there that's a whole other discussion. But SLOs, like I said, your are your team's goals for your production metrics. They are the level of reliability your team commits to delivering for that service. They can encompass anything that you think is important. Anything that your customers determine is important, right? Service, speed, payload size, any metrics that matter to your users, and they may be different for every component in the ecosystem. There's no O universal best set of indicators to set your objectives to. It's going to vary depending on what your users do, what things they interact with, what they expect performance wise of your services. So they can be very individualized and in that way take some work and may require some tooling, right? Others, you can back into a system data, right? If you have an application that slows down. When some system resource threshold is met, you can monitor that system resource as a proxy for the behavior of that service, right? Some of your SLOs are going to be around metrics that are easy to count, but also transient, noisy, right? Especially if you have a system that has a very large number of hits or interactions. One or two little errors pop up occasionally anyway, because that's just the nature of packet switch systems, right? So if you've ever been called by an executive because they couldn't access a service that's working for someone else, and you're like all our monitors say it's fine, and I have no idea why it's not working for your, from you, for you from your pineapple farm in Hawaii when everything's hosted in Dulles. The Internet's in the way, there's ocean in the way. There's lots of things in the middle that are, preventing that, that performance. You, the magic of SLOs though is that the, it allows you to account for some percentage of those errors before blowing up everybody's phone, right? You have everybody gets 500 errors. You wanna alert on them at some point. Do you have. Do you have to alert on every single 500 error? That's the question, right? If you have a site with a very large capacity, very large traffic numbers, one 500 error out of 10,000 hits, probably not something to worry about unless it keeps recurring right? In a small timeframe. So when you're combating. Alert fatigue. SLOs are gonna help you with two things. They're gonna help you determine urgency, and they're gonna help you reduce the overall volume of notifications and alerts that are getting to your human responders. Having these SLOs attached to your key metrics means that you know how many failures you have before they're, you can really say, oh yes, this is a problem. There's no need in a lot of cases to alert on every single failure. You don't wanna burn through your entire air budget on a single instance of alert, but you be able to better manage the importance of a single alert versus a group of them in a given time period, one every couple of hours versus. 10 in the next five minutes, right? Like those are extreme cases, but example for, how varied the situation can be. So if your SLO is like a 95% success on some key metric, you can alert your humans starting maybe at 97% instead of 100% success minus one error, right? That's going to lower the volume of overall alerts your team is receiving on these key indicators. So what an error budget looks like. It's really your wiggle room, right? There are the places where you can be a little bit unreliable, right? And your customers are okay with it, right? Some of these are gonna be a discussion you need to have with your product managers. 'cause they're gonna know, hopefully, what the data looks like for how customers engage with your systems, where customers abandon stuff, right? But the error budget is where the magic of SLOs meets the annoyance of realtime alerting, right? Once you decide on your SLOs for a particular service, you can determine how much of your error budget you wanna spend on any given series of errors, right? A single isolated error on a very busy service shouldn't wake anybody up in the middle of the night. You want a cluster of them before you wake someone up, you might wanna note it, right? And it comes in maybe as low urgency until there's a whole bunch of 'em, and then it jumps to escalated to a high urgency. So you have some record that it happened, but you don't necessarily need to wake some up at 3:00 AM about it. So your users really are the point on this, right? The entire reason we have all these alerts and notifications and all this stuff that comes in is because we have users. They wanna serve them a good experience, so they keep coming back and engaging with their stuff and maybe paying us for it, right? So we're not going through this exercise simply because it's fun. We're saving our human team members some sanity, right? By cleaning up all the alerts and giving our customers, our users the best experience by focusing on what they care about, right? You might have to talk to users. You definitely wanna talk to your product managers. You might wanna do some experiments. The good thing about using SLOs in particular is that they're movable, right? And like I said, they're, as long as they're inside your, the boundaries of your SLA. Like you can move them around and do what you need to do with them in order to, flex around how the system is performing, the goals that you can actually meet and how it's impacting your cus your actual human team. Knowing what your users are expecting is gonna help you determine, is future work necessary here? Are we slipping out of our tolerances? For these particular metrics. 'cause maybe there's no amount of notifications of people overnight that's gonna fix the thing that is giving you a chronic issue. And it's time to go back to the product manager and have a discussion about reliability and how it's impacting the system. So there's a lot of flexibility here. These tools can be fairly sophisticated, so you might wanna talk to a vendor about them. So overall, our goals are number one. We want to reduce our overall alert load that is going to help our alert fatigue greatly. Even if we aren't in a place to deploy to SLOs, right off the bat, we'll have an lower volume of high urgency alerts and they will be legitimately be important, right? They're going to things that we need to see in real time in order to preserve the reliability of the system for the users. Every alert that comes to a human responder should be actionable and it should require human intervention. There are lots of other things that we can de to deploy in this workflow like auto remediation, other points of automation, maybe some machine learning to help us out. You can also make use of. But everything that comes to our humans needs to be actionable over time. As you clean this up and keep things rolling in a good with good hygiene, mostly alerts that come to your responders are going to be novel. They're gonna be stuff you haven't seen before. In the same way they're going to be things that are. New to you or are completely dependent on the sort of climate in your environment at that time and not chronic issues that you can't fix or can't deal with. So in summary number one, clean up your alerts even if you have no intention of deploying SLOs, making use of those as a tool. Helping your team out by making sure your alerts are good and coming into people that can actually work on them and actually make sense and are helpful is going to help everyone out anyway. Like we, like I mentioned at the beginning, we all have enough notifications. Without production responsibilities. So if you do have production responsibilities, you're already shoving all this information into people who are already maybe overwhelmed by how many notifications that they get. So we're gonna focus on the users with SLOs and we're gonna prioritize the things that are important to the users. And it doesn't matter if your users are internal or external or somewhere in between, right? Everyone has things that they care about and the systems that we run. So we're work towards that. You can then think about getting the junk out of your workflow with automation and machine learning. If those things are in your tool set and you can deploy them, you're gonna get more out of them. After you've cleaned up your alerts, then you would, if you just throw the fire hose of all your alerts at them, it'd be more helpful that way. If you get to a place where your manager is not bought in on this, remind them of. The six points I mentioned at the beginning about missing alerts and that impacting your reliability and burnout and all those kinds of things, right? Negative impacts, especially if you're already seeing them on your team, and you might be right, you might have folks who are burned out. You might have folks that feel like they're drowning during their on-call shift. If you're getting more than a handful of major alerts during a, like a week long shift, like you're basically drowning people, right? There should be a limit. So make sure you're focusing on the language around your reliability your customer experience, your bottom line, right? You are running these things to make money in some way. And if they aren't running well, then that's impacting your potential bottom line. So some resources for you. We have plenty of stuff on our blog about alert fatigue. You can check all those out. If you are new to SLOs and haven't encountered those before, the Google SRE workbook has an entire chapter about them. You can also reach out to our friends Noble nine. We use their stuff to manage our own SLOs. They have lots of documentation on their site. They wrote a O'Reilly book about SLOs and can get you going there. So if you're totally new to this and you're like, I don't understand what you're talking about, plenty of information for you out there. Lots of resources to, to take a look at. So I hope that was helpful and I hope you enjoy the rest of the sessions at com. 42 SRE.

Slides

Download slides (PDF)

See all 109 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Tackling Alert Fatigue with SLOs

Video size:

Abstract

Summary

Transcript

Slides

Mandi Walls

DevOps Advocate @ PagerDuty

Join the community!

Featured event

2025

2024

Info

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Tackling Alert Fatigue with SLOs

Video size:

Abstract

Summary

Transcript

Slides

Mandi Walls

DevOps Advocate @ PagerDuty

Join the community!