Conf42 Site Reliability Engineering 2021 - Online

Reducing Trauma in Production with SLOs and Chaos Engineering

Video size:


Customer experience is the responsibility of the entire team. Many organizations leave reliability up to the SRE team, however reliability should be built in from the very beginning.

In this talk Julie and Mandi will discuss what Service Levels Objectives are, why they are important to the organization, and how to define and set them. Going beyond SLOs, attendees will learn what Chaos Engineering is and practical ways to ensure compliance and resilience with best practices. We’ll show you how to focus your goals and error budgets with examples that will lead to reliability and improved user experience.


  • You can enable your DevOps for reliability with chaos native. This is reducing trauma in organizations with slos and chaos engineering. Downtime costs money, quantifiable and unquantifiable costs as well.
  • Julie, why don't you kick us off with some information on slos? Slos and SlIs. The whole point of the whole exercise is about keeping your users happy. When we add chaos engineering into the mix, we can look at this SRE pyramid.
  • There's lots of things that you probably already have in your metrics that sre important but not primary to your user experience. You can use chaos engineering, which is practices injecting failure into your systems to understand and to validate your monitoring.
  • There's some operational maturity that you'll want to have in place before you go down the path to these slis and your slos. You want a lot of telemetry already available. And then observability tools are going to help you underpin all of these components, give you a more complete picture of the ecosystem.
  • Some people say that you can only do chaos engineering in production. But you can actually adopt the practice in development, right? You're architecting for failure. You just want to work iteratively like you would with code.
  • Mandi, talk about working with upstream dependencies. Make sure you're prioritizing the user experience first. Work with your error budgets. Focusing on user experience is a real downstream tool of setting goals.


This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE, a developer, a quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native. Litmus Cloud hi. Welcome to our talk. This is reducing trauma in organizations with slos and chaos engineering. My name is Mandy Walls. I am a DevOps advocate at pager Duty. And I'm Julie Gunderson, a senior reliability advocate at Gremlin. And Mandy and I are really excited to be here with you today because we actually both worked together at Pagerduty and now at different organizations. We really see ways that you can combine some of these practices to make for really reliable organizations. So, Mandy, awesome. Yeah, thanks, Julie. Yeah. One of the things we talk about at Pagerduty we call full service ownership, and it's about focusing on the reliability of services that your team creates once they get into their production life, whatever that is, if it's internal production or external customers or whatever that is. So part of knowing how your services are performing. Mandi setting goals around that performance is really a key to keeping all of your users happy, whether they are external customers or internal customers. So we're going to talk about using tools like service level indicators and service level objectives and how they can help you focus on what your users need. And then adding to that, using chaos engineering practices to make sure you're hitting those goals as you're working on your service and as you're making those improvements. One of the things that we often talk about is the cost of downtime. So downtime costs money, quantifiable and unquantifiable costs as well. I mean, on the quantifiable costs, you've got your revenue. So you can just talk to your accounting or your sales folks or employee production, that customer chargebacks, and breaching those slos, that's where we really want to focus this talk on. But just so that the other things don't get lost, we also want to remember the unquantifiable costs, such as brand defamation and employee attrition. And so with that, Mandy, why don't you kick us off with some information on slos? Slos and SlIs. Yeah. So a lot of vocabulary on this one. So let's just set some baseline so we'll go to the next slide there. We're not going to talk too much about slas. That's sort of the realm of lawyers, right? If you've worked for a software vendor like both of us do, you probably have some slas that your legal team and maybe your insurance company and a bunch of other people get together to set, so that it's part of the contractual agreement between your company and your customers. And there might be places where there's, like Julie mentioned, chargebacks or some kind of remuneration for outages and things like that. That's beyond what most of our SRe kind of practitioners can get into. So we're going to focus on the other pieces of this. So the first one here is our indicators. Our service level indicators are the metrics that we're going to work towards. The things that we're going to figure out how important they are to our users and where they need to be. And then our objectives are going to be the parts of those metrics that we can set goals for. They're going to be the places where we know beyond a certain point we're going to start losing users. So we're looking at sort of the tolerance level for where we can experiment and push a little bit of risk and still keep the customer base happy. So then we have the rest of the time, right, we think about, okay, well, we've got some sort of uptime. That's kind of what we're measuring against. We will have some sort of goals that we're going to set. And the rest of that, whatever's left out of the pie, is going to be our air budget. And you can kind of SRE from this sort of silly diagram. The pie chart is 99%, things are good, and 1% is our sort of wiggle room, that place where we're able to maybe try something out. We're not sure it's going to work 100%, or it might be a little bit out of range for what our goals actually are. But it gives you this place to measure yourself against your goals and implement changes in a way that you're still preserving your customer experience. So it gives you a measurement for maybe you need to improve things, like removing services from load balancers before you restart them, or using blue green deploys or all these kinds of things to maintain this sort of air budgets goal that you set for yourself, for things. And when we add chaos engineering into the mix, we can look at this SRE pyramid this comes from, you know, you really start with the monitoring and observability, right? Just like what Mandy was talking about with the SLos and the Slis. And then you move into that incident response, and then when we get to the post incident analysis. So let's say you've had an incident. Now, you've run a blameless post mortem on it. One of the things that you want to do is then obviously work towards those fixes that you have. You want to actually have an internal time that you've agreed to as a team to work towards remediation efforts so that those incidents don't occur again. But then you want to test that, Mandi. So you want to repeat those incidents with chaos engineering, and you want to automate that so that you can make sure that those fixes have worked and that they continue to work, and then to your testing and release procedures. You want to bring your chaos engineering practices into that. And then you move to capacity planning and development, and then product, which, again, goes back to where Mandy was talking about with error budgets and setting those in. And so then we go ahead and talk about kind of how we center on that customer experience. Yeah, really, Julie, the whole point of the whole exercise is about keeping your users happy. Right. We're looking for the things that they care about. And if we go to the next slide, we have to figure out what users actually care about. Right. You might have users that are really sensitive to slow loading pages. You might have users that are sensitive to large payloads on mobile because maybe their mobile traffic is too heavy for what their connectivity is. There might be lots of different behaviors that they have, and you want to be able to test that and take a look at it for what the user behavior looks like. Well, and Mandi, we've seen it, too, with the state of X, Y and Z reports and multiple reports. We've seen. We've seen it in our Gremlin reports, in the pagerduty reports that users, they won't wait. Now, that app or website takes a long time to load. They're out of there, and they're onto a competitor who's really built that user experience in. Yeah, absolutely. So we're looking at maybe there's a place where certain user behaviors only apply when folks SRe logged in, or maybe they have different behavior when they're sort of a guest. Maybe there are certain things that are super important, like your search function and your shopping cart. But maybe management or updating of billing information isn't quite as important because people aren't using it all the time. So you're really looking for the things that people are really gravitating to or using a lot in your applications and the behaviors that they really want to have. Be fast, right? Like thinking about, like Jolene mentioned, people are going to leave, right there's certain places like, okay, I'm not going to change my bank if their app is SLos, but if I'm shopping for something or I'm looking for some music or something like that, if you're not responsive, I'm going somewhere else, right. So some of the things that you might find that your users care about, right? No errors on your main page or some module loads first. And you can actually see this as a user. As you're floating around the Internet and you pull up a site, you can notice which modules load first. YouTube is very obvious, right? The video player goals and then the rest of the page builds around it. Because that video load and that actual video itself is the main part of the experience. So when you're thinking about that for your own services, how the page loads, the pieces that build in, where they come from, how long they take, all part of your user experience. So lots of things to think about when you're working with your users and what they care about. You might actually have to ask them about the things that they care about. Sometimes that's things that we tend to forget. We want to develop things for ourselves that we think our users are going to love, right? But then upon releasing it, we find out that our users are not using that new feature and they're using some random thing that we thought nobody would like. And it's just really understanding that user experience. Definitely. And there's lots of things that you probably already have in your metrics that sre important but not primary to your user experience. Like some stuff is going to be maybe early warning, right? Cpu utilization, memory allocation, those things are super important in that they will impact user experience down the road. But you want to broaden your definition of the things that you want to look at. As you're setting these slis, you're looking for the actual behavioral aspects of your application, of your services. When they hit the user. User has no idea that your cpu utilization is at 85%. They just know that your queries are slow, like they're not getting their searches back when they're searching in your site. So thinking about what the user experience is and how that's going to translate into the metrics that you're collecting and what metrics you might have to add is another super important step of the whole exercise. And then that's kind of how we talk about, like we talk a lot about metrics. And some people get a little bit scared when it comes to chaos engineering because maybe they don't have a ton of baseline metrics, right? And that's okay, because you can actually use chaos engineering, which is practices injecting failure into your systems to understand and to validate your monitoring. Mandi, your metrics. And so when we want to know, okay, our slos, our slis, are they right? We can go ahead and practice that. We can imagine what that customer experience is going to look like. Let's say if the email server goes down. So I check out, I purchase my item, but I don't get that immediate feedback as a customer with that email confirmation, that's okay. Was I still able to check out? Was I still able to complete my purchase? Were those things able to happen within the set defined timeline that we have agreed to? So we want to inject that failure proactively so that we can validate. Are these objectives that we have set, are they the right ones for our team? Are they what we should be holding the expectations to, and that goals along to the customer experience? And so, Mandi, talk to us a little bit about that. Yeah, there's a bit of a place where before you embark on this journey, right. There's some operational maturity that you'll want to have in place before you go down the path to these slis and your slos. So some of the things we've already talked about around user experience and those kinds of things, you want to already have in your pocket, right? You want to have a good idea of the impact of things like new features or changes or degradations and how users are responding to those. And you also want to have the mechanics via your chaos engineering tools to be able to work with that. So you want to have a lot of telemetry already available, right? You might have open source or commercial solutions or whatever it is, but you're going to be collecting your user facing metrics, right? That's going to give you perspective on what users are experiencing throughout your services. You might have a set of synthetic monitors that are going to tell you the things you know about the things you know, right? That's the place where you've already sort of pre populated what needs to be known about those potential components. You probably have some logging set up so that you're tracking things on a post hoc basis, so that you can collate behaviors and other events as they happen throughout the ecosystem. So it's a good place to do that. You get all the text on it, get all the timestamps, all that fun stuff, and then you can kind of put those two together with a tracing tool. Gives you a place to start dealing with the complexity, especially if you've got a wide distributed system. If you've got a monolith, you might already be in a good enough place, right? You can kind of cheat a little bit because things aren't moving around a whole lot. But if you do have a widely distributed ecosystem, you're going to want to have some tracing so that you can follow user requests throughout all of your services. And then observability tools are going to help you underpin all of these components, give you a more complete picture of the ecosystem. In a more generic sense, it helps you sort of tag down to the unknown unknown. So things that you weren't expecting, because you're poking sort of that black box of your software with some inputs and seeing what pops out the other side. And that gives you a lot more ability to say, okay, this user behavior is indicative of this set of requests and this set of otherwise sort of hidden back end requests and things like that. So I can know then as these users come in with this particular use case and behavior, they're going to hit all of these things and I can start tracking those down for my Slis and my slos. Another good place where unfortunately, Julie and I have both sort of seen places where folks don't exactly have a really good picture of all their dependencies. And this is super important. You want to be able to know your services, what they're consuming. You want to know if they're eating bad stuff. Right? And if you've got a back end dependency that its slo is really low, your services can't have a more stringent requirement if your back ends aren't up to performing to that requirement. So it gives you a place to really start thinking about other teams that you're working with, things that you can do more defensively. Maybe you can red button something when it goes out of your range for tolerance and those kinds of things. Mandi, doing some more advanced techniques to protect your service from things that aren't up to your requirements. Well, and another thing too is sometimes you don't realize that there is a service that's actually a critical path. You may think it's not. This is not a critical path. If the redis cart goes down, we're fine, right? But then when you test and when you're purposefully injecting this failure, you might find a critical path that all of a sudden makes you realize you need to redefine your slos based on that. Absolutely. Let's take a look at, like, it's going to look like math, right? So if we go to the next slide there, we've got kind of a generic model for goals, right? You have your service level indicator, which is going to be some text, and then you have your service level objective, which is going to be kind of some numbers, probably. And then we've got a period of time which is represented by T. And our SlI is going to be the number of good things that happen divided by the number of all the things that happened. And we're going to multiply by 100 to get a percentage. And then our error budgets are going to be whatever's left over. So if I'm going to say, okay, my service level indicator for 500 errors on my main page is 99%, then my error budget is 1% of those requests can be out of that range before my customers start to get really unhappy. And we have some examples on the next slide of how these numbers sort of fit together, right? The bigger the pool of events, the more wiggle room you can kind of get, right? At the same percentage points. It's just math, right? So at 100,000 events, if I have 99,000 good events, that means I can have 1000 events where maybe I'm trying something different out, maybe I'm doing some experimentation, but I know that I have that air budget, that sort of wiggle room to do a bit of things that maybe sre outside our goal parameters, and our users are going to cope with those in an okay way. They're going to be more tolerant. Yeah. So I know that none of us expected a math lesson today, so thanks for that, Mandy, but really, reliability is obviously very important to organizations now, so. Right. We're perfect 100% of the time. Our web requests have zero milliseconds of latency all the time. Right. But not necessarily not in the real world. So that's why we talk about slos, slos, Slis. So maybe we have an SLA that 90% of the web requests have a web latency of 500 milliseconds for the month, or then that's when the customer gets their money back, then we have set a buffer in now for our slO, we've got a 95% slo. So we've got this 5% buffer between our SLA and our slO, and that's what we can use to play with. That's what we can use to experiment with. That's where we can start getting creative with maybe new features that we want to release to our customers. But it's really important that we are staying within those ranges that we have set for our organization. And so when we kind of look at this in play, here's a little bit of kind of a scenario that you can run through, right? So you can look at the SLO scenario in staging. You can do that with gremlin. Maybe an instance downtime occurs, right? Datadog is picking up that instance, they're calculating it, and then pagerduty is firing off that alert, letting you know that that SLO has been breached. So these are ways that you can put all of these things together to use the goals that sre at your disposal to make sure that you are maintaining those not only contractual organizations that you have, but that customer experience. Yeah. One thing to remember, though, as you're working on these, is that while your SLA is your public facing customer contract, that thing that the lawyers put together for you, your slos and your slos are really for you. They're for your team to work against and to budget and prioritize for. They're not meant to be a cudgel or any kind of punishment because we don't want to disincentivize people from making changes, making changes, shipping features, getting all that stuff out there for our customers is how we get more users. It's how we provide them with delightful things. So we don't want to punish people or beat them over the head with their slO if they're not beating it. However, it is a good place to revisit after a post mortem or talk about during a review. Where are we on the air budget for this quarter? Where are we on the air budget for this particular feature? So that you can be really conscientious about the changes that you're making and the work that you're doing and how it impacts your users. Yeah, we've seen it where some organizations will stop releasing new features if they're getting close to that. Right. When you're looking at that overall math equation and you know you're close to breaching, you're going to say, okay, we're going to pause and we're going to work on the reliability of what we have now. We're not going to make any more changes so that you can make sure that you're keeping up with what your goals are. And you can also then use chaos engineering to test out the new features and to make sure that you're focusing on those slos. So some people say that you can only do chaos engineering in production. That's the only way. So if you're not doing in production, you might as well just not do it at all. And that is absolutely not true. I mean, we've had experiences where we've practiced chaos engineering, in tabletop experiments where we're just writing ideas on a piece of paper. Mandy, Mandi had some fun with that at one of the summits a little while back. But you can actually adopt the practice in development, right? So if you think about it, you're architecting for failure. You're keeping that in mind, Mandi. You can get confident then testing and development, and then you can move to staging, and then you can start small, and you can expand your blast radius as you are releasing these new features. And then finally you can move on to production, and you can start small with these experiments, and then you can increase the magnitude, you can increase the blast radius. So, in all reality, this is just how we do development, right? You don't actually have to overthink it. You just want to work iteratively like you would with code, move up your environments like you would with code. We all know how to do this. And so, Mandi, I'm going to pass it to you to talk a little bit about working with upstream dependencies. Yeah, upstream dependencies can be tough, right? If there sre things that are owned by your organization, you might have some ability to put some pressure on your colleagues and other business units to say, look, man, users really love this thing that we're consuming off of you, but you're not up to what they're expecting. You can have those kinds of discussions when you have external dependencies. You have third party pieces looking to see if they even publish what they're going to present to you if they've got published Slos, if it's something that you're buying, and they have a contractual SLA looking at those, because then you're going to use that as part of your own math to say, well, service a is reliant on service b, and service b can only ship us this particular availability. We can't be better than that, right? If you want to be better than that, you have to think about defensively coding around bad performance, looking at turning things off, or taking things out of the user experience if they're not performing. So really looking at it from the user's perspective to say, would they rather not see something than have it be slow? Or do I need to look at alternatives? Do I need to consume this from another provider? And really being proactive about, and I love the term defensive coding, too. Because that's really what we're thinking about. Again, kind of going back to that architecting for failure. And I know we mentioned this earlier, so I'm just briefly going to touch on using chaos engineering to validate those dependencies and those critical paths. But there are specific attacks that you can run, so maybe you can inject some latency since slos are time sensitive. Let's see what happens if this application is required for my application to serve its core function. Because again, we want to serve our customers. And even though an application might continue to work in some capacity, it might not be the capacity that supports the goals that we have set for ourselves. Absolutely. And that comes into things like unplanned work as well. Right. Your incidents can indicate that work needs to be done on your reliability. Right. Your slos and your error budgets, you really need to make them part of your post mortem process. You can sit down and say, we had this particular incident happen, this is what we blew out of our air budget for this particular service. And then you can make that decision. What work needs to be done in our next sprint to prioritize fixing this thing. How is it going to affect our air budget going forward? Can we even ship new things for the duration of this time period if we're far beyond our air budget based on the last incident planning around those things? Focusing on the user experience is a real downstream tool of setting these goals, these objectives with your indicators, so that you can really plan defensively for the change that you need to make to make the whole experience better. So your lifecycle is going to be, you start with your user behavior, you look for your reliability, your performance metrics, Mandi, the things that your users care about. Then we're going to set all of our goals, Mandi, our slis and establish our slos. And we're going to work over time keeping those slos in the green. Right. And with that, we're going to practice our chaos engineering. We introduce a new feature. It's going to go through testing. It may also go through chaos testing as well, so that we know that it's not pushing our service out of tolerance for what our users expect from us. Right. So it's a maturity process. So make sure you're prioritizing the user experience first. That's the whole reason we're going through the whole process. You're going to quantify what's good and bad via your experiences. Work with your error budgets. That's really just to tell your team where you are on your time frame and then it all feeds back into work prioritization and how you prioritize work and organize it there. And if they're not working for you anymore, change them. And so we've got some great resources for you. There's the talks that you can find at Slos comp, Google's SRE book. It's available online if you haven't read it. It's an amazing book. I haven't read it cover to cover yet. It's one of my working through kind of like Lord of the Rings. You can also check out Gremlin for free if you go to SlOS buttons so that you can practice with chaos engineering. But Mandy, I just realized we didn't tell people how to get a hold of us. So you can find me on Twitter at Gund and Mandy, I'm lnxch and so thank you for taking the time to hang out with us today and hopefully your trauma is a little bit reduced. Yeah, thanks very much.

Julie Gunderson

Senior Reliability Advocate @ Gremlin

Julie Gunderson's LinkedIn account Julie Gunderson's twitter account

Mandi Walls

DevOps Advocate @ PagerDuty

Mandi Walls's LinkedIn account Mandi Walls's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways