Conf42 Site Reliability Engineering 2021 - Online

Pragmatic Incident Response: 5 lessons learned from failures

Video size:

Abstract

Incident response is overwhelming. So where do you start? There’s a lot of advice out there, but it’s mostly theories that aren’t taking reality into account. So how do you get a process in place that actually works and scales?

In this session, FireHydrant CEO and Co-Founder, Robert Ross, will share stories (good and bad) from his experience as an SRE and what 5 pragmatic tips he’s learned along the way on building a successful incident response process.

Summary

  • Robert Ross is the CEO and co founder of Firehydrant, a reliability platform for teams of all sizes. He shares 42 pragmatic incident response lessons learned from failures. The habit of ignoring small issues is often really going to lead to bigger issues.
  • For huge incidents, let's talk small first. Run retros for tiny incidents. Don't go into a giant incident retrospective for the first time. Low stake incidents are the best place to get better at doing retros.
  • Check your MTTR. MTTR is a great statistic to measure and improve incident response on your team. Track your meantime to retro. Hold prompt and consistent retros on your incidents the easiest way.
  • The severity of your incidents is directly linked to customer pain. Alert on when a customer is feeling a problem, not when the computer is potentially just doing its job. Create service level objectives that are tied to a customer experience and alert on those.
  • Who do you notify about an incident? Can a small set of people tackle an incident without disrupting the rest of the company? When does customer success and support get involved? Too many cooks spoil the broth.
  • What teams own your services? You need to be considering your service ownership before you actually have an outage. Knowing which team to send to the fire is critical. Build a service catalog, assign clear ownership lines, and it will greatly help you in your incident response.
  • Declare and run retros for the small incidents. Action items are more actionable. Alert on pain felt by people, not computers. assign clear service owners. If you don't have a clear line of ownership, you're going to have longer incidents.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE, a developer? A quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native Litmus Cloud welcome everyone to my Conf 42 pragmatic incident response lessons learned from failures let's go ahead and get started on here. My name is Robert Ross, but people like to call me Bobby tables. I am the CEO and co founder of Firehydrant. We are a reliability platform for teams of all sizes looking to up level their service, ownership, incident management, incident retrospectives, postpartums, whatever you call them. And my experience has led me to start this company and ultimately have a bunch of tips on things I think you can do to help improve your incident response at your company, your team, maybe your life. Really up to you. So what we're going to talk about are some five tips on pragmatic incident lessons. So pragmatic being the key word here, there's a whole swath of things that you can do to improve incident response, but we're just going to talk about some relatively easy things that you can start to push into the company today to help you with your incident response and management. So one of the things that we're going to start with is small things. And if you recognize these lyrics, it's because it's from a band called Blinkway Two, one of my favorites and from my hometown of San Diego. But we're going to talk about my elevator instead for a second. So I live in an old candy factory. This building, what you see behind me, it's an old candy factory in Williamsburg, Brooklyn, New York City. And my elevators, some of the buttons don't light up when you push them. They certainly work and they dispatch an elevator and they send it to the correct floor, but they don't light up. And then other problems is that the led display in the elevator inside of the elevator car displays the wrong thing from the floors sometimes. So if you push basement, it'll actually say seller in the led screen. Just a minor inconsistency, but generally, and another fun one that I discovered actually very recently, is that it has a race condition that if you push a button, sometimes two elevators will show up simultaneously. No one's in them, just two show up. And generally it's actually kind of a piece of crap, but it goes up and down, and that's what's important, right, for an elevator goes up and down, door opens, I step in and out, whatever. But then on April eigth of this year, I walked up to my elevator and there was an out of order sign and actors were used in this photograph. But I was like, okay, well, my elevator, that the buttons are mislabeled. It has a race condition and a bunch of other problems. It's dirty a lot of the time. Like shocking, right? Oh my God. No way that the elevator is broken right now, right? I was completely shocked. Not at all. Right. It made sense. It all lined up. The story told itself. And where I'm going with this is that the habit of ignoring small issues is often really going to lead to way bigger issues. When little things start to pile up, it's no longer a bunch of small things, it's going to become a big thing. A lot of the time. It's important to actually start seeing small things as indicators of bad things to come if we're not going to go and handle them. I'm sure we've all worked at a company, maybe you work at one right now and you don't want to admit it, that you have an error tracking system that just has thousands of errors, thousands of exceptions in there that nobody is going and looking at. And one day one of those exceptions is actually going to matter. And it will be so hard to wade through all of the exceptions in that tool because there's just so many of them. So my first tip here is you should be focusing on small incidents first. Don't try to change the world for your incident lessons methodology on your team or at your company. For huge incidents, let's talk small first. Run retros for tiny incidents. Don't run them for the biggest ones. You're not going to move mountains there. Don't go into a giant incident retrospective for the first time. And we're all excited that we're going to change our process and make the world a better place. And inevitably, you're going to have a Jira ticket that says rearchitect async pipeline. That's never going to happen. That's not going to happen because of that incident. It's going to take a way bigger movement than that. So create really small actionable items from your incidents. And low stake incidents are the best place to get better at doing retros, right? They're low stakes. Nobody is maybe feeling really down on themselves for this incident. Maybe it was just a minor data inconsistency in an export to another system. No user data was lost. No users even knew there was a problem. Those are the best ones to start instituting behavioral change for your incident lessons and retrospective practices. And Heidi Waterhouse works at launchdarkly has a great quote. A plane with many malfunctioning call buttons may also be poorly maintained in other ways, like faulty checking in for turbine blade microfactors or landing gear behavior. So next time you're on a plane and the buttons don't work and maybe the window is super scratched and your seat doesn't lean back, like maybe you should start to wonder certain things. And sorry if I just gave anyone crippling anxiety of flying, but I think that this quote is really good. Moving on, let's talk about things that you can measure. So that was tips number one. Moving on. Tip number two, what do you measure? Well, knowing how much you're improving is really important and improving your incident response, improving your overall reliability as a team, as a company, as an organization. Are you improving your MTTR? Are you actually measurably making sure that your MTTR is going down? So my second tip really quick is check your MTTR. MTTR is a great statistic to measure and improve incident response on your team. Really, really critical that you measure your MTTR and for the incident, human safety factors and all of the nerds out there that love learning about MTTR and why it's the worst metric to track, I am in fact trolling you. The thing that I'm actually saying is meantime to retro. I'm sure you didn't think I was saying meantime to resolution, I hope. But we're talking about meantime to retro retrospective our memory up here. It goes quick. It fades very fast. Herman Ebbinghouse, he discovered that memory loss is way faster than we think. The moment you learn something, you're going to start to forget it pretty quickly. You need to really institute knowledge through dedicated learning for it to stick around. So where I'm going with this is that your meantime to retro is a really important step, because one of the easiest ways to have a bad incident retrospective is to wait two weeks. You could do all of your process right, you could have all the right people in the right room, all of your timeline correct, and you could still produce a bad retrospective learning document, whatever your outcome is. And the reason is that we forget things very, very quickly, faster than you think. So don't wait weeks. If you have a sev one incident, that is a moment, and calendars should change like calendar events, the next day should go away to make room for that retrospective. Because a sub one incident, that's a big deal, maybe a sev five, maybe a little bit more leniency. But even still, like within a couple of days, because our memory just is that susceptible to being corrupted or lost entirely. So tip number two, with a little bit less troll, is track your meantime to retro. Hold prompt and consistent retros on your incidents the easiest way. Again, don't wait, do it super fast. Another metric that I really like is actually tracking the ratio of your incidents to retrospectives. And that's a number that should go up. If you are tracking this number, let's say you have, let's say ten sev one incidents, but you only did two retrospectives. That's a pretty bad ratio. You should be getting that number up. So track that number. It's a really good one to actually measurably see if your company behavior is changing. And that's one of the most important things, is getting people on the same page. So let's talk about things that you can alert on a hot topic. I'm sure for a lot of people that maybe you have an alert fatigue problem. Maybe you have systems that are red herring alerts all the time. So let's talk about things that you can arent on. One thing that we should all take note of here is that we don't declare incidents. We only are really declaring incidents because our customers or the people that receive value from our systems and software are feeling some level of pain. Whether it be slow checkout times on your online ordering form or just errors entirely, whether it be someone, an image isn't loading for them and that's why they're there like they want to look at. Maybe you have a photo website. The severity of your incidents is directly linked to customer pain. We would not open incidents unless there was another customer on the other side having some level of pain. And where I'm going with this is that computer vitals are what a lot of companies alert on. A lot of companies do this and it's okay, you can get better at this. But when you are alerting on your computer cpu, that is a bad metric to arent on because cpus are going to get hot. Like they're going to be utilized more at certain times of the day and utilized less maybe in the middle of the night, wherever your peak traffics are. And this is a big deal because if you think about it, if I go on a run, if I go down to my street, and if my elevator works and I get on the street and I sprint down the block, my heart is going to beat faster. It is designed to do that. It means that, hey, you sre exerting a lot of energy. I'm going to give your muscles more oxygen and that's what it does. I think that is what I was taught in school. I need to maybe go verify that before I so confidently say that on talk. But my point is that when you page people at 02:00 a.m. Because disk is maybe at 80% capacity or your cpu is running at 100%, let's say 95%, but no one is feeling any pain. Your customers don't know anything, don't know any difference. There's no reason to wake anyone up. It's the easiest way to lose great teammates is to page them for things that no one knew about, that no one actually felt any pain. So my third tip here is alert on degraded experience with the service and not much else. A cpu burning hot again at 90% is not necessarily a bad thing. It might indicate problems down the road, but correlation does not equal causation all the time. And what we should be doing is creating service level objectives that are tied to a customer experience and alert on those. Alert on when a customer is feeling a problem, not when the computer is potentially just doing its job. People experiencing problems sre the only thing you should alert on, period. Really, there's no reason to wake anyone up unless anyone is feeling a problem with your system. Soundcloud has a great blog post on it. I reference it all the time. It's called alerting on slos, but you can apply. What they say is that you should alert on symptoms and not causes. So going back to the running analogy, if I go downstairs and I sprint down the block, my heart's going to start beating faster. That's what it does. But if all of a sudden, if I pass out and I fall and I hit my head, that's maybe a problem, right? I shouldn't be passing out just because I sprinted down the block. That means that something arent wrong and I should maybe go to the hospital and page someone. But that so you should only be alerting on your symptoms, not the causes. Who do you notify about an incident? And this is an interesting topic, I think because at a lot of companies, incidents are high blast radius events. Everyone in the company knows that you're having an incident in some way, shape or form. I used to work for a payroll provider and if our payroll went down, that was a big deal. A lot of people knew about that. Our HR team knew about it, our engineers knew about it. Customer success knew about it because their accounts were calling them saying, hey, I can't run payroll. The sales team knew about it because they couldn't do demos. It was a big deal for the payroll system to no longer be running, and so it became an issue when you have 600 employees, 500, 600 employees, and everyone wants to be involved. Everyone. They all feel the pain of it internally at the company, so they all want to join an incident channel or a Zoom call or whatever it is. And incident notifications can be scary. As an engineering team, do you really need to tell your entire team about an incident? Are you going to cause panic and pandemonium by telling your team that there's an incident? I certainly have. Even though it was a benign incident, simply saying, hey, this isn't working right now. People don't know how to maybe take that. So can a small set of people tackle an incident without disrupting the rest of the company? Can you have a tiger team that just gets in there really quick and just fixes a problem and gets out? When does customer success and support get involved? That's another big deal. Do they need to be directly attached to the engineering's communication lines and hip during an incident, or can there be a push mechanism? Can we give the success and support team the information that they need the moment that it becomes available without them maybe being in, let's say, the figurative kitchen? And another question is, when do you escalate? When do you actually say, you know what, maybe we do need to tell everyone right now? That's another challenge that should have a well defined process behind it. So my fourth tip is focused teams will perform the best mitigation there is such a thing as over communicating about an incident to a company. I've done it. I've been a person that's even seen it. And it can cause ridiculous distractions and really massive logical fallacies among the team. Again, incidents are big blast radius events. Like, we should be very careful about how much we actually tell people about those incidents. And even in New York City, when there's a fire in the big, tall buildings, like where it's cement in between each floor, you actually are told by the fire department if you are above the fire, I. E. If there's a fire on floor ten and you're on floor 13, the fire department will actually tell you, don't leave the building. Don't leave the building. And that is insane as that sounds. The reason that the fire department does that is because you're probably not in that much danger. There's a lot of cement between you and the floors below you. And two, when you have, let's say it's a 25 story building. If you have, what is that? 15 floors of people with each floor having hundreds of people suddenly on the street down below, the fire trucks can't get there. They can't even get to the building. So that's why the fire department says if you are two floors or more above the fire in New York City, in certain buildings, don't leave your floor because you're going to make it really hard for the fire department to even get to the fire to put it out. And then you actually make it worse. So be careful who you actually loop into incidents. It really can make a big difference. And a great quote. Too many cooks spoil the broth. Too many people mitigating by committee will produce inferior results. All right, the last tip that I have for everyone here. What teams own your services? If you're at a bigger company, or even if a small one with only a few things or product areas, this is a big thing for people to think, but, and you need to be considering your service ownership before you actually have an outage. And not only for the purposes of this team builds these features on this service. That's not really what I hes that is a part of it, but really it's more important because people need to feel like they own something to properly know how to fix it or call the right people or be the right person to fix those problems. For example, if there's an authentication service that my team owns that goes down, it doesn't make sense to have a different team from maybe a different product area coming to my neighborhood to solve the problem. They might be completely capable of doing it. It's not a capability thing. They might take a little bit longer to get there. They might take a little bit longer to get their bearings. Imagine if a fire department from the northern part of Manhattan had to go to the southern part of Brooklyn. That doesn't make a ton of sense. They maybe don't know the roads as much, so that's a big deal. And also the service owner, they know all the dependencies. They know where all of the dead bodies are in the application. They are going to be the best people to resolve that incident. But you're not going to be able to define that the moment an incident starts. This needs to be clear beforehand. Everyone should know when this service has a problem. I'm the person that resolves it and there should be no questions but that. Because knowing which team to send to the fire is critical. When seconds matter, fire departments, there's a reason that there's a fire department in each neighborhood. It's because it's the fastest way to get to them. They know the neighborhood, they know the nuances of streets. They are the better team to get there. And seconds do matter, of course, in real life fires, but also with software. So my fifth tip is to assign clear ownership, very crystal clear ownership. Ad hoc free cuts will only continue unless there is a clear line of who owns what. So providing a service catalog with clear ownership will help you route people to fires faster. I promise you. Service catalog is also valuable because it's a thought exercise. It makes you think, well, here's everything we own, and you might not actually have a big database in thinking about that. It can be a great exercise even just to build a service catalog and actually start to think about your system at a high level. So build a service catalog, assign clear ownership lines, and it will greatly help you in your incident response. So let's recap really quickly here. Declare and run retros for the small incidents. It's less stressful. Action items are more actionable. Don't do giant incidents for retros yet. Do the small really promise? Trust me. Do the small incidents first, decrease the time you take before you even analyze an incident. Good retros take practice. That's going to take you a little bit of time, but one of the easiest things that you can do is just decrease the time before you even have the retro. Those retros will be more valuable. Alert on pain felt by people, not computers. Computers. We might feel bad. My computer is running hot right now recording this. It's not sentient, doesn't have feelings. If I suddenly lost this recording, that's something I would alert on and I'd probably be very upset. So alert on the pain felt by people, not computers. And really, it's worth saying the only reason we declare incidents at all is because of the people on the other side using our software. Four consciously mitigate without overdoing the communication. Again, bringing a lot of people into an incident is not the best way to solve an incident. The fire department doesn't want everyone on the street, even though they're pretty close to the fire. It's safer for people to stay above it and out of responders way down below on the street. And number five, assign clear service owners. If you don't have a clear line of ownership between people and the services they build and run, you sre going to have longer incidents. You're going to have maybe even more impactful incidents. Maybe it starts to spread. Maybe you have a thundering herd problem. It is really important to have your service ownership very well defined. And with that, I will say thank you to the organizers, to everyone that watched this. If you are looking to implement maybe some of these practices, maybe your ears perked up a little bit on some of them. Firehydrant.com this is what we do. This is what we started to build this software to solve our own problems that led to these tips. So check us out. I'm also bobby tables on Twitter, bobby tables on GitHub, and we're also at fire hydrant on Twitter as well. And with that, I hope everyone enjoys the rest of your day. And thanks for watching.
...

Robert Ross

CEO @ FireHydrant

Robert Ross's LinkedIn account Robert Ross's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways