Conf42 Incident Management 2022 - Online

Varieties of Incident Response

Video size:

Abstract

Have you ever wondered if there was a better way to respond to incidents? When you are in the midst of an incident, does “the process” help you and your teammates or is it more of a burden?

There have been a variety of approaches to organizing people and teams over the 30+ years of online services. Each of them have benefits and drawbacks. This talk will dive into a representative set of these approaches to examine them and help the audience to have a wider context by which they can evaluate their own arrangements for incident response.

The talk will also look at incident response from a more abstract, task/intent-focused perspective to give a framework against which processes can be examined and adjusted to be more enabling, less burdensome. (And no, this is not a lite beer commercial ;-))

Summary

  • A concept called the Ooda loop involves observing, orient, decide, and act. With a multitude of people involved, even if it's just three or four, there is coordination needed. How do you organize this functional approach to responding to incidents?
  • There's another model that has come into the tech industry maybe about ten years ago, adapted from emergency response. The incident commander and the communications lead end up being a wrapper or a protection for the technical teams. This doesn't really fit well with an agile, everybody takes responsibility, engineering culture.
  • OodA loop: Observe, orient, decide and act. This is the functional side of incident response. No one size or structure is correct all the time. All hands on deck can be the right scenario, depending on or the right approach in particular scenarios.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Thank you for joining me for my talk. I want to go over some of the aspects of incidents response that maybe you haven't thought of before, as well as talking about different ways that companies can organize their approaches to incident response. Taking a cue from a popular narrative approach, let's start in the middle of the story. It was a calm, dark night, and our hero was fast asleep when all of a sudden the pager went off. Now, when the pager goes off, what normally happens? Typically it will bring in a technical responder. It's going to go to a service team person who's on call for problem service and was needed. That person might bring in additional resources, call a colleague, call a friend, and try to get these problem solved, especially if it's happening during the daytime or business hours. It's very common that they'll recruit other folks into the event with them. So what will these folks do when they join the incident bridge, call, channel, whatever it is that they're using as a communications method. Perhaps they're even in a physical situation room and working together to solve this. I want to introduce you to a concept called the Ooda loop. OOda observe, orient, decide, and act. This is a fundamental approach that's going to give us a way to look at what everybody's doing along the was of handling an incident. They're going to start with observing, seeing what signals are available to them, what's happening, and what does the system tell them about what is happening after that? They're going to try to orient, they're going to make sense or try to make sense of what these signals mean to enrich their observations with context. What's been deployed recently, what other changes do they know may or may not have been going on? Then they're going to have to make a decision. They're going to have to evaluate their response options, pick an approach, and then ultimately act, execute the decision, and in turn, repeat look at what effect their action has. Orient as far as what that means, make new decisions and new actions. Each person is going to be doing this individually as well as the group doing it together. So what is it that they're doing? They're doing diagnosis. They're figuring out what is happening, figuring out what to do, and figuring out what can be done about what it is that's going on with the goal, ultimately of therapy. They want to take action to mitigate the issue and to ensure that the action was effective. So what's missing from this picture? Let's back up just a little bit because I ran a little bit fast. Going through all of this with these initial responder engaged the person who receives the page. I glossed over the work of recruiting additional people, and the role of incident commander is often one that doesn't fit well on the shoulders of a technical responder. This work of getting the right people involved is something that can be handed over, and often is to a role that's known as incident commander or incident coordinator. I use incident commander because it's kind of the traditional term. How does the IC themselves get engaged? It could be that the technical responder throws up a flare and says, hey, I need an incident commander to help me out here. Or it could be that the pager maybe calls the incident commander, either initially and they bring the technical responder in or at the same time as the technical responder is paged in return, then the incidents commander can work on recruiting the additional people that are needed to help out in the scenario. With a multitude of people involved, even if it's just three or four, there is coordination that's needed. Who does what and when? Who's in charge? Well, these are where the organizational models come into play. So we've got this structure right now. We've got a technical responder that's engaged with technical compatriots. We have an instant commander who's looking over things. We also have in any incident, other concerned people, these might be executives, they could be managers, they could be customers. If this is a customer impacting incidents, they want updates. They want to know what's going on with an incident commander. They know who to talk to because they are the designated contact. But sometimes there's so much work going on and so many things to coordinate within the response team, that delegating that communication role to a communications lead can really help out the incident commander. So this covers the spectrum roughly, in very functional terms of the work that needs to be done. Diagnosis therapy in the technical realm, recruiting and communication in the logistical or administrative realm, and coordination, bringing it all together. This is for functional approach. How do you organize this, and how do teams typically approach, or companies typically approach, structuring the way that people respond to incidents? Let's map this functional model into kind of an organizational approach. So there's one approaches that I teams, the one at a time, or the pass the baton method. The page comes in to team a. Perhaps these is an alert that was triggered on a service that's owned by team a, and team a looks into it, and they decide our system is doing the right thing. It must be because team b's system is giving us bad data, or returning an error or something like that, or not returning at all. So they pass the baton to team b, who in turn looks at it, passes it to team c, and ultimately they decide, ah, it's got to be the DNS, so we'll pass it to the DNS team. Since that starts with a d, the problem, or one of the problems with a one at a time pass the baton method, is that where does the incident commander come from? We've passed the responsibility from team a to b to c to d. What happens if team D finds that their problem is actually caused by something back on team D? Team B, we've gone around on a loop here, and perhaps with no great resolution, with the lack of an incident commander. That's typical in this kind of a structure. It teams to confusion, longer response times, longer resolution times, and it leaves the other concerned people kind of floating out there in the fog somewhere. They don't know who to talk to. 1 minute they're talking to somebody on team b, and the next minute, oh no, we're not involved in that anymore. And it's not a great experience for the outside folks. An alternative to avoid this problem with passing the baton from team to team to team is the all hands on deck approach. The page may go out to all four teams at once, plus an incident commander. This can have different variations, so maybe it goes to these on call engineer from each of the teams. Perhaps it goes to a single cross functional team that has all the capabilities that are needed within that one team. Or literally it could go to everyone. There are organizations where an incidents will trigger 100 or 200 people to join the situation room, the channel, whatever the communication framework is. This can lead to teams that are seeking to escape. So essentially you summon four teams into every incident. Their first thought is going to be, okay, how do I get out of these and leave some of the other teams holding the bag? Not necessarily the greatest collegial experience and can lead to some unfortunate dynamics, but at least we have an incidents commander. And so the outside parties that are wondered know who to talk to and then they can work through that process in order to understand what's happening with the event. Can alternative approach is an escalation or a tiered model. This is very common in organizations that have a knock, a network operations center, or some variation thereof. Calls, alerts come into the knock. They triage those calls, either decide which tier two team to escalate to, or ideally try to solve some large percentage of them without requiring the escalation. Tier two gets involved when necessary. They try to take out another chunk of the calls and only bother the expert team three on occasion. This one has a question of where does the incidents commander come from. Some organizations will have incident command be a responsibility of the NOC and they have the expertise. They have people trained in doing incident command and will follow through the escalation process. But this isn't always the case. And again, it's an open question that organizations will solve in different ways. If there's not an incident commander, the outside concerned people have the same challenge of knowing who to talk to was in the one at a time model. If these is an incidents commander, that mitigates the problem for the other parties that are involved. There's another model that has come into the tech industry maybe about ten years ago, adapted from emergency response, firefighting, FEMA emergency response called ics or incident command system, and the first adaptations of this into the tech world. I'm calling strict ICs. These incident commander is literally in command and if they don't do their job, nobody else is going to do anything at all. This can be thought of in the scenario of wildland fire response, where the fire chief tells each of the teams what they're going to be doing. They're very directive and very tightly organized, I guess is the best way to put it. Other concerned people are kept informed because you've got a communications lead role. And the problem is that this doesn't really fit well with an agile, everybody takes responsibility, engineering culture. And so there's an adaptation from strict ics that has evolved in the tech industry, which was come to be known as adaptive ics thanks to the work of Laura McGuire. Essentially the technical teams undertake their own work, largely in a self directed manner. The incident commander and the communications lead end up being a wrapper or a protection for the technical teams from the outside world. They provide the communications in and out, and the incident commander is responsible again for focusing on how the event is being conducted, making sure the teams have who and what they need, making sure that the teams are staying functionally operational, that people aren't getting too tired out and wiped out from the process. My colleague Matt Davis has coined this term, the response trio. The incident commander, the communications lead, and then the problem solving team, all working together in concert to accomplish these goal of solving these problem as quickly and effectively as possible. The incident commander's key role is to be focused on the coordination aspects of these problem. How is the response being conducted, not the incident itself. They're not diving into the metrics or the graphs and the logs. That's the technical team, the problem solvers. The incidents commander is responsible for upholding and maintaining the common ground throughout the response team and throughout the response period. Let's recap some of these pieces quickly as I draw this to an end. We have the OodA loop. Observe, orient, decide and act, which is the process that everybody is doing. Even the incident commander is observing how the individual responders are interacting with each other and seeking to make sure that, let's say, when a new person comes in, they are brought up to speed quickly in understanding the common ground that has been established and maintained amongst the previous responding teams. As they come in to join without having to distract the technical responders. Everybody is orienting, understanding the meaning of what they're seeing and observing, making decisions, acting, and these repeating the loop regularly. This is all in service of the technical sides of diagnosis and therapy. Handling the incident, understanding the incident, and then handling it, solving it again, recruiting people as a key component, coordinating who does what, and then communicating outside of the team. This is the functional side of incident response, and pretty much everything that happens can fit into one of these categories. It can be helpful to think about, is what I'm doing therapeutic, or is it diagnostic, or is it an attempt at therapy? And if it works, then we'll conclude, because it worked. Oh, that must have really been the problem. That's not an unusual scenario. Organizational models, we've looked at one at a time, all hands on deck, an escalation model, and in strict and adaptive ics. I want to point out that while adaptive ics is kind of the latest approaches and the one that is generally widely accepted amongst companies and organizations, no one size or structure is correct all these time. There are scenarios and organizations where one at a time is the best that you can do. Perhaps you have teams that are geographically distributed. One model of pass the baton is a follow the sun. Team a works for 8 hours, and then they hand it off to team b, and then they hand it off to team c. These are not functional teams, but they're time zone teams. And again, handoffs. There's a lot of literature and analysis around effective handoffs and ineffective handoffs that I don't have time to go into. But this is one aspect of a pass the baton. You can pass it around the world. All hands on deck can be the right scenario, depending on or the right approach in particular scenarios. And so it's important that you don't hear any of this as being down on a particular model. Understand these strengths, understand the weaknesses of the model, and then adapt them to your scenario is the best way to do it. You recall at the beginning that I said I was going to jump into the middle of the story. There's much more to explore regarding interteam dynamics as well as the organizational context in which those dynamics happen, and I'd encourage anyone who's interested in these topic to dive into the literature of resilience engineering, adaptive capacity, and safety too. I have some resources for further reading on the next slide. As a starting point, I'd like you to consider the question, why is this thing making no ways? And is it actionable? So here are some suggestions for further reading, and I'd invite you to reach out to me either through the conference discord or through email with any questions that you have. Thank you very much for considering these varieties of incident response.
...

Kurt Andersen

SRE Architect @ Blameless

Kurt Andersen's LinkedIn account Kurt Andersen's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)