Don't Panic! Effective Incident Response

Video size:

Abstract

Incidents are the pits. They’re frequently unexpected and many of us scramble to respond to the alarm. But what if you could refine your response process so that it felt routine?

Today I’ll be talking about how to do just that - by adding formalized structure to your process.

Summary

You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native. Litmus cloud everyone. I hope you're having a great conference.
An incident is any unplanned disruption or event that requires immediate attention or action. The main goal of incidents response is to replace chaos with calm. Even with disorganized response, you'll likely still find a solution. Automation is also very important.
The incident command system was developed as part of wildfire response in southern California. In 2004, the National Incident Management system, or NIMS, was established by FEMA and is now the standard for all emergency management by public agencies in the United States.
When defining incidents, it's important to codify what constitutes a major incident similar to an incident. It needs to be defined in a way that can be used across teams in your organization. You might want to have a metric or two that you tie to that.
Timing is a surprise, so there's little to no warning and that the time of the response matters. It also requires mobilizing a team of the correct responders to resolve. Anyone needs to be able to trigger an incident at any time. We need to switch from peacetime to wartime.
The lifecycle of an incident or its steps are you have triage, mobilize, resolve, and prevent. In order for these steps to happen relatively smoothly, you want to make sure you have some roles defined.
The incident commander is basically running the incident. The deputy might help with other tasks like liaisons, or the scribe. And then there are all the subject matter experts. These are the people with the relevant expertise for whatever the latency, outage, breach is.
A typical sequence of events is something is triggered either by monitoring or by human that goes to a subject matter expert. The role scale down. Maybe your incident commander and other roles can be rolled up in a way that's relatively effective. If you notice that a pairing isn't working, don't keep using it.
The incident commander role ensures the incident response is calm and organized because they're making the calls. Incident commander also makes sure to assign tasks to a specific person. Making a wrong decision in this situation is better than making no decision.
Some tips before new incident commanders introduce yourself on the call with your name. Avoid acronyms because you don't know what different type teams are doing. Speak slowly and with purpose on the calls. Kick people off if they're being disruptive. In summary, for the incidents commander, everyone is focused.
How do I prepare to manage incident response team? Step one, ensure explicit processes and expectations exist. Step two, practice running a major incident as a team. Don't neglect the post mortem when things are fresh.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Are you an SRE, a developer, a quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native. Litmus cloud everyone. I hope you're having a great conference. My name is quintessence and I'm theyre to talk to you about effective incident response. Jumping right in. And before we talk about incident response, let's standardize on the definition of an incident. An incident is any unplanned disruption or event that requires immediate attention or action. This encapsulates different types of incidents. They can be internal, facing security breaches, data loss, et cetera. Your definition may differ from ours, and this is how we define incidents at pagerduty. But as long as it's standardized in your organization, that's fine. The main goal of incidents response is to replace chaos with calm, because during an incident, people might be talking all over each other or not knowing who to reach out to, basically trying to get that incident resolved, and then there's a bunch of chaos to get that accomplished. And that chaos is not good for an incident. It's going to increase your frustration, confusion, and it's going to make it slower for you to get to that resolution. The goal of incident response, therefore, is not only about resolving that incident, but doing it in an organized way. So you're trying to reduce how much confusion there is around the response process itself and put as much energy as possible into the actual problem. Saying it another way. The goal of incident response is to handle the situation in a way that limits damage and reduces recovery time and costs. Because even with disorganized response, you'll likely still find a solution, but it'll take longer. The incident might get more severe over time. There might be more data loss if that's the situation, or there might be a longer outage if that's the situation, and there's just more damage overall, more difficulty as the incident continues to escalate. Right? So in order to accomplish the goal of effective incidents response or calm incidents responses, you want to make sure you're mobilizing the correct people. This is the right responders, knowing who to paying for a specific service or whatever the situation is. They need to have that right skill set, and they need to have enough autonomy to take action on whatever's happening. And you also need to make sure that you're learning and improving from each situation so you can build off of your mistakes and avoid them in the future. Automation is also very important because if you can build automation into your response, it's possible to start triggering the automated response before visible incident, an impacting incident occurs. Rolling back a little bit, when we're talking about incidents response, we're talking a little bit about what's known as the incident command system and who invented this. Certainly not us, right? Actually, that was based on what's called ICS, which was developed as part of wildfire response in southern California. In the they encountered was that thousands of firefighters needed to respond to massive wildfires. And while each of them individually knew how to handle that type of fire, they did not know how to handle it at scale across the land, but also between each other, so they could not effectively coordinate over the scope of the entire situation after those fires happened, an agency called firescope was formed in order to develop a model for the large scale response for a natural disaster. And one of those was ICS, which is what we're talking about right now. And this became more broadly used for any major incidents. And in 2004, the National Incident Management system, or NIMS, was established by FEMA and is now the standard for all emergency management by public agencies in the United States. So this means that ICS is used by everyone for every response from an individual house fire to a natural disaster. And that standardization means that everyone familiar with the process, with the response procedure, is actually able to handle and put more mental energy on that problem, on the fire or on the flooding or whatever the situation is, rather than scrambling to try and define a process on the fly while this natural disaster is going on. This might not feel directly relevant to it incidents, but the principle is the same. Recall how we define incidents as an unplanned disruption that requires immediate action or attention. The idea is that this definition and the response pattern is as similar across teams as possible. So when they need to coordinate, they can. This is very similar to what was happening with ics in the fire response. Building off with that a little further. When defining incidents, it's important to codify what constitutes a major incident similar to an incident. It needs to be defined in a way that can be used across teams in your organization. You should be able to make sure that your definitions are short and simple and prevent the stress of is this an incident or isn't? Or if it is, is it a major one? And you want to keep all of that in mind when you're writing these definitions here at pageduty. We define a major incident as any incident that requires a coordinated response across multiple teams, depending on your needs or your organization. You might want to have a metric or two that you tie to that in order to differentiate a major incident from a so called regular incident. Again, as long as it doesn't become too granular. An example of this might be it becomes a major incident when more than x percent of customers are impacted. Again, this is all because you can't respond to that incident until you know what it is. So if a person or team isn't in alignment with the at large either for an incidents or a major incident, you're going to have confusion and inconsistency in that response process. Four things that major incidents have in common. Timing is a surprise, so there's little to no warning and that the time of the response matters. You need to respond quickly, having services degraded or down or data loss or breaches, et cetera. The more time that that's going on, the more severe that is possibly the more money that's being lost, et cetera. The situation is rarely completely understood at the beginning. You don't normally know why there's an outage. You don't know to necessarily, for example, run a simple setup script or a reset script that might be hanging out somewhere that'll just kick off a process or reboot a cluster, right? You don't know why this situation is happening. It also requires mobilizing a team of the correct responders to resolve, and that is the cross functional coordination. So typically you can determine the severity based on how drastically your metrics are affected. So for example, as the traffic to your website drops, severity can increase. It might start but off as a Sev five, that might look a little bit. Let's pick on Amazon for a second. Might look like you can add something to a cart. You just can't see the image of it. So you can see the item, you can see the cart, you can even check out of the cart. You just can't see the image of it. So that's a type of outage. But people are still able to accomplish what they're trying to accomplish. If it becomes more severe, maybe between a four or maybe up to a three. Now, there might be interference with the checkout process or you can't link to the details. Even though you can see the line items, you can't actually see what you've added to compensate for not seeing the picture, et cetera. So as you get more severe, now you're at that sev three and maybe responders have been pulled in up to this point. Once we pass this boundary here from sev three to sev two, that is what we codify as a major incident. So severity two and severity one, in some cases, some people use like sev zero or p zero, any convention is fine. Those very severe incidents are what are called major incidents. That is when you would fire off what's called the major incident response process. Anyone needs to be able to trigger an incident at any time. And this is because even well designed monitors can miss things. And you want to make sure that people are empowered to trigger incidents rather than someone knows something's wrong and they're just sitting there worrying about retribution, about what would happen if they're wrong or other consequences, rather than just setting off the incident, triggering it, having people pulled in who can actually look at it, see if it's a problem, or see if they can just close it, right. It's less risky to open the incidents and close it than to not open it, or to wait for something else to catch it if it's cautable. In order for anyone to be able to trigger event, you need to set up your tooling to make this possible. So, for example, if you integrate with Slack, anyone in slack can trigger the incident, and that doesn't require them, for example, to have a specific account for a specific tool. If you have page reading in particular, it will look a lot like this. Someone just sends off an IC page command and it pages the relevant incident commanders. Once an incident is triggered, we need to switch our mode of thinking. This mentality shift is the differentiation between normal operations and something's wrong. We need to switch this from what's called peacetime to wartime. It's from day to day operations to defending something that's wrong. Something that would be considered completely unacceptable during normal operations, like deploying codes with minimal or no testing, might become more acceptable during a major incident. Again, depending on a lot of factors that I'm not going to get into right here, the way you operate, your role hierarchy, and the level of risk is what I'm talking about. All these things will shift when you go from a so called peacetime to wartime. Peacetime to wartime is something that got inherited from the fire response because they are a paramilitary organization. So they might use the word terminology. We don't need to do that. We can do normal or emergency okay versus not okay. Really. It doesn't matter how you term them, so long as it's clear and immediate. People don't have to figure out which side theyre on. You just want to make sure that you're using a term that the organization is okay. With. And part of what we want to do for the differentiation is tying in those metrics. For business impact. Metrics can be very useful and often work best when they're tied to that business impact rather than something very granular. So, for example, one of the metrics we monitor here at page of duty is number of outbound notifications per second. That might make sense, right? Amazon might know, monitor orders, Netflix streams, things that are relevant to their business. You want to make sure that you're tracking if something visible and impacting is happening or not, and how severely, because, again, that goes back to, is this an incident or not, or is this a major incidents or a regular incident? The mindset shift to emergency also causes the other shifts to happen as well. But something that can happen is what's called decision paralysis. Responding teams do need to have the ability to make decisions on their own and try and reduce impact and resolve as quickly as possible. Right? Again, going back to not wanting z and sin to escalate over time. But it's also important for all responders to avoid decision paralysis. This is what happens when you spend so much time debating two, maybe more, options that are just similar enough or not quite different enough, and you can't quite choose one. And all the time, the incident is still ongoing, possibly escalating. Making a wrong decision in this case is better than making no decision. If one was clearly better than the other, you would know by now. So a wrong decision is at least going to give you some information to work with, whereas no decision gets you nowhere. Now, let's talk a little bit about the roles that happen in incident response process and the incident categorization. So the lifecycle of an incident or its steps are you have triage, mobilize, resolve, and prevent. So triage is what happens when you see the incident, whether it's triggered via slack in a page or whether it's hit by one of your monitors. It's what happens when you first see it, and you need to figure out how severe it is. It's when you're going to differentiate between, am I responding to this? Is this going to be a cross functional response? Right. And once you get to that decision process, that's when you start to mobilize people. You want to make sure the right subject matter experts are there. And then the resolution phase, that's the one everyone likes to focus on, right? Everyone's cranking away on whatever's wrong and trying to get everything working again. And then the last step is the prevention stage. Usually this incorporates, like notes, post mortems, all that good stuff so we can learn and move forward. In order for these steps to happen relatively smoothly, especially in a cross functional response, you want to make sure you have some roles defined and this slide is going to get a little heavy. Okay, so let's start with the incident commander. The incident commander is basically running the incident. Think of them as like a captain of a ship or something like this. They're guiding the ship of the incident response process. They are the highest ranked person on the incident response call. We actually do recommend that in the context of the incident that they might outrank the CaO. Don't catch anyone by surprised by this, though. Make sure you get buy in before you make that decision. But the idea is that they're the single source of truth during an incident and that they're able to respond and make decisions in a responsible and effective way. Next up we have. Sorry about that. Next up we have the scribe. The scribe is taking notes. Depending on your tooling, you might have some timelining built in for what decisions are being made or what actions are being taken. We also need the scribe to be writing down relevant context. You don't want it to be so loan that they're writing a novel that no one's going to read, but you do want to write down rebooted cluster in order to simple, quick sentences so that when you're looking through the incident, why did I make this decision? What was the impact of that decision? And we can keep reading through. The deputy is the support role for the incident commander. They're not a shadow, so theyre not an observational role. Theyre expected to perform tasks during the incident. So depending on the size of the company and the incident, the deputy might help with other tasks like liaisons, which I'll get to in a minute, or the scribe or something like that. Internal and external liaison, or internal and customer liaison are the communications roles. So during an incident, it's not uncommon for people within or if it's customer facing outside of the company to want to know what's happening. Right. Theyre do AWS status, Twitter, right? Region goes down, you check their status, you check their Twitter. You want to make sure that someone is communicating in whatever mechanisms that you're using, be it social media or status pages or whatever, and you want to make sure it's a defined role or else everyone and the layer underneath all the subject matter experts are focusing on fixing the issue and not like keeping time to make sure that theyre sent the last update out in a timely fashion. So liaisons do this. You might have one or both. Again, it depends on what's happening with the incidents. And then there are all the subject matter experts. These are the people that people normally associate with incident response. Theyre are the people with the relevant expertise for whatever the latency, outage, breach, et cetera, is. These are all the response level roles. So setting this up at scale, when you have a department wide incident, you want to make sure that you have an on call scheduled, a primary and a backup. The idea here is if that page goes out and someone doesn't respond in whatever SLA is defined for, that it needs to escalate somewhere or else it's just going to keep paging and going to nowhere. Right. So you need that primary and the backup. One of the things that we do when we have a primary and a secondary on call is we inherit from the previous week. So when you have the secondary, they were last week's primary into this week's secondary, so that they have a context for any past incidents that were relatively recent. You also want to make sure that you have primary and backup for the subject matter experts. Same reason, actually, same rotation, secondary to primary. And then if there are any other roles, scribe, et cetera, as you're defining them, you want to make sure that they're defined again. When you send out that page, if someone's sending out a page, let's say I'm responding to what seems to be a database incident and I am doing my DBA thing, and then I realize it's exceeding scope for my area of expertise. I don't want to look up who is front end, who's the SRE, who's this, who's that? And find all these people individually. I want to be able to send out a correct page to the right teams. All these teams need to have those on call rotations, or you need to have a method of paging them that's consistent, single source of truth, and relatively easy so you dont lose time in the incident. Responding to the incident. So let's talk about the response itself. A typical sequence of events is something is triggered either by monitoring or by human that goes to a subject matter expert. Initially, they're usually the first to respond and see what the situation is. It might be simple enough for them to solve right away, very low level incident, maybe a set four or five, and they might resolve it, or they might need to escalate it. They need to escalate it. They send out a page, it pages the command level which is again, incident command scribe and deputy. It also will page out the liaisons and will page out to operations, right? So whatever the SME levels that need to go out are, depending on your. This might seem like a lot, and you might be thinking, we only have 20 engineers in total. We can't have this huge incidents response process. Not a problem. The role scale down. So you might find, for example, you don't need two liaisons because you only need one, or your comms aren't going out that frequently, or you're going to be really heavily relying on your subject matter experts. But maybe your incident commander and other roles can be rolled up in a way that's relatively effective. I misspoke there a little bit though, because you really don't want the incidents commander to be rolling up into other roles. That one should be rather dedicated. But you might share a role with a scribe. You might have liaison sharing roles. So maybe a scribe and a liaison is the same person. Or maybe depending on the subject matter expert response, a scribe and an SME are a shared role. Make sure that this is consistent in a way that makes sense. If you notice that a pairing isn't working as you tried to scale down, don't keep using it. Just cuz. Just separate them out in a way that makes sense for you. Let's review all the different roles and responsibilities, starting with the incident commander, just to provide some more context for what everyone's doing. Remember that we want to replace chaos with calm. This isn't the first, it won't be the last time I say this. The incident commander role ensures the incident response is calm and organized because they're making the calls. If you think about it this way, let's say the subject matter expert says, I want to reboot Cassandra. Incident commander says, okay, if two subject matter experts are giving differing opinions and you're in that decision paralysis mode, the incident commander breaks the tie. Okay? You have someone who's making a decision to make sure action keeps getting taken. They're also a single source of reference. It's one of the most important roles as such. So even if you don't have the deputy stride and liaisons all broken out of separate roles, and you're starting to combine them in ways that make sense, that instant commander is the one you should get first and the one that should be dedicated as the single source of truth during an incident and the one in charge I keep mentioning, decision paralysis happens a lot. One of the things that we talk about at Pagerduty is having the are there any strong objections model. The idea is you want to make sure you don't have people later going, oh, I knew this wasn't going to work. You maybe did, you maybe didn't. But whatever everyone knew at the time wasn't strong enough to break the tie. Right, the two way through, whatever tie of the decisions that were on the table at that moment. So in order to gain consensus, you need to say, are there strong objections? Is there something that outweighs the decision we're proposing as opposed to other ones on the table? And if not, we're just going to pick one, move forward and get more information, even if we fail. And all of this is about making sure you're making that decision again. Making a wrong decision in this situation is better than making no decision. Making no decision doesn't help you move forward and you don't learn anything. The incident is still going on, possibly getting worse. Making a decision, even the wrong one, again gives information. If it turns out to be wrong, you can learn from how it went wrong. To make the next decision, choose between the other options, what have you. Incident commander also makes sure to assign tasks to a specific person. It's one of those things where if everyone owns it, nothing gets done, and it's not malicious. It's just they're focusing on what they're actively assigned to. Right. And this can be happening during the incident. It can happen after the incident. So, for example, post mortem, if no one owns the post mortem, to be clear, not writing the whole post mortem by theirself, just collecting the people and getting people to write their sections of the postmortem is probably not going to get done because there's no one checking on the progress of it. So the incident commander and the scope of the incident and assigning the post mortem is assigning tasks to specific people. I mentioned this before, going to talk about this now. They are the highest authority within the context of the incident, no matter their day to day role, an incidents commander is always the highest ranking person on this call. And I just mentioned, no matter their role, we actually don't always recommend, and frequently don't recommend that the incident commanders are one of the subject matter experts and expertise. You don't want the incident commander to shift role to solving the incident while they're trying to, for example, answer exec questions. If execs come in on the call and say, hey, can you solve this in ten minutes? Hey, what's going on? Hey, I didn't see any comms or whatever they're asking. Right. The incident commander's job is to shield from all of that. They can't be solving the incident when they do that. So it's actually not uncommon for organizations to have non end roles as incident commanders. I'm just going to reinforce this a little bit. We did used to require that all of the incidents commanders be engineers. It probably makes sense, you think of engineering with an incident response. So all the roles are possibly engineers. But that was actually a big mistake because, again, ics aren't responders in that way. They're not fixing the problem. They don't need the deep technical knowledge. It also limited our response pool. And if we think about an organization maybe larger now, not so much a problem, but when you're small and you have that thought that I mentioned earlier, we only have 20 engineers. Why are you using up your on call pool for a role that does not require the technical knowledge? When you can branch it out into other teams, that might make sense, maybe products, so that they have more connection with the response process that's happening. Right. There are other roles that can be brought in from the to do this. Handoff procedures are encouraged. You want to make sure that people aren't staying on an incident if it keeps going on and on. Hours, 12 hours. Right. You need to know what time frame you need to start handing things off for. And this is for all the relevant roles. You need a handoff procedure, and you need to know who you're handing off to when you're going through this. You're asking for status, repeatedly deciding on action to gain consensus, assigning tasks, and then following up to make sure it's completed within the context of an incidents, that means if I say I'm going to reboot a cluster and do some queries or whatever I'm going to do, it's like, great. How long do you think they'll take you? 1015 minutes, tops. The commander will check in in 15 minutes to see what the status is, if something changed in that time, if I need to be doing something else and just keep updated within the lifecycle of the incident, it looks a little more like this. This stabilizes the resolution phase, which is why it has that little separate circle there. It's just you're iterating. Iterating until it's complete. Some tips before new incident commanders introduce yourself on the call with your name. Say you're the incident commander. So my name is contestants and I'm the incident commander. Avoid acronyms because you don't know what different type teams are doing on the call. An easy example is, let's say I'm not on edge as an incident commander and I start using pr, I mean press release, and they mean pull request engineering, right? Avoid acronyms. You're having a cross functional response. It is very common to have collision in that space. Speak slowly and with purpose on the call. Kick people off if they're being disruptive. You can change roles, you can escalate to different roles. You can have people who are supposed to be shadowing, that aren't shadowing, be asked to leave that sort of thing. Time boss tasks. You may recall earlier when I said, oh, I'm going to be rebooting the cluster and running a query and it'll take me ten to 15 minutes. That's a time box, right? I know if I'm the incident commander, to check in ten to 15 minutes, if I don't know roughly how long to take, how often to check in, and I'll be interrupting what that person is doing. You also need to be able to explicitly declare when the response has ended, because once that has happened, you don't want people to say, oh, what's this over here? Is this still connected to this incident? Is this something new? Right? When the response has ended, that incident is ended. If something else is popping up, it might require a different response process. In summary, for the incidents commander, everyone is focused, keeps decision making, moving, helps avoid the bystander effect, and you're moving forward towards a revolution. Now, all of that was just the incident commander. The rest of the rules aren't going to take as long to explain. So let's go through. We have scribe, the deputy, then we have the liaisons and the SMEs. As mentioned before, deputy role incident commander keeps focus, takes on any additional tasks. This is the scaling problem. If there's too much going on that the IC can't do by themselves, if there's too many execs coming in to ask for status updates, whatever, all that is, the deputy. They serve to follow up on reminders to ensure tasks aren't missed. So in that time boxing example, if I as an incident commander, am not following up with the DBA in ten to 15 minutes, the deputy says, hey, it's been 18 minutes, or they act as a hot standby per previous comments about being able to have handoffs. Importance of the scribe. They're documenting the timeline with context, and the incident log is used to augment what they're writing in the postmortem process. Note when important actions are taken, follow up items, etc. Any status updates and similar to the deputy and the IC, anyone can be ascribed. You do not need the technical knowledge or the resolution knowledge to write down what's happening. So again, dont box yourself to just the engineering for all of theyre roles. Similar for liaison roles. You can have internal and external liaisons, or both as a single role or both as separate roles. It depends on what's happening and what you need. They're notifying the relevant parties of what's happening. We usually recommend this happens in about a 30 minutes cadence. If it's too frequent, people aren't chalking and they're just getting overwhelmed with information. And if it's too infrequent, people, especially people who maybe I pick on execs a lot because they're the ones that are going to be getting calls. If this is a massive incident and they're popping in because they don't have information, it is common to have this be a member of support. Though some pitfalls. Executive Swoop is the one I mentioned the most. We sometimes nick nickname it executive Swoop and poop. The idea is people are swooping in because they don't have enough information and they're trying to get information out of you while you're solving the incident as a result. Or sometimes they try to be helpful. Can we resolve this in some number of minutes, please? We all wish we could. Right. But again, this isn't meant to be malicious. People just want to be encouraging or say, have you thought about things? Or they want more information so theyre can talk to people. Maybe they have a dependency. Right. Can I at least know who's impacted so I can start making phone calls. Right. That's all stuff you want to kind of buffer again in the incidents response process. You want those command and communication roles to buffer the responder roles. There's also a certain amount of do what I say. Oh, did you reboot backluster? Oh, did you do this? Maybe. But we in the incident call context have been having a conversation for however long and then the person who just joined in has none of that history yet. Right. One of the things that you can do though, in this case is ask people if they wish to take command, because if they're an incidents commander and they take command, like if they're trained in that role, sure, then it's not your problem. Right. And if they say no, just say, great, in that case I need you to, and then give them some direction. I need you to wait for the next status update. I need you to wait till we send out the information or whatever. Right? So make sure you always ask this question if someone's being a little bit persistent, because again, it's not normally malicious, they're as anxious as everybody else. Another common pitfall is that failure to notify. That's why the liaison rule is separated out. If you don't notify people, they're going to ping you in slack in the call or wherever to get information on the opposite side. If you ping them too often, it's not usually necessary and it doesn't have enough information or the status updates are too similar to warrant the frequency. Right. Red herrings are things that happen in incidents that are just misleading. This isn't really a part of the incidents response process breaking down. This is just the resolution process breaking down where you think that you need to do something and ends up being a dependency or something else that looks like it's happening over here and it's happening over there. I'm not going to go through all the anti patterns here. Suffice it to say, there are lots of things that happen during an incident that can bog down the response. Debating of the severity is something we covered a lot earlier. If you spend your time saying, is this an incidents? Is this not an incidents? Is it a four or a five? Is it a two? Is it a three? You're going to lose time, right? If you don't have the process defined in advance, you're going to lose time. If you don't send out policy changes or have them documented in a single place, people are going to lose time trying to find them and so on. What I do want to call out though, is assuming silence means no progress. If you think about it, heads down or reading a book, even decoupling from incidents, right. Silence doesn't necessarily mean stopping. Silence just means I haven't been able to send you anything yet. How do I prepare to manage incident response team? Step one, ensure explicit processes and expectations exist. See this entire presentation. Step two, practice running a major incident as a team. This is kind of like failure Fridays, which is something we run at pagerduty to practice chaos experiments. Run your major incident response as part of it. Why not? Right? If you have rare incident, that's awesome. But if you don't practice what you're doing when they do happen, you're not going to be as agile. It'll also give you the opportunity to tune your process to know what works. You don't want to find out in the first real incident that you have with your new process that it didn't work for you. Something that can help with this is checklists. That way you know what you've done, you haven't done what you need to change next time. It can look a little bit like this again, I'm not going to go through each one of these, but please feel free to screenshot or look at the slides. I'll have a link at the end. Something else I want to call out don't neglect the post mortem. It's really easy to forget because the incidents is done. We go back to regularly schedule activities or we do any of the action items that we take away from it and writing everything down just like docs kind of gets lost. But if you don't document things in the post mortem when things are fresh, then you're going to lose the learning opportunity. So as an overview, you want to make sure you're covering the high level of the impact. This is just a quick our shopping cart service was down and customers could not purchase new items. Something like that. It's not meant to be very in depth at that level. And then in the what happens sections, detailed descriptions of what are happening, what response efforts happen. This is when you can really make use of those scribe notes. Then you review the response process itself, what went well, what didn't, and write down any action items as a slide to screenshot. Here are some things that you can include in a more detailed post mortem I actually will have a link to a post mortem ops guide that we wrote up here at page duty at the end, so hold on to your seats there as a quick summary, use the incident command system for managing incidents. An incident commander takes charge during wartime scenarios. Set expectations upward. That specifically refers to the CEO comment about the incident commander being the top one, but basically everything else, right? You want to communicate outward, work with your team to set explicit expectations and process practice. And don't forget to review all of the links including the post mortem. One I just mentioned are going to be on this noticed page. So go ahead and go here. I have four guides there about response and post mortems and the meetings associated. The topmost link, in case you want to see it just now, is the response Ops guide. That Ops guide is an expansion of this pre presentation. So if you want to read more or want to recap everything I've mentioned here, it's on this slide and with that I'll go ahead and go to Q A. I'll be available in the discord and the link for all of the links that I've mentioned in this presentation again, are going to be on that notice page. Thank you and have a great conference.

Slides

Download slides (PDF)

See all 48 talks at this event!

Conf42 Site Reliability Engineering 2021 - Online

September 30 2021

Don't Panic! Effective Incident Response

Video size:

Abstract

Summary

Transcript

Slides

Quintessence Anx

Developer Advocate @ PagerDuty

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering 2021 - Online

September 30 2021

Don't Panic! Effective Incident Response

Video size:

Abstract

Summary

Transcript

Slides

Quintessence Anx

Developer Advocate @ PagerDuty

Join the community!