Conf42 Chaos Engineering 2021 - Online

Postmortems, Continuous Learning and Enabling Blameless Culture

Video size:

Abstract

So you’ve had an incident. Restoring service is just the first step—your team should also be prepared to learn from incidents and outages. In this talk you will learn some best practices around postmortems/post incident reviews to help your team and organization see incidents as a learning opportunity and not just a disruption in service. In this workshop, attendees will:

  • Get an overview of blameless postmortems
  • Learn techniques for effective information sharing
  • Learn how to run a postmortem meeting effectively
  • Understand the difference between “blame” and “accountability”

Summary

  • Julie Gunderson, DevOps advocate at Pagerduty, talks about postmortems, continuous learning and enabling a blameless culture. Postmortems provide a peacetime opportunity to reflect once an issue is no longer impacting users and organizations. Delaying the postmortem delays key learnings that will prevent the incident from reoccurring.
  • For the postmortem process to result in learning and system improvements, the new view of human error must be followed. The tendency to blame is hardwired through millions of years of evolutionary neurobiology. Another pervasive bias is the tendency to favor information that reinforces existing beliefs.
  • Ask what and how questions rather than who or why. Asking what questions grounds the analysis in the big picture of the contributing factors to the incident. When in crying about human action, abstract it to an inspecific responder.
  • The inclination to blame individuals when faced with a surprising failure is ingrained deeply in all of us. Punishing individuals for causing incidents discourages people from speaking up when problems occur. Organizations can rapidly improve the resilience of their systems by eliminating the fear of blame. Culture change doesn't happen overnight.
  • For action items to get done, they have to have clear owners. Engineering leadership helps clarify what parts of the system each team owns. We've also seen improved accountability for completing action items by involving the leaders. The most important outcome of a post mortem meeting is to gain buy in for the action plan.
  • You want to document the facts of what happened during the incident. Avoid evaluating what should or should not have been done and coming to conclusions about what caused the incidents. The goal in analyzing the incidents is not to identify a root cause but to understand the multiple factors that created an environment where failure became possible.
  • The external postmortem is a summarized and sanitized version of the information used for the internal postmortem. After you've completed the written postmortem, follow up with a meeting to discuss the incident. The purpose of this meeting is to deepen the postmortem analysis through direct communication.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, I'm Julie Gunderson, DevOps advocate here at Pagerduty, and I'm here to talk to you today about postmortems, continuous learning and enabling a blameless culture. So when we talk about postmortems, a postmortem is a process, and it's intended to help you learn from past incidents. It typically involves a blamefree analysis and discussion soon after an event has taken place and you produce an artifact. And that artifact includes a detailed description of exactly what went wrong in order to cause the incidents. It also includes a list of steps to take in order to prevent a similar incident from occurring again in the future. You also want to include an analysis of how the incident response process itself work. Like the value of having a postmortem, it comes from helping institutionalize that culture of continuous improvement. The reason that we do postmortems is because during incidents response, the team is 100% focused on restoring service. They cannot and they should not be wasting time and mental energy thinking about how to do something optimally or performing a deep dive on what caused the incident. That's why postmortems are essential. They provide that peacetime opportunity to reflect once an issue is no longer impacting users and organizations. They tend to refer to the postmortem process in slightly different ways. They're after action reviews or post incident reviews. Sharing reviews incident reviews incidents reports it really doesn't matter what you call it, except for this last one, root cause analysis. Because we will talk about why words matter. It just matters that you do this. The postmortem process. It drives that culture of learning. Without a postmortem, you fail to recognize what you're doing right or where you could improve, and most importantly, how to avoid the same mistakes in the future. Writing an effective postmortem allows you to learn quickly from mistakes and effectively and improve your systems and processes. And a well designed, blameless postmortem allows teams to continuously learn, serving as a way to iteratively improve your infrastructure and incident response process. So you want to be sure to write detailed and accurate postmortems in order to get the most benefit out of them. Now, there are certain times that you should do a postmortem. You should do a post mortem after every major incident. This actually includes anytime the incident response process is triggered, even if it's later discovered that the severity was actually lower, or it was a false alarm, or it quickly recovered without intervention. A postmortem shouldn't be neglected in those cases because it's still an opportunity to review what did and did not work well during the incident response process. If the incident shouldn't have triggered an incidents process, it's worthwhile to understand why it did so. Monitoring can be tuned to avoid unnecessary triggering incident response processes in the future, and doing this analysis and follow up action will help prevent that alert fatigue going forward. And just as restoring a major incident becomes top priority when it occurs, completing the postmortem it's prioritized over planned work. Completing the postmortem it's the final step of your incident response process. Delaying the postmortem delays key learnings that will prevent the incident from reoccurring. So pager Duty's internal policy for completing postmortems is three business days for a sev one and five business days for a sev two. And because scheduling a time when everyone is available can be difficult, the expectation is actually that people will adjust their calendars to attend the postmortem meeting within the time blame and at the end of every major incident call or very shortly after the incident commander. That's what we call the command structure here at pager duty, they select and directly notify one responder to own completing the postmortem and note that the postmortem owner is not solely responsible for completing the postmortem themselves. Writing of postmortem it's a collaborative effort and it should include everyone involved in the incident response process. While engineering will lead the analysis, the postmortem process should involve management and customer support and the business communications teams. The postmortem owner, they coordinate with everyone who needs to be involved to ensure the post mortem is completed in a timely manner. And it's really important to designate a specific owner to avoid what we call bystander effect. If you ask all responders or a team to own completing the post mortem, you risk everyone assuming that someone else is doing it and therefore no one does. So it's really important to designate an owner for the postmortem process, and some of that criteria can be someone who took a leadership role during the incident or performed a task that led to stabilizing the service. Maybe they were the primary on call, or maybe they manually triggered the incidents to initiate that incident response process. It is very important to note, though, that postmortems are not a punishment and the owner of the postmortem is not the person that caused the incidents. Effective postmortems are blameless in complex systems. There's never a single cause but a combination of factors that lead to failure, and the owner is simply an accountable individual who performs select administrative tasks, follows up for information, and drives that postmortem to completion. So writing the postmortem will ultimately be a collaborative effort. But selecting a single owner to orchestrate this collaboration is what ensures that it gets done. So let's talk a little bit about blame. As it professionals, we understand that failure is inevitable in complex systems, and how we respond to failure when it occurs is what matters. So in the field guide to understanding human error, Sidney Decker describes two views on human error. There's this old view, which asserts that people's mistakes cause failure, and then the new view, which treats human error as a symptom of a systematic problem. The old view ascribes to that bad apple theory, which believes that by removing the bad actors, youve just going to prevent failure. And this view attaches an individual's character to their actions. Assuming negligence or bad intent is what led to the error. An organization that follows this old view of human error, they may respond to an incident by finding that careless individual who caused the incident so that they can be reprimanded. And in that case, engineers will hesitate to speak up when incidents occur for fear of being blamed. And that silence can increase overall. It can increase that meantime to acknowledge that meantime to resolve, and it really exacerbates the impact of incidents. For the postmortem process to result in learning and system improvements, the new view of human error must be followed. In complex systems of software development, a variety of conditions interact to lead to failure. And the goal of the postmortem is to understand what systematic factors led to the incident and to identify actions that can prevent this kind of failure from reoccurring. So a blameless postmortem stays focused on how a mistake was made instead of who made the mistake. And this is a crucial mindset, leveraged by many leading organizations for ensuring postmortems have that right tone. And it empowers engineers to give truly objective accounts of what happened, because you're eliminating that fear of punishment. But blamelessness is hard, and it's really easy to agree that we want a culture of continuous improvement. But it's difficult to practice that blameless listen required for learning the unexpected nature of failure. It naturally leads humans to react in a way that interfere with our understanding of it. When processing information, the human mind, it unconsciously takes shortcuts. And by applying general rules of thumb, the mind optimizes for timeliness over accuracy. And when this happens, it produces an incorrect conclusion. It's called a cognitive bias. So J. Paul Reed argues that the blameless postmortem is actually a myth because the tendency to blame is hardwired through millions of years of evolutionary neurobiology, and ignoring this tendency or trying to eliminate it entirely is impossible, and so that it's more productive to be blame aware. I'll touch on this and some of the biases next, but for more details, read Lindsay Homewood's article on cognitive biases that we must be aware of when we're performing post mortems. So one of the errors that we see is the fundamental attribution error, and it's the tendency to believe that what people do reflects their character rather than their circumstances. And this goes back to that old view of human error, assigning responsibility for a failure to bad actors who are careless and clearly incompetent. Ironically, we tend to explain our own actions by our context, not our personality. So you can combat that tendency to blame by intentionally focusing the analysis on the situational causes rather than the discrete actions that the individuals took. Another pervasive bias is cognitive bias. It's a confirmation bias, which is the tendency to favor information that reinforces existing beliefs. So when presented with ambiguous information, we tend to interpret it in a way that supports our existing assumptions. And when combined with that old view of human error, this bias is dangerous for postmortems because it seeks to blame that bad apple. And so when approaching the analysis with the assumption that an individual is at fault, you're going to find a way to support that belief despite evidence to the contrary. So, to combat confirmation bias, Lindsay Holm would suggest appointing somebody to play devil's advocate, to take a contrarian viewpoint during an investigation. Be careful, though, when doing this, because you want to be cautious of introducing negativity or combativeness with the devil's advocate. You can also counter confirmation bias by inviting someone from another team to ask any and all questions that come to their mind, because this will help surface lines of inquiry, as the team may have learned to take some of these things for granted. Now, hindsight bias is a type of memory distortion where we recall events to form a judgment. So knowing the outcome, it's easy to see the event as being predictable, despite there having been little or no objective basis for predicting it. Often we recall events in a way that make ourselves look better. An example is when a person is analyzing the causes of an incident, and they believe they knew that it was going to happen like that. So enacting this bias it can lead to defensiveness and a division within the team. So homewood actually suggests avoiding the hindsight bias by explaining events in terms of foresight instead. So start your timeline analysis at a point before the incident and work your way forward instead of backward from resolution. And then there's negativity bias. And this is the notion that things of a more negative nature have a greater effect on one's mental state than those of a neutral or even positive nature. Research on social judgments had shown that negative information disproportionately impacts a person's impression of others. And this relates back to the bad apple theory, the belief that there are negative actors in your organization to blame for failures. Studies have also shown that people are more likely to attribute negative outcomes to the intentions of another person rather than neutral and positive outcomes. And this also explains our tendency to blame individuals characters to explain a major incident. In reality, things go right more often than they go wrong. But we tend to focus on and magnify the importance of negative events, focusing on exaggerating and internalizing incidents as negative events. It can be demoralizing and it can lead to burnout, reframing incidents as learning opportunities and remembering to describe what was handled well in the response process. That can help to balance the perspective. And so these biases can damage team relationships if they go unchecked. It's important to be aware of these tendencies so we can acknowledge bias when it occurs. By making postmortems a collaborative process, teams can work as a group to identify blame and then constantly dig deeper in the analysis. So acknowledging blame and working past it, it's easier said than done. Think about what behaviors can we adopt to move towards a blameless culture? And I'm going to share a few of those with you. So ask what and how questions rather than who or why. So going back to Lindsay homewood, what questions are like what did you think was happening? Or what did you do next? Asking what questions grounds the analysis in the big picture of the contributing factors to the incident. In his article the Infinite House, John Allspa encourages us to ask how questions because they get people to describe at least some of the conditions that allowed an event to take place. Homewood also notes that how questions can help clarify technical details, distancing people from the actions they took. So avoid asking why questions because it forces people to justify their actions. It really does attribute blame, like, well, why did you just do that? Or why did you take that action? So think about how you can reframe those in the what and how perspective and then consider multiple and diverse perspectives. This is actually a great time to bring in maybe an intern. They tend to think about things in ways that people who have been practicing this for 20 years may not think about things. So consider the different perspectives and ask why a reasonable, rational, and decent person may have taken a particular action. When analyzing failure, we may fall into the victim, villain, and helpless stories that propel those emotions and attempt to justify our worst behaviors. And you can move beyond blame by telling the rest of the story. Ask yourself, why would somebody may have done this? Because you want to put yourself in that person's shoes. This thinking will help turn attention to the multiple systematic factors that led to the incident versus the WHO. Also, when in crying about human action, abstract it to an inspecific responder. I mean, anybody could have made that same mistake. We've all made mistakes. These are not intentional. It's not that bad. Apple theory and whether you're introducing postmortems as an entirely new practice at your organization or working to improve an existing process, culture change is hard. And change doesn't have to be driven by management. Oftentimes the bottom up changes are more successful than those top down mandates. Anyways, no matter your role, the first step to introducing a new process it is to get buy in from leadership and individual continuous to practice blameless postmortems and encourage that culture of continuous improvement. You do need commitment from leadership that no individuals will be reprimanded in any way after an incident. And it can be difficult to get this buy in when management holds that old view of human error, believing that the bad actors cause failures and that they'll never have failures if they just remove those bad actors. So to convince management to support a shift to blameless analysis, clarify how blame is harmful to the business and explain the business value of blamelessness. Punishing individuals for causing incidents. It discourages people from speaking up when problems occur because they're afraid of being blamed. And as I talked about earlier, this silence just increases that meantime to acknowledge and meantime to resolve. You want people to speak up when problems occur. And organizations can rapidly improve the resilience of their systems and increase the speed of innovation by eliminating the fear of blame and encouraging collaborative learning. The inclination to blame individuals when faced with a surprising failure. It's ingrained deeply in all of us. And it may sound silly, but when selling a new blameless postmortem process to management, do try to avoid blaming them for blaming others. Acknowledge that that practice of blamelessness, it's difficult for everyone, and teams can help hold each other accountable by calling each other out when blame is observed in response to failure. Ask leadership if they would be receptive to receiving that feedback if and when they accidentally suggest blame after an incident. A verbal commitment from management to refrain from punishing people for causing incidents is important to start introducing blameless postmortems. But that alone isn't going to eliminate the fear of blame. Once youve have leadership support, you also need buy in from individual continuous who will be performing that postmortem analysis? Share that you have commitment from management that no one will be punished after an incident because the tendency to blame it turns out it's not unique to managers and leadership. Explain to the team why blame is harmful to trust and collaboration, agree to work together to become more and more blame aware, and kindly call each other out when blame is being observed. People need to feel safe talking about failure before they're willing to speak up about incidents. And when Google studied their teams to learn what behaviors made groups successful, they found that psychological safety was the most critical factor for a team to work well together. And so Harvard Business School professor Amy Edmondson defines psychological safety as a sense of confidence that a team will not embarrass, reject, or punish someone for speaking up. And this also describes what we're trying to achieve with blameless postmortems. The team does not feel punishment for speaking up about incidents, a sense of safety. It makes people feel comfortable enough to share information about incidents, which allows for deeper analysis and then results in learning that improves the resilience of the systems. And the Dora State of DevOps report actually showed that psychological safety is a key driver of high performing software delivery teams. So the number one thing that you can do for your teams is to build that culture of psychological safety that can be your gateway to culture change. So how do you do this again? Culture change doesn't happen overnight. You want to start small, you want to iteratively introduce new practices to an organization small, and you share successful results of experimenting with new practices. And then you slowly expand those practices across teams, similar to how we talk about Blast radius with chaos engineering, you can start experimenting with blameless postmortems with just a single team. And to get started, you can actually use our how to write a postmortem guide at postmortems pagerduty.com to see some tips. It's also easy to start practicing blameless postmortems by analyzing smaller incidents, maybe before tackling major ones, because the business impact of that incident is lower, there's less pressure to scapegoat an individual as the cause of an incident. If someone does fall back on blaming an individual, there's also lower repercussions for causing a minor incident. Simply put, the stakes are lower when analyzing a minor incident. So doing post mortems for smaller incidents, it allows the team to develop that skill of deeper system analysis that goes beyond how people contributed to the incident and remembering that this is a skill and skills require practice. Also, sharing the results of postmortems is very important. It has a couple of benefits. It increases the system knowledge across the organization. And actually, it also reinforces a blameless culture. When postmortems are shared, teams will see that individuals are not blamed or punished for incidents, and then this reduces that fear of speaking up when the incidents inevitably occur. So, creating a culture where information can be confidently shared, it leads to that culture of continuous learning in which teams can work together to continuously improve. We also encourage teams to learn postmortem best practices from each other by hosting a community of experienced postmortem writers. So we have a community of experienced postmortem writers who are available to review postmortems before they're shared more widely. And this ensures that the blameless analysis is there through feedback and coaching while those postmortems are being written. And you can scale culture through sharing. So the state of DevOps report, it told us that operationally mature organizations, they adopt practices that promote sharing. People want to share their successes, and when people see something that's going well, they want to replicate that success. And it may seem counterintuitive to share incidents reports because it seems like you're sharing a story of failure rather than success. But the truth is, practicing blameless postmortems leads to success because it enables teams to learn from failure and to improve systems to reduce that prevalence of failure. So framing incidents as learning opportunities with concrete results and improvements, rather than a personal failure, it also increases morale, which increases employee retention and productivity. One thing we do at pagerduty is we actually schedule all postmortem meetings on a shared calendar, and the calendar is visible to the entire company and anyone is welcome to join. This gives engineering teams the opportunity to learn from each other on how to practice blamelessness and deeply analyze incident causes. It also makes clear that incidents aren't shameful failures that should be kept quiet. By sharing learnings from the incident analysis, you help the entire organization learn, not just the effective teams responsible for the remediation. So pagerduty also sends completed postmortems via email to an incidents reports distribution list that includes all of engineering and product and support, as well as our incident commanders who may or may not be in one of those departments. And this widens system knowledge for everyone involved in the incidents response process. Information sharing and transparency it also supports an environment that cultivates accountability. A common challenge to effective postmortems is that after analyzing the incident and creating action items to prevent reoccurrence, information sharing to increase transparency is never done. So I'm just going to talk about some quick tips for success when youve starting this off. For action items to get done, they have to have clear owners. So because we're an agile and DevOps shop, the crossfunctional teams responsible for the affected services are also responsible for implementing improvements expected to reduce the likelihood of failure. Engineering leadership helps clarify what parts of the system each team owns and sets expectation for which team owns new development and operational improvements. Ownership designations they're communicated across the organization so all teams understand who owns what where ownership gaps can be identified, and we document this information for future reference and new hires. Any uncertainty about ownership of an incident's action items are discussed in the postmortem meeting with representatives for all teams that may own the action item. So start by setting a policy for when post mortem action items should be completed. So at pager duty, our vp of engineering has set the expectation that high priority action items needed to prevent a sub one from reoccurring should be completed within 15 days of the incident and action items from a sub two should be addressed within 30 days. It's really important to communicate this expectation to all of engineering and make sure it's documented for future reference. We've also seen improved accountability for completing action items by involving the leaders, responsible product managers and engineering managers. Those leaders need to be involved because they're prioritizing that team's work in the post mortem meeting. I mean product managers, they're responsible for defining a good customer experience. Incidents cause a poor customer experience, so engage product managers in postmortem discussions by explaining that it will provide a wider picture of threats to that customer experience and ideas on how to improve that experience. Doing so, it gives engineering a chance to explain the importance of these action items so that the product managers will prioritize the work accordingly. And similarly, getting engineering leadership more involved in the postmortem discussion. It gives them a better understanding of system weaknesses to inform how and where they should invest technical resources. So sharing this context with the leaders that prioritize the work. It allows them to support the team's effort to quickly complete high priority action items that result from that incidents analysis. And really, the most important outcome of a post mortem meeting is to gain that buy in for the action plan. So this is an opportunity to discuss proposed action items and brainstorm other options and gain that consensus among the team leadership. Sometimes the ROI of proposed action items, it's not great enough to justify the work the post mortem action items might have shown, so they might need to be delayed for other priorities. The postmortem meeting that's a time to discuss these difficult decisions and make clear what work will and will not be done as a result of the expected implications of these choices. And whereas the written postmortem, it's intended to be shared widely in the organization, the primary audience for the postmortem meeting is the teams directly involved with the incident. This meeting gives the team a chance to align on what happened and what to do about it, and how they'll communicate about the incidents to internal and external stakeholders. So participants in this meeting, they might be the incident commander, maybe any shadowees, service owners and key engineers involved in the incident, engineering managers for impacted system product managers and any customers, internal or external liaisons. And if not already done by the incident commander, one of the first steps is to create a new and empty postmortem for the incident. Go through the history in slack. So we use slack. So whatever tool you're using to identify the responders and add them to the page so that they can help populate that postmortem. Include the roles of incident commander and deputy and scribe as well. And if you're confused about those roles, check out response pagerduty.com, where we break down how the incident response process works at Pagerduty and for some other organizations, youve also want to add a link to the incident call recording, invite stakeholders from related or impacted teams, and then schedule that post mortem meeting for 30 minutes to an hour. Depending on whatever the complexity of that incident was and scheduling the meeting at the beginning of the process, it helps make sure that that post mortem is completed within the SLA. And so again, these are some of the people that should attend, service owners, key engineers, engineering managers, product managers, customer liaisons, an incident commander or a facilitator, and any of the other roles that you may have had during that incident response call. And then you want to begin by focusing on the timeline. You want to document the facts of what happened during the incident. Avoid evaluating what should or should not have been done and coming to conclusions about what caused the incidents. Present only the facts here and that will help avoid blame and support that deeper analysis. And note that the incident, it may have actually started before responders became aware of it and began that response effort. So the timeline includes important changes in status and impact and key actions taken by responders that helps to avoid hindsight bias if you start your timeline at a point before the incidents happened and work your way forward instead of working your way backwards towards resolution. Also review the incident log and whatever chat tool. Hopefully it's not email that you're using to find key decisions that were made and actions taken during the response effort, and include information the team didn't know during the incident that in hindsight you wish you would have known. You can find this additional information by looking at monitoring and logs and deployments related to affected services. You'll take a deeper look at monitoring during the analysis step, but start here by adding key events related to the incident, like deploys or customer tickets filed. Maybe it was a hypothesis being tested during a chaos engineering experiment and include changes to the incident status and the impact to that timeline. And for each item in the timeline, identify a metric or some third party page where data came from because this helps illustrate each point clearly and it ensures that you remain rooted in fact rather than opinions. So this could be a link to a monitoring graph or a log search. Maybe it was a bunch of tweets, anything that shows the data point that youve trying to illustrate in the timeline. And so just a few tips for the timeline. Stick to the facts, include changes to the incident and then we are going to go to documenting the impact. Impact should be described from a few perspectives. How long was the impact visible? In other words, what was the length of time users or customers were affected? And note that the length of impact may differ from the length of the response effort. Impact may have started sometime before it was detected and the incident response began. How many customers were affected? Support may need a list of all affected customers so they can reach out individually and then maybe how many customers wrote or called support about the incident, what functionality was affected and how severely. You want to quantify impact with a business metric specific to the product. So for pagerduty this includes like event ingestion and delayed processing or slow notification delivery. And now that you have an understanding of what happened during the incident, look further back in time to find the contributing factors that led to the incident. Technology it's a complex system with a network of relationships from organizational to human to technical, and that's continuously changing. In his paper how complex systems fail, Dr. Richard Cook says that because complex systems are heavily defended against failure, it is a unique combination of apparently innocuous failures that join to create a catastrophic failure. Furthermore, because overt failures require multiple faults, attributing a root cause is fundamentally wrong. There's no single root cause of a major failure in complex system, but a combination of contributing factors that together led to that failure. And your goal in analyzing the incidents is not to identify a root cause, but to understand the multiple factors that created an environment where failure became possible. So Cook also says that the effort to find a root cause does not reflect an understanding of the system, but rather the cultural need to blame a specific localized forces for events. Blamelessness is essential for an effective postmortem. An individual's actions should never be considered a root cause. Effective analysis goes deeper than human action in the cases where someone's mistake did contribute to a failure. It's worth anonymizing this in your analysis to avoid attaching blame to an individual. Assume any team member could have made the same mistake. According to Cook, all practitioner actions are actually gambles, that is, acts that take place in the face of uncertain outages. Now on youve analysis, you want to start it by looking at your restoring for the affected services. Search for irregularities for like when sudden spikes or flatlining when the incident began and leading up to the incident. Include any commands or queries you use to look up that data. The graph images or links from your monitoring tool alongside this analysis so others can see how that data was gathered. This level of analysis will cover the superficial it's going to uncover the superficial causes of the incident. Next, ask why was the system designed in a way to make this possible? Why did those design decisions seem to be best decisions at the time? And answering these questions will help you to uncover those contributing factors. And some helpful questions are is this an isolated incident or part of a trend? Was it a specific bug, a failure, something that we anticipated? Or did it uncover a classic issue we weren't even sure of? Was there work that some team didn't do in the past that contributed to this incident? Were there any similar related incidents in the past? Does this incident demonstrate a larger trend? And will this class of issue get worse as we continue to grow and scale and use the service? Now, it may not be possible, or as I mentioned before, with the effort to completely eliminate the possibility of the same incident or similar incident from occurring again. So also consider how can you improve detection and mitigation of future incidents? Do you need better monitoring and alerting around this class of problems so that you can respond faster in the future? If this class of incident does happen again, how can you decrease the severity or the duration? Remember to identify any actions that can make your incident response process better too. Go through the incident history to find any to do action items raised during the incident and make sure that those are documented as tickets as well. At this phase, you are only opening tickets. There's no expectation that tasks will be completed before the postmortem meeting. And so Google writes that postmortem action items to ensure that they're completed, that they should be actionable, specific, and bounded. So with actionable items, you should phrase each item as a sentence starting with a verb. The action should result in a useful outcome. And for specific you want to define each action item's scope as narrowly as possible, making it clear what is and is out of scope. And for bounded, you want to word each action item to indicate how to tell when it's finished, as opposed to open ended or ongoing tasks. And so here are some ideas for better wording if you think about poorly worded investigated monitoring for this scenario versus an actionable worded item which is had alerting for all cases where this service returns greater than 1% errors. Next, you want to move on to external messaging. And the goal of external messaging is to build trust by giving customers enough information about what happened and what you're doing about it, without giving away all your proprietary information about your technology or your organization. There are parts of your internal analysis that primarily benefit the internal audience, and those don't need to be included in your external postmortem. The external postmortem is a summarized and sanitized version of the information used for the internal postmortem. So some of the components that you want to include with external postmortems are a summary, which is just two to three sentence to summarize what happened, what happened, so what were those contributing factors and what are we doing about it? It can be pretty short and sweet, and remember, it's sanitized from what you would share internally. Now at Pagerduty, we have a community of experienced postmortem writers available to review postmortems for style and content as I messaged, and this avoids wasted time during the meeting. So we post a link to our postmortems in slack to receive feedback at least 24 hours before the meeting is scheduled and some of the things that we look for are does it provide enough detail? Rather than just pointing out what went wrong, does it drill down to the underlying causes of the issue? Does it separated what happened from the how to fix it? Do the proposed action items make sense? Are they well scoped enough? Is the postmortem understandable and well written? Does the external message resonate well with customers, or is it likely to cause outrage? So a few things to do are make sure that the timeline is an accurate representation of the events, separate the what happened from the how to fix up fix it and you want to write those follow up items that are again, actionable, specific and bounded. And then things that you don't want to do is don't use the word outage unless it really was an outage, accurately reflect the impact of an incident. Outage is usually too broad of a term to use, and it can lead customers to think that the product was fully unavailable and likely that wasn't the case. Also, don't change details or events to make things look better. Be honest in your postmortems. Don't name and shame folks. Keep your post mortems blameless. If someone deployed a change that broke things, it's not their fault. Everyone is collectively responsible for building a system that allowed them to deploy a breaking change and also avoid that concept of human error. Very rarely is that mistake rooted in a human performing an action. There's often several contributing factors that can and should be addressed. And then also don't point out just what went wrong. Drill down to the underlying causes of the issue and also point out what went right. And so, after youve completed the written postmortem, follow up with a meeting to discuss the incident. And the purpose of this meeting is to deepen the postmortem analysis through direct communication and to gain buy in for the action items. You want to send a link to the postmortem document to the meeting attendees at least 24 hours before the meeting. The postmortem may not be complete at that time when it's sent to the attendees, it should be finished before the meeting, but it's still worth sending an incomplete postmortem to meeting attendees in advance so that they can start reading through the document. This is an opportunity to discuss proposed action items and brainstorm other options. And remember, gain that continuous among leadership. As I mentioned, the ROI of proposed action items. It may not justify the work right? This post meeting mortem meeting is a time to discuss that and then one other thing is, if you can develop good facilitators, that's really helpful in the postmortem meeting. The facilitator role in the postmortem meeting, it's different from the other participants. So you may be used to a certain level of participation in meetings, but that will change if you take on the role of a facilitator. The facilitator isn't voicing their own ideas during the meeting. They encourage the group to speak up and they keep that discussion on track. And this requires enough cognitive load that's difficult to perform when you're attempting to also contribute personal thoughts to the decision. For a successful postmortem meeting, it's helpful to have a designated facilitator that's not trying to participate in that discussion. So good facilitators tend to have a high level of emotional intelligence. That means that they can easily read nonverbal cues to understand how people are feeling. And they use this sense to cultivate an environment where everyone is comfortable speaking. Agile coaches and project managers, they are often skilled facilitators. At pager duty, we have a guild of confident facilitators who coach individuals interested in learning how to facilitate. So some of the things that the facilitators do is they keep people on topic. They might need to interrupt to remind the team of meeting goals, or ask if it's value to continue with this topic or if it can be taken offline. They can also time box agenda items, and once the time is done, they can vote. If you want to keep talking about it or if we should move on, they also keep one person from dominating. So as a facilitator, you want to say up from that. Participate from everyone is important. You want to explain what the roles and responsibilities of your job as a facilitator are so that they won't be offended if you tell them to stop talking or if you ask somebody to speak up and you want to pay attention to how much people are talking throughout the meeting, some facilitator tips are, oh, I wasn't able to hear what person a was saying. Let's give them a moment. Really. The facilitator is acting as a mediator to call out when people are getting interrupted. The facilitator is also encouraging, continuous, and they see that if a team member hasn't said anything, you can get them to contribute by saying things like, let's go around the room and hear from everyone, or what stood out to you so far, or what else? What we need to consider I do want to point out again that there's no single root cause of failure because systems are complex. So remember that during this meeting we're not looking for one single person or a single root cause. We're looking for all of those contributing factors. And it's really important to go back to avoiding that blame, to not use the term root cause. And so practice makes perfect. Use your post mortem practice even for mock incidents, for chaos engineering complete postmortems at the end of every one of those experiments. And so a few of the key takeaways here are that post mortems should drive a culture of continuous improvement and help you to understand those systematic failures that led to the incident. Blame is bad. It disincentivize knowledge sharing. Individual actions should never be considered a root cause. There's no single root cause, just contributing factors. And if you would like to learn more, you can head over to postmortems pagerduty.com and learn more about our process. There's also some templates in there, and thank you all for your time.
...

Julie Gunderson

Advocating DevOps @ PagerDuty

Julie Gunderson's LinkedIn account Julie Gunderson's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways