Conf42 Chaos Engineering 2021 - Online

Blameless Postmortem Culture

Video size:

Abstract

Psychological safety has been identified as the topmost feature of a successful and innovative organization. At the same time, we need to learn from failure and prevent recurrence of mistakes. These two practices seem to contradict each other, but is there a way to achieve them both?

Introduction to Postmortem Culture

  • Goal: System shouldn’t break in the same way again.
  • Growth mindset.
  • Learn from mistakes.
  • Document everything for posterity!

What does a postmortem look like?

  • Various sections.
  • All sections for a reason.
  • Best practices for each section.
  • When does it make sense to write one?

Psychological safety & Blamelessness

  • Postmortems, if not written well can be counterproductive to your culture.
  • Why is blamelessness important? The most important?
  • Leadership and promoting this culture.
  • Celebrating failure!
  • Going beyond technology.

Ok, I wrote one, what next?

  • Going above and beyond just writing some documents.
  • Making things run like a clockwork, while self sustaining!

Summary

  • Failure and what you do with it can have a transformational effect on your organization. Do you blame people up, ignore indicators of underlying problems, and continue business as usual until the next crisis occurs? Or do you encourage feedback and try find ways to transform failures into learning opportunities?
  • What's a good postmortem? Why is it important to write them? Also how well written postmortem culture be an effective tool to influence culture around you. Towards the end, I'll cover some possibilities around what you can do after you've written a bunch of good postmortems.
  • I am currently a program manager at Google within site reliability engineering. By education, I'm an electrical engineer and Bollywood dance instructor. I love traveling and I love food. 2021 looks very optimistic with respect to vaccines and hopefully some sort of normalcy.
  • Postmortem culture can address a variety of failures for personal growth, growth of others, encouraging accountability, sharing knowledge and best practices. It can also be a very effective tool for leadership to be transparent about popular decisions. Introducing such a culture can seem intimidating at first, but making this change incrementally is possible.
  • A postmortem is a written record of an incident, its impact actions taken to mitigate or resolve it, the root cause, and then the follow up actions to prevent the incident from reoccurring. Teams need to define postmortem criteria before an incident occurs. Writing postmortem should ideally be looked at as a learning opportunity for the entire company.
  • Have some predefined postmortem writing criteria. What matters are that these criteria are defined in the first place and periodically revisited. A solid understanding of your own self will guide you to know when to write a postmortem.
  • If not written well, a postmortem can be counterproductive to your culture. Human nature makes accepting failure very difficult. The best practice to combat this fear is to keep your postmortems blameless.
  • A better way to express the gravity of the situation would be to gather facts and document them objectively with metrics. phrases such as disaster or of course, that happened add absolutely no value to your document. What it does is erode psychological safety and make your culture toxic.
  • Leaders can start writing a few postmortems yourself. Use this wealth of knowledge as training material for future leaders. Create a culture of rewarding people for doing the right things. Reward closure of postmortem action items.
  • The next big stretch goal could be creation of tools to aggregate these postmortems and identify common themes and break for improvement across product boundaries. Without intentional nurturing, any sort of culture can ultimately fizzle out.
  • The cost of failure is education. Focus on these system, not the people. Keep your postmortems blameless. When written well, acted upon and widely shared. Can be a very effective tool for driving positive cultural changes.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
You humans make mistakes, and at some point in time, errors in judgment will occur despite everybody's best sections and domain expertise. Failure and what you do with it can have a transformational effect on your organization. So what do you do when things go wrong? Do you blame people up, ignore indicators of underlying problems, and continue business as usual until the next crisis occurs? Or do you encourage feedback and try find ways to transform failures into learning opportunities? I'll cover some of these fantastic topics and a little bit more in this talk on blameless postmortem culture. So very quickly, what this talk is not let's understand that this is not a lecture on SRE. This is not a lecture on outages and how to resolve them. And I will not be teaching you how to write a postmortem step by step. However, what this talk is, is I'll go deeper into some of these following topics. So what's a good postmortem? Why is it important to write them? And then why isn't writing a good postmortem not easy? I'll also go over how well written postmortem culture be an effective tool to influence culture around you. And then towards the end, I'll cover some possibilities around what you can do after you've written a bunch of good postmortems. So here's a little bit about myself. I am currently a program manager at Google within site reliability engineering, or SRE as we know it. I've been at Google for about four and a half years and I've worked on multiple teams. So mobile client infrastructure, currently firebase SRE. And then I also had some opportunity to work with Gmail, spam abuse and delivery SRE. I have co authored a chapter on blameless postmortem culture in the second Cytre Liability engineering book, which is called the Cytre Liability Workbook. And then I also co authored a report which is titled Engineering Reliable Mobile Applications. Both of these are available on the SRE site for free. And in my previous experience, I juggled multiple roles at a company called Bright Idea. By education, I'm can electrical engineer, and then I've also done a little bit of Bollywood dance instructor, and I was a primary dancer for them. And I love traveling and I love food. 2021 looks very optimistic with respect to vaccines and hopefully some sort of normalcy. So I just cannot wait to board my next flight and go to a new destination soon. All right, so that's enough context about me in terms of philosophy. Let's bring everybody on the same page and talk a little bit about general philosophy behind postmortem culture. Now, whenever an incident occurs, the person who's on call, these first priority is to mitigate the incident, fix the underlying issue, and then ensure that services return back to their normal operating conditions. The next big responsibility is to make sure that the system doesn't break in the same way again. Now, unless we have some sort of a formalized process of learning from these incidents in place, these may reoccur. Left unchecked, incidents can multiply in complexity or they can sometimes cascade. This can overwhelm a system and ultimately impact our users. Now one of the reasons I decided to share this talk with you today is because I want to dispel the myth that postmortem culture is only applicable to big organizations or deep technical problems. I think as long as one understands the basic framework, they can use it to address a variety of failures for personal growth, growth of others, encouraging accountability, sharing knowledge and best practices, and documents facts for the long term so others can learn from it. I know I've personally used the framework to think through a variety of tough situations in my personal life. Also, when things didn't go according to plan. It can also be a very effective tool for leadership to be transparent about popular decisions and document some rationale behind their unpopular choices. In almost all the cases I've seen, people really appreciate when leaders lead by example and are transparent. Now, introducing such a culture can seem intimidating at first, but making this change incrementally is possible, and one can gradually tune the process based on their organization's needs. One of the best ways to make use of this talk today is to think of a recent high impact failure that you experienced and think about how that situation was handled. Then apply the principles and best practices of blameless postmortem culture to that situation and these imagine if things could have been done a little bit differently. So what is things postmortem I keep talking about? This is especially for newer folks. So a postmortem is a written record of an incident, its impact actions taken to mitigate or resolve it, the root cause, and then very importantly, the follow up actions to prevent the incident from reoccurring. Now, let's talk through a simple, high level template. The first section is all about quick executive summary. This can include sections like incident details which could be simple facts such as incident date, postmortem writing date, when it was published, and then who has access to the document. Another subset of these exec updates could be about the problem summary duration of these problem estimated user impact detection mechanisms and resolution after exec updates. These next section is background. This is specifically meant for people who don't have a lot of context around your postmortem culture. Could be new employees or people who've just recently joined your team, or people from a completely different part of the company, or people who think your postmortem in a way could be related to an outage they are dealing with currently. This section is also important because one of the best practices is that once a postmortem is written, ideally everybody should be able to read it throughout the company. Next we go into sections where concrete details are expected so impact of the outage, root cause trigger and these recovery efforts. After that, let's do some deliberate introspection and we document lessons learned so things that went well, things that did not go well and then my personal favorite is where did we get lucky? And things unfortunately happens more than one would like. Next, we document action items and then what we want to get out of it. One important problem with action items is usually they're not very measurable. It's important that they are objectively measurable. They have owners priority and some sort of a tracking id. And then these last sections is glossary again meant for people who don't have a lot of context about your postmodern with respect, two terminology that you're using and may need a quick reference in addition to the basic template. Usually it's encouraged that you over communicate in terms of making the other person understand what your postmortem is all about. All of these sections need careful attention and there are best practices around how to handle each one of them. So now you may be wondering that things sounds like too much work and so do we need to write a postmortem for every single failure? Well, the answer is no, we don't. Teams need to define postmortem criteria before an incident occurs so that everyone knows when a postmortem is necessary. In addition to these objective triggers, 1 may also request a postmortem from another team if they think it's warranted. Personally speaking, over the years, postmortem culture become such an innate part of the SRE culture that we organically know when we can expect one. Another idea could be to group small incidents that have a similar in nature into one postmortem for resolution rather than writing one little postmortem for each incident. Writing postmortem should ideally be looked at as a learning opportunity for the entire company and not as a punishment. It's definitely worth acknowledging that a good postmortem process does present an inherent cost in terms of time and or effort, so one needs to be deliberate in choosing when to write one. So the next question is, when do you write one? Well, depends on your needs and circumstances. One criteria fits all approach doesn't really work here. For example, a team whose primary responsibility is ensuring the reliability of a website's ecommerce infrastructure will have distinctly different success criteria or failure metrics than a team whose primary responsibility is product development. The two teams will also have a distinct set of personalities, making incident management nuances somewhat different. So some example scenarios could include maybe business reasons where x numbers of users are affected, y amount of money is lost in revenue, or there could be loss in user trust due to a marketing error, or some other reasons. There could be product reasons for writing a postmortem. For example, a latest canary shows significant regression in a metric. A risky change was pushed out two production during production freeze. Or there could be people reasons. So let's say a disruptive reorg impacted people's careers. An improper project management overloaded people for a long time. What matters are that these criteria are defined in the first place and periodically revisited, so one knows when to escalate an incident and when it becomes an all hands on deck situation. One can also write a postmortem for personal reasons, for example, that travel plan went completely awry or I'm feeling burnt out, et cetera, in terms of defining a criteria for your personal growth. Also, a solid understanding of your own self will guide you, and then you'll know when to write a postmortem. So, so far we've talked in detail about what we should do. It sounds pretty straightforward. Have some predefined postmortem writing criteria. If an incident occurs, fill out a template, create a return record of an incident, create some action items, and then magically all our problems are solved. But is it really that simple to answer the question? The next thing I want to emphasize on is the importance of ensuring psychological safety. Now, writing a postmortem just for the sake of writing a document is not enough. In fact, if not written well, it can be counterproductive to your culture. It could lead to an atmosphere where incidents or issues are swept under the rug, leading to a greater risk for your company. So to make the best of postmortem culture, it's absolutely imperative to keep them blameless. So a blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with these information they had, it focuses on what went wrong instead of who did wrong. This is important because basic human nature makes accepting failure very difficult. Sometimes these mistakes can cost your company millions of dollars or euros. And accepting that something that did caused it can be very embarrassing and distressing. For individuals in a very toxic organization, publicly acknowledging mistakes can sometimes cost people their careers. Now, all of these fears will make it pretty much impossible for a postmortem to be valuable, fact based and objective. Now, the best practice to combat this fear is to keep your postmortems blameless. But why? Now we has humans fear public humiliation. Speaking up under normal circumstances is hard enough. Speaking up under immense pressure, as in when an ongoing incident is happening, is much harder, and people generally tend to avoid the spotlight for the fear of being ridiculed. Now, ironically, a crisis demands for someone to step up, think differently, and also speak. Also remember that perform postmortems are artifacts. So now we've increased the complexity of the problem. Not only does one have to risk isolating themselves in difficult circumstances, they also fear being documented in history. So what if I make a mistake? Everyone will know. Will I be fired? Will I not get promoted? Will my future employees make fun of me? These are all very valid concerns. And then there's the potential to curb innovation and autonomy. Now dreaming big working on revolutionary ideas brings a certain amount of associated risk, which may very well result in failures. If a person doesn't feel psychologically safe in taking calculated risks, they may never act on those ideas and then innovation will probably stagnate. This also takes away from that person's autonomy has they'll tend to just follow instructions from their sections or someone else instead of being creative themselves. We cannot create future visionaries and leaders like things. Now it may seem like a good idea to highlight individuals while describing an outage in a postmortem. Instinctively, it feels like we are assigning ownership to someone, which may then motivate the individual to take responsibility. But the big risk in doing so is individuals would become risk covers because they may fear public humiliation. This can lead to people covering up facts. Risking transparency, which could be critical to understanding, can issue and preventing it from reoccurring. And when it comes to personal circumstances, blaming yourself for, let's say a personal outage is not great either. Have faith in yourself, assume best intent, and know that you acted in the best of your abilities, in the circumstances and the context around you. Blameful behavior can be pretty detrimental from a people perspective, but blameful behavior can be very problematic from business perspective. Also, when mistakes are hidden, systematic fixes are harder and problems are more likely to reoccur. Also, blaming humans sometimes tends to result in fire human as an action item. Now for a moment, even if we ignore that that may not be the right thing to do. It still doesn't prevent reoccurrence of a problem. If the system has set up to allow these human to make the mistake in the first place, there's a higher probability that a newly hired and less experienced human will repeat that outage. So blameful behavior is pretty detrimental from a business perspective also. All right, so we all get the idea of blamelessness. Let's work on an example together and see how we can transform accusations into simple facts. So, in terms of this exercise, remember to focus on what went wrong instead of who did wrong. So let's say there's a hypothetical incident, and then there's a hypothetical postmortem that was written, and this was noted as the root cause. So this says, dylan, one did not both to set up alerting for our storage cluster or check our hard drives manually in case of doubt. Of course, we ran out of space and then this disaster ensued. Now, maybe pause this video, take a couple of minutes, and then think if there's anything wrong with the way this is written in the postmortem. All right, I hope you've thought about it a little bit so I can give my perspective here. So not only is this example very blameful, where clearly one person is being scapegoated, it's also quite dramatic. So phrases such as disaster or of course, that happened add absolutely no value to your document. What it does is erode psychological safety and make your culture toxic. You can be sure that, Dylan, one is not going to speak up when another outage happens, and the company could probably lose or out on some valuable information. A better way to express the gravity of the situation would be to gather facts and document them objectively with metrics. So here's an example of a more objective, blameless version. So the hard drives in one of the trading storage clusters run out of space, and this was not noticed due to a lack of alerting. This led to trades from that cluster being redirected to other trading clusters, which were also almost full. And an action item that could come out of this is to set up alerting for this scenario, and that AI must have an owner and some sort of a tracking id. Let's move on to leadership. Now, I'm assuming we're all leaders here, so let's take a look at some of the ideas we can propose and implement to foster a blameless postmortem culture in our surroundings. Now, one of the best things all of us can do is lead by examples you has. Leaders can start writing a few postmortems yourself. This will give you an opportunity to experiment with what format or template works best for your. This is also an opportunity to set the tone of the postmortem culture you'd like to see. Once you've written with an artifact, share it openly, let the readers digest the content of your documents, and then come back to you with questions and feedback. This is another opportunity to experiment with your postmortem format. Now once people warm up to the concept and see you doing the right thing, they'll probably be encouraged to do some themselves. Now, once folks have accumulated a few postmortems over time, you can use this wealth of knowledge as training material for future leaders. One example is an exercise called the wheel of misfortune. Think of it as disaster roleplay in which a previous postmortem is reenacted with a cast of employees. Playing does as laid out in the postmortem. The aim is to make the experience has real as possible and make the educational portion of ramping up as fun as possible. Now, if you or your senior management think it's appropriate, you can go a step further and create a culture of rewarding people for doing these right things. So some examples could be highlight well written postmortems in company or team meetings. Reward closure of postmortem action items if action items are documented but not acted upon, what's the point? Also acknowledge that making systems more reliable is high impact work, and then reward such work in your performance. Assessments regarding postmortem owners as leaders can be very motivating. Setting up the postmortem owner as the go to person for resolving or mitigating a failure can be very rewarding for some people. If your senior management is on board, it is okay to even publicly celebrate a person who actually was the proximal cause of a big outage, but then who also acted effectively and sensibly to either escalate or troubleshoot or mitigate or resolve that outage, and who honestly then documented system failures that allowed them to make that mistake. This is again to emphasize that even if it's clearly a person who can be held responsible, that it was always a set of conditions and system failures that allowed things to happen in the first place. You can also celebrate people for finding and fixing reliability bugs. Another idea is to invite folks to talk about failures and lessons learned on a high visibility platform. This could help normalization of conversations around failure. All of these ways are some great examples to instill psychological safety in current and especially newer members of the team. You can do even more. So let's talk about that. So remember, things slide we looked at previously. So now you've done all this postmortem work, you're all experts at blamelessness, and then you have a magically blameless orc, but the work is not done yet. We still haven't completed the last step. We still haven't made sure that the same incident never occurs again. Now, as a part of writing your blameless postmortem, you documented many action items. So what happened to those? Remember, the overarching reason we are doing all this work is to prevent incidents from reoccurring. And if action items remain unaddressed, you can be sure that incidents are going to reoccur and maybe on a much bigger scale, since complexity of our systems and processes is not going down anytime soon. So in order to do so, review postmortems and sections items periodically. These action items can be around adding more monitoring, automated rollbacks, automated release gating, or even more refactoring of the existing code base. If a company doesn't have preexisting tooling around postmortem action item management, one can look at third party tools. Also, what process you follow, how simple or hard it is can depend on your team, but it's important that you do it in the first place. Otherwise, what's the point of accumulating these action items in your postmortem? Sometimes reoccurring postmortem reviews can also give you a sneak peek into maybe some themes that are emerging, and you can then combine those themes into a bigger project and then prioritize fixes accordingly. Now, with a growing number of postmortems being written every quarter, the next big stretch goal could be creation of tools to aggregate these postmortems and identify common themes and break for improvement across product boundaries. One can work on machine learning projects where they can teach a model to preempt outages based on past patterns. Encourage folks to keep their trans encourage folks to keep their postmortems accessible by all. As a default, this will encourage transparency and reinforce the concept of using failures or postmortems as a learning opportunity for the company. And then let's not forget that without intentional nurturing, any sort of culture can ultimately fizzle out. The breakdown of postmortem culture may not always be obvious. Following are some common failure patterns and then recommended solutions that might work so lack of time quality postmortem culture time to write when a team is overloaded with other tasks. The quality of postmortem culture so prioritize postmortem work, track its completion and review, and allow teams adequate time to implement the associated plan. If teams are experiencing failures that mirror previous incidents, it could be time to dig a bit deeper. So consider asking questions like are action items taking too long to close? Are we biased towards feature velocity over reliability? Are the right action items been captured in the first place? Or maybe the service is overdue for a refactor? Next point is blameful language while speaking with each other responding when someone uses blameful language can be very challenging, especially if a person senior than you is doing so. One trick is to mitigate the damage by moving the narrative in a more constructive direction, reminding them that investigating the source of misleading information is much more beneficial to the than assigning blame. And these disengaging from the postmortem process is another sign that postmortem culture is starting to fail. If folks seem content with not discussing failures and maybe avoid the issue altogether, it may be time to reinforce blamelessness and ensuring that high visibility postmortem culture reviewed for possible blameful wording. All right, this brings me to my final slide. In case you don't remember anything I said in the last 30 to 40 minutes, I hope these these key takeaways stay with you. So number one, the cost of failure is education. Number two, keep your postmortems blameless. Focus on these system, not the people. And then lastly, when written well, acted upon and widely shared, blameless postmortems can be a very effective tool for driving positive cultural changes and also preventing reoccurring errors. Thank you. Thank.
...

Pranjal Deo

Engineering Program Manager @ Google

Pranjal Deo's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways