Postmortem Culture at Google: How do we learn from failures and how can you too?

Video size:

Abstract

Writing postmortems after incidents and outages is an essential part of Google’s SRE culture. They are blameless, widely shared internally, and allow us as an organization to maximize the insights from failures. We touch on how postmortems are written and used at Google, as well as how they can help in making decisions and driving improved reliability. We also show how you can get started with your own lightweight postmortem process.

Summary

When something fails, what we do is we typically write a postmortem. Postmortems are written records of an incident. They are a great tool for learning about the system for reason. It's very important that the postmortems aren't blameless.
When you have an outage, you have to write a postmortem and analyze what happens. It's important to define the criteria beforehand to guide the person supporting the service. Producing a good postmortem is something that takes time.
The key idea is asking why until the root causes are understood and actionable. The action items need follow up and need execution, need closure for that. You need to prioritize those action items within your project.
Google has some review clubs, we have some postmortem of the month. I think it's interesting for socializing it and for other people to understand what failure modes a system has. How do we execute on action plans?
There are many more scenarios and many more angles for site reliability engineering. So we have these two books. The first one is the original SRE book, which covers the principles and the general practices. The second one, the workgroup, is an extension of the SRE.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey, folks, how sre you doing? I'm Ramon. I'm a site reliability engineer at Google in Therick, Switzerland, and we are going to be talking about postmortem culture at Google today. So first of all, we're going to cover an introduction of what postmortem are and how we are going to write them. So at Google, we have embraced their culture, meaning that we know that everything is failing underneath us, right? So we have disks, machines, network rings failing all the time. Therefore, the 100% is a wrong reliability target for basically everything. And what we do is we use the reliability target under 100% to our failure. So the opposite of the reliability target of SLO is going to be our budget. We use that for taking risk into our services, adding new features changes, rollouts, et cetera. So when something fails, what we do is we typically write a postmortem. So postmortems are written records of an incident. So whenever something happens, either an outage, privacy incident, security vulnerability, a near miss. So we have a problem with our service, but it doesn't trigger into or translates into an actual outages that customers srE, we write a postmortem. Postmortem is a documentation of an incident. So exactly what happened? Like what was the state of the system? What were the actions that were taking before and after the incident and what was the impact? So what the customer or our users saw as a result of the outage, we went to detail the summary of the root causes and the triggers. So the root cause is the actual reason why the postmortem occurred, and the trigger was the event that was actually triggering or activating that root cause. Like for example, you might have a bag that was written in your code base years before, and then it never materializes until you made some certain change in your system and you exercise it, and then you got advantage, right? Another key part of a postmortem is the action items. So it's very important that within your postmortem, you not only specify the root cause and the incident and the status of the system, but what are you doing? So this outage never occurs again. Postmortems are gaining popularity in the IT industry, but they are very, very common to have them in other industries, like aviation and medical industries. Like for example, when an airplane has near miss in an airport, there is going to be a detailed analysis that has the same shared of a postmortem. So why are we going to write postmortems? Basically, it's a learning exercise. So we want to learn how the system behaves when some changes with some interactions with some problems in our back ends will happen. There are many root causes that we need to understand how our systems happens. So the reasons that we do this learning exercise is to prevent outages to happen again. So they are a great tool for learning about the system for reason, how the system works and reacts to third time complex systems. And then they will enable us to take qualitative and quantitative measures and actions to prevent the outage, to respond in an unexpected or an undesirable way until some changes on back ends or some changes in the systems itself, et cetera. Right. It's very important that the postmortems are blameless. Blameless postmortems seems or means that we want to fix our system and our processes, but not people, right? At some point in time, we will have effectively, like someone pushing a production release or changing a button or whatever. But that's not the root cause of the problem. The root cause of the problem is that our system was vulnerable to third time code pads or the iCAR systems did incorporate a library and has a bag or whatever, was it? Right. And the trigger was something changed in the system. A new version came out, some customer behavior change, et cetera. In general, what we want to do with the postmortems is learn and make our system more resilient and more reliable in general. Right. Another thing that we want to as well take into account for postmortem writing and analysis is don't celebrate heroism. So heroes are people that they will just be available all the time. You can page them all the time. They will put long hours in postmortems, et cetera. What we want to have is systems and processes that do not depend on people, like overworking or overstretching or whatever, right. Because dot by itself might be a flaw on our system if that person is not available for one reason, are we able to sustain the system? Is the system healthy? Right? So just to emphasize, let's fix systems and processes, not people, right? So when do you write a postmortem? First of all, when you have an outage, if you have an outage, you have to write a postmortem and analyze what happens. So did you affect or did your outage affect so many users? Do you have so much revenue loss, et cetera, et cetera. Classify it with severity so you can understand what was more or less the importance of this outages and then write it and have an internal review. It's important to define the criteria beforehand to guide the person supporting the service so they know when to write one or not. Right. And as well be flexible because there are times that you might have some criteria, but then things change or the company grows, et cetera, and you might want to write a postmortem anyway. If you have a near miss, I would recommend as well to write a postmortem because even if the outage didn't immediately translate it into customers seeing your service down or your data not available, it's interesting because you can actually use it to prevent the outage for happening for real in the future. And as well, another option for postmortem is when you have a learning opportunity. So if there is some integrate kind of alert that you hunt, if you sre other customers that they are interested, if you see that this potential near means that you have had, or this potential alert could actually escalate in the future to something that our customers could see or that could actually become an actual outage, write a postmortem. You might want to do it a bit lightweight, have less amount of review, or have a postmortem that is just internal for the team. Right. That's totally okay. Right. And it's as well a nice documentation for the risk assessments for your service. And as well, it's a nice way of training new members of the team that they are going to be supporting the service on how you write postmortems, and then they will have a trail of postmortems that they can analyze and read and understand how the system works and the risk that the service will have. So who writes a postmortem? In principle, there's going to be an owner that is going to contribute or is going to coordinate the contribution doesn't necessarily mean that this person is going to write every single line of the postmortem. Right. The popular choices are usually the incident commander that was responding to the outage or anyone in the deaf or SRE team that is like a TL or a manager, depending how your company organizes. This owner will ask for contributors. Right. There's going to be an SRE team on another service that was affected, or there was a backend that was actually some sort of like, affecting your service. So they will need to contribute, like, the timeline of events, the root cause action items that they're going to be taking out. Right. Other dev teams or other SRE teams that they were impacted. So it's not only the persons or the teams that had services that impacted your service reliability, but how you affected or your service affected other products in a company, right? All of this is a collaborative effort, and producing a good postmortem is something that takes time. It takes effort and will need reviews and iterations until it's an informative and a useful document. Who reads the postmortem? So the first class of audience is going to be the team. The team that supports a service will have to read and understand every single detail of all postmortems that happens for a service, basically because that's how they understand which action items they will have to produce. What are the priorities for their own project, prioritization process, et cetera. The company if you have postmortems that they have impact, near misses, or cross team postmortems, they are interesting to have some people outside of the team to review, like for example, directors, BPS, architects, whoever in the company will have a role of understanding how the architecture of the services and the products go. In this case, details might not be as needed or going as needed. An executive summary, for example, would certainly help understanding the impact how this relates to other systems, but it's definitely a worthy exercise to do. And then customers in the public, they are another part of audience for postmortems that they are interesting as well to consider. If you, for example, run a public cloud or you run a software as a service company, you will have customers that trust you with their data, their processing, whatever is what you offer to them, right? When you have an outage, the trust between you and your customers might be affected and the postmortem is a nice exercise to actually regain that trust. Additionally, if you have slas for example, and you are not able to meet them, postmortems might not only become something that is useful for your customers, but something that is required as your agreements. So what will you include in the postmortem? This is like the bare minimum postmortem that you can write. Like the minimal postmortem, it will include the root cause, which is in red. In this case was something a survey is a product that did have some canary metrics that didn't detect a bug in a feature, right? The feature was enabled at some point. That was the trigger. Think that the root cause might be just sitting in your code for a long time and until the trigger happens, it will not be exercised and therefore the outage will not be materialized and then you have an impact measure. So in this case, the product ordering feature for this web or this software as a service was unavailable for 4 hours and therefore yielded a revenue loss of so many dollars. It's interesting to always link your impact to business metrics because that makes everyone in the company able to understand exactly where is the impact, even if they sre not directly from the same team. Additionally, we have an action plan. So in this case that's two ais, two action items. So this implement better the canary analysis, we can detect problems, right. And then this is a reactive measure, right. So when something happens, you are able to detect it and then prevent. This is a preventive measure which is like to avoid things to happen again, which is all the features will have a rollout checklist for example, or a review or whatever it is. What you incorporate there is as well the lessons learned, which are very interesting to have what went well, what went poorly. We got lucky, right? And those are interesting questions because it's never that the outage is all negative. There may be things that they worked well. For example, team was well trained, we have proper escalation measures, et cetera, et cetera. And then some supporting materials, chat logs. When you were responding to the outage, you were typing into IRC, into your slack, into whatever is the chat that you use into your company, right. Metrics, for example, showing screenshots or links to your monitoring system that shows the metrics and so on for posteriority to understand, for example how to protect them and so on. Documentations, links to commits code, excepts of code. That was actually for example, the part of the root cause for incorporating measures like when you review some comments, et cetera. That's as well been interesting to have in your pm useful metadata to capture who is the owner, collaborators, what is the review status like if there is some reviews that are happening, who is signing off the post mortem. So the list of action items are validated and can move into implementation. Right. And then we should have for example, impact and root cause. That's important. We have quantification of that impact. So was the slow violation impacts in terms of revenue or customer affectation, et cetera. Timeline is something that is very interesting, which is like a description of all the things that what happened from the root cause incorporation to the trigger to what was the response. And that's a nice learning exercise for the team that is supporting a service to understand how the response should go in the future. Right. This is a postmortem metadata. You have stuff like for example, the date authors. There is an impact measure in this case. I think it's very interesting because the impact measure is measured in queries lost, right? But there is no revenue impact. That could be, for example, postmortems, that they do have an actual hard revenue impact in there. And then you have a trigger and the root cause, see that the root cause, for example, in this case is a cascading failure through some high load and blah, blah, blah, right? That was in the system. That vulnerability that the system had for this complex behavior was there and only materialized when the trigger was like the increase in traffic actually exercised that latent back that you have in your code, right? And you have detection. That is, who told you, like, could be your customer, your monitoring system, sending you an alert, et cetera. The action plan. In this case, we have five action items. I think it's important to classify by type, the classification by mitigation and prevention. I think it's interesting because you will have action items that will reduce the risk, right? So the risk is always a probability of something materializing. And what was the impact of that risk? So mitigation would reduce either the probability of something happen or the impact. And then you have prevention, which will be like, we want to reduce the probability of something happening, or ideally up to zero, so it never happens again. So, learnings and timeline, this is something that is very interesting, at least for me. The timeline is my favorite part of postmortem. So lesson learned is things that are going well. For example, in this case, the monitoring was working well for this service, right. Things that were going wrong is staff that is prime candidates for becoming action items to solve. Right? And you are always lucky. Just we need to realize that. And there are some places that we got lack in this case as well. Those are prime candidates for action items. You don't want to depend on luck for your system reliability. And then you have the timeline. You see that the timeline covers many action items. Sorry, many items, right. And then the outage begins is not exactly at the beginning. So you see that there are some reports happening for the Shakespeare ID and Sonnet and there could be even entries that they are older. That is like this commit was incorporated in the code base and that contained the actual bug that was latent for months even, right. And then there was a trigger and the outage actually proceeded to begin. So the postmortem, first of all, how you go for the process? Need a postmortem, yes or no? Yes. Then let's write a draft. The draft is something you need to put together very quickly with whatever forensics you can gather from the incident response, like logs, timeline. Just dump everything into the document. Everything. Even if it's ugly or it's just disorganized, just dump it so you don't lose it. And then you can just work it around and make it a bit prettier. Then analyze root cause, like internal reviews, clarify, add action items, et cetera, and then publish it. When you understand the root cause and you have reviewed the action plan, publish it, have reviews. Right. And then there is the last part, which is, well, the most important. You need to prioritize those action items within your project. Work for your team. Right. Because at the same time, a postmortem without action items, there's no difference between nothing and a postmortem without action items. The action items need follow up and need execution, need closure for that. So actually the system improves. So ais action items. So what I was saying, a postmortem without action item is indistinguishable for postmortem, for a. No, postmortem for our customers. And that's true because you might have a postmortem. Right. It doesn't have action items. The customer won't see any improvement in the service. And if you have an action item list that you don't follow up. Right. The system think that it's in the same status as it was prior to the outage. So how you are going to go for understanding your root causes? Five whys. Right. The key idea is asking why until the root causes are understood and actionable. This is very important because the root causes might be just red herring that they are not the actual one. So you need to keep asking until you know what was the root cause. Because that's how you are going to derive some action items, that they are nice and they are actually improving your system. In this case, the users couldn't order products worldwide. Why? Because feature x had a bag. But why had a bag, right? Because the feature was rolled out globally in one step or we were missing test case forex. Both can happen. Right. Why? Because the canaries didn't evaluate that and blah, blah, blah until you just have it more well defined and crystal clear best practices for the action plan. So there are some action items that they're going to be band aids that they are like short term stuff. Those Sre valid shouldn't be the only thing that you do. It's nice to have some action items that will just make the system incrementally better or more resilient just in a short term. Right. But you need to do as well the long term follow up. We need to think beyond prevention as well, because there might be cases that you can just prevent it to happen 100%. That's ideal, right? But you might want to as well to mitigate, so reduce the probability of something happening, but as well if some risk materializes, reduce the impact of feed affecting your service, right? And then humans as a root cause is a problem because you can have action items fixing humans. So it should be the processes or the system. Remember that. So don't fix human, fix like the documentation, fix the processes for rolling out new binaries, fix the monitoring that is going to tell you that something is broken. So you have your postmortem done and published, right? And it's excellent. So you have it. So we have some review clubs, we have some postmortem of the month and so on in the company, especially in a company as large as Google. I think it's interesting for socializing it and for other people to understand what failure modes a system has. Because if a system like my systems for example, are the authentication stack, if I have some failure modes that I'm subject to, perhaps other systems that they sre similar will have them too. So it's an interesting exercise to read how other teams fail, how other sorry services fail, right? So I can see like wait, am I subject to that? So I can prevent. And as well, the wheel of misfortune is a nice replay for training. So for when a new member team joins, we say, let's just take this postmortem and we just replay it and see how the response would go and how we approach it. Right? So it's a nice learning exercise as well. So how do we execute on action plans? First of all, we need to pick the right priorities, right? Not all of the action items in your postmortem are going to be all of the highest priorities that they can be because you will have a capacity to execute on them. So you perhaps need to choose and go sequentially, like addressing them. Right. Reviews are very important, so you have to review how you sre progressing and if your burn rate of action items is actually making you to completion, right. And then have some focus for the executives, even if your postmortem might be those that are not reviewed by the executive for one reason, right? It's nice to have high level visibility as well because of your customers. Your customers, either teams in the company or your actual external customers can see that and can see that you are making progress to make the service better. So that's all I have for today. This is all about postmortems. But there are many more scenarios and many more angles for site reliability engineering. So we have these two books. The first one is the original SRE book, which covers the principles and the general practices. And the second one, the workgroup, is an extension of the SRE book that will tell you how to put them in practice. So we cover a lot of space in these books about postmortems and action items and incident response that might be interesting for you to read if you have enjoyed this talk. And that's all, and thank you very much for watching.

Slides

Download slides (PDF)

See all 33 talks at this event!

Conf42 Site Reliability Engineering 2022 - Online

June 09 2022

Postmortem Culture at Google: How do we learn from failures and how can you too?

Video size:

Abstract

Summary

Transcript

Slides

Ramon Medrano Llamas

Senior Staff SRE @ Google

Join the community!

Featured event

2025

2024

Info

Conf42 Site Reliability Engineering 2022 - Online

June 09 2022

Postmortem Culture at Google: How do we learn from failures and how can you too?

Video size:

Abstract

Summary

Transcript

Slides

Ramon Medrano Llamas

Senior Staff SRE @ Google

Join the community!