Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Hey, folks, how sre you doing? I'm Ramon. I'm a site reliability
            
            
            
              engineer at Google in Therick, Switzerland, and we
            
            
            
              are going to be talking about postmortem culture at Google today.
            
            
            
              So first of all, we're going to cover an introduction of what postmortem are
            
            
            
              and how we are going to write them. So at
            
            
            
              Google, we have embraced their culture, meaning that we know that everything is failing
            
            
            
              underneath us, right? So we have disks, machines, network rings
            
            
            
              failing all the time. Therefore, the 100% is a wrong reliability
            
            
            
              target for basically everything. And what we do is we use the
            
            
            
              reliability target under 100% to
            
            
            
              our failure. So the opposite of the reliability target of SLO
            
            
            
              is going to be our budget. We use that for taking risk
            
            
            
              into our services, adding new features changes, rollouts, et cetera.
            
            
            
              So when something fails, what we do is we typically write
            
            
            
              a postmortem. So postmortems are written records
            
            
            
              of an incident. So whenever something happens, either an outage,
            
            
            
              privacy incident, security vulnerability, a near miss.
            
            
            
              So we have a problem with our service, but it doesn't trigger into
            
            
            
              or translates into an actual outages that customers srE,
            
            
            
              we write a postmortem. Postmortem is a documentation of an incident.
            
            
            
              So exactly what happened? Like what was the state of the system?
            
            
            
              What were the actions that were taking before and after the incident and
            
            
            
              what was the impact? So what the customer or our
            
            
            
              users saw as a result of the outage,
            
            
            
              we went to detail the summary of the root causes and the triggers.
            
            
            
              So the root cause is the actual reason why the postmortem
            
            
            
              occurred, and the trigger was the event that was actually triggering
            
            
            
              or activating that root cause. Like for example, you might have a bag that was
            
            
            
              written in your code base years before, and then
            
            
            
              it never materializes until you made some certain change in
            
            
            
              your system and you exercise it, and then you got advantage, right?
            
            
            
              Another key part of a postmortem is the action items. So it's
            
            
            
              very important that within your postmortem, you not only specify
            
            
            
              the root cause and the incident and the status of the system, but what are
            
            
            
              you doing? So this outage never occurs again.
            
            
            
              Postmortems are gaining popularity in the IT industry, but they
            
            
            
              are very, very common to have them in other industries, like aviation
            
            
            
              and medical industries. Like for example, when an
            
            
            
              airplane has near miss in an airport, there is going to be
            
            
            
              a detailed analysis that has the same shared of a postmortem.
            
            
            
              So why are we going to write postmortems?
            
            
            
              Basically, it's a learning exercise. So we want to learn how the
            
            
            
              system behaves when some changes
            
            
            
              with some interactions with some problems in our back
            
            
            
              ends will happen. There are many root causes that we need to understand how our
            
            
            
              systems happens. So the reasons that we do this learning exercise
            
            
            
              is to prevent outages to happen again. So they are
            
            
            
              a great tool for learning about the system for reason, how the system
            
            
            
              works and reacts to third time complex systems.
            
            
            
              And then they will enable us to take qualitative and
            
            
            
              quantitative measures and actions to prevent the outage,
            
            
            
              to respond in an unexpected or an undesirable way until
            
            
            
              some changes on back ends or some changes in
            
            
            
              the systems itself, et cetera. Right. It's very important
            
            
            
              that the postmortems are blameless. Blameless postmortems seems
            
            
            
              or means that we want to fix our system and our processes,
            
            
            
              but not people, right? At some point in time, we will have
            
            
            
              effectively, like someone pushing a production release or changing a
            
            
            
              button or whatever. But that's not the root cause of the problem.
            
            
            
              The root cause of the problem is that our system was vulnerable to third time
            
            
            
              code pads or the iCAR systems did incorporate
            
            
            
              a library and has a bag or whatever, was it? Right. And the trigger
            
            
            
              was something changed in the system. A new version
            
            
            
              came out, some customer behavior change, et cetera.
            
            
            
              In general, what we want to do with the postmortems is learn and
            
            
            
              make our system more resilient and more reliable in general.
            
            
            
              Right. Another thing that we want to as well take into account
            
            
            
              for postmortem writing and analysis is don't celebrate heroism.
            
            
            
              So heroes are people that they will just be available
            
            
            
              all the time. You can page them all the time. They will put long hours
            
            
            
              in postmortems, et cetera. What we want to have is systems and processes
            
            
            
              that do not depend on people, like overworking or overstretching
            
            
            
              or whatever, right. Because dot by itself might be a flaw on our system
            
            
            
              if that person is not available for one reason, are we able to sustain
            
            
            
              the system? Is the system healthy? Right?
            
            
            
              So just to emphasize, let's fix systems
            
            
            
              and processes, not people, right? So when do you write a
            
            
            
              postmortem? First of all, when you have an outage, if you have an outage,
            
            
            
              you have to write a postmortem and analyze what happens.
            
            
            
              So did you affect or did your outage
            
            
            
              affect so many users? Do you have so much revenue loss, et cetera,
            
            
            
              et cetera. Classify it with severity so you can understand what
            
            
            
              was more or less the importance of this outages
            
            
            
              and then write it and have an internal review.
            
            
            
              It's important to define the criteria beforehand to guide the
            
            
            
              person supporting the service so they know when to write one or not.
            
            
            
              Right. And as well be flexible because there are times that you might have
            
            
            
              some criteria, but then things change or the company grows, et cetera,
            
            
            
              and you might want to write a postmortem anyway. If you have a near miss,
            
            
            
              I would recommend as well to write a postmortem because even if
            
            
            
              the outage didn't immediately translate it into
            
            
            
              customers seeing your service down or your data not available,
            
            
            
              it's interesting because you can actually use it to
            
            
            
              prevent the outage for happening for real in the future. And as
            
            
            
              well, another option for postmortem is when you have a learning opportunity.
            
            
            
              So if there is some integrate kind of alert that you
            
            
            
              hunt, if you sre other
            
            
            
              customers that they are interested, if you see that
            
            
            
              this potential near means that you have had, or this potential alert could
            
            
            
              actually escalate in the future to something that our customers
            
            
            
              could see or that could actually become an actual outage,
            
            
            
              write a postmortem. You might want to do it a bit lightweight, have less
            
            
            
              amount of review, or have a postmortem that is just internal for the team.
            
            
            
              Right. That's totally okay. Right. And it's as well a nice documentation for
            
            
            
              the risk assessments for your service. And as well, it's a nice way
            
            
            
              of training new members of the team that they are going to be supporting the
            
            
            
              service on how you write postmortems, and then they will have
            
            
            
              a trail of postmortems that they can analyze and read
            
            
            
              and understand how the system works and the risk
            
            
            
              that the service will have.
            
            
            
              So who writes a postmortem? In principle,
            
            
            
              there's going to be an owner that is going to contribute or is going to
            
            
            
              coordinate the contribution doesn't necessarily mean that this person
            
            
            
              is going to write every single line of the postmortem. Right.
            
            
            
              The popular choices are usually the incident commander that was responding to the
            
            
            
              outage or anyone in the deaf or SRE
            
            
            
              team that is like a TL or a manager, depending how your company organizes.
            
            
            
              This owner will ask for contributors.
            
            
            
              Right. There's going to be an SRE team on another service that was affected,
            
            
            
              or there was a backend that was actually some sort of like,
            
            
            
              affecting your service. So they will need to contribute, like, the timeline
            
            
            
              of events, the root cause action items that they're going to be taking out.
            
            
            
              Right. Other dev teams or other SRE teams that they were impacted. So it's not
            
            
            
              only the persons or the teams that had services that
            
            
            
              impacted your service reliability, but how you affected or your
            
            
            
              service affected other products in a company, right?
            
            
            
              All of this is a collaborative effort, and producing a good postmortem is something
            
            
            
              that takes time. It takes effort and will need reviews
            
            
            
              and iterations until it's an informative and a useful document.
            
            
            
              Who reads the postmortem? So the
            
            
            
              first class of audience is going to be the team. The team that supports a
            
            
            
              service will have to read and understand every single detail of all postmortems
            
            
            
              that happens for a service, basically because that's how they understand
            
            
            
              which action items they will have to produce. What are the
            
            
            
              priorities for their own project, prioritization process,
            
            
            
              et cetera. The company if you have postmortems that they have
            
            
            
              impact, near misses, or cross team postmortems, they are
            
            
            
              interesting to have some people outside of the team to review, like for example,
            
            
            
              directors, BPS, architects, whoever in the company
            
            
            
              will have a role of understanding how the architecture of the services
            
            
            
              and the products go. In this
            
            
            
              case, details might not be as needed or going as needed.
            
            
            
              An executive summary, for example, would certainly help understanding the impact
            
            
            
              how this relates to other systems, but it's
            
            
            
              definitely a worthy exercise to do. And then customers
            
            
            
              in the public, they are another part of audience for postmortems
            
            
            
              that they are interesting as well to consider.
            
            
            
              If you, for example, run a public cloud or you run a software as a
            
            
            
              service company, you will have customers that trust you with their data,
            
            
            
              their processing, whatever is what you offer to them, right? When you have
            
            
            
              an outage, the trust between you and your customers might be
            
            
            
              affected and the postmortem is a nice exercise to actually regain that trust.
            
            
            
              Additionally, if you have slas for example, and you
            
            
            
              are not able to meet them, postmortems might not only become something that is useful
            
            
            
              for your customers, but something that is required as your agreements.
            
            
            
              So what will you include in the postmortem? This is
            
            
            
              like the bare minimum postmortem that you can write. Like the minimal postmortem,
            
            
            
              it will include the root cause, which is in red. In this case was something
            
            
            
              a survey is a product that did have some canary metrics that didn't detect
            
            
            
              a bug in a feature, right? The feature was enabled
            
            
            
              at some point. That was the trigger. Think that the root cause might be just
            
            
            
              sitting in your code for a long time and until the trigger happens,
            
            
            
              it will not be exercised and therefore the outage will not be materialized
            
            
            
              and then you have an impact measure. So in this case, the product ordering
            
            
            
              feature for this web or this software as a service was unavailable
            
            
            
              for 4 hours and therefore yielded a revenue loss of so many dollars.
            
            
            
              It's interesting to always link your impact to business metrics because
            
            
            
              that makes everyone in the company able to understand exactly where
            
            
            
              is the impact, even if they sre not directly from the same team.
            
            
            
              Additionally, we have an action plan. So in this case
            
            
            
              that's two ais, two action items. So this implement
            
            
            
              better the canary analysis, we can detect problems, right. And then
            
            
            
              this is a reactive measure, right. So when something happens, you are
            
            
            
              able to detect it and then prevent. This is a preventive measure
            
            
            
              which is like to avoid things to happen again, which is
            
            
            
              all the features will have a rollout checklist for example, or a review
            
            
            
              or whatever it is. What you incorporate there is as well the lessons
            
            
            
              learned, which are very interesting to have what went well, what went poorly.
            
            
            
              We got lucky, right? And those are interesting questions because
            
            
            
              it's never that the outage is all negative.
            
            
            
              There may be things that they worked well. For example, team was well trained,
            
            
            
              we have proper escalation measures, et cetera, et cetera.
            
            
            
              And then some supporting materials, chat logs. When you were responding
            
            
            
              to the outage, you were typing into IRC, into your slack, into whatever
            
            
            
              is the chat that you use into your company, right. Metrics, for example,
            
            
            
              showing screenshots or links to your monitoring
            
            
            
              system that shows the metrics and so on for posteriority to understand,
            
            
            
              for example how to protect them and so on. Documentations,
            
            
            
              links to commits code, excepts of code. That was actually
            
            
            
              for example, the part of the root cause for incorporating measures
            
            
            
              like when you review some comments, et cetera. That's as well been interesting to
            
            
            
              have in your pm useful
            
            
            
              metadata to capture who is the owner, collaborators, what is
            
            
            
              the review status like if there is some reviews that
            
            
            
              are happening, who is signing off the post mortem. So the list of
            
            
            
              action items are validated and can move into implementation.
            
            
            
              Right. And then we should have for example, impact and root
            
            
            
              cause. That's important. We have quantification of that impact.
            
            
            
              So was the slow violation impacts in terms of revenue or
            
            
            
              customer affectation, et cetera.
            
            
            
              Timeline is something that is very interesting, which is like
            
            
            
              a description of all the things that what happened from the root cause incorporation
            
            
            
              to the trigger to what was the response. And that's a nice learning exercise
            
            
            
              for the team that is supporting a service to understand how
            
            
            
              the response should go in the future. Right.
            
            
            
              This is a postmortem metadata. You have stuff like for example,
            
            
            
              the date authors. There is an
            
            
            
              impact measure in this case. I think it's very interesting because the impact measure
            
            
            
              is measured in queries lost, right? But there
            
            
            
              is no revenue impact. That could be, for example,
            
            
            
              postmortems, that they do have an actual hard revenue impact in there.
            
            
            
              And then you have a trigger and the root cause, see that the root cause,
            
            
            
              for example, in this case is a cascading failure through some high load
            
            
            
              and blah, blah, blah, right? That was in the system.
            
            
            
              That vulnerability that the system had for this complex behavior
            
            
            
              was there and only materialized when the trigger was like the
            
            
            
              increase in traffic actually exercised
            
            
            
              that latent back that you have in your code, right? And you have detection.
            
            
            
              That is, who told you, like, could be your customer,
            
            
            
              your monitoring system, sending you an alert, et cetera.
            
            
            
              The action plan. In this case, we have five action items.
            
            
            
              I think it's important to classify by type, the classification by mitigation
            
            
            
              and prevention. I think it's interesting because you will have action items
            
            
            
              that will reduce the risk, right? So the risk is
            
            
            
              always a probability of something materializing. And what was the impact
            
            
            
              of that risk? So mitigation would reduce either the probability of something happen or
            
            
            
              the impact. And then you have prevention,
            
            
            
              which will be like, we want to reduce the probability of
            
            
            
              something happening, or ideally
            
            
            
              up to zero, so it never happens again. So, learnings and
            
            
            
              timeline, this is something that is very interesting, at least for me. The timeline is
            
            
            
              my favorite part of postmortem. So lesson
            
            
            
              learned is things that are going well. For example, in this case, the monitoring
            
            
            
              was working well for this service,
            
            
            
              right. Things that were going wrong is staff that is prime
            
            
            
              candidates for becoming action items to solve. Right?
            
            
            
              And you are always lucky. Just we need to realize that. And there
            
            
            
              are some places that we got lack in this case as well. Those are prime
            
            
            
              candidates for action items. You don't want to depend on luck for your system reliability.
            
            
            
              And then you have the timeline. You see that the timeline covers many
            
            
            
              action items. Sorry, many items, right. And then the
            
            
            
              outage begins is not exactly at the beginning.
            
            
            
              So you see that there are some reports happening
            
            
            
              for the Shakespeare ID and Sonnet and there could be
            
            
            
              even entries that they are older. That is like this commit
            
            
            
              was incorporated in the code base and that contained the actual bug
            
            
            
              that was latent for months even, right. And then there was a trigger and
            
            
            
              the outage actually proceeded to begin.
            
            
            
              So the postmortem, first of all, how you go
            
            
            
              for the process? Need a postmortem, yes or no? Yes. Then let's write
            
            
            
              a draft. The draft is something you need to put together very quickly with whatever
            
            
            
              forensics you can gather from the incident response, like logs,
            
            
            
              timeline. Just dump everything into the document. Everything.
            
            
            
              Even if it's ugly or it's just disorganized, just dump it so you
            
            
            
              don't lose it. And then you can just work it around and make it a
            
            
            
              bit prettier. Then analyze root cause,
            
            
            
              like internal reviews, clarify, add action items,
            
            
            
              et cetera, and then publish it. When you understand the root cause
            
            
            
              and you have reviewed the action plan, publish it, have reviews.
            
            
            
              Right. And then there is the last part, which is, well, the most important.
            
            
            
              You need to prioritize those action items within your project. Work for
            
            
            
              your team. Right. Because at the same time, a postmortem without action items,
            
            
            
              there's no difference between nothing and a postmortem without action items.
            
            
            
              The action items need follow up and need execution,
            
            
            
              need closure for that. So actually the system improves.
            
            
            
              So ais action items. So what I was saying,
            
            
            
              a postmortem without action item is indistinguishable for postmortem,
            
            
            
              for a. No, postmortem for our customers. And that's true because you might have a
            
            
            
              postmortem. Right. It doesn't have action items.
            
            
            
              The customer won't see any improvement in the service. And if you have an
            
            
            
              action item list that you don't follow up. Right. The system think
            
            
            
              that it's in the same status as it was prior to the
            
            
            
              outage. So how
            
            
            
              you are going to go for understanding your root causes? Five whys.
            
            
            
              Right. The key idea is asking why until the root causes are understood
            
            
            
              and actionable. This is very important because
            
            
            
              the root causes might be just red herring that they are not the actual one.
            
            
            
              So you need to keep asking until you know what was the root cause.
            
            
            
              Because that's how you are going to derive some action items, that they are nice
            
            
            
              and they are actually improving your system. In this case,
            
            
            
              the users couldn't order products worldwide. Why? Because feature x
            
            
            
              had a bag. But why had a bag, right? Because the feature
            
            
            
              was rolled out globally in one step or we were missing test
            
            
            
              case forex. Both can happen. Right. Why? Because the canaries
            
            
            
              didn't evaluate that and blah, blah, blah until you just have it more well
            
            
            
              defined and crystal clear best practices for the action
            
            
            
              plan. So there are some action items that they're going to be band aids
            
            
            
              that they are like short term stuff. Those Sre valid shouldn't
            
            
            
              be the only thing that you do. It's nice to have some action items that
            
            
            
              will just make the system incrementally better or more resilient just
            
            
            
              in a short term. Right. But you need to do as well the long term
            
            
            
              follow up. We need to think beyond prevention
            
            
            
              as well, because there might be cases that you can just prevent
            
            
            
              it to happen 100%. That's ideal, right? But you might want to as
            
            
            
              well to mitigate, so reduce the probability of something happening, but as
            
            
            
              well if some risk materializes, reduce the impact of feed affecting
            
            
            
              your service, right? And then humans as a
            
            
            
              root cause is a problem because you can have action items fixing
            
            
            
              humans. So it should be the processes or the system. Remember that.
            
            
            
              So don't fix human, fix like the documentation,
            
            
            
              fix the processes for rolling out new binaries,
            
            
            
              fix the monitoring that is going to tell you that something is broken.
            
            
            
              So you have your postmortem done and published,
            
            
            
              right? And it's excellent. So you have it. So we
            
            
            
              have some review clubs, we have some postmortem of the month and
            
            
            
              so on in the company, especially in a company as large as Google. I think
            
            
            
              it's interesting for socializing it and for other people to understand what
            
            
            
              failure modes a system has. Because if
            
            
            
              a system like my systems for example, are the authentication stack,
            
            
            
              if I have some failure modes that I'm subject to, perhaps other systems
            
            
            
              that they sre similar will have them too. So it's an interesting exercise to read
            
            
            
              how other teams fail, how other sorry services fail,
            
            
            
              right? So I can see like wait, am I subject to
            
            
            
              that? So I can prevent. And as well, the wheel of misfortune is
            
            
            
              a nice replay for training. So for when a new member team
            
            
            
              joins, we say, let's just take this postmortem and we just replay
            
            
            
              it and see how the response would go and how we approach it.
            
            
            
              Right? So it's a nice learning exercise as well. So how do
            
            
            
              we execute on action plans? First of all, we need to pick the right priorities,
            
            
            
              right? Not all of the action items in your postmortem are going
            
            
            
              to be all of the highest priorities that they can be because you will have
            
            
            
              a capacity to execute on them. So you perhaps need to choose and go sequentially,
            
            
            
              like addressing them. Right. Reviews are very important,
            
            
            
              so you have to review how you sre progressing and if your burn
            
            
            
              rate of action items
            
            
            
              is actually making you to completion, right. And then
            
            
            
              have some focus for the executives, even if your postmortem
            
            
            
              might be those that are not reviewed by the executive for one reason,
            
            
            
              right? It's nice to have high level visibility as well because of your
            
            
            
              customers. Your customers, either teams in
            
            
            
              the company or your actual external customers can see that and can see that
            
            
            
              you are making progress to make the service better.
            
            
            
              So that's all I have for today.
            
            
            
              This is all about postmortems. But there are many more scenarios
            
            
            
              and many more angles for site reliability engineering. So we
            
            
            
              have these two books. The first one is the original SRE book,
            
            
            
              which covers the principles and the general practices. And the second one, the workgroup,
            
            
            
              is an extension of the SRE book that will tell you
            
            
            
              how to put them in practice. So we cover a lot
            
            
            
              of space in these books about postmortems and action items and incident response
            
            
            
              that might be interesting for you to read if you have enjoyed this talk.
            
            
            
              And that's all, and thank you very much for watching.