Transcript
This transcript was autogenerated. To make changes, submit a PR.
You humans make mistakes,
and at some point in time, errors in judgment will occur
despite everybody's best sections and domain expertise.
Failure and what you do with it can have a
transformational effect on your organization.
So what do you do when things go wrong?
Do you blame people up, ignore indicators of underlying
problems, and continue business as usual until the next
crisis occurs? Or do you encourage feedback
and try find ways to transform failures into learning opportunities?
I'll cover some of these fantastic topics and a little
bit more in this talk on blameless postmortem culture.
So very quickly, what this talk is not let's
understand that this is not a lecture on SRE.
This is not a lecture on outages and how to resolve
them. And I will not be teaching you how to
write a postmortem step by step.
However, what this talk is, is I'll
go deeper into some of these following topics.
So what's a good postmortem? Why is it important to write them?
And then why isn't writing a good postmortem not easy?
I'll also go over how well written postmortem culture be an effective
tool to influence culture around you. And then towards
the end, I'll cover some possibilities around what you can
do after you've written a bunch of good postmortems.
So here's a little bit about myself. I am currently a program
manager at Google within site reliability engineering,
or SRE as we know it. I've been at Google for
about four and a half years and I've worked on multiple teams.
So mobile client infrastructure, currently firebase SRE.
And then I also had some opportunity to work with Gmail,
spam abuse and delivery SRE.
I have co authored a chapter on blameless postmortem culture
in the second Cytre Liability engineering book, which is called the Cytre Liability
Workbook. And then I also co authored a report which
is titled Engineering Reliable Mobile Applications.
Both of these are available on the SRE site for free.
And in my previous experience, I juggled
multiple roles at a company called Bright Idea.
By education, I'm can electrical engineer, and then I've
also done a little bit of Bollywood dance instructor, and I was a primary
dancer for them. And I love traveling and I love food.
2021 looks very optimistic with respect to vaccines
and hopefully some sort of normalcy. So I just cannot wait
to board my next flight and go to a new destination
soon. All right, so that's enough context about
me in terms of philosophy.
Let's bring everybody on the same page and talk a little bit
about general philosophy behind postmortem
culture. Now, whenever an incident occurs,
the person who's on call, these first priority is to mitigate
the incident, fix the underlying issue,
and then ensure that services return back to their normal
operating conditions. The next big responsibility
is to make sure that the system doesn't break
in the same way again. Now, unless we
have some sort of a formalized process of learning
from these incidents in place, these may reoccur.
Left unchecked, incidents can multiply in complexity or they can sometimes
cascade. This can overwhelm a system and ultimately
impact our users. Now one of the reasons
I decided to share this talk with you today is because
I want to dispel the myth that postmortem culture is only applicable
to big organizations or deep technical problems.
I think as long as one understands the basic framework,
they can use it to address a variety of failures for personal growth,
growth of others, encouraging accountability, sharing knowledge
and best practices, and documents facts for the long
term so others can learn from it. I know I've
personally used the framework to think through a variety of
tough situations in my personal life. Also,
when things didn't go according to plan. It can also
be a very effective tool for leadership to be transparent
about popular decisions and document some rationale
behind their unpopular choices.
In almost all the cases I've seen, people really appreciate when
leaders lead by example and are transparent.
Now, introducing such a culture can seem intimidating
at first, but making this change incrementally is possible,
and one can gradually tune the process based on their
organization's needs. One of the best
ways to make use of this talk today is
to think of a recent high impact failure that you
experienced and think about how that situation was
handled. Then apply the principles and best practices
of blameless postmortem culture to that situation and
these imagine if things could have been done a little bit differently.
So what is things postmortem I keep talking about?
This is especially for newer folks. So a postmortem
is a written record of an incident, its impact
actions taken to mitigate or resolve it, the root cause,
and then very importantly, the follow up actions
to prevent the incident from reoccurring. Now,
let's talk through a simple, high level template. The first
section is all about quick executive summary.
This can include sections like incident details which
could be simple facts such as incident date, postmortem writing
date, when it was published, and then who has access to the document.
Another subset of these exec updates could be
about the problem summary duration of these problem estimated
user impact detection mechanisms and resolution after
exec updates. These next section is background.
This is specifically meant for people who don't have a lot
of context around your postmortem culture.
Could be new employees or people who've just recently joined your team,
or people from a completely different part of the company,
or people who think your postmortem in
a way could be related to an outage they are dealing with currently.
This section is also important because one of the best practices is
that once a postmortem is written, ideally everybody
should be able to read it throughout the company. Next we go
into sections where concrete details are expected so
impact of the outage, root cause trigger and these recovery
efforts. After that, let's do some deliberate introspection
and we document lessons learned so things that went well,
things that did not go well and then my personal favorite is
where did we get lucky? And things unfortunately happens
more than one would like. Next, we document action items and then what
we want to get out of it. One important problem
with action items is usually they're not very measurable.
It's important that they are objectively measurable.
They have owners priority and some sort of a tracking
id. And then these last sections is glossary again
meant for people who don't have a lot of context about your postmodern
with respect, two terminology that you're using and may need a
quick reference in addition to the basic template.
Usually it's encouraged that you
over communicate in terms of making the other person
understand what your postmortem is all about. All of these sections need
careful attention and there are best practices around how to
handle each one of them. So now you may
be wondering that things sounds like
too much work and so do we need to write a postmortem
for every single failure? Well, the answer is no, we don't.
Teams need to define postmortem criteria before an
incident occurs so that everyone knows when a postmortem is
necessary. In addition to these objective triggers,
1 may also request a postmortem from another team if they
think it's warranted. Personally speaking, over the
years, postmortem culture become such an innate part of the
SRE culture that we organically know when we can expect
one. Another idea could be to group
small incidents that have a similar in nature into
one postmortem for resolution rather than writing one little postmortem
for each incident. Writing postmortem should ideally
be looked at as a learning opportunity for the entire company and
not as a punishment. It's definitely worth acknowledging
that a good postmortem process does present an inherent
cost in terms of time and or effort, so one
needs to be deliberate in choosing when to write one.
So the next question is, when do you write one?
Well, depends on your needs and circumstances.
One criteria fits all approach doesn't really work here.
For example, a team whose primary responsibility is ensuring
the reliability of a website's ecommerce infrastructure
will have distinctly different success
criteria or failure metrics than a team whose
primary responsibility is product development. The two
teams will also have a distinct set of personalities,
making incident management nuances somewhat different.
So some example scenarios could include maybe
business reasons where x numbers of users are affected,
y amount of money is lost in revenue, or there
could be loss in user trust due to a marketing error,
or some other reasons. There could be product reasons
for writing a postmortem. For example,
a latest canary shows significant regression in
a metric. A risky change was pushed out two production
during production freeze. Or there could be people reasons.
So let's say a disruptive reorg impacted people's
careers. An improper project management
overloaded people for a long time.
What matters are that these criteria are defined
in the first place and periodically revisited,
so one knows when to escalate an incident and when
it becomes an all hands on deck situation. One can also
write a postmortem for personal reasons, for example, that travel
plan went completely awry or I'm feeling burnt out,
et cetera, in terms of defining a criteria for
your personal growth. Also, a solid understanding of your
own self will guide you, and then you'll know when to write a
postmortem. So, so far we've talked in detail about
what we should do. It sounds pretty straightforward. Have some
predefined postmortem writing criteria. If an incident
occurs, fill out a template, create a return record of an incident,
create some action items, and then magically all our problems are
solved. But is it really that simple to
answer the question? The next thing I want to emphasize
on is the importance of ensuring psychological safety.
Now, writing a postmortem just for the sake of writing a document
is not enough. In fact, if not written
well, it can be counterproductive to your culture.
It could lead to an atmosphere where incidents or issues are swept
under the rug, leading to a greater risk for your company.
So to make the best of postmortem culture, it's absolutely
imperative to keep them blameless.
So a blamelessly written postmortem assumes that
everyone involved in an incident had good intentions
and did the right thing with these information they had,
it focuses on what went wrong instead of who
did wrong. This is important
because basic human nature makes accepting failure
very difficult. Sometimes these mistakes can cost your
company millions of dollars or euros. And accepting
that something that did caused it can be
very embarrassing and distressing. For individuals in
a very toxic organization, publicly acknowledging
mistakes can sometimes cost people their careers.
Now, all of these fears will make it pretty much impossible
for a postmortem to be valuable, fact based
and objective. Now, the best practice to combat
this fear is to keep your postmortems blameless.
But why? Now we has humans fear
public humiliation. Speaking up under normal
circumstances is hard enough. Speaking up under
immense pressure, as in when an ongoing incident
is happening, is much harder, and people
generally tend to avoid the spotlight for the fear of being ridiculed.
Now, ironically, a crisis demands for someone to step
up, think differently, and also speak. Also remember
that perform postmortems are artifacts.
So now we've increased the complexity of the problem.
Not only does one have to risk isolating themselves in difficult
circumstances, they also fear being documented in history.
So what if I make a mistake? Everyone will know.
Will I be fired? Will I not get promoted? Will my future
employees make fun of me? These are all very valid
concerns. And then there's the
potential to curb innovation and autonomy.
Now dreaming big working on revolutionary ideas brings
a certain amount of associated risk, which may very well result
in failures. If a person doesn't feel psychologically
safe in taking calculated risks, they may never
act on those ideas and then innovation will probably stagnate.
This also takes away from that person's autonomy has
they'll tend to just follow instructions from their sections or someone else
instead of being creative themselves. We cannot
create future visionaries and leaders like things. Now it
may seem like a good idea to highlight individuals while
describing an outage in a postmortem.
Instinctively, it feels like we are assigning ownership
to someone, which may then motivate the individual
to take responsibility. But the big
risk in doing so is individuals would become risk covers
because they may fear public humiliation.
This can lead to people covering up facts. Risking transparency,
which could be critical to understanding, can issue and
preventing it from reoccurring. And when it comes to
personal circumstances, blaming yourself for, let's say
a personal outage is not great either.
Have faith in yourself, assume best intent,
and know that you acted in the best of your abilities, in the
circumstances and the context around you.
Blameful behavior can be pretty detrimental from a
people perspective, but blameful
behavior can be very problematic from business
perspective. Also, when mistakes are
hidden, systematic fixes are harder and problems
are more likely to reoccur. Also, blaming humans
sometimes tends to result in fire human as an action
item. Now for a moment, even if we ignore
that that may not be the right thing to do. It still
doesn't prevent reoccurrence of a problem. If the
system has set up to allow these human to make the mistake in the
first place, there's a higher probability that a newly
hired and less experienced human will
repeat that outage. So blameful behavior is pretty
detrimental from a business perspective also.
All right, so we all get the idea of
blamelessness. Let's work on an example together and
see how we can transform accusations into simple
facts. So, in terms of this exercise,
remember to focus on what went wrong
instead of who did wrong. So let's
say there's a hypothetical incident, and then
there's a hypothetical postmortem that was written, and this
was noted as the root cause. So this says,
dylan, one did not both to set up alerting for our storage
cluster or check our hard drives manually in case of doubt.
Of course, we ran out of space and then this disaster
ensued. Now, maybe pause
this video, take a couple of minutes, and then think
if there's anything wrong with the way this is written in the postmortem.
All right, I hope you've thought about it a little bit so
I can give my perspective here. So not only is this example
very blameful, where clearly one person is being scapegoated,
it's also quite dramatic. So phrases such as disaster
or of course, that happened add absolutely no value
to your document. What it does is erode psychological
safety and make your culture toxic. You can
be sure that, Dylan, one is not going to speak up when
another outage happens, and the company could probably lose
or out on some valuable information.
A better way to express the gravity of the situation would
be to gather facts and document them objectively
with metrics. So here's an example
of a more objective, blameless version.
So the hard drives in one of the trading storage clusters run out
of space, and this was not noticed due to a lack of alerting.
This led to trades from that cluster being redirected to other trading clusters,
which were also almost full. And an action
item that could come out of this is to set up alerting for this scenario,
and that AI must have an owner and some sort of a tracking
id. Let's move on to
leadership. Now, I'm assuming we're all leaders
here, so let's take a look at some
of the ideas we can propose and implement to
foster a blameless postmortem culture in our surroundings.
Now, one of the best things all of us can do is lead by examples
you has. Leaders can start writing a few postmortems yourself.
This will give you an opportunity to experiment with
what format or template works best for your.
This is also an opportunity to set the tone of the postmortem
culture you'd like to see. Once you've
written with an artifact, share it openly, let the
readers digest the content of your documents, and then come back
to you with questions and feedback. This is another opportunity
to experiment with your postmortem format.
Now once people warm up to the concept and see you doing
the right thing, they'll probably be encouraged to do some themselves.
Now, once folks have accumulated a few postmortems
over time, you can use this wealth of knowledge as training
material for future leaders. One example
is an exercise called the wheel of misfortune.
Think of it as disaster roleplay in which a previous
postmortem is reenacted with a cast of employees.
Playing does as laid out in the postmortem.
The aim is to make the experience has real as possible and
make the educational portion of ramping up as fun as
possible. Now, if you or your senior
management think it's appropriate, you can go a step further
and create a culture of rewarding people for doing these
right things. So some examples could be
highlight well written postmortems in company or team meetings.
Reward closure of postmortem action items if
action items are documented but not acted upon,
what's the point? Also acknowledge
that making systems more reliable is high impact work,
and then reward such work in your performance. Assessments regarding
postmortem owners as leaders can be very motivating.
Setting up the postmortem owner as the go to person for
resolving or mitigating a failure can be very rewarding for
some people. If your senior management is on board,
it is okay to even publicly celebrate a person who
actually was the proximal cause of a big outage,
but then who also acted effectively and sensibly
to either escalate or troubleshoot or mitigate
or resolve that outage, and who honestly then documented system
failures that allowed them to make that mistake.
This is again to emphasize that even if it's clearly a person
who can be held responsible, that it was always a
set of conditions and system failures that allowed things to
happen in the first place. You can also celebrate people
for finding and fixing reliability bugs.
Another idea is to invite folks to talk about failures and
lessons learned on a high visibility platform.
This could help normalization of conversations
around failure. All of these ways are
some great examples to instill psychological safety in
current and especially newer members of the team.
You can do even more. So let's talk about that.
So remember, things slide we looked at previously. So now you've
done all this postmortem work, you're all experts at blamelessness,
and then you have a magically blameless orc,
but the work is not done yet. We still haven't completed
the last step. We still haven't
made sure that the same incident never occurs again.
Now, as a part of writing your blameless postmortem,
you documented many action items. So what happened to those?
Remember, the overarching reason we are doing all this work is
to prevent incidents from reoccurring. And if action
items remain unaddressed, you can be sure that incidents
are going to reoccur and maybe on a much bigger scale,
since complexity of our systems and processes is not
going down anytime soon. So in order
to do so, review postmortems and sections items periodically.
These action items can be around adding more monitoring, automated rollbacks,
automated release gating, or even more refactoring of
the existing code base. If a
company doesn't have preexisting tooling around postmortem action item
management, one can look at third party tools. Also,
what process you follow, how simple or hard it is
can depend on your team, but it's important that you
do it in the first place. Otherwise, what's the point of accumulating
these action items in your postmortem? Sometimes reoccurring
postmortem reviews can also give you a sneak peek into maybe
some themes that are emerging, and you can then combine those themes
into a bigger project and then prioritize fixes accordingly.
Now, with a growing number of postmortems being written every quarter,
the next big stretch goal could be creation of tools
to aggregate these postmortems and identify common themes
and break for improvement across product boundaries.
One can work on machine learning projects where they
can teach a model to preempt outages based on past patterns.
Encourage folks to keep their trans encourage folks
to keep their postmortems accessible by all.
As a default, this will encourage transparency and
reinforce the concept of using failures or postmortems
as a learning opportunity for the company.
And then let's not forget that without intentional nurturing,
any sort of culture can ultimately fizzle out.
The breakdown of postmortem culture may not always be obvious.
Following are some common failure patterns and then recommended
solutions that might work so lack of
time quality postmortem culture time to write when
a team is overloaded with other tasks. The quality
of postmortem culture so prioritize postmortem
work, track its completion and review, and allow teams
adequate time to implement the associated plan.
If teams are experiencing failures that mirror previous
incidents, it could be time to dig a bit deeper.
So consider asking questions like are
action items taking too long to close? Are we
biased towards feature velocity over reliability?
Are the right action items been captured in the first place?
Or maybe the service is overdue for a refactor?
Next point is blameful language while speaking with each other
responding when someone uses blameful language can be very
challenging, especially if a person senior
than you is doing so. One trick is to
mitigate the damage by moving the narrative in a more constructive
direction, reminding them that investigating
the source of misleading information is much
more beneficial to the than assigning blame.
And these disengaging from the postmortem process is
another sign that postmortem culture is
starting to fail. If folks seem content with
not discussing failures and maybe avoid the issue
altogether, it may be time to reinforce blamelessness
and ensuring that high visibility postmortem culture reviewed
for possible blameful wording.
All right, this brings me to my final slide.
In case you don't remember anything I said in the last
30 to 40 minutes, I hope these these key takeaways
stay with you. So number one,
the cost of failure is education.
Number two, keep your postmortems blameless.
Focus on these system, not the people.
And then lastly, when written well,
acted upon and widely shared,
blameless postmortems can be a very effective tool
for driving positive cultural changes and also preventing
reoccurring errors. Thank you.
Thank.