Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome to my talk. We can't all be Shaq.
Today we'll dive into why instant heroes need to pass the ball my
name is Malcolm Preston. I've been in the technology world for most of
my life, since my father bought a Xerox personal computer when I was a child.
I've held many different titles over the years, but all heading generally
towards the same outcome. My passion is to make complex systems work.
Currently, I work at Fire Hydrant, where we envision a world where all software
is reliable. Fire Hydrant operates in the incident management space and
reliable software lines up perfectly with my other passion for getting a good night's
sleep. This talk will not focus on some specific
technical product we use or how we scale to accommodate a huge increase in customer
traffic. Instead, this talk will focus on basketball.
Now bear with me. I know this isn't the topic of most tech talks,
but my hope is within a few minutes it'll become clear how incident response
and basketball go together like copy and paste. I enjoy
watching sports, not so much to root for a particular team,
but I enjoy watching how teams adapt and work together, especially when
they're facing adversity. I've often been intrigued over
the years why certain company seem better than others in various aspects
of their operations, similar to sports teams at
a going away party for a job I was leaving a few years back,
the VP of engineering told a story I didn't even remember, but that I know
subconsciously shaped how I viewed my role in the team.
While sitting on a new hire orientation session toward the end
of my very first day at the company, an issue arose with some internal system
with zero context. I pulled but my laptop, figured out what was going on
and helped fix the issue. After that, every time there was
an incident, the VP said as soon as he saw my name in the slack
chat, he felt a sense of relief and knew we'd be okay.
My story isn't unique. In fact, most organizations I talk
to have a handful of people or teams who swoop in and save the
day when a technical crisis arises. When the 02:00 a.m.
Outage page goes out, they're the first ones to respond.
These nebulous heroes find the problem, determine the affected areas,
fix the issue, or know who to call to fix the issue. Wake up the
VP, draft messages to send to customers and other stakeholders, then create
tickets to address why things went bad. The next day at
09:00 a.m. They go back to the job they were hired to do. Their backs
are padded and life goes on back to normal until the next
02:00 a.m.. Page I like to equate these people to Shaq.
Shaquille O'Neill is 7ft one inch tall and weighs well over 300
pounds. Any team he played for knew if they desperately
needed two points, they could pass the ball to Shaq and a thunderous
dunk was likely to follow. I'd wager most of you either know someone
who was a Shaq or you play that role yourself.
These incident response heroes aren't usually 7ft tall,
but when a moment arises when the team needs a quick victory, these people always
seem to come through in the clutch. If you've been this person at your company,
you know how good that can feel. You also know how exhausting it
can be. Why is this a problem? As individuals,
we bring our talents every day with a goal of some kind of personal satisfaction.
That satisfaction could be monetary compensation, pride in doing good
work, helping the team, changing the world, or any other
personal motivation. A team brings these groups of individuals together
under some guidance with a goal of winning games while theyre together
in sports. It could be a year or a season in your company that
time can frame could revolve around quarterly goals or maybe
longer term projects. An organization continually
sponsors teams with the goal of repeating success on a long term
basis year over year. It's true when you have individuals
operating like Shaq, games are getting won,
incidents are getting resolved, but are you truly setting your organization
up for longer term success by just doing it for them every time?
The problem for a lot of organizations is Shaq makes
dunking look easy, incidents are being remediate,
and nobody else is really feeling the pain, so much so that your
company might not think there's a problem. And without the tools
or headcount to make a sweeping change, it turns into a self fulfilling
prophecy. There's no great system in place, but you know
what to do, so you just end up continuing to do it.
This negates the need for a system and puts the onus track on you.
Even as an individual contributor, there needs to be an understanding of what
is best for the organization long term. While it
may feel good to be the center of attention and a key contributor to success,
establishing long term sustainable processes will help the organization
in the future. I'd like to share
another personal anecdote about the time I got a new observability tool adopted
at a previous employer. As a company, we became heavily
invested in self hosting a NoSQL database I had
used that product in the past and became one of de facto resources
when anything bad or puzzling happened that might be related to that database
platform. At some point I realized I had become a bottleneck
in terms of support, and to be honest, I felt a little uncomfortable with the
amount of stress that brought knowing I'd be integral in just about any issue involving
these database deployments, which had proliferated to just about every
microservice we created. I researched and
advocated for a database monitoring analysis tool that could help
the rest of our teams self service and bring more reliability to the systems
that theyre were responsible for. When I was met with skepticism
around why we even needed a product like this, I realized no one
else thought there was a problem to solve, because every time there was an incident
Shaq in a pre Covid world, I often gleaned
a lot of peripheral information around what other teams were doing. By happenstance,
someone might want to introduce a new infrastructure component theyre weren't familiar with,
and they might ask for advice. Or I might hear about some upcoming change or
challenge just from casual conversation.
I'd say the vast majority of the service dependency matrix in my head at past
companies originated from water cooler talk.
In a past era, it might have been through smoke break conversations.
Unfortunately, in a post Covid world, most of us are working
under further siloed environments just because we rarely see our
coworkers face to face, and in some cases, there's less communication outside
of immediate work related conversations.
This facilitates the need to codify what incident responders do
and who should be involved in various scenarios. One heroes
knowing where all the bodies are buried, swooping in and saving the
day is becoming less of a reality. Times change,
technology changes, and platforms are becoming more and more complex.
What worked in the past doesn't always work in the present or
in the future. Sometimes Shaq gets
tired. Sometimes Shaq gets hurt.
Burnout is a real thing, and let's
be honest, sometimes Shaq isn't the easiest team member to get
along with. Kobe so
where do we go from here?
It's time to pass the ball. To get out of this cycle of heroes,
you as an individual have to commit to changing reliability culture
at your organization. I always recommend people start with
a few small steps and then grow from there.
First, if the contents of your company's service,
catalog dependencies, documentation, and incident management
communication workflow are all in your head, you're in trouble.
What happens when you're not around? Take the first step toward formalizing
an incident management runbook. This doesn't have to look like setting
aside a full day to write a step by step process. Instead,
take the start small approach of talking to yourself during an
incident. The next time you respond, start a thread in the incident channel where
you literally just think out loud, be over communicative,
and don't assume your teammates understand why you're taking the actions you sre
think in terms of I just got paged. What's the first thing I do?
Where are the places I look to check on the statuses of our services?
How do I know who to call when I discover what service is down?
How do I know how to revert the last deploy for that service? What impact
does this incident have on customers and internal teams?
What are my thoughts on how to fix this issue going forward?
These are all questions you're answering in your head in an organic way
because you're the one who knows how to do this. By documenting your process,
you're taking the first steps to getting that info out of your head and
eventually into an incident management tool or company wiki breaking
that silo. Step two,
you know who looks up to Shaq? Everyone. And the same
is probably true of you. Become an incidents commander and give the other members
of your team the opportunity to learn to expand their own skills and
to bask in that hero's glow. A team benefits from players who
have differing specific roles and specialties, all working together under
sound coaching and organizational direction. The best responders
facilitate communication and collaboration, so the next time an
incident arises, instead of taking on the shack role, be a point guard
or a coach. Simply hang back, provide assists
and guidance, give people the info they need, or better yet,
help them figure out where to find it, and lend a hand when you're
asked. Once you've done this a couple of times, maybe miss a
game or two, see if not being on call as an option for the sake
of your own health, but also to help others step up.
Instead, work on a special project like taking the
initial efforts towards formalizing the documentation you started from the first step.
There's no better way to flag a weakness up the chain of command than by
demonstrating what happens when you're not there to fix things, and that's
a surefire way to direct some resources toward incident management.
Step three question the status quo. Let's talk
about the big o's on call and observability.
What does on call look like at your company? Are you alerting
on every down instance or only the ones that impact your customers celebrated?
Are you using service level objectives, if not,
what's standing in the way of adopting them. Depending on how noisy your
alerting systems are, being all call for even a week might be a totally unreasonable
amount of time. What observability tools are you using?
Can other teammates easily find the relevant data they need to triage
an incident? Lastly, what tools
are you using to manage your incidents? How are incidents
declared and under what circumstances? Are you conducting retrospectives?
And if so, how do you influence product roadmaps?
Why were any of these tools chosen? When and by whom?
And do these tools fit the way your team works?
We have to improve the on call and incident response experience for our engineers in
order to reduce burnout and retain our colleagues. We also
want to be able to assemble a winning team even when the seven foot oneal
inch star is unavailable. This will allow
us to thrill our customers with the projects that directly impact revenue
producing products and features. Direct your hero energy to making the entire
process better, not just remediate ad hoc incidents.
These first steps sre all moving toward a common goal, and that's to move away
from the whack a mole style incident response to a more strategic and
holistic incident management. If you're playing the hero
role at your organization, you might be unintentionally masking the need
for better incident management practices.
This isn't your fault, though, and you're not alone. By helping
our companies shift toward a better incident management posture, we can improve
things for our customers, for our teammates, and for ourselves.
Thank you for joining.