Conf42 Site Reliability Engineering 2022 - Online

We can’t all be Shaq: why it’s time for the SRE hero to pass the ball and how to make it happen

Video size:

Abstract

Shaquille O’Neal is one of the most celebrated NBA players of all time — and for good reason. When the team needed to put up quick points, they knew they could throw the ball to Shaq, and let him go to work. The skills are different, but there are a lot of engineers playing the Shaq role at their company. They’re the heroes who come in at 2 a.m. knowing just what to do to remediate fast and get back on track.

Although that might win games and resolve incidents, it’s not setting your team up for sustained success. Attend this talk to understand why it’s time to pass the ball and learn three ways you can take fast action to get there today.

Summary

  • Malcolm Preston talks about why instant heroes need to pass the ball. Most organizations have a handful of people who swoop in and save the day when a technical crisis arises. But are you truly setting your organization up for long term success by doing it for them every time?
  • In a post Covid world, most of us are working under further siloed environments. Take the first step toward formalizing an incident management runbook. Become an incidents commander and give the other team the opportunity to learn to expand their own skills.
  • Step three question the status quo. What does on call look like at your company? What tools are you using to manage your incidents? By helping our companies shift toward a better incident management posture, we can improve things for our customers, for our teammates, and for ourselves.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome to my talk. We can't all be Shaq. Today we'll dive into why instant heroes need to pass the ball my name is Malcolm Preston. I've been in the technology world for most of my life, since my father bought a Xerox personal computer when I was a child. I've held many different titles over the years, but all heading generally towards the same outcome. My passion is to make complex systems work. Currently, I work at Fire Hydrant, where we envision a world where all software is reliable. Fire Hydrant operates in the incident management space and reliable software lines up perfectly with my other passion for getting a good night's sleep. This talk will not focus on some specific technical product we use or how we scale to accommodate a huge increase in customer traffic. Instead, this talk will focus on basketball. Now bear with me. I know this isn't the topic of most tech talks, but my hope is within a few minutes it'll become clear how incident response and basketball go together like copy and paste. I enjoy watching sports, not so much to root for a particular team, but I enjoy watching how teams adapt and work together, especially when they're facing adversity. I've often been intrigued over the years why certain company seem better than others in various aspects of their operations, similar to sports teams at a going away party for a job I was leaving a few years back, the VP of engineering told a story I didn't even remember, but that I know subconsciously shaped how I viewed my role in the team. While sitting on a new hire orientation session toward the end of my very first day at the company, an issue arose with some internal system with zero context. I pulled but my laptop, figured out what was going on and helped fix the issue. After that, every time there was an incident, the VP said as soon as he saw my name in the slack chat, he felt a sense of relief and knew we'd be okay. My story isn't unique. In fact, most organizations I talk to have a handful of people or teams who swoop in and save the day when a technical crisis arises. When the 02:00 a.m. Outage page goes out, they're the first ones to respond. These nebulous heroes find the problem, determine the affected areas, fix the issue, or know who to call to fix the issue. Wake up the VP, draft messages to send to customers and other stakeholders, then create tickets to address why things went bad. The next day at 09:00 a.m. They go back to the job they were hired to do. Their backs are padded and life goes on back to normal until the next 02:00 a.m.. Page I like to equate these people to Shaq. Shaquille O'Neill is 7ft one inch tall and weighs well over 300 pounds. Any team he played for knew if they desperately needed two points, they could pass the ball to Shaq and a thunderous dunk was likely to follow. I'd wager most of you either know someone who was a Shaq or you play that role yourself. These incident response heroes aren't usually 7ft tall, but when a moment arises when the team needs a quick victory, these people always seem to come through in the clutch. If you've been this person at your company, you know how good that can feel. You also know how exhausting it can be. Why is this a problem? As individuals, we bring our talents every day with a goal of some kind of personal satisfaction. That satisfaction could be monetary compensation, pride in doing good work, helping the team, changing the world, or any other personal motivation. A team brings these groups of individuals together under some guidance with a goal of winning games while theyre together in sports. It could be a year or a season in your company that time can frame could revolve around quarterly goals or maybe longer term projects. An organization continually sponsors teams with the goal of repeating success on a long term basis year over year. It's true when you have individuals operating like Shaq, games are getting won, incidents are getting resolved, but are you truly setting your organization up for longer term success by just doing it for them every time? The problem for a lot of organizations is Shaq makes dunking look easy, incidents are being remediate, and nobody else is really feeling the pain, so much so that your company might not think there's a problem. And without the tools or headcount to make a sweeping change, it turns into a self fulfilling prophecy. There's no great system in place, but you know what to do, so you just end up continuing to do it. This negates the need for a system and puts the onus track on you. Even as an individual contributor, there needs to be an understanding of what is best for the organization long term. While it may feel good to be the center of attention and a key contributor to success, establishing long term sustainable processes will help the organization in the future. I'd like to share another personal anecdote about the time I got a new observability tool adopted at a previous employer. As a company, we became heavily invested in self hosting a NoSQL database I had used that product in the past and became one of de facto resources when anything bad or puzzling happened that might be related to that database platform. At some point I realized I had become a bottleneck in terms of support, and to be honest, I felt a little uncomfortable with the amount of stress that brought knowing I'd be integral in just about any issue involving these database deployments, which had proliferated to just about every microservice we created. I researched and advocated for a database monitoring analysis tool that could help the rest of our teams self service and bring more reliability to the systems that theyre were responsible for. When I was met with skepticism around why we even needed a product like this, I realized no one else thought there was a problem to solve, because every time there was an incident Shaq in a pre Covid world, I often gleaned a lot of peripheral information around what other teams were doing. By happenstance, someone might want to introduce a new infrastructure component theyre weren't familiar with, and they might ask for advice. Or I might hear about some upcoming change or challenge just from casual conversation. I'd say the vast majority of the service dependency matrix in my head at past companies originated from water cooler talk. In a past era, it might have been through smoke break conversations. Unfortunately, in a post Covid world, most of us are working under further siloed environments just because we rarely see our coworkers face to face, and in some cases, there's less communication outside of immediate work related conversations. This facilitates the need to codify what incident responders do and who should be involved in various scenarios. One heroes knowing where all the bodies are buried, swooping in and saving the day is becoming less of a reality. Times change, technology changes, and platforms are becoming more and more complex. What worked in the past doesn't always work in the present or in the future. Sometimes Shaq gets tired. Sometimes Shaq gets hurt. Burnout is a real thing, and let's be honest, sometimes Shaq isn't the easiest team member to get along with. Kobe so where do we go from here? It's time to pass the ball. To get out of this cycle of heroes, you as an individual have to commit to changing reliability culture at your organization. I always recommend people start with a few small steps and then grow from there. First, if the contents of your company's service, catalog dependencies, documentation, and incident management communication workflow are all in your head, you're in trouble. What happens when you're not around? Take the first step toward formalizing an incident management runbook. This doesn't have to look like setting aside a full day to write a step by step process. Instead, take the start small approach of talking to yourself during an incident. The next time you respond, start a thread in the incident channel where you literally just think out loud, be over communicative, and don't assume your teammates understand why you're taking the actions you sre think in terms of I just got paged. What's the first thing I do? Where are the places I look to check on the statuses of our services? How do I know who to call when I discover what service is down? How do I know how to revert the last deploy for that service? What impact does this incident have on customers and internal teams? What are my thoughts on how to fix this issue going forward? These are all questions you're answering in your head in an organic way because you're the one who knows how to do this. By documenting your process, you're taking the first steps to getting that info out of your head and eventually into an incident management tool or company wiki breaking that silo. Step two, you know who looks up to Shaq? Everyone. And the same is probably true of you. Become an incidents commander and give the other members of your team the opportunity to learn to expand their own skills and to bask in that hero's glow. A team benefits from players who have differing specific roles and specialties, all working together under sound coaching and organizational direction. The best responders facilitate communication and collaboration, so the next time an incident arises, instead of taking on the shack role, be a point guard or a coach. Simply hang back, provide assists and guidance, give people the info they need, or better yet, help them figure out where to find it, and lend a hand when you're asked. Once you've done this a couple of times, maybe miss a game or two, see if not being on call as an option for the sake of your own health, but also to help others step up. Instead, work on a special project like taking the initial efforts towards formalizing the documentation you started from the first step. There's no better way to flag a weakness up the chain of command than by demonstrating what happens when you're not there to fix things, and that's a surefire way to direct some resources toward incident management. Step three question the status quo. Let's talk about the big o's on call and observability. What does on call look like at your company? Are you alerting on every down instance or only the ones that impact your customers celebrated? Are you using service level objectives, if not, what's standing in the way of adopting them. Depending on how noisy your alerting systems are, being all call for even a week might be a totally unreasonable amount of time. What observability tools are you using? Can other teammates easily find the relevant data they need to triage an incident? Lastly, what tools are you using to manage your incidents? How are incidents declared and under what circumstances? Are you conducting retrospectives? And if so, how do you influence product roadmaps? Why were any of these tools chosen? When and by whom? And do these tools fit the way your team works? We have to improve the on call and incident response experience for our engineers in order to reduce burnout and retain our colleagues. We also want to be able to assemble a winning team even when the seven foot oneal inch star is unavailable. This will allow us to thrill our customers with the projects that directly impact revenue producing products and features. Direct your hero energy to making the entire process better, not just remediate ad hoc incidents. These first steps sre all moving toward a common goal, and that's to move away from the whack a mole style incident response to a more strategic and holistic incident management. If you're playing the hero role at your organization, you might be unintentionally masking the need for better incident management practices. This isn't your fault, though, and you're not alone. By helping our companies shift toward a better incident management posture, we can improve things for our customers, for our teammates, and for ourselves. Thank you for joining.
...

Malcolm Preston

Staff Software Engineer @ FireHydrant

Malcolm Preston's LinkedIn account Malcolm Preston's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways