Conf42 Site Reliability Engineering 2021 - Online

Evangelizing the SRE mindset: Building a culture of reliability and ownership

Video size:

Abstract

Most engineers respond to messages or emails from an SRE or security engineer with disdain. They often see the work of these teams as another hurdle to getting code out the door and a tax on their productivity.

We know they’re wrong. We need to spread the SRE mindset and approach to all engineering teams and pivot their thinking towards “How can I build a solution that is resilient, secure, and scalable?”, and “How can I partner with my SRE and security teams to make this a reality?”.

This talk will take a deep dive into the core principles of SRE thinking and how to create a culture of reliability and ownership, with practical takeaways that you can use with your own teams.

Key Discussion Points (Outline)

  • How do SREs define their role?
  • How do engineers define the SRE role?
  • How do we bring these two together?
  • What does it mean to foster a culture of reliability and ownership?
  • How can you apply this to your delivery machine?
  • What products or services can help achieve these goals?

Cortex is Platinum Sponsor of the conference. Visit the subpage

Summary

  • You can enable your DevOps for reliability with chaos native. Create a free account at Chaos native.
  • Christina is a founding engineer at Cortex. Cortex is building a tool to help teams manage their services. Christina talks about evangelizing the SRE mindset and building a culture of reliability and ownership.
  • Today we're going to go through best practices to enable engineering organizations to deliver reliable, performant and secure systems. Both sres and engineers emphasize a focus on efficiency, scalability and reliability. Today's leading technology companies are empowering sres to level up their engineering organizations.
  • What does it mean to foster a culture of reliability and ownership? The first and most important thing is ownership. People should worry about letting their customers down, not about getting yelled at by their boss. At the end of the day, it's all about the customers and your users.
  • Average developer spends more than 17 hours a week dealing with maintenance issues. Bad code is debt, and debt is something that all engineering, all leaders, understand. There are three ways to pay down the debt slowly over time.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE? A developer? A quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native. Litmus Cloud hi, welcome two my talk about evangelizing the SRE mindset and building a culture of reliability and ownership. I'm Christina. I moved to the US from Columbia when I was seven and I studied at the University of Pennsylvania. I majored in systems engineering and computer science. I started my career at Bridgewater Associates. It's a hedge fund in Connecticut and while I was there I was building a portfolio analytics platform and an editorial news site. Afterwards I joined Cortex as a founding engineer and these second hire and I've been partnering with engineering leads and sres ever since. Cortex is building a tool to help teams manage their services and define best practices across their organizations. So enough about me, let's actually get into the talk. Today we're going to be covering the changing role of the SRE at leading technology companies and go through best practices to enable engineering organizations to deliver reliable, performant and secure systems. So how do Sres define theyre role? If you look at the quote above, you'll see that the focus is efficiency, scalability, reliability and systemically delivered solutions. And so how do engineers define the SRE role? Based on this quote, you'll see that there's a focus on help with daily technical challenges, providing tools and metrics, performance resiliency and management visibility into team performance. So these definitions are pretty similar. That's great. And obviously I don't want to overemphasize the findings in the sample of two quotes. But if you just bear with me for a moment, you'll see that both sres and engineers emphasize a focus on efficiency, scalability and reliability. Let's unpack where these definitions actually differ. Right. The engineer seems to have a broader definition of the role than the SRE does in this particular example. Again, just bear with me with the two quotes. Both mention visibility into how the system works and fits together, but the engineer uses the word daily, which I find particularly interesting. They also mentioned that the SRE team is a conduit between engineering teams and management, and that's different. And that has changed recently. Before when we thought about SRE teams, we would think of them as these Kubernetes expert or the AWS expert. Those days are pretty much over. Today's leading technology companies are empowering sres to actually level up their engineering organizations and have impact on how the engineering team works. And what it means to work together. So what does it mean to foster a culture of reliability and ownership? One of the first things that I learned at my first job was who our systems were for. There I was working for people. Sorry. There it was for underresourced state pension funds, for teachers and firefighters who relied on our system to make allocation decisions that would impact millions of retirees or sovereign wealth funds charged with the financial stability and security of nations. Someone turned that expectation real about my systems. We operated under the tried and true four nines principles. That allows for less than an hour of downtime a year. That's 53 minutes of downtime, to be exact. The first time that I brought down production, it was a gut wrenching feeling. I had this in mind, and it absolutely sucked. I knew that I let down my team. I let down myself, I let down my manager, but also I let down the analysts at our company using the tools, and I let down the clients, the people that actually depend on our systems to make their decisions. One of the analysts I worked with described poor performance in our application as being sent into war without a sword. And so every time it brought down production, that's kind of what I had in mind. It's that terrible feeling that, you know, that you let people down, that you cause the problems, and so what do you do? How do you handle these situations? Right after the dust actually settles on these incidents, the app is back up and running. We've had a day or two to let down our emotions and just let it all die down. It's super important to get together and talk about what went wrong, and that's exactly what my team did. We'd pull up the code that caused issues and ask, who wrote this? How was it tested? Should our automated testing have caught this? Who code reviewed it? Should they have code reviewed it? Were they the right people to code review? Why wasn't the problem caught in a staging environment? Why wasn't it caught in post production? Who released it? Who went through that checklist? That post validated the release? And we did this in a way that didn't point fingers or make people feel bad. The whole point of this process is to bring the teams together so that you could learn, evolve, and figure out how to prevent this from happening in the future. People should worry about letting their customers down, not about getting yelled at by their boss or having a bad performance review. At the end of the day, it's all about the customers and your users. As a junior engineer, knowing how these problems would be handled when they did occur because we all know that they're going to occur. You're never bring to write perfect code actually gave me the space to develop without letting me off the hook or relegating me to bad tasks that I maybe didn't want to do. So ask yourself, does your engineering team know why the application matters? Are there explicit expectations about uptime? Do your engineers know how problems are handled in the organization? If you're not sure about the answers to any of these questions, or maybe you're not sure that the junior engineer that just joined your team six months ago knows in the same way that someone that has been there for two years, it might be a good time to actually get your team together and talk about this problem and just make sure that everyone's on the same page. It's even a good thing to do every couple of months to make sure that expectations haven't changed. The team is operating as they should. So how can you apply this to your delivery machine? The first and most important thing is ownership. If there's any confusion about who's responsible for what, you have absolutely no chance. If there's frequent problems with some part of the system, someone's neck needs to be on the line to fix it. Who is both empowered to do so and knows that they're responsible for it. These owner of each piece of the application is also responsible for making sure that documentation is up to date, runbooks are clear and easy to follow, and dependencies are well defined. We all know that writing documentation kind of sucks, but in a few minutes I'll go through a case study about why this is so important. With that said, you can also build safeguards around these workflows to keep ship from shipping. Sres need to go toe to toe with product managers to make sure that the work required to build a robust delivery system isn't getting kicked down to the backlog. It should be prioritized just as much as new features are prioritized. Finally, when things go wrong, talk about it. Get your team together, follow the principles of agile and actually have retros go through what went wrong. You can use the questions that I went through earlier, go through what the solution was, go through how you can prevent this problem from happening again, and make sure you know who was responsible. This retro and going through these problems should be a mundane part of just day to day doing business. It should become as much of a non event as prioritizing Jira tickets. In that way, your team knows exactly what to expect when things go down. And again, the point isn't to blame someone. It's to figure out how to work together better as a team, how to evolve, and how to prevent these things from happening. So let's go through a case study. Say you're the engineer on call. You went to bed, it's 03:00 a.m. And you get woken up. It's the middle of these night. You're super groggy, you're tired, you've been sleeping for a few hours, but it's not enough. And so you look at this message and you have no idea what it's referring to, what the service is. You grunt, you grumble little, and you pull your laptop into bed with you and open it up. It's 315. It's been 15 minutes since you got the call, and you can't find these service that's down. You have no idea how to find it. You've never heard of it. You've looked everywhere you can think of. You keep looking through the documentation, and you finally find the service. But there's no documentation about it. There is no runbook. There's nothing that you know that can help you get it back up at 345. You decide that because you can't find the service owner, and you don't know who to contact in 45 minutes. You just want to go back to bed. You call a different engineers on your team. The Mr. Fixit, we all know him. He's the one that gets called every time there's an incident, every time something goes down. The one you go to, every time you have questions, Mr. Fixit answers your call. Thank God. And then you both work together. He hasn't seen this before, but he thinks that restarting the app will help. It's happened with similar services, so that's exactly what you do. You figure out how to restart the app. You're both there. You wait for it to come back up, and after 15 minutes of monitoring it, you go back to sleep. At 10:00 a.m. You log back onto your computer for your daily stand up. Everything's up, everything's running. No one realizes that you and Mr. Fixit were up at 03:00 a.m. Actually solving this. And in a month, when this all happens again, because there was no retro, there was no talk about fixing it, there was no talk about preventing this from happening again. You're the one on call once again. And this time, when you go to call Mr. Fixit, he's no longer there. He quit to go work at a company that actually cares about reliability. He was done with the 03:00 a.m. Wake up calls. So not a great scenario, and not one that personally, I want to be a part of. Right. So how do we actually take the principles that I laid out earlier and apply this to how the incident machine should go and how teams should actually handle this process? So again, let's reset. Pretend again, you're the engineer on call. You went to bed, you get woken up at 03:00 a.m. You're groggy, you're annoyed. You open up the messages, and you look at it. You open up your computer, and you go straight to the service with that name. This time around, there's documentation. There's a clear process to figure out where the logs are, where the runbook is, what's going on. So you look at those logs, you determine that the app needs to be restarted. You pull up the runbook to do so, and after 45 minutes, the app is back up. Talk about a difference, right? It's already taking less time than it had in our previous scenario. Moreover, you didn't call anyone. You didn't have to call Mr. Fixit this time around. You were able to actually fix this yourself without necessarily knowing what these service does, what it is, and having worked on it before, you go back to bed after monitoring for 15 minutes, and then at 10:00 a.m. These team responsible for the service gets together to actually prioritize a fix for the issue so that next time this happens so that it won't happen again, and that if it does happen next time, they can figure out why and what went wrong. But you know that they've tried to fix it before, so why does this matter? Obviously, it's bad for morale. It's bad for the engineering team. You don't want to be woken up at 03:00 a.m. But it also is super important to the business. So stripe actually published a survey in 2018 about engineering efficiency and its $3 trillion impact on the global GDP. And so, obviously, your engineering team will be annoyed by incidents and the morale will be down, and you should prioritize fixing it. But in terms of translating this to leadership, these study found that the average developer spends more than 17 hours a week dealing with maintenance issues. That involves 13 hours a week on tech debt and 4 hours a week on bad code. And so if you actually translate this into the cost, you'll see that in the average work week of 40 hours a week, if you're spending 4 hours on bad code, that's 9.25% of productivity, and that equates to $85 billion a year. And so let's think of this in the same exact way that leaders can understand, right? Bad code is debt, and debt is something that all engineering, all leaders, regardless of their technical abilities, understand. We all know that there's a few ways to handle debt. You can refinance, you can pay the monthly minimum, you can take out another loan, but eventually the debt collector is always going to call, and those bills need to be paid. This framing is exactly what you can use for managers who aren't technical enough to understand the problems that they're dealing with when you talk about tech debt. So, practically, as engineers, we have three approaches that can go into solving these problem. One, we can pay down the debt slowly over time by carving out engineering capacity for it. This can mean that you put 10% of engineering capacity into every sprint to actually handle tickets. You could also do one ticket per engineer is a tech debt ticket. And that's a way to slowly carve out over time and fix on that tech debt. Option two is you can focus on it. You could dedicate a specific team to fixing tech debt, or you can dedicate a specific sprint or time period every quarter to actually eliminate tech debt for all teams. Personally, I don't think this is a great option. No team wants to work solely on tech debt. It sucks. We all want to be working on the features that actually impact clients. We want to see them use the tools. You want to see that aha moment when they're using your product and you know that it works. Moreover, the team who created the tech debt should work on their tech debt. That's how you learn from it, that's how you improve, and that's how you start thinking about how to eliminate it in the future as you're building out new products. And if you choose to go the route of doing a sprint a quarter, these problem with that is that you don't want to pause feature development for, say, two weeks to actually work on tech debt. And there's always going to be a team that's working on something so critical that you just can't do that. And so at that point, they're going to be exempt. It's going to be easy to ignore it for a few people, and it's not actually bring to necessarily help these situation. These third option is, obviously you go bankrupt. And this doesn't mean literally bankrupt, right? But it could mean a variety of bad outcomes. There could be security breaches, you could be having lot of downtime rather than the 1 hour of downtime a year, you could have an hour of downtime a week. You could also have performance degradation, which again is almost just as bad as your app being down. And so basically this is a good framing for people to actually think about how to handle bad code and how to handle these tech debt and how to get it prioritized among your engineering leaders and your organization. So remember, the simplest and most important takeaway from all of these should be that you want to create an engineers culture that cares about the users. And behind the users is the reliability and stability of the application, management and teammates who embody these values. SRE incredibly important good onboarding for new engineers that makes requirements and expectations explicit is key that new team members should understand how incidents are handled and how the onpaul process works, even if they're not necessarily on it. You hope that in six months of time theyre will be on it and it won't be a huge thing to get trained up, rather something that they already know how it works. Having up to date documentation, run books, easily accessible logs is key. Obviously my case study that I did isn't a concrete example, but I'm SRE. We've all been there, we've all restarted our apps to fix an incident and these the trickiest part to navigate here right in a spot where good technology can help is actually defining ownership, having a place to store all that documentation and information about services, and helping map out the dependencies across complex systems. And that's exactly what Cortex is built to do. I encourage you to take a look at our website or send me a message and we can do a quick demo. Um, but we've been helping engineering teams and SRE teams for a while and the tool helps you do exactly this. Thank you. If you want to reach out, my email is on the screen. Hope you enjoyed the talk.
...

Cristina Buenahora Bustamante

Founding Engineer @ Cortex

Cristina Buenahora Bustamante's LinkedIn account Cristina Buenahora Bustamante's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways