Conf42 Site Reliability Engineering 2021 - Online

Elephant in the Blameless War Room - Accountability

Video size:

Abstract

How do you reconcile the ideals of blamelessness with a demand for blame? When is accountability actually required? We’ll navigate these challenges by explaining:

How to empathize with blameful people - we’ll look at how their goals align with yours, even if their methods are archaic
How to skilfully respond to a demand for blame - blameful peoples’ goals can be achieved blamelessly - here’s how to communicate that
When is accountability necessary? - sometimes accountability is part of the best way forward - let’s figure out when
How to be blamelessly accountable - true accountability requires blamelessness - we’ll show you why

Blameless is Platinum Sponsor of the conference. Visit the subpage

Summary

  • You can enable your DevOps for reliability with chaos native. We wanted to reconcile the idea of being totally blameless, but still holding personal accountability. And we're really excited to share what we found today.
  • Christina: You can't underestimate the importance of being blameless. Emily: Show that there is business value in having a blameless culture. What starts to dissolve conflict is when you start to see commonalities between the parties.
  • Leaders might assume that punishment will deter others from making the same error. How can we skillfully address it so that we challenges their mind without triggering their defense mechanisms? Ask questions when both the leader and you are in a calm state of mind.
  • Trying to create empathy between the leader and the engineer that was involved. It's possible that the engineer doesn't fully understand the business impact. Are there other ways to rebuild trust with stakeholders besides retribution?
  • The best way to respond to an incident is to be direct and succinct. Focus on common ground and creating psychological safety. Having shared goals is extremely important. Follow up investigation could look like. systemic changes that can prevent incidents like this.
  • When is it fair to hold someone accountable in the traditional sense where you are actually letting someone go? Have you accounted for all other contributing factors? This is a very distinct and separate process from incident resolution. This is performance management.
  • Twitter learned that accountability faces forward. Team that is accountable will take full ownership of improving reliability. There's no trade off between blameless culture and accountability. Leadership is critical in fostering a psychologically safe culture. It takes incredible empathy, stress tolerance and critical thinking to get blamelessness and accountability working together in harmony.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE, a developer, a quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at chaos native. Litmus Cloud hello, and welcome to our talk, the elephant in the blameless war room, accountability. This talk started when Emily and I were encountering executives in Fortune 500 companies who owned software reliability for these entire company that didn't really believe in blameless culture. They would ask us point blank, well, somebody so has to get fired, right? And that actually was very poignant, and it got me thinking, what about accountability? And so that's something that Emily and I spent a lot of time thinking and distilled the answers in this talk. We really wanted to reconcile the idea of being totally blameless, but still holding personal accountability when it was the best solution. And we're really excited to share what we found today. First, a round of intros. I'm Christina. I'm on the strategy team at Blameless, strategizing for executive team cohesion and also market positioning. I'm really passionate about making blameless culture work, not only for engineers, but also for business leaders. And I'm Emily. I'm a content writer at Blameless. I'm originally an outsider to the world of SRE, but I've been really excited to learn about the space and to start sharing my perspectives with the community. So we started thinking about factors that have a huge impact on business value. And one of the major ones agreed upon by every study is developer velocity. Then we found that a major factor in developer velocity is psychological safety. And what do you think is a major factor in psychological safety? Blameless culture. That would be correct. So it really is a big deal. You can't underestimate the importance of being blameless. Yeah. And especially when speaking with business leaders, it's really important to speak in their language, and that is the currency of communication. And so showing that there is business value in having a blameless culture is tremendously powerful. So picture this. We have an engineer working one night on the testing environment admin panel. However, on this dark and stormy night, they slowly realize this isn't the testing environment. This is the production environment. Of course, these changes lead to a major incident. And just at this time, the door opens and the executive walks in and asks, what happened? Who's responsible? So this is a pretty chaotic situation. A lot of things have gone wrong, and a lot of emotions are running high. Let's break down what happened. The shared reality is pretty simple. Leadership walked in they asked, what happened? Who's responsible? Probably their forehead was pretty furrowed, their voice was raised. They're a little agitated, speaking faster, and physically hovering around people's desks, really trying to get to the bottom of this. Now, as members of the engineering team, how would this shared reality be interpreted? Well, it's very natural and human to feel blame and frustrated or even afraid, imagining all different scenarios about what are the possible repercussions could be, and that could make it really difficult to focus on resolving the issue. So even if that's what the engineering team's thinking, let's look inside the head of the leadership. What might they be going through? Exactly? And heres I wanted to say that in psychology, there's this idea of a fundamental attribution error, where when we feel hurt, we assume that the other person has bad intentions. War is a bad person, whereas when we hurt someone else, as you can see, we would assume that that was an accident. I was just having a bad day. And so it's very natural and human, again to feel blamed and assume that that is the objective reality, when really the only shared reality is that leadership asked, who's responsible? And these lets. So let's see what might actually be going on with the leadership without assigning intent based on the engineering team's feelings. So their goals are probably very similar to the engineering teams. They're really just focused on resolving the incident, preventing it from reoccurring and restoring. True, with these stakeholders. Now, sometimes we could have the same objectives and get there through different paths. And so, given what we each know as the engineering team versus leadership, we might think that for the leader, holding someone responsible is the best path forward, whereas for the engineer, there might be other paths. So really it's easy to think, oh, leadership doesn't respect psychological safety. They're wrong. This is not the way to resolve incidents. I hate this toxic culture, but that doesn't actually solve the problem. And what I found in my experience is that it really helps when you try to step into the other person's shoes and see how, under their set of assumptions, their conclusion may be reasonable or logical. And once you uncover those implicit assumptions, then you can directly speak to those as a starting point of influencing and causing change. So again, this you can easily see as a starting source of conflict. What starts to dissolve conflict is when you start to see commonalities between the two parties. So here we can see that not only do they have the blame goals, but they're really feeling the same way too. Both of them are pretty tense. They're pretty stressed, and they're very afraid of the conference this incident could have. They're both very motivated to resolve the incident and restore the service to functionality. So let's start breaking down. How can we cover this bridge? How can we bring these two groups together? This will be the main backbone for our talk. We're going to establish empathy for leadership, understand their goals and perspective, and look at the assumptions and perspective differences that might be driving their blameful behavior. We'll address their concerns in three major areas. Looking at the incident, the engineers involved, and the stakeholders trust, then we'll cover how to be blamelessly accountable, how to incorporate accountability into the best solution. So what are the assumptions that could cause leadership wanting to hold someone accountable as a way of resolving the incident to be correct? That's the question that Emily and I asked ourselves. So they might assume some things about the incident, like it just straight up should never have happened, or that the best way to deter other people from making the same error is to punish someone. And some of the assumptions about the engineer could include a skilled engineer would never make this mistake. If someone made a mistake like this, it must mean that there's an issue with their competence or skill. Removing the engineer will remove the problem. And without punishment, these engineer won't fully understand the impact of their mistake. They could have some assumptions about how the stakeholders are feeling too. They might believe that the stakeholder wants to see someone singled out and perhaps fired, that this is the most persuasive way to convince them that these incident is resolved. They might also think the stakeholders are expecting some sort of fairness, that because they've experienced pain from the incident, they'd want to see pain experienced by the engineering team as well. But we know this isn't really how things will play out. Even though blame seems like a good way to achieve your goals, given these assumptions, we know that systemic changes are far more enduring and beneficial. So how do we cover this gap of understanding? Absolutely. And do remember that even at this point, both the engineering team and the leadership have the same shared goals. They both want to resolve the incident. It's just given what we know about incidents, engineer, stakeholders, we might have different ways of getting to that outcome. So first, let's understand leadership's perspective on the incident. If they assume that it should never have happened, punishment will deter others from making the same error. Then how can we skillfully address it so that we challenges their mind without triggering their defense mechanisms? And so Emily and I came up with a way to essentially uncover these perspective differences. What better way to have open minded conversations than to ask questions? And keep in mind that these are not questions that we are expecting you to ask during the incident. Everyone is still stressed and tensed and focused on the resolution during the incident. We highly suggest that you ask about these probing questions when both the leader and you are in a calm state of mind where you can meaningfully engage in a conversation and where you're both open to changing your mind. So these are some of the possible questions. One thing to ask is, is 100% reliability even possible? Is it worth the cost if it is? And what kind of tradeoff are you willing to make between trying to prevent incidents and preparing yourself to react to them, given that you only have so much engineering capacity? Another question to ask is, are there other ways of making people more blameful than using punishment? So let's see how we could address leadership's concerns about the incident. So one thing to emphasize when you're having these conversations is that systemic changes are more enduring and beneficial. That if you actually change the system at heart of the incident, then you'll do way more to prevent the incident from recurring than just swapping out people. And it helps to give a specific example of how in the past you've had a system change that was more effective than letting go of someone. Based on Aristotle's model of persuasion, ethos, pathos, logos, it's actually very important to strategically sequence your appeals based on how likely the person you're talking to will agree with you. So, for example, if you anticipate that the leader is less likely to agree with you, it's more important to establish your credibility first, then use logic, then use emotional appeal at the end. So don't start with, oh, the team feels really guilty. But start with, I've been doing this for many years, and I've seen this example where the systemic change was a lot more effective and enduring than actually letting go of someone and then say, from an emotional standpoint, what helping the leader understand how the engineer might be feeling. Another thing to really make clear is that there's no way that complex systems won't ever fail. There's no way to prevent incidents 100% of the time, as much as you might try. Yeah. So Dave Renson, the former head of customer reliability engineering at Google, would often in his talk say to air is human, the sun is not 100% reliable because, well, it's unreliable at the very end, and then our heres are not 100% reliable, and complex system failures are inevitably going to fail. So helping the leader understand that software doesn't just work the way it probably did in the 90s is going to be very helpful. Another thing to really get across, and this is more kind of on the emotional side, maybe, as you were talking about, that engineers are not at their best when they're stressed. If they're in a fight or flight mode where they think, like, every mistake they make could lead to the end of their job, they're not going to be able to focus at all on actually solving the problem. Absolutely. While zero one and two can wait until after the incident is resolved, you can see that zero three is actually the key to helping the leader understand the situation, to give enough room for the engineers to focus on resolving the issue in the moment. So instead of letting leadership hover and ask what's going on, what's going to happen, who's responsible? You can say, hey, we're all here trying to resolve the problem as soon as possible, and engineers don't solve problems well in fight or flight mode. So it might be helpful if you could take some time and give the engineer team and give the engineering team some space to resolve the issue first, and then we'll come back to looking at what actually happened. Now let's take a look at how the leader might be feeling about the engineers involved in the incident. They might have some assumptions about the engineer that we covered before, like a skilled engineer would never make a mistake like this, or that if you just get rid of this one bad apple, then everything should be fine. Or that the engineers don't really understand these severity of an incident without punishment. So let's try to look at how we can bridge the gap in our perspectives that would lead them to have these assumptions. So ask them, do you think there are deeper causes of incidents beyond individuals? I love these open ended questions because it really gives everyone in the conversation a chance to do creative problem solvings together instead of fixating on things that have gone wrong. This question embeds some forward momentum towards solving the problem together. Another question is, do engineers understand the business impact of incidents? It could be that they don't, and there needs to be more of an open dialogue between leadership and engineering teams so that they can understand how their development choices will translate into money gained or lost by the company. More than likely, though, they do understand that the incident was severe, and they're probably already feeling plenty guilty. Yeah. So let's talk about addressing these leadership concerns about the engineer. Anyone in this position could have made that mistake. And so from Emily and I's conversation with the emergency physician at Stanford Emergency Medicine LA has said that there are more missed heart attacks in the US than what we would expect. Because doctors, even though they've gone through so many years of very tough training, will have interruptions. Sometimes their brain is holding a different set of contexts. Maybe these engineer was helping another engineer with a deploy in production and therefore coming back immediately under a time rush. Perhaps they didn't realize that this is actually not the testing environment, but these production environment. And another thing, again on more the emotional side of appeals, really try to create empathy between the leader and the engineer that was involved. Really emphasize that nobody wanted this outcome. It's very easy for leadership people to feel isolated in their roles, that only they can grasp the magnitude of all of these incidents. But the engineer is obviously suffering very much as well. And closing the gap on that empathy will allow them to understand a lot better. Absolutely. And it's also about building common ground, because for leaders who typically are people that take extremely high ownership of company performance, they will feel like, oh, I'm the only one that heres because they're not sure. They're also in a mode of stress when they're trying to assign blame, if that is the case. And so when that is happening, it's difficult to have the mental space to recognize that, no, we actually all feel bad. And it is possible that the engineer doesn't fully understand the business impact and that is these leadership's responsibility to actually help the engineering understand. And what I found is that it actually builds tremendous trust. When you can facilitate a conversation where the engineer does acknowledge the understanding of the business impact, it doesn't mean you're taking full responsibility of something that would have happened and could have happened with other people, but it means that you understand the pain that leadership is experiencing too. Another thing from a business perspective to emphasize is that it's way more costly to hire someone new than to train the existing team. Even if there are gaps in knowledge. It's way easier to bring someone up to speed than to hire someone brand new and teach them all of the intricacies of your system. As you can see, the first and third point heres are logical appeals and then these second one is an emotional appeal. So if the leader likely understands you, these you can start with can emotional appeal. But if you foresee them disagreeing with you first, start with one and three first. Finally, let's take a look at their perspective on stakeholder trust. Again, they might have some assumptions about what the stakeholders might want to see that they might want to see someone get fired as a way to be convinced that the incident is restored or to maintain some fairness among different teams and stakeholders. So again, here are some questions to ask to try to figure out where your perspectives are different on these situations. Like just ask open endedly, are there other ways to rebuild trust with stakeholders besides retribution? Another thing to ask is how will they respond to retribution versus being informed on your long term plans to resolve this incident and other incidents like it. As you can see, the first question heres also allows everyone to come together and brainstorm different options rather than to just say it's wrong to punish these engineer because that actually cuts the momentum of the conversation and stops it at that point. Whereas you can direct that energy towards a new option and see, okay, what are some other ways that we could build trust? So now that you've opened up this dialogue, how can you present your own perspective? One thing to really try to convince leadership of is that your action plan will inspire confidence that once it's explained to stakeholders, they'll see how it leads to in a more enduring solution. The other point to mention is to really show that you hear and acknowledge the pain of your customers and also all stakeholders impacted. So who else could be impacted by customers? Well, sometimes customer success team is often the team that is responsible for retention metrics. And so if customers are impacted by incidents, they may be more at risk of churn, which makes it harder for the customer success team to do their job. And so extending, so extending empathy and understanding not only to customers but to internal stakeholders as well, is very important. So let's return back to this incident where the leader has come in, asked what happened and who's responsible in the moment. How is the best way to respond to this? We think some of the elements it should have is to be very direct and succinct. No beating around the bush. Yeah, because that could actually make you seem suspicious and trying to hide something if you're beating around the bush. Another thing to really focus on is building common ground, looking for things that you can both agree on, things that you're both feeling and goals that you share. You also really want to create psychological safety. And if you see any rush to point fingers and blame, really try to alleviate that with some of the questions we mentioned before. Like I said, having shared goals is extremely important. So explicitly articulate them, make sure you're all on the same page and facing the same direction, and then give visibility into what these next steps will be. Now that you have set up the goals that you share. How are you going to achieve them without using blame? So let's go back to that moment where leadership walks in through the door. So what happened here? Who is responsible for this mess? It really could have been anyone. We're all focused on resolving the incident as quickly as possible. So why don't we give these team some time and space to focus on the resolution first? And I understand the impact this has on customers and we're committed to restoring stakeholder trust and we'll take full ownership of working towards preventing incidents like this in the future through in depth contributing, factor analysis and follow up actions over the next two weeks after this incident is resolved. That seems fair. I look forward to seeing what you find. Wow. That was actually a very scary and stressful experience for me, even hearing Emily say that because I could feel blamed. I felt like I needed to hold someone accountable in that moment. But I wanted to actually ask, how did you feel as you were asking those questions? I tried to really embody the feeling that this was a big deal and that I had the entire company perhaps riding on resolving this quickly. I really wanted to get across how passionate I felt about this going wrong and convey the importance to everyone else. So if that came across as scary, we can see now where these gaps start to pop up. Yeah. Wow. That's powerful. See, even when I was scared, I actually lost these ability in that moment to understand you. Heres just really prioritizing this issue. It came off definitely feeling like you're trying to hold a specific person responsible. So yeah, that was very powerful. Thanks. So immediate response is actually not enough. Let's look at what the follow up investigation could look like. Rather than saying this engineer screwed up, let's dig a little bit deeper and do the hard work and see what are other contributing factors. So as an example, we can return to our story from the start and ask a few questions about how this may have happened. Like why do the admin control panels for production and testing look really, really similar? Yeah. And should production have a big flashing banner? Production? This is production. This is production. Yeah. And maybe just a single person acting alone shouldn't be able to make these changes. Maybe there should be like an oversight someone has to review before they go through. And should we maybe be selective about the engineers who can make changes on the production admin panel? So just by digging into it, we can come up with all sorts of enduring systemic changes that can prevent this specific incident and all sorts of other incidents like it going forward. It's so much better than just getting rid of one engineer. Yeah. So again, you should have this follow up conversation a little while after the incident is resolved and you've uncovered these perspective differences. You've uncovered what assumptions they might have had, and then you can start really meaningfully implementing these changes. So now that you've done the investigation, you've come up with really great systematic changes that will help prevent issues like this in the future. Are you done? Well, as you can imagine, no. There's also follow up planning for reliability overall. So looking at these incidents, how does that inform the three pillars of planning people? Process tooling and process includes prioritization as well. So for people, how do incidents inform headcount planning? Do we need more people and process? How can we update runbooks or production readiness checklists? Is there things that we can do to consistently improve our performance? Resolving incidents in the future? And also consider investment in goals that will really up level the effectiveness of the engineering team. So just from one incident, we can dive into major priorities for the entire organization? Absolutely. Incidents can uncover issues that maybe the company isn't even ready to hear about yet. Because after asking enough whys of how something happened and not just going down one path of these tree, but exploring multiple options, it often ends up revealing something about leadership and also about how the team is structured. So there's really interesting insights when we dig deeply into incidents. So let's go back to the beginning of the talk where executives asked, someone still has to get fired, right? Well, sometimes, yes. But when is it fair to hold someone accountable in the traditional sense where you are actually letting someone go? Well, Emily and I came up with a number of prerequisite questions to ask as a starting point. So heres expectations for this person's job clear? Were they realistic? Were they well documented? Did they know what they were supposed to be doing? And were the mistakes of the incident a result of their lack of skill, good intentions, or honest effort? Have you been sharing feedback about their gaps in performance on a real, consistent basis, making sure they know that they're not up to par? And also, have you accounted for all other contributing factors? When it is in the context of an incident, holding someone accountable shouldn't be the easy way out. It shouldn't be something that you leap to as the simplest solution, but instead something that you resort to after accounting for every other circumstance that could have led to the mistake. And as you can see, this is a very distinct and separate process from incident resolution. This is performance management. And just because there are incidents which are normal and natural and happens with every company and every system. It doesn't mean that that can be a substitute for proper performance management. So let's talk about being blamelessly accountable, having your cake and eating it too, being both blameless in culture, but holding people accountable when necessary. Well, at Twitter, we learned that accountability faces forward. So it means that the team that is accountable will take full ownership of improving reliability from the incident point moving forward. It also means that you're separating these reliability outcomes from performance management. Like we were talking about before, performance management should never be a substitute for resolving the incident in the best possible way. And likewise, it shouldn't come at the cost of in depth contributing factor analysis. You shouldn't give up on trying to find other causes of incidents just because you've decided to hold someone accountable. Absolutely. So really, there's no trade off between blameless culture and accountability. It's not that if you are blameless, you sacrifice accountability. You could very much have both a blameless culture and also people feeling a tremendous sense of ownership about improving the system together as a whole. Leadership is critical in fostering a psychologically safe culture, and it takes incredible empathy, stress tolerance and critical thinking to get blamelessness and accountability working together in harmony. But it is possible, as we've seen it done in this example. So the example we shared is actually based on a true story and in real life, these engineer, of course, felt bad, but was not punished in any way as a result of the incident. And the team worked together to implement these systematic changes to make the distinction between testing environment and production environment more clear. A perfect example of blamelessness and accountability working in harmoniously together. So, as we worked on this talk, Christina and I found a wealth of valuable resources. If this subject interests you, we encourage you to check them out. We learned a lot about empathy, conflict resolution. We looked at the reliability journey of other companies and how they reached this point of maturity. And we learned a lot about just what it means to be blameless. So what do we do about the elephant in the blameless war room? We shouldn't hide it. Let's ride it. Yeah. Thanks for coming to our talk. Thank you.
...

Emily Arnott

Blog Content Writer @ Blameless

Emily Arnott's twitter account

Christina Tan

Strategy @ Blameless

Christina Tan's LinkedIn account Christina Tan's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways