Psychology of Chaos Engineering

Video size:

Abstract

Chaos Engineering, failure injection, and similar practices have verified benefits to the resilience of systems and infrastructure. But can they provide similar resilience to teams and people? What are the effects and impacts on the humans involved in the systems? This talk will delve into both positive and negative outcomes to all the groups of people involved - including users, engineers, product, and business owners.

Using case studies from organizations where chaos engineering has been implemented, we will explore the changes in attitude that these practices create. This talk will include a brief overview of chaos engineering practices for unfamiliar members of the audience, but the main focus will be on human elements. I will discuss successful implementations, as well as challenges faced in teams where chaos was a “success” from a technical perspective, but contained negative impact for the people involved.

Summary

Maddie Stratton is a DevOps advocate for Pagerduty. She founded DevOps days Chicago, which is going to be in September. Sometimes all the things you wanted to say have already been said. The best thing about being the last talk is tying things together.
There's different perceptions around what chaos engineering is and what it provides. When you're doing change inside your organization, it's about understanding the perceptions. What you're trying to do is affect change in an organization. So perception super matters.
The components that make up chef code, one of the elements of those are called recipes. Sometimes words matter, and sometimes these don't, right? Sometimes a word matters when it affects how we think about something.
Use the word root cause when trying to affect change using chaos engineering within organizations. People get nervous about things that seem to imply additional risk. A failure experiment is a great way to normalize the practice of incident response.
The blast radius is what matters. When there's an understanding of the effects, this can actually have a really positive effect on your delivery teams. This really boils down to education being helpful. All you want to do is get in the door to talk about it.
Matt Stratton: Never surprise anybody with your chaos experiment. Know what your key business metrics are that are these. And again, we want to build resiliency. At the end of all of this, there are always people. So trying to remember the humans is what matters.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Sometimes being the last talk. Actually being the last talk is always super dope. But there's different reasons, right? These best thing about being the last talk is you can kind of tie things together. You know, I can listen to all the other talks that happen throughout the day and then tell the story. The other thing is sometimes all the things you wanted to say have already been said. So talks over. We're done. Be. Let's go. No, no, seriously. I'm using to try to be a little flexible as how we're going based on these ideas of things that we've already talked about. So going to take this opportunity. I've got some. That's weird. Well, sorry. On that screen, just noticed that you can't really see the top. Well, all the really good information is at the very top of the slide, so you're missing all the real thought leadership over there. This talk is called the psychology of chaos engineering. So it's thinking along the lines of a lot of the human factors that come into play. My name is Maddie Stratton. I'm a DevOps advocate for Pagerduty. So I'm not a huge fan of resume slides, but I really like this one. So you're just using to have to sit through it. It's cool. So anyway, I work for Pagerduty. How many people here have heard of Pagerduty? You already heard about it in some earlier slides today, too. So if you didn't raise your hand, you're lying. Cool. That's as much as we're going to talk about Pagerduty right now. Business. It's the bottom of every slide. I do a podcast called Arrested DevOps. We're one of the longest running, still running DevOps podcasts. If you're a listener and you want a sticker, come see me later. I got lots of stickers. I also have cool pager duty stickers, too, because Deverel life is giving away stickers. That's my real job, sticker engineer. I founded DevOps days Chicago, which is going to be in September. So if you're into DevOps and you're going to be in Chicago, you should come to it and we'll talk about it later. And I help run DevOps days all over the world. And this used to be my license plate. So one might say I'm invested in DevOps. But then DevOps is like ten years old. It literally is, right? Like, the first DevOps days was ten years ago. I want to be cutting edge and everything, so I actually had to get a new license plate. So this is my current license plate. That's true, by the way, that is my car. So I might have $200 more than cents might be because that's what a vanity license plate costs in Illinois. But in reality, what I like to do at this part of the talk is kind of get a level set and get some agreement so that we're kind of speaking a similar language. And in the case of this being towards the end of the day, we've seen all these talks. Fortunately, the level setting that I wanted to do has already been done. Right. Nobody really said anything today that concretely disagrees with anything I was going to say, so thank God, because that would have been awkward. But a couple of things that I do like to, like to kind of stress and you may find. So the nice thing, but doing this talk at the end of the day, especially at these conference, is that the things that you already know, those are review. These is all. It's like it's on purpose, right? We're tying it all back together. So, first of all, also, I don't always give this talk at chaos engineering conferences. So sometimes I have to tell people what I think chaos engineering is. And by I, I mean somebody else's definition that I like. So this is from principles of chaos. But really, the couple of the things about that that I always like to kind of bring. But we talked about experimenting. I think we're all pretty common and thinking about that, and we're looking about building confidence to be able to withstand these turbulent conditions. Right. You'll notice there was nothing in here about prediction. So that's just kind of my little thought. This isn't an old definition, but we can go back here. I was just talking about. It's almost ten years ago. So this was from these Netflix tech blog talking about chaos monkey. But there's three things I like to pick out of this definition that are generally interesting. And again, think about at some conferences I'm giving this talk, this is a brand new idea here. We're reviewing, we're getting on the same page. So it's. Right. So they're running elements in the middle of the business day. And this is similar to at page of duty. We run our failure Fridays during the day. They're run it at lunchtime, Pacific time, because frankly, there's no good time for pager duty to be down. If there is any good time for us to take any kind of an incident, it's during the day in San Francisco when most of the people are in that particular office. People in our Toronto office will tell me that, no, the best time is during the day in Toronto because Canadians are just as important as valley people. And that's totally true. Carefully monitored environment is a big key part of this. Right? And then again, having your engineers standing by. So these is all stuff that hopefully in this room is like things we consider as table stakes. If we don't consider this table stakes. First of all, if any of this comes as a big surprise to you right now, you've probably been sleeping all day, right? Which I totally get. So cool. But perceptions. So here's the thing. We just talked about what we understand this to be, but there's different perceptions around what chaos engineering is and what it provides and also what's involved. And sometimes people say perception is reality. What you're trying to do is affect change in an organization, likely. So perception super matters. It's not about being right. That's what Twitter is for. When you're doing change inside your organization, it's about understanding the perceptions. So this has been alluded to before, but this drives me up a freaking wall. Anytime I try to go somewhere and collect information about chaos engineering, several people think they are the most clever person in the world. That will be the first ones who ever make these joke, right? They're not also to see the people who come up to the pager duty booth and go, ha, I hate you. You wake me up in the middle of the night, we're like, clever. Never heard that one before. So this is generally my response to that. It's like, first of all, you're wrong, right? That's why it's on social media. But also, it's not even clever. So that's the other thing. It offends me. As Jerry Seinfeld said before, to sort of paraphrase. It offends me. Has a chaos engineering advocate, and it offends me as a comedian, because it's not even funny. If you're using to troll, at least be funny. Okay? And again, it's not about breaking things, right? Our intent is not to actually break something and try to push it to its limit and go kind of, haha. I figured out how to. But your system, right? That's a different thing. And if you don't believe me, believe Sylvia. Because if you know anything about Sylvia Botros from Sengrid, she is the expert at breaking things. And if Sylvia tells you it's not about breaking things, these it super isn't. So that's got to be true. Look, I know you know this, right? So why am I bringing it up? One is, it's kind of a nice way to round out these day. The other is we have to kind of continually remind ourselves of these points and these principles, because almost by virtue of you sitting in this room, you're at a certain level of understanding of this practices and of this field and the first principles surrounding it. The people you're working with in your organization may not be. And it's very easy, as we become further advanced in our understanding of something, to kind of forget where we came from, not even necessarily where we came from, but where people at a different mode of understanding might be. Right. So, again, I know you know this I'm using to say it. Anyway, the good thing is this just sort of blends right into the end of the day. So this is almost just like bullshitting at the bar. We're kind of at a bar, so it's cool, right? All right. So like I said, you know this, but they're experiments. I think that's a really key thing, and it's a really key way to help with that understanding. We've talked about hypotheses, right? And our hypothesis should be that if we do this thing, if this condition exists, my hypothesis is it will still work. If. My hypothesis is, if I shut down this node, everything's going to go to shit. I probably shouldn't run that experiment, right? Again, we're testing out assumptions and hypothesis now. If your hypothesis is everything will go terrible, then, yeah, maybe you still want to run it, but you definitely should run that in a lower environment. That's a whole different talk to talk about the myth of staging. So we're just not going to talk about that right now. That's cool, right? So, again, taking a scientific approach, I absolutely loved that convergence divergence thing that Adrian talked about, and I'm going to steal it for the next iteration of this talk that I give when Adrian hasn't given that talk right before it. So, these. I will look like the clever one. I may actually attribute it to Adrian. We'll see how that goes. So you're like, we know this, Maddie. We got it right. And why does it matter? Because how we talk about things matter. And this is a little tricky. Sometimes words matter, and sometimes these don't, right? So getting nerd snipe points on Twitter for someone using a word wrong just to be right is not helpful. So sometimes a word matters, and a word matters when it affects how we think about something. So I'm going to take a couple of examples, not directly related, but to illustrate that point so we can see how it applies. So before working at pagerduty, I worked at chef, come from an infrastructure, has code background, so automate all things. That's amazing. But the components that make up chef code, one of the elements of those are called recipes. Because of course they are. Because chef, by the way, if you hate food puns, you definitely should go use puppet or ansible, don't come to chef because chef is a t shirt company that also sells software. And then we also make a lot of would oftentimes, instead of calling it a recipe, people, customers I was working with, users that are trying to adopt this would talk about chef scripts. And that's one that I would correct gently and with an explanation. Not because I'm like, oh no, it's a recipe because chef and food and blah, blah, blah. But it's a different way of thinking about what that actual application of a concept is doing, right? A script is iterative going through step. It's a stepwise thing. A recipe, maybe it's not a perfect analogy, but the point is that recipe is good and script is bad, but script is not the was we want to think, so we're going to use a different word. There's other things that I choose not to get a pedantic about. I work at pager duty. We talk about post mortems a lot because incidents, right? So how many people are aware that there's some people that don't like calling them post mortems and for good reason. Right? But it's not fundamentally, by calling it a post incident review or a retrospective or an after action report or whatever is not inherently changing how you think about it. You can have a very good reason for not wanting to call it a post mortem, but it's not related to a change in behavior of how you do it. How many of you follow me on Twitter? It's okay if the answer is no. But if you do, you can probably take a guess as to what the next word is that I'm going to say I'm picky about. And that's the word root cause, right? These reason, first of all, I'm not getting nerd points here. If you want to call it root cause, that's great. You're a fine human being and I love you. But here's why I think, contributing factor, why this one matters, because it changes how we think when we use that word. It makes us think about a singular cause, which in complex systems is not there. So my whole point of this, this is not about stop using the word root cause, by the way, I stopped using these word root cause. It's about the words we're going to use when we're trying to affect change using chaos engineering within organizations. Right. The thing is, people get nervous. It's kind of how we live. It's kind of how we've survived as a species is because we're worried about risk. If we weren't worried about risk, we have all been eaten by antelopes. Well, maybe not antelopes. I don't know. I'm not good at animals, but something bigger with teeth. So risk aversion is kind of baked into us. So we're going to get nervous about things that seem to imply additional risk. Right. You're going to do what in production? How many people think about folks inside your organization whose entire focus and lens by which they view their relationship to your company's business is mitigating risk? There's a lot of people that. That's their job, and that's great if you are going to come to them and say, I would like to do this thing, and they hear it as I would like to create more risk. The conversation is now over. The irony is that folks who want to, we know we're all sitting here going, but that's ridiculous. Chaos engineering helps lower risk. It's good. It's blah, blah, great. And they will love that once they understand it. But you want to be able to have that conversation. So Adrian talked about that a little bit before. Right? And again, this is beautiful. Everyone's already given my talk, but this is why it matters. And when we're thinking about mitigating risk, I also like to mean, again, use your monitoring like it's for real. We've had already conversations around this, your chaos experiment, to be successful. And by successful, not to prove your point, but it's not actually get you fired. You need to be looking at the impact to production. You need to be monitoring like it's for real. And it's because it is another way to think about this. And we've seen some examples, but at pager duty, we run our failure Fridays like a regular incident. We already start incident response as part of the failure experiment, and there's two reasons for that. One is we're already tuned up for it. For us, it's also, we believe very strongly in always practicing incident response. And a failure experiment is a really great way to normalize the practice of incident response. This is a little different. When Ross was talking about, like, a fire drill, it makes you complacent. The difference of this is just being users to the motions. Right. It doesn't mean because we do want to try to reduce some of the stress. So, for example, the way we train incident commanders, if you want to be an incident commander at pager duty, before you go on call as an incident commander, you have to have run a failure Friday because that's giving an opportunity to practice that under a relatively low stress situation. The reason I'm bringing this up is so we might take this and say, like, oh, well, then a failure day. A failure game day is a great way to practice a stressful situation. It's a great way to practice stress. It's a great way to practice trouble. No, it's not. No, don't do that. Because the first thing is your people don't need training and practice of being under stress. They get plenty of that already. Right. What we want to do is the opposite. So, because we have some insight into what's likely to happen, or at least the systems that are affected and we know what we actually acted upon, it's a really great way to practice and go through the motions, which create kind of a physiological. So, you know, as Paige says, something's broken, it's your fault. In this case, it actually is. You can be a little blameful in chaos days, but it's good blame, right? That's cool. Okay, so what about the people? So we've talked about a lot about tech. We've talked a lot about the systems, the technology, the providers, the terraforming, the Kubernetes. That's all the fun stuff, right? It's also the easy stuff. The humans pieces come in. So when we think about the people that are involved, and there's lots of people that come in, there's your employees, right? There's your delivery teams that are for the systems that we're running elements on. There's the people that are engaging in the experiment. And, I don't know, you might have some customers or users that might be wanting to use these systems. Those are people. They're involved. Right. So I always kind of, like, ask this question. If I said, how does it make you feel to know that someone like Netflix is practices these principles? We're actually pretty down with it, right? We're like, that's cool, man. They're Netflix. They're DevOps and all the. You know, and that's cool. So I know that I can get my new movies, and I can binge all the things and whatever. Yeah. What about your bank again? And everybody in this room buys into this, and this one still made you. Hmm. Right. But we totally know why it matters. And the thing is, the blast radius is what matters. I was pleased to see that Adrian also goes to Twitter like I do. So I had a very scientific survey and said, if you discover a service you consume uses chaos engineering production, do you feel reassured or uneasy? And most people said they were reassured. This is not scientific at all, by the way. And also there's a little selection bias, but graphs mean results. So I had to add some data and a little more data, such as it is. I did a few surveys around words that people might use to describe. So I thought it was interesting to say what words describe your personal feeling towards use of chaos engineering on your team. And a lot of people were optimistic, but there was quite a fair amount of uneasy and cynical and everything. And then this is what tells you everything. But engineers, when we flipped it and we actually said, what about if products you use? They were like, oh, my God, that's fantastic. But not us, because we're terrible. Right. So I thought that was really interesting. But one of the things that happens is when there's an understanding of the effects, this can actually have a really positive effect on your delivery teams, because we feel more comfortable making changes, we feel we have greater trust in the system, but we have to understand it because it has the opposite effect if we don't understand it, right? So this principle that when someone thinks it's about breaking things on purpose and all those things, it's actually going to reduce their. It's going to make them feel uneasy, but when there's a greater understanding. So this really boils down to education being helpful. And we talked about people getting nervous. Management can get very nervous. And we think about considering our words, and this is usually the part when I say, I don't have a great suggestion of what you should call it business failure injection or chaos engineering, because it might be different. Fortunately, a bunch of people today have already given you a bunch of really good of. And the thing that's important is accuracy is not the most important part. So, like, Ross talked about system verification, and the nerd snipes and us went, well, technically, it's not system verification because that's a formal process. It doesn't matter. You're talking to your CFO, right? Because all you want to actually do is get in the door to talk about it, right? Have that conversation. I like what Adrian says, just call it engineering. But that doesn't work as well for when you're trying to explain a new practice. So I invite everybody to kind of think on this, and I'd love to talk about it if people have examples of ideas, but it will vary depending upon your organizations. Right. And really it comes down to the understanding of the philosophy. And when you're trying to bring people along for a ride, you want to be somewhat like minded. Right. So I really like these. So Cody doesn't really use chaos engineers fully, but had some interesting failures from interns, which I guess is some form of chaos engineers. And he says now, honestly, feeling excitement when confronted with a new error that hasn't occurred before. This is really a lot more about learning from incidents. But if you have those things that have to be coupled and learning from incidents, it's a thing that we're all not very good at. By all, I don't mean all of us are bad at it, but I mean few of us are good at it because we usually look at incidents as something to be avoided. Again, we don't encourage them. We don't want to say like, boy, I sure wish that I had a ton and ton of incidents. Although one potentially controversial way to say it is it's been said, hey, you want to get better at incident response. If you tried having more incidents, and that immediately sounds funny, but actually the way it's phrased is scope more things to be considered as an incident, you will practice it more. And so incidents are a gift. If that's a little hard to wrap your to kind of swallow, maybe you say incidents are unplanned investment, but if we don't think about, if we're not focusing on being a learning culture, we're not going to get a lot of value out of all the practices we've talked about today, it's about the learning. And that's why I thought it was so great that we had to talk about post mortems as they apply to that. We need to run all the things we would do for an incident on our elements as well, especially. But the only reason that makes sense is if your after incident review, your pir, your post mortem is focused on learning. If the whole reason you do a post mortem is to write down these root case, it doesn't make any sense to you to do it after a chaos experiment because you already know what the root cause was, it was turned off, this thing. So those two things, I think these practices are so well coupled which is why in a lot of people's mind, they all kind of run under resiliency engineering practices, right? Learning from incidents. These things are all loosely coupled, right? They don't give you those things, but they all end up being connected because they all come back to us wanting to learn and have better understanding and be able to reason about our systems more broadly. I'm going a little quick. Part of it is because we're getting to the end of the day, and I know we started running behind, and I'd love to have a little more hallway track and everything like that. But just a couple of things I'd like to consider. Again, safety first, right? And this is in the safety two idea of safety, but safety first. We want to think about all sorts of things we've talked about today, about minimizing blast radius, about making sure your responders, by the way, just speaking of one thing I forgot, was I the only one who was waiting for the last slide for Muri's? Talk about where we could, like where the git repo was for Gabetta, and they're like, oh, no, we don't have that yet because I want it. Right. So that was exciting to me because of, again, the thoughts around the whole experiment, right? So a couple things I think that are very key to keep in mind, and these may seem like table stakes, but to use a terrible analogy, I used to do swing dancing, and we used to used to say about dancers, we'd say beginning dancers take intermediate classes, intermediate dancers take advanced classes, advanced dancers take beginner classes. So it's always helpful. As much as we're big experts in this room, a couple of these things. If it is common sense, great, it's a good review. If it's table stakes, I assume you're already doing all of these. But knowing your conditions of when do you shut down the experiment? And that's knowing what your key business indicators are, right? You're not shutting down the experiment because SQL server is using too much memory. All of a sudden, you're shutting down the experiment because your average card size has dropped below where it's supposed to be. Know what your key business metrics are that are these, you're just going to call it. So you need to know what those are. Right? And again, we want to build resiliency, and resiliency comes from people having adaptive capacity, right? And what we're not trying to do here is stress test people, right? Even if Adrian's going to break necks of your key sres, you're not trying to add stress, you're trying to find challenges. So it can be very tempting to look at a case experiment as a way to test people's ability to troubleshoot or to simulate the stress of on call. We want to have transparency on everything that's happened. Never surprise anybody with your chaos experiment. Right. And at these end of all these wonderful numbers about reducing MTTR and availability, numbers that are happening, everything, there are always people. And where I bring that up to mind is that some of the scale at which we work, a small number is not a small number. We're like, oh, well, we only impacted one 10th of a percent of our users. Okay, well, that might have been like 10,000 people that just had a shitty hour because we weren't really watching those key metrics. So at the other side of all of your grafana and all of this, there are humans. And we always want to keep the humans involved. And thinking about these, it doesn't mean we don't take risks. But remember that at the end of the day, it's really easy to look at a graph, but all those little things at the end are probably a person most likely. So trying to remember the humans is what matters. So, not that these slides are terribly exciting, but if you want to check them out, that's my speaking website. The slides are there, some supporting links, some other articles that I found interesting. There's a couple of articles that I linked to in there that I didn't talk about, that are about how we do failure Fridays and stuff at page of duty, which is just not saying you should do exactly what we do, but you're all interested in this domain, so it's more things to know, right? And yeah, if you like Twitter, that's where you can find me. I'm Matt Stratton, and Peggy says, follow me.

Slides

Download slides (PDF)

See all 11 talks at this event!

Conf42 Chaos Engineering 2020 - Online

January 23 2020 - premiere 5PM GMT

Psychology of Chaos Engineering

Video size:

Abstract

Summary

Transcript

Slides

Matty Stratton

DevOps Advocate @ PagerDuty

Join the community!

Featured event

2026

2025

Info

Conf42 Chaos Engineering 2020 - Online

January 23 2020 - premiere 5PM GMT

Psychology of Chaos Engineering

Video size:

Abstract

Summary

Transcript

Slides

Matty Stratton

DevOps Advocate @ PagerDuty

Join the community!