Conf42 Incident Management 2022 - Online

Plan for Unplanned Work: Game Days and Chaos Engineering

Video size:

Abstract

How do you plan for unplanned incidents? You practice with Chaos Engineering. Strong incident response doesn’t just happen, you have to build the skills and train your team. Practicing for major incidents gives your team insight into how your applications will behave when something goes wrong as well as how the team will interact to solve problems. Combining your Incident Response practices with Chaos Engineering roots your response practice in real-world scenarios, helping your team build confidence.

Summary

  • Mandy Walls is a DevOps advocate at Pagerduty. We've got some positions open. If you'd like to get in touch with me, you can tweet at me. Also email me.
  • Having better incident response helps our reliability, it helps our customer experience. But getting better at incident response is a double edged sword. The goal of combining the structure of chaos engineering and specific tests with a regular cadence of tests.
  • Chaos testing and game days are an excellent place to practice your incident response protocols in a low stakes environment. Too many teams spend time planning and executing game days and then not putting those process improvements back into their work stream.
  • There are benefits to practicing a range of scenarios across your teams over time. But be very explicit about what you want to find. You can also focus on larger problems like DDoS attacks or data center failures. Get your assumptions down in writing so you can use them as a baseline for improvement.
  • Afterwards, make sure you're recording it. Save your charts, save your graphs, save the list of commands. This is also helpful in real life incidents so that you are collecting information to run your post incident review. You can use these experiments to create new best practices.
  • Be very careful about deciding to run a surprise game day. Not everybody likes surprises. When's a good time to do this assessment of our reliability and resilience? Especially if you're in a sort of distributed microservices environment.
  • Don't game day right after a reorg. Don't wait until your busiest season and then game day, you're going to plan ahead. These tests are a way for your team to improve reliability. Make sure that everyone on the team is on board with taking those lessons and internalizing them.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, welcome to my session. This is plan for unplanned work game days with chaos Engineering. My name is Mandy Walls. I am a DevOps advocate at Pagerduty. If you'd like to get in touch with me, you can tweet at me. I'm lnxchk on Twitter. You can also email me. I'm mwalls@pagerduty.com. Pagerduty is hiring. We've got some positions open. You can check those out@pagerduty.com. Slash careers that's the last time we talk about Peter duty. So what do we do? We do a bit of incidents response and we want to get better at incident response. Having better incident response helps our reliability, it helps our customer experience. But getting better at incident response is a double edged sword. We want to get better and to get better at most things, what you do is you practice. But to practice getting better at incident response, that means we want to have more incidents, but we really don't want to have more incidents. So what can we do to give ourselves a place to practice? Incidents response is a muscle that has to be exercised like anything else, right? The workflows, your communications patterns, the who and where doesn't just happen by magic. Like, you train people how to do it and they get better at it over time, right? So we don't want to do that in public as much as we possibly, possibly can. But we want to give folks a place to practice what they do when there's a real issue, right? So that's what we're going to take a look at. So many organizations and even individual teams now are using a game days practice to build those muscles around incident response and around troubleshooting and just sort of getting used to what it means to triage and troubleshoot incidents on their environments. At pager duty, we refer to them as failure Fridays. And we've been kind of like talking about this for almost ten years. Honestly, I think the blog post I have here is from, I think, 2013. We refer to them as failure Fridays, and sometimes they're failure any day, right? If they happen to not be on Friday. But the Fridays are for the big ones. But regular teams can sort of do whatever they need to during the week. The point of the exercise, though, is to make sure that production operation systems are doing what we need them to do when we need them to do it. And this can be a number of different things that are sort of attached to your production ecosystem are metrics and observability tools, giving us the right telemetry in the right way that we can access it. Are the escalation policies or notifications and incident rooms and all that stuff working as we expect them to do? Folks know what is expected of them and where to go to gives them more information. You don't want them scrambling around during a real incident. Right? And if we need to push a fix, how do we do that? Is there a short circuit for the regular process? Does it have to go back through the full pipeline? What are we going to do there? Real incidents can be super stressful, especially when they're customer impacting. And not everyone has the temperament to be calm and think through things. People get super hyped up, right? So we also want to give folks a chance to experience what happens when something isn't running smoothly. So we can do this with some failure Fridays, some game days, and we can introduce a little bit of experimentation into those. We want to go into our game days and our failure Fridays with a goal. Maybe it sounds cool to just walk through your data center and pull cables, or scroll through the assets list in a public cloud and just delete something randomly. But that isn't where you should be starting. Reliability and resilience improvement practice, that's not going to give you the best benefit when what you want to do is create an environment for everyone to improve on things. Chaos engineering lets you compare what you think will happen to what actually happens in your systems. You literally are going to break things on purpose to learn how to build more resilient systems. And this is going to help you build better technical systems. It's also going to help you build better sociotechnical systems, which is all of that human workflow that goes into responding to an incidents, troubleshooting, resolving, and all of those pieces. Right. Chaos engineering can make it sound kind of scary or maybe irreverent for some teams. You might also find programs that call it fault injection, which sounds a little bit more serious. But what we're really after is taking the systems as we know them, changing things around a little bit, and see what happens. So we're going to use chaos engineering to validate the assumptions we have about how our systems are going to behave when they meet the users. Your best laid plans might go out the window when you find out what the users are actually going to do with the product. Right. The goal of combining the structure of chaos engineering and specific tests with a regular cadence of intentional practices is really going to make sure that what you're putting in front of your users is the best system possible. You want to make sure you're not forgetting about all that dependency back there. You do a black hole test, which means like taking, pretending that back end is offline and then what happens to the front end? What's it do? How are you going to mitigate that and working through all of those things? So we want to be intentional about what we're doing and not just random picking things apart. When we're looking at what we want to actually improve, our goals are to learn and we want to focus on the customer experience. That's the whole point, to keep customers and users happy and give them the best experience possible. So when we're planning out testing programs and failure scenarios, we want to keep the users in mind. We want to improve the reliability and resilience of the components that our users rely on the most and where they expect the best performance. A testing scenario that only flexes part of the application that users aren't utilizing. It's like a tree falling in the forest. It's a lot of work with little impact on the overall success of your service. So we want to focus on components that are going to generate the most stress when they go down or become unreliable. To have a successful chaos testing and game day practice, you want to have some things in place first, right? It's kind of enticing to sort of go in at the beginning and create one of these practices. But you need some tools aligned first. One, you're going to want at least some basic monitoring and telemetry. It's absolutely fine and really kind of expected to use your game days to flex your monitoring and try and figure out what you're missing. What other piece of data would have shown us this problem earlier, especially for log messages and other application output? You need to start with at least enough monitoring that you know your test registered right, and then figure out what happens downstream. If you're looking for, say, how your application responds to a dependency being offline. Black hole test. Have a basic set of monitors in place for that scenarios. Whatever it is that makes sense for your platform. If it's an internal dependency, maybe you're monitoring it directly. If it's an external dependency, maybe you're watching a status page on a periodic basis or something like that. Two, you want to set up your response process. Chaos testing and game days are an excellent place to practice your incident response protocols in a low stakes environment. We're being intentional. We've set our goals in advance. We think we know what's going to happen. So we can take some time to practice. Having an incident commander assigning a scribe preparing a post incidents review, like a post game day review, is just as useful as a post incident review is for production. Then we want to establish our workflows for fixing things. Lay the groundwork for how much experimentation you want the team to do during the test. In a real incident, your SMEs may need a pooch code or configuration updates to remediate the issue. They're going to keep working until they have restored service. You're not going to do that necessarily during a game day or a failure Friday scenario. You have a set work time and some things that you want to experiment around, and then you can call time right and be done, but establish what you're going to do there. Finally, you want to think about any improvements that you're going to sort of discover in this process what happens after the game day. Too many teams spend time planning and executing game days and then not putting those process improvements back into their work stream, right? If the reliability improvements you learn about during the exercise require changes to the code, you need to get those back into the workflow so that they become production code. Right? So making sure that your product team is on board and things can be prioritized and you're learning and utilizing what you then learn from the process. So improving reliability is a long range process. So you're taking the time to do the planning ahead of a game day and that's going to help you get the most out of the work that you do after the game day. I won't go into the first two pieces deeply, right? Monitoring telemetry. There's lots of folks that know more about that stuff than I do, but game days are an excellent opportunity to flex your incident response muscles. You don't have to mobilize a full incident response every game day, but it's an option you should keep in mind just to sort of keep things flowing and keeping all that muscle memory active, responding to major incidents that impact users, and especially if they cross team and service boundaries. Take coordinated effort, right? And practice. Not only do they help your team work within the response framework, but your game day can then also help your incidents commanders practices managing an incident. They're skills like any other, right? The more we use them, the better they'll be. And for folks who are newly trained as say, an incidents commander or some other position you might be using in your response process, having the opportunity to practice in a not real incident is super good for them, right? Your incident response practice is going to improve over time as well. And your game days are going to help you become more comfortable and more confident when they're handling incidents in production, right? So we organize in advance, we set out explicit expectations for our practice, and we invite the people that are going to be learning from it. Then we can set up what we need to know. Your game day is going to help you verify that all of your end to end components in your IR process are working appropriately. Sometimes they aren't right. Things don't work. So when you choose your tests, your first checkpoint is, do we have the appropriate alert for this scenario? If I've taken a service offline in a black hole test, does the system alert me somehow? Does it alert for latency tests for disk Rio? Where is it coming in from? How do I know that I'm coming into the right service in my incident management software? All of those things can be exercised when the alerts fire. Then what happens? Are folks being notified appropriately behave? They actually gone and put their contact info into the platform? Are you coordinating in a team chat channel? Are they on a conference call? How does everyone find out all that information? Especially if you've got folks new to your team or you've changed platforms, take the opportunity to get everyone in line to practice how to find all this information. Finally, think about your troubleshooting before you're in a real failure. Right? It's essential do folks behave access to all the dashboards they need? Are they hidden behind a password? Are they locked down somewhere? Can responders access the hosts or repositories configuration files, whatever else they might need to mitigate a real incident? You ought to practice that. Right? Has everyone been added to the management tools for your platforms? Do they know how to use them? Can they do restarts? Can they do scale ups? Any of those sort of basic failure scenarios are super helpful as part of your game day to provide valuable experience for new team members. Decide when you're going to call the experiment over, put an end to it right. It might be time bound. We're going to look at this particular scenario for ten minutes, or it could be more flexible. When we feel like we've learned what we wanted to know, we'll turn it off. These scenarios can be super quick, especially if the systems are already well defined, right? And if you use a lot of defensive practices, right, you've got graceful degradation and other things in place. So you may only be testing a little bit of that defense, maybe focusing on a new feature or something like that. And that's totally fine. They don't have to be blown out wide. We take the whole site down for a while sort of practices. It's fine to practice in smaller components. We build our game day with our goals in mind. We have this general goal, the overarching goal, improve the reliability of our service. But to get the most out of the game day, we want to set something specific so that we can concentrate on it. Maybe we have some code that we've introduced to fix something that happened previously. Right. We want to make sure then that as we introduce that code, we behave actually fixed the thing. Right. And that's hard to do in a staging environment, especially if it's reliant on production level load. Right. Do we need to test how a new database index impacts a slowdown? Are our users reporting that the sign up flow is slow? You're not sure why, you can't really catch it in staging. And there's plenty of scenarios that you can be digging into using these kinds of practices. But be very explicit about what you want to find. You can also focus on larger problems like DDoS attacks or data center level failures, depending on which parts of your ecosystem you want to investigate and which teams are sort of involved in the game day practice. There are benefits to practicing a range of scenarios across your teams over time. And it's good too to mix up, like having teams that own a couple of services practices on their stuff independently before introducing a larger cross team experiment. So we're going to set up our hypotheses. What do we expect to find? If we already have defensive coding measures, do they kick in? Is there a failover? Is there a scale up that should happen? Is there some other automation that's going to take care of some things? Maybe we expect the whole thing to fall over. That's fine. It's a place to start improving from, right? But definitely set those initial hypotheses. You have assumptions on how the systems will behave, so get them down in writing so you can use them as a baseline for improvement. Afterwards, we're going to talk about what happens, so make sure you're recording it. Save your charts, save your graphs, save the list of commands. This is also helpful in real life incidents so that you are collecting information to run your post incident review. Think about not just the things you first looked at, but other related information that might help in the future. This is another part of your practice that's going to help you build up a better incident response practice, and especially if you take the time to write a post game day review. It's going to give your team a place to organize their thoughts and then improve the practice for the next time. Say, well, we thought we were going to be relying on this particular piece of information. Turns out that metric wasn't as helpful as we thought. But here's this other thing that we're monitoring over here that was actually much more helpful. So the next time you have a production incidents relating to that service, you can go right to the one that you learned about, right? So talk about what you learned. Right. We went to all of this trouble. We did this planning. It's on the follow the failure Friday calendar. We put all this stuff together, we planned all these scenarios. Then what we want to talk about, what we learned, we want to share with other teams in our organization who can benefit from it so that everybody is gaining knowledge through these practices. We talk about our improvements. We're going to put the things that we learned to good use. Improving the reliability and resilience of our system is our requirement. We're going to balance non feature operational improvements with feature work in order to provide the best experience for the users. So the findings from your game day might generate work that should go into the backlog for your service to improve it over time. You might want better error messages in the application logs. You might want fewer messages in the application logs, right. You might need a new default for memory allocation or garbage collection or other kind of subsystems. You might need better timeouts or feature flags or other mechanisms for dark launching. Whatever it is, any number of new improvements might be uncovered in this practice. So don't abandon them. Get them documented and into the planning. And over time that's going to help you work around all of these processes. As your team gets comfortable with defensive resilience techniques, it's going to be easier to use those practices regularly when you're developing new features, and then they're going to be in there from the beginning instead of waiting for test day to sort of unveil them. Right. You can use these experiments to create new best practices, common shared libraries and standards for your whole organization. So we get some questions when we talk to customers about these kinds of practices. And one of the big ones is should you run a surprise game day? And I know it seems like that's a thing you should do, but often you probably don't want to, right? Maybe we'll say maybe once you behave your patterns and practices well honed. We've been talking about being deliberate and explicit about what our goals are for running a game day in the first place. If you're looking to run a surprise game day, be really clear about what you're hoping to accomplish. Knowing that there is much more risk for this kind of testing when folks aren't expecting it, it can feel more real, I guess you could say to run surprise game days, we don't know when real incidents are going to happen, but if your team isn't consistently getting through planned testing scenarios, a surprise isn't going to magically make them better. It actually could have negative effects on the team and how it works together. So be very careful about deciding to run a surprise game day. Not everybody likes surprises, for sure. And then when should we game day. When's a good time to do this assessment of our reliability and resilience? The truth is kind of anytime, right? Especially if you're in a sort of distributed microservices environment, right? Teams that have ownership of singular services can probably run a failure scenario across the services at any time. But also you want to think about doing it when things have been going well, right? You don't want to run production environment game days when you've already been blasting through your slos on a service. If you've had a lot of incidents already, you've had maybe some downtime if your users are already unhappy with the reliability of the application. Doing production testing isn't a way to make them super happy, even if you mean well and even if your goal is improvement. Right. You don't want to blow through the rest of your air budget on testing as you might need it for your real incidents. So look for time when you can do shorter focus tests when things are calm and running well. Right? The bottom image is an example from our bot that's notifying folks that someone on the team is running a failure Friday exercise. It lasts from one seven p. M. To one thirty one p. M. It's a short, focused experiment that can give you a lot of information, and it's in our chat. You can follow the channel and figure out what that team was doing, and maybe that will help your team as well. But make plans to run some chaos testing against new features, especially after they've had some time to burn in. You want to have an idea of the baseline performance before you go injecting new faults into it. So work on sort of setting that good practice that as you invoke new features and you put new things into production, that you're also going back and looking at the impact there and digging into those tests and additional performance issues. The converse of that is when should you not game day. And this is going to vary in different organizations, but there's a couple of things that we encourage people to keep in mind. Don't game day right after a reorg. We know big organizations like to reorg from time to time, but give folks time to get acclimated with the services, acclimated with the teams they've been assigned to. Don't wait until your busiest season and then game day, you're going to plan ahead. You know stuff is coming, so keep those two things in mind, not right after Reorg and not right at the beginning of your busiest time. I'm looking at you retail like you guys have been planning this since June. You know your biggest season is coming. You want to be doing your practicing earlier, not later. So reconsider spending your time doing game days if your business partners and product managers aren't on board with taking input for the backlog. Based on what you learn, a game day can still have value if it's just for your team to practice, but you won't be getting the full value of the exercise right. So you want to take all those things in mind as you do more of these, and even as you do more small ones, your team gets used to all of the components that are important for your incident response process. They are going to know where all the dashboards are. They're going to see how your chat bots and your other automation operates and know where to log into the conference call, what channel to follow in the chat application. All of those things that, if they're not used to it, are even more difficult when people are stressed out during a real incident. So you have this wonderful opportunity to sort of get people in the mindset that we're worried about our reliability. We want to work on this and here's how our practices go. So to summarize just a little bit for you, one have a plan. Think about it in advance. Don't just do it right. You want to say, hey, we have introduced a new table into the database. We want to do some failure Friday scenarios around the performance of that. We have put a new dependency on the back end. We want to make sure our defensive coding is okay. We have put some new features in and we want to make sure all the logging works so that folks know where to find information when things go wrong. Any kind of new feature, new code improvement, whatever it is, what does it look like when it gets into production? There are going to be things that you can see when you get to staging, and that's fine. But there's also going to be things that you're not going to be fully comfortable with until you've seen them actually perform in prod. And you need that data, you need that user flow, you need all that activity going on to make sure that you've done what you thought you were going to do. So you're going to be intentional about all of these things. These aren't an accident, they aren't a surprise. They are a way for your team to say we're going to improve our reliability via x, Y and z. We're going to improve our incident management practices via additional practice and workflow and give us an opportunity here to get better at all of the things that are going to in the downstream have an effect on our incidents response and our overall reliability. And then we're going to use what we learn. There's plenty of things that we can learn about the performance of our non feature work, all of our operational requirements, whether they are database indexes or timeouts or red button or whatever you're doing with your services and your reliability there. We want to use all those things. So make sure that everyone on the team is on board with taking those lessons and internalizing them. So a couple for creating effective game day tests this is a really nice article. What they've worked on. The second one is from the folks at Azure advancing resilience through chaos engineering and fault injection. Another really good article to read if you're kind of new to all this and thinking about what you might want there. If you want to learn more about incident response methodologies and how to handle that with your team, you can check out response pagerduty.com. And just for fun, we have a podcast called pager to the limit and we'd love to have you as a listener there. We cover incident management, but also all kinds of other things. So if that's interesting, throw it in your favorite podcaster. So I hope you enjoy the rest of the event. And thanks for coming to my session.
...

Mandi Walls

DevOps Advocate @ PagerDuty

Mandi Walls's LinkedIn account Mandi Walls's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways