Conf42 Chaos Engineering 2021 - Online

Incident Ready: How to Chaos Engineer Your Incident Response Process

Video size:

Abstract

I started to build FireHydrant, because the solutions for incident management were just bad (to put it nicely). My goal is to build tech for engineers, by engineers. So, I incorporated my experiences, and feedback from the community, to implement researched and vetted recommended practices into an incident management platform. The chaos engineering I’m speaking to is from actual experiences that I and my team have had and we want to make sure other engineers can focus on building without BS.

We’re pretty sure using a real incident to test a new response process is not the best idea. So, how do you test your process ahead of time?

This talk will share how to leverage best practices to break, mitigate, resolve, and fireproof incident processes. We’ll show you how to use chaos engineering philosophies to stress test 3 critical parts of a great process:

  1. When and how you declare an incident
  2. How you communicate an incident - internally and externally
  3. When and if you should escalate an incident to your stakeholders

Summary

  • Robert Ross is the CEO and co founder of Fire Hydrant. The company is an incident management platform. Ross talks about chaos engineering and how to run a chaos experiment, but for incident response. He also talks about how to add process that doesn't weigh down teams.
  • Think of each technique as an individual lego brick. When someone knows how to use a single break in multiple situations, they can mitigate different incidents more effectively. How can you practice the techniques, then break it on purpose?
  • You need to identify the core techniques people use during incidents. Break your system and watch people fix it. You need to practice the individual techniques all the time. If you're looking for an incident can management tool, firehydrant. com is a great place and resource.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
My name is Robert Ross, but people like to call me Bobby tables. I am the CEO and co founder of Fire Hydrant. We are an incident management platform. Previously, I am a cyber liability engineer at namely, I also worked at Digitalocean in the pretty earlier days. I've been on call for several years, still on call. Even as CEO. I've put out fires. A lot of the time they were my own that I had to put out. And then I am a cocktail enthusiast, as you can kind of tell behind me. I'm also a live coder. What I've been doing every Thursday has been building a terraform provider, live, writing a bunch of go having a bunch of fun. And that's 05:00 p.m. On Eastern on Thursdays. But so really quickly, we're going to go over what we're talking about. So we are going to talk about just some of the basics of chaos engineering and kind of how to run a chaos experiment, but for incident response. And that's going to be the core of this entire presentation, really. We're also going to talk about how to add process that doesn't weigh down teams. That's a really hard thing to do. So a quick overview of chaos engineering. We're going to start off with a quote. This comes directly from principlesofchaos.org, but it says, chaos engineering is the discipline of experiences on a system in order to build confidence in the system's capability to withstand turbulent conditions in production. Another way that this is commonly said, and I think the Gremlin team has popularized this, is you can think of it like a flu shot, right? You're injecting in a failure to create immunity. Another way you can maybe think about it is maybe a controlled burn. You're helping burn down the forest so it doesn't burn down the entire forest. At certain points, you're trying to make things fail. Sometimes they may not. And what it does is it allows you to kind of find the weak spots and ultimately reinforce them. So a few examples of chaos experiments that you may run, and these are technical chaos experiments, would be adding latency to a postgres instance. Really simple one. And a possible outcome of that is maybe now your requests to your app servers start queuing and the whole site topples over. I've experienced that a couple of times in my career. Maybe you black hole traffic intentionally to redis and caching breaks, but the site continues to operate. Maybe just a little bit more slow and then killing processes randomly. Requests are half fulfilled, leaving data in an inconsistent state, never a fun situation. So all these examples, super simple, but hopefully the idea kind of comes across. Now, process experiences can be run in a very similar fashion. So chaos engineering obviously focuses on kind of the computers and how they interact with each other when thrown into a certain state. But with a process experiences, there are so many other outcomes that could possibly happen. And the reason I like process experiments is that people are going to solve problems different ways. Myself, I'm going to look for logs and find what I need to find with a different route than maybe my colleague would. And if you have a seven one outage and you ask someone to update the status page, there could be a bunch of ways that they go about and do that. Rolling back a deploy, maybe I revert a merge, but somebody else uses spinnaker to roll back a deploy. There's so many different ways that we can actually change our systems with only our keyboards. So how do you run? Can experiment on process? That's an interesting question. And one of the ways you can do it is you can have a surprise meeting. So this is something that we actually did last year is I put my co founder Dylan in a room and I actually asked him to update our status page. I did not tell him I was going to do this like you to do. Pretend there is a seven incident for fire hydrant and update our status page. Okay, great. So I'm going to go to status page IO and I'm not logged in because only Bobby has credentials for status page. And now I'm stuck and don't know what to do. So about 30 seconds is all it took to really find that an on call engineer could not update our status page. That's all it took. And it immediately revealed that Dylan had no access. And we have our own status page product now. But that was very quickly researched. He would have had no way to publicly disclose incidents to customers. And that's part of our SLA to our customers is saying, we will tell you when we have incidents as fast as possible. You can do this for any common operation you do during an incident, any of them. And there are a ton of different operations that if you really take a microscope to how you operate using an incident, you probably have had to roll back a bad deploy, update a replica, config, purge a cache, find a log, skip a test suite just so something can go to production faster. There are so many things that we do when we respond to incidents and when you start to think of them as individual techniques and create boxes that you can then practice that's what we're talking about when it says, how do you master the techniques that then become a part of your process? Because the reality is that people mitigate incidents, not process. Process, in a lot of ways can hamper mitigation of incidents. If a process is too heavy and like, oh, you have to update the status page and create a Jira ticket and create a slack room and all these things manually. That's not helping you solve the problem and all of that, that's what fire hydrogen is really for, is helping institutionalize and automate that process. But that's not why we're here. And the other thing that processes get wrong a lot of the time is they're prescribing how to do something. This is the only way that you can add storage to the database, right? Like, oh, if we're getting out of storage errors, here's the runbook to add storage. But what if it was a red herring? What if it wasn't actually storage being out, right? Maybe an error was just mislabeled. There's a lot of reasons why you don't want to prescribe processes. You just want to teach the techniques and test those. So think of each technique as an individual lego brick. The fun little plastic pieces that we maybe all played with as kids, I still play with as an adult. And what's interesting about Lego bricks is that when you have a lot of them, you can create shapes and you can break it down and build a different shape with the same bricks. And that's really important because when someone knows how to use a single break in multiple situations, in a break, in this analogy as a technique, they can mitigate different incidents more effectively. Now, if someone only knows how to make an entire set, they only know how to make the one spaceship, the one building, whatever it is, just front to back. That's the only thing they've ever learned. That's not very helpful when it comes to a different type of incident that comes along. So you should teach how to use the mitigation techniques individually and then practice using those within scenarios. Now, all of these are technique bricks is what Im calling them. I'm deeming them technique bricks. And if you look at the bottom half here, we have finding logs, maybe we have an incident, and im like, I got to find the logs and it says, oh, it looks like caches are maybe stale. So I ssh into a box, I purge a cache, and then I roll back a bad deploy that was causing stale caches. Each of those individual things were not created specifically for an incident that has stale cache problems. But because we have individual lego bricks, we can resolve this incident because we've practices those different things. And now we can get creative and get really fast at putting together a set to solve an incident very quickly. Now the bricks are boring. They're so boring. It's why we don't do them. You don't practice. Hey, go practice finding logs. Go practice rolling back a deploy. Go practice doing this. That's not a thing that we do. It's not something we do in our industry. And that's not really how any other part of high performing works. So I'm going to warn you, I did a lot of marching band. I did Drumcore International, my co founder did as well. And we did things in marching band that were really boring. And this is one of them. What we did is this. We just marched forward, put our horns up, march forward, eight steps, put them down. And that was the entire 3 hours. Sometimes this is all we would do. Now, that's important. Marching in step with the right technique, holding your horn correctly. Those were individual techniques that we were required to master. And the only reason that we did that was so that once we had mastered that, we could take the field and do things like this. Now, if you look at each individual on the screen, they are marching. They're running marching in steps. They're using the same techniques that they practice for hours on end. And then they're basically able to create different shapes and do it faster, do it slower to create shows. And you need to identify which techniques lack understanding first. That's really important. So how can you practice the techniques, then break it on purpose? Right? That's what the purpose of this is, is chaos. You want to break something on purpose, but this time you're going to watch a teammate try to fix it. Now, this is something we did at, namely, we would actually find someone that was going to be basically a secret agent, and they were going to break staging. And it was decided amongst a few small group of people that all knew what was going on, and they would decide how they were going to break it and when. And then what we did is we scheduled time on the calendar for when we were using to break the environment. And we had a team that they knew this was going to happen. We just told them, hey, don't have a meeting at this time. Be at your computer. We're going to break staging, and we're not going to tell you how, we're not going to tell you any hints. We're just going to see what happens. And then what we did is we've watched the team and we've watched them how they identified the problem and mitigated it along the way. You're going to hit walls. People are going to run into walls. They're going to break their noses. Inevitably they're going to hit some bottleneck. It might be that they have to take an alleyway to get to where they need to go and it just slows them down. Or they just outright can't do something. They don't have permissions to do it. They don't know how. A runbook doesn't exist. The capability doesn't exist in the system. Maybe we can't roll something back. There are a lot of things that you will identify when you create these scenarios. What you're breaking is your process. You're trying to break your process. So what you can do is going back to that list of Lego bricks, is you need to kind of create your own. You need to find all of the different individual bricks. And you can do this by going through your retrospectives or post mortems, depending on your nomenclature, and find the individual techniques that people used to help resolve the incident and break those down, make them molecular. And, oh, we rolled back a bad deploy. That's a technique we used. We found a log on this system. That's a technique we used. Right. And now what you can do is you can find your newest teammates and ask them to do it. So roll out a benign change to an environment. Do it on production, because production is rarely always the same as staging or QA. Get on a Zoom call with that teammate and set it to record, and then ask the teammate to share their screen and then tell them to roll back the change. That's it. And then just watch what happens. And this is really interesting, because going back to what I was saying earlier, people are going to have different ways of rolling back things. There might be one person that just pushes a new image to a docker registry, or another person that reverts a commit on GitHub, and another person that knows how to roll it back using our CI tool. There are a bunch of different ways to do this. Now, you might actually find someone that has an extremely efficient way to do this, and you didn't know how to do that, you didn't know how to do it that way. And now all of a sudden you've revealed, oh, there's another way to do this that's even better. And you can kind of put that out into the team. And now that's the technique that we use. You can break it down. You find finding logs, right? Do the same thing with zoom record and share screen. And ask a teammate to find all the logs for 500 status code requests. Because the last thing that someone needs to be learning during can incident is lucine syntax. Are we doing status underscore code or status uppercase c code or like what is the index name? There are a lot of things that slow you down during incident response, and those are the minute details. And the shavings of those make a pile of time that can really hamper your process. It's important to practice the techniques. So with that, you need to identify the core techniques people use during incidents. And you can do this through chaos engineering principles, right? Break your system and watch people fix it. It's not typical chaos engineering we think of as we break the system and then what happens is that we just try to see the adverse effects of the system breaking. But we rarely use the same technique to test process. So once you find those techniques that people are using during these intentional incidents, write them down, get them on paper, put them in notion confluence, whatever you're using to really institutionalize that knowledge and then practice them regularly, your logging system is going to change. Inevitably, something is going to change. The way you find metrics, the way that metrics are labeled. You need to practice the individual techniques all the time. The same way that we did in marching band where we would do very boring things every single day before we did anything that mattered to our audience at least. So what did we cover? We got through this pretty quickly. We learned that techniques matter in incident response. I really strongly emphasize this. It's important that you really boil down to how you are doing things and practice those individual things. I highly encourage breaking something and just observing how a team fixes it. You will reveal a lot through this. You'll find that someone doesn't have access to something. You'll find that someone has a different way of doing something that's better just by. This will take you an hour to do with a small team. So thank you. My name is Robert Ross. People call me Bobby tables. Im on Twitter at Bobby Tables GitHub Bobby tables. Feel free to email me. I'm Robert@firehydrant.com and as always, if you're looking for an incident can management tool that can help you automate a lot of the process around incident response, firehydrant.com is a great place and resource.
...

Robert Ross

CEO @ FireHydrant

Robert Ross's LinkedIn account Robert Ross's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways