Conf42 Chaos Engineering 2022 - Online

The Road to Reliability

Video size:

Abstract

Over the years a lot of research has been conducted and many books have been written on how to improve the resilience of our software. This talk will dive deep into the three keep practices identified by the authors of Accelerate to improve reliability: Chaos Engineering, GameDays, and Disaster Recovery. We will discuss the key measures of tempo and stability, and how practicing Chaos Engineering will increase both.

We will be walking through the Google Cloud open source Bank of Anthos application to illustrate why teams should focus on the customer experience and how to test for failures.

Attendees will learn practical tips that you can put into action focused on resource consumption, capacity planning, region failover, decoupling services and deployment pain.

Summary

  • Julie Gunderson: How can you make your systems more reliable and improve the reliability of everything? She says two key practices that improve the key measures of tempo and stability are chaos engineering and game days. At the end of this, she will let you know how you can get a free chaos engineering certification.
  • Tempo is measured by deployment frequency and change lead time. And stability is measures by mean time to recover MTTR and change failure rate. These are the types of metrics that accelerate recommends that you measure.
  • According to Accelerate, you should focus on improving reliability. It's the foundation that enables you to have a really great tempo. Engineers feel more confident when software is reliable. That makes you actually able to ship code faster and more reliably.
  • Chaos engineering is about simulating the chaos of the real world in a controlled environment. You start with a hypothesis and experiment to validate that hypothesis. Communication is important, so you want to make sure you share your plans with everyone. Sharing what you learn makes everyone better engineers.
  • To measure how systems change during an experiment, it's important to understand how they behave now. This involves collecting relevant metrics from target systems under normal load. Through these experiments, you're planning about your systems and validating your hypothesis. Share the results of your chaos engineering experiments with your teams.
  • You can use any chaos engineering tool. Start small in staging, then expand your blast radius and move to production. Ask yourself, does black holing a critical path service result in a graceful degradation of the customer experience?
  • Kubectl scale deployment transaction history and give it two replicas instead of just one. Send 50% of the transaction history pods into a black hole. Will things be okay? Will things get worse? You never really know until you run the experiment.
  • Game days are a great way to build relationships within an organization. The goal is cooperative, proactive testing of our system to enhance reliability. Plan for a minimum of 30 minutes though, in the real world. Document the results and then share widely in the organization.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Jamaica. Make up real time feedback into the behavior of your distributed systems and observing changes exceptions. Errors in real time allows you to not only experiment with confidence, but respond instantly to get things working again. You sability, chaos, engineering and game days. I'm really excited to be with all of you today, so let's just jump right in. If we've learned anything over the years, we've the the road to reliability accident reliability takes work and a plan and a strategy and a lot of technical actions which include teamwork and collaboration. So how can you make your systems more reliable and improve the reliability of everything? Let's talk about that today. One of the things that we saw a few weeks ago was this massive outage at AWS, and think about what that meant for customers. Look at all of the folks that were impacted. Our systems are complex, and when we see something like a major five hour outage, how can we plan or prepare for those types of things in the best way possible? So when we talk about complex systems, what do we mean? And I'll get there in just a minute. First, I want to let you know a little bit about me. I'm Julie Gunderson, a senior reliability advocate here at Gremlin. You can find me on Twitter at Gund or email me at julie@gremlin.com prior to joining Gremlin, I was over at Pagerduty as a DevOps advocate. So I've been in the reliability space for quite some time. Other than that, I live in the great state of Idaho. Now that you know who I am, let's jump right in. So a great resource, if you haven't checked it out, is the book accelerate. It's a fantastic book written by Dr. Nicole Forsgren, who's a PhD and responsible for the Dora report, Gene Kim, who wrote the Phoenix project, and Jez Humble, who's Sre over at Google. They've done an amazing job of collecting all the research they've gathered from practicing DevOps over the years and also creating the DevOps survey and report. So they took four years of research and came together to analyze that data in a great way and bubble it up and say what the most important things you need to do to create high performing technology organizations. So building and scaling high performing teams in accelerate, they focus a lot on tempo and stability. Now, a lot of times folks ask, how do you measure chaos engineering? How do you make sure that what you're doing is the right thing when you're doing this type of work, how do you prove that the work that you did actually makes a difference and is able to move that needle, especially if you have a lot of different people doing different projects. And it's really important to be able to measure back and say, this is the work that we did, this is how we moved that needle, and this is the ROI that we got from doing that work. And over the years, a lot of research has been conducted and books have been written on how to improve the resilience of our software. So today I want to dive into two key practices that improve the key measures of tempo and stability that are outlined in the book, accelerate. And those two practices are chaos engineering and game days. And you'll learn practical tips that you can put into action focused on resource consumption and capacity planning, decoupling services, large scale outages and deployment pain. And at the end of this, I will let you know how you can get a free chaos engineering certification. So what is tempo instability? Tempo is measured by deployment frequency and change lead time. And stability is measured by mean time to recover MTTR and change failure rate. If we break that down even further, deployment frequency would be the rate that software is deployed to production of an App Store. So for example, within a range of multiple times a day to maybe a really long deployment frequency, like once a year, tempo would be change lead time. So the time it takes to go from a customer making a request to that feature been built and rolled out to where the customer can use it. So again, the time it takes from making the request to the request being satisfied, and then stability is our mean time to recover. It's the mean time it takes a company to recover from the downtime of their software and stability change failures rate is the likelihood of defect changes. So if you roll out five changes, how many of those will have a defect? How many of those might you need to roll back, or might you need to apply a patch for? Is it one out of five or two out of five? These are the types of metrics that accelerate recommends that you measure. So back to the key practices. As I mentioned, the key practices are chaos engineering and game days. And according to Accelerate, you should focus on improving reliability, because it's the foundation that enables you to have a really great tempo. It enables you to improve developer velocity, which makes so much sense. But what's really nice is that they're backing this up with data over years of research. And we know that engineers feel more confident when software is reliable. They feel more confident to write new code, and more confident that the features that they're building are going to work well. They understand these different failure modes because they've been focusing on reliability and have been trying to improve reliability. So that makes you actually able to ship code faster and more reliably. It's really exciting when you can ship new features that work and that you can trust and features that meet requirements in accelerate. They say that we should build systems that are designed to be deployed easily and that can detect and tolerate failures and can have various components of the system updated independently. And this is a lot of different things that are really great to work towards. And I want to focus on the detect and tolerate failures specifically. So what's the best way to know if your system can detect and tolerate failures? Chaos engineering. And that's the best way, because you're purposefully injecting real failures to see how your system can handle that. So let's look at the basics of chaos engineering. Chaos engineering is a misnomer. We're really simulating the chaos of the real world in a controlled environment. Introducing chaos is methodical and scientific. You start with a hypothesis and experiment to validate that hypothesis. You start with the smallest increment that will yield a signal. And then you move safely from small scale to large scale teams, and safely from dev to staging to production. Communication is important, so you want to make sure you share your plans with everyone you want to think through. What if there's an incident? You don't want to negatively affect other teams. You also want to share what you learn, because chaos engineering is about learning, and sharing what you learn makes everyone better engineers. So share internally and externally, there is a chaos engineering slack, which is a great online resource you can find@gremlin.com community. You can talk to other people who are practicing chaos engineering, or if allowed, write a blog about it. Share this with people so that they can learn the best practices of chaos engineering. Chaos engineering is about iteration. We're creating a hypothesis. You're running an experiment. Then you're creating tasks to improve your software and processes, updating that hypothesis and repeating. And then you increase your blast radius and keep repeating. To sum it up, this is what chaos engineering is, thoughtful and planning experiments to reveal weaknesses in systems, both technical and human. And so we want to think through where is our tech broken or insufficient? Where does that user experience break? Does our auto scaling work? Is our monitoring and alerting setup? And we also want to think about our human systems and processes. Are they broken or ill equipped? Is that alert rotation working? Are the documentation and playbooks up to date? How about the escalation process. These are all things that we should be thinking through, and now is the best time, because systems are complex and they become more complex over time. So let's start when things are less complex. Now, I want to talk a bit about applying the scientific method to chaos engineering. To measure how systems change during an experiment. It's important to understand how they behave now. And this involves collecting relevant metrics from target systems under normal load, which provide that baseline for comparison. So using that data, you can measure exactly how your systems change in response to an attack. And if you don't have baseline metrics, that's okay. You can actually start now and start collecting those metrics, and then use chaos engineering to validate those metrics. One of the most powerful questions in chaos engineering is, does this work the way I think it does? Once you've got an idea of how your system will work, think about how you're going to validate that. What type of failure could you inject to help prove or disprove your hypothesis? What happens if your systems don't respond the way you expected? So you've chosen a scenario, the exact failure, to simulate what happens next. And this is an excellent thought exercise to work through as a team, because by discussing a scenario, you can hypothesize on the expected outcome. When running that in production, you can think through what the impact to customers would be to the dependencies. And once you have a hypothesis, you'll want to determine which metrics to measure in order to verify or disprove your hypothesis. And it's good to have a key performance metric that correlates to customer success, such as orders per minute or stream starts per second. As a rule of thumb, if you ever see an impact to these metrics, you want to make sure that you halt the experiment immediately. And after you've formed your hypothesis, you want to look at how you can minimize your blast radius prior to the experiment. And I'm going to talk about blast radius a little bit more in a few minutes. But blast radius is usually measured in customer impact. Like, maybe 10% of the customers could be impact, but it can also be expressed in hosts or services or other discrete parts of a customer infrastructure. So when running a chaos experiment, you want to think about that blast radius. You always want to have a backup plan in case things go wrong. And you need to accept that sometimes even the best backup plan can fail. So talk through how you're going to revert the impact. One of the important things with chaos engineering is to understand safely. So you want to make sure the impacts can be reverted, allowing you to safely abort and return to that steady state. If things go wrong after you run your first experiment, there's likely going to be one of two outcomes. Either you've verified that your system is resilient to the failure you've introduced, or you found a problem that needs to be fixed. Both of these are great outcomes because, on one hand, you've increased your confidence in the system and its behavior, or on the other hand, you've found a problem before it caused an outage. Make sure that you have documented the experiments and the results. And as I mentioned before, a key outcome of chaos engineering is planning. And through these experiments, you're planning about your systems, you're validating your hypothesis, you're teaching your teammates. So share the results of your chaos engineering experiments with your teams, because you can help them understand how to run their own experiments and where the weaknesses in their systems are. So, let's talk about, how do you get started? As I mentioned, you want to pay attention to the blast radius. So you want to start small. You want to be careful. You want to start on a single host or service, not the whole application or fleet. You want to start in a controlled environment with a team that's ready. We're not trying to catch folks off guard. Then, once you've done that, you want to expand that blast radius, adopt the practice in development. So engineers are architecting for failure. Get confident testing in development, and then move to staging. Start small in staging, then expand your blast radius and move to production. And start small and increase. This is really similar to how you do development, so you don't need to overthink it. You can work iteratively, like with code, and move up the environments, like with code. You do know how to do this, and so I want to show you a real demo of what this actually looks like. Now, we'll be using an open source application and Gremlin to do this demo, which there is Gremlin free. But I want to let you know, this is just to show you what it looks like. You can use any chaos engineering tool. So this is a really cool open source project, and you can find it on the Google Cloud platform GitHub webpage. It's a repo called the bank of Anthos. And I'll send some links so that you can check it out. So, this is the architecture diagram. It's really great for learning how to practice chaos engineering because it's an Internet banking application, and it has multiple languages. So both Python and Java, which is pretty common these days when you're working on a system, there's oftentimes those multiple languages. There's also two database. There's the account database and a ledger database, which are running on postgres. And then we have our transaction history service, a balance reader service, a ledger writer context user service, our front end, and a load generator. So this is what it looks like. Now, one of the questions we want to ask ourselves when we're starting to practice chaos engineering is what is our hypothesis? We want to think through what is going to happen. So does black holing a critical path service like the balance reader result in a graceful degradation of the customer experience? You want to think through that. What would happen if we make the balance reader unavailable? So we make it unreachable, which is what we call a black hole attack in Gremlin is. But if we make that unavailable, what is going to occur to the application? Do you have any ideas? You want to think through those? You want to think through what will happen? Will we be able to use the website? Will we be able to see what the balance is? Will we be able to deposit funds? And these are the questions we should ask ourselves. So my guess would be that if we black hole the balance reader service, I think we might see an error message, like unable to read the balance, we might get a user friendly message. We also hope that we would get like a loading, that there was some sort of issue with the balance reader, that it was no longer available, or maybe it was really well built and we could just fail over automatically. So maybe if that one service was unavailable because it was running on Kubernetes, maybe there's going to be some redundancy there and we'll be able to actually fail over. Let's see what we can do. So, within Gremlin, you can select the balance reader service and we're selecting it as a Kubernetes replica set. Now, we can already learn a lot here because we can see in the visualization that there's only one pod impacted. So if there were multiple pods with the balance reader, we'd actually see two pods implemented or impacted. So that shows us already that when we run this chaos engineering experiment, we're going to make all of the balance reader service unavailable because there's no secondary pod, which is a pretty large blast radius if you think about it. So this is what it looks like when we run our experiment, the balance reader will appear as, and this can be really confusing for the user. So if I was building this as a real Internet banking app for a bank, they'd likely say, no way. We're going to get so many support tickets if this really happens. That's not even a real error message. And this could make users really confused. And then they're going to pick up that phone and start calling the call center. And then that increases the cost of having to answer all of those calls, and it causes additional problems. But we also want to check other functionality, too. Does this affect other dependencies? So the user is actually still able to make a deposit of $1,000 while the balance reader service is in this black hole or unreachable state. And you can see this in the transaction history, that we've added money to our bank account and we get a successful message, but the balance is still. But if we try to make a payment, we're unable to do it. And then we get this awesome engineer friendly message, but not really a generally friendly customer facing message. You've got to kind of think about that. Do you think your friends and family who don't work in tech would know what this means? Or are they going to start calling you and ask you what's going on? I mean, I can tell you that every time there's an outage, especially with Facebook, Hulu, or rarely Netflix, my mom calls me to see if I can figure out what's wrong, even though I have repeatedly told her that I do not work at these organizations. So we want to think about that user experience. It's something good to think through. I used to get errors like these all the time, and I actually still do at my current bank, and they're errors that make sense to them. But when I can buy my new Lego set and I have this error, it is really frustrating for me, and it makes me realize that I need to go to an entire new bank. So that's why we want to think through the experience of reliability, because how does the user see the issue? How do we represent the problems to them? Do they even notice that there is a problem? Or is there a way to hide that problem from the user and gracefully degrade? Maybe even if possible, remove the component from the web page if it's not working at the moment, there's a lot of things that you can do to allow for graceful degradation that makes a better user experience. And we want to think through this always from that user perspective, how can we reduce stress on those users? So if you're interested in learning more about the free demo environment, check it out. It's totally free to spin up on Google Cloud Shell. Here's the information and the URL. But I want to show you another example of something that we're looking for. So let's go back to our demo environment. So another interesting thing to look at is different types of failure modes. So let's have a look at the transaction history service and see how a black hole might impact that. This is the transaction history here and you can see what it looks like when things are good. You can see the credits and the debits and who made the transactions. What you want to see is are we going to get a graceful degradation of this service if it's made unavailable or not. During our experiment we can see that the transaction history is not there. We get an error that says cloud not load transactions, but we can also see that our deposit was successful. So this is better than the balance reader because at least we're not getting that and we have more of a friendly user experience that doesn't talk about get. So how can we mitigate against a black hole? And that's the next question, because before, if you remember on the balance reader, there was only one pod, now we're going to make it two. So we'll scale our replicas, and if you're interested in kubernetes and learning more about this, then this is for you. So we can run this command, Kubectl scale deployment transaction history and give it two replicas instead of just one. We will do this for the transaction history, one for this example. So run the command and now we can look in the terminal and see that Kubectl get pods and. Yep, now we have two pods that are running for the transaction history that can show us that data. So now what we can do is a smaller blast radius chaos engineering experiment. So let's send 50% of the transaction history pods into a black hole. So just one of the two. And now you can see at the bottom on the right 50% one of the two pods is impacted. And I've selected the transaction history replica set. You can see the two pods, there are two little green dots in the visualization. So now you want to think through what's your hypothesis? Now if we send one of the two Transaction pods into a black hole, will we get an error message? Will things be okay? Will things get worse? The interesting thing here is you never really know until you run the experiment. You're never going to be able to just guess exactly, because that's very hard to do. And if it was easy, to do, we'd probably all be Powerball winners right now. So now let's look at the architecture diagram, just to be able to really understand this and understand what's happening. So we have our two transaction history pods, we have the two replicas. And what actually happens with this is that there will be a very short outage that's not visible to the human eye, and then the other pod is going to take over. It will say, this pod is not reachable. So I'm going to flip over to the other pod. And that's what occurs when you test this out with chaos engineering in real time, which is great because there's no error message visible to the user. They just get the data that they wanted. You're still able to see all of the transaction history. You no longer receive error measures. Your mom's probably not calling you. So this is a really nice way to fail over that service. So again, here is the URL if you're interested in looking at the Google cloud platform bank of Anthos. Now, as I mentioned, there are two key practices, and the next key practice I want to talk about is game days. So, game days are a really great team building exercise. They're something that accelerate talks a lot about in the book, because game days are a great way to build relationships within an organization. And it is so true. They not only help you improve stability and reliability, but also definitely tempo because it enables you to work closely with other teams that you want to have better relationships with. And this is especially important if you're in larger engineering organizations. The goal is cooperative, proactive testing of our system to enhance reliability. So by getting the team together and thinking through your system architecture, you can test these hypothesis. You can evaluate if your experiments are resilient to different kinds of failure. And if they're not resilient, you can fix them before those weaknesses impact your customers. So now this is an example of a game day. What you would want to do is invite four or more people to attend, and it's always good to have at least two teams. You want to know if you have some type of failure, how it appears to the other team, or if they have some sort of failures. You want to see how that affects your systems. And that's why these are so great, because you're doing it in real time. They're not happening during an incident. This is planned. You're doing this at 10:00 a.m. With lots of coffee and a zoom, and you don't have to spend hours doing this. Some folks, do they want to plan game days for an entire day or a half a day, but you can really have much shorter game day experiences. I would definitely say plan for a minimum of 30 minutes though, in the real world. And oftentimes we see folks spike load, maybe using Gatling or Jmeter, and then they introduce different types of failure modes. So for example, if you use Jmeter, you could be like, all right, let's send a bunch of requests. But now I'm going to spike cpu because our auto scaling is based on cpu, or maybe it's based on a mix of cpu and requests. So this allows you to create that situation where auto scaling should occur, and then you can see that it works. And then when traffic subsides and the cpu attack finishes, you can run one, maybe for 60 seconds and it will go back to normal and it'll scale back down. And that's going to help with other things too. If you're thinking about cost management with infrastructure, keeping those costs low, and a lot of sres really care about that too, how can I say that we saved this much money. We were able to make sure that we were able to quickly scale down when we don't need to be running a large number of machines in our fleet. So when we look at how to run a game day, at least 30 minutes, 30 minutes to 1 hour sessions are fabulous. You want to include two plus teams. Make sure you decide on your communication tool. Is it slack? Is it confluence? Is it a Google Doc? We use Slack here at Gremlin when I was at pager duty. We also use Slack. You want to plan and design two to three use cases and you want to make sure that you have documented this. You want to assign roles such as commander and general observer and Scribe, and make sure people understand what their role during the day is. Document the results and then share widely in the organization and if possible, share your learnings externally as well. So here are other examples of game days you can look at. Dependency testing, capacity plan testing and auto scaling. Testing capacity planning is an interesting one because we've seen outages when folks tried to scale up during a peak traffic day and they actually didn't have the limits set correctly. They weren't even allowed to do that. They didn't have the permissions set up. There were caps and they just couldn't scale. So that's a bad situation. And you definitely want to test that your caps are not in place in a bad way that impacts you. These are things that you're thinking through when you're going through these game days. Also, here's a scenario planning template. You can actually find all of this@gremlin.com gamedays and again, we're documenting the attack target, the attack type, the failure consumption that we're simulating the expected behavior and risk, and then that post attack state and impact. And that's important because going back to the very beginning of what we talked about, you're creating a hypothesis and you're testing it. We're not just randomly running around shutting things off. And yes, you can automate all of this. And you want to, you want to codify your chaos engineering experiments because you want to make them shareable. You want to do version control. You want to show the history of your experiments. So look at how you can integrate chaos engineering experiments into your CI CD pipeline. And you might do this for production readiness. They might say, you need to pass this number of chaos engineering reliability experiments before you can ship your new service into production or before you can ship your change. Or if you're going multi cloud, you might want to make sure that all the code can pass through a set of chaos engineering experiments, because we all know that lift and shift really isn't a thing or a thing that works out well because environments are so different and there's a lot of fine grained detail. So it's great if you can automate your chaos engineering experiments, because you don't have to go around teaching everyone the SRE best practices. You can just build a system that can quickly go in and check, and you're giving people tips and knowledge so that they can build better systems in the future and going forward. So with that, as I mentioned, I would let you know how to get a free certification. We have a chaos engineering practitioner and professional certification, so you can head over to gremlin.com certification to learn more about that. And then I want to thank everyone for being here, and I wish you a great day.
...

Julie Gunderson

Reliability Advocate @ Gremlin

Julie Gunderson's LinkedIn account Julie Gunderson's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways