Conf42 Chaos Engineering 2024 - Online

Enhancing Hypothesis Development Through Resilience Modeling

Video size:

Abstract

In this session, you will learn how to build a resilience model to create valuable hypotheses that enable you to maximize the value of your chaos engineering experiments.

Summary

  • Resilience modeling guides teams to anticipate scenarios before they lead to incidents. Teams then prioritize these scenarios as hypothesis to test through experimentation. Experiments allows observation of controls to understand their effectiveness in detecting and preventing incidents. In this session, you will learn how to build a resilience model to create valuable hypotheses.
  • AWS has worked with numerous customers in North America, Latin America, Europe and Asia to anticipate incidents using the practices that I'm going to share with you today. Customers have probably avoided over 65 incidents if they had created a resilience model prior to going live.
  • In order to prevent what could go wrong with specific applications, we want to dive deeper into user journeys or system functions. Different workloads might be related to multiple different user journeys. This helps us understand what are the services within that critical path, and to exercise how that could fail.
  • An architecture diagram for an online storefront based on Kubernetes running in Amazon eks. What are the failure modes, what scenarios would cause the failure, and what controls are in place to mitigate these scenarios? With these hypothesis, you're able to design and run high quality chaos engineering experiments.
  • Gunnar Grosch: Please check out the new resilience space we have over at community AWS. We keep adding resources for how to build and operate resilient applications on AWS. I'd be happy to connect with all of you on social media.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Resilience modeling guides teams to anticipate scenarios before they lead to incidents. Teams then prioritize these scenarios as hypothesis to test through experimentation. A resilience model documents the scenarios that may impact the system and the controls in place to guard against such impact. Experiments allows observation of these controls to understand their effectiveness in detecting and preventing incidents. My name is Guillonard Geroch and I am principal developer advocate advocate at AWS, focusing on resilience, chaos engineering and architecture. In this session, you will learn how to build a resilience model to create valuable hypotheses, and they allow you to maximize the value of your chaos engineering experiments. Most of you probably recognize this flywheel showing the faces of chaos engineering. It takes you from understanding the steady state of your system, to forming a hypothesis, to designing and running an experiments, to verifying your experiments by comparing to the steady state, and then finally to learning from the experiments and then improving the system. Many sessions around chaos engineering, they focus on this phase, how to design and run experiments, and that's great. I love seeing examples of how chaos engineering tools work, or stories from real world use cases of chaos engineering. Sometimes sessions also covers the verify phase where we look at the results of these experiments. In this session, though, I want to focus on the before the hypothesis phase. How do you actually create valuable hypotheses that allow you to maximize the value of your chaos engineering experiments? Spending time before running experiments, it allows you to use your resources more effectively. And this goes back to the four key capabilities that a system needs to have in order to be able to be resilient, anticipate, monitor, respond, and finally learn chaos engineering. Well, that mostly falls in that fourth capability, learning about our systems. But in order to prevent failures, we need to be able to anticipate, and that's where resilience modeling comes in. So, to simplify a bit here, we want to anticipate to learn better. So I want to start by sharing a quick story from AWS. And this was published as a service event, an event which I would encourage you all to read so you can access this article using the QR code on this slide. In 2012, the ELB service team had tasked one of their operators to perform a routine maintenance procedure on the ELB control plane. The operator performed the procedure as instructed, but this resulted in the inadvertent lesion of configuration data from the control plane. With this data, the control plane lost the ability to manage existing ELB resources, and this meant that any calls to modify existing load balancers began to fail. While calls to create and manage new ELB resources continued to succeed. So the service team took time to troubleshoot and identify the cause of the behavior. And when they realized what have happened, they realized they didn't really have any recovery procedure to restore that deleted data. So they had to develop a recovery plan on the spot. And after putting it into action, they were able to recreate the missing data and then finally restore that service. So, using resilience modeling, well, this event could have been anticipated and prevented, because based on that resilience modeling, we could have seen that we were lacking these recovery procedures. And using the resilience model, we could have formed an hypothesis, and we could then use the hypothesis to perform different chaos engineering experiments. So we have worked with numerous customers in North America, Latin America, Europe and Asia to anticipate incidents using the practices that I'm going to share with you today. And these customers, they've shared that they have probably avoided over 65 incidents if they had created a resilience model prior to go live. And we've also seen where creating a resilience model creates a shared understanding of how a system works in a team, and it becomes a vehicle for shared learning about what can go wrong. So by creating a resilience model, these Fortune 500 companies are preventing incidents and gaining confidence in the resilience of their systems and in how they operate the systems. So, before we get into the process of resilience modeling, let's just first define some important terms. First, the term system. So, this is a reference architecture diagram for a container based e commerce application running on AWS and AWS. An architecture diagram. This shows the components in the IT stack for hosting the application, but it doesn't show the entire system. For example, this diagram doesn't highlight the version control system for the infrastructure as code, or perhaps the monitoring platform for understanding the system state. It also doesn't show the human operators, the people who are managing the system at runtime. So even if we sometimes think of a system as fully represented by an architecture diagram, that's really not the case. A lot is missing from that architecture diagram. A system. It includes the tech stack, which would host the ecommerce application. It's the components that the users of the system communicate with. But there are additional controls in place to help the system continue to function over time. Things like autoscaling groups, circuit breakers in the code failover logic for the RDS database. Those are just some of these automated controls which respond to signals coming from the it stack and then change the it stack to ensure continued availability. And in addition to the automation, there are human operators, the people who receive alerts from the system. They have dashboards for observing the health of the system. They can respond by restarting or redeploying components as needed to ensure that users can continue to access the system. And all of these things, they make up a system. And this is important because both the automation and the operators, they can change the it stack, and that could result in both causing downtime or preventing downtime. So when modeling the resilience of the system, we need to consider these elements in addition to the components in the architecture diagram that we're used to seeing. So what is a system function? Well, many of the systems you work with, they are probably quite large and diverse in terms of their components, the dependencies, and the team supporting the system. So, to reduce the number of elements that are considered during modeling, we align our thinking with the key functions of the system. So in order to prevent what could go wrong with specific applications, we want to dive deeper into user journeys or system functions, and a user journey or a system function that's related to the capability that a specific workload has to deliver value to the business. Different workloads might be related to multiple different user journeys. If we take an ecommerce platform, for example, we should be able to break it down into multiple areas, authentication, personalization, ordering, delivery, and so on. Focusing in on all of those areas is mostly painful, and it would lead the application owners into an infinite engagement. And this is why we want to dive deeper into the notion of breaking it down in these smaller pieces. So, thinking of the example we just looked at, we can have an example of selling an item, bringing this as the main focus. It helps us understand what are the services within that critical path, and to exercise how that could fail. And this is ultimately the goal of breaking the system into these smaller user journeys. So after we've identified the system functions for a system, we need to understand how the system function might behave and how it might fail to completely meet the business objectives. And this then enables us to begin and associate a cost to these failures. Failure modes, they should be written from the perspective of the business or the business process the system supports, and it shouldn't call out the cause of the actual failure mode. It's typical that a failure mode will have more than one potential cause as well. So for each system function, you should try to consider if the function were to fail, or if it were to over or underperform. Consider if the function succeeded intermittently, or if the function were to execute when it shouldn't. So if we then look at function, let's say we're looking at a user login. Well, if there's no function, it means that we have login failure. If we instead have over function, it might mean that users log in and have administrative rights under function. Well, that could mean that users are logging in, but they only get read only access. Intermittent function might mean that only some of the users are able to log into the system and unintentional function that the wrong user gets logged in. So, to then determine the loss to the business of each of these failure modes, after all, preventing loss to the business is the main reason why we're trying to improve resilience in the first place. For each loss, we should then try to calculate the cost to the business of that loss. Often this will be quantified through customer satisfaction, customer trust, lost sales, and might even be fines. And the application team can then later weigh the implementation and operation cost of controls versus what they are actually mitigating. So then, after identifying failure modes and their cost to the business, the team will anticipate the scenarios that would lead to each failure mode, and can then begin to align controls to these different scenarios. And these controls, well, they will be one out of four different types. Detective controls. They are used to understand when the failure scenario has occurred or when it's about to occur. That could, for instance, be alarms on the number of error responses sent to client, or it might be decline in the number of orders per second in our system. Preventive controls. Those are the type of mechanisms that we put in place to prevent impairment to the system when the failure scenario has incurred. I mentioned circuit breakers in the code earlier. That's one example. It might also be different types of redundancy that we put in place in the architecture. Corrective controls. These are mechanisms or procedures in place to clear the system of impairment if it has been affected by the failure scenario. And then testing controls. Those are the type of tests that we have in place to try to detect whether the system is susceptible to a failure scenario. So these are the four types and detective controls. Well, how can you detect that this happens? Preventive controls. Are you taking any measures to avoid this failure? Recovery controls. If it happens, what do you do? How do you recover? And testing controls, do you have any processes to test against this failure? And when we're creating a resilience model, you want to map the losses to their failure modes and the failure modes to their failure scenario. Each failure mode may have one or more failure scenarios. And each failure scenario, it's going to have multiple controls. So if we look at an example where a failure mode was anticipated for a data distribution system, if data was not fully transferred to clients, there was a potential for fines to be issued to the business. Two causes of this failure mode were identified and then the controls for detection, prevention and correction were also identified. The team in this case, they didn't have any mechanism for testing this scenario. So we have the business loss, the failure mode, the scenario and the controls. Now, as I think most of you know, an hypothesis that's a proposed explanation for any type of phenomenon. For a hypothesis to be a scientific hypothesis, the scientific method requires that one can test it. And the way we test them, well, that is of course through chaos engineering experiments hypothesis, it's usually written in the form of an if then statement, and that gives us a possibility if, and explains what may happen because of the possibility then. And we then make use of the failure scenario and the controls. If failure scenario, then preventive control. If failure scenario, then detective control and recovery control, for instance. So going back to our previous example with the data distribution system, we can start to create high quality hypothesis from our model. If network mutates the response, then there is a checksum to verify the message content, and application will alert if checksum mismatches. And this then helps us to create a very clear and very testable hypothesis. We understand the scenario we want to test and we know the controls that are in place for that specific scenario. We can now use this hypothesis in our chaos engineering experiments. So let's now use the rest of this time to begin building a resilience model that we can then use to create hypothesis. And we're going to use our online storefront. So this is an architecture diagram for an online storefront based on Kubernetes running in Amazon eks. The application also uses DynamodB, Aurora, MySQL, elasticache for redis and RabbitmQ. In addition to the components shown here, there is a GitHub repository which provides CI, CD, and we have an operations team which can access any part of the system in production during operations. So then we have to ask, what are some of the key system functions for this system? Well, in this case, we can see the critical pathway for the submit order function. Orders are sent by the user through the storefront system. Custom code communicates with the payments processor, the pricing API and the inventory API to process the order for the submit order function. What are the different failure modes? So we can now use the no function over function, under function, intermittent function and unintentional function as a template when we're creating this. So if we were to model this submit order, we need to ask us these questions. What are the failure modes, what scenarios would cause the failure modes, and what controls are in place to mitigate these scenarios? So let's start the model. The failure mode is order submission fails. So our first failure scenario is that the TLS certificate on the application load balancer is expired. We have detective controls in place. Alarms will notify operators if certificate error occurs. Preventive controls well, our TLS certificates, they are rotated annually recovery control support department coordinates with operations to troubleshoot and testing controls. We actually don't have any testing controls in place to be able to test for this scenario. The second failure scenario is that storefront is unable to find user sessions in cache. Detective controls well, we don't have any. We also don't have any preventive controls for this failure scenario. But recovery control is that users are then redirected to the login page and their shopping basket is maintained. Testing control for this failure scenario is that we have automated testing in place to verify that logged out users are redirected to the login page. And the third failure scenario that we think of is that under high load, the cache evicts recent user sessions. We have detective controls in place by alarms that notify the operators if the number of evictions is nonzero. We also have preventive controls. Cache is right sized through load testing. Our recovery controls is that the operations team, they need to grow the size of elasticache. We don't have any testing controls for this failure scenario. So with these three failure scenarios that we have for our failure mode order submission fails, we can now use the technique we looked at earlier and start forming our hypothesis. So for the first failure scenario, TLS certificate on the application load balancer is expired. We can then create a hypothesis that would be if TLS certificate on ALB is expired, operators are notified and troubleshooting starts. The second failure scenario, storefront is unable to find user session in cache. Well, we can form a hypothesis, that is, if storefront is unable to find user session in cache, user is redirected to the login page and their shopping basket is maintained. And then for the third failure scenario that we found under high load, the cache evicts recent user sessions. Then we can form a hypothesis. That is, if cache evicts recent user sessions, operators are notified and the cache is right sized. All three of these are very valuable hypothesis and testable. It allows us to then maximize the value of our upcoming chaos engineering experiments. So if we go back to our flywheel of the faces of chaos engineering, we've now spent time forming valuable hypotheses through the help of resilience modeling. And now, well, we can move to the run experiment phase, where with these hypothesis, you're able to design and run high quality chaos engineering experiments. So we've now maximized our chaos engineering efforts by spending time before actually running experiments, you can use your chaos engineering and people resources much more effectively. So let's look at some key takeaways from this. First off, you should consider the entire system, not only the things you see in the architectural diagram. Consider everything around, everything from the people to the observability, to where you get your code from, and so on. Next, try to find the system functions and identify the critical path within your system that helps you. Then zoom in and be able to find these failure scenarios. Write your failure modes from a business perspective. Think from the business, think with a loss. When you are forming or writing your failure modes, anticipate the scenarios that would lead to each of these failure modes. How would one of these failures happen? And then create that and anticipate that scenario. For each of these scenarios, try to identify the controls that you have in place based on those four different control types that we looked at, and finally create your hypothesis based on that failure scenario and the controls that you have in place, and then can use that hypothesis to then run more high quality chaos engineering experiments. So, before I leave you, I want to show you this. Please check out the new resilience space we have over at community AWS we've gathered, and we keep adding resources for how to build and operate resilient applications on AWS. And with that, I want to thank you for joining this session. We've looked at how to build a resilience model to help us create more valuable hypotheses, and that allows us to maximize the value of our chaos engineering experiments. My name is Gunnar Grosch. I'd be happy to connect with all of you on social media. You can find me on most of them using the details shown on screen right now. Thank you all very much.
...

Gunnar Grosch

Principal Developer Advocate @ AWS

Gunnar Grosch's LinkedIn account Gunnar Grosch's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways