Conf42 Chaos Engineering 2024 - Online

Chaos in the cloud

Video size:

Abstract

Beyond Traditional Testing: Cloud infrastructure adds an additional layer for testing in our traditional application landscape. We will demonstrate how to run Chaos Experiments at the Cloud Infrastructure Layer, using the AWS Fault Injection Simulator (Demo). Audience: engineers/architects.

Summary

  • We are going to talk about chaos engineering in the cloud. We will talk about the AWS fault injection simulator. Then we're going to share some best practices on working with the forge injection simulator and architecting for resiliency in general.
  • What is chaos engineering and why do we do it? We generally talk about kind of a loop, stressing a system, observing how it reacts, and then improving the system. And we do that to improve resiliency and performance.
  • So chaos engineering locally on your local machine, on your server, versus in the cloud. Why would we want to move chaos engineering into the cloud? We believe that using cloud services we gain a more holistic monitoring view of how our system works.
  • A region is made up of multiple availability zones, abbreviated as AZ. Each AZ is madeup of one or more data centers. As you distribute an application across AZ, it will become resilient against single AZ fiduciary failure.
  • The AWS fault injection simulator is our fully managed chaos engineering service. It lets you reproduce real world failures, whether they are very simple like stopping an instance, or more complex like swapping APIs. Service fully embraces the idea of safeguards so you can make sure that a running chaos experiments does not impact your productive deployment.
  • default injection simulator injects faults into supported resources. There's a host of targets and actions supported across the categories of compute, storage, networking databases and management service services. Show us a demonstration of how the fault injection simulator works in practice.
  • The first one is about cross region connectivity disruptions. Another new scenario is around availability zone power interruptions. We will take you through a journey of testing an application and improving its resiliency when it comes to availability zones failures.
  • We are going to set up a load testing tool to send frequent requests against our application to inspect some metrics like the availability and also the response time of our request. Then we are setting up the fizz to introduce some chaos in our application. And then we are revisiting our load test tool to see the impact of our chaos experiment.
  • So now let's delete the old deployment and apply the updated deployment. Let's configure the same amount of users being simulated and run the load test. Give it some seconds here to create some stable requests per second. And as we can see here with this update, our application now reacted way better to the outage.
  • I want to share some best practices to get started with chaos engineering for your own applications. Always try to test as close as possible to production as possible, maybe even in production. To start with collecting your first hands on experience, I can recommend this workshop.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome to our talk, chaos in the cloud. I'm going to start by telling you a little bit about what we are going to do in the next 45 minutes. We are using to start with a recap on what is chaos engineering. Then we are going to talk about chaos engineering in the cloud. Taking an aside into what the cloud is and what its resiliency features are, we're going to talk about the AWS fault injection simulator, which is the AWS fully managed chaos engineering service. Then we are going to have a demonstration of how that looks in practice. Then we're going to finally share some best practices on working with the forge injection simulator and architecting for resiliency in general. What is chaos engineering and why do we do it? I like to start this talk with a quote from Werner Fogel. He's the CDO of Amazon.com, and he said, failures are a given, and everything will eventually fail over time. And that's so because with the rise of microservices and distributed architectures, the web and our applications have grown increasingly complex. And as a result of that, random failures, random failures have grown difficult to predict. At the same time, our dependence on those systems has only ever increased. And so as we moved away from monolithic architectures towards microservices to become more agile, build faster, and be able to scale, we naturally adds some complexity between those microservices. And chaos engineering is experimenting on those distributed systems to build confidence in those systems and make sure they can survive failure, and to get a better understanding of their failure conditions and of the performance envelopes. And that's all contained in the quote I have on the slide from principlesofchaos.org. So, diving a little bit deeper into chaos engineering, what is it? We generally talk about kind of a loop, stressing a system, observing how it reacts, and then improving the system. And we do that to improve resiliency and performance, uncover hidden issues in our architectures, and expose blind spots, for example, in monitoring observability and alarming. And usually we also achieve a degree of improvement in recovery times, improving our operational skills with the system and the culture in our technical. Sorry. And a culture in our tech.org. And I talked about that being a loop and going deeper into the loop. We see a loop of setting an objective. We want to achieve something with our system. We want to make it resilient against single ac failure, for example. We set that objective, then we design and implement our system. We design an experiment to test if our objective is actually achieved. We run the test we operate the system, and from the experiments and the antisystem operation we learn about the system. We can respond to failure conditions, improve the system and set new objectives and get better over time. So chaos engineering locally on your local machine, on your server, versus in the cloud I'm using to start by talking about chaos engineering in Java, and I just choose Java because it's one of the most widely used languages and it has awesome support for chaos testing, both via libraries and via tooling. What you generally see here is that a lot of those tests are either JVM based, they run inside the local JVM, they might be agent based, they run as a sidecar next to the JVM on the same machine, or they might be moved into the service mesh connecting many systems together. And that's really an awesome way to test your application code and to make sure it's resilient and to be able to control those experiments from your application code. And if that is so awesome and it already works quite well, it's a great improvement and everybody should do more of it. Why would we want to move chaos engineering into the cloud? We believe that using cloud services we gain a more holistic monitoring view of how our system works. We can identify cloud based challenges, for example around limits, around scaling, around interacting with other accounts. For example, we can run dedicated scaling experiments to learn where the failure points of our application is if we scale to a very large degree or very fast. And finally, we can do validation of disaster recovery strategies where we gain the ability to run chaos experiments on a system and see if it, for example, can be switched over to another region without an outtime or in a specified time frame. And this is the point where I'm going to take an insight into resiliency features of the cloud of AWS and how they interact with applications. What we see on the slide is the region design of AWS, and we see a region is made up of multiple availability zones here, abbreviated as AZ. And each AZ is made up of one or more data centers. And that's the only time I'm actually going to talk about data centers. The fundamental logical part of resilience in AZ you should think about is an availability zone, because those availability zones are independent from each other. And what that means is they have independent network providers, they have independent power connections, and if there is a geographic feature in the city where the region is located, the acs will be located while being mindful of that geographic feature. So for example, if there is a river flowing through the city not all of the AWS will be close to the river, so that we can be sure that in case of a flooding, not all of the AWS go down. And so as you distribute an application across AWS, it will become resilient against single AZ fiduciary and gain a degree of resilience. It will become resilient against single availability zone failure, even though those failures are quite unlikely in practice. What does it actually look like? How do we distribute an application across multiple availability zones? What we see here is a simplified architecture of a web application. We see load balancing, distributing traffic to different instances, and that means Ec two instances for us and those instances act as application servers. And then underneath that we see a primary and a standby DB instance, and we see that those are also distributed across availability zones. And so in the end, like the event that, for example, availability zone r goes down, the load balancer would distribute traffic to b and c, and there would be an automatic DB switch over to make the current standby instance in c, the primary instance, and your application would continue to work. Now that we have a certain understanding of resiliency in the cloud and why we would want to actually test it, let me go deeper into the AWS fault injection simulator and how it supports you in running chaos experiments. The AWS fault injection simulator is our fully managed chaos engineering service. It's an easy way to get started and it lets you reproduce real world failures, whether they are very simple like stopping an instance, or more complex like swapping APIs. And finally, this service fully embraces the idea of safeguards. So you can make sure that a running chaos experiments does not impact your productive deployment. I'm going to go into detail about all of those three features that I just mentioned. It's easy to get started because we spent actually a lot of time making it easy to get started, because when we talk to customers about chaos engineering, what customers repeatedly told us is that it is a little bit hard to get started with chaos engineering. And so we want to make that as easy as possible. You can use the console to get familiar with the service and actually try things out, and then you can use the CLI to take advantage of the templates and integrate the service with your CI CD pipelines. And those templates are JSon or YAML files that you can share with your team. You can version control them and use all of the benefits and best practices associated with version control, like code review. And you have conditions. You can run experiments in sequence, you can run experiments in parallel sequences. For example are used to test the impact of gradual degradation, like a sequence latency. And parallel experiments are to test the impact of multiple concurrent issues, which is actually how a lot of real world outages happen. You don't see outages because of a single failure, but because of a chaos of single failures leading up to a real world outage. It currently supports services like EC two RDS, ecs and eks. So virtual instances, databases, container runtimes and managed kubernetes. And we are working all the time to provide support for more service, for more services, sorry. And for more complex conditions. And just to hit the nail on the head here, those faults, they are really happening at the service control plane level. So an instance might actually be terminated, memory is actually being utilized, APIs are actually being throttled. It's not faking something with metric manipulation, but it's actually impacting how things work at the control plane level. So you will have to take extra or use extra caution when using the service. And to enable you to do that, we have safeguards. Safeguards act AWS, the automated stop button, a way to monitor the blast radius of your experiments and make sure that it's contained and that failures created with the experiment are rolled backed if alarms go off. And that's kind of runtime controls what happens during an experiment. And the service of course integrates with identity and access management. IAM and Im controls can be used to control what fault types can be used in an experiment and what resource can be affected. And that's of course working with tag based policies. So for example, only EC two instances with a tag environment equals test can be affected. That's one of many possible safeguards you can implement. What kind of targets and actions are supported. There's a host of targets and actions supported across the categories of compute, storage, networking databases and management service services. Sorry. And management services. And now we will dive somewhat deeper into the architecture of default injection service and how it interacts with the different components of the AWS cloud. And we see here a diagram of how the service works. At a high level we start with an experiment template which will comprise, sorry. At a high level we will start with an experiments template which contains different fault injection actions, targets that will be affected and safeguards to be run during the experiment. And you can see that here slightly to the left of center called experiment template and white. And then when we start an experiment, default injection simulator performs the actions. So it injects faults into supported resources that are specified as the targets. And then those faults interact with your resources and that will change what happens in monitoring, in monitoring in Amazon Cloudwatch, for example, or in your third party monitoring solution. And then you can take action based on those observability metrics you have there on those alarms you have there on those logs you have there by using Amazon eventbridge to, for example, stop an experiment if the wrong alarm is triggered, or to start a second experiment at that point in time. And so now we have an understanding of what chaos engineering is, why we want to do it in the cloud, and how the AWS fault injection simulator works. And it's the point where I'm going to hand over to bent to tell you about some exciting new scenarios we saw from reinvent, and to show you a demonstration of how the fault injection simulator works in practice. All right, thank you very much, Oliver. We're now going to take a closer look at those two new scenarios we launched at reinvent. So the first one is about cross region connectivity disruptions. So we have customers who have the requirement of operating their application at the highest possible availability rates. And those customers typically tend to architect their applications to span two regions. And those customers also asked us that we could maybe help them to make chaos testing even easier for them to test whether their applications can really withstand a connectivity disruption between two regions. So for instance, think of an active active or active passive kind of setup between two regions where all of a sudden your database, like a dynamodB, won't replicate new data. This is now possible with the new cross region connectivity disruption scenario that is available with the fault injection simulator. Another new scenario that we also launch is around availability zone power interruptions. So we already mentioned that those kind of scenarios have a really low likelihood of happening. However, there are customers who still want to experience how their application would behave in such an event. So in the event of a power interruption, you would see, for instance, scenarios like EC, two instances or containers stopping out of nowhere. And with the new scenario of ac availability power disruptions, you will basically be able to test how your application behaves under those kind of events. But now we basically want to have a look at our own demo. So we will take you through a journey of testing an application and improving its resiliency when it comes to availability zones failures. So what we brought for today here is a simple workload that is currently running in a single container. So we have a container that is basically a simple API that responds with a pong to every request to send to the container. And this is currently running on eks. It's a single pod hosted on a single node that we have in our cluster which is currently running in the availability zone. A and I personally am not the most Kubernetes experienced guy, so I'm not really sure how Kubernetes behaves in the case of an availability zone disruption. And I personally want to have a really low cost with my application. So I would just want to check whether running a single pod is enough to tolerate the failure of an AC or if for instance, Kubernetes will automatically for me schedule this pod on a new node that is hosted in a different availability zone. But before we jump into the console, I already want to show you the result of the experiment. So of course this is not going to happen. We won't see an automatic pod reassignment to a different node hosted in a different availability zone. This is not how it actually works. So instead what we would want to do here to make the system more resilient is basically by updating our deployment to at least run in two different availability zones. And this is actually what we are going to take a look at now in the demo. Now it's time to look at the fault injection simulator in action. For this we are going to first take a look at the application that we deployed in Kubernetes. Then we are going to set up a load testing tool to send frequent requests against our application to inspect some metrics like the availability and also the response time of our request. Then we are setting up the fizz to introduce some chaos in our application. And then we are revisiting our load testing tool to see the impact of our chaos experiment. So let's start with taking a look at the Kubernetes manifests so we can quickly see here with our deployment. This is basically the Kubernetes resource that you need to deploy containers in your cluster. We can see here that we have one container that we are deploying on our cluster. This is basically pulling the container image from my application from a elastic container registry of my account. Then I'm also using a node selector here to ensure that this container will basically be scheduled on a node that is running in the US east one a availability zone. Besides our deployment, we also have a service and an ingress resource. Those are required to expose our deployment in the Internet and we need that in order to run our load testing here for this use case. One thing to highlight here is the ingress section. So we are using the AWS load balancer controller to create an application cloud balancer in the cloud from this ingress resource that we just deployed. So both the ingress and the deployment Yaml files are already deployed. So let's have a look if that was successful. So I'm jumping in the console now I'm running kubectl get pod. This should now return one pod. This looks good. And now we want to basically double check whether this node here is really running in the correct availability zone. And this looks good. So the node is running in the US east one a availability zone. So that basically means that the container is running in the availability zone that we want to run a chaos experiment on. Now let's get the URL of the ingress controller or to be more precise of the cloud balancer. So let's say Kubectl get ingresses. That looks good. And here's the address. Let's send a curl request here and we are getting a response pong this looks good. So now let's copy this URL and let's set up the load testing tool. So for this case I'm using locost. Locost is basically running on my local machine and will send a good amount of requests per second to my application. Let's start this here on the top right corner we can see the requests per second that are sent against our application. And here we can see the failures. So failures would indicate a status code 400 ish. Now we can take a look at the chart and we can see here that we slowly ramped up on requests per second. And now we are around sending around 240 requests per second. And this looks good. We have no big failures. Our response time is quite static and with 120 milliseconds this looks good. So I would say this is a successful deployment of our application. So now let's check how the resiliency of our application really looks like. And let's see the impact of a chaos experiment. For this I'm going to the AWS console and here I'm basically searching for the service called AWS AWS fizz. I'm going to open up this one here. I'm basically taking a look at the experiments templates here we can see one template that I already created. This is called disrupt Aza and this is doing exactly that. Let's have a look at it. So the name is disrupt Aza and this is exactly what happens here. Let's quickly take a look at the update wizard. I think this is really good for the visualization what's happening here. So we can see that this template is quite small. So we have one action which is called disrupt Aza let's have a look at it first. So here we basically specify the action type. This is disrupt connectivity. Disrupt connectivity will prevent packets from leaving and entering the given subnet. And down here we can see the duration. In this case we have two minutes configured and we have configured the target. So the target are the subnets that are impacted by this event. Now let's have a look at the targets. So I'm quickly opening up the target here we can see the subnet target one. We are using a resource filter here to ensure that not every single subnet is targeted but instead we are only taking subnets in the US east one a availability zone in scope of this experiment. So a subnet in AWS is a zonal resource. So that means that a subnet is deployed in one availability zone. And this configuration really just makes sure that no network traffic is leaving or entering every single subnet in the availability zone. Us east one a. So I would say let's go back here and start this experiments to take a look at what's going on. So I'm clicking start here and it takes some time until the experiment is actually executed so you can take a look at the timeline to see what's going on. So currently this one is pending. This was just taking a couple of seconds to deploy the disruption. So let's wait. There we go. Now it's actually running. So if we go back to the locals page. Yeah there we go. We should see a drop in availability. So you can see that straight away we have a really reduced amount of requests per seconds and now we can see that there's no request being successful anymore. So that's basically showing us that there's not a single request going through. What's interesting, if you can go back here we can see that those requests right now are somehow dangling. So there is no bad response code like a HTTP 400 ish but the connection is just not completely opened and cloud. So this is really the impact of our availability zone outage here. So let's wait for some time until this experiment is finished. Here we go. The experiment is now finished. Let's go back to the console and we can see straight away once the experiment was finished our network is available again and our application continues to serve traffic. So here we can see that our current architecture with one container being deployed in one availability zone is not really able to handle the situation well. So let's see how we can improve the availability. I will first stop the load test here and we are now jumping back into the editor, because I already prepared an updated deployment. So here we have the updated deployment. It's pretty similar to the previous one, besides two major updates. So first of all, we have two replicas. So this basically means that we deploy two containers. And then what's even more important is we have an affinity rule created. So this affinity rule will basically ensure that those containers will be spread across the nodes and specifically spread across the different availability zones. So now let's go back to the terminal. And now let's delete the old deployment. And now let's apply the updated deployment. So this will take just a couple of seconds to deploy our pods. Let's have a look at them. There they are running on two different nodes. And if we now again open up the nodes here, we will see that those two nodes are actually running in two different availability zones. So now we can test again. So let's go back to the cloud testing tool. Let's do a new test. Let's configure the same amount of users being simulated and run the load test. Give it some seconds here to create some stable requests per second. So here we go. This looks good. Now let's go back into the AWS console to rerun the experiments. So we are going to experiment, going to the templates, opening up the template. We're now starting another experiment here. Let's start this one. So this again just takes some time to initiate. Still pending. Let's see. Now it's running. So let's also wait for some time here to check what is happening. So we can see quite a different result now. So what we see here is a very, very short time period where our service wasn't available. And this is basically because of the load balancer health check that checks the availability of its targets every 5 seconds. And this is just the small time gap where the load balancer thought that the container is available. And then after the next evaluation figured out, no, I cannot send any further request to the target. And as we can see here with this update, our application now reacted way better to the outage. And this would be a. Yeah, would basically show you the full lifecycle of running an experiment. So we made an assumption, figured out that this is not correct, updated our application to see an improvement in the resiliency. And this basically here concludes the demo of today. I now also want to share some best practices to get started with chaos engineering for your own applications. So I would recommend you to start with very small templates in the beginning that maybe only, as in our demo, only include one single action, because this allows you to very quickly understand the impact of a certain template action. The second tip that I have for you here is testing close to your production environment. So let's say you have a containerized workload that is in your staging environment, running on Docker compose, for instance, and on your production environment, those containers run on a fully fledged eks cluster. Here I would recommend you to basically maybe add a new test environment that is in the architecture more closer to what you have in production, because else you won't be able to basically catch flaws in your staging application architecture that you can basically apply in production to increase the resiliency of your application. So always try to test as close as possible to production as possible, maybe even in production. The next one is about minimizing the blast radios. So we mentioned that with the fault injection simulator on AWS, it's possible basically to minimize the blast radios to two ways. So the first one being with limiting the access that the service has to your resources. So for instance, with principle of least privilege in your IAM policies, you can basically limit the access to the resources that the fault injection simulator has by for instance, making sure that only your application servers and your databases of the staging environment are accessible by the service. And also we would recommend you to basically use the health check capabilities, the emergency brake stops to stop an experiment when you see that you really have degraded health in your application when you, for instance, test in production. To get started with collecting your first hands on experience, I can recommend this workshop that we have for you here. With this workshop, you basically have a guided step by step experience where you basically will learn and understand those different functionalities of default injection service firsthand. And I would recommend you to check it out either by scanning the QR code or visiting the URL on the screen. And this concludes the session. So you see another QR code on the screen. This is really important. So if you scan this QR code or visit the URL, you can give us feedback. And we really, really need your feedback. We want to understand if you like the session and what we could improve next time. So please take a minute and fill out the form. It would really mean a lot to us and we thank you really much and we wish you a great day ahead and also fun with all the other interesting sessions that you have the chance to explore today. Thank you very much.
...

Bent Krause

Associate Solutions Architect @ AWS

Bent Krause's LinkedIn account

Oliver Steenbuck

Solutions Architect @ AWS

Oliver Steenbuck's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways