Conf42 Site Reliability Engineering 2021 - Online

Improve Resilience with automated chaos engineering

Video size:


The transition into more complex systems is accelerating, and chaos engineering has proved to be a great-to-have option in our toolbox to handle this complexity. But the speed at which we’re developing and deploying makes it hard to keep up through manual chaos experiments, so we turn to automation.

In this session, we’ll look at how automated chaos experiments help us cover a more extensive set of experiments than we can cover manually and how it allows us to verify our assumptions over time as unknown parts of the system change.


  • You can enable your DevOps for reliability with chaos native. Chaos engineering is the process of stressing an application in testing or production environment. It's not just about improving the resilience of your application, but also its performance. Use cases include game days and frequent deployments.
  • Multiple services that do different things within our application. There are dependencies between these services and different parts of our system. frequent deployments are hard to cover manually with chaos engineering experiments. Systems are becoming more complex. It's hard to keep and create a mental model of how the systems works.
  • automation helps us cover a larger set of experiments than what we can cover manually. Automated experiments verifies our assumptions over time as unknown parts of the system are changed. Three ways that we can automate our chaos engineering experiments.
  • Continuous integration and continuous delivery encourages frequent deployments. By adding chaos engineering experiments as part of our delivery pipelines, we can continuously verify the output or behavior of our system. Stop condition that watches your application behavior and then stops an experiment if an alarm is triggered.


This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE? A developer, a quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native Litmus cloud hi everyone, Chaos Engineering has proved to be a great to have option in the SRE and SDE toolbox, but the transition into more complex system is accelerating. My name is Guillauna George and I am a developer advocate at Amazon Web Services. And in this session we'll look at how automated Chaos engineering experiments can help us cover a more extensive set of experiments than what we can cover manually, and how it allows us to verify our assumptions over time as unknown parts of the system change. Chaos engineering, as most of you know, is the process of stressing an application in testing or production environment by creating these disruptive events, such as server outages or API throttling, and then observing how the system responds and implementing our improvements. And we do that to prove or disprove our assumptions about our system's capability to handle these disruptive events. But rather than to let them happen in the middle of the night or during the weekend, we can create them in a controlled environment and during working hours. And it's important to note that chaos engineering is not just about improving the resilience of your application, but also its performance. Uncover hidden issues, expose monitoring, observability and alarm blind spots, and more like improving the recovery time, the operational skills, culture and so on. And chaos engineering is but breaking things, but in a controlled environment. So we create these well planned experiments in order to build confidence in our application and in the tools we're using to withstand these turbulent conditions. And to do that, we follow this well defined scientific method that takes us from understanding the steady state of the system we're dealing with, to articulating a hypothesis, running an experiment, often using fault injection, verifying the results, and finally learning from the experiment in order to improve the system improvements such AWS resilience to failure, its performance, the monitoring, the alarms, the operations, well, the overall system. And today we're seeing customers using chaos engineering quite a lot. And that usage is really growing. And we've seen two clear use cases have emerged. The perhaps most common way of doing chaos engineering experiments is creating the one off experiment. And this is when you create can experiment by, for instance, looking at a previous outage or different events for your system, or you can also identify the services that have the biggest impact on your end users or customers if they go down or don't function properly. And then you create experiments for those, or maybe you've built a new feature, added a new service, or just made changes to the code or the architecture, and you create an experiments to verify that the system works as intended. And companies are doing this in different ways. Some have dedicated chaos engineers creating and running the experiments. For others, it's part of the SRES responsibilities, or as we partly do at AWS, chaos engineering is done by the engineering teams themselves on their services. The other very common use case is to use chaos engineering as part of your game days. And a game day. It is the process of rehearsing ahead of an event by creating the anticipated conditions and then observing how effectively the team and the system responds. And an event, well, that could be an unusually high traffic day, a new launch, a failure, or something else. And you can use chaos engineering experiment to run a game day by creating the event conditions and monitoring the performance of your team and for your system. So doing these one off experiments and perhaps the occasional game day now and then, it gets us very far on the road to improving resilience of our system. So isn't this enough? Well, it definitely can be, and it is for many, but let's look at an example. So this is a use case example of an ecommerce web application. So this is our application. It's a simple ecommerce site where we have tons and tons of end users buying things off the site continuously. And we've built this using well architected principles. So were set it up using multiple instances running in auto scaling groups spread over multiple availability zones. We have our database instances using read replicas and replication across availability zones as well. So we're trying to build for resilience and reliability. And next then? Well, this is just one part of the system. Of course, this is the product service, but we've added chaos engineering AWS, a practice, to our example. In this case, we use it to verify the resilience of our service and to learn and gain confidence in the application. And we also, of course, have introduced CI CD practices. So continuous integration and continuous delivery has not only made it possible, but it even encourages frequent deployments. And that's what we do in our use case example, as we know, frequent deployments, they are less likely to break and it's more likely that we'll be able to catch any bugs or gaps early on. But frequent deployments, when done perhaps daily, multiple times a day, or perhaps even by the hour in some cases, they are really hard to cover manually with chaos engineering experiments, it's just hard to keep up with the pace. And next, we also have of course, multiple services within our application. So we have different services that do different things within our application. So besides the product, we have order, we have a user service. We of course need a cart service and a recommendation service and a search service, all built using slight different architectures. They have different code bases and perhaps even different teams building and running these parts of our application. And next, then, well, there are dependencies between these services and different parts of our system. So the CaRt service is of course dependent on the user service. We have the product service, and that's also used by the cart service and the order service, and the cart service works together, and the search service needs to be able to search our products, and the recommendation engine also uses our products, for instance. So we create these dependencies between our microservices or services within the application. And that's also hard because teams operate differently, they might make change to the application at different times, and you depend on that other service to be there. So creating experiments that is able to cover those changes, that's also quite hard. So based on what we just looked at, some learnings from this very simple use case is that frequent deployments are hard to cover manually with chaos engineering experiments, just because we do them often, it's hard to create experiments and run them as frequently in a manual fashion and to cover a more extensive set of experiments. It is time consuming. And then even though you might have full control of the service or microservice that you're working on, unknown parts of the system might change because the other teams are making changes. And also finally, systems are becoming more complex. It's hard for anyone to keep and create a mental model of how the systems works, let alone to keep documentation up to date. And that, well, it brings us to automated experiments. Automation helps us cover a larger set of experiments than what we can cover manually. Automated experiments verifies our assumptions over time as unknown parts of the system are changed and doing automated experiments. It really goes back to the scientific part of chaos engineering in that repeating experiments, it's standard scientific practice for most fields, and repeating an experiment more than once, it helps us determine if that data was just a fluke or if it represents the normal case. It helps us guard against jumping to conclusions without enough evidence. So let's take a look at three different ways that we can automate our chaos engineering experiments. First off, let's think about how our system evolves. I mentioned it in the use case example, so that even when we might have full control over our service, our microservice, the one we're working on, other teams or even third parties, they are making changes, they are delivering new code, they are releasing new versions of their service, and those might be services that you depend on. So the verification you got from doing a one of chaos experiment a week ago or a month ago, it might quickly become obsolete because of these other changes. So by scheduling experiments to run on a recurring schedule, you can get that verification over and over again as unknown parts of the system change. So let's take a look at how this can quite easily be achieved. So I'm building a simple scheduling service for my chaos engineering experiments, and I'm doing this using a simple serverless application, setting up a schedule cloudwatch event rule. And this is based on the schedule that I define and basically a chrome job for when this should run. And this is a simple AWS lambda function that will then take the experiment template that I define and run that on that schedule. And in this case I'm using AWS fault injection simulator, AWS FIS. But the same principle works no matter if you're using another system, if you're using your own scripts for doing chaos engineering experiments. So this is my application, the instances running, let's say the product service that we looked at before, multiple instances running in different availability zones. So I've created a simple experiments for that. Let's just create that template. This is using cpu and memory stress. On our instances, I can now take this simple experiment template and using my scheduler, my lambda function, I can deploy it and define which experiment should run on the schedule. So just pasting in my experiments template id, setting up the schedule in this case once per day, simple crone syntax and deploying that deploying has started. Let's switch to AWS lambda console. All right, so the deploying is done, switching back. This is Amazon eventbridge, our event bus, so we can just have a look at what actually got deployed. So this is our schedule. So I'm copying a sample scheduling event and let's try this out back to AWS lambda, pasting in that sample event, and now we can test it. So this is as if my schedule were to run right now. So the lambda function kicks off and that should then start an experiment. Yes. So in AWS FISC we can see that an experiment is running. This is a very simple example using cpu stress. This is one of the instances I'm logged into. And if we watch the cpu levels, we can now see that that instance is being stressed by my chaos experiment, and it's also using up memory on the instance. And in this case my experiments is doing the steps that I defined in my template. But since this is an automated experiment, it is a recurring experiments. It will then run over and over again and verify the same set of conditions for me. And since it is automated, we need to have stop conditions in place, meaning that we have alarms that will then stop the experiment if anything goes wrong. So that was our first example. So now let's look at the second example. And the second approach to automation is to run chaos experiments. Automation based on events and an event, well that's basically anything that happens within your system. It could be an event related to the tech stack for instance, that latency is added whenever there's an auto scaling event. New instances are started for instance. Or maybe it's a business related event like can API being throttled when items are added to the cart. So building automation around these types of experiments, it can help you answer those quite hard to test questions. What if this when that is happening, and even when that is an event that's in a totally different parts of the system. So let's look at an example of that as well. So once again, simple automation set up using serverless application. In this case it is an event triggered experiments. So setting up an event based on cloud watch event or an event in Eventbridge. And in this case the pattern is AWS auto scaling. Whenever an EC two instance is launched, what will happen is that it will kick off a lambda function, and that lambda function is pretty much the same as in the previous example, meaning that it will start an experiments. So if we look in Eventbridge, we can see that besides easy to auto scaling events, we can create these types of event patterns for a whole bunch of different AWS services. Or we can create our custom patterns, meaning that it could be a pattern based on a business metric, something that happens, as I mentioned before, items added to cart or a third party service as well. All right, so we have that in place. We just need to define which auto scaling group it should base this on. And of course I have stop conditions in place. Once again, it's automated experiments. We won't watch them manually every time, so we need to make sure that stop conditions are in place to stop the experiment if an alarm is triggered. So creating this new experiment template, we go now to deploy my event triggered experiment automation. And we have that in the AWS lambda console. Here we go. So, switching to EC two and let's just make a change to one of our auto scaling groups. Let's change the desired capacity from two instances to three instances, which then is an auto scaling event. And that gets picked up by eventbridge, which triggers our AWS lambda function, which in turn triggers our AWS FisC experiment. So we can see that one experiment is running and looking at a logged in instance. Once again, we can see that the instance is using cpu and memory, meaning that our experiment is successful. And with the stop conditions in place, we don't need to watch the experiment manually. This can happen over and over again. And if an alarm is triggered, it will automatically stop. So our customers end users aren't affected by the experiment. And then that gets us to our third example. And the third way of doing automated experiments is perhaps the most popular one so far, and the one I'm definitely getting the most questions around. So, continuous integration and continuous delivery, as I said before, it encourages frequent deployments, and this means that the application is less likely to break. So we have this problem that we showed in the previous use case that we have frequent deployments, but aren't perhaps able to do chaos engineering experiments as frequent. So by adding chaos engineering experiments as part of our delivery pipelines were able to continuously verify the output or behavior of our system. So let's look at an example of that AWS. Well, so this is our pipeline. It is simply deploying to staging and deploying to production, demo purpose pipeline. This is built using infrastructure as code, of course. So we have our pipeline, we have the stages, fetching the source, deploying to staging and deploying to production. Now we're adding an experiments stage for the staging environment. So after deploying to staging, let's just kick off the deployment of this updated template. So after deploying to staging, it will run an experiment on the staging environment. And what it does, well, it's simply a state machine using AWS step functions that will start the experiment and then monitor that experiment to make sure that it either succeeds, or if it fails, it will then stop that experiment. So back to the pipeline, and now we have this new stage, the experiment stage in place. So let's give it a try. And as one does for a demo, let's just edit straight in GitHub, make a small change to our code base, and commit. That straightaway kicks off the pipeline, fetching the source, deploying to staging, which is a quick process in our demo environment. And then it gets to the experiments stage where it will now initiate our AWs step function workflow, which in turn then parts our AWs fis experiments. So that is running as we can see. And for purpose of this demo, this is a really quick experiment, so it will quickly finish and complete so we can see what happens. All right, so it's already completed. Let's switch back to the pipeline. Succeeded, and then it moves on to the next stage, which is to deploying to production. So, very simple example of how we can add our chaos engineering experiments to a pipeline. So let's do another one. What if that experiment fails? What if an alarm is triggered and it doesn't work as intended? So let's release a new change, fetching the source from GitHub once again, parts to deploying to our staging environment. Soon as that is done, it will kick off our experiment once again. There we go. It's in progress. Let's check AWS fis. The experiments is running. So what I can do now is use the AWS CLI and just set the alarm to be triggered. So I'm setting the alarm state for our stop condition. Let's try that. And this means that it will act as if an alarm was triggered, and fisk straightaway stops our experiments. Switching back to the pipeline. We can see that it failed. And the failed experiment in this case, well, it means that it won't proceed to the next step. And the next step would be to deploy to production. But for some reason, our experiment failed. Might be that something is wrong with the code, something doesn't work, we have more latency, or whatever it is we're testing with our experiment. And in this case, we won't move that into production. We can build on this AWS. Well, of course, by adding experiment stage after deploying to production, as well as an extra way of testing and making sure that everything works as intended. So this was an example of how to add experiments to your pipelines. First, by showing what happens when it works, it just proceeds to the next step, and when it fails, it stops the pipeline. And that shows the value of having stop conditions in place. Stop condition that watches your application behavior and then stops an experiment if an alarm is triggered. And what kind of stop conditions you'll use, that's very much up to you. And the use case, the traditional it depends answer. But for instance, it might be that you're seeing less users adding items to cart, or it might be a very technical metric, for instance, that you're seeing cpu levels above a certain threshold or things like that. So with these three options, the recurring scheduled experiments, the event triggered experiments, and the continuous delivery experiments, we have three different ways to automated our chaos engineering experiments. So should you then stop doing one off experiments and the periodic game day? Well, no, you shouldn't. They should still be at the core of your chaos engineering practice. They are a super important source for learning, and it helps your organization build confidence. But now you have yet another tool to help you improve the resilience of your system, the automated chaos experiments. So one way to think, but it is that experiments you start off by creating as one offs or as part of your game days, they can then turn into experiments that you run automated after doing the experiment manually to start with, they can be set to run every day, every hour, or on every code deploy. And that brings us to a summary with a recap of some takeaways. So automation helps us cover a larger set of experiments than what we can cover manually and automated experiments verifies our assumptions over time as unknown parts of the system are changed and safeguards and stop conditions are key to safe automation. And introducing automated chaos engineering experiments does not mean that you should stop doing manual experiments if you just can't get enough of chaos engineering to improve resilience. I've gathered some code samples, the ones used in the demos, and some additional resources for you in the link shown on the screen. Now just scan the QR code or Gunnar grosch link chaos. And with that, I want to thank you all for watching. We've looked at how to improve resilience with automated chaos engineering. As ever, if you have any questions or comments, do reach out on Twitter at Gunagarosh as shown on screen or connect on LinkedIn. I am happy to connect. Thank you all for watching.

Gunnar Grosch

Developer Advocate @ AWS

Gunnar Grosch's LinkedIn account Gunnar Grosch's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways