Conf42 Chaos Engineering 2023 - Online

Building Resilience through Chaos Engineering

Video size:


In this session, I will introduce you to What, Why and How of Chaos Engineering. I will dive deep into the principles behind Chaos Engineering and how to keep your micro services architectures up and running. How to apply these principles in the context of cloud applications using AWS Services.


  • Kaios engineering focuses on operational excellence, reliability, and performance efficiency. Chaos engineering is a process of stressing an application by creating disruptive events and observing how the system responds to those events. More and more companies are turning to chaos engineering.
  • The AWS fault injection simulator, or FIS, is a fully managed service for conducting coyote engineering experiments provided by AWS. It allows you to test your systems for real world failures. FIS basically embraces the idea of blast radius and monitoring your blast radius.
  • How to build highly available fault tolerant systems on AWS. Set up kiosk experiments using AWS fault injection simulator. Create or cause a cpu stress as one of the scenarios you want to perform the experiments. Observe the impact of the resources from the cloud watch dashboard.
  • I invite you to start testing reliability of your systems using coyotes engineering techniques you've learned today. I'll share some resources in the next couple of slides that will help you on this journey. If you have not created an AWS account, this link will tell you how to.


This transcript was autogenerated. To make changes, submit a PR.
Everyone in this session, I'll introduce you to what, why, and how of Kaios engineering. I'll dive deep into the principles principles behind kiosk engineering, how to keep your services up and running, how to apply these principles in the context of application. Built on top of AWS before we move on, I want to start with Werner Vogel's quote. Everything fails all the time, and hence we need to build systems that embrace failure as a natural occurrence. Creating technology solutions is a lot like constructing a physical building. If the foundations aren't solid, it may cause structural problems that undermine the integrity and the function of the building. The AWS well architectures Framework is a set of design principles and architectural best practices for designing and running services in the cloud. The framework is built based on years of experience architecting solutions across a wide variety of business verticals and use cases, and from designing and reviewing a number of architectures with thousands of customers on AWS, the framework has a set of questions to drive better outcomes for anyone who wants to build and operate your applications on cloud. There are six principles in the AWS well architected framework operational excellence, security reliability, performance efficiency, cost optimization, and sustainability. Following these guidelines will enable you to build a system that delivers functional requirements that meet your expectations. In this session, I'll touch on only the three pillars that are relevant to this topic today. Those are operational excellence, reliability, and performance efficiency. First, let's look at what does it mean by operational excellence? Operational excellence is the ability to support, development, run and monitor systems effectively to gain insight into your operations, and then deliver business value through continually supporting the support processes. Reliability encompasses the ability of an application or a service to perform its intended function correctly and consistently. Performance efficiency is the ability of the system to use the computing resources in most efficient manner to meet your system requirements and to maintain the efficiency as the demand increases or the technology evolves. In the reliability pillar of well architected framework, there is a segment that talks about testing your system through failure injection, and this comes as a recommendation from Amazon's many years of experience building and operating large distributed systems. And this practice of using fault isolation or fault injection to test your environments is also better known as coyotes engineering. Let's go into the details of what, why, and how of coyotes engineering. Let's understand the what. First, it's about designing your system to work despite failures, building sustainability and building stability in your system behavior, and proactively looking for problems instead of waiting for them to happen and be surprised by them. Above all, chaos engineering needs a cultural shift for organizations to adopt the approach, chaos engineering is a process of stressing an application by creating disruptive events and observing how the system responds to those events and finally implementing those improvements. So it's an approach to learning about how your system behaves when subject to scientific experimentation and finding evidence. So let's talk about why kiosk engineering rise of microservices distributed cloud architectures the pace of innovation, pace of development and deployment of software means that the systems are growing increasingly complex. While individual components in a development cycle work, they when integrated, some of the faults may be unexpected, and these failures could cause costly to businesses. Even brief outages could impact companies bottom line. So the cost of the downtime is becoming a key performance indicator in key engineering teams. retail online business could have a large impact on revenue, even for a few minutes of outage. So companies need a solution to this challenge. Waiting for the next costly outage is not an option. To meet these challenge head on, more and more companies are turning to chaos engineering. So let's learn what's involved in adopting this approach, we make an assumption about our system, and we conduct experiments in a controlled environment to prove or disprove our theories, our assumptions about our system's capability to handle such disruptive events. But rather than let those disruptive events happen at 03:00 a.m. During these weekend or in a production environment, we create them in a controlled work hours development environment and experiment and see how the system behaves. We repeat these experiments at regular intervals, and thus learn more and more about the ability of the system to withstand interruptions and improve our systems to bounce back and provide the best possible service. So we are building reliability in our systems by using an approach called chaos engineering. So let's talk about the engineering approach facilitates building this resilience. So there are five phases to this coyote engineering. First, about understanding the state of your system you're dealing with. Secondly, hypothesize, articulate a hypothesis about your system, run an experiment, often using fault injection, verify the results of the system, and then finally, learn from your experiments. In order to improve the system further, in order to build resilience into your systems, we should be able to identify under what circumstances, under what scenarios our systems are likely to fail. And then we can translate these scenarios into a set of experiments and learn to build stability around the system. For example, the kind of chaos experiments you could conduct could be for hardware failures where a server is going down, or it could be non functional requirements, such as a spike in the traffic. Or it could also be testing your software services, where you're sending malformed responses and see and learn about your system, how that would respond. Now, let's take a quick look into the kind of tooling available for us to conduct the kiosk engineering experiments before we go into the tooling itself, a little bit of a history as to how it originated. Back in 2010, Netflix's engineering teams have created a tool called Chaos Monkey, which is a tool that they have used in their systems to build resilience to enforce failures into their systems and to learn how to stabilize their system, such as by fault injection, such as by terminating services or terminating a server that's running a particular service, and so on. Next, in 2011, civilian army added additional failure modes to just kind of provide a full set of failure testing capability. In 2017, kiosk engineering toolkit for developers emerged, mostly to have an open API for the developers to be able to integrate kiosk experiments into their systems, and also to automate into CI CD pipelines and so on. Now, in 2019, another powerful coyote engineering platform emerged for Kubernetes for testing container based services. Chaos, the ability to perform coyotes experiments without modifying the deployment logic of the application itself. In 2009, actually, Colton Andrews had built fault isolation fault injection at Amazon. He later on went to found he co founded Gremlin, which is a failure as a service platform, which is launched in 2019. So Gremlin helps build resiliency into your systems by turning failure into resilience, and by offering engineers a fully hosted solution to safely conduct experiments on simple or complex systems, and in order to identify those weaknesses even before they can impact the customers experience and allow you to reduce any revenue loss. So this allows developers to run experiments against hosts, containers, or even functions, or also kubernetes primitives. It's available. Gremlin is available on AWS Marketplace as well. Another tool that I want to talk about today is the AWS fault injection simulator, or FIS. In short, it's a fully managed service for conducting coyote engineering experiments provided by AWS. It's designed to make it easy for the developers to use the service. It allows you to test your systems for real world failures. Either it could be a simple test, or it could be a complex test as well. FIS basically embraces these idea of blast radius and monitoring your blast radius. It does so by giving you the ability to set up conditions around your experiments. So basically, it's the idea of safeguarding your servers, even if by mistake, so that you can reduce that blast radius of the experiment. And some alarms can be set off if those conditions are met. So now let's take a quick look into what are the components that comprise of fis. First is actions. Actions are fault injection activities that you want to conduct experiments with as well as these actions act on. Targets and targets are nothing but EC two resources. It's not necessarily easy to but AWS resources that you want these actions to be performed on. And these resources can be identified via targets, sorry, via tags as well. You have experimentation templates or experiment template which forms the basis for conducting a simple experiment first, and these templates can further be used to develop multiple experiments. Now let's talk about how to build highly available fault tolerant systems on AWS. Before we do that, let me go through the details of AWS global infrastructure. AWS region is a physical geographical location consisting of two or more availability zones, and an availability zone consists of two or more data centers that are redundantly connected with power networking and connectivity. These availability zones are interconnected with low latency network cables. Now let's suppose we have this three tier architecture hosted on AWS. The web tier is hosted on elastic container service or ECS for short and the web tier is receiving traffic from Internet gateway from Internet through Internet gateway and it is distributing traffic through elastic load balancer to the ECS cluster which is the web tier. The web tier is further distributing the traffic through another elastic load balancer to another ECS cluster which is these application layer and we're using Amazon Aurora as the database tier. Now let's take a look at how to set up kiosk experiments using AWS fault injection simulator. Let's say for our first scenario, what happens when servers are experiencing cpu load and our hypothesis is that if the cpu utilization for a compute resource were to be under stress, the availability of our website would not be impacted due to the built in capability in these system. Now let's go through the steps to run this experiment. I'm assuming that you already have an AWS account. Now go to your AWS console, search for AWS FIS. On the left hand side you should find experiment templates on the FIS console similar to what you're seeing on these slide. Using this experiment templates option you can create various experiments. Now let's take for example creating or causing a cpu stress as one of the scenarios you want to perform the experiments on. When you navigate to create experiment template, you can choose from the pre built set of actions in FIS such as an AWS systems managers run command under the hood to create the cpu stress in the top right corner of this slide you can see the actions drop down menu which allows you to run the experiments. Then if you click on the run experiment option, you will be prompted with an input box to type the word start to confirm running the experiment ensure to check the state of the experiment is in running state now you will be taken to the experiment details page and note that the state of the experiment changes over the time. So in order to observe the impact of the resources, navigate to the monitoring dashboard of the ECS cluster or you can prebuild a custom dashboard to view the impact of the cpu utilization load on your application. So you should observe a spike in the monitoring graph. Similarly, you can conduct other experiments using the prebuilt actions say for example you can use SSM commands for network packet loss. This action can be used as a basis for conducting network stress experiment say if you want to run an experiment and try to mimic an application's response to an easy failure by removing the availability zone from the underlying auto scaling group configuration and triggering a database failover. Here is the hypothesis for this scenario. 50% of our EC two instances will not affect the availability of our website and the application will remediate itself to get back to the desired healthy capacity because of the use of autoscaling group underneath the hood. Now running this experiment is going to terminate 50% of our total EC two instances, both app and the web layer the steps to run this experiment are navigate to the AWS FIS console as we have shown earlier. Navigate to the experiment templates section on the AWS FIS page, select the experiment template in the top right corner, select the actions drop down and hit run. Ensure to check the status of the experiment is in running state and then observe the impact of the resources from the cloud watch dashboard. So the action or the experiment template we are using here is terminate EC two which is based on the actions to terminate instances from the EC two service. To observe the impact of the resources you can either go to the Cloudwatch dashboard or you can go to the ECS cluster to note the healthy hosts section. You should see a different number from the original steady state. To observe the impact you can also go to the EC two service dashboard instances section where you can notice the EC two instances getting terminated and eventually you will see some new EC two instances coming back up as well. And from the auto scaling menu you can navigate to the auto scaling group section and you will see the instances count, decrease and automatically restore as well. And finally to observe the impact of user experience you can navigate to these Cloudwatch service and set up a Cloudwatch synthetics canary if you have this pre set up, you can observe these change in the user experience because synthetics canary actually does the monitoring of the user endpoints and the APIs. Basically, a synthetic scannery is a configurable script that can run on schedule and it monitors those endpoints. You can also navigate to the ECS service endpoint manually via the browser and check the user experience impact during the auto scaling process as well. Once this experiment completes, the applications should return back to its steady state. Now let's take a look at how to make this a reality. It's not sufficient to run these experiments once and leave it at that. Your kiosk scenarios and hence the experiments and hence your recovery design are based on certain assumptions. For example, if you think of a data replication scenario and assume that the system will complete set of replication tasks within the set steady state time. As the data grows organically, the replication times may not hold good anymore. It's hence important to conduct these experiments regularly, validate them and improve your results and enhance your customer experience. Reality may differ as I said earlier, one way to test your systems and bring them to as close to reality as possible is through running game days. It's a concept where you bring in a set of people from different disciplines or who have not used your system as much, so they do not have an idea of how this is going to work and give them a brief overview and the scenario of events to run and gather the feedback, analyze the results and follow up with items to improve your system and run these tests regularly. Conduct these game tests regularly. Integrate these tests AWS part of your CI CD pipeline as I mentioned, if you have runbooks that need to be checked manually, they could very easily get out of date and run into issues once it's in production. Finally, I want to leave you with this quote regarding coyotes engineering. It isn't about creating coyotes, it is about making the coyotes inherent in the system visible. I invite you to start testing reliability of your systems using coyotes engineering techniques you've learned today. I'll share some resources in the next couple of slides that will help you on this journey. If you have not created an AWS account, this link will tell you how to and here are some of the links to go over and understand the AWS well architected framework and some hands on labs.

Shirisha Vivekanand

Solutions Architect @ AWS

Shirisha Vivekanand's LinkedIn account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways