Building confidence through chaos engineering on AWS

Video size:

Abstract

This session provides users with an understanding of what chaos engineering is and what it isn’t, where it provides value, and how they can create their own chaos engineering program to verify the resilience of your distributed mission-critical systems on AWS.

Summary

In this session, we'll discuss about building confidence through chaos engineering on AWS. We'll learn what chaos engineering is and what it isn't and what value is of chaos engineering. And how can we get started with chaos engineering within your own firms?
Narender Gakka is a solutions architect at AWS. He will talk about chaos engineering and continuous resilience. Chaos engineering is not just about blowing up production. It's more like improved operational readiness. And when all of this put together, it leads to a better application.
In a distributed system, there is no way that every single person understands the hundreds or many microservices that we communicate with each other. You need mechanisms which come into play, and those mechanisms can be built using chaos engineering and continuous resilience.
There are many companies that have already adopted chaos engineering. Gartner says 40% of companies will adopt chaos engineering in next year alone. These are some of the industries what we see on the screen. There are many case studies or customer stories.
So let's get the prerequisites on how you can get started with chaos engineering. First you need like basic monitoring, and if you already have observability already. You need to have organizational awareness as well. And then fourth is then, of course, once we find those faults within the environment, we find a deficiency.
In chaos engineering, we call metrics as known unknowns. By collating all the information, we start looking at like highest level at our baseline instead of looking at a particular graph. Observability is based on three key pillars: metrics, logging and tracing. Small teams can be injected into chaos engineering to decentralize development.
Real world failures and faults which this application can suffer or have dependency on. These are some of the real world experiments which you can do with the chaos engineering. And then the last prerequisite is about making sure that when we are building a deficiency within our systems that could be related to security or resilience.
Continuous resilience is a lifecycle that helps us think about workload from a steady state point of view. Chaos Engineering and continuous Resilience program builds safe experimentation of these within our pipelines. With chaos engineering, it's important to truly plan what we want to execute.
The well architected review helps customers build secure, high performing, resilient and efficient workloads on AWS. It has six pillars like operational excellence, security, reliability, performance, efficiency and cost optimization. The next phase is basically defining experiment.
When we think about an experiment to simulate a brownout within our eks environments, we need to understand the steady state. The hypothesis is key as well when we are thinking about the experiment, because the hypothesis will define at the end. The execution flow is like.
Running a chaos engineering experiment requires great tools that are safe to experiment. In AWS, we have fault injection simulator with which you can inject faults. Also integration with litmus chaos and the chaos mesh. If something goes wrong during that experiment, it should basically stop and then roll back the whole experiment.
Now let's move to the next environment, which is obviously taking this into the production. You have to think about the guardrails that are important in your production environment. We want to basically take this learnings and automate that so that we can bring them and run them in pipelines.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi. Welcome, everyone. In this session today, we'll discuss about building confidence through chaos engineering on AWS. So, we'll learn what chaos engineering is and what it isn't and what value is of chaos engineering. And how can we get started with chaos engineering within your own firms? But more importantly, I will show you how you can combine the power of chaos engineering and continuous resilience and build a process that you can scale chaos engineering across your organization in a controlled and secure way so that your developers and engineers with secure, reliable, and robust workloads that ultimately lead to great customer experience. Right. My name is Narender Gakka. I'm a solutions architect at AWS, and my area of expertise is resilience as well. So, let's get started. So, the first, let's look at our agenda. So, first, I'll introduce you to chaos engineering, and let's see what it is and more importantly, what it isn't. Also, and I will also take you through the various aspects when you are thinking about prerequisites for chaos engineering and what you need to get started in your own workloads and in your environments. We'll then dive deep into the continuous resilience and why continuous resilience is so important. When we are thinking about resilient applications on AWS and combined with chaos engineering and continuous resilience, I will also take you through our Chaos Engineering and continuous resilience program that we users to help our customers to build chaos engineering practices and programs that can scale across their organizations. And at last, I will also share with you some resources and great workshops, which we have so that you can get started with Chaos Engineering on AWS on your own, basically. So when we are thinking about Chaos Engineering, it's not really a new concept. It has been there for over a decade now. And there are many companies that have already successfully adopted and embraced the Chaos engineering and have taken the mechanisms in trying to find out what we call as known unknowns. I mean, these are things that we are aware of, but not don't fully understand in our systems. It could be weaknesses within our system or resilience issues and also chase the unknown unknowns, which are the things that we neither are aware of nor do we fully understand. And through chaos engineering, these various companies were basically able to find deficiencies within their environments and prevent the users on what we call it AWS, large scale events, and therefore ultimately have a better experience for their customers. And yet, when we are talking about or when we are thinking about chaos engineering, in many ways, it's not how we see Chaos engineering. Right? There is still a perception that chaos engineering is that thing which blows up production and nor where we randomly just shun down things within an environment. That is not what chaos engineering is about. Right. It's not just about blowing up production or randomly stopping things or removing things. But when we are thinking about chaos engineering at AWS, we should look at it from a much different perspective. And many of you probably have seen shared responsibility or shared responsibility model for AWS, for security. This basically is for resilience. And there are two sections, the blue and orange pit in the resilience of cloud. We at AWS are responsible for the resilience of the facilities, for example the network, the storage, the networking or the database aspects. These are basically the services which you consume, but you as a customer, you are responsible of how and what services you use, where you place them. Think for example for your workloads. Think about zonal services like EC two, where you place your data and how you fail over if something happens within your environment. But also think about the challenges that come up when you are looking at a shared responsibility model. So how can you make sure if the service fails that you are consuming that in an orange space that your workload is resilient? Right. What happens if an availability zone goes down at AWS? Is your workload or an application able to recover from those things? How do you know if your workloads can fail over to another AZ? And this is where chaos engineering comes into play and help you with those aspects. So when you are thinking about workloads that you are running in the blue, what you can influence in the primary dependency that you're consuming in AWS, if you're using EC two, if you're using lambda, if you're using sqs, if you're using services like caching, services like elasticache, these are the services that you can impact with chaos engineering in a safe and controlled way. And you can also figure out mechanisms on how your components within your application can gracefully fail over to another service. Sorry. So when we are thinking about chaos engineering, what it provides you is more like improved operational readiness. Because your teams with chaos engineering will get trained on what to do if a service fails. And you will have mechanisms in place to be able to fail over automatically. And you will also have great observability in place because you will realize that by doing case engineering you will realize what is missing within your observability, which you currently have and what you haven't seen. And when you are running these experiments in a controlled way, you'll continuously improve the observability part as well. And ultimately you will build resilience so that the workloads which you build will have more resilient on AWS. And when you're thinking about all of this put together, what it does lead to is of course, a happy customer and a better application. Right? So that's what chaos engineering is about, that it's all about building resilient workloads that ultimately leads to great customer experience. And so when you think about chaos engineering, it's all about building controlled experiments. And if we already know that an experiment will fail, we're not going to run it at experiment because we already know why that fails. And there is no point of running that experiment. It's rather you invest time in fixing that issue. So we're not going to run that experiment. And if we know that we're going to inject a fault and that fault will trigger a bug that brings our system down, we're also not going to run the experiments because we already know what happens when you trigger that bug. It's better to go and fix that bug. So what we want to make sure is if we have an experiment, that by definition that experiment should be tolerated by the system and also should be fail safe, that it doesn't lead you to issues. And many of you might have a similar workload with a similar architecture wherein you have the external DNS pointing to your load balancer, where you have a service running which is getting data and customer data from either cache or a database, depending on your data freshness, et cetera. But when you're thinking about it, let's say you're using redis on EC two or elasticache, what your confidence level if the redis fails? Right? What happens if the redis fails? Do you have mechanisms in place to make sure that your database does not get fully overrun by all these requests which are not being served from the cache suddenly? Or what if you think about the latency that suddenly gets injected between your two microservices and you create a retry storm? Right? Did you have mechanisms to mitigate such an issue? What about the back off and jitter, et cetera? And also, let's assume that you have, let's say, cascading failure, that everything in an availability goes down. Are you confident that you can either fail over to a different availability zone to one another and think about impacts that you might have on a regional service? That what is your confidence if the whole region, the entire region basically goes down. Is your service able to recover in another region within the given sla of your application? Right. So what is your confidence level with the current architecture that you have? Basically, do you have those runbooks, playbooks which will let you do this cross region or a cross easy failover seamlessly? And can you run through them? Right. And so when you're thinking about like chaos engineering, when we are thinking about the services that we build on a daily basis, they're all based on trade offs that we have every single day, right? So those trade offs, we basically all want to build great, awesome workloads. But the reality is we are under pressure that a certain budget, we can only users certain budget, and that there are certain time that we need to deliver. We have a time constraints as well, and we need to also maintain and get those certain features released on time. But in a distributed system, there is no way that every single person understands the hundreds or many microservices that we communicate with each other. And ultimately, what happens if think that I'm depending on a soft dependency, where someone suddenly changes a code, that becomes a hard dependency. And what happens is that you suddenly have an event. And when you're thinking about these events, usually they happen. You're somewhere in a restaurant or maybe somewhere outside, and you get called at an odd hour and everybody runs and tries to fix that issue and bring the system back up. And the challenge with such a system is that once you go back to business as usual, you might get that same challenge again. Right? And then it's not because we don't want to fix it, but because the good intentions, they don't work. That's what we say at aws, right? You need mechanisms which come into play, and those mechanisms can be built using chaos engineering and the continuous resilience. Now, as I mentioned in the beginning, that there are many companies that have already adopted chaos engineering, and there are so many verticals, these companies that have adopted chaos engineering, and some of them already started quite early and have seen tremendous benefits in the overall improvement of resilience within their workloads. These are some of the industries what we see on the screen, and there are many case studies or customer stories which are in that link. So please feel free to go through them on how they were adopted if you belong to those industries, how they have leveraged the chaos engineering and improved their architectures overall. There are many customers that will adopt chaos engineering in the next years to come by. And there is a great study by Gartner that was done for the infrastructure and operations leaders guide. That said that 40% of companies will adopt chaos engineering in next year alone. And they are doing that because when they think they can increase customer experience by almost 20%, and think about how many more happy customers you're going to have with such a number. It's a significant number, this 20% is. So let's get the prerequisites now on how you can get started with chaos engineering. Okay, let's get some of the prerequisites on how you can get started. So first you need like basic monitoring, and if you already have observability already, which is a great starting point, and then you need to have organizational awareness as well. And third is that you need to think about what sort of real world events or faults we are injecting into our environment. And then fourth is then, of course, once we find those faults within the environment, we find a deficiency, right. We remediate, we actively commit ourselves, have the resources to basically go and fix those so that it improves either security or the resiliency of your workloads. There's no point finding it, but not fixing it, right? So that is the fourth prerequisite. So when you're thinking about metrics, many of you really have great metrics. If you're using AWS already, you have the Cloudwatch integration and you already have the metrics coming into. But in chaos engineering, we call metrics as known unknowns. So these are the things that we already aware and fully understand. So basically we call them known knowns, right? So these are the things that we already are aware of and we fully understand. And when you're thinking about metrics, for example, like it's cpu percentage, it's memory, and it's all great, but in distributed system, you're going to look at many, many different dashboards and metrics to figure out what's going on within the environment, because it doesn't give you a comprehensive view, each gives its own view, but you're going to look at many, many different dashboards. So when you are starting with chaos engineering, many times when we are running, like, first experiments, even if we are trying to make sure that we are seeing everything, we realize we can't see it. And this is what leads to the observability. So observability helps us find basically that needle in a haystack. By collating all the information, we start looking at like highest level at our baseline instead of looking at a particular graph. And even if we have absolutely no idea what's going on, we're going to understand where we are. So basically at a high level we know what is our application health, what sort of customer interaction we are having, et cetera, so that we can drill down all the way to tracing. Like we can use the services like AWS X ray and understand it. But there are also options, there are many open source options and if you already use them, that's perfectly fine. Aws, well right. So when you're thinking about observability, this is the key. Observability is based on what we say, there are three key pillars and you have, as we already mentioned the metrics, then you have the logging and then you have the tracing. Now what is important is why these three key are important is because we want to make sure that you embed for example metrics within your logs, so that if you're looking at a high level steady state that you might have, you want to drill in. And as soon as you get into a stage from tracing to log that you see what's going on and can also correlate between all those components end to end. And so at that point you can understand where your application is. So if we take can example of the observability. So when we are looking at this graph, even for example any person who has absolutely no idea about what this workload is and sees there are few issues here, like if you look at the spikes there and you're going to say okay, something happened here basically. And if we would drill down we would see that we have a process which can out of control or there is a cpu spike, right? And every one of you is able to look at the graph down here and say wait a minute, why did the disk utilization drop? And if you drill down you will realize that it had an issue with my Kubernetes cluster and the pod suddenly, right, the number of nodes suddenly start restarting and that leads to lot of 500 errors. And as you know HTTP 500 obviously is not a good thing to do. So if we can correlate this, that is a good observability. That because of such and such issue, this is my end user experience. And if you want to provide developers with the aspects of understanding the interactions within the microservices and especially when you're thinking about like chaos engineering and experiments, you want them to understand what is the impact of this experiment is. And we shouldn't forget in the user experience and what users sees when you are running these experiments because if you're thinking about the baseline and we are running an experiment and the baseline doesn't move means that the customer is super happy because everything is green, it's all working. Even if when you are doing experiment, if everything is fine from the end user's perspective, that is a successful application or a reliable or zilliant application, right? And now that we understand the observability aspects, now basically we have seen what basic monitoring and observability is. Now let's actually move on to the next prerequisite, which is the organizational awareness. What we found is that when you are starting with a small team and you enable small team on chaos engineering, and they build common faults that can be injected across the organization and then able to decentralize the development teams on chaos engineering, that works fairly well. Now why is that? Why does a small team work? Well, if you're thinking about that, you have hundreds of, maybe depending on the scale and size of your organization, you might have hundreds if not thousands of development teams who are trying to build the application. There's no way that the central team will understand every single workload that is around you. And there is also no way that the central team will get the power to basically inject failures everywhere. But those development teams already have like IAM permission to access their environments and do things their own environments, rather than the central team doing the other way around. So it's much easier to help them run experiments than having a central team running it for. Right. So you decentralize the case engineering so that they can embrace it part of the development cycles itself. So that also helps basically with building customized experiments which is suitable for their own workload, which they are designing and building. And the key to all of this to work is having that executive sponsorship, the management sponsorship, that helps you make those resilience part of the journey of the software development lifecycle, and also shift the responsibility for resilience to those development teams who know their own application, their own piece of code better than anybody else. And then we think that these real world, they can also think about the real world failures and faults which this application can suffer or have dependency on. Now, what we see the real world when we say real world experiments, is that some of the key experiments are code and configuration errors. So think about the faults, the common faults you can inject when you are thinking about deployments, or think about experiments that you can cause and say, okay, well, do we even realize that we have a faulty deployment? Or do we see it within observability if my deployment fails, or it is leading to a customer's customer transaction to fail, et cetera, et cetera. So how do we do that? Experiments. And second is that when we are thinking about infrastructure, what if you have an EC two instance that fails within your environment? Suddenly in a microservices deployment you have an eks cluster where a load balancer doesn't pass the traffic to your, sorry, doesn't pass traffic, or able to mitigate such events? Are you able to mitigate such infrastructure events within your architecture? And what about the data and state? Right, this is also a critical resource for your application. This is not just about cache drift, but what if suddenly your database runs out of, let's say disk space or out of memory? Do we have mechanisms to not only just first detect it and inform you that this happened, but also how do you automatically mitigate that so that your application is working resiliently, right? And then of course you have dependency where we have seen that. Do you basically understand the dependencies of your application with any third parties which you have? It could be maybe an identity provider or a third party API which your application consumes every time a user logs in, or let's say, does a transaction that you do understand those dependencies? And what happens if those suffer any issues? Do you have a mechanisms to first test it and also prove that your application is resilient enough to tolerate and can work without them as well? And then of course, although highly unlikely but technically feasible, that natural disasters, when we are thinking about maybe human errors, that something happened, the user does something, how do you, you know, how can you fail over, or how can you simulate those events and that too in a controlled way through the chaos engineering. Right. So these are some of the real world experiments which you can do with the chaos engineering. And then the last prerequisite, of course, is about making sure that when we are building a deficiency within our systems that could be related to the security or resilience, that we can go and basically remediate it, because it's basically worth nothing if you build new features, but our services is not available. Right. So we need to have that executive sponsorship as well, that we need to be able to prioritize these issues which come up through chaos engineering and basically fix them and improve the resilience of the architecture in a continuous fashion. So that basically now brings us to the continuous resolution, continuous resilience. So when we are thinking about the continuous resilience, resilience is not just one time thing, because every day you're building new features, releasing them to your customers and your architecture changes. So we need to think it should be part of everyday life when we are thinking about building resilient workloads right from the bottom to all the way to the application itself. And so continuous resilience is basically a lifecycle that helps us think about workload from a steady state point and steady state point of view, and work towards mitigating events like we just went through, from code configuration all the way to the very unlikely events of natural disasters, et cetera. And also we need to build safe experimentation of these within our pipelines, within our pipelines, and also actually outside our pipelines, because errors happen all the time. And not just when we provision new code and making sure that we also learn from the faults that surface during those experiments as well. And so when you take continuous resilience and chaos engineering and you put them together, that's what leads us to the Chaos Engineering and continuous Resilience program, which is, and the program that we have built over the last few years at AWS and have helped many customers run through it, which enabled them to basically, as I was saying earlier, to build a chaos engineering program with their own firm and scale it across various organizations and develop teams so that they can build controlled experiments within their environment and also improve resilience. Usually when we are talking about or when we are starting on this journey, we start with a game day that we prepare for, to start with game day, as you might think, where we are just running it for 2 hours session and we are checking if something was fine or not, especially when we are starting out. With chaos engineering, it's important to truly plan what we want to execute. So it's setting expectations as a big part of it. So the key to that, because you're going to need quite a few people that you want to invite, is project planning. And usually the first time when we do this, it might be between a week or three weeks that we are planning the game day and the various people that we want in a game day, like the KS champion, that will advocate the game day throughout the company. It could be the development teams. If there are site reliability engineers, sres, we're going to bring them in as well, observability and incident response teams. And then once we all have all the roles and responsibilities for the game day, we're going to think about what is it that we want to run experiments on. And when we are thinking about chaos engineering, it's not just about resilience. It can be about security or other aspects of the architecture as well. And so contribution is a list of what's important to you. That can be resilience, that can be availability, or that can be a security. It could also be durability for some of the customers. That's something which you can define. And then of course we want to make sure that there is a clear outcome of what we want to achieve with this chaos experiment. In our case, when we are starting out, what we actually prove to the organization and the sponsors is that we can run an experiment in a safe and controlled way without impacting the customers. That's the key. And then we take these learnings and share it, either if we found that something or not within our customers, to be able to make sure that the businesses unit understand how to mitigate these failures. If we found something or have the confidence that we are resilient to the faults, we inject it. So then we basically go to the next type where we select a workload for this presentation here. So let's have can example application. And this is basically could be because we are talking about the bank. So this can be like a payments workload. And it's running on eks, where eks is deployed across multiple availability zones and there is a route 53 and there are application load balancer which is taking in the traffic, et cetera. And also there is can aurora database and Kafka for managed streaming, et cetera. It's important that when you are choosing a workload, making sure that we are not starting out and not choosing a critical workload that you already have and then impact it. And then obviously everyone would be happy if you start with such a critical system and something goes wrong. So choose something which is not critical so that even if it is degraded, if it has some customer impact, then it is still flying because it's not critical. And usually we have metrics that allow that when you're thinking about slos for your service. So once you have chosen a workload, we're going to make sure that our chaos experiments that we want to run are safe. And we do that through a discovery phase of the workload. And that discovery phase will involve quite a bit of architecture in it, right? So we're going to dive deep into it all. You know, that well architected review. It helps customers build secure, high performing, resilient and efficient workloads on AWS, which has six pillars like operational excellence, security, reliability, performance, efficiency and cost optimization, as well as newly added security sustainability as well. And so when we are thinking about the well architected review, it's just not about clicking the buttons in the tool. But we are talking about through the various designs of the architecture and we want to understand how the architecture and the workloads and the components within your workloads speak to one another. Right. And what mechanisms do you have in place? Like can one component, when it fails, can it retry again as well or not? And what mechanisms do they have in regards to circuit breakers or have you implemented them? Have you tested it, et cetera? And do you have like run books or playbooks in place in case we have to roll back a particular experiment? And we want to make sure that you have the observability in place. And for example, health checks as well when we execute something so that your system automatically can recover from it. And if we have all that information, we can see that there is a deficiency that might impact internal or external customer. That's where we basically stop. When we see an impact to customers, then we basically stop that experiment. And if we have known issues, we're going to have to basically fix these first before we move on within that process. Right now, once and only if that everything is fine, we're going to say, okay, let's basically move on to the definition of an experiment, right? So the next phase is basically defining experiment. So when you are thinking about your system that we just, the sample application which we just saw before, we can think about what can go wrong within this environment, right? So if we already have or have not mechanisms in place, for example, if you have a third party identity provider, in our case, do we have a breaklass account wherein I can prove that I can still log in if something happens, right? If that identity provides goes down, can I still log in with a break glass account? And let's say, what about my eks cluster? If I have a node that fails within that cluster, do I have my code which builds on the other node itself? Right? Do I know how long does it take or what would be my end customer impact if it happens? Or it could be someone misconfigured an auto scaling group and health checks and which suddenly marks most of the instances in that zone unhealthy. So do we have mechanisms to detect that? And what does that mean again for customers and the teams that operate the environment as well? Right? And think about the scenario where someone pushed a configuration change and the ECR and your container registry is no longer accessible anymore. So that means you cannot basically launch new containers. Do we have mechanisms to detect that and recover from that? And what about the issues with the Kafka which is managing our streamers. So are we going to lose any active messages? What would be the data loss there? What if it loses a partition or it loses its connectivity, or basically it may reboot, et cetera. So do we have mechanisms to mitigate that? And what about our aurora database? What if the writer instance is not accessible or has gone down for whatever reason? Can it automatically and seamlessly fail over to the other node? And meanwhile all of this is happening. What happens to the latency or the jitter offer the application when you implement all this? Yeah, and with basically the fault injection and controlled experiments, we are able to do all of this. And then lastly, think about challenges that your clients might have connect to your environment while all of this is happening. So for our experiment, we wanted to achieve, what we wanted to achieve is that we can execute and understand a brownout scenario. So what a brownout scenario is that our client that connects to us expects a response in a certain amount of, let's say milliseconds or depending on the environment. And if it do not provide that, the client just going to go and back off. But the challenge is when you have a brownout, that your server is still trying to compute whatever they requested for, but the client is not there, and those are the wasted cycle. So that inflection point is basically called the brownout. Now, before we think about an experiment to go ahead, before we can actually think about an experiment to simulate a brownout within our eks environments, we need to understand the steady state and what the steady state is and what it isn't. So when you're thinking about defining a steady state for our workload, that's the high level top metric, right? That you're thinking about your service. So for example, for a payment system, it could be transactions per second, for a retail system it can be orders per second, for streaming stream starts per second, et cetera. And when you're looking at that line, you see very quickly you have an order drop or a transaction drop, that something that you injected within the environment caused probably that to drop. So we need to have that steady state metric, or already available, so that when we run these case experiments, we would immediately know something happened. So the hypothesis is key as well when we are thinking about the experiment, because the hypothesis will define at the end. Did your experiment turn out to be a turnout as expected, or did you learn something new that you didn't expect? And so the important here is, as you see, we are saying that we are expecting a transaction rate. So 300 transactions per second and we think that even 40% of our nodes will fail within our environment. Still, 99% of all requests to our API should be successful. So the 99th percentile and return a response within 100 milliseconds. So what we also would want to define is because we know our systems, we're going to say, okay, based on we have our experience, the node should come back within five minutes and the part will get scheduled and process tropic within the eight minutes after the initiation of experiment. And once again we are all agreeing on that hypothesis, then we're going to go out and fill out can experiment template. And so when you're thinking about an experiment template, experiment template itself, we're going to make sure that very clearly defining what we want to run, we're going to have the definition of the workload itself and what experiment and action we want to run. And you might want to run the experiment where you say, I'm going to run for 30 minutes with five minutes intervals to make sure that you can look at the graphs and on the experiments, staggering experiments you are running to understand the impact of the experiments. And then of course, because we want to do this in a controlled way, we need to be very clear on what the fault isolation boundary is for our experiment. And we're going to clearly define that as well. So we're going to have the alarms that are in place that would trigger the experiment to roll back if it gets out of control or if it causes any issues with the customer transactions. And that's the key because we want to make sure that we are practicing safeguards, engineering experiments, right? And we also make sure that we understand what is the observability and what we are looking at when we are running the experiment. So we need to keep an eye on the observability and the key steady state metrics. And then you would add the I hypothesis again to the template as well. Yeah, aws. You see on the right side we have the two empty lines for that. When we are thinking about the experiment itself, whether good or bad, we are always going to have an end report where we might celebrate that our system is resilient enough to such failure, or we might celebrate that we find something that we didn't know before and we have just helped our application and the organization that we have mitigated an issue or an event which could have happened in the future. Right? So once we have that experiment ready, we're going to think about basically preparing or priming the environment for our experiment. But before we go there, I just want to touch upon how do you go through that entire cycle of how we execute an experiment, because that's also critical on how we execute that experiment. So the execution flow is like. So first we have to check if the system is actually in a healthy state. Because if you remember in the beginning, I was saying that if we already know the system is unhealthy or it's going to fail, we're not going to run that experiment. So we immediately stop that. So once the system is healthy, we'll see if the experiment is still valid, because the issue or the test we are doing might have been already fixed before. So you don't want to run that experiment because the developer might have fixed those bugs or improved that resilience. And if we see this, then we're going to create a controlled experiment group where we're going to make sure that defined, and I'm going to go into that in a few seconds. And if we see that the control and experiment group is there and defined and which is up and running, then we start generating load against the control and experiment group in our environment. And we are checking again, is the steady state that we have, is intolerance that we think it should be or not. If it is tolerant, then now, and finally we can go ahead and run the experiment against the target, and then we check if it is intolerance based on what we think. And if it isn't, then we stop. Condition is going to kick in and it's going to roll back. Um, so as I was saying in the, in the previous slide that I mentioned the aspects of control and experiment group. So when you're thinking about chaos engineering and like running experiments, the goal always is that it's controlled, and two, that you have a minimal or no impact to your customers when you're running it. So weighing how you can do that is, as we call it, not just having synthetic load that you generate, but also synthetic resources. For example, you spin a new key case cluster, a synthetic one, one that you have and inject a fault, and the other one which is healthy and still serving your customers, right. So you're not impacting an existing resources that is already being used by the customers, but new resource with exactly the same code base and the other ones where you understand what happens in a certain failure scenario. So once we prime the experiment and we see that control and experiment group are healthy and we see a steady state, we can then move on and think about running the experiment itself. Now, running a chaos engineering experiment requires great tools that are safe to experiment. So when we are thinking about tools, there are various tools out there how you can use and consume. In AWS, we have fault injection simulator with which, when you're thinking about one of the first slide with the shared responsibility model for resilience, fault injection simulator helps you quite a bit with that because the faults that you can inject with fis are running against the AWS APIs directly. And you can inject these faults against your primary dependency to make sure that you can create mechanisms that you can survive a component failure within your system, et cetera. Now second is that faults and actions that I want to highlight are the highlighted parts are basically integration with litmus chaos and the chaos mesh. And the great thing about this is that it provides you with a widened scope of faults that you can inject, for example, in our example architecture into your kubernetes cluster to fault injection simulator via single plane of glass. And then it also has various integrations. Now, if you want to run experiments against, let's say, EC two systems, you have the capability to run these through AWS systems manager via the SSM agent. Now think about where these come into play. So when we are thinking about running experiments, these are the ways on how you can create disruptions within the system. Let's say you have various microservices that run and consume a database. Now you might say how can we create a fault within the database without having impact to all those microservices, right? And the answer to that is that you can inject faults within these microservices itself, for example, packet laws, or that they would result in exactly the same as application not being able to talk to or write to the database, because it's not going to, right. It's not going to get there and reach the database without you bringing down the database itself. And so it's important that to widen the scope and think about the experiments that you can run and see what actions you have on how you can simulate those various experiment failures. So in our example case, because we want to do that brown out that I showed before we use the eks action, that we can terminate a certain number of nodes, a percentage of nodes within our cluster, and we would run them, right? So if you go to the tool itself, the way it runs, if you use the tool, we can trust the fis that we're going to make sure that something goes wrong, that it can alert automatically and also helps us roll back the experiment, right? And the fault injection simulator has these mechanisms wherein. So when you build an experiment with fis, you can define what are my key alarms are, which define those steady state. And they should kick in that if they find any deviance. Right. And if something goes wrong during that experiment, it should basically stop and then roll back the whole experiment. So in our case everything was fine and we said that, okay, well, now we have confidence based on our observability that we have for this experiment. Now let's move to the next environment, which is obviously taking this into the production. So you have to think about the guardrails that are important in your production environment. So when you are running ks, experiment in production, especially when you are thinking about running them for the first time, please don't run on a peak hours. It's probably not the best idea to run on a peak hours. And also make sure that in many ways, when you're running these experiments in lower and ever environments, your permissions are also quite permissive that you have when it compared to production. And you got to make sure that you have the observability in place that you have permissions to execute these various experiments. Aws. Well, because in production it's always the restricted permissions. And also key to understand is that the fault isolation boundary changed because we are in production now. So we need to make sure that we understand that as well. And also we understand the risk of running them in production environment because we need to understand and make sure that we are not impacting our customers. That's the key. So once we understand this and have the runbook and plays books which are up to date, we are finally at a stage where we can think about moving to production. And here again, we want to think about, you know, you know, think about like, you know, experiment in production with a cannery. We'll check that in a second. So, you know, as you have seen this picture before, in our lower environment, we're going to do the same thing in production. But we don't have a mirrored environment, right? So that some customers do where they split traffic. And we have a chaos engineering environment in our production and another environment. So what we use in this is a cannery to say that we're going to take the real traffic a tiny bit of percentage into our. We're going to start bringing that real user traffic into the controlled and experiment group. Now keep in mind at this point, nothing should go wrong. We have the control and experiment group here as well. We haven't injected the fault yet. And we should be able to see from can observability perspective that everything is good, because we haven't created any experiments yet. And once we see that truly happen, that's where we start. That's where we kick in the experiment. Right. So we're going to get running the experiment in production. But when we are thinking about running this in production, we want to make sure that we have all the workload experts in terms of engineering teams, observability operators, incident response teams, everybody in a room before we actually do this in production. So that if something goes wrong or if you have seen any unforeseen incidents during that engineering experience, you can quickly roll back and make sure that the system is back up and running within no time. Right. And the final stage is basically going into the correction of error state where we are basically listing out all the key findings, learnings from that experiment which we have run, and then we'll see, okay, how did we communicate between the teams? Did we have any persons or people whom we needed in the room, but they were not there? Was there any documentation missing, et cetera. How can we improve the overall process? How do we then basically take these learnings and share that across the organization so that they can further improve the overall workloads, et cetera? So that is the final phase that the next take is basically the automation part because we are not running this just for once. Right. So we want to basically take this learnings and automate that so that we can bring them and run them in pipelines. So we need to make sure that experiments also run in the pipeline and also outside the pipeline because the faults happen all the time. So they don't just happen by pushing a code, they happen like day in and day out within the production environment as well sometimes. Right. And then we can also use game days to bring in the teams together to help them understand the overall architecture and recover the apps, et cetera, and test those processes work. And are people alerted in a way that if something goes wrong, they're able to work together and then bring that resilience, continuous resilience culture to make it easier for our customers. We have built in a lot of templates and handbooks that we are going to the experiments with them. So we share, like chaos Engineering Handbook that shows the business value of chaos engineering and how it helps with resilience. The chaos engineering templates as well as correction of error templates we have, and also various aspects of the reports that we share with customers when we are running to the program now. Next, I just want to share some resources, which we have, but we have the workshops with which you can, for example, in the screen we see that you basically start with an observability workshop. And then the reason that is because the workshop builds an entire system that provides you with everything in the stack of observability. And then you have to absolutely nothing to get out of pressing a button, right? And once we have that and we have the observability, from top down to tracing to logging to metrics, then go for the chaos engineering workshop and looking at the various experiments there, and start with some database fault, injection the containers and EC two and it shows you how you can do that in the pipeline as well. And you can take those experiments and you run it against a sample application within the observability workshop and it gives you a great view of what's going on within your system. And if you inject these failures or faults, you'll see them right away within those dashboards with no effort at all. So these are the QR codes for those workshops. Please do get started and reach out to any of your AWS representatives or contacts for further information on these. You can also reach me on my Twitter account with that I just want to thank you for your time. I know it's been long session, but I hope you found it insightful. Please do share your feedback and let me know if you want more information on this. Thank you.

Slides

Download slides (PDF)

See all 16 talks at this event!

Conf42 Chaos Engineering 2023 - Online

February 16 2023

Building confidence through chaos engineering on AWS

Video size:

Abstract

Summary

Transcript

Slides

Narender Gakka

Solutions Architect @ AWS

Join the community!

Featured event

2025

2024

Info

Conf42 Chaos Engineering 2023 - Online

February 16 2023

Building confidence through chaos engineering on AWS

Video size:

Abstract

Summary

Transcript

Slides

Narender Gakka

Solutions Architect @ AWS

Join the community!