Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Hi. Welcome, everyone. In this session today, we'll discuss
            
            
            
              about building confidence through chaos engineering on AWS. So,
            
            
            
              we'll learn what chaos engineering is and what it isn't
            
            
            
              and what value is of chaos engineering.
            
            
            
              And how can we get started with chaos engineering within your own
            
            
            
              firms? But more importantly, I will show you
            
            
            
              how you can combine the power of chaos engineering and
            
            
            
              continuous resilience and build a process that you
            
            
            
              can scale chaos engineering across your organization in
            
            
            
              a controlled and secure way so that your developers
            
            
            
              and engineers with secure,
            
            
            
              reliable, and robust workloads that ultimately
            
            
            
              lead to great customer experience. Right.
            
            
            
              My name is Narender Gakka. I'm a solutions architect at AWS,
            
            
            
              and my area of expertise is resilience as well.
            
            
            
              So, let's get started. So, the first,
            
            
            
              let's look at our agenda. So, first, I'll introduce you to
            
            
            
              chaos engineering, and let's see what it
            
            
            
              is and more importantly, what it isn't. Also, and I
            
            
            
              will also take you through the various aspects when you are
            
            
            
              thinking about prerequisites for chaos
            
            
            
              engineering and what you need to get started in your own
            
            
            
              workloads and in your environments. We'll then dive deep into
            
            
            
              the continuous resilience and why continuous resilience is
            
            
            
              so important. When we are thinking about resilient applications
            
            
            
              on AWS and combined
            
            
            
              with chaos engineering and continuous resilience, I will also
            
            
            
              take you through our Chaos Engineering and continuous resilience
            
            
            
              program that we users to help our customers to build
            
            
            
              chaos engineering practices and programs
            
            
            
              that can scale across their organizations.
            
            
            
              And at last, I will also share with you some
            
            
            
              resources and great workshops, which we have so
            
            
            
              that you can get started with Chaos Engineering on AWS on
            
            
            
              your own, basically.
            
            
            
              So when we are thinking about Chaos Engineering, it's not
            
            
            
              really a new concept. It has been
            
            
            
              there for over a decade now. And there are many companies
            
            
            
              that have already successfully adopted
            
            
            
              and embraced the Chaos engineering and have taken the
            
            
            
              mechanisms in trying to find out what
            
            
            
              we call as known unknowns.
            
            
            
              I mean, these are things that we are aware of,
            
            
            
              but not don't fully understand in our systems.
            
            
            
              It could be weaknesses within our system or resilience issues and also
            
            
            
              chase the unknown unknowns, which are the things that we
            
            
            
              neither are aware of nor do we fully understand.
            
            
            
              And through chaos engineering, these various companies were basically able
            
            
            
              to find deficiencies within their environments
            
            
            
              and prevent the users on what we call it AWS,
            
            
            
              large scale events, and therefore ultimately
            
            
            
              have a better experience for their customers.
            
            
            
              And yet, when we are talking about or when we are thinking about chaos engineering,
            
            
            
              in many ways, it's not how we see Chaos engineering.
            
            
            
              Right? There is still a perception that chaos engineering is that thing
            
            
            
              which blows up production and
            
            
            
              nor where we randomly just shun down things within
            
            
            
              an environment. That is not what chaos engineering is about.
            
            
            
              Right. It's not just about blowing up production or
            
            
            
              randomly stopping things or removing things.
            
            
            
              But when we are thinking about chaos engineering at AWS,
            
            
            
              we should look at it from a much different perspective.
            
            
            
              And many of you probably have seen shared
            
            
            
              responsibility or shared responsibility model for AWS,
            
            
            
              for security. This basically is for
            
            
            
              resilience. And there are two sections,
            
            
            
              the blue and orange pit in the
            
            
            
              resilience of cloud. We at AWS
            
            
            
              are responsible for the resilience of
            
            
            
              the facilities, for example the network, the storage,
            
            
            
              the networking or the database aspects. These are
            
            
            
              basically the services which you consume, but you as a
            
            
            
              customer, you are responsible of how and what
            
            
            
              services you use, where you place them.
            
            
            
              Think for example for your workloads. Think about zonal
            
            
            
              services like EC two, where you place your data and
            
            
            
              how you fail over if something happens within your environment.
            
            
            
              But also think about the challenges that come up when you are
            
            
            
              looking at a shared responsibility model.
            
            
            
              So how can you make sure if the service fails that you are consuming that
            
            
            
              in an orange space that your workload is resilient?
            
            
            
              Right. What happens if an availability zone
            
            
            
              goes down at AWS? Is your workload or an application able
            
            
            
              to recover from those things? How do you know if your
            
            
            
              workloads can fail over to another AZ?
            
            
            
              And this is where chaos engineering comes
            
            
            
              into play and help you with those aspects. So when
            
            
            
              you are thinking about workloads that you are running in the blue,
            
            
            
              what you can influence in the primary dependency that you're consuming in
            
            
            
              AWS, if you're using EC two,
            
            
            
              if you're using lambda, if you're using sqs, if you're
            
            
            
              using services like caching, services like elasticache,
            
            
            
              these are the services that you can impact with chaos engineering
            
            
            
              in a safe and controlled way. And you can also figure out mechanisms
            
            
            
              on how your components within your application can gracefully
            
            
            
              fail over to another service.
            
            
            
              Sorry. So when we are thinking about
            
            
            
              chaos engineering, what it provides you is
            
            
            
              more like improved operational
            
            
            
              readiness. Because your teams with chaos engineering
            
            
            
              will get trained on what to do if a service fails.
            
            
            
              And you will have mechanisms in place to be able to
            
            
            
              fail over automatically. And you will also have great
            
            
            
              observability in place because you will realize that
            
            
            
              by doing case engineering you will realize what is missing
            
            
            
              within your observability, which you currently have and what
            
            
            
              you haven't seen. And when you are running these experiments
            
            
            
              in a controlled way, you'll continuously
            
            
            
              improve the observability part as well. And ultimately
            
            
            
              you will build resilience so that
            
            
            
              the workloads which you build will have more resilient on AWS.
            
            
            
              And when you're thinking about all of this put together,
            
            
            
              what it does lead to is of course, a happy customer
            
            
            
              and a better application. Right? So that's what chaos
            
            
            
              engineering is about, that it's all about building resilient
            
            
            
              workloads that ultimately leads to great customer experience.
            
            
            
              And so when you think about chaos engineering,
            
            
            
              it's all about building controlled experiments.
            
            
            
              And if we already know that an
            
            
            
              experiment will fail, we're not going to run it at experiment because we
            
            
            
              already know why that fails. And there is no point of running that experiment.
            
            
            
              It's rather you invest time in fixing that issue. So we're
            
            
            
              not going to run that experiment. And if we know that
            
            
            
              we're going to inject a fault and that fault will trigger
            
            
            
              a bug that brings
            
            
            
              our system down, we're also not going to run the experiments because we already
            
            
            
              know what happens when you trigger that bug. It's better to go and
            
            
            
              fix that bug. So what we want to
            
            
            
              make sure is if we have an experiment,
            
            
            
              that by definition that experiment
            
            
            
              should be tolerated by the system and also should
            
            
            
              be fail safe, that it doesn't lead you
            
            
            
              to issues.
            
            
            
              And many of you might have a similar workload
            
            
            
              with a similar architecture wherein you have the external
            
            
            
              DNS pointing to your load balancer, where you have a service running which is
            
            
            
              getting data and customer data from either cache
            
            
            
              or a database, depending on your data
            
            
            
              freshness, et cetera. But when you're thinking about it,
            
            
            
              let's say you're using redis on EC two or elasticache,
            
            
            
              what your confidence level if the
            
            
            
              redis fails? Right? What happens if the redis
            
            
            
              fails? Do you have mechanisms in place to make sure that your
            
            
            
              database does not get fully overrun by all
            
            
            
              these requests which are not being served from the cache suddenly?
            
            
            
              Or what
            
            
            
              if you think about the latency that suddenly gets injected
            
            
            
              between your two
            
            
            
              microservices and you create a retry
            
            
            
              storm? Right?
            
            
            
              Did you have mechanisms to mitigate such an issue?
            
            
            
              What about the back off and jitter,
            
            
            
              et cetera? And also,
            
            
            
              let's assume that you have, let's say, cascading failure,
            
            
            
              that everything in an availability goes down. Are you
            
            
            
              confident that you can either fail over to a different availability
            
            
            
              zone to one another and think about impacts that
            
            
            
              you might have on a regional service? That what
            
            
            
              is your confidence if the whole region, the entire region
            
            
            
              basically goes down.
            
            
            
              Is your service able to recover
            
            
            
              in another region within the given
            
            
            
              sla of your application? Right. So what is your confidence
            
            
            
              level with the current architecture that you have?
            
            
            
              Basically, do you have those runbooks, playbooks which will let you
            
            
            
              do this cross region or a cross easy failover seamlessly?
            
            
            
              And can you run through them? Right.
            
            
            
              And so when you're thinking about like chaos engineering,
            
            
            
              when we are thinking about the services that we build on a daily
            
            
            
              basis, they're all based on trade offs that we have
            
            
            
              every single day, right? So those trade offs,
            
            
            
              we basically all want to build great,
            
            
            
              awesome workloads.
            
            
            
              But the reality is we are under pressure that
            
            
            
              a certain budget, we can only users certain budget,
            
            
            
              and that there are certain time that we need to deliver. We have
            
            
            
              a time constraints as well, and we
            
            
            
              need to also maintain and get those certain features released
            
            
            
              on time. But in a distributed system,
            
            
            
              there is no way that every single person understands
            
            
            
              the hundreds or many microservices
            
            
            
              that we communicate with each other. And ultimately,
            
            
            
              what happens if think
            
            
            
              that I'm depending on a soft dependency, where someone suddenly changes a
            
            
            
              code, that becomes a hard dependency. And what happens
            
            
            
              is that you suddenly have an event.
            
            
            
              And when you're thinking about these events, usually they happen. You're somewhere
            
            
            
              in a restaurant or maybe somewhere outside, and you get called
            
            
            
              at an odd hour and everybody runs and tries to fix that
            
            
            
              issue and bring the system back up. And the challenge with
            
            
            
              such a system is that
            
            
            
              once you go back to business as usual, you might get
            
            
            
              that same challenge again. Right?
            
            
            
              And then it's not because we don't want
            
            
            
              to fix it, but because the
            
            
            
              good intentions, they don't work. That's what we say at aws,
            
            
            
              right? You need mechanisms which come
            
            
            
              into play, and those mechanisms can be built using chaos
            
            
            
              engineering and the continuous resilience.
            
            
            
              Now, as I mentioned in the beginning, that there are
            
            
            
              many companies that have already adopted chaos engineering,
            
            
            
              and there are so many verticals,
            
            
            
              these companies that have adopted chaos engineering, and some of
            
            
            
              them already started quite
            
            
            
              early and have seen tremendous benefits in the overall improvement of resilience
            
            
            
              within their workloads. These are some of the industries
            
            
            
              what we see on the screen, and there are many case studies
            
            
            
              or customer stories which are in that link. So please feel
            
            
            
              free to go through them on how they were adopted if you belong to
            
            
            
              those industries, how they have leveraged the chaos engineering and improved their
            
            
            
              architectures overall.
            
            
            
              There are many customers that will adopt chaos engineering
            
            
            
              in the next years to come by. And there is a great study
            
            
            
              by Gartner that was done for the infrastructure
            
            
            
              and operations leaders guide. That said that
            
            
            
              40% of companies will adopt chaos engineering
            
            
            
              in next year alone. And they are doing that because
            
            
            
              when they think they can increase customer experience by almost
            
            
            
              20%, and think about how many more happy customers
            
            
            
              you're going to have with such a number. It's a significant
            
            
            
              number, this 20% is. So let's get the prerequisites now
            
            
            
              on how you can get started with chaos engineering.
            
            
            
              Okay, let's get some of the prerequisites on how you
            
            
            
              can get started. So first you need like basic
            
            
            
              monitoring, and if you already have observability
            
            
            
              already, which is a great starting point, and then you
            
            
            
              need to have organizational awareness as well.
            
            
            
              And third is that you need to think about what sort
            
            
            
              of real world events or faults we
            
            
            
              are injecting into our environment. And then
            
            
            
              fourth is then, of course, once we find those faults
            
            
            
              within the environment, we find a deficiency,
            
            
            
              right. We remediate, we actively commit
            
            
            
              ourselves, have the resources to basically go and fix those
            
            
            
              so that it improves either security or the resiliency of
            
            
            
              your workloads. There's no point finding it, but not fixing it, right?
            
            
            
              So that is the fourth prerequisite.
            
            
            
              So when you're thinking about metrics,
            
            
            
              many of you really have great metrics.
            
            
            
              If you're using AWS already, you have the Cloudwatch integration and you
            
            
            
              already have the metrics coming into. But in chaos engineering,
            
            
            
              we call metrics as known unknowns. So these are the things
            
            
            
              that we already aware and fully understand.
            
            
            
              So basically we call them known knowns, right? So these are the things
            
            
            
              that we already are aware of and we fully understand. And when
            
            
            
              you're thinking about metrics, for example, like it's
            
            
            
              cpu percentage, it's memory,
            
            
            
              and it's all great, but in distributed system,
            
            
            
              you're going to look at many, many different dashboards and metrics
            
            
            
              to figure out what's going on within the environment,
            
            
            
              because it doesn't give you a comprehensive view, each gives its own
            
            
            
              view, but you're going to look at many, many different dashboards.
            
            
            
              So when you are starting with chaos engineering, many times when
            
            
            
              we are running, like, first experiments, even if we are trying to make sure
            
            
            
              that we are seeing everything, we realize we can't
            
            
            
              see it. And this is what leads to the observability.
            
            
            
              So observability helps us find basically
            
            
            
              that needle in a haystack. By collating
            
            
            
              all the information, we start looking at like highest level
            
            
            
              at our baseline instead of looking
            
            
            
              at a particular graph. And even
            
            
            
              if we have absolutely no idea what's going on,
            
            
            
              we're going to understand where we are. So basically at a high level we know
            
            
            
              what is our application health, what sort of customer
            
            
            
              interaction we are having, et cetera, so that we can drill down all
            
            
            
              the way to tracing. Like we can use the services like AWS
            
            
            
              X ray and understand it. But there are also options,
            
            
            
              there are many open source options and if
            
            
            
              you already use them, that's perfectly fine. Aws, well right. So when
            
            
            
              you're thinking about observability, this is the
            
            
            
              key. Observability is based on what we say,
            
            
            
              there are three key pillars and you
            
            
            
              have, as we already mentioned the metrics, then you have the logging
            
            
            
              and then you have the tracing. Now what is important
            
            
            
              is why these three key are
            
            
            
              important is because we want to make sure that you embed for example
            
            
            
              metrics within your logs, so that if you're looking
            
            
            
              at a high level steady state that you might have,
            
            
            
              you want to drill in. And as soon as you get into a stage from
            
            
            
              tracing to log that you see what's going on
            
            
            
              and can also correlate between all those components
            
            
            
              end to end. And so at that point you can understand
            
            
            
              where your application is.
            
            
            
              So if we take can example of the observability.
            
            
            
              So when we are looking at this graph,
            
            
            
              even for example any person who has
            
            
            
              absolutely no idea about what this workload is
            
            
            
              and sees there are few issues here, like if
            
            
            
              you look at the spikes there and
            
            
            
              you're going to say okay, something happened here basically.
            
            
            
              And if we would drill down we would see that we have a process
            
            
            
              which can out of control or there is a cpu spike,
            
            
            
              right?
            
            
            
              And every one of you is
            
            
            
              able to look at the graph down here and say wait a minute,
            
            
            
              why did the disk utilization drop? And if you
            
            
            
              drill down you will realize that it
            
            
            
              had an issue with my Kubernetes cluster and the pod suddenly,
            
            
            
              right, the number of nodes suddenly start restarting
            
            
            
              and that leads to lot of
            
            
            
              500 errors. And as you know
            
            
            
              HTTP 500 obviously is not a good thing to do. So if we can
            
            
            
              correlate this, that is a good observability. That because
            
            
            
              of such and such issue, this is my end user experience.
            
            
            
              And if you want to provide developers with the
            
            
            
              aspects of understanding the interactions
            
            
            
              within the microservices and especially when you're
            
            
            
              thinking about like chaos engineering and experiments,
            
            
            
              you want them to understand what is the impact of
            
            
            
              this experiment is. And we shouldn't forget in the
            
            
            
              user experience and what users sees when
            
            
            
              you are running these experiments because if you're thinking about the baseline
            
            
            
              and we are running an experiment and the baseline
            
            
            
              doesn't move means that the customer is super happy because everything
            
            
            
              is green, it's all working. Even if when you are doing experiment,
            
            
            
              if everything is fine from the end user's perspective, that is a
            
            
            
              successful application or a reliable or zilliant application, right?
            
            
            
              And now that we understand the observability aspects,
            
            
            
              now basically we have seen what basic monitoring and observability is.
            
            
            
              Now let's actually move on to
            
            
            
              the next prerequisite, which is the organizational awareness.
            
            
            
              What we found is that when you are starting with
            
            
            
              a small team and you enable small team on chaos engineering,
            
            
            
              and they build common faults that can be injected across the organization
            
            
            
              and then able to decentralize the
            
            
            
              development teams on chaos engineering,
            
            
            
              that works fairly well. Now why
            
            
            
              is that? Why does a small team work? Well, if you're thinking about that,
            
            
            
              you have hundreds of, maybe depending on the scale
            
            
            
              and size of your organization, you might have hundreds if
            
            
            
              not thousands of development teams who are trying to
            
            
            
              build the application. There's no way that
            
            
            
              the central team will understand every single workload
            
            
            
              that is around you. And there is
            
            
            
              also no way that the central team will get the power to basically
            
            
            
              inject failures everywhere. But those development teams already
            
            
            
              have like IAM permission to access their environments and
            
            
            
              do things their own environments, rather than the central team doing the other way
            
            
            
              around. So it's much easier to help them run experiments
            
            
            
              than having a central team running it for. Right. So you decentralize the
            
            
            
              case engineering so that they can embrace it part of
            
            
            
              the development cycles itself. So that
            
            
            
              also helps basically with building customized experiments
            
            
            
              which is suitable for their own workload, which they are designing and building.
            
            
            
              And the key to all of this to work
            
            
            
              is having that executive sponsorship, the management sponsorship,
            
            
            
              that helps you make those resilience part of the journey of the software development
            
            
            
              lifecycle, and also shift the responsibility for
            
            
            
              resilience to those development teams who know their own application,
            
            
            
              their own piece of code better than anybody else.
            
            
            
              And then we think that these real world,
            
            
            
              they can also think about the real world failures and faults which this application
            
            
            
              can suffer or have dependency on.
            
            
            
              Now, what we see the real world when
            
            
            
              we say real world experiments, is that
            
            
            
              some of the key experiments are
            
            
            
              code and configuration errors. So think about the faults,
            
            
            
              the common faults you can inject when you are thinking about deployments,
            
            
            
              or think about experiments that you can cause and say, okay,
            
            
            
              well, do we even realize that we have a faulty deployment?
            
            
            
              Or do we see it within observability if my deployment fails,
            
            
            
              or it is leading to a customer's customer transaction to fail,
            
            
            
              et cetera, et cetera. So how
            
            
            
              do we do that? Experiments. And second
            
            
            
              is that when we are thinking about infrastructure,
            
            
            
              what if you have an EC two instance that fails within your environment?
            
            
            
              Suddenly in a microservices deployment you have an
            
            
            
              eks cluster where a load balancer doesn't pass the traffic to
            
            
            
              your, sorry, doesn't pass traffic,
            
            
            
              or able to mitigate such events? Are you able to mitigate such infrastructure
            
            
            
              events within your architecture? And what about
            
            
            
              the data and state? Right, this is also a critical resource for your
            
            
            
              application. This is not just about cache drift,
            
            
            
              but what if suddenly your database runs out of,
            
            
            
              let's say disk space or out of memory? Do we have
            
            
            
              mechanisms to not only just
            
            
            
              first detect it and inform you that this happened, but also how
            
            
            
              do you automatically mitigate that so that your
            
            
            
              application is working resiliently, right? And then of
            
            
            
              course you have dependency where we have seen
            
            
            
              that. Do you basically understand
            
            
            
              the dependencies of your
            
            
            
              application with any third parties which you have? It could be maybe
            
            
            
              an identity provider or a third party API which
            
            
            
              your application consumes every time
            
            
            
              a user logs in, or let's say, does a transaction that
            
            
            
              you do understand those dependencies? And what happens if those
            
            
            
              suffer any issues? Do you have a mechanisms to
            
            
            
              first test it and also prove that your application is resilient enough
            
            
            
              to tolerate and can work without them as well?
            
            
            
              And then of course, although highly
            
            
            
              unlikely but technically feasible,
            
            
            
              that natural disasters, when we are thinking about
            
            
            
              maybe human errors, that something happened, the user does
            
            
            
              something, how do you, you know, how can you fail
            
            
            
              over, or how can you simulate those events and
            
            
            
              that too in a controlled way through
            
            
            
              the chaos engineering. Right. So these are some
            
            
            
              of the real world experiments which you can do with
            
            
            
              the chaos engineering. And then the last prerequisite,
            
            
            
              of course, is about making sure that when we
            
            
            
              are building a deficiency within our systems that
            
            
            
              could be related to the security or resilience, that we can go and basically remediate
            
            
            
              it, because it's basically
            
            
            
              worth nothing if you build new features, but our services
            
            
            
              is not available. Right. So we need to have that executive
            
            
            
              sponsorship as well, that we need to be able to prioritize
            
            
            
              these issues which come up through chaos engineering
            
            
            
              and basically fix them and improve the resilience of the architecture
            
            
            
              in a continuous fashion. So that basically
            
            
            
              now brings us to the continuous resolution,
            
            
            
              continuous resilience. So when
            
            
            
              we are thinking about the continuous resilience,
            
            
            
              resilience is not just one time thing, because every
            
            
            
              day you're building new features, releasing them to your customers and your architecture
            
            
            
              changes. So we need to think
            
            
            
              it should be part of everyday life when we are thinking about building
            
            
            
              resilient workloads right
            
            
            
              from the bottom to all the way to the application itself.
            
            
            
              And so continuous resilience is basically a lifecycle that
            
            
            
              helps us think about workload from a steady state point
            
            
            
              and steady state point of view, and work towards
            
            
            
              mitigating events like we just went through,
            
            
            
              from code configuration all the way to the very unlikely
            
            
            
              events of natural disasters, et cetera. And also we
            
            
            
              need to build safe experimentation of these within our pipelines,
            
            
            
              within our pipelines, and also actually outside our pipelines,
            
            
            
              because errors happen all the time. And not just when we provision new code
            
            
            
              and making sure that we also learn from the faults that surface
            
            
            
              during those experiments as well. And so
            
            
            
              when you take continuous resilience and chaos engineering and
            
            
            
              you put them together, that's what leads us to the Chaos Engineering
            
            
            
              and continuous Resilience program,
            
            
            
              which is,
            
            
            
              and the program that we have built over the last few years at AWS and
            
            
            
              have helped many customers run
            
            
            
              through it, which enabled them to basically, as I was saying earlier,
            
            
            
              to build a chaos engineering program with their own firm and scale
            
            
            
              it across various organizations and develop teams so that
            
            
            
              they can build controlled experiments within
            
            
            
              their environment and also improve resilience.
            
            
            
              Usually when we are talking about or when we are starting on this journey,
            
            
            
              we start with a game day that
            
            
            
              we prepare for, to start with game
            
            
            
              day, as you might think, where we are just running it for 2 hours session
            
            
            
              and we are checking if something was fine or not, especially when we are starting
            
            
            
              out. With chaos engineering, it's important to truly plan what
            
            
            
              we want to execute. So it's setting expectations
            
            
            
              as a big part of it.
            
            
            
              So the key to that, because you're going to need quite
            
            
            
              a few people that you want to invite, is project planning.
            
            
            
              And usually the first time when we do this, it might be between a week
            
            
            
              or three weeks that we are planning the game day and the various people that
            
            
            
              we want in a game day, like the KS champion,
            
            
            
              that will advocate the game day throughout the company.
            
            
            
              It could be the development teams. If there are site reliability
            
            
            
              engineers, sres, we're going to bring them in as well,
            
            
            
              observability and incident response teams. And then
            
            
            
              once we all have all the roles and responsibilities for the game day,
            
            
            
              we're going to think about what is it that we want to run experiments
            
            
            
              on. And when we are thinking about chaos engineering,
            
            
            
              it's not just about resilience. It can be about security
            
            
            
              or other aspects of the architecture as well. And so contribution
            
            
            
              is a list of what's important to
            
            
            
              you. That can be resilience, that can be availability,
            
            
            
              or that can be a security. It could also be
            
            
            
              durability for some of the customers. That's something
            
            
            
              which you can define. And then of
            
            
            
              course we want to make sure that there is a
            
            
            
              clear outcome of what we want to achieve with this chaos
            
            
            
              experiment. In our case, when we are starting
            
            
            
              out, what we actually prove to
            
            
            
              the organization and the sponsors is that we can run an experiment in a
            
            
            
              safe and controlled way without impacting the customers.
            
            
            
              That's the key. And then we take these learnings and share it,
            
            
            
              either if we found that something or not within our
            
            
            
              customers, to be able to make sure that the businesses unit understand
            
            
            
              how to mitigate these failures. If we found something
            
            
            
              or have the confidence that we are resilient to the faults,
            
            
            
              we inject it.
            
            
            
              So then we basically go to the next type where
            
            
            
              we select a workload for
            
            
            
              this presentation here. So let's have can example application.
            
            
            
              And this is basically could be
            
            
            
              because we are talking about the bank. So this can be like a payments
            
            
            
              workload. And it's running on eks,
            
            
            
              where eks is deployed across multiple availability zones
            
            
            
              and there is a route 53 and there are application load
            
            
            
              balancer which is taking in the traffic, et cetera. And also
            
            
            
              there is can aurora database and Kafka for managed
            
            
            
              streaming, et cetera. It's important
            
            
            
              that when you are choosing a workload,
            
            
            
              making sure that we are not starting out and not choosing
            
            
            
              a critical workload that you already have and then
            
            
            
              impact it. And then obviously everyone would be happy if you start with such
            
            
            
              a critical system and something goes wrong. So choose something which is not critical so
            
            
            
              that even if it is degraded,
            
            
            
              if it has some customer impact, then it is still flying because
            
            
            
              it's not critical. And usually
            
            
            
              we have metrics that allow that when you're
            
            
            
              thinking about slos for your service. So once
            
            
            
              you have chosen a workload, we're going to make sure that our chaos experiments
            
            
            
              that we want to run are safe. And we
            
            
            
              do that through a discovery phase of the workload.
            
            
            
              And that discovery phase will involve quite a bit of architecture in
            
            
            
              it, right? So we're going to dive deep into it
            
            
            
              all. You know, that well architected review. It helps
            
            
            
              customers build secure, high performing, resilient and efficient
            
            
            
              workloads on AWS,
            
            
            
              which has six pillars like operational excellence,
            
            
            
              security, reliability, performance, efficiency and cost optimization,
            
            
            
              as well as newly added security sustainability as well.
            
            
            
              And so when we are thinking about the well architected review, it's just not about
            
            
            
              clicking the buttons in the tool. But we are talking
            
            
            
              about through the various designs
            
            
            
              of the architecture and we want to understand how the architecture and
            
            
            
              the workloads and the components within your workloads speak to one another.
            
            
            
              Right. And what mechanisms do you have in place?
            
            
            
              Like can one component, when it fails, can it retry again
            
            
            
              as well or not? And what mechanisms do
            
            
            
              they have in regards to circuit breakers or have
            
            
            
              you implemented them? Have you tested it, et cetera? And do
            
            
            
              you have like run books or playbooks in place in case
            
            
            
              we have to roll back a particular experiment?
            
            
            
              And we want to make sure that you have the observability in place. And for
            
            
            
              example, health checks as well when we execute something so that
            
            
            
              your system automatically can recover from it.
            
            
            
              And if we have all that information, we can see that there is a deficiency
            
            
            
              that might impact internal or external customer.
            
            
            
              That's where we basically stop. When we see an impact
            
            
            
              to customers, then we basically stop that experiment. And if
            
            
            
              we have known issues, we're going
            
            
            
              to have to basically fix these first before we move
            
            
            
              on within that process. Right now, once and only if
            
            
            
              that everything is fine, we're going to say, okay,
            
            
            
              let's basically move on to the definition
            
            
            
              of an experiment, right? So the next phase is basically
            
            
            
              defining experiment. So when you are thinking about your
            
            
            
              system that we just, the sample application which we just
            
            
            
              saw before, we can think about what
            
            
            
              can go wrong within this environment, right? So if we already
            
            
            
              have or have not mechanisms in place,
            
            
            
              for example, if you have a third party identity provider, in our
            
            
            
              case, do we have a breaklass
            
            
            
              account wherein I can prove that I can still log in if
            
            
            
              something happens, right? If that identity provides goes down,
            
            
            
              can I still log in with a break glass account? And let's say,
            
            
            
              what about my eks cluster? If I have a node that fails within
            
            
            
              that cluster, do I have my code which
            
            
            
              builds on the other node itself? Right?
            
            
            
              Do I know how long does it take or what
            
            
            
              would be my end customer impact if it happens?
            
            
            
              Or it could
            
            
            
              be someone misconfigured an auto scaling group and
            
            
            
              health checks and which suddenly marks most of the
            
            
            
              instances in that zone unhealthy. So do we have
            
            
            
              mechanisms to detect that? And what does that mean again
            
            
            
              for customers and the teams that operate the environment as
            
            
            
              well? Right? And think about the scenario where someone
            
            
            
              pushed a configuration change and the ECR and
            
            
            
              your container registry is no longer accessible anymore.
            
            
            
              So that means you cannot basically launch new containers.
            
            
            
              Do we have mechanisms to detect that
            
            
            
              and recover from that? And what
            
            
            
              about the issues with the Kafka which is managing
            
            
            
              our streamers. So are we going to lose any active messages?
            
            
            
              What would be the data loss there? What if it
            
            
            
              loses a partition or it loses its connectivity, or basically
            
            
            
              it may reboot, et cetera. So do we have mechanisms
            
            
            
              to mitigate that? And what about our aurora database?
            
            
            
              What if the writer instance is not accessible or has gone
            
            
            
              down for whatever reason? Can it automatically
            
            
            
              and seamlessly fail over to the other
            
            
            
              node? And meanwhile all of this is happening.
            
            
            
              What happens to the latency or the jitter offer the application
            
            
            
              when you implement all this?
            
            
            
              Yeah,
            
            
            
              and with basically the fault injection and controlled experiments,
            
            
            
              we are able to do all of this. And then lastly, think about challenges that
            
            
            
              your clients might have connect to your environment while all of
            
            
            
              this is happening. So for our experiment, we wanted
            
            
            
              to achieve, what we wanted to achieve
            
            
            
              is that we can execute and understand a brownout
            
            
            
              scenario. So what
            
            
            
              a brownout scenario is that our client that connects
            
            
            
              to us expects a response in a certain
            
            
            
              amount of, let's say milliseconds or depending on the environment.
            
            
            
              And if it do not provide that, the client just going
            
            
            
              to go and back off. But the challenge is
            
            
            
              when you have a brownout,
            
            
            
              that your server is still trying to compute whatever
            
            
            
              they requested for, but the client is not
            
            
            
              there, and those are the wasted cycle. So that
            
            
            
              inflection point is basically called the brownout.
            
            
            
              Now, before we think about an experiment to go
            
            
            
              ahead, before we can
            
            
            
              actually think about an experiment to simulate a brownout within
            
            
            
              our eks environments, we need to understand the steady state
            
            
            
              and what the steady state is and what it isn't.
            
            
            
              So when you're thinking about defining a steady state for our
            
            
            
              workload, that's the high level top
            
            
            
              metric, right? That you're thinking about your service.
            
            
            
              So for example, for a payment system, it could be transactions per second,
            
            
            
              for a retail system it can be orders per second, for streaming
            
            
            
              stream starts per second, et cetera. And when you're looking at that
            
            
            
              line, you see very quickly you
            
            
            
              have an order drop or a transaction
            
            
            
              drop, that something that you injected within the environment caused
            
            
            
              probably that to drop. So we need to have
            
            
            
              that steady state metric,
            
            
            
              or already available, so that when we run these case experiments,
            
            
            
              we would immediately know something happened.
            
            
            
              So the hypothesis is key as well when we are thinking
            
            
            
              about the experiment, because the hypothesis will define at the end.
            
            
            
              Did your experiment turn out to be a turnout
            
            
            
              as expected, or did you learn something new that you didn't
            
            
            
              expect? And so the important here is, as you see,
            
            
            
              we are saying that we are expecting a transaction
            
            
            
              rate. So 300 transactions per second and
            
            
            
              we think that even 40% of our nodes will fail
            
            
            
              within our environment. Still, 99% of all requests
            
            
            
              to our API should be successful. So the 99th
            
            
            
              percentile and return a
            
            
            
              response within 100 milliseconds. So what we also
            
            
            
              would want to define is because we know our systems, we're going to
            
            
            
              say, okay, based on we have our experience,
            
            
            
              the node should come back within five minutes and
            
            
            
              the part will get scheduled and process tropic within the
            
            
            
              eight minutes after the initiation of experiment.
            
            
            
              And once again we are all agreeing on that hypothesis, then we're going
            
            
            
              to go out and fill out can experiment template.
            
            
            
              And so when you're thinking about an experiment template, experiment template
            
            
            
              itself, we're going to make sure that very clearly defining what we want to
            
            
            
              run, we're going to have the definition of the workload itself
            
            
            
              and what experiment and action we want to run.
            
            
            
              And you might want to run the experiment where you say, I'm going to
            
            
            
              run for 30 minutes with five minutes intervals to make
            
            
            
              sure that you can look at the graphs and on the experiments,
            
            
            
              staggering experiments you are running to understand the impact of the
            
            
            
              experiments. And then of course, because we want to do this in a controlled
            
            
            
              way, we need to be very clear on what the fault isolation
            
            
            
              boundary is for our experiment.
            
            
            
              And we're going to clearly define that as well.
            
            
            
              So we're going to have the alarms that are in place that would trigger the
            
            
            
              experiment to roll back if it gets
            
            
            
              out of control or if it causes any issues with
            
            
            
              the customer transactions. And that's the key
            
            
            
              because we want to make sure that we are practicing safeguards,
            
            
            
              engineering experiments, right? And we also make sure that
            
            
            
              we understand what is the observability and what we are looking
            
            
            
              at when we are running the experiment. So we need to keep an eye on
            
            
            
              the observability and the key steady state metrics. And then
            
            
            
              you would add the I hypothesis again to the template as well.
            
            
            
              Yeah, aws. You see on the right side we have the two empty
            
            
            
              lines for that.
            
            
            
              When we are thinking about the experiment itself,
            
            
            
              whether good or bad, we are always going to have an end report
            
            
            
              where we might celebrate that our system is resilient enough
            
            
            
              to such failure, or we might celebrate that we find something that
            
            
            
              we didn't know before and we have just helped our application
            
            
            
              and the organization that we have mitigated an issue or an
            
            
            
              event which could have happened in the
            
            
            
              future. Right? So once we have that experiment ready,
            
            
            
              we're going to think about basically preparing or priming the environment
            
            
            
              for our experiment. But before we go there, I just want to touch upon
            
            
            
              how do you go through that entire cycle of how we execute an
            
            
            
              experiment, because that's also critical on how we execute that experiment.
            
            
            
              So the execution flow is like. So first we have to check if the system
            
            
            
              is actually in a healthy state. Because if you remember in the beginning,
            
            
            
              I was saying that if we already know the system is unhealthy or it's
            
            
            
              going to fail, we're not going to run that experiment. So we immediately stop that.
            
            
            
              So once the system is healthy, we'll see if the experiment is still valid,
            
            
            
              because the issue or the test we are doing might have been already fixed before.
            
            
            
              So you don't want to run that experiment because the developer
            
            
            
              might have fixed those bugs or improved that resilience.
            
            
            
              And if we see this, then we're
            
            
            
              going to create a controlled experiment group where we're going to make sure
            
            
            
              that defined, and I'm going to go into that in a few seconds. And if
            
            
            
              we see that the control and experiment group is
            
            
            
              there and defined and which is up and running, then we start generating
            
            
            
              load against the control and experiment group in our environment.
            
            
            
              And we are checking again, is the steady state that we have, is intolerance
            
            
            
              that we think it should be or not.
            
            
            
              If it is tolerant, then now, and finally we can go
            
            
            
              ahead and run the experiment against the target, and then we
            
            
            
              check if it is intolerance based
            
            
            
              on what we think. And if it isn't, then we stop. Condition is going
            
            
            
              to kick in and it's going to roll back. Um,
            
            
            
              so as I was saying in the, in the previous slide that I
            
            
            
              mentioned the aspects of control and experiment group. So when you're thinking about
            
            
            
              chaos engineering and like running experiments, the goal
            
            
            
              always is that it's controlled, and two, that you
            
            
            
              have a minimal or no
            
            
            
              impact to your customers when you're running it. So weighing
            
            
            
              how you can do that is, as we call it,
            
            
            
              not just having synthetic load that you generate, but also synthetic resources.
            
            
            
              For example, you spin a new key case cluster, a synthetic
            
            
            
              one, one that you have and inject a
            
            
            
              fault, and the other one which is healthy and still serving your customers,
            
            
            
              right. So you're not impacting an existing resources that
            
            
            
              is already being used by the customers, but new resource with exactly the
            
            
            
              same code base and the other ones
            
            
            
              where you understand what happens in a certain failure scenario. So once
            
            
            
              we prime the experiment and we see that control and experiment
            
            
            
              group are healthy and we see a steady
            
            
            
              state, we can then move on and think about running the experiment
            
            
            
              itself. Now, running a chaos engineering
            
            
            
              experiment requires great tools that are safe
            
            
            
              to experiment. So when we are thinking about tools,
            
            
            
              there are various tools out there how you can use and consume.
            
            
            
              In AWS, we have fault injection simulator
            
            
            
              with which, when you're thinking about one of the first slide with
            
            
            
              the shared responsibility model for resilience, fault injection simulator helps you
            
            
            
              quite a bit with that because the faults that you can
            
            
            
              inject with fis are
            
            
            
              running against the AWS APIs directly. And you
            
            
            
              can inject these faults against your primary dependency to make
            
            
            
              sure that you can create mechanisms that you can
            
            
            
              survive a component failure within your system, et cetera.
            
            
            
              Now second is that faults and actions
            
            
            
              that I want to highlight are the
            
            
            
              highlighted parts are basically integration with litmus chaos and the
            
            
            
              chaos mesh. And the great thing about this is that it provides
            
            
            
              you with a widened scope of faults that you can inject, for example,
            
            
            
              in our example architecture into your kubernetes
            
            
            
              cluster to fault injection simulator via
            
            
            
              single plane of glass.
            
            
            
              And then it also has various integrations.
            
            
            
              Now, if you want to run experiments
            
            
            
              against, let's say, EC two systems, you have the capability to run these
            
            
            
              through AWS systems manager via the SSM agent.
            
            
            
              Now think about where these
            
            
            
              come into play. So when we are thinking about running experiments, these are
            
            
            
              the ways on how you can create disruptions within the system.
            
            
            
              Let's say you have various microservices that run and consume a database.
            
            
            
              Now you might say how
            
            
            
              can we create a fault within the database without having impact to all those microservices,
            
            
            
              right? And the answer to that is that you can inject faults within these
            
            
            
              microservices itself, for example, packet laws, or that they
            
            
            
              would result in exactly the same as application not
            
            
            
              being able to talk to or write to the database, because it's not going to,
            
            
            
              right. It's not going to get there and reach the database without you bringing down
            
            
            
              the database itself. And so it's important that to widen
            
            
            
              the scope and think about the experiments that you can run and
            
            
            
              see what actions you have on how
            
            
            
              you can simulate those various experiment failures.
            
            
            
              So in our example case, because we want to do that brown out that I
            
            
            
              showed before we use the eks
            
            
            
              action, that we can terminate a certain number of
            
            
            
              nodes, a percentage of nodes within our cluster, and we
            
            
            
              would run them, right? So if you go to the
            
            
            
              tool itself, the way it runs, if you use
            
            
            
              the tool, we can trust the fis that we're going to make sure that something
            
            
            
              goes wrong, that it can alert automatically
            
            
            
              and also helps us roll back the experiment,
            
            
            
              right? And the fault injection simulator has these
            
            
            
              mechanisms wherein. So when you build an experiment with
            
            
            
              fis, you can define what are my key alarms are,
            
            
            
              which define those steady state. And they
            
            
            
              should kick in that if they find
            
            
            
              any deviance. Right. And if something goes wrong during that experiment,
            
            
            
              it should basically stop and then roll back the whole experiment.
            
            
            
              So in our case everything was fine and we said that, okay, well, now we
            
            
            
              have confidence based on our observability that we have for this experiment.
            
            
            
              Now let's move to the next environment,
            
            
            
              which is obviously taking this into the production.
            
            
            
              So you have to think about the guardrails that are important in your
            
            
            
              production environment. So when you are running ks,
            
            
            
              experiment in production, especially when you are thinking about
            
            
            
              running them for the first time, please don't run on a peak hours.
            
            
            
              It's probably not the best idea to run on a peak hours. And also make
            
            
            
              sure that in many ways, when you're running these experiments in lower and
            
            
            
              ever environments, your permissions are also quite permissive that you have
            
            
            
              when it compared to production. And you got to make sure that you have the
            
            
            
              observability in
            
            
            
              place that you have permissions to execute these various
            
            
            
              experiments. Aws. Well, because in production it's always the
            
            
            
              restricted permissions. And also key to understand is that
            
            
            
              the fault isolation boundary changed because we are in production now.
            
            
            
              So we need to make sure that we understand that as well.
            
            
            
              And also we understand the risk of running them in production
            
            
            
              environment because we need to
            
            
            
              understand and make sure that we are not impacting our customers.
            
            
            
              That's the key. So once we
            
            
            
              understand this and have the runbook and plays books
            
            
            
              which are up to date, we are finally at a stage where we
            
            
            
              can think about moving to production. And here again,
            
            
            
              we want to think about, you know,
            
            
            
              you know,
            
            
            
              think about like, you know, experiment in production with a cannery.
            
            
            
              We'll check that in a second. So, you know,
            
            
            
              as you have seen this picture before, in our lower environment,
            
            
            
              we're going to do the same thing in production. But we
            
            
            
              don't have a mirrored environment, right? So that some customers
            
            
            
              do where they split traffic. And we have a chaos engineering environment in
            
            
            
              our production and another environment. So what we use in this
            
            
            
              is a cannery to say that we're going to take the real traffic
            
            
            
              a tiny bit of percentage into our.
            
            
            
              We're going to start bringing that real user traffic into the controlled
            
            
            
              and experiment group. Now keep in mind at this
            
            
            
              point, nothing should go wrong. We have the control and experiment group
            
            
            
              here as well. We haven't injected the fault
            
            
            
              yet. And we should be able to see from can observability
            
            
            
              perspective that everything is good,
            
            
            
              because we haven't created any experiments
            
            
            
              yet. And once we see that truly happen,
            
            
            
              that's where we start. That's where we kick in the
            
            
            
              experiment. Right. So we're going to
            
            
            
              get running the experiment in production.
            
            
            
              But when we are thinking about running this
            
            
            
              in production, we want to make sure that we have all the
            
            
            
              workload experts in terms of engineering teams,
            
            
            
              observability operators, incident response teams,
            
            
            
              everybody in a room before we actually do this in production.
            
            
            
              So that if something goes wrong or if you have seen
            
            
            
              any unforeseen incidents during that engineering experience,
            
            
            
              you can quickly roll back and make sure that the system is back up and
            
            
            
              running within no time. Right.
            
            
            
              And the final stage is basically going into
            
            
            
              the correction of error state where we are basically listing
            
            
            
              out all the key findings, learnings from
            
            
            
              that experiment which we have run, and then we'll see,
            
            
            
              okay, how did we communicate between the teams?
            
            
            
              Did we have any persons or people whom we needed
            
            
            
              in the room, but they were not there? Was there any documentation
            
            
            
              missing, et cetera. How can we improve the overall process? How do
            
            
            
              we then basically take these learnings and share that
            
            
            
              across the organization so that they can further improve the
            
            
            
              overall workloads, et cetera?
            
            
            
              So that is the final
            
            
            
              phase that the next take is basically the automation part
            
            
            
              because we are not running this just for once.
            
            
            
              Right. So we want to basically take this learnings
            
            
            
              and automate that so
            
            
            
              that we can bring them and run them in pipelines. So we need
            
            
            
              to make sure that experiments also run in the pipeline
            
            
            
              and also outside the pipeline because the faults happen all the time.
            
            
            
              So they don't just happen by pushing a code,
            
            
            
              they happen like day in and day out within the production environment as well sometimes.
            
            
            
              Right. And then we can also use game days to
            
            
            
              bring in the teams together to help
            
            
            
              them understand the overall architecture and recover the apps, et cetera,
            
            
            
              and test those processes work. And are people alerted
            
            
            
              in a way that if something goes wrong, they're able to work together
            
            
            
              and then bring that resilience,
            
            
            
              continuous resilience culture to
            
            
            
              make it easier for our customers. We have built in a
            
            
            
              lot of templates and handbooks that
            
            
            
              we are going to the experiments with them. So we share,
            
            
            
              like chaos Engineering Handbook
            
            
            
              that shows the business value of chaos engineering and how it helps
            
            
            
              with resilience. The chaos engineering templates
            
            
            
              as well as correction of error templates we have, and also various aspects
            
            
            
              of the reports that we share with customers when we are running to the program
            
            
            
              now. Next, I just want to share some resources, which we
            
            
            
              have, but we
            
            
            
              have the workshops with which you can,
            
            
            
              for example, in the screen we see that you basically start with an observability
            
            
            
              workshop. And then the reason that is because the workshop
            
            
            
              builds an entire system that provides you with everything
            
            
            
              in the stack of observability. And then you have to absolutely nothing
            
            
            
              to get out of pressing a button, right? And once we
            
            
            
              have that and we have the observability, from top down to tracing to
            
            
            
              logging to metrics, then go for the chaos engineering workshop
            
            
            
              and looking at the various experiments there,
            
            
            
              and start with some database fault,
            
            
            
              injection the containers and EC two and it shows you how
            
            
            
              you can do that in the pipeline as well. And you can take those experiments
            
            
            
              and you run it against a sample application within the observability
            
            
            
              workshop and it gives you a great view of what's going on within
            
            
            
              your system. And if you inject these failures or faults,
            
            
            
              you'll see them right away within those dashboards with no effort at
            
            
            
              all. So these are the QR codes
            
            
            
              for those workshops. Please do
            
            
            
              get started and reach out to any of your AWS
            
            
            
              representatives or contacts for further information
            
            
            
              on these. You can also reach me
            
            
            
              on my Twitter account with that I just
            
            
            
              want to thank you for your time. I know it's been long
            
            
            
              session, but I hope you found it insightful.
            
            
            
              Please do share your feedback and let me know
            
            
            
              if you want more information on this. Thank you.