Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Everyone, good morning, good afternoon and good evening.
            
            
            
              Happy to be here at Con 42
            
            
            
              Chaos Engineering conference.
            
            
            
              Before we delve into the topic of continuous resilience,
            
            
            
              a bit about me I am Uma Mukara,
            
            
            
              head head of Chaos engineering harness. I am
            
            
            
              also a maintainer and co founder the
            
            
            
              Litmus Chaos CNCF project which is at incubating stage
            
            
            
              at harness. I've been developing customers head
            
            
            
              head of chaos engineering wider number of use cases
            
            
            
              in that process I have learned a little bit
            
            
            
              about head of chaos engineering been adopted.
            
            
            
              What are the use cases that are
            
            
            
              more prominent, more appealing? So here is
            
            
            
              an opportunity for me to talk about what I learned in
            
            
            
              the last few years of trying to push chaos
            
            
            
              engineering to more practical environments in the cloud native
            
            
            
              space. Innovation is a continuous process in
            
            
            
              software, right? And we're all trying to
            
            
            
              innovate something in software either
            
            
            
              to improve governance,
            
            
            
              quality, efficiency,
            
            
            
              control and reliability, et cetera.
            
            
            
              So in this specific talk, let's talk how
            
            
            
              can we innovate more in the space of reliability or
            
            
            
              resilience?
            
            
            
              So before we actually reach the topic of innovating
            
            
            
              in the space of resilience, or in the area of resilience,
            
            
            
              let's talk about the software development costs
            
            
            
              that applies to the software developers overall.
            
            
            
              In the world today, we have about 20 been million software developers
            
            
            
              costing 100k average on annual basis.
            
            
            
              That leads to a total spend of about $2.7
            
            
            
              trillion. That's a huge money that's being
            
            
            
              spent on software development. If that
            
            
            
              is so big amount of money that's being spent, what are
            
            
            
              the software developers are doing? In this poll you can
            
            
            
              see that more than 50% of the software developers indicate
            
            
            
              that they actually spend less than 3 hours in a day writing the code.
            
            
            
              Where are they spending the remaining time?
            
            
            
              They could be spending the time in trying to
            
            
            
              build the environments, or build
            
            
            
              the deployment environments, or debugging
            
            
            
              the existing software, or the software that they
            
            
            
              just wrote, or production issues, etc. Cetera, et cetera.
            
            
            
              So this all leads to a more toil
            
            
            
              for software developers. And is there a way
            
            
            
              we can actually reduce this toil to
            
            
            
              as much as about 50% less?
            
            
            
              Right? So that could actually free
            
            
            
              up more money
            
            
            
              for the actual software development.
            
            
            
              And that's a huge spend, right? So this is the overall market
            
            
            
              space for developers, but you can apply the same thing
            
            
            
              to your own organization. You're spending a lot of money
            
            
            
              on developers, but developers are actually not spending enough
            
            
            
              time and effort to writing code, right?
            
            
            
              That's an opportunity to reduce that toil and
            
            
            
              increase innovation in different, different sres. The opportunity
            
            
            
              is to innovate, to increase the developer productivity and
            
            
            
              hence save the cost. And you can use that cost back into
            
            
            
              more development and ship more products
            
            
            
              or more code or faster code, et cetera, et cetera.
            
            
            
              Right. So let's see how it applies
            
            
            
              to the resilience as a use case.
            
            
            
              Right. So you can actually reduce
            
            
            
              the developer toil. So this toil comprises
            
            
            
              of either build time or deployment
            
            
            
              time or a debug time. Right. So you need to basically reduce
            
            
            
              this toil and that's where you can actually improve
            
            
            
              production. In this specific topic we are going to
            
            
            
              look at, where are these developers
            
            
            
              spending their time in debugging? Why are they doing that?
            
            
            
              And how can we actually reduce that amount of debug
            
            
            
              time? And eventually that leads to more
            
            
            
              time for innovation. And because you are reducing
            
            
            
              the debug time in the area of resilience, that also improves
            
            
            
              the resilience of the products.
            
            
            
              So why are they spending time in debugging?
            
            
            
              Right. So basically, in other words, why the bugs are
            
            
            
              being introduced. Right. So it could
            
            
            
              be plain oversight. Developers are humans, so there
            
            
            
              is possibility that something is overlooked.
            
            
            
              Even the smartest developers could overlook
            
            
            
              some of the cases and then introduce bugs or leak bugs
            
            
            
              to the right, but that cannot be done.
            
            
            
              But the more common pattern
            
            
            
              that can be observed is a lot of dependencies
            
            
            
              in the practical world are not tested and they've been
            
            
            
              released to the right. And it's also possible
            
            
            
              that developers, there's a lot of churn,
            
            
            
              a lot of hiring. It could be a
            
            
            
              case of either. In that case,
            
            
            
              you are not the guy who has written the product
            
            
            
              from the beginning and the product has scaled up so much and
            
            
            
              you don't necessarily understand the entire architecture of the product,
            
            
            
              but you are rolling out some of the new features, right? So that leads to
            
            
            
              a bit of lack of understanding of the product architecture.
            
            
            
              So some of the intricacies are not
            
            
            
              well understood and then the design
            
            
            
              box can be trickling in or the code box,
            
            
            
              right? So even if you take care of all that,
            
            
            
              you assume that the product will run in certain environment,
            
            
            
              but the environment can be totally different or
            
            
            
              it can keep changing and that may
            
            
            
              not work as expected
            
            
            
              in that environment, your code. So these are the reasons
            
            
            
              why you end up as a developer spending more time.
            
            
            
              And these reasons
            
            
            
              become more common in cloud native.
            
            
            
              But before that, let's look at the cost of
            
            
            
              debugging, right, so you can end
            
            
            
              up having costing much more to
            
            
            
              the organization if you actually find these bugs in
            
            
            
              production and start fixing them. The cost of
            
            
            
              fixing bugs in production is almost like more
            
            
            
              than ten times than what you incur if you
            
            
            
              debug and fix them in QA or within the code,
            
            
            
              right? So it's a well known factor, nothing new.
            
            
            
              It's always good to find the bugs before
            
            
            
              they go into the production, right? So that's another way to
            
            
            
              look at this. But the reasons for introducing
            
            
            
              this issues or
            
            
            
              overlooking this causes
            
            
            
              is becoming more and more common in the case of cloud native
            
            
            
              developers, right? So in cloud native, two things
            
            
            
              are happening. By default, you are assumed
            
            
            
              to ship things faster because the total ecosystem is
            
            
            
              supporting faster shipment of your bills
            
            
            
              because of the small amount of code that each
            
            
            
              developer has to look at, and well defined boundaries around
            
            
            
              continuous and entire
            
            
            
              ecosystem of cloud native, right? So the
            
            
            
              pipelines are better, the tools surrounding shipment,
            
            
            
              like githubs and all are helping you to
            
            
            
              ship things faster. So added
            
            
            
              to that, containers are helping
            
            
            
              developers to wrap up features faster
            
            
            
              because they are microservices. You need to look at things objectively,
            
            
            
              only to a limited scope within a
            
            
            
              container and surrounded by APIs. So you are able
            
            
            
              to look at things very objectively,
            
            
            
              finish the coding and ship things faster.
            
            
            
              So because you are doing them very
            
            
            
              fast, the chances of
            
            
            
              not doing the dependency testing or chances of not
            
            
            
              understanding the product very well are high.
            
            
            
              And that could actually cause a lot of
            
            
            
              issues. And if the issues are related, the faults
            
            
            
              are happening in infrastructure and the impact of
            
            
            
              outage can be very high to
            
            
            
              just a fault happening within your container.
            
            
            
              And because of that, the outage is happening,
            
            
            
              the impact can be very low at that level, right?
            
            
            
              So the summary here is you
            
            
            
              are testing the code as much as possible and then
            
            
            
              shipping as fast as you can, and you may
            
            
            
              or may not be looking at the entire set of new
            
            
            
              testing that is needed, right? So it's
            
            
            
              possible that deep dependencies, the faults happening in the deep dependencies
            
            
            
              are not tested. So typically
            
            
            
              what's happening in the case of cloud native shipments,
            
            
            
              in such cases is the end service
            
            
            
              is impacted, right? And developers are then jumping
            
            
            
              into debug, and finally they know that,
            
            
            
              they come to discover that there's a dependent component
            
            
            
              or a service that
            
            
            
              incurs fault within it or multiple faults,
            
            
            
              and because of that, a given service is affected.
            
            
            
              So that's kind of a weakness within a given service.
            
            
            
              It's not resilient enough and you find it and fix it, right?
            
            
            
              So this is typically a case
            
            
            
              of increased cost, and there's a good
            
            
            
              opportunity that you can find such issues much earlier
            
            
            
              and avoid the cost,
            
            
            
              the kind of test that you
            
            
            
              need to be doing before you ship to avoid
            
            
            
              such cost is you have to
            
            
            
              assume that faults can happen within your code or
            
            
            
              within the aps that your code or application is consuming,
            
            
            
              or other services such as databases or
            
            
            
              message queues or other network services. There are faults
            
            
            
              that could be happening and your application has to be tested for
            
            
            
              such faults. And of course the infrastructure faults
            
            
            
              that are pretty common and infrastructure faults
            
            
            
              can happen within kubernetes and your code has to
            
            
            
              be resilient for such faults,
            
            
            
              right? These are the dependent fault testing
            
            
            
              kind of set that you need to be aware of and test,
            
            
            
              right? So what this really means is cloud
            
            
            
              native developers need to do chaos testing, right?
            
            
            
              This is exactly what chaos testing typically means. Some fault can happen
            
            
            
              in my dependent component or infrastructure,
            
            
            
              and my service, that is depending on my code,
            
            
            
              need to be resilient enough, right? So chaos
            
            
            
              testing is needed by the nature of the
            
            
            
              cloud native architecture itself
            
            
            
              to achieve high resilience.
            
            
            
              And we are basically saying that developers end up spending
            
            
            
              a lot of time debugging and that's not good for
            
            
            
              developer productivity. So if you need to do chaos testing,
            
            
            
              let's actually see what the typical definition of chaos
            
            
            
              engineering is. Chaos engineering. Typically it's
            
            
            
              been there for quite some time. We all know that, and we
            
            
            
              all are kind of told that chaos engineering
            
            
            
              is about introduce control faults and
            
            
            
              reduce the expensive outages. So if you are
            
            
            
              basically reducing the outages,
            
            
            
              you are looking at doing chaos testing in production,
            
            
            
              right? And that really comes with a high barrier.
            
            
            
              And this is one reason, even though
            
            
            
              chaos engineering has been there for quite some time, the adoption of
            
            
            
              chaos engineering, though it's increasing in the
            
            
            
              recent years,
            
            
            
              rapidly. The typical chaos testing
            
            
            
              or chaos engineering understanding is that it applies to production that's
            
            
            
              changing very fast. And that's exactly what we are talking about here.
            
            
            
              And the traditional chaos engineering is also about introducing
            
            
            
              game days and try to find the
            
            
            
              right champions within the organization who are open to
            
            
            
              do this game days and find any resilience
            
            
            
              issues and then keep doing more game days, right?
            
            
            
              That's typically the head head
            
            
            
              of chaos engineering. It's been a reactive approach.
            
            
            
              Either some major incidents have happened
            
            
            
              and has a solution. You're trying to head of chaos engineering.
            
            
            
              Sometimes it can be driven by regulations as well,
            
            
            
              especially in the case of Dr. And all chaos engineering comes
            
            
            
              into the picture in the banking sector and all.
            
            
            
              But these are the typical needs
            
            
            
              or patterns that head head of chaos
            
            
            
              engineering until a couple of years ago.
            
            
            
              But the modern chaos engineering is really driven by not
            
            
            
              necessarily to reduce the outages,
            
            
            
              but also the need to increase developer productivity,
            
            
            
              right? So if my developers are spending a lot
            
            
            
              of time in debugging the production issues.
            
            
            
              That's a major loss, and you need to avoid that.
            
            
            
              So how can I do that? Use chaos testing,
            
            
            
              and similarly, the QA teams. Right, so QA
            
            
            
              teams are coming in and looking at more
            
            
            
              ways to test so many components that are coming in the form of
            
            
            
              microservices. Earlier, it was easy enough that you're
            
            
            
              getting a monolith application, very clear boundaries,
            
            
            
              and you can write better, complete set of
            
            
            
              test cases. But now microservices can
            
            
            
              pose a challenge for QA teams. There's so many
            
            
            
              containers and they're coming in so fast.
            
            
            
              So how do I actually make sure that the
            
            
            
              quality is right in many aspects and that
            
            
            
              can be achieved through chaos testing.
            
            
            
              Right. And it's also possible that the whole
            
            
            
              big monolith or traditional application that's working well,
            
            
            
              which is business critical in nature,
            
            
            
              is being moved to cloud native. How do you ensure
            
            
            
              that everything works fine on the other side, on the cloud native?
            
            
            
              So one way to ensure is by employing
            
            
            
              the chaos engineering practices. Right. So the need for chaos engineering
            
            
            
              in the modern times is really defined or
            
            
            
              driven by these needs, rather than just, hey,
            
            
            
              I incurred some outages, let's go fix them. Right.
            
            
            
              So while that is still true, there are more drivers
            
            
            
              that are driving adoption of chaos engineering.
            
            
            
              So these needs are leading
            
            
            
              to a new concept called continuous resilience.
            
            
            
              So what is continuous resilience? It's basically verifying
            
            
            
              the resilience of your code or component through automated
            
            
            
              testing. Chaos testing. And you do that continuously.
            
            
            
              Right. So chaos engineering,
            
            
            
              done in automated way across your DevOps
            
            
            
              spectrum, is called continuous resilience.
            
            
            
              You achieve continuous resilience. That approach is called continuous resilience
            
            
            
              approach. Right. So just to summarize,
            
            
            
              you head head of chaos engineering,
            
            
            
              QA, pre prod and prod,
            
            
            
              continuous all the time, involving all the personas.
            
            
            
              And that leads to continuous resilience as a concept.
            
            
            
              So what are the typical metrics that you look for in the
            
            
            
              continuous resilience space or a model
            
            
            
              is the resilience score and resilience coverage.
            
            
            
              Right. So you always measure the resilience
            
            
            
              score of a given experiments, chaos experiment,
            
            
            
              or a component or a service itself.
            
            
            
              And it can be defined by the
            
            
            
              average success of the steady state checks of
            
            
            
              whatever you are measuring. Right. The steady state checks
            
            
            
              that are done during a given experiment
            
            
            
              or of a given component or of a given service.
            
            
            
              Right. So this is the resilience score. Typically it can
            
            
            
              be out of 100 or a percentage.
            
            
            
              And the more important metric in continuous resilience, you can
            
            
            
              think of this as resilience coverage, where because
            
            
            
              you are looking at the whole spectrum, you can
            
            
            
              come up with a total number of possible chaos
            
            
            
              tests. Basically you can compute them
            
            
            
              as what are the total resources that my service
            
            
            
              is comprising of? And you can do multiple combinations of that.
            
            
            
              The resources can be infrastructure resources, API resources,
            
            
            
              or network resources, or the resources that make up the
            
            
            
              service itself, like container resources, et cetera.
            
            
            
              And basically you can come up with a
            
            
            
              large number of tests that are possible, and then you
            
            
            
              start introducing
            
            
            
              such chaos tests into your pipelines, and those
            
            
            
              are the ones that you actually cover. Right. So you have
            
            
            
              a very clear way of measuring what are
            
            
            
              the chaos tests that you have done out of the possible chaos
            
            
            
              tests. And that leads to a coverage. Think of this as a
            
            
            
              code coverage. In the traditional developer spectrum,
            
            
            
              resilience coverage is being applied for
            
            
            
              resilience and chaos experiments.
            
            
            
              So many
            
            
            
              people are calling this approach as hey, let's do chaos
            
            
            
              in pipelines. That's almost same, right? Except that
            
            
            
              continuous resilience does not limit
            
            
            
              yourself just to pipelines. You can automate the
            
            
            
              chaos test on the production side as
            
            
            
              your maturity improves, right? So it's a pipelines
            
            
            
              approach. So what are the general differences
            
            
            
              between the traditional chaos engineering approach versus
            
            
            
              the pipelines or approach, or the continuous
            
            
            
              resilience approach? So traditionally
            
            
            
              in the game disk model, you are executing on demand
            
            
            
              with a lot of preparation. You need to assign
            
            
            
              certain dates and take permissions
            
            
            
              and then execute this test. Versus pipelines,
            
            
            
              you are executing continuous with not much of a
            
            
            
              thought or preparation. They are supposed to work,
            
            
            
              and if it doesn't work, it doesn't hurt
            
            
            
              so much. But it actually is a good thing that you can go and look
            
            
            
              at whenever it fails. Right? Maybe just slowing down the delivery of
            
            
            
              your builds, but that's
            
            
            
              okay. So this leads to greater adoption itself
            
            
            
              overall. And game days are targeted
            
            
            
              towards sres. Sres are the ones that think of, they're the
            
            
            
              ones that budget this entire game
            
            
            
              days model. But in the Chaos
            
            
            
              pipelines model, all personas are involved.
            
            
            
              Shift left is possible, but shift right also is possible
            
            
            
              in this approach. Right. So that's
            
            
            
              another major difference. So as you can assume,
            
            
            
              the chaos gamed model, that option is
            
            
            
              very barrier is very high. The barrier
            
            
            
              for pipelines is very less because you're doing in a non prod environment
            
            
            
              and you have the bandwidth that is
            
            
            
              associated to the development, and developers are
            
            
            
              the ones who are writing. So it becomes kind
            
            
            
              of unnatural for the
            
            
            
              adoption of such model. So when it comes to writing
            
            
            
              the chaos experiments themselves,
            
            
            
              traditionally it's been a challenge because the
            
            
            
              code itself is changing. And if
            
            
            
              sres are the ones that are writing such
            
            
            
              bandwidth is usually not budgeted or planned,
            
            
            
              and sres are typically pulled in into the
            
            
            
              other pressing needs,
            
            
            
              such as attending to incidents and corresponding action tracking,
            
            
            
              et cetera, et cetera. So that it may not be always possible to
            
            
            
              be proactive in writing a lot of chaos experiment,
            
            
            
              right? And in general,
            
            
            
              because you are not measuring the resilience coverage kind
            
            
            
              of a thing and you are just going and doing game day model,
            
            
            
              it's not very clear how many more chaos experiments
            
            
            
              I need to develop before I can say that I have covered
            
            
            
              all my resilience issues. Right?
            
            
            
              But in the continuous resilience approach, these are exactly opposite.
            
            
            
              Right? So you are basically looking at each other's help in
            
            
            
              a team sport model and you're extending
            
            
            
              your regular test. Developers would be writing integration
            
            
            
              best. And now you add some more best to
            
            
            
              introduce some faults on the dependent components,
            
            
            
              and those tests can be reused by QA and QA will add
            
            
            
              a little bit more tests. Those can be reused either by
            
            
            
              developers or by sres, et cetera, et cetera.
            
            
            
              So basically there is an increased sharing of the tests and
            
            
            
              in central repositories, or what
            
            
            
              you call them as chaos hubs in general.
            
            
            
              So you tend to manage these
            
            
            
              chaos experiments as code in git, and that increases the
            
            
            
              adoption. Right. And with resilience coverage is the
            
            
            
              concept, you know exactly how much more coverage
            
            
            
              you need to do or how many tests more you need to
            
            
            
              write, et cetera, et cetera. So that also helps in general
            
            
            
              with planning perspective. Right. So that's
            
            
            
              really a kind of a
            
            
            
              new pattern to think how to head,
            
            
            
              head, head of chaos engineering need to adopt chaos engineering.
            
            
            
              That's what I've been observing in the last few years
            
            
            
              and also at harness where we are saying there
            
            
            
              is a good growth of adoption of chaos
            
            
            
              for the purpose of both developer productivity
            
            
            
              as well as to increase the resilience as an innovative
            
            
            
              metric. Right? So let's take a look at
            
            
            
              a couple of DevOps. One on how
            
            
            
              you can inject a chaos experiment into a
            
            
            
              pipeline and probably cause a rollback depending on
            
            
            
              the resilience score that is achieved. And the other one,
            
            
            
              a quick demo about how we at chaos,
            
            
            
              the development teams are using chaos experiments in the pipeline
            
            
            
              a little bit more liberally before the
            
            
            
              code is shipped to a preprod environment or a QA
            
            
            
              environment.
            
            
            
              In this demo, we're going to take a look at how
            
            
            
              to achieve continuous resilience using chaos experiments
            
            
            
              with a sample chaos engineering tool.
            
            
            
              In this case, we are using head, head,
            
            
            
              head of chaos engineering,
            
            
            
              any other tool, a pipeline tool and a chaos engineering
            
            
            
              tool together to achieve the same
            
            
            
              continuous resilience. So let's start.
            
            
            
              So I have the chaos engineering tool
            
            
            
              from harness harness chaos engineering.
            
            
            
              This has the concept of chaos experiments,
            
            
            
              which are stored in chaos hubs.
            
            
            
              These chaos hubs are generally a way to
            
            
            
              share the experiments across teams, because in continuous
            
            
            
              resilience, you are talking about multiple teams across
            
            
            
              different pipeline stages. Either it's
            
            
            
              dev or QA or preprod or prod.
            
            
            
              So everyone will be using this tool and they will have
            
            
            
              access to either common chaos hubs, or they'll be maintaining
            
            
            
              their own chaos hubs. This chaos hub can maintain the
            
            
            
              chaos faults that are developed and chaos
            
            
            
              experiments that are created,
            
            
            
              which in turn uses the chaos fault.
            
            
            
              So a chaos fault in this case
            
            
            
              is nothing but the actual chaos injection
            
            
            
              and addition of certain resilience
            
            
            
              probes to check a steady state hypothesis. So let me
            
            
            
              show how in this harness chaos engineering
            
            
            
              tool, a particular chaos experiment
            
            
            
              is constructed or been. So let me go
            
            
            
              here. If I take a look at a
            
            
            
              given chaos experiment, it has multiple chaos
            
            
            
              faults. It can have multiple chaos faults
            
            
            
              either in series or in parallel. And a given chaos
            
            
            
              fault usually will have. Where are
            
            
            
              you injecting this vault at your
            
            
            
              target application? And what
            
            
            
              are the characteristics of the chaos itself?
            
            
            
              How long you want to do it, how many times you need
            
            
            
              to repeat the chaos, et cetera. And then the probe
            
            
            
              in this case is. Different tools call this
            
            
            
              probes in different ways. This is basically a way
            
            
            
              to check your steady state while this chaos injection
            
            
            
              is going on. So in the case of harness chaos
            
            
            
              engineering, we use probes to define the resilience of
            
            
            
              a given experiment or of
            
            
            
              a given service, or of a given module or a
            
            
            
              component, right. You can add any number of probes
            
            
            
              to a given fault.
            
            
            
              So that way you're not just developing on one
            
            
            
              probe to check the resilience, you're checking a whole
            
            
            
              lot of things while you inject chaos at
            
            
            
              any point of time into a given resource, right? Or against
            
            
            
              a given resource. So in the case of this
            
            
            
              particular chaos experiment, for example,
            
            
            
              you can go and see that it has resulted in 100%
            
            
            
              resilience, because there were three,
            
            
            
              the chaos that was injected was a cpu hog
            
            
            
              against a given pod. And while that cpu
            
            
            
              hog was injected, there were three process that
            
            
            
              were checked whether the pods were okay and some other
            
            
            
              service. Was it available or not? The HTTP
            
            
            
              endpoint. And it also was checking
            
            
            
              a completely different service.
            
            
            
              And it's checking for the latency response from
            
            
            
              the front end web service. So you should generally
            
            
            
              look at the larger picture while gauging
            
            
            
              the steady state hypothesis while injecting chaos
            
            
            
              fault. So because everything is passed and there's only one fault,
            
            
            
              you will see the resilience score as 100%.
            
            
            
              So this is how you would generally go and score
            
            
            
              the resilience against a given chaos experiments.
            
            
            
              And then these chaos experiments generally should be mobile
            
            
            
              back into a chaos hub, or you should be able to launch
            
            
            
              these experiments from a given chaos hub, et cetera, et cetera.
            
            
            
              And in general, the chaos tool should have
            
            
            
              the ability to do some access control. For example,
            
            
            
              in the case of harness chaos engineering,
            
            
            
              you will have default access control against
            
            
            
              who can access the centralized library
            
            
            
              of chaos hubs and who can execute a given chaos experiment.
            
            
            
              And chaos infrastructure is your target agent area.
            
            
            
              And if there are game days, who can run these
            
            
            
              game days, and typically nobody should have the ability to
            
            
            
              remove the reports of game days. So there's no delete option for anyone.
            
            
            
              Right? So with this kind of access control and
            
            
            
              then the capability of chaos hubs and then the
            
            
            
              probes, you will be able to score the
            
            
            
              resilience against a given chaos experiment for
            
            
            
              a given resource and also be able to share
            
            
            
              such developed chaos experiments across multiple different
            
            
            
              teams. And now let's go and take a look
            
            
            
              at how you can inject these chaos
            
            
            
              experiments into pipelines.
            
            
            
              Or let's look at the other way. How are you supposed to
            
            
            
              achieve continuous resilience during
            
            
            
              the deployment stage? Right. So example here,
            
            
            
              this pipelines is meant for developing a given service.
            
            
            
              That means somebody has kicked off a
            
            
            
              deployment of a given service, and once it's deployed,
            
            
            
              this could be a complicated process or a complex job in itself.
            
            
            
              And once this is deployed, we should
            
            
            
              in general add more tests. So this deployment is
            
            
            
              supposed to involve some functional tests has. Well, but in addition
            
            
            
              to that, you can add more
            
            
            
              chaos best. And for example here,
            
            
            
              each step in harness pipeline can be
            
            
            
              a chaos experiment. And if
            
            
            
              you go and look at this chaos experiment,
            
            
            
              it's integrated well enough to go and browse in your
            
            
            
              same workspace. What are the
            
            
            
              chaos experiments that are available? So I'm just going to
            
            
            
              go and select certain chaos experiment
            
            
            
              here, and then you can set
            
            
            
              the expected resilience score against that.
            
            
            
              In case that resilience score is
            
            
            
              not met, you can go and implement
            
            
            
              some failure strategy. Either go and observe, take some manual
            
            
            
              intervention, roll back the entire stage,
            
            
            
              et cetera, et cetera. So for example,
            
            
            
              in this actual case, we have identified
            
            
            
              the failure strategy or configured the failure strategy
            
            
            
              as a rollback. And typically you can
            
            
            
              go and see the past executions of
            
            
            
              this pipeline. And let's
            
            
            
              say that this has a failed instance
            
            
            
              of a pipeline, and you could go and see this
            
            
            
              pipeline was deploying
            
            
            
              this service and then the chaos
            
            
            
              experiment has executed and the expected resilience
            
            
            
              has not good enough. And if you go and take a look
            
            
            
              at this resilience scores or
            
            
            
              probe details, you see that one particular probe
            
            
            
              has failed. In this case though,
            
            
            
              when cpu was increased,
            
            
            
              the pod was good and the court service
            
            
            
              where the cpu was injected,
            
            
            
              high injection of cpu happened, it was continuing
            
            
            
              to available, but some other service provided a latency
            
            
            
              issue, so that was not good. And then
            
            
            
              it eventually caused it to fail and
            
            
            
              the pipeline was rolled back. Right.
            
            
            
              So that is an example of how you could
            
            
            
              do it, how you could do more and more chaos experiments
            
            
            
              into a pipelines and then stop leaking
            
            
            
              the resilience bugs to the right. And primarily
            
            
            
              what we are trying to say here is we
            
            
            
              should encourage the idea of injecting
            
            
            
              chaos experiments into the pipelines and sharing these
            
            
            
              chaos experiments across teams.
            
            
            
              And someone has developed, most likely developers in this case
            
            
            
              or QAT members. In any large
            
            
            
              deployment or development system,
            
            
            
              there are a lot of common services and the teams are distributed,
            
            
            
              there are a lot of processes involved. Just like you are
            
            
            
              sharing the test cases, common test cases, you could share the
            
            
            
              chaos test as well. When you do that, it becomes a
            
            
            
              practice. And the practices of injecting
            
            
            
              chaos experiments, whenever you test something,
            
            
            
              it becomes common and it increases the
            
            
            
              adoption of chaos engineering within the organization,
            
            
            
              across teams, and it eventually leads to more
            
            
            
              stability and less resilience issues or bugs.
            
            
            
              So that's a quick way of looking at how
            
            
            
              you can use a chaos experimentation tool
            
            
            
              and use the chaos experiments to
            
            
            
              inject chaos in pipelines and
            
            
            
              verify the resilience before they actually go to the
            
            
            
              right or go to the next stage. You,
            
            
            
              you,
            
            
            
              let's look at another demo
            
            
            
              for continuous resilience, where you can inject
            
            
            
              multiple chaos experiments and
            
            
            
              use the resilience score to decide whether to
            
            
            
              move forward or not.
            
            
            
              So in
            
            
            
              this demo, we have a pipelines,
            
            
            
              which is being used internally at
            
            
            
              harness in one of the module
            
            
            
              pipelines. So let's take a look at this particular
            
            
            
              pipeline. So what
            
            
            
              we have done here is the existing pipeline is
            
            
            
              not at all touched, it is
            
            
            
              kept as is. Maybe the maintainer of this particular
            
            
            
              stage will continue to
            
            
            
              focus on the regular deployment and the functional tests
            
            
            
              associated with it. And once the functional
            
            
            
              tests are completed after deployment,
            
            
            
              you can add more chaos tests in
            
            
            
              separate stages. In fact, in this particular example,
            
            
            
              there are two stages. One to verify the
            
            
            
              code changes related to the Chaos module
            
            
            
              CE module, and then another stage
            
            
            
              that is related to platform
            
            
            
              module itself. So you can put all of them into
            
            
            
              a group. So here it's called a step group.
            
            
            
              So you can just dedicate one single separate
            
            
            
              stage to group all the chaos experiments together,
            
            
            
              and you can set them up in parallel if
            
            
            
              needed. Depending on your use case individually,
            
            
            
              each chaos experiments will return some resilience score,
            
            
            
              and you can take all the resilience
            
            
            
              scores into account and decide at the end whether
            
            
            
              you want to continue or take some actions such as rollback.
            
            
            
              Right? So in this case, the expected
            
            
            
              resilience was all good,
            
            
            
              so nothing needs to be done, so it proceeded.
            
            
            
              This is another example of how
            
            
            
              you can use step groups or multiple chaos experiments
            
            
            
              into a separate stage and then take a decision
            
            
            
              based on the resilience score. I hope this helps.
            
            
            
              This is another simple demo of how do you use multiple
            
            
            
              chaos experiments together? You well,
            
            
            
              you looked at those two demos.
            
            
            
              So in summary, resilience is a real challenge,
            
            
            
              and there's an opportunity to increase resilience by
            
            
            
              involving developers into the game and start
            
            
            
              introducing chaos experimentation in the pipeline.
            
            
            
              And you can get ahead of this challenge of resilience by
            
            
            
              actually involving the entire DevOps,
            
            
            
              rather than just involving on the need basis the
            
            
            
              sres alone. Right? So the DevOps
            
            
            
              culture of chaos engineering is more scalable and
            
            
            
              is actually easy to adopt. Chaos engineering
            
            
            
              at scale, it makes it easier. So thank you very much
            
            
            
              for watching this talk,
            
            
            
              and I'm available at this Twitter
            
            
            
              handle or at the Litmus Slack channel.
            
            
            
              Feel free to reach out to me if you want to talk to me
            
            
            
              about more practical use cases on what I've been seeing
            
            
            
              in the field with chaos engineering adoption. Thank you and have
            
            
            
              a great conference.