Conf42 Chaos Engineering 2023 - Online

Continuous Resilience with Chaos Engineering

Video size:


Chaos Engineering has been helping SREs in improving resilience for a while. The adoption has been growing so much that we are now seeing more patterns of usage of Chaos Engineering and one of them is use of chaos experiments across the board in the software delivery process. In this talk, I introduce the details and best practices of the pattern of developing and using chaos experiments in CD pipelines and Production by both developers and SREs through a collaborative that gives rise to a new concept in DevOps called Continuous Resilience. With the continuous resilience pattern in place, there is an opportunity to resolve the traditional issues in the implementation of chaos engineering such as low chaos coverage, dependency on developers by SREs etc.


  • Uma Mukara is head head of Chaos engineering harness. Co founder of Litmus Chaos CNCF project which is at incubating stage at harness. Talks about how can we innovate more in the space of reliability or resilience.
  • In the world today, we have about 20 million software developers costing 100k average on annual basis. This leads to a total spend of about $2.7 trillion on software development. How can we reduce that toil and increase innovation in different, different sres?
  • The adoption of chaos engineering, though it's increasing in the recent years, is rapidly. Modern chaos engineering is driven by not necessarily to reduce the outages, but also the need to increase developer productivity. These needs are leading to a new concept called continuous resilience.
  • In the traditional developer spectrum, resilience coverage is being applied for resilience and chaos experiments. But continuous resilience does not limit yourself just to pipelines. You can automate the chaos test on the production side as your maturity improves. This leads to greater adoption itself overall.
  • In continuous resilience, you are talking about multiple teams across different pipeline stages. These chaos hubs are generally a way to share the experiments across teams. How are you supposed to achieve continuous resilience during the deployment stage?


This transcript was autogenerated. To make changes, submit a PR.
Everyone, good morning, good afternoon and good evening. Happy to be here at Con 42 Chaos Engineering conference. Before we delve into the topic of continuous resilience, a bit about me I am Uma Mukara, head head of Chaos engineering harness. I am also a maintainer and co founder the Litmus Chaos CNCF project which is at incubating stage at harness. I've been developing customers head head of chaos engineering wider number of use cases in that process I have learned a little bit about head of chaos engineering been adopted. What are the use cases that are more prominent, more appealing? So here is an opportunity for me to talk about what I learned in the last few years of trying to push chaos engineering to more practical environments in the cloud native space. Innovation is a continuous process in software, right? And we're all trying to innovate something in software either to improve governance, quality, efficiency, control and reliability, et cetera. So in this specific talk, let's talk how can we innovate more in the space of reliability or resilience? So before we actually reach the topic of innovating in the space of resilience, or in the area of resilience, let's talk about the software development costs that applies to the software developers overall. In the world today, we have about 20 been million software developers costing 100k average on annual basis. That leads to a total spend of about $2.7 trillion. That's a huge money that's being spent on software development. If that is so big amount of money that's being spent, what are the software developers are doing? In this poll you can see that more than 50% of the software developers indicate that they actually spend less than 3 hours in a day writing the code. Where are they spending the remaining time? They could be spending the time in trying to build the environments, or build the deployment environments, or debugging the existing software, or the software that they just wrote, or production issues, etc. Cetera, et cetera. So this all leads to a more toil for software developers. And is there a way we can actually reduce this toil to as much as about 50% less? Right? So that could actually free up more money for the actual software development. And that's a huge spend, right? So this is the overall market space for developers, but you can apply the same thing to your own organization. You're spending a lot of money on developers, but developers are actually not spending enough time and effort to writing code, right? That's an opportunity to reduce that toil and increase innovation in different, different sres. The opportunity is to innovate, to increase the developer productivity and hence save the cost. And you can use that cost back into more development and ship more products or more code or faster code, et cetera, et cetera. Right. So let's see how it applies to the resilience as a use case. Right. So you can actually reduce the developer toil. So this toil comprises of either build time or deployment time or a debug time. Right. So you need to basically reduce this toil and that's where you can actually improve production. In this specific topic we are going to look at, where are these developers spending their time in debugging? Why are they doing that? And how can we actually reduce that amount of debug time? And eventually that leads to more time for innovation. And because you are reducing the debug time in the area of resilience, that also improves the resilience of the products. So why are they spending time in debugging? Right. So basically, in other words, why the bugs are being introduced. Right. So it could be plain oversight. Developers are humans, so there is possibility that something is overlooked. Even the smartest developers could overlook some of the cases and then introduce bugs or leak bugs to the right, but that cannot be done. But the more common pattern that can be observed is a lot of dependencies in the practical world are not tested and they've been released to the right. And it's also possible that developers, there's a lot of churn, a lot of hiring. It could be a case of either. In that case, you are not the guy who has written the product from the beginning and the product has scaled up so much and you don't necessarily understand the entire architecture of the product, but you are rolling out some of the new features, right? So that leads to a bit of lack of understanding of the product architecture. So some of the intricacies are not well understood and then the design box can be trickling in or the code box, right? So even if you take care of all that, you assume that the product will run in certain environment, but the environment can be totally different or it can keep changing and that may not work as expected in that environment, your code. So these are the reasons why you end up as a developer spending more time. And these reasons become more common in cloud native. But before that, let's look at the cost of debugging, right, so you can end up having costing much more to the organization if you actually find these bugs in production and start fixing them. The cost of fixing bugs in production is almost like more than ten times than what you incur if you debug and fix them in QA or within the code, right? So it's a well known factor, nothing new. It's always good to find the bugs before they go into the production, right? So that's another way to look at this. But the reasons for introducing this issues or overlooking this causes is becoming more and more common in the case of cloud native developers, right? So in cloud native, two things are happening. By default, you are assumed to ship things faster because the total ecosystem is supporting faster shipment of your bills because of the small amount of code that each developer has to look at, and well defined boundaries around continuous and entire ecosystem of cloud native, right? So the pipelines are better, the tools surrounding shipment, like githubs and all are helping you to ship things faster. So added to that, containers are helping developers to wrap up features faster because they are microservices. You need to look at things objectively, only to a limited scope within a container and surrounded by APIs. So you are able to look at things very objectively, finish the coding and ship things faster. So because you are doing them very fast, the chances of not doing the dependency testing or chances of not understanding the product very well are high. And that could actually cause a lot of issues. And if the issues are related, the faults are happening in infrastructure and the impact of outage can be very high to just a fault happening within your container. And because of that, the outage is happening, the impact can be very low at that level, right? So the summary here is you are testing the code as much as possible and then shipping as fast as you can, and you may or may not be looking at the entire set of new testing that is needed, right? So it's possible that deep dependencies, the faults happening in the deep dependencies are not tested. So typically what's happening in the case of cloud native shipments, in such cases is the end service is impacted, right? And developers are then jumping into debug, and finally they know that, they come to discover that there's a dependent component or a service that incurs fault within it or multiple faults, and because of that, a given service is affected. So that's kind of a weakness within a given service. It's not resilient enough and you find it and fix it, right? So this is typically a case of increased cost, and there's a good opportunity that you can find such issues much earlier and avoid the cost, the kind of test that you need to be doing before you ship to avoid such cost is you have to assume that faults can happen within your code or within the aps that your code or application is consuming, or other services such as databases or message queues or other network services. There are faults that could be happening and your application has to be tested for such faults. And of course the infrastructure faults that are pretty common and infrastructure faults can happen within kubernetes and your code has to be resilient for such faults, right? These are the dependent fault testing kind of set that you need to be aware of and test, right? So what this really means is cloud native developers need to do chaos testing, right? This is exactly what chaos testing typically means. Some fault can happen in my dependent component or infrastructure, and my service, that is depending on my code, need to be resilient enough, right? So chaos testing is needed by the nature of the cloud native architecture itself to achieve high resilience. And we are basically saying that developers end up spending a lot of time debugging and that's not good for developer productivity. So if you need to do chaos testing, let's actually see what the typical definition of chaos engineering is. Chaos engineering. Typically it's been there for quite some time. We all know that, and we all are kind of told that chaos engineering is about introduce control faults and reduce the expensive outages. So if you are basically reducing the outages, you are looking at doing chaos testing in production, right? And that really comes with a high barrier. And this is one reason, even though chaos engineering has been there for quite some time, the adoption of chaos engineering, though it's increasing in the recent years, rapidly. The typical chaos testing or chaos engineering understanding is that it applies to production that's changing very fast. And that's exactly what we are talking about here. And the traditional chaos engineering is also about introducing game days and try to find the right champions within the organization who are open to do this game days and find any resilience issues and then keep doing more game days, right? That's typically the head head of chaos engineering. It's been a reactive approach. Either some major incidents have happened and has a solution. You're trying to head of chaos engineering. Sometimes it can be driven by regulations as well, especially in the case of Dr. And all chaos engineering comes into the picture in the banking sector and all. But these are the typical needs or patterns that head head of chaos engineering until a couple of years ago. But the modern chaos engineering is really driven by not necessarily to reduce the outages, but also the need to increase developer productivity, right? So if my developers are spending a lot of time in debugging the production issues. That's a major loss, and you need to avoid that. So how can I do that? Use chaos testing, and similarly, the QA teams. Right, so QA teams are coming in and looking at more ways to test so many components that are coming in the form of microservices. Earlier, it was easy enough that you're getting a monolith application, very clear boundaries, and you can write better, complete set of test cases. But now microservices can pose a challenge for QA teams. There's so many containers and they're coming in so fast. So how do I actually make sure that the quality is right in many aspects and that can be achieved through chaos testing. Right. And it's also possible that the whole big monolith or traditional application that's working well, which is business critical in nature, is being moved to cloud native. How do you ensure that everything works fine on the other side, on the cloud native? So one way to ensure is by employing the chaos engineering practices. Right. So the need for chaos engineering in the modern times is really defined or driven by these needs, rather than just, hey, I incurred some outages, let's go fix them. Right. So while that is still true, there are more drivers that are driving adoption of chaos engineering. So these needs are leading to a new concept called continuous resilience. So what is continuous resilience? It's basically verifying the resilience of your code or component through automated testing. Chaos testing. And you do that continuously. Right. So chaos engineering, done in automated way across your DevOps spectrum, is called continuous resilience. You achieve continuous resilience. That approach is called continuous resilience approach. Right. So just to summarize, you head head of chaos engineering, QA, pre prod and prod, continuous all the time, involving all the personas. And that leads to continuous resilience as a concept. So what are the typical metrics that you look for in the continuous resilience space or a model is the resilience score and resilience coverage. Right. So you always measure the resilience score of a given experiments, chaos experiment, or a component or a service itself. And it can be defined by the average success of the steady state checks of whatever you are measuring. Right. The steady state checks that are done during a given experiment or of a given component or of a given service. Right. So this is the resilience score. Typically it can be out of 100 or a percentage. And the more important metric in continuous resilience, you can think of this as resilience coverage, where because you are looking at the whole spectrum, you can come up with a total number of possible chaos tests. Basically you can compute them as what are the total resources that my service is comprising of? And you can do multiple combinations of that. The resources can be infrastructure resources, API resources, or network resources, or the resources that make up the service itself, like container resources, et cetera. And basically you can come up with a large number of tests that are possible, and then you start introducing such chaos tests into your pipelines, and those are the ones that you actually cover. Right. So you have a very clear way of measuring what are the chaos tests that you have done out of the possible chaos tests. And that leads to a coverage. Think of this as a code coverage. In the traditional developer spectrum, resilience coverage is being applied for resilience and chaos experiments. So many people are calling this approach as hey, let's do chaos in pipelines. That's almost same, right? Except that continuous resilience does not limit yourself just to pipelines. You can automate the chaos test on the production side as your maturity improves, right? So it's a pipelines approach. So what are the general differences between the traditional chaos engineering approach versus the pipelines or approach, or the continuous resilience approach? So traditionally in the game disk model, you are executing on demand with a lot of preparation. You need to assign certain dates and take permissions and then execute this test. Versus pipelines, you are executing continuous with not much of a thought or preparation. They are supposed to work, and if it doesn't work, it doesn't hurt so much. But it actually is a good thing that you can go and look at whenever it fails. Right? Maybe just slowing down the delivery of your builds, but that's okay. So this leads to greater adoption itself overall. And game days are targeted towards sres. Sres are the ones that think of, they're the ones that budget this entire game days model. But in the Chaos pipelines model, all personas are involved. Shift left is possible, but shift right also is possible in this approach. Right. So that's another major difference. So as you can assume, the chaos gamed model, that option is very barrier is very high. The barrier for pipelines is very less because you're doing in a non prod environment and you have the bandwidth that is associated to the development, and developers are the ones who are writing. So it becomes kind of unnatural for the adoption of such model. So when it comes to writing the chaos experiments themselves, traditionally it's been a challenge because the code itself is changing. And if sres are the ones that are writing such bandwidth is usually not budgeted or planned, and sres are typically pulled in into the other pressing needs, such as attending to incidents and corresponding action tracking, et cetera, et cetera. So that it may not be always possible to be proactive in writing a lot of chaos experiment, right? And in general, because you are not measuring the resilience coverage kind of a thing and you are just going and doing game day model, it's not very clear how many more chaos experiments I need to develop before I can say that I have covered all my resilience issues. Right? But in the continuous resilience approach, these are exactly opposite. Right? So you are basically looking at each other's help in a team sport model and you're extending your regular test. Developers would be writing integration best. And now you add some more best to introduce some faults on the dependent components, and those tests can be reused by QA and QA will add a little bit more tests. Those can be reused either by developers or by sres, et cetera, et cetera. So basically there is an increased sharing of the tests and in central repositories, or what you call them as chaos hubs in general. So you tend to manage these chaos experiments as code in git, and that increases the adoption. Right. And with resilience coverage is the concept, you know exactly how much more coverage you need to do or how many tests more you need to write, et cetera, et cetera. So that also helps in general with planning perspective. Right. So that's really a kind of a new pattern to think how to head, head, head of chaos engineering need to adopt chaos engineering. That's what I've been observing in the last few years and also at harness where we are saying there is a good growth of adoption of chaos for the purpose of both developer productivity as well as to increase the resilience as an innovative metric. Right? So let's take a look at a couple of DevOps. One on how you can inject a chaos experiment into a pipeline and probably cause a rollback depending on the resilience score that is achieved. And the other one, a quick demo about how we at chaos, the development teams are using chaos experiments in the pipeline a little bit more liberally before the code is shipped to a preprod environment or a QA environment. In this demo, we're going to take a look at how to achieve continuous resilience using chaos experiments with a sample chaos engineering tool. In this case, we are using head, head, head of chaos engineering, any other tool, a pipeline tool and a chaos engineering tool together to achieve the same continuous resilience. So let's start. So I have the chaos engineering tool from harness harness chaos engineering. This has the concept of chaos experiments, which are stored in chaos hubs. These chaos hubs are generally a way to share the experiments across teams, because in continuous resilience, you are talking about multiple teams across different pipeline stages. Either it's dev or QA or preprod or prod. So everyone will be using this tool and they will have access to either common chaos hubs, or they'll be maintaining their own chaos hubs. This chaos hub can maintain the chaos faults that are developed and chaos experiments that are created, which in turn uses the chaos fault. So a chaos fault in this case is nothing but the actual chaos injection and addition of certain resilience probes to check a steady state hypothesis. So let me show how in this harness chaos engineering tool, a particular chaos experiment is constructed or been. So let me go here. If I take a look at a given chaos experiment, it has multiple chaos faults. It can have multiple chaos faults either in series or in parallel. And a given chaos fault usually will have. Where are you injecting this vault at your target application? And what are the characteristics of the chaos itself? How long you want to do it, how many times you need to repeat the chaos, et cetera. And then the probe in this case is. Different tools call this probes in different ways. This is basically a way to check your steady state while this chaos injection is going on. So in the case of harness chaos engineering, we use probes to define the resilience of a given experiment or of a given service, or of a given module or a component, right. You can add any number of probes to a given fault. So that way you're not just developing on one probe to check the resilience, you're checking a whole lot of things while you inject chaos at any point of time into a given resource, right? Or against a given resource. So in the case of this particular chaos experiment, for example, you can go and see that it has resulted in 100% resilience, because there were three, the chaos that was injected was a cpu hog against a given pod. And while that cpu hog was injected, there were three process that were checked whether the pods were okay and some other service. Was it available or not? The HTTP endpoint. And it also was checking a completely different service. And it's checking for the latency response from the front end web service. So you should generally look at the larger picture while gauging the steady state hypothesis while injecting chaos fault. So because everything is passed and there's only one fault, you will see the resilience score as 100%. So this is how you would generally go and score the resilience against a given chaos experiments. And then these chaos experiments generally should be mobile back into a chaos hub, or you should be able to launch these experiments from a given chaos hub, et cetera, et cetera. And in general, the chaos tool should have the ability to do some access control. For example, in the case of harness chaos engineering, you will have default access control against who can access the centralized library of chaos hubs and who can execute a given chaos experiment. And chaos infrastructure is your target agent area. And if there are game days, who can run these game days, and typically nobody should have the ability to remove the reports of game days. So there's no delete option for anyone. Right? So with this kind of access control and then the capability of chaos hubs and then the probes, you will be able to score the resilience against a given chaos experiment for a given resource and also be able to share such developed chaos experiments across multiple different teams. And now let's go and take a look at how you can inject these chaos experiments into pipelines. Or let's look at the other way. How are you supposed to achieve continuous resilience during the deployment stage? Right. So example here, this pipelines is meant for developing a given service. That means somebody has kicked off a deployment of a given service, and once it's deployed, this could be a complicated process or a complex job in itself. And once this is deployed, we should in general add more tests. So this deployment is supposed to involve some functional tests has. Well, but in addition to that, you can add more chaos best. And for example here, each step in harness pipeline can be a chaos experiment. And if you go and look at this chaos experiment, it's integrated well enough to go and browse in your same workspace. What are the chaos experiments that are available? So I'm just going to go and select certain chaos experiment here, and then you can set the expected resilience score against that. In case that resilience score is not met, you can go and implement some failure strategy. Either go and observe, take some manual intervention, roll back the entire stage, et cetera, et cetera. So for example, in this actual case, we have identified the failure strategy or configured the failure strategy as a rollback. And typically you can go and see the past executions of this pipeline. And let's say that this has a failed instance of a pipeline, and you could go and see this pipeline was deploying this service and then the chaos experiment has executed and the expected resilience has not good enough. And if you go and take a look at this resilience scores or probe details, you see that one particular probe has failed. In this case though, when cpu was increased, the pod was good and the court service where the cpu was injected, high injection of cpu happened, it was continuing to available, but some other service provided a latency issue, so that was not good. And then it eventually caused it to fail and the pipeline was rolled back. Right. So that is an example of how you could do it, how you could do more and more chaos experiments into a pipelines and then stop leaking the resilience bugs to the right. And primarily what we are trying to say here is we should encourage the idea of injecting chaos experiments into the pipelines and sharing these chaos experiments across teams. And someone has developed, most likely developers in this case or QAT members. In any large deployment or development system, there are a lot of common services and the teams are distributed, there are a lot of processes involved. Just like you are sharing the test cases, common test cases, you could share the chaos test as well. When you do that, it becomes a practice. And the practices of injecting chaos experiments, whenever you test something, it becomes common and it increases the adoption of chaos engineering within the organization, across teams, and it eventually leads to more stability and less resilience issues or bugs. So that's a quick way of looking at how you can use a chaos experimentation tool and use the chaos experiments to inject chaos in pipelines and verify the resilience before they actually go to the right or go to the next stage. You, you, let's look at another demo for continuous resilience, where you can inject multiple chaos experiments and use the resilience score to decide whether to move forward or not. So in this demo, we have a pipelines, which is being used internally at harness in one of the module pipelines. So let's take a look at this particular pipeline. So what we have done here is the existing pipeline is not at all touched, it is kept as is. Maybe the maintainer of this particular stage will continue to focus on the regular deployment and the functional tests associated with it. And once the functional tests are completed after deployment, you can add more chaos tests in separate stages. In fact, in this particular example, there are two stages. One to verify the code changes related to the Chaos module CE module, and then another stage that is related to platform module itself. So you can put all of them into a group. So here it's called a step group. So you can just dedicate one single separate stage to group all the chaos experiments together, and you can set them up in parallel if needed. Depending on your use case individually, each chaos experiments will return some resilience score, and you can take all the resilience scores into account and decide at the end whether you want to continue or take some actions such as rollback. Right? So in this case, the expected resilience was all good, so nothing needs to be done, so it proceeded. This is another example of how you can use step groups or multiple chaos experiments into a separate stage and then take a decision based on the resilience score. I hope this helps. This is another simple demo of how do you use multiple chaos experiments together? You well, you looked at those two demos. So in summary, resilience is a real challenge, and there's an opportunity to increase resilience by involving developers into the game and start introducing chaos experimentation in the pipeline. And you can get ahead of this challenge of resilience by actually involving the entire DevOps, rather than just involving on the need basis the sres alone. Right? So the DevOps culture of chaos engineering is more scalable and is actually easy to adopt. Chaos engineering at scale, it makes it easier. So thank you very much for watching this talk, and I'm available at this Twitter handle or at the Litmus Slack channel. Feel free to reach out to me if you want to talk to me about more practical use cases on what I've been seeing in the field with chaos engineering adoption. Thank you and have a great conference.

Uma Mukkara

Head of Chaos Engineering @ Harness

Uma Mukkara's LinkedIn account Uma Mukkara's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways