Conf42 Chaos Engineering 2023 - Online

Shifting left chaos testing

Video size:


We all agree reliability is one of the most important features of any application. Then, why are we not routinely including reliability as part of the testing process? Why have we “shifted it to the right” relying (maybe too much) in chaos testing in production? Can we shift it back to the left?


  • Pablo Chacin is the Chaos engineering lead at Casey's Grafana Labs. He will talk about how organization can build confidence in their ability to withstand failures by shifting left chaos testing. Will also discuss some of the obstacles organization may face when trying to adopt chaos engineering.
  • Modern applications follow a microservices architecture that leverage crew native technologies. Under these conditions, applications frequently fail in unpredictable ways. Chaos engineering is a discipline that emerged as a response for this need. But obstacles still remain for chaos engineering to be adopted by most organizations.
  • The main challenge of the modern distributed applications is their unpredictable behavior. Is it valid to ask what benefits can we expect from testing known faults in controlled development environments? Will this really contribute to improve the reliability of the applications or will it only create a sense of force confidence?
  • A load of functional tests can be reused to test the system under turbulent conditions. The KC disruptor extension adds fork injection capabilities to kcs. Chaos testing can be democratized by promoting adoption of the technology.


This transcript was autogenerated. To make changes, submit a PR.
Hello. Thank you for joining me in this talk. My name is Pablo Chacin. I'm the Chaos engineering lead at Casey's Grafana Labs. Today, I will be talking about how organization can build confidence in their ability to withstand failures by shifting left chaos testing. I will be talking about why achieving reliability in modern application is hard and how chaos engineering emerged in response to this reality. To help organization build confidence in their ability to continue operating in the precedent of failures. I will also discuss some of the obstacles organization may face when trying to adopt chaos engineering. I will then introduce chaos testing as a foundation to facilitate adoption of chaos engineering. Finally, I will exemplify the principle of chaos testing with a case study and will demonstrate how kcs, an open source reliability testing tool, can be used for developing chaos tests. So let's start why achieving reliability in modern application is hard modern applications follow a microservices architecture that leverage crew native technologies. This de facto standard has many benefits, but it also increased the complexity of these applications. This complexity is often beyond the ability of engineers to fully predict how application will behave in production and how will react unexpected conditions like the failure of a dependency network, congestion, resort depletions, and others. Under these conditions, applications frequently fail in unpredictable ways. In many cases, these features are the consequence of misconfigured timeouts and inadequate fallback strategies that create retry, storm and cascading features. This can be considered effects in the applications. Unfortunately, traditional testing methodologies and tools do not help in finding them, mostly because they manifest in the interaction between services and are triggered by very specific conditions. Implementing tests that reproduce this condition is difficult and time consuming, and frequently the resulting tests are themselves unreliable because they produce unpredictable results. So how organization can build confidence in their ability to consume these failures? The most common way is by battle testing their applications, procedures and people. By going through incidents. By implementing war structure, post incident reviews, and adopting a blameless culture, organization can learn from incidents and improve their ability to handle them. But incidents don't make a good learning tool. They are unpredictable. They induce stress to the people involved. You cannot decide what or when to learn, not to mention their potential impact in the user and the business. So why not induce this incident on purpose? In this way, incident response team can be prepared in advance and tested procedures with less stress. This is a better way for learning, but there are still risks, mostly in the initial stages when the procedures are not well tested. Also, there is a limit on the incident an organization can try before affecting their service levels objectives. Another limitation is that they are preparing the organization for situation they have already experienced or can predict somehow. But as we discussed previously, modern systems sometimes fail in our ways. Therefore, we need a way to experiment with this system and learn more about how it fails. Chaos engineering is a discipline that emerged as a response for this need. It builds on the idea of experimenting on a system by injecting different type of faults to uncover systemic weakness, for instance, killing or overloading random compute instances, or disrupting the network traffic, and doing this on a continuous way, making faults the norm instead of deception with intention that developers get used to facing them and therefore consider recovery mechanisms in the design of their applications instead of introducing them later in response to incidents. This approach has been championed by companies such as Netflix with the iconic chaos monkey, but despite its promises, some obstacles still remain for chaos engineering to be adopted by most organizations. First, chaos engineering set a high adoption bar by focusing on experimenting in production, and we cannot argue against this principle. Nothing can substitute testing this real stuff. Unfortunately, many organizations are not prepared for this. They don't have battle tested procedures, and the teams may lack confidence in their ability to contain the effects of such experiments. Another significant issue is the unpredictability of the result of these experiments. Killing or overloading instances. Also, disrupting the network may affect multiple application components, introducing unexpected side effects and making the brass radius hard to predict. Moreover, modern infrastructure has many recovery mechanisms that may came into play and interact in complex ways. All these factors made the result of the experiment hard to predict and this is in part the idea. This is why it is called chaos engineering after all. But it is difficult to test recovery strategies for a specific situation if you cannot reproduce it consistently. Finally, adopting chaos engineering tools can also be challenging. Installing and using them sometimes requires a considerable knowledge on infrastructure and operations. They seem designed by and for SREs and DevOps, and it makes sense as chaos engineering has its roots in these communities. However, this complexity rise the adoption bar for most developers that cannot be self sufficient when using these tools. In summary, chaos engineering presupposes a level of technical proficiency and maturity that many teams and organizations do not have. So how more organizations can start building confidence in their ability to withstand failures. Is there an alternative to bonji jumping into chaos engineering in production? We propose shifted chaos testing to the left, incorporating chaos testing as part of the regular testing practices early in the development process, submitting the application to four that have been identified from incidents and validating if they can handle them in an acceptable way. Implementing and testing recovery mechanisms, if not. At the core of chaos testing, is for injection four. Injection is the software testing technique of introducing errors on a system to ensure it can withstand and recover from dossy conditions. This is not a novel idea. It has been used extensively in the development of safety critical systems. However, it has generally been used for testing how application handle isolated errors such as processing concrupted data. The challenge for modern application is to inject the complex error patterns they will experience in their interaction with other components. Fortunately, as explained in this quote from two former members of Netflix Chaos engineering team, from the distributed system perspective, almost all interesting availability experiments can be driven by affecting latency or the response type. Later in this presentation we will discuss how this can be achieved using cases. But at the beginning of this presentation, we said that the main challenge of the modern distributed applications was their unpredictable behavior on the turbulent conditions. Therefore, is it valid to ask what benefits can we expect from testing known faults in controlled development environments? Will this really contribute to improve the reliability of the applications or will it only create a sense of force confidence? According to a study of failure in real world distributed systems, 92% of the catastrophic system features were the result of incorrect handling on nonfatal errors, and in 58 of these cases, the resulting force could have been detected through simple testing or error handle code. And how hard is to improve this error handle code? According to the same study, in 35% of the cases the error handle code fall into one of three patterns. It overreacted, aborting the system under nonfalton errors was empty or only containing a lock printing statement. It contained expressions like fix me or to do in the comments. What this study comes to tell us is that there is a significant room for improvement in the reliability of comprehensive distributed application by just testing the error handle code and this is what chaos testing proposed. Incorporate the principle of chaos engineering early into the development process as an integral part of the testing practices. Shifted the emphasis from experimentation to verification for uncovering unknown fault, to ensuring proper handling of the known faults. By adopting chaos testing, teams can build confidence for moving forward to chaos experiments in productions and then using the insight obtained from these experiments and for incidents, improve their chaos test, creating a process of continuous reliability improvement in order to achieve its goal, chaos testing is sustained in four guiding principles. Incremental adoption organizations should be able to incorporate chaos testing into their existing teams and development processes in an incremental manner, starting with simple tests so they can understand better how their system handled faults and then building more sophisticated test cases. Applicationcentric testing developers should be able to reproduce in their tests the same fault pattern observed in their applications using familiar terms such as latency and error rates without having to understand the underlying infrastructure. Chaos testing as code switching between application testing tools and chaos testing tools will create production in the process and as we discussed before, it may reduce the autonomy of developers for creating chaos test. Therefore, developers should be able to implement chaos tests using the same automation tool they are familiar with. But adoption of chaos as code have other benefits. Developers can reuse log, pattern and user journeys from their existing tests. In this way, they can ensure they are testing how the application react to faults on the realistic use cases. Control chaos faults introduced by chaos tests should be reproducible and predictable to ensure the tests are reliable. You cannot be confident from flocky test test tests should also have a minimal blast radius. It should be possible to run them insure infrastructure, for example staging environment with little interference between teams and services. Let's put these principles into action using a fictional case study this case study used the sock shop. This is a demo application that implements an ecommerce site that allow users to broad a catalog of products and buy items from it. It follows a polyglot microservice architecture. Microservices communicate using HTTP requests and it is deployed in kubernetes. The front end service works both as a backend for the web interface and also exposes the APIs of other services working as a kind of API gateway. Let's now imagine an incident that affected the sock shop. In this incident, the catalog service database was overloaded by long running queries. This overload caused delays in the request up to 100 milliseconds over the normal response time and eventually made some requests, failed and returned an HTTP 500 error. The catalog service team will investigate the incident to address the root cause. However, the front end team wonders how similar incident will affect the service and the end users. To investigate this, let's start with a load test for the front end service that will serve as a baseline. This test applies a load to the front end service requesting products from the catalog. The front end service will make requests to the catalog service. The front end service is the four the system under test. We will measure two metrics for the request to the front end service, the failure rate and the percentile 95 of the response time. We will send this metric to our grafana dashboard for visualization and we will implement this test using cases. Casey's is an open source reliability testing tool. In cases, tests are implemented using JavaScript. Cases cover different types of testing needs such as load testing, end to end testing, synthetic testing, and chaos testing. It can send tests resort to common backend social Prometheus. Its capabilities can be extended using a growing catalog of extension including kafka, NoSQL databases, kubernetes, SQL, and many others. Even when we are not going into too much detail in this example, there are some concepts that are useful for understanding the code we will discuss next in cases, user flows are implemented as JavaScript functions that make requests to the system under test, generally using a protocol such as HTTP if we are testing an API, or our simulated browser session if we are testing the user interface. The result of these requests are validated using checks. Scenarios describe a workload in term of a user flow and number of concurrent users. The rate at which the user may request and the duration of the load threshold are used for specifying SLO for metrics such as latency and error rates, let's make a work through the test code. Don't worry, we will just skim over the code highlighting the most relevant parts. At the end of the presentation, you will find additional resources that explain this code in detail. The test has two main parts, a function that makes the call to the front end service and check for errors, and in a scenario that describe how much load will be applied and for how long. Let's run this test and check the performance metrics. We can see the error rate with zero. That is, all requests were successful and the latency was around 50 milliseconds. We will use this result as a baseline. Now let's add some chaos to this test. We will repeat the same load test, but this time while the load is applied to the front end service, we will inject fault in the request served by the catalog service, reproducing the pattern observed in the incident. More specifically, we will increase the latency and inject a certain amount of errors in the responses. Notice that the frontend service is still the system under test. For doing so, we will be using the KC disruptor extension. This eruptor is an extension that adds fork injection capabilities to kcs. We are not going into the technical details about how this extension works. For now, it is sufficient to say that it works by installing an agent into the target of the chaos test, for example a group of Kubernetes pods. These agents have the ability to inject different type of faults such as protocol level errors and this is done from the test code as we will see next without any external tool or setup. At the end of the presentation, you will find resources for exploring this extension in detail, including its architecture. Let's see how this works in the code, we add a function that inject faults in a service. This function defines a fault in term of a latency that will be added to each request and the rate of request that will return a given error. In this case, 10% of request will return a 500. Then it select the catalog service as a target for the four injections. This interrupts the disruptor to install the agents in the pods that backtick service. Finally it inject the four for a given period of time, in this case the total duration of the test. Then we add a scenario that invokes this function at a given point during the execution of the test. In this case, we will inject in the fault from the beginning of the test and this is all that we need. Let's run this test. We can see that the latency reflect the additional 100 milliseconds that we injected. We can also observe that now we have an error rate of almost 12%, a slightly over the 10% that we define in the fault description. It's important to remark that we are injecting the faults into the catalog service, but we are measuring the error rate at the front end service so we can see the front end service is not handling the errors in the request to the catalog service. Apparently there are no retry over fail request. I wouldn't be surprised if we find a two to comment in the error handle code. How this test help the front end team first by uncovering proper error handling logic as we just saw, and then enabling them to validate different solution onto the obtained unacceptable error rate. For example, introducing retries. They can also easily modify the test to reflect other situations like higher error rates in order to fine tune the solution and avoid issues such as retry and storms. This brief example shows the principle of scale station in action. A load of functional tests can be reused to test the system under turbulent conditions. These conditions are defined in terms that are familiar to developers. Latency and error rate. The test has a control effects on the target system. The test is repeatable and the results are predictable. Default injection is coordinated from the test code. Default injection does not add any operational complexity. There is no need to install any additional component or define additional pipeline for triggering default injection. To conclude, let me make some final remarks. We firmly believe that the ability to operate reliably shouldn't be a privilege of the technology elite. Chaos engineering can be democratized by promoting the adoption of chaos testing, but to be effective, chaos testing will be adapted to the existing practices of testing in grafana cases. We are committed to making this possible making chaos engineering practices accessible to a broad spectrum of organizations by building a solid foundation from which they can progress toward more reliable applications. Thank you very much for attending. I hope you have found this presentation useful. If you want to learn more about chaos testing using cases, you may find these resources useful. You will find an in depth walkthrough for the example we saw today and more technical details about the disruptor distinction.

Pablo Chacin

Technical Lead Chaos Engineering @ Grafana Labs / K6

Pablo Chacin's LinkedIn account Pablo Chacin's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways