Conf42 Chaos Engineering 2023 - Online

Vaccinate your Software, Build its Immunity

Video size:

Abstract

Testing on Production is always considered a challenge, and mostly it’s a No. It is hard to convince bosses and the risk is Huge. Chaos testing is a highly specialized testing type to test large distributed systems & in my talk I’ll showcase how it can unearth bugs that couldn’t be found in Staging.

Summary

  • Vipin Jain: Testing in the production systems is covered under a new stream of testing called chaos testing. Like a vaccine, we inject harm into the system to build the system's immunity. It simulates failures before they lead to unplanned downtime or a negative user experience.
  • advanced principles emphasize building your hypothesis around a steady state definition. Second part of chaos testing planning is you have to vary the real world events. Engineers always have the tendency to focus on variables that reflect their experiences rather than the user's experience. Most advanced engineering takes place on the production system.
  • automation provides a means to scale out the search of vulnerabilities that could contribute to undesirable systematic behaviors. It also helps empirically verifying our assumptions over time. And finally the last principle, you have to keep the blast radius as small as possible.
  • A simple two by two matrix of known and unknowns. On the left hand side, known are the things we are aware of, and unknown are the thing we are not aware of. Let's take a real life chaos testing scenario.
  • If we shut down the two replicas of the cluster at the same time, it would take us to clone two new replicas. We have never tried that. We don't know how much time it will take for the primary to be cloned twice. But what is known is we have a pseudo primary.
  • chaos testing relies on the production identification of errors within a system. It has benefits for customers, technical people and for the business. There is an increased availability and durability of the service. The incident management system overall improves as much as possible.
  • There is something called as eight fallacies of distributed system. To give a user the experience of his life, a smooth network usage experience, these fallacies need to be tested continuously. These fallacies drive the design of chaos engineering experiments.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello all. My name is Vipin Jain and I am going to speak on a very interesting yet controversial topic, testing in the production systems. Most of the managers and customers, if you ask, they are not very keen to allow the testers to run certain tests on the production systems. There are a lot of security issues, there are lot of stability issues, and they don't want their system to be shut down just because some test case failed. And this concept of testing on the production systems is covered well under a new stream of testing called as chaos testing. I wrote this paper sometime back when the entire world was talking about vaccines and immunities due to the pandemic and hence I use these words for my topic. Vaccinate your software and build its immunity. Before we begin, who am I? I am a son, a husband and a father. You can see in this pick. My wife, my daughter and my son are there. At heart, I am a tester, although by my job I am now more into deliveries and providing various solutions to customers all across the world. Having said that, I am always a speaker by choice because that led me to travel all across the world and meet people, hear their ideas, and of course, using the best ideas into my system. I'm a process advocate to deliver quality. Since last couple of years I have entered into blogging and I have written lots of bugs for various websites. My contacts are here. You can follow me on LinkedIn, you can follow me on Twitter, and of course you can always email me if you want to have any kind of discussion with me. Now, this image depicts a real life chaos in the system. You can see the boss is shouting, papers are flying from here and there people are running, so there is no system as such. And it looks like the entire system has gone into a disarray. Now this depicts chaos. How does this chaos looks like into a software system? Let's see everything. First, let's talk about what exactly is chaos engineering. Now, as the image depicts chaos engineering like a vaccine, we inject harm to build immunity into the system. So if you go by a Google definition, it says chaos engineering or chaos testing. It is a highly disciplined approach to test a system's integrity. How it does that proactively. It simulates and identify the failures in a given environment before they lead to unplanned downtime or a negative user experience. Think of a vaccine or a flu shot where you inject yourself with a small amount of potentially harmful foreign bodies in order to build resistance and prevent illness. Chaos engineering is a tool that we use to build such can immunity in our technical system by injecting harm. And what kind of harm we inject into a system, they can be a latency, it can be a cpu failure, or it can be a network black hole in order to find and mitigate potential weaknesses. So this is what exactly is termed as chaos engineering. And as I said, like a vaccine, we inject harm into the system to build the system's immunity. Let's take a brief look at the history of how chaos engineering took birth, and where it is right now in its current shape. So it began in year 2010, when Netflix designed what is called as a chaos monkey to test the system stability by enforcing failures by the pseudo random termination of instances and services within network Netflix architecture. So when we are watching a movie on Netflix, what we don't want is that there should not be any lag between the two images, between the videos. The videos should run as a very constant, regular stream, and we should get a very, I would say, very smooth movie watching experience, right? So following their migration to the cloud, Netflix service was now reliant upon the AWS. Netflix in 2010 was not able to sustain its hardware infrastructure, and therefore it moved to AWS, which basically helped Netflix to scale up, right? It provided it a technology that could show them how their system responded when the critical components of their production service infrastructures were taken down. Intentionally causing the single failure would remove any weakness in their system and then guide them towards automated solutions which gradually handle failures much, much better way. So this was the original aim. And that's why in 2010, Netflix used the first tool called as Chaos monkey to test everything. Then it increased. And in 2011, apart from Chaos monkeys, Netflix created what is called as a Simeon army. So it had a geniter monkey, which identifies and disposes unused resources. It has a chaos.com, which drops a full AWS region. It has confirmity monkey, which shut down instances which are not adhering to the best practices, and similarly, chaos gorilla, security monkey, Doctor monkey, and latency monkey. So the entire Simeon army was born, and they added additional failure injection modes on top of the Chaos monkey that would allow the testing of more complete suits of failure states and thus building resilience to those as well. And in 2020, just two years ago, Chaos Engineering. Chaos now become the part of AWS well architectured framework. This WAF well architectured framework, it is currently into its 8th update, and this was recently announced, which has included Chaos engineering as the requirement of a stable system. So, as you can see, within ten years of its beginning, AWS has recognized chaos engineering as the requirement of a reliable system. And when I'm talking about all these historical things and the products reliable systems, I'm not talking about a simple software. I'm talking about software which are well distributed across the world, which relates to heavy software, I would say, which has a lot and lot of people footfall. And in these places, chaos engineering has become very, very crucial. It has formed something called as chaos testing principles, which we will be, of course, looking at each one, one at a time. But just to give you a summary of that. So, the first principle states that you have to build a hypothesis around a steady state behavior. You have to find out some real world events, some experiments needed to be run on the production system. Automate these experiments to run continuously and ultimately minimize the blast radius. I will take you through all these five to give you a real world idea of how chaos testing happens and how it is planned. So, let's first begin with build a hypothesis around steady state behavior and what exactly this means. So, let's see this example before I go into any kind of details. Under dash circumstances, the security team is notified. This is a simple sentence, and which has a fill in the blanks. This will lead to under security control violation circumstances. The security team is notified. Now, what is happening here? This blank space is filled by the variables that you determine. The advanced principles emphasize building your hypothesis around a steady state definition. This means focusing on the way the system is expected to behave and capturing that in a measurement, you have an idea of what can go wrong. So you have chosen the exact failure to inject what happens next? This is an excellent thought exercise to work through as a team. By discussing the scenario, you can hypothesize on the expected outcome before running it in live. What will be the impact to customers, to your service, or to your dependencies? This exercise basically answers that. So, in this particular example, the entire team sits together, and then they say that, okay, our first hypothesis is, under dashed circumstances, security team is notified. And when all of them agree that under security control violation circumstances, the security team is notified, which means everyone is on the same page, that security team is notified only when there is a security control violation. Then the second part of this chaos testing planning is you have to vary the real world events. Now, what does these mean? It's an advanced principle which states that the variables in experiments should reflect real world events. Like for example, in the previous example around hypothesis, we talk about a security violation. It's a real world event. This is not something hypothetical, right? So while it might seems obvious in the hindsight that, yes, everyone is using real world events. Still, it is very important to call this out, and I'll give you two very good reasons that why I have decided to explain this here. What happens is people just don't focus on these things. People will say that, okay, if I need to gather variables, let's pick all of them. So, again, going back to the previous hypothesis example, without focusing on what is the exact reason, when the security alarm gets raised, people just say that, okay, take all the possible reasons where this security notification has to be made. Or sometimes they say that, pick any. How does it matter? We just need to run the scenario. So, variables are often chosen for what it is easy to do, rather than what provides the most learning value. That's the first problem. And the second is engineers always have the tendency to focus on variables that reflect their experiences rather than the user's experience. So engineers always remain in a practical state, they always remain in an experimental state, rather than thinking about how a user, a real world user, will see this entire situation. So, just to explain my two points, either pick any variable, or the engineers decide to pick all the variables. And both these scenarios are completely wrong. You have to really decide what exactly you want and pick only those variables which have the most learning value. This becomes the second program, second principle of chaos testing. Vary the real world events. So, what is the next one? We decide that tonight we test in the production. Wow. Seen from Spartacus or this is seen from 300. This is seen from 300. Yes. King Leonidas experimentation teaches us about the system we are studying. We all know that all the exploratory testing, all the other types of testings, basically ask us to experiment while testing in a system. If we are experimenting in a staging environment, then you build convince. In a staging environment, if you are experimenting into a pre build phase, then you are building confidence into the pre build phase. How we are going to build confidence into the production system? We are not doing any experiments on the production system, or I would say we are not allowed to do any kind of experimentation with the production system. To the extent that the staging and the production environments differ, often in ways that a human cannot predict, you are not building confidence into the environment that I really care about, which is you care about more production or the staging? Production, yes. Every user, the entire business, is based upon the interactions of real world users with the production system. But all the experiments, all the user thinking, all the user planning that the testers do, how end users would behave, that actually happens on the staging. So how we are going to build convince about the production system. Fine. So for this reason, the most advanced chaos engineering it takes place on the production. I know that it is very difficult to convince any senior manager or the owners or the stakeholders to allow the testers to run testing on the production system. But chaos testing, this is all about. So most advanced chaos testing experiments always run on the production system. That becomes our third thing. So use production to run your experiments. What you are going to see here, someone in real life is testing the bulletproof vest and how it is going to test. Someone just wears that and the other guy stands in front and shoots. Wow, that's a real good production experiment. If it fails, the poor guy dies. If chaos engineering test fails, the production system can stop working, affecting millions of users across the world. But that's again what chaos engineering talks about. We have to use production to run my experiments now. Why? It is constantly pushed that the chaos engineering chaos testing has to be run on the production. It is a common belief that there is a set of bugs and there are some set of vulnerabilities that can be found only in the production environment which uses some live data or live traffic. This principle is not without controversy. I've already told you about. Certainly in some fields, there are regulatory requirements that prelude the possibility of affecting the production system. In some situations, there are insurmountable technical barriers to run these experiments. So it is important to remember that the point of chaos engineering is to uncover the chaos inherited in the complex system, not to cause it. If we know that an experiment is going to generate an undesirable effect on the production system or the outcomes, then we should not run that experiment. Remember that and how to once we are finalized with an experiment, the next step is automate your experiments and run them continuously. Fine. We all know about what automation is and why automation came into existence, right? This is quite straightforward thing, but here it is important for such systems in two distinct ways. Automation provides a meme to scale out the search for vulnerabilities that could contribute to these undesirable systematic outcomes. And it helps in empirically verifying our assumptions over time as the unknown parts of the systems are changed. So it helps in covering a large set of experiments that the humans can cover manually. In complex systems, the conditions that could possibly contribute to an incident are so numerous that they cannot be planned for. In fact, they can't even be counted because they are unknowable in advance, which means that humans can't reliably search the solution space of possible contributing factors. In a reasonable amount of time. Automation provides a means to scale out the search of vulnerabilities that could contribute the undesirable systematic behaviors, and it also helps empirically verifying our assumptions over time. So imagine a system where the functionality of a given component relies on some other component which is outside the scope of this test. Now, this is the case of almost all the complex systems because of the third party controls and tools which get connected and communicate through web services. Without the tight coupling between the given functionalities and all the dependencies, it is entirely possible that one of the dependencies will be changed in such a way that it creates a vulnerability in the entire system. Continuous experimentation provided by automation can catch these issues and teach the primary operators about how the operations of their own system is changing over time. This could be a change in the performance, like for example the network is becoming saturated by noisy neighbors, or a change in functionality. For example, the response bodies of a downstream services are including extra information that could impact how they are parsed. Or it may be just a change in the human expectations, like for example, the original engineers. They leave the team and the new engineers are not as familiar with the current role. So automation definitely has to be here because it will continuously check all the possible scenarios that it is made up of and giving us a constant feedback. And then finally the last principle, you have to keep the blast radius as small as possible. What is the blast radius? When you run chaos test on a production system, we always run chaos test on a small part of the system, because we always have to keep this in mind that if something goes wrong, the production system can stop. So rather than having a test which can affect the entire production system, the test should be as small as possible, so that the blast which happened due to that test not working correctly, affects just a very small part of the system which can be corrected and put the correct fix into the place in a very quick amount of time. So you have to use a tightly orchestrated control group to compare within a variable group. Experiments can be constructed in such a way that the impact of the discovered hypothesis on the customer traffic is minimal. How a team goes about achieving this is highly context sensitive to the complex system. Some systems it may mean using shadow traffic, or extruding requests that have high business impacts, like transactions over dollar hundred, or implementing automated retry logic for requests. In the experiment that failed in the case of chaos teams work at Netflix, sampling of requests, sticky sessions and similar functions not only limited the blast radius, they had the added benefit of strengthening signal detections however it is achieved. This advanced principle emphasizes that in truly sophisticated implementations of chaos engineering, the potential impact of the experiments has to be limited by design. So, these are the five principles on which the chaos testing is built upon and which helps in your chaos testing, planning, and the final execution. Now, I will take you to a real life scenario and how everything got built up there. So, this is one of my favorite part of this entire presentation, which is called as chaos testing execution rules. Now, this is a simple two by two matrix of known and unknowns. On the left hand side, known are the things we are aware of, and unknown are the things we are not aware of. And on the bottom, knowns are the things we understand, and unknowns are the things we don't understand. So the difference is just between things that we are aware of and things that we understand or don't understand. Point number one is called as known knowns, things you are aware of and the things that you understand. Point number two is called as known unknowns, things you are aware of but do not fully understand. Unknown knowns is the point number three, where things you understand but you are not aware of. And finally, point number four is unknown unknowns, which means things you are neither aware of nor fully understand. Don't get confused. I'll take a real world example, and then I'll try to explain all these four. But as I've said, this is a very, very simple matrix, which you will see in the next slide. And just the difference is between the things that we are aware of and the things that we understand or the things that we are not aware of and the things that we do not understand. Let's take a real life chaos testing scenario. Now, what does this scenario means? There is a region a in the entire system right here. What we is present is we have a primary database host with two replicas, and then we use a semisync replication. We also have a pseudo primary and two pseudo replicas in different region. So the entire region a gets replicated into a region b. So the primary replica one and replica two, they are the real functional ones. And then there is a pseudo of primary replica one and replica two. Simple thing. Region A, region B. Everything in region a is converted is duplicated into region b. Now, let's try to build the known unknown matrix on this scenario. Setting up the knowns and unknowns first is known knowns. When a replica shuts down, it will be removed from the cluster. A new replica will then be cloned from the primary and added back to the cluster. So if the replica shuts down, it will be removed from the cluster, and then a new replica will then be cloned from the primary and added back to the cluster. So again, if I go back. Sorry. Yes. So, these two replicas, if any one of them gets shut down, it is removed from this cluster, a new clone is made and put here as a new replica. That's the process. So this becomes the known knowns. What is known unknown here. So the clone will occur. We know that, as we have logs that confirm, even if it succeeds or fails. So when the replica shuts down and we try to reclon it, even if the process fails, there are logs which confirm. So this is something we know. But what we don't know is the weekly average of the meantime it takes from experiencing a failure to adding a clone back to the cluster. Effectively, it may take few minutes, it may take an entire hour, or it may take an entire day. This we don't know. So this becomes the known unknowns. Then let's go to the unknown unknowns. What is unknown here? If we shut down the two replicas of the cluster at the same time, we don't know exactly the meantime, it would take us to clone two new replicas of the existing primary. Remember? So just imagine replica one, replica two, both get shut down. We have never tried that. So we don't know how much time it will take for the primary to be cloned twice and putting both the systems into the entire system. Both the replicas into the entire system. But what is known is we have a pseudo primary and two replicas which will also recording all the transactions that are happening here. So it's a pseudo thing, which is. So we know about that. So this becomes unknown and known. And finally, the last one. Unknowns. Unknowns. What would happen if we shut down this entire cluster, the primary replica one, replica two. What would happen if this entire thing goes down? Will the pseudo region be able to fail over effectively? Because we have not run yet this scenario. So if this entire system goes down, we don't know whether the pseudo region will also go down gracefully or effectively. We have never tried that. Why? Because this primary, pseudo one and pseudo two are the production system. We have never tried to shut it down completely. Hence, we don't even know whether the pseudo would shut down effectively or not. So chaos testing is the highly disciplined approach to test a system's integrity. This I have already talked about. And chaos testing relies on the production identification of errors within a system. So, with this metrics that I have created of known and unknowns, things that you understand and things that you know, you can create your entire hypothesis and entire planning about how to go about performing chaos testing. Okay, we have said many things about chaos testing and principles and everything. But the point is, does it have any benefits except finding certain bugs which can be present only in the production because of the live traffic that is coming in? Or is it more beneficial to not to run those scenarios because then the system will remain up and everything which is working fine on your pre prod systems will be just replicated on the production? The answer is chaos testing definitely has benefits and it has benefits for customers. It has benefits for technical people and for the business. And here are the benefits benefits of performing chaos testing for the customers. Definitely there is an increased availability and durability of the service. It means there is no outage disrupt their day to day lives. That's number one. Because most of the effects that most of the harmful effects on the production system due to the live traffic will be caught, corrected, and the system is made more and more efficient, which means the customers will not face, I won't say any downtime, but I would definitely say very rare downtimes. So this is for the customer. What about the businesses? The businesses will prevent big losses in the revenue and the maintenance cost because the systems are up all the time. They are more happy and more engaged engineers because they don't have to spend extra weekends and long hours to correct production failures. The incident management system also gets improved because of the chaos testing results. Whatever results that chaos testing identifies, they actually will be locked under incident management system. Because if chaos testing is not executed on the production system, then when the real user use that production system and find something error, he will always call upon the incident management system team and say that hey, this software is not working fine or I was trying to do some payment and the payment is not going through. Can you please look it into it urgently now? Because chaos testing is running on the production system, many of these issues will be uncovered during the testing phase. Which means the incident management system already knows a lot of issues that may come from the real world users because it has already seen those things as the output of the chaos testing. So this is the big advantage. And finally, for the technical teams, insights from the chaos testing means there are reduction in incidents because a lot of real world incidents are already caught and corrected. There is a reduction in the on call burdens because of course, happy people, happy customers will call less. There is an increased understanding of the system failure modes. So by looking at every chaos output, which is I can say that as a chaos bug, it won't be an easy bug because all the easy bugs are identified pre prod and by the qas. So the bugs has to be the debugs, the bug has to be, I would say a complex user journey bug. So more bugs identified on the production system using live traffic. It always helps in understanding the system failure modes better. And as I said, the incident management system overall improves as much as possible. So these are the various benefits of performing chaos testing. There is something called as eight fallacies of distributed system which I picked from a website called as Architecturenodes Co. Now what exactly these fallacies are and why I have put this slide into the chaos testing discussion. Now if you see these fallacies, first one, the network is reliable. We all use Internet. We all download things, we all upload things, we go for our voice calls, we watch movie, and we never think whether the network will shut down or not. In the back of the mind there is always this concept that the network is reliable, it won't break. Similarly the latency is zero. Similarly, there is only one administrator of the entire Internet. The bandwidth is infinite, the network is secured, the topology never changes, the network is homogeneous, and finally, the transport cost of any packet from one place to another is zero. These are the fallacies for every user who is using Internet. But does any of them, any of them? Are they correct? Is there only one admin? Is the bandwidth infinite? Is the network reliable? Is it secured? We know that no, this is not the case. But to give a user the experience of his life, a smooth network usage experience, these fallacies need to be tested continuously. And that can be done if we do chaos testing on the systems. So many of these fallacies, they drive the design of chaos engineering experiments. For example, packet loss attacks or latency attacks. For example, if a network outage can cause a range of failures for applications that severely impact customers, applications may stall while they wait endlessly for a packet. Applications may permanently consume energy on, let's say, a Linux system. And even after a network outage has passed, application may fail to retry the stalled operations or it may retry too aggressively. Applications may even require a manual start. So each of these examples, they need to be tested and prepared. And these can be done only if we are running chaos testing on our production systems. So I know that this is not easy to convince any managers, but when you try to convince them by explaining these things, when you try to convince them by saying that, look, every production system has certain issues. Because the production system receives its data live and we don't have any control on the live data, we may uncover certain very good bugs, which makes our system more and more reliable for the end users across the world. Right. Allow us to perform chaos testing and then plan for the chaos testing in a very proficient way. Very efficient way. Put your blast radius, very small, plan for automation and all the other build your hypothesis. So, all the five points that I explained earlier, then prepare your own metrics and finally run your experiments. And then show the managers, show the stakeholders that yes, the time and money that we have invested in chaos testing has actually made their system more and more reliable. With this, I come to the end of this talk. Thank you for hearing me patiently.
...

Vipin Jain

QA Head & Product Delivery Manager @ Metacube Software

Vipin Jain's LinkedIn account Vipin Jain's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways