Conf42 Chaos Engineering 2023 - Online

Reliability in the Face of Uncertainty

Video size:


Today’s success of businesses depends on their systems’ stability and reliability. Downtimes are painful: They cause you to lose the trust of your customers and cost your business money. Providing reliable services to your customers is essential to success. Despite the best efforts, the one sure thing about undesirable events like outages is that they will always occur, but the moment chooses you.

Waiting and hoping an outage will not occur is not a strategy. Benjamin how to deal with todays conditions in a proactive manner to improve your reliability and keep your customers happy.


  • Benjamin is the co founder and CEO of Steadybit. He talks about reliability in the face of uncertainty. If you would like to reach out to Benjamin after the talk, please feel free via Twitter or LinkedIn.
  • What is reliable? One very important characteristic is reliability in our system. What's behind a software system is gliable if it's consistently good in quality and performance. All of us here have a common goal and we work on it every day. Happy and satisfied customers who use our software.
  • Today's systems what are the systems we need to build today to meet our customers expectations? Let's take a look at today's system. The complexity and the system is increasing and therefore the number of incidents as well. Engineering is necessary to improve our systems and train them to deal with failures.
  • Lisa: To be honest, happy pass testing is not enough. We need to test under real conditions, early as possible. This is the true reason why chaos engineering exists, to check what's going on under normal conditions in production. We will run an experiment in a preproduction environment to test how our system behaves under such conditions.


This transcript was autogenerated. To make changes, submit a PR.
Hi, my name is Benjamin. I'm the co founder and CEO of Steadybit and welcome to my talk reliability in the face of uncertainty. If you would like to reach out to after the talk, please feel free via Twitter or LinkedIn. I'm happy to answer all your questions and hopefully maybe you will provide me some feedback. So let's get started. Let's start the talk with important questions. Question what do we strive for? And what goals does each of us pursue every day? Or in other words, what's the mission of software development? Because I assume that we are all part of the software development process. We are maybe developers, sres, people who are running the systems, operating the systems. So we are all part of it and we are all on the mission. And the mission is continuously improve and deliver a software solution that reliably delivers value to its customers. And that's the important part. All of us here have a common goal and we work on it every day. Happy and satisfied customers who use our software. Our customers have high expectations from the software they use and therefore from us. And from the point of view of our customers, the software must always work and they are not interested in the technical complexity that this requires or even in outages. What is reliable? One very important characteristic is reliability in our system. And yeah, let's therefore take a look at this characteristic and see what's the definition. What's behind a software system is gliable if it's consistently good in quality and performance. So it's important to note that this characteristic must exist from the end user point of view. We all know that our systems are in a constantly changing environment, and as a customer, I expect a result from the system that fits my request in a reasonable time. And if the system does not give me an answer, gives me an error, or makes me events wait too long, it does not fulfill these characteristics and I as a customer will continue, will move on to another vendor, maybe. But if the system is able to answer me in a reasonable time and with a suitable result, I as a customer, yeah, I build up trust and I use this system, let's say gladly and repeatedly, because I'm able to trust the system. So in a summary, we as customers, we trust the system when it's consistently good in quality and performance. Today's systems what are the systems we need to build today to meet our customers expectations? Let's take a look at today's system. I have brought you two illustrations to diagrams of a complex system where it's easy to see how many dependencies exist in a microservice architecture. And from this quickly. Yeah, comes the question, why is it so complex? Why do we need to build it so complex? The answer is yeah, we need to meet different requirements. Like we need to handle high load, the system and the services within, they need to be resilient and the load needs to be distributed. We need to be able to do rolling updates under load and maybe doing load peaks. Yeah, we also need to create new instances. They must be added automatically and then cleared up afterwards. If not, your cloud provider is getting happy but not you. This resulting dynamic must be considered during development and appropriate precautions must be taken. So however, the two diagrams also only show a part of it and obscure the underlying layers. We add with like kubernetes or cloud providers or maybe with third party providers for let's say authentication or other things. That's something we have to handle and take care. I really assume that everyone here can share one or two or maybe more positive and also negative experience with today's complex systems. So with this in mind, with this knowledge, this experience we all have, I really like the quote from Dr. Richard Cook, which is very applicable. It's not surprising that your system sometimes fails. What is surprising that it ever works at all. And yeah, we haven't solved the requirements of our customers and the associated expectations. They make it very difficult for us to meet them. And why I'm claiming it such hard. Yeah, let's take a look. Let's say look at the incidents, the incidents we are all dealing with every day, maybe. And there was a report done by the people at fire hydrants. They have analyzed about fifty k of incidents. And you can see even in small organizations there are ten incidents each month they have to deal with because of something is not working normal if the organization is growing. Also the complexity and the system is increasing and therefore the number of incidents as well. And you can see that the numbers are getting up on bigger companies as well. And let's take a look at the time we are spending on each incident. So it's about 24 hours in average for each incident. So we are spending 24 hours from creating an incident, to analyze the incident, to find a fix, to deploy the fix, and to close the incident. And that's just the average time we are spending on. And it does not mean that our customer is not able to work with our product for 25 hours. No, it's the time we are spending from opening to closing the incident. Yeah, famous sentence by Vana Fugas. Everything fails all the time and that's something we all have realized in our daily life. So we should ask, or we need to ask ourselves, what's normal. And under these conditions I described earlier, they are providing the impression of everything is being chaotic, it's unpredictable, and we must ourselves, is this really normal? And failures are normal. Failures are normal and are part of our daily life. But errors or failures are much more. They contain valuable information from which we can derive knowledge and bring up improvements in our systems. We can improve our systems based on failures. But I would go one step further and interpret failures as the attempt. How our systems want to get in touch with us. They would like to tell us something. Our systems want to point us to problems where we should optimize, where we should fix something. So it's like a communication channel from our systems and the system is calling for help. Please support me. Here's something not working. That's something we can do better. And yeah, we are working under chaotic conditions. And why chaotic? Because we are not able to know when the next failure occurs. And under such chaotic conditions, chaos engineering is necessary to improve our systems and train them to deal with failures and failures. Yeah, failures are the foundation failures. We know about our starting point and they help us to develop the appropriate test cases. So we can use a failure and transform a failure into a test case in an experiment. What can we do exactly? We need to do it proactive. We need to proactively improve the reliability of our system. And we need to ask specific questions before we are in production. And we need to know the answers before we are going into production. For example, here are three examples. The first question maybe is can our users continue hoping while we are doing a routing update? That's something you should figure out before you're in production, if you are able to do a routing update under load or not. The second example, will our service start when the service dependencies are unreachable? That's something we can check and we can validate if the service is restarting or not. Or do we need to take it in a different order? If so, it's not a good place, not a good choice. Or the last example, can our users buy products in the event of cloud sound outage? Maybe you are running on a specific cloud provider and sometimes there are sound outages. Are you able to shift the load in a different zone? Is there something done automatically? Have you checked it? Have you tested? So take care. Now, let's take an example. Let's take a look at this showcase. It's a kubernetes based example. There's a microservice architecture inside, there are multiple services connected to each other. And now let's identify our key services. So we need to know, what are our key players, what are our services we have to deal with. We have to take care. You can see there are some lines in between, there are some connections. So let's check on a deeper level. You can see the entry service is a gateway. The gateway is connected to four internal services. That's, of course, one of our key services because it's the entry point in front of the gateway. There's a load balancer, and the load balancer is handling the load. On all the instances from the gateway, in this example, just two. But what's going on inside after the gateway was called? So there is hot deals, fashion and toys, and they are all connected to one specific service, the inventory service. So that's a key services, because if the inventory is not working or maybe responding slow, there's an impact, maybe a high impact for the hot deals, toys and bestseller. And that's something we need to know. It's not something. Let's say, let's hope everything is fine. No, hope is not a strategy. So we have to check it upfront. How can we do this proactive? We need to test under real conditions, early as possible. And to be honest, happy pass testing is not enough. We need to improve the confidence in our systems. So we need to test under real conditions, like we are able to see in production, because production is not a happy place. It's not everything is fine in production. No. And this is where experimentation comes in. And I believe this is the true reason why chaos engineering exists, to check what's going on under normal conditions in production. Let's talk about the reconditions. Let's go one step deeper, and let's use a very technical example. Based on the already known showcase from some slides ago. We know that the inventory service is one of our key services, and therefore we need to know, how is it called in advanced, what effects a non normal behavior of the inventory service will have. We know from our monitoring, and also from our load tests that the average response time of the inventory service is about 25 milliseconds. Under these conditions, we have a reliable behavior in the whole system. That's what we already know from production, what we are able to see in production and in our monitoring solution, but also from our monitoring, we, or I, have also learned that there were some spikes in the response time of the inventory service. And this data tells us, okay, we had some response times up to 500 milliseconds, and that's knowledge, or that's data we can use. So let's use this knowledge and create an experiment to simulate the high response time of the inventory service by injecting latency with up to 500 milliseconds. What we don't know is what impact the high response times have and whether the service that depend on them can handle it. That is exactly what we are now testing. And we will run an experiment in a preproduction environment and proactively test how our system behaves under such conditions. Let's take a look at the experiment, what experiment we are running. So, it's a complex failure scenario, and we can recreate such scenarios with the right tools, and we can generate knowledge about how our system is reacting. So you can see there's a gray element, it's just a wait step. And then the blue line is a load test, so we can reuse, and we should reuse our existing load tests to execute them. But in combination with some bad behavior which has been injected during the execution, and that's no longer happy pass testing, it's like testing under real conditions. The yellow ones are verification steps. So, for example, check if every instance is up and running, if a specific endpoint is responding or not. And there's a special one. We need to define the expected behavior of the system. And if it's not given, the experiment fails, because the reliability is not given. So the system is not working as it should. And that's something we can test. We can write a test case for it, and the experiment will fail if this check is not true. Let's take a look at the experimentation run. So, the experimentation, the experiment was successful, successfully, holy. The experiment was successfully executed, and our system proved that it can handle short term, increased response times by the inventory service. If we now encounter these conditions in production, we know that our system can handle them, so we are now more safe. We have tested a specific, complex scenario from the past in a pre proactive environment, and the system is able to survive, which is good. Next example. Now, the recondition is something we also found in the past data. So in the failure data of the past, another failure appears in our monitoring solution. And that hit our system quite hard. Now we need to transform this failure into an experiment to validate if we can handle the situation or not. And the error was this time for the hotdeal service. So the hot yield service was not responding in a specific time, so there was a delay. And this delay was limited in two zones from our cloud provider followed by a DNS outage in our Kubernetes cluster. That was the failure scenario. Now recreate this failure scenario, let's rebuild it and we will inject the failure events that occurred. And yeah, we will also check if the Lisa had behavior is provided or not. Again, the yellow one is a check verification. A specific endpoint is called and it needs to be reachable and needs to respond with a 200 and HTTP status code. If not there's a success rate we can tweak and we would like to get 100%. Again, the gray one is a wait step and then in the second line there is a green one where we will inject latency in two specific zones for the hot yield service only, followed by in the third line by an DNS outage for our Kubernetes cluster. So no DNS communication inside of this Kubernetes cluster was possible. That is what the data was telling us from the monitoring. Now let's execute. And you can see, okay, a lot of red elements, so something went wrong and you can see in the top corner the experiment failed. So it's clear that our system is once again unable to handle this failure and the behavior we expect is not given and our customers are not able to use our service. Now it's up to us to adapt our service to deal with the failure, but we need to mitigate the impact. And as we said in the beginning, failures are normal and will always occur, but we have it in our hands to control the efforts. And failures are of often failure and our system can learn to deal with it. When should we run these experiments as an integral part of our software development process, is the answer. And this question when should we run these tests? Is not new and used to arise with the unit test integration tests as well. But over time we and the industry have started to run these best before we check in code or in the case of integration tests before we merge or cut a new version of our system. So I look at an experimentation as let's say real end to end integration test based on production situations, conditions that we want to be resilient to, that we would like to be able to handle or mitigate with this described approach of creating experiments and running them in a preproduction environment. We test the reliability of our systems and this results in a list of experiments. So we are creating continuously new experiments that we can integrate into our deployment process and run them continuously after a new deployment to check how big is the risk we are taking? And is the system still able to survive this past incident or not? Or is it able to survive a specific scenario? And yeah, to make it an integral part of your development process, you can trigger the experiment execution via API or via like a GitHub action that automatically starts one or more experiments after running a new deployment in a pre production stage. And after the execution, you will get a list of, let's say ten experiments. They are executed, you will get a result back. Five of them are successful, maybe five of them have failed, and now you can handle the risk before you go into production. Let's recap. Let's summarize once again. Failures are normal and they will always occur. It's how we handle them and how we build and operate our systems that makes the difference. Just sitting there and waiting until the next error or failure occurs in production is not a strategy. We need to proactive, deal with possible failures and test what impacts they have on our system and of course on our customers. So embrace failures and turn them into experiments to understand how your system is reacting under such conditions. And as said, chaos engineering is necessary. And with our tool steadybit, there's a very easy entry how you can get started with chaos engineering to improve your system. And if you have any questions, please reach out to me. And yeah, thanks a lot for having me and enjoy the conference.

Benjamin Wilms

Co-Founder & CEO @ Steadybit

Benjamin Wilms's LinkedIn account Benjamin Wilms's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways