How to Build a Reliable System Under Unpredictable Conditions

Video size:

Abstract

Modern businesses depend on the stability and reliability of their systems. Downtime is costly, and outages are inevitable—what matters is how we prepare for and respond to them. With the complexity of today’s microservice architectures, understanding and assessing risk at any given moment is critical. Engineering teams must strike a balance between delivering new features and maintaining system resilience. But systems are more than just technology—they are a dynamic interplay of multiple elements: infrastructure, teams, people, and processes. Are these components perfectly aligned? Does everyone know what to do when things go wrong? In this session, we will put the system to the test. Through live experiments, we’ll uncover hidden risks and explore strategies to build resilience at every level. You’ll walk away with actionable insights to strengthen both your technology and your teams, ensuring reliable service delivery even in unpredictable conditions.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Welcome everyone. I'm excited to be here today to talk about how to build reliable systems under unpredictable conditions. A challenge that every engineering team faces. We will explore smart approaches in chaos engineering, observability and incident management and how they can work together to create resilient systems that can withstand real world failures. Throughout this session, we will discuss. Practical strategies and tools, including SteadyBit that help teams proactively uncover weaknesses, strengthens reliability, and ensure both systems and organizations are prepared for unexpected disruptions. Let's dive in. My name is Benjamin. I'm one of the founder of SteadyBit, a chaos engineering platform. I started with chaos engineering around nine years ago. And with SteadyBit, we are in, in, in that space since five years. Setting the scene, let's dive in. Let's start with an important question. What do we strive for? What goal does each of us pursue every day in our work? In other words, what is the mission of software development? When we write code, deploy applications, or optimize systems, we are all working towards something bigger. But what is that overarching, purpose? Is it about delivering features or is there something deeper? The mission of software development at its core is to continuously improving and delivering a software solution that provides real value to its users. Software isn't just about writing code or deploying applications. It's a living, evolving entity. The best software is never done. It adapts, improves, and meets users needs over time. But achieving this mission isn't simple. There are challenges, complexity, changing changing requirements, reliability concerns, and that's why we need to focus on not just on building a software, but on building resilient, maintainable, and user centric software. Now, why does this continuous improvement matter? Because customer, customers trust a system when it's consistently good in both quality and performance. It's not enough for software to work well from time to time. Users expect reliability, Stability, every single time they interact with it. If a system is fast today, but slow tomorrow, or if it works well in some cases, but fault in others, trust erodes. And once the trust is lost, it's incredibly hard to regain. So as engineers, our job isn't just to build features. it's to ensure that those features work predictably. Reliably, under real world conditions. Why is it so hard? Now, let's talk about one of the biggest challenges we face in achieving reliability. Complexity. The complexity of today's systems is massive. And it's only growing. think about. How modern applications are built. We are no longer dealing with monolithic, self contained systems. Instead, we have distributed architectures, microservices, cloud platforms, third party dependencies, and constantly evolving infrastructures. This complexity makes our job even harder. It's no longer just about writing code. It's about understanding how all the moving parts interact, fail, and fingers crossed, hopefully recover. But wait, when we talk about a system, are we all talking about the same definition of a system? Let's have a look at the, at the definition of a system from my point of view. Of course, There is software. There is something we have created, but it needs to deploy it on some infrastructure hardware. People are needed to build, maintain, and operate such software and infrastructure. There is a pipeline, there is a build process, there is something how we can really create and deliver the applications and software. All that is done inside of an organization. Which means also there are processes and many more components. So it's quite complex and a lot of stuff is going on in such a system. And now we also must continuously improve our systems, not just to add new features, but to make them more fail safe and capable of handling failures gracefully. And to be honest, That's freaking hard. And one thing must be clear. We will never have a 100 percent error free system. No matter how much testing we do, no matter how many best practices we follow, failures will happen. Servers will crash. Networks will lag. Dependencies will break. So instead of chasing this impossible dream of a failure proof system, we should focus on building resilience by using chaos engineering and the power of chaos engineering. You all know the definition of chaos engineering, so nothing new in here. Let's continue. But as mentioned earlier, the definition of a system is much more than just technology. The term socio technical system expresses it very well. It's not just technology. It's a combination of people, processes, and tools working together to produce a specific outcome. This means failures aren't just software bugs. They can come from misconfigurations, human errors, unexpected interactions between services or even misaligned team processes. Chaos Engineering helps us proactively test, learn, and improve. we are not just reacting to failures when they happen, but anticipating when, then anticipating them and strengthening our systems in advance. We are getting into something more proactive. Let's now take a look at such a socio technical system and how it evolves over time. This simple website shows the result of a collaboration between many different people and technologies. I would like to address the question of what and who is needed to build and operate this online shop. Starting on the right hand side, we now see the individual services that are needed to operate the shop on the left hand side. Let's group the elements by the, yeah, let's say subject area to get a better overview. It's not getting better or it's still quite complex and many moving parts. These components are provided by more than one team or one person. And they require coordinated interaction between the individual teams for the overarching common goal to provide the customer with a functioning system at all times with consistency, good in quality and performance. But it gets even more complex. Within the teams, people work together to build and operate this online shop. There's a lot of interaction between the teams and each individual person is interacting as well. So it is difficult to keep everyone in the loop and to keep this constantly changing system stable. Now that we are all clear on the preconditions and now the whole system, let's get into the subject in detail. So I want to increase the systemic resilience. And there are three phases needed and we have to go through to ensure that our system constantly delivers good quality and performance. Let's go through the phases one by one and discuss how we proceed in them. Uncover risk. First one. Let's talk about how we identify and address system vulnerabilities. Vulnerabilities. One of the key challenges in modern distributed systems, especially in Kubernetes environments, is understanding where the risk is. Dependencies, misconfigurations, or weak points in redundancy can all lead to failures. This is where SteadyBits approach comes in. By analyzing system configuration, we can uncover hidden risk and provide precise recommendations to enhance system robustness. What you see here is a risk analyze, highlighting potential weak spots. Red areas indicate a critical issue that could impact availability, while green areas show resilient components. with this analysis, kind of insight. We are not just reacting to failures, we are getting ahead of them, we are turning it into a proactive approach. Phase number two, understand impact. Once we have identified hidden risks, the next step is to verify our resilience strategy to ensure our system can truly withstand disruptions. This is where chaos engineering experiments can We don't just assume that our redundancy measures work. We test them under real world conditions before failures happen. Here on the picture, you can see an experiment focused on validating Kubernetes pot redundancy across AWS zones. The system, or SteadyBit, flags a potential risk. Pots are distributed across zones, but we still need to verify if an outage in one zone is valid. Impacts our performance and quality. The best part about, these experiments don't disrupt production. They are controlled, observable, and measurable early in the development cycle. Helping teams build confidence in their infrastructure before they deploy into production. Want to get a real demo? Real example? Let's jump into SteadyBit. Now, on the right side, again, you can see like this bubble with all the moving parts from the system. I would like to do again, a grouping. So I would like to group the elements by the deployment name of Kubernetes, just to get a better picture about the system. And now I want to identify a potential risk in my system. So I can do that with some colors. I can ask SteadyBit, please color all the elements on the right side by the zone they are running in. In total, SteadyBit now tells me, okay, you're running in two zones. And with just one look, I can identify Toy's Best Seller is distributed across two zones, checkout service as well. But here in the center, orders or our hot deal service isn't distributed across all the zones you are running in. This is already a risk, and I was able to identify this without any need of running experiments or so, just based on the data. doing one step back, grouped by all the deployments. Now, I'm using SteadyBuild and the Advice feature. help me to filter in my, yeah, in my moving targets on the right side. And SteadyBit is now able to tell me where a potential risk is already. So getting in again in the checkout service, SteadyBit tells me, okay, here are some orange ones. So what's going on? Let's get into this example, the pod deployment. So an advice by Steadybit was, Hey, you should run this in, in, in, with more instances and how to fix this. Hey, here takes a snippet and reconfigure your elements. This is already something where, yeah, Steadybit can help me to identify something, can help me to improve my system. And last element to verify. Now, if this. is something which already fixed the situation. So there is an experiment created based on the data SteadyBit has discovered. And that's just one example. So let's zoom in. What SteadyBit is doing here right now is, first of all, there's a pre check. SteadyBit will check if all the pods are ready. If not, don't execute this experiment because the system isn't in a healthy state. Next, we will simulate a pod failure of the checkout service. So we will delete a specific pod. And then our expectation is, first of all, we want to see that this pod is failing, but also that it's returning in a couple of seconds. So just a minute. So our expectation is, even if there's one pod, the system under test is still working and is recovering. Getting back to the slides, because now we can execute this experiment. same experiment. But this time it's, just with the fashion bestseller. So pre check was done. Everything is fine. Now we are injecting the delete pot attack, and then we want to see the pot is coming back after, within 60 seconds. If not, the experiment will fail and we already identified the risk in the system. It's now just been executing on fast forward to save some time. Steadybit is also collecting some internals about Kubernetes. So Kubernetes events are coming in. I'm getting some information about the deployment readiness and everything is getting in one central report. Four, three seconds left. Here we go. Experiment was successful in the end. So first job done. We are quite okay. Now let's get into the second part. We have established the preconditions, understanding systems complexity, uncovering vulnerabilities or risks, and verify, the resilience more proactively. It's time to deep, dive deeper into this subject, subject, sorry. So let's shift our focus now from Why we should do it into more how we can do it. So how we can apply these principles in practice, run meaningful experiments, and truly enhance the system resilience in our organization. And that's on another level. It's not only on the technical side of things. Now we are, yeah, we want to increase the systemic resilience of our organization, which is like the last element, the last building block. We have already done the hard work of improving our system's resilience, identifying weaknesses, validating redundancies, and running experiments to strengthen our technical setup. But there is still one crucial question left. How will our teams respond when things go wrong? Even the most robust systems can, can't prevent all failures. What really determines success is how well the organization, our people and processes can react under pressure. This is where socio technical resilience comes in. It's not just about the technology, it's about how different teams interact, coordinate and make decisions when things break. The unpleasant feeling of uncertainty, of knowing how well we will handle an incident, is real. And the best way to eliminate that uncertainty? We test it. We simulate incidents, we practice responses, and we ensure that our organization is just as resilient as our technology. Now you can see how both worlds are coming together. What you see here are two complex systems on the left hand and one on the right hand. But what's important to understand is that both are essential to running and operating a truly resilient system. On left we have the technical system, the architecture, infrastructure and software that, that power our application. On the right, we have the organizational system. It's a team's communication channels, processes that ensure smooth operations when incidents occur and that are helping us to keep our systems up and running. So now. Let's jump into another demo and see how a tool like Steadybit can help you harness the power of chaos engineering to measure and improve the reliability of both systems and organizations, of both systems and organizations. There was, let's assume there was an incident a while ago, and with all the data from the incident, we were able to identify a specific scenario. Let's The scenario I'm talking about is a latency spike in one of the backend services called Hot Deals, followed by a total outage of the central gateway component. So this was really like a cascading error, something went wrong in the backend, but still it also went wrong in the, in, in the, earlier stages of the system, earlier, elements of the system in the gateway component. My expectation, or in other words, my hypothesis is I want to see that our monitoring recognizes the scenario within 90 seconds and reports it to the relevant team on call, we are using PagerDuty, and that the incident has been opened and acknowledged within three minutes. That's the first element. That is something I need to see from my system, but also for my organization. But then. Getting more on the technical side of things, the technical system, the system normalized within three minutes and the incident and our on call team, determined that the resilience system has recovered and is working normally. So the incident is resolved within four minutes. So you can now really see both, both worlds. It's a technical system where we have implemented, yeah, resilience patterns. So that our system can recover from failures, but also on the other side, our organization, our teams, our people that are able to understand what's going on and how they should react under those conditions. And to truly strengthen our organization's resilience, we need to focus on the interaction, not just within teams, but also between the tools and platforms that keep everything running smoothly. Here, we see three key players, PagerDuty, Datadog, and SteadyBit. PagerDuty helps us orchestrate incident response, ensuring the right people are alerted at the right time. Datadog provides observability. giving us real time insights into system performance and failure patterns. And finally, SteadyBit brings in chaos engineering, allowing us to proactively test, validate, and strengthen both our technical and organizational response strategies. And by integrating these tools, We create a holistic approach to resilience where we don't just detect failures, but also understand, react, and ultimately prevent from impacting users. And yeah, resilience isn't about a single tool or strategy. It's about making sure everything and everyone works together seamlessly to handle whatever comes our way. All right. Getting into the details and the timeline. everything will start with a latency spike in our Hot Deals service, followed by an outage of our gateway component. We want to see this event or the, yeah, the failures in our monitoring tool, Datadog, within 90 seconds. Datadog will call PagerDuty, that there is something wrong, and will trigger the An incident, PagerDuty knows who's on call and will, yeah, notify those people. We want to see that those people on call are jumping in, getting first insights and doing an acknowledgement of the incident. Then getting back on the technical side of things and our resilience and reliability patterns. The system is able to recover. It's turning into an okay status back, which has been recognized by Datadog. And this will also, this also means that our team is able to see this improvement. So the system is back to normal. So the incident needs to be resolved as well. A lot of stuff is going on. Let's get in, in, in something before we start into the experiment. Something very important. It's, I don't like finger pointing. And it's about. Testing how fast it's not about testing how fast an individual person fixes an outage. It's really about how good the organization is at detecting and fixing faults and how to processes and all the and how processes are coordinated. So now it's time to zoom in and see, take a look at the experiment we will run. Okay. First. SteadyBit is integrated with Datadog and we will check as a pre check if the system is in a healthy state. If not, don't execute this experiment. Next, SteadyBit will inject latency in all hot deal services, followed by a total outage of the gateway component. And then, our expectation in the hypothesis is After 90 seconds, there needs to be an alert in our monitoring tool data talk. So we want to see this alert coming in. This alert needs to trigger an incident. We will, we are integrated with PagerDuty from SteadyBits point of view. And we can check if there is an incident triggered. If so, everything okay, everything as designed. Now, our system is able to recover. It's getting back into an okay status. So far, so good. But also, we are checking again in PagerDuty if someone on call is jumping into that incident and is doing some search about what's going on to get insights. If the system is stable back in an okay status, the incident should have been resolved. And that is something again, we will check inside automatically in PagerDuty. And that's this experiment we will execute now. You can really see how many interactions are needed to really see if this process is designed right and working as needed. So here we go. Let's take a look at the recording of this experiment, because normally it takes about five minutes. So prechecks are executed. Data doc has been, yeah, integrated inside of Steadybit. So we can ask the system if everything is fine. At the bottom of that screen, you can see like a little green bar, it's coming up. This tells us that the status is okay. And we are checking it for a couple of seconds. So it's not just a one shot. It's really like for a specific time we want to see everything is okay. Now Steadybit is injecting the latency followed by a total outage of the gateway component. The expectation now is after 90 seconds, we are in the fast forward mode. We want to see an alert inside of Datadog. So we are calling again Datadog. Hey, the status needs to be alert. And you can see a red bar is coming up at the bottom. And also we are now checking PagerDuty if there is an incident triggered. And we found an incident related to that outage. Now, again, the outage is gone. System should recover. And we want to see after some seconds that first of all, someone is jumping in the incident. So the incident needs to be acknowledged. And we want also to see that Datadog is able to see that the system is able to recover and that the status of the system is turning back into an okay status. Monitoring is okay. Now you're scrolling down. You can see the bar. And now, last element, now the system is back to normal. The incident needs to be resolved, maybe by data doc automatically inside of PagerDuty or by someone on call. And now our scenario is successful because our system was able to handle it and the people on call were able to react as needed, so they are well trained. Let's quickly recap what we have covered today. We started by defining the mission of software development. Delivering continuous value while ensuring reliability. It's hard. It's not about just features. We saw that complexity is massive. Failures are inventable and trust is built on consistent good quality and performance. And to, yeah. recover from a bad outage is quite hard to get the trust from the customers back is very hard. So to tackle this, we explored chaos engineering as a way to uncover weaknesses in both technical and organizational systems called socio technical systems, helping us verify resilience before failures happen. And we are all, we did this every time in a, not in production, in a early stage environment. And finally, we saw how a tool like SteadyBit empowers team to test, validate, and strengthens both the infrastructure and their response strategy on the organization level. So because in the end, resilience is not just about preventing failures. It's about also preparing for them because they will happen. So thank you again for listening. I hope you enjoyed the session. If you have any questions. Please reach out to me. You can find me online at LinkedIn at SteadyBit. And yeah, again, thank you very much for your time. Bye.

Slides

Download slides (PDF)

See all 31 talks at this event!

Conf42 Chaos Engineering 2025 - Online

February 20 2025 - premiere 5PM GMT

How to Build a Reliable System Under Unpredictable Conditions

Video size:

Abstract

Summary

Transcript

Slides

Benjamin Wilms

Co-Founder & CEO @ Steadybit

Join the community!

Featured event

2025

2024

Info

Conf42 Chaos Engineering 2025 - Online

February 20 2025 - premiere 5PM GMT

How to Build a Reliable System Under Unpredictable Conditions

Video size:

Abstract

Summary

Transcript

Slides

Benjamin Wilms

Co-Founder & CEO @ Steadybit

Join the community!