Engineering Failure-Resilient Systems: Proactive Strategies for Distributed Network Reliability

Video size:

Abstract

When systems fail, millions lose access, companies lose millions, and engineers lose sleep. Strategies to build self-healing systems that survive failures. Learn practical chaos engineering techniques, recovery automation, and architecture that keep systems running when—not if—components fail.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Good morning everyone. My name is Mark Izu. I'm the lead DevOps engineer at Botanics Labs. I wanna start with a confession. I love failures and not just because I enjoy watching systems crash and burn. But because each failure teaches us something profound about our system that success could never teach us. And that's just a personal opinion. But think about it, when was the last time we learned something deep or a very deep engineering lesson from something that's working perfectly or a system of use that is working perfectly? That's right. That's right. It's the failures that shape us. It's the failures that force us to grow the reveal, the hidden weaknesses in our most carefully crafted systems, in our most stable system. It's the failures that teaches us the flaws in our designs. It's not the social stories, but the failures and the fixing. The fixing of the failures. That is what Teach forces us to grow. I wanna thank you for joining me in my talk today on engineering failure resilient systems and proactive strategies for distributed network reliability. I would say I've spent the last couple of years in the trenches building and sometimes breaking distributed systems, but yeah, that's just for the fun of it. Today we're gonna cover why failure is innovative in distributed systems, common failure patterns, proactive resilience strategies, and building resilient teams and culture. I would say it's a meaningful presentation. Just I try not to be too technical and just go over the, just go high surface. It is three. The systems are down, services are halted. A lots are blowing up. What now? Your boss is on your neck. The whole team is gathering up. Need the dashboard is all painted red services are down. What do you do? MI microservices can't connect. No external assets. We've all had that point in time in our career where we've had to wake up to those annoying alerts. It's unavoidable. It's really unavoidable. It's what? It's what makes us engineers essentially. Failure is not the exception. It's to brew every time something, everything fills all the time. Where NoGo is the Amazon CT o set in distributed systems, failure isn't just possible, it's inevitable. The question is since if your systems fail, but when, how, and at what cost, what would be the cost of your system failure, what is in that service level agreement? What is in what are those indicators? What is the cost for a system for you? What is the cost for that? One hour of no assets to the systems. What does that cause The system. The key point here is in systems with thousands of components, something is always filling something. One system is receiving more traffic than the rest. The something is always happening here, essentially. I had ask this question, how many of us here have experienced the failure in production? Again, you can see all hands are up. Essentially that's just to tell us the output and what it feels like. Essentially having a system failure. We've all experienced that outage in production. We've all experienced that quick fix, that deployment gone wrong. That no running out, running really high on temperature. We've all experienced that. Every hand here represents a lesson learned about resilience. I. That's, we can, as a Spiderman meme indicates we're all pointing fingers and pointing hands, but no, we shouldn't point hands at ourselves. These systems are bound to fail. It's all about engineering this system to be failure resilient and knowing what to do and what next on the next line of action essentially. Then we look at the cost of downtime. What is the real cost of downtime? As asked earlier, when Amazon's S3 experienced the four R outage in 2017, it cost companies over the. Over 150 million. These failures creates ripples around in industries, reliant on cloud services. Look, think about the customers, how they felt regarding this fall out during this fall out outage. This shows how. Interconnected. We are in the or hyperconnected. We are and reliant on certain services. And this is, this asks the question, what is the real cost of downtime? What? What is the real cost of downtime in your system, in your service beyond traditional reliability engineering, which focuses on maximum uptime through redundancy at four tolerance. These approaches are necessary, but are they sufficient for today's complex distributed environments? We ask ourself due to the limits of uptime metrics, resilience of our redundancy, and preparing for the unknown Metrics like uptime and availability are no longer sufficient for just to ensure system reliability. Modern infrastructures must plan to. Must plant for failure proactively. You must. These systems must be designed and architected for resilience. They must be architected for to plant for failure within them, essentially, because failure is not the exception. It's the rule. Here we talk about the pillars of resilience, engineering, and the five pillars of resilience engineering as outlined here. They are antifragile architecture, chaos, engineering processes, system breaker design, circuit breaker design, architecture patterns, dynamic resource allocation, and of course monitoring and observability. Because you cannot fix what you do not know. You cannot fix what your, you can't monitor, essentially. So we look at to antifragility, what Antifragility looks like. Anti. What is anti fragility? The question is, unlike resilient systems that re resist failure, anti-fragile systems grow stronger through disruptions. They use adversity to evolve and adapt. This is incorporating chaos input, real time feedback, and diversification to create systems that optimize on that stress. The key word, optimize on that stress here. And then we look at chaos engineering practices. The case study here would be Netflix's Chaos Monkey two. Which one do terminates production instances? This wasn't madness, it was survivor. Think about this. Look at it in this way. Your systems are gonna fail either ways. There's always gonna be that increase in traffic or in increase in system load. So why not recreate it yourself? Why not plan for it yourself where you have. Total visibility on the system where you have, when you know and you can monitor what is going on and how, what failure looks like when you are in charge of the failure, you are orchestrating the failure. Exactly. So that's what Chaos Monkey is about. I'd advice you looking to the Chaos Monkey. So because that's learning true, deliberate. Failure. The key points there, determinates, random instances in production. Ensure systems can handle failure component and just transformed into this, an entire discipline of chaos, engineering, and multiple companies all around the world incorporate this type of care, this type of discipline in their services, basically. Basically, because you have to learn. True deliberate failure. You have to plan for the failure. You need to take control of what failure looks like. Then we look at circuit breaker architecture part design patterns. A circuit breaker is a pro is a protective and safety mechanism that prevents the application from continuously making requests to a service that has problems or is done. This is basically isolation as a strategy. Think about, a distributed system where of 10 nodes, where two nodes are down and you have a circuit breaker in in, in front of the systems. When you implement a circuit breaker design part down the traffic and traffic would never be directed to the 14 nodes because the circuit breaker is, are aware that those nodes are 40 and therefore traffic would be directed to the held in nodes. Circuit breakers prevent cascading failures by failing. First when downstream services degrade, essentially, then we look at dynamic resource allocation. Here. We encourage or we. We come coincidence with self healing system that automatically respond to changing conditions. We look at Kubernetes portal auto scaling as a case study that scales based on cost metrics. We look at predictive scaling systems that analyze historical patterns and scale preemptively resource portals that automatically redistribute adequate. We distribute capacity during degraded performance and intelligent load that prioritize traffic base on business impact Then of all. Monitoring of and observability, like I earlier said, you can't fix what you can't see. Logs, traces, and metrics together from the backbone of modern observability stack that gives engineers actionable insight. Monitoring must always establish baseline behavior and detect anomaly automatically is your monitoring and observability stack should always be able to distinguish between noise and actionable signals because you do not want your engineers waking up. In the middle of the night for noisy signals for or un actionable alerts. Basically, that's why you have to build at a stable and proactive monitoring observability stock for your service. Now we look at common failure patterns. This is fine. So failure pattern number one, we look at cascading failures. The look at multiple case study here, which will be, provided in the documentation we look at Slack's 2021 global outage. We look at the Netflix Eve, Netflix, Christmas Eve outage in 2012. These are example of retry stops. We look at failures due to resource contention, the Robinhood trading outage in 2020. These are example, these are public examples of what failure or what common failure patterns look like. E essentially. Secondly, we look at operational failures. We look at when there is a configuration drift, a deployment problem, or human error configuration drift. Firstly, when there is a slight change in configuration, maybe from a a third party provider or even from a user, we, the case study here is the Salesforce database outage in 2019. Look at data deployment problems. Look at TSB Bank. Migration failure in 2018, which occurred when they tried to migrate their services from one provider to another provider, and they run into deployment issues because this were not planned for this deal, not preemptive. Look at the night capital trading loss in 2012 and the Amazon SD outage in 2017, which a whole lot of services globally were heavily relied upon. We look at. Failures due, operational failures due to human error. GitLab data loss in 2017 were where a GitLab engineer mistakenly deleted production db and the outage was due, was out for, and the services were out for four hours. These are things we do not want to happen. But they are common failure patterns. Failure due to human error. Look at failure pattern number three, which is software failures. Failures due to resource exhaustion, dependency failure. We look at the GitHub details. I would encourage you to look at the GitHub leaders incident in 2018 and the Reddit outage in 2016. These are failures due to resource exhaustion when systems aren't architected or built in resilience or to act to scale for when there's when there is an increase in traffic. This is what happens, dependency failures. When the. System is updated or there's a system upgrade and there is no accountability on what dependency looks like. Look at that Stripe. A PIA case study will be the Stripe API Outage in 28 in 2019, and fastly CD CDN outage in 2021. What was achieving 99.999? Reliability look like the five nine. Reliability just means five minutes of downtime per year. Is that possible? When broad says downtime, it's can we do that? Is that achievable? What does achieving down? What does this look like? What does achieving five minutes of downtime look like? It looks like eliminating all single points of failure, implementing zero downtime. Deployments and rollback. Designing for partial availability during degradation, redundant infrastructure, resilient network architecture, comprehensive monitoring, and also on the code level microservice design pattern with resilience patterns. Essentially, that is what achieving five nights means, essentially five minutes of downtime in a year. It's achievable. It is, it's possible when we plan for. A reliable system. That is what we achieve basically. So we look at the resilience engineering toolkits, what's how we can identify fragility code, PA code and patterns, and monitor and measure for identify fragility. We use architecture, architectural reviews and failure modeling to look at single pain of failure and fragile dependencies. By fragile dependencies, we look at mean dependencies that. That are actively updated or actively wrote that we don't want to design our systems on those kind of dependencies because we. We need to be able to account for those changes. Essentially, we look at how we architect our code and the kind of part, the kind of software partners we used in designing our applications, because basically that's where the that always is the bottleneck. If we don't have accurate foundational patterns or proper foundational patterns, then we also need to monitor and measure. Monitoring our measurement in our resilience code toolkit because we need to define our SLIs and slus the indicators and the objectives and the error budgets, and we also need to have proactive dashboards and treat to observe this system health in real time. In conclusion, I feel. Can only be embraced essentially. That's why we need to embrace the inevitable is your system built to break and bounce back stronger. Failure is not a risk, it's a certainty. Building failure, resilient distributed system requires bold design and continuous experimentation and culture of resilience. With the right tools and mindsets, we can build systems that don't just survive chaos, but thrive in it. Thank you very much and. I will be willing to answer any questions about system resilience. Thank you very much once again.

Slides

Download slides (PDF)

See all 109 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Engineering Failure-Resilient Systems: Proactive Strategies for Distributed Network Reliability

Video size:

Abstract

Summary

Transcript

Slides

Onyebuchi Mark Irozuru

DevOps Lead @ Botanixlabs

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Engineering Failure-Resilient Systems: Proactive Strategies for Distributed Network Reliability

Video size:

Abstract

Summary

Transcript

Slides

Onyebuchi Mark Irozuru

DevOps Lead @ Botanixlabs

Join the community!