Designing for Failure using AWS: Resilient System Architectures

Video size:

Abstract

Failures are inevitable-how your AWS architecture handles them defines its success. Learn to design resilient, fault-tolerant systems using AWS best practices, from automated failover to chaos engineering. Discover how services like Auto Scaling and DynamoDB keep your applications running

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone. This is Sathi and today I'm going to talk about designing for failure and how to build a resilient system that can take failures. Let's start with a quick thought experiments. Imagine it's a busiest day, like some black Friday day if you're in retail or month end, close if you're in finance or tax season. If you are filing tax on April 15th or the last day, suddenly a critical part of your system goes offline. Customers are frustrated, maybe revenue stops flowing, this scenario isn't hypothetical, but it's the reality of the complex systems today. Things will fail. Hardware breaks, networks get congested of warehouse bugs, dependencies become un, unavailable humans make mistakes. So for decades, our primary focus has been to make. Systems resilient to failure. So we are trying to prevent failure at all costs. We aim for perfect systems that never will go down. But here's the reality. Preventing failures is impossible, especially as system grow more complex and distributed. So what if instead of trying to prevent failure, we are designed for it. So what if we accepted failure as inevitable and build systems that could gracefully handle it? So that's what we are going to talk about today designing for failure. In the next 10 minutes, we'll explore why this mindset is crucial and look at some of the core principles for building truly resilient systems. Okay, getting to the next slide. So why design for failure? So let's talk about that. Firstly, as I said, prevention of failure is a myth. Chasing a hundred time, a hundred percent of time will only is a prevention tactics that. Leads to diminishing returns, it becomes exponentially harder and more complex and expensive to prevent. Make systems up a hundred percent of the time. Secondly, impact of failure in unprepared systems is often catastrophic. For example, a small glitch in one service can cascade and break down all the entire application. So think of something like a Dom domino effect. So this leads to prolonged outages, terrible user experience, and significant business and customer impact. Designing for failure on the other hand. Aims to limit blast radius. It's about building systems that can withdraw turbulence without fully collapsing gracefully degrade functionality instead of completely failing. Recover quickly when failures inevitably occur. It's not truly about. Giving up on stability, but achieving true reliability by acknowledging and planning for imperfection. So moving on to what core principles and techniques we could use. So let's talk about the core principles and techniques that we can use. So this boils down to how do we actually do this now that we agreed that. Preparing for failure is a good idea. Let's talk about what are the different ways we could do it and what works for your systems. On a high level, there are several key principles and patterns that we could follow. Let's touch on a few core ones. First is redundancy. So this is one of the fundamental pillars. Never have a single point of failure for critical components. Run multiple instances of your service services databases, load balances, ideally across different physical locations or availability zone. So if one instance fails the traffic can be routed to healthier redundant or backup system. This is a simple concept, but it has a massive impact. Okay, next is isolation. So think of bulk heads on a ship. If one compartment floods, the bulk heads contain the water and prevent the entire ship from sinking. And similarly, in our distributed or complex systems that we are building, lets isolate components. A failure in a user profile service shouldn't bring down the entire processing payment processing service, for example. We can achieve this through techniques, right? Assigning different resource pools for different parts of the service following microservice architecture or Q based communication and things of that sort. Next, let's talk about circuit breakers. Imagine calling a service that currently is overloaded or down. Continuously hammering it to make things works for everyone and making it, in worst cases, unrecoverable, a circuit break breaker pattern is used to address these kind of situation. For example, if certain failures reach a threshold the breaker trips and it stops sending requests for a short period, returning an immediate. Error will fall back. This protect the calling service from wasting resources and it also gives the failing service time to recover. After a timeout, it might let a few test requests go through to see before the system is deemed healthy again. So think of this as like a circuit breaker in electronic terms. Next is timeouts and retries. Don't let your service wait indefinitely for a response from another service. Implement aggressive time modes. Either if a call times out, you might retry, but do it carefully. You can use techniques like exponential back off and limited number of e retries. Otherwise, you can accidentally create a retry storm that intensifies the request to an extent that instead of solving the problem, it creates a new problem. Okay, next, let's talk about graceful degradation. Sometimes offering partial functionality is much better than offering nothing. If a non-critical feature falls, like for example, displaying some news articles on a news site can the core functionality like reading the main article still work or design new systems? Keeping this in mind, try to make your core functionality separate from the noncritical functionality so that if noncritical ones fail, it doesn't affect the core functionality. And next we talk about now that we have discussed various strategies, now coming to the monitoring and observability aspect which is an absolute must. Into understanding your system's health. So add robust logging metrics failure detection, understanding the impact, diagnose the root cause, and verify that your resilience methods are working as expected. So you can't really manage what you can't measure. So it'll be both helpful in recovering from a failure as well as doing a root cause analysis once your failure. Has been recovered so that you can prevent it from happening again. So monitoring is one of the most important things that you would def should definitely add to your system. Alright, now let's move on to the next slide. Now that we understood what are the different ways you can make your system. Resilient towards a failure. Let's just try to see how to shift your mindset because this is, we are entirely shifting from a prevention of error to making sure your system actually works when an error comes. Let's just talk about that. Implementing these patterns is not just about code, it's about a cul cultural and architectural and mindset shift. It means developers, testers, operating teams, operation teams, SREs should be constantly asking, what happens if this part that I'm working on fails? How will the system behave? How can we contain the impact? So it encourages practices like kiosk engineering intentionally we should encourage, but, practices like kios engineering, where you intentionally inject control failures into production environments to test how the system actually responds and uncovers weakness before they actually cause real outages. So it's about proactively validating your s resilience. Alright moving to the next slide. Concluding I would to talk about. Failures in our complex systems especially distributed systems or machine learning, engineering included systems isn't an if, but when, or trying to prevent every single failure is losing battle. So by embracing the designing for failure mindset and implementing patterns like redundancy, isolation, circuit breakers. Graceful degradation and timeouts enterprise all supported by strong monitoring and observability. We can build systems that are truly resilient systems that handle failures gracefully, minimize user impact and recover quickly. This leads to higher availability, better customer trust, and ultimately more successful product. Go back to your teams and start asking this question, what happens if this fail? Start small. Pick one critical component and analyze its failure modes. Introduce one resilience pattern, build that muscle, expect failure. Test it out by injecting the failure, design for it and build resilience. Alright, that's all I have. Thank you. Let me know if you have any questions or you can contact me.

Slides

Download slides (PDF)

See all 109 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Designing for Failure using AWS: Resilient System Architectures

Video size:

Abstract

Summary

Transcript

Slides

Sai Sahiti Nemalikanti

Senior Software Engineer @ AWS

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering (SRE) 2025 - Online

April 17 2025 - premiere 5PM GMT

Designing for Failure using AWS: Resilient System Architectures

Video size:

Abstract

Summary

Transcript

Slides

Sai Sahiti Nemalikanti

Senior Software Engineer @ AWS

Join the community!