Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
This is Sathi and today I'm going to talk about designing for failure
and how to build a resilient system that can take failures.
Let's start with a quick thought experiments.
Imagine it's a busiest day, like some black Friday day if you're
in retail or month end, close if you're in finance or tax season.
If you are filing tax on April 15th or the last day, suddenly a critical
part of your system goes offline.
Customers are frustrated, maybe revenue stops flowing, this scenario
isn't hypothetical, but it's the reality of the complex systems today.
Things will fail.
Hardware breaks, networks get congested of warehouse bugs, dependencies become
un, unavailable humans make mistakes.
So for decades, our primary focus has been to make.
Systems resilient to failure.
So we are trying to prevent failure at all costs.
We aim for perfect systems that never will go down.
But here's the reality.
Preventing failures is impossible, especially as system grow
more complex and distributed.
So what if instead of trying to prevent failure, we are designed for it.
So what if we accepted failure as inevitable and build systems
that could gracefully handle it?
So that's what we are going to talk about today designing for failure.
In the next 10 minutes, we'll explore why this mindset is crucial and look
at some of the core principles for building truly resilient systems.
Okay, getting to the next slide.
So why design for failure?
So let's talk about that.
Firstly, as I said, prevention of failure is a myth.
Chasing a hundred time, a hundred percent of time will only is
a prevention tactics that.
Leads to diminishing returns, it becomes exponentially harder and more
complex and expensive to prevent.
Make systems up a hundred percent of the time.
Secondly, impact of failure in unprepared systems is often catastrophic.
For example, a small glitch in one service can cascade and break
down all the entire application.
So think of something like a Dom domino effect.
So this leads to prolonged outages, terrible user experience, and
significant business and customer impact.
Designing for failure on the other hand.
Aims to limit blast radius.
It's about building systems that can withdraw turbulence
without fully collapsing gracefully degrade functionality
instead of completely failing.
Recover quickly when failures inevitably occur.
It's not truly about.
Giving up on stability, but achieving true reliability by acknowledging
and planning for imperfection.
So moving on to what core principles and techniques we could use.
So let's talk about the core principles and techniques that we can use.
So this boils down to how do we actually do this now that we agreed that.
Preparing for failure is a good idea.
Let's talk about what are the different ways we could do it
and what works for your systems.
On a high level, there are several key principles and
patterns that we could follow.
Let's touch on a few core ones.
First is redundancy.
So this is one of the fundamental pillars.
Never have a single point of failure for critical components.
Run multiple instances of your service services databases, load balances,
ideally across different physical locations or availability zone.
So if one instance fails the traffic can be routed to healthier
redundant or backup system.
This is a simple concept, but it has a massive impact.
Okay, next is isolation.
So think of bulk heads on a ship.
If one compartment floods, the bulk heads contain the water and prevent
the entire ship from sinking.
And similarly, in our distributed or complex systems that we are
building, lets isolate components.
A failure in a user profile service shouldn't bring down the entire processing
payment processing service, for example.
We can achieve this through techniques, right?
Assigning different resource pools for different parts of the service following
microservice architecture or Q based communication and things of that sort.
Next, let's talk about circuit breakers.
Imagine calling a service that currently is overloaded or down.
Continuously hammering it to make things works for everyone and making
it, in worst cases, unrecoverable, a circuit break breaker pattern is used
to address these kind of situation.
For example, if certain failures reach a threshold the breaker trips
and it stops sending requests for a short period, returning an immediate.
Error will fall back.
This protect the calling service from wasting resources and it also gives
the failing service time to recover.
After a timeout, it might let a few test requests go through to see before
the system is deemed healthy again.
So think of this as like a circuit breaker in electronic terms.
Next is timeouts and retries.
Don't let your service wait indefinitely for a response from another service.
Implement aggressive time modes.
Either if a call times out, you might retry, but do it carefully.
You can use techniques like exponential back off and limited number of e retries.
Otherwise, you can accidentally create a retry storm that intensifies the request
to an extent that instead of solving the problem, it creates a new problem.
Okay, next, let's talk about graceful degradation.
Sometimes offering partial functionality is much better than offering nothing.
If a non-critical feature falls, like for example, displaying some news
articles on a news site can the core functionality like reading the main
article still work or design new systems?
Keeping this in mind, try to make your core functionality separate
from the noncritical functionality so that if noncritical ones fail, it
doesn't affect the core functionality.
And next we talk about now that we have discussed various strategies, now coming
to the monitoring and observability aspect which is an absolute must.
Into understanding your system's health.
So add robust logging metrics failure detection, understanding
the impact, diagnose the root cause, and verify that your resilience
methods are working as expected.
So you can't really manage what you can't measure.
So it'll be both helpful in recovering from a failure as well as doing a
root cause analysis once your failure.
Has been recovered so that you can prevent it from happening again.
So monitoring is one of the most important things that you would def
should definitely add to your system.
Alright, now let's move on to the next slide.
Now that we understood what are the different ways you can make your system.
Resilient towards a failure.
Let's just try to see how to shift your mindset because this is, we are
entirely shifting from a prevention of error to making sure your system
actually works when an error comes.
Let's just talk about that.
Implementing these patterns is not just about code, it's about a cul cultural
and architectural and mindset shift.
It means developers, testers, operating teams, operation teams, SREs should
be constantly asking, what happens if this part that I'm working on fails?
How will the system behave?
How can we contain the impact?
So it encourages practices like kiosk engineering intentionally we should
encourage, but, practices like kios engineering, where you intentionally
inject control failures into production environments to test how the system
actually responds and uncovers weakness before they actually cause real outages.
So it's about proactively validating your s resilience.
Alright moving to the next slide.
Concluding I would to talk about.
Failures in our complex systems especially distributed systems or machine learning,
engineering included systems isn't an if, but when, or trying to prevent
every single failure is losing battle.
So by embracing the designing for failure mindset and implementing patterns like
redundancy, isolation, circuit breakers.
Graceful degradation and timeouts enterprise all supported by strong
monitoring and observability.
We can build systems that are truly resilient systems that handle
failures gracefully, minimize user impact and recover quickly.
This leads to higher availability, better customer trust, and
ultimately more successful product.
Go back to your teams and start asking this question, what happens if this fail?
Start small.
Pick one critical component and analyze its failure modes.
Introduce one resilience pattern, build that muscle, expect failure.
Test it out by injecting the failure, design for it and build resilience.
Alright, that's all I have.
Thank you.
Let me know if you have any questions or you can contact me.