Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi.
Today we are discussing about resilience for secure and fault
tolerance enterprise systems.
First, welcome to this deep dive into resilience engineer A three
plan for modern enterprises.
Today we are exploring how to build systems that not only survive failures,
but actually thrive under pressure.
This is Practical Session grounded in real world distributed systems
challenges to at enterprise scale.
By the end, you will have actionable strategies to implement
in your own architectures.
Let me introduce myself.
My name is, I am working as quality performance engineer at Inc. I
bring Tenness years of experience in enterprise scale distributed systems.
My focus is mainly on performance optimization, reliability
engineering, and DevSecOps practices.
I specialize in building resilient architectures that
balance three competing demands.
Security fault tolerance and operational efficiency.
I work extensively with complex microservices environments
where one failure can castrate across your entire platform.
I have debug systems serving 30 K request per minute, so the challenge, modern
distributed systems under pressure.
There are three typical pressures.
One is security threats.
Second is cascading failures.
Third is performance demands.
The first related to security threats, we are seeing evolving
attack vectors, specifically targeting distributed architectures.
Example, attackers exploit microservice boundaries where
security controls are inconsistent.
It's a vulnera.
Vulnerability in one service can propagate across your entire
mesh, if not properly isolated.
Cascading failures, single point failures, don't stay, isolate it anymore.
They cooperate across microservices, database connection, pool exertion.
Memory leaks, independent services, and ultimately your entire platform
cascades the do like saying service safe rails, service speed, timeouts,
and due to this service, C tries aggressively, which causes resource,
star, and entire system will collapse.
Third threat performance demands, you must maintain reliability
under variable workloads.
Example, black Friday traffic is 10 x. Your baseline security
instance create traffic spikes.
Your system must handle report gracefully.
You want to slow down to say cars, but can compromise reliability.
What is resilience engineering?
So the core definition of resilience engineering is about engineering
systems to maintain secure and reliable operation under adverse conditions.
This goes beyond traditional fault elements.
The key insight is it's not just about surveying failures.
It's about adapting to threats, recovering gracefully in a three part foundation.
First is proactive design.
Don't wait for failures to happen.
Designwise system as human failures will happen.
Step two, security integration.
Combine security controls with reliability mechanisms not after
tops, robust distribution systems.
Systems that work reliably across complex microservices environments.
It's example for this is for traditional approach.
It's like building a service, moderate and react others.
But as part of resilience, engineering approach, assume the service will fail.
How does the system handle it?
What mechanisms prevent gas?
Can we recover automatically?
So there are.
Score four tolerance mechanisms and the three, mechanisms.
One is circuit breaker patterns, second is logic, third is redundancy calculations.
Let's discuss one problem, so circuit breaker patterns.
So the problem we are solving here is when a downstream service
fails, clients keep bonding it with requests, making recovery harder.
Resource based, slow recovery and customer facing types.
The solutions a state-based model that detects failures and prevents
cascading errors by temporarily halting request to failing surfaces.
Think of it like an electrical circuit breaker.
When current gets to pipe, it trips and stops.
There are three states.
Closed state, open state, half open state, enclosed state normal
operations request flow through in open state failure threshold exceeded
then circuit clicks has stopped sending the request half open state.
In this, we test recovery with the limited requests, like sending
one or two requests and checking whether the service is up or not.
Let's take an example.
Your microservice handles thousand requests per second normally.
Database crashed.
So response times spike two, 300 seconds or 30 seconds After five consecutive
failures, circuit opens, so normal requests sent to failing service.
After 30 seconds of open, we'll send one or two requests to DB to check if the full
circuit closes and traffic will ratio.
They open and they'll try again.
Up 30 seconds coming to, we try logically.
So the problem here is transit failures are common in distributed
systems like network, temporary resource ET can create thundering hub.
All clients retrain simultaneously or wellbeing the system.
So the solution is exponential back of strategies that
intelligently fe the operations without overwhelming the systems.
Each.
We 12 wait longer than the previous one in exponential.
Like one second, two seconds, four seconds, eight seconds, 16 seconds In
that time, so jitter will add randomness.
So all the clients don't replay the exer.
Same moment example form our system.
Six a p. A call failed trans due to network error.
So the first feature happened at a hundred milliseconds.
It'll wake.
Then next will happen at 200 milliseconds, and then third
will happen at 400 milliseconds.
So by the time you get to the ry, you are waiting 1.6 seconds,
giving the system time to recover.
Progressive delay interval, prevent cascading tress.
The key benefit of this is transient failure.
70% succeed on first retirement.
Exponential.
Back off.
Recover from temperature failures without adding already struggling systems.
Next is redundancy calculations.
So the concept of this is active, passive and active deployments ensuring
continuous available, no single mission, pulse, the entire load, spread the risk.
So there are two deployment patterns.
I two passive.
In Active U Primary server handles all traffic.
Secondary server status one standard example, master slave database setup.
No primary failure or on primary failure to secondary intent to 32nd.
Recovery.
Active multiple service, actively handled traffic submitted easily.
Load balancer distributor requests.
So example, three identical payment process.
Each handling one third of a traffic one, fail the remaining
to observe the traffic with grace.
Perhaps with 50% slower, not complete outage, geographic distribution.
This is a real example.
So recently, or if we take an AWS, we can see that there are data centers
that are present in different regions.
Like USE is west eu China everywhere.
So if a Ws USC has an auto US west and EU keeps serving customers.
So rt, recovery time objective, it's near zero.
Users don't notice.
RP recovery point of ticking.
Minimal data loss volume, replication, load balancing across replicas.
Three identical services just to traffic will be 33% to each.
Healthy checks ensures failed instances get removed from road balance.
Automated scaling as load increases, it'll speed up more instances.
Let's go in depth between circle breaker pattern and action.
Okay, so first State, ERBC.
Normal operations requests are flowing through.
Normal system is healthy.
Metrics are green.
Customers are ahead.
This is 99% of the terminal healthy system stays to open state failure detected.
So failure threshold, exited circuit tricks, five consecutive failures
happened and circuit got okay.
No request will be sent to the failing service written return,
faster response to class saying service temporarily and available.
This is crucial.
Stop wasting time from stage three half ated testing recovery.
We are testing recovery with limited requests like sending one or two request.
Proceed with the service if both succeed.
Recess circuit two, closed.
Full recovery.
If circuit stays open, we'll try again later.
Stage four, recovery return to normal so the now the service has
covered and is successfully restored.
Normal flow.
Once service healthy, circuit complete, full traffic will be resumed and
customer see normal service levels.
Business impact due to thesis without circuit breaker.
Outage last five minutes plus 15 minutes recovery, it'll be totally
20 minutes Pain with circuit breaker outage recognized in 10 seconds.
Fast fail responses.
Recovery happens automatically.
User experience impact is minimum.
Next is chaos engineering testing, resilience through control feature.
So the philosophy is chaos engineering is disciplined approach
to discovering system weakness.
Con counter idea intentionally break your system before real failures happen.
This is not randomly breaking points, it's hypothesis.
The definition is chaos.
Engineering proactively injects failures into production like anyons.
To validate resilience mechanisms and security posture
before real insurance occur.
Example, instead of discovering your disaster recovery doesn't
work during an actual disaster.
Test it.
Intentional key practices control failure induction example.
Turn off one database replica.
You are simulating here.
Hardware failure.
Run the experiment for finance.
Observe system behavior.
The system should automatically fail our two remaining requests.
Two hypothesis driven experiments.
Start with the hypothesis.
If the cash cluster goes down, our a PA should still serve requests from
the database slower, but flex down.
Kill the cache to install.
Observe.
Does the hypothesis hold?
Can we handle it for how long?
Gradual blast radius expansion.
So we test in terms like in week one, test a single no failure.
Week two, test network partition between two regions.
Failure of multiple critical services, week four, similar security.
Instead detect and block malicious traffic.
Continuous learning and improvement.
Every chaos experiment reveals gaps.
Example, when cash failed database was, we need better fix the
gap, then rerun the experiment.
Netflix runs chaos and they didn't constantly, they pioneered this.
They intentionally kill services in production using chaos and chaos.
Experiments and tools.
Because of this, they're ultra resilient and competitors who let discuss a four
CAP Thera CP is means consistency, availability, addition, tolerance.
So the fundamental transcr distributed systems must
choose two or three properties.
This is a hard law of distributor systems.
You cannot have all three.
Let's explain each product consistence.
All nodes see the same data.
Some example, you transfer hundred from account to company.
Consistency means before the transfer A has thousand dollars and B has $400.
After that.
A has $900 and B has $600 never in between.
Every query returns the most recent point available.
Every request receives a response.
Your system always response, never returns.
I don't know.
Availability goes to.
Example, 99.99% availability is equal to max.
43 seconds down one year.
Partition tolerance system continues.
Disparate network failures.
Example, your data center are split across continents.
A fiber cuts, some operating, even though the two hos contacting each other.
The trade off in this practice.
CA systems, it's very rare.
It's like consistent place available, but failure if there is a portion
example, traditional single data center, relational database, not fault to.
So one failure is complete locked each CP systems.
Consistent place, partition tolerance, but may not be available.
Example, some financial systems priorit test consistency.
Money must never be wrong, so the trade off is if a partition happens,
stop serving request are rather than serve inconsistent data, AP systems.
Available plus partition tolerant, but may not be consistent.
Example, most modern web services like Twitter straight up, is on partition.
You might solve slightly scale data, but service keeps funnel.
So the enterprise guidances, enterprise architecture typically
prioritize partition orders.
It's non negotiable.
They balance consistency and availability based on businesses for money.
Choose CP Sacrifice, availability for consistency for social.
First choose AP availability.
More important than consistency.
So next is Observability, unified Security and Performance Manual.
So the core concept is you can't manage what you can in your
systems behavior through data.
Four pillars of observability methods S collection.
So performance counts, latency measurements.
So example of these metrics is CPU is costed 45%.
Memory is 72%.
Request C, like P 99 is goes to 200 milliseconds, which is 99% of requests
completing to 50 milliseconds.
Error rate is 0.02 percentage.
That is one error top 5,000 request.
This metrics are collect two seconds using data tools like Prometheus,
Datadog, new Relic Plot, et cetera.
Second is centralized logging.
Aggregated logs with security event correlation.
Every service generates logs.
You collect them in one place.
Security correlation.
Did we see suspicious patterns?
Multiple failed logins from CNIP, et cetera, like that.
Distributed across microservices parameters.
One customer, click Buy now request goes through seven microservices.
Okay.
Shows order service to payment, service to inventory service to
fulfillment service, and each lecture timing like order is 50 milliseconds.
Payment is to 200 milliseconds.
Inventory is goes two 30 milliseconds.
Find bottlenecks while payment, taking 200 milliseconds when
it should be 50 milliseconds.
Intelligent alert, anomaly detection, combining security
and reliability signals.
Not just threshold, almost like CPU greater than 80%.
I it's not that like that.
So detecting anomaly error rate should only jump from 0.02% to file loss.
So it can be an attack or attack.
Then latency increase gradually unusual geographic pattern.
So it can be DDS from new country.
So why this matters for resilience.
When failure happens, observability, let us detect 18 seconds.
Not ours.
Not just what happened.
Volume first of maintain less customer impact, data driven resilience analysis.
So we can use mathematics to predict and proven failures.
Move from to data foundation model.
There are two mathematical models.
First one is mark change.
Model system states and transition positive bilities to predict failure
scenarios and recovery parts.
Example, the service status will be status is healthy, where
94% chance is L 5% chance face.
State B, it's degraded here, 60% chance recovers, 40% chance
where it fails comfortable.
And stage C, it's a first level, a hundred percent chance must recover.
Use probable math to predict if one service has 5% failure
probability, that's the probability.
Three.
Services all fail simul, the answer will be 0.01 to 5%.
It's very extremely clear.
Helps you understand where whether non helps must.
The second mathematical model is ING theory.
Request patterns, service rates, and resource to optimize.
Example, we have a current request thousand per second, and the service
capacity is 400 per second average rate than is four millisecond calculated duty.
Q not, but if it goes where it's, so then Q will be built up and the.
Hundred Medicinals and customer.
So human theory per cells, you do at least 1400 capacity to stay below 50
millisecond Ancy, real example from the metrics is, let's say system can
handle 25,000 request for minimum.
That's four because possible.
Approximately.
If traffic spikes at thousand requests per second, you need to scale up
queuing three models, request rate, service media business model is
these quality operations enable objective co of resilience strategies.
Example, she will spend 50 K dollars on active readiness, or 30 K dollars on.
Map will show like active, reduce downtime by eight 15 plus hours per year and
care delivery 30 hours per year there.
So we reduce chaos, delivery, integration, challenges in network elements.
So the first challenge is technology stack diversity.
We have multiple languages about some platforms requiring
unified resilience supports.
In the real scenario, the new application will help different frameworks, like
a payment service can be a Java Spring Boot framework service will have Python
or first A p. A mostly will be or not.
Database will be posted to non et cetera.
Message will be each has different failure modes, modern tools,
recovery problems, the trial.
How do you implement unified pattern opposed to all this.
And the solution will be like used boxes slash pattern.
Example is service me.
It had a circuit in the past.
Challenge, legacy system constraints.
All the components we limited for capabilities.
Ready, protective block example, you have 20-year-old main front thermal
pass contacts are rator logic.
Contact distributed person.
We can't do not auto.
The solution will be wrapping gateway with modern resilience layer like a PA gateway,
replace sets of problem, challenge three, security policy and forecast.
And first, all consistent security controls across desk, correct
systems, boundaries, and trust.
Challenge is how do we ensure all services enforce encryption,
authentication, audit log.
So parent service might be work.
I take service, might be API.
Keys legacy system might have no help.
The solution is implement security pulse.
A P Gateway service Challenge will be operational contracts,
managing resilience mechanisms.
We thought engineering teams, when we implement secure circuit
breakers, timeouts, redundancy.
And chaos testing.
Each team needs to understand these concept.
Each service needs configuration treatment, and the problem is
configuration will be sprawled, inconsistency, and team will partner.
The solution will be its standardized platform service mesh managed government
is will handle common credits.
Resource optimization without compromising related.
So the, you need to scale for peak profit example, 50,000 request per
on platform, but you can't afford to run a peak team 365 days per year.
So solution will be optimized.
Resource without optimizing related.
First it's baseline measurement.
Establish performance and cost benchmarks.
So run for a non from collect below, like average traffic, 2000 equals per second.
Peak traffic, 8,000 equal per second, and the cost will be 50,000 bucks per month.
Available is 99.94%, and that run rate is 0.0 people.
This will be your best address.
Second, identify inefficiencies.
Look at all profit from or under replace resources.
Example in efficiencies will be database has 16 CPU course,
but it feeds a 30% utilization.
This work provision, KPI, gateway, has expensive certificates, but
traffic only needs 20% records.
Capacity caching layer of its data too aggressive.
Should use bigger cash opportunity.
Reduce database to eight course still handle three.
Circuit data rightsizing possible applying out safety at 60% CP utilization.
Add one, use user server utilization.
Remove one server.
Example, three servers.
Black Friday can scale up to eight server on January.
Again, you can scale it back to three server.
Safety scale to exact amount, right?
Keep 20% metro four.
Step four, continuous violation.
Ensure optimizations maintain s. S after optimizing measures
is available still 99.94% if you drop to maintenance for eight.
Four of the optimization went too far to reduce the optimization of is latency.
P 99 still below two 50 Millis.
Increase we find NCE is no capacity, is error rate still 0.0 0.03%.
IT 50 increase time.
Progressive example, new conflict and they maintains SLS and cut cost from
50 gig dollars, 38 K dollars per month.
This will be like 1 44, $3 per year will be served.
So actionable resilience strategies.
So we have five strategies and we can try in depth strategy.
So first strategies defense, and layer multiple security and form.
Others don't rely on single control.
Example, layer one, network failure.
It'll blocks weight limiting and a layer three circuit breaker
blocks, cascading failures.
Layer four, re try with backup.
So here we handle transient failure, ance layer six, automatic
filler so it can check our health checks layer seven kio system.
So this will validate all the layers one to six in scenario
if one other still theia.
So start regular chaos.
Example, we Q1 data system.
Should we two similar network partition regions system should operate and
take to tenant now system should and then similar security breach.
Monthly incident postmortem to learn from the RC strategy.
Three.
Automated recovery implement self-healing patterns and balance.
Self healing patterns.
Examples are hard pressure ES automatically restarted.
Database slip, memory leak causes service.
Restart automatically drain all the connections around.
Book examples if latency exists.
One second.
First, check.
Cing.
Second, scale up, part clear cash.
Four trigger logs if error rate exceed 1%.
First rollback, deploy, check locks.
Then incident deploy strategy for unified observability, common
security and performance development.
Example, we'll have dashboards where for performance latency error, red
through and for security log failures are not raised by accessories.
Malicious traffic s. For correlation will be like error rates.
Spike correlates with traffic.
Spike for it can be videos unified.
Alert, trigger alert or combine signal security.
Press related strategy.
Five.
Continuous improvement.
Learn from incidents and media basis after everyone's dead.
Blameless post product should be done like that.
Not who failed, but why failed and what failed.
We need to check Example breaker.
Didn't open fast enough.
Let's registration and chaos test will build security.
Incident show we are missing buffer.
Let's add it so that time to detect.
Time to respond, time to recover.
Improve each quarter.
The key takeaway is take resilience, engineering, integrate
security and reliability.
Fault tolerance mechanisms must work in harmony with security controls.
Don't build resilience and security.
Security must work with
issue, can be exploited as DDO.
So keep opening and it'll keep on opting since the survey.
Monitor if circuit optic pattern was suspicious.
This bottom take observable enables faster detection and response.
Front.
Monitoring of security and performance signals accelerates
incident solution with observable data proper and 10 seconds 1230.
Faster production is equal to faster response equals to plus customer impact.
Example without observability it.
Lets start minutes.
Exclude S renal surface down, and with the 30 seconds, take three.
Chaos ing, while its assumptions before production and failure,
injection levels, weaknesses, system.
Then discover by months real.
Discover them.
Us.
Every chaos experiment.
A real incident example.
Will data backup start fix before it is matter?
Don't wait till it becomes a real issue.
Data driven models provide objective, benchmark mathematical
approaches enabled quantitative comparison of resilience strategies.
Use smart chains and QIC data to make objective decisions instead of, I
think we should scale to 10 servers.
You can say MAP shows.
We need eight to 10 servers for this traffic pattern.
Enable fact-based conversation Will stakeholders.