Transcript
This transcript was autogenerated. To make changes, submit a PR.
Good morning everyone.
My name is, I work as a director of software engineering at MasterCard,
and my primary focus is in the system, resiliency, scalability, and performance.
I bring with me about 15 years of industry experience in these
domains with a particular focus on site reliability, engineering.
And today I'd like to talk about one of these patterns that improve system
resilience and help us fail fast and recover faster, which is to harness
cascading system timeouts throughout the system to improve resilience.
In this session today, we're gonna talk about how these timeouts play
an important role, especially in large scale distributed systems.
Timeouts are vital for performance and resilience.
When they're misconfigured or poorly managed, they can introduce critical
issues such as resource exhaustion, incorrectly set timeouts can hold
vital system resources leading to their depletion and instability.
Delayed failure detection, overly generous timeouts, prevent rapid identification
of unresponsive services, masking the actual problems, and prolonging
outages and hindering recovery.
Cascading service degradation, improper timeouts allow single
service failures to propagate leading to widespread system degradation.
These challenges are particularly acute in highly transactional environment
where system stability is paramount.
And in today's session, we will detail how these prevalent, timeout
anti-patterns can be tackled.
We will explored what a principled implementation of cascading
timeout looks like, a strategy for building more resilient performance
and robust infrastructure.
We'll go over some foundational principles and practical deployment strategies
across distributed system layers and the measurable impact of these changes.
You will learn to proactively design systems that fail fast when issues
arise, enabling them to recover faster and minimize downtime.
Today's agenda, we'll talk about the problem, which is the timeout
antipas exploring why common timeout strategies fail, and their impact and
system stability, cascading timeout principles, understanding the core
theory and methodology behind cascading timeouts, how we implement them across
system layers, practical examples and strategies for deployment in a production
environment, and lessons learned.
Keen sites on implementation and guidance.
So let's talk about the problem, which are timeout antipas.
There's a, there are a few common timeout antipas that we have
observed, such as uniform timeouts using identical timeout values
across all service layers regardless of hierarchy or dependency claim.
Inverted timeouts when we are configuring outer services to timeout before inner
services leading to resources getting blocked and clients getting timed
out before they could get a response.
Excessive timeouts setting excessively long timeouts, for example, 30 sec
plus seconds that exhaust threads and connection pools during system failures.
These antipas are particularly pragmatic in ML inference pipelines
where resource efficiency is critical.
The impact of poor timeout management is basically stress
on the system in various ways.
Such threat pool exhaustion, database connection, starvation, slow failure
detection, complete service outages, and degraded user experience.
So prior to implementing cascading timeout.
We recommend that we perform analysis of incident and latency data to infer
how poorly configured timeouts are responsible for major incidents.
So what are cascading timeouts?
Cascading timeouts define a systematic approach to configuring timeout
values and distributed systems based on one fundamental rule.
All upstream service calls must have a longer timeout than
their downstream dependencies.
This is the opposite of the inverted timeouts issue that we just discussed,
where an upstream client times out before its downstream dependency, and when we
do not ensure that the downstream systems are releasing their resources properly,
we end up getting into a situation where a downstream system is waiting
to get data where while an upstream system times out and they're paying.
Giving the user end user a really bad user experience.
So having these timeouts help re release resources promptly at every layer
and systematically longer timeouts higher up the call stack enable rapid
fault isolation, efficient resource management, and graceful degradation.
This leads to more resilient and responsive microservice architecture.
Let's look at the timeout principle.
The cascading timeout principle.
It dictates assigning progressively shorter timeouts from the
sister's outermost edge inward.
So what does that look like?
Let's take the simple system where we have a load balancer, we
have a microservice gateway, and then we have a database, right?
And as you see, the timeouts are set in a progressively shorter manner,
ensuring that the user at the end who's connecting through the load balancer.
It's given enough time for each layer to actually fetch the data, cascade it up so
that the user eventually gets the data.
And if for some reason the data is not able to be fetched properly, a
timeout occurs, ensuring that all the layers don't hold up resources,
they're able to a retry, they're able to try to fetch the data again, send a
proper error code, whatever it is that ensures a smooth user experience, so
failures gets surfaced quickly to the client Downstream systems can recover
gracefully without tying up resources.
Exhaustion is prevented during partial outages, and the system
fails in a predictable, controlled manner, which is really the
lifeblood of our engineers, right?
To be able to predict during chaos.
And implementing these cascading timeouts establishes a strict hierarchy.
Assigning each successive downstream layer of timeout of at least 500
milliseconds shorter than the layer above.
This crucial buffer accounts for network latency and processing overhead.
It ensures efficient resource management and prevents bottlenecks.
This methodology prevents threat pool exhaustion and connection
duration by ensuring upstream services release resources promptly.
This discipline reduction in timeout values contains issues.
And prevents them from propagating, mitigating bottlenecks.
And the systematic approach enables swift failure detection and
control responses leading to a more resilient and responsive microservice
architecture implementation layer.
Let's look at the edge, for example, the ingress or the load balancer.
You could implement something similar configure the proxy
route to 3000 milliseconds.
And the backend proxy time out to 2,500 milliseconds.
We could integrate circuit breakers to prevent excessive reconnections,
and we could introduce custom response codes for timeout scenarios.
Enhancing observability as the outer most layer.
These timeouts are the longest in the system, so we have to ensure that
these systems are designed to have enough resources to be able to hold.
Transactions for that long.
And also this ensures the clients, the end users receive timely responses even
during internal system degradation.
Now look, let's look at the service layer, which could be like the
API gateway application services database connections, right?
We are setting progressively shorter timeout, right from the load balance.
As you get to the gateway, you're setting a time out of 2000 milliseconds.
And you could implement a retry policy with one retry, an exponential
back off so that we are not failing the transaction on the first try.
We're giving it another chance, but then we are not creating a retry storm.
And also enable error response caching for known failures to
enable the system to respond faster.
At the application layer, we could set a time out to 1500 milliseconds.
And we concise threat rules based on anticipated request patterns, implement
resiliency for j type of circuit breaker pattern to recognize failures
and to implement resiliency, right?
In taking a service out of rotation or side of rotation, whatever it
is that is possible in your case.
And now coming to the database, which is supposed to be the fastest layer, right?
We typically have databases responding in milliseconds.
Subseconds for sure.
And we typically manage JDBC connection timeouts.
We are a pool such as the Hickory Connection pool, so we could set statement
timeouts at the connection pool lever, configure separate query timeouts for
read and write operations, and thereby manage the database, connect connectivity
and ensure that for some reason if the database is beginning to degrade,
we're able to let the application know.
And get ourselves out of a situation where we're stuck waiting and we're
never telling the client what's going on.
Now, coming to speci special considerations for ML pipelines, right?
Machine Learning Inference Services requires specific timeout handling due
to their pretty unique characteristics.
Separate the model loading operations from inference, each with distinct
timeout strategies and configure batch prediction jobs with longer timeouts
compared to real time inference.
Introduce graceful degradation through fallback models when the
primary models experience timeout.
Introduce a cache layer caching layer for recent predictions to maintain service
availability during degraded state.
And additionally implement separate circuit breakers for different
model types, tailoring them to their specific resource consumption profiles.
So we talked about a lot of different layers and how we could implement timeouts
everywhere, and what considerations we need to take or make at every layer, and
how it applies to machine learning models.
Let's just do a recap, right?
What are the lessons that we learned through this journey?
That everything starts and ends with data, right?
We need data to look at product production volumes, production patterns,
and then we need to test and tune and continuously look at the data and
simulate delays to verify this behavior.
It's not enough to just set time outs progressively lower.
We need to ensure that we are giving enough time for each layer.
To do its core work and add a buffer on top of it to account for any
degradation due to increased load, due to some bad queries, bad plans
at the database layer, for example.
So always simulate delays, verify behavior, and refine configurations,
iteratively on actual performance data, load tests, and incident analysis.
Tailor your strategies to different service types and business context.
For example, data processing pipelines require longer timeouts than latency
sensitive transactional services, right?
Or it's the simple batch versus OLTP processing differences.
So always account for those and use those.
Standardize and monitoring, right?
Your system can be doing really well or really bad, but without
the proper observability.
We really don't know what's going on, so establish clear, documented
timeout policies for new services, implement robust monitoring and
alerting to quickly identify and proactively resolve timeout related
issues, enabling continuous refinement.
All of this needs to be iterative.
Everything goes hand in hand, and that's how in the end, we're
able to realize resiliency.
We'll just go over the implementation playbook another time, right?
Step one is to map service topology, understand all the dependencies,
document them, talk about, think through the flow of request through
your system, and identify critical paths and potential bottlenecks.
Step two, measure baselines.
Look at your current metrics, right?
Be it response times, be it error codes, be it.
Thread pool usage.
JVM usage, right?
Everything that tells you how the system behaves under average load
conditions and baseline them.
Now, design the timeout strategy.
Assign preliminary timeout values by starting from the innermost layers.
Example the databases, caches and progressively working outward.
Adding buffer time at each step.
Implement and test iteratively deploy changes incrementally and conduct.
Thorough performance and chaos testing to verify system behavior under
various failure scenarios, and five, monitor and continuously refine.
Observe the system behavior.
It ably adjust time out values based on real world performance
data and production insights.
The key takeaways are always designed for resilience.
And this is one such strategy to achieve resilience in your system.
Implementing cascade timeouts from inward to outward.
And this design prevents resource exhaustion, cascading failures.
It ensures graceful degradation and rapid system recovery, right?
Validate and monitor proactively.
Test timeout behaviors.
Ensure you have robust monitoring for real time observability
and swift issue resolution.
Implemented refinement, analyze performance data, instrument reports,
allow for refinement throughout your process and keep optimizing the
system resilience and performance.
Thank you for this opportunity and please feel free to reach out
to me if you have any questions.
Thank you.