Conf42 MLOps 2025 - Online

- premiere 5PM GMT

Fail Fast, Recover Faster: Harnessing Cascading Timeouts for Scalable System Resilience

Video size:

Abstract

Struggling with slow failures and system crashes at scale? Learn how cascading timeouts transformed our infrastructure into a high-speed, failure-resilient machine. Walk away with proven tactics to boost throughput, cut latency, and build bulletproof MLOps systems.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Good morning everyone. My name is, I work as a director of software engineering at MasterCard, and my primary focus is in the system, resiliency, scalability, and performance. I bring with me about 15 years of industry experience in these domains with a particular focus on site reliability, engineering. And today I'd like to talk about one of these patterns that improve system resilience and help us fail fast and recover faster, which is to harness cascading system timeouts throughout the system to improve resilience. In this session today, we're gonna talk about how these timeouts play an important role, especially in large scale distributed systems. Timeouts are vital for performance and resilience. When they're misconfigured or poorly managed, they can introduce critical issues such as resource exhaustion, incorrectly set timeouts can hold vital system resources leading to their depletion and instability. Delayed failure detection, overly generous timeouts, prevent rapid identification of unresponsive services, masking the actual problems, and prolonging outages and hindering recovery. Cascading service degradation, improper timeouts allow single service failures to propagate leading to widespread system degradation. These challenges are particularly acute in highly transactional environment where system stability is paramount. And in today's session, we will detail how these prevalent, timeout anti-patterns can be tackled. We will explored what a principled implementation of cascading timeout looks like, a strategy for building more resilient performance and robust infrastructure. We'll go over some foundational principles and practical deployment strategies across distributed system layers and the measurable impact of these changes. You will learn to proactively design systems that fail fast when issues arise, enabling them to recover faster and minimize downtime. Today's agenda, we'll talk about the problem, which is the timeout antipas exploring why common timeout strategies fail, and their impact and system stability, cascading timeout principles, understanding the core theory and methodology behind cascading timeouts, how we implement them across system layers, practical examples and strategies for deployment in a production environment, and lessons learned. Keen sites on implementation and guidance. So let's talk about the problem, which are timeout antipas. There's a, there are a few common timeout antipas that we have observed, such as uniform timeouts using identical timeout values across all service layers regardless of hierarchy or dependency claim. Inverted timeouts when we are configuring outer services to timeout before inner services leading to resources getting blocked and clients getting timed out before they could get a response. Excessive timeouts setting excessively long timeouts, for example, 30 sec plus seconds that exhaust threads and connection pools during system failures. These antipas are particularly pragmatic in ML inference pipelines where resource efficiency is critical. The impact of poor timeout management is basically stress on the system in various ways. Such threat pool exhaustion, database connection, starvation, slow failure detection, complete service outages, and degraded user experience. So prior to implementing cascading timeout. We recommend that we perform analysis of incident and latency data to infer how poorly configured timeouts are responsible for major incidents. So what are cascading timeouts? Cascading timeouts define a systematic approach to configuring timeout values and distributed systems based on one fundamental rule. All upstream service calls must have a longer timeout than their downstream dependencies. This is the opposite of the inverted timeouts issue that we just discussed, where an upstream client times out before its downstream dependency, and when we do not ensure that the downstream systems are releasing their resources properly, we end up getting into a situation where a downstream system is waiting to get data where while an upstream system times out and they're paying. Giving the user end user a really bad user experience. So having these timeouts help re release resources promptly at every layer and systematically longer timeouts higher up the call stack enable rapid fault isolation, efficient resource management, and graceful degradation. This leads to more resilient and responsive microservice architecture. Let's look at the timeout principle. The cascading timeout principle. It dictates assigning progressively shorter timeouts from the sister's outermost edge inward. So what does that look like? Let's take the simple system where we have a load balancer, we have a microservice gateway, and then we have a database, right? And as you see, the timeouts are set in a progressively shorter manner, ensuring that the user at the end who's connecting through the load balancer. It's given enough time for each layer to actually fetch the data, cascade it up so that the user eventually gets the data. And if for some reason the data is not able to be fetched properly, a timeout occurs, ensuring that all the layers don't hold up resources, they're able to a retry, they're able to try to fetch the data again, send a proper error code, whatever it is that ensures a smooth user experience, so failures gets surfaced quickly to the client Downstream systems can recover gracefully without tying up resources. Exhaustion is prevented during partial outages, and the system fails in a predictable, controlled manner, which is really the lifeblood of our engineers, right? To be able to predict during chaos. And implementing these cascading timeouts establishes a strict hierarchy. Assigning each successive downstream layer of timeout of at least 500 milliseconds shorter than the layer above. This crucial buffer accounts for network latency and processing overhead. It ensures efficient resource management and prevents bottlenecks. This methodology prevents threat pool exhaustion and connection duration by ensuring upstream services release resources promptly. This discipline reduction in timeout values contains issues. And prevents them from propagating, mitigating bottlenecks. And the systematic approach enables swift failure detection and control responses leading to a more resilient and responsive microservice architecture implementation layer. Let's look at the edge, for example, the ingress or the load balancer. You could implement something similar configure the proxy route to 3000 milliseconds. And the backend proxy time out to 2,500 milliseconds. We could integrate circuit breakers to prevent excessive reconnections, and we could introduce custom response codes for timeout scenarios. Enhancing observability as the outer most layer. These timeouts are the longest in the system, so we have to ensure that these systems are designed to have enough resources to be able to hold. Transactions for that long. And also this ensures the clients, the end users receive timely responses even during internal system degradation. Now look, let's look at the service layer, which could be like the API gateway application services database connections, right? We are setting progressively shorter timeout, right from the load balance. As you get to the gateway, you're setting a time out of 2000 milliseconds. And you could implement a retry policy with one retry, an exponential back off so that we are not failing the transaction on the first try. We're giving it another chance, but then we are not creating a retry storm. And also enable error response caching for known failures to enable the system to respond faster. At the application layer, we could set a time out to 1500 milliseconds. And we concise threat rules based on anticipated request patterns, implement resiliency for j type of circuit breaker pattern to recognize failures and to implement resiliency, right? In taking a service out of rotation or side of rotation, whatever it is that is possible in your case. And now coming to the database, which is supposed to be the fastest layer, right? We typically have databases responding in milliseconds. Subseconds for sure. And we typically manage JDBC connection timeouts. We are a pool such as the Hickory Connection pool, so we could set statement timeouts at the connection pool lever, configure separate query timeouts for read and write operations, and thereby manage the database, connect connectivity and ensure that for some reason if the database is beginning to degrade, we're able to let the application know. And get ourselves out of a situation where we're stuck waiting and we're never telling the client what's going on. Now, coming to speci special considerations for ML pipelines, right? Machine Learning Inference Services requires specific timeout handling due to their pretty unique characteristics. Separate the model loading operations from inference, each with distinct timeout strategies and configure batch prediction jobs with longer timeouts compared to real time inference. Introduce graceful degradation through fallback models when the primary models experience timeout. Introduce a cache layer caching layer for recent predictions to maintain service availability during degraded state. And additionally implement separate circuit breakers for different model types, tailoring them to their specific resource consumption profiles. So we talked about a lot of different layers and how we could implement timeouts everywhere, and what considerations we need to take or make at every layer, and how it applies to machine learning models. Let's just do a recap, right? What are the lessons that we learned through this journey? That everything starts and ends with data, right? We need data to look at product production volumes, production patterns, and then we need to test and tune and continuously look at the data and simulate delays to verify this behavior. It's not enough to just set time outs progressively lower. We need to ensure that we are giving enough time for each layer. To do its core work and add a buffer on top of it to account for any degradation due to increased load, due to some bad queries, bad plans at the database layer, for example. So always simulate delays, verify behavior, and refine configurations, iteratively on actual performance data, load tests, and incident analysis. Tailor your strategies to different service types and business context. For example, data processing pipelines require longer timeouts than latency sensitive transactional services, right? Or it's the simple batch versus OLTP processing differences. So always account for those and use those. Standardize and monitoring, right? Your system can be doing really well or really bad, but without the proper observability. We really don't know what's going on, so establish clear, documented timeout policies for new services, implement robust monitoring and alerting to quickly identify and proactively resolve timeout related issues, enabling continuous refinement. All of this needs to be iterative. Everything goes hand in hand, and that's how in the end, we're able to realize resiliency. We'll just go over the implementation playbook another time, right? Step one is to map service topology, understand all the dependencies, document them, talk about, think through the flow of request through your system, and identify critical paths and potential bottlenecks. Step two, measure baselines. Look at your current metrics, right? Be it response times, be it error codes, be it. Thread pool usage. JVM usage, right? Everything that tells you how the system behaves under average load conditions and baseline them. Now, design the timeout strategy. Assign preliminary timeout values by starting from the innermost layers. Example the databases, caches and progressively working outward. Adding buffer time at each step. Implement and test iteratively deploy changes incrementally and conduct. Thorough performance and chaos testing to verify system behavior under various failure scenarios, and five, monitor and continuously refine. Observe the system behavior. It ably adjust time out values based on real world performance data and production insights. The key takeaways are always designed for resilience. And this is one such strategy to achieve resilience in your system. Implementing cascade timeouts from inward to outward. And this design prevents resource exhaustion, cascading failures. It ensures graceful degradation and rapid system recovery, right? Validate and monitor proactively. Test timeout behaviors. Ensure you have robust monitoring for real time observability and swift issue resolution. Implemented refinement, analyze performance data, instrument reports, allow for refinement throughout your process and keep optimizing the system resilience and performance. Thank you for this opportunity and please feel free to reach out to me if you have any questions. Thank you.
...

Madhavi Bhairavabhatla

Director, Software Engineering @ Mastercard

Madhavi Bhairavabhatla's LinkedIn account



Join the community!

Learn for free, join the best tech learning community

Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Access to all content