Conf42 Platform Engineering 2025 - Online

- premiere 5PM GMT

Building Bulletproof Error Detection: A Middleware-First Approach to Reducing MTTR by 88.5%

Video size:

Abstract

From 162-minute detection delays to instant alerts. From angry customer calls to proactive fixes. One middleware strategy saved $2.3M and cut MTTR by 88.5%. Learn the exact architecture that transformed chaos into operational excellence across 230K users.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
This is ti. I have a total of 18 plus years of experience in information technology. Right now I am working as a principal software engineer in Liberty Mutual Insurance. Today I'm going to show the presentation about how we can build a bulletproof error detection, a middleware fast approach to dramatically reducing meantime to resolution when your application crashes at. 2:00 AM every second counts. This guide shows how a strategic middleware implementation transformed error response across multiple enterprise applications, reducing alert noise, improving detection accuracy, and dramatically cutting mean time to resolution. Next, we are talking about the crisis that changed everything. It Tuesday morning brought a stark realization. Our flagship e-commerce platform had been slightly failing for over two hours incorrectly, processing orders and corrupting critical customer data. Our discovery came not from our extensive monitoring systems, but from an escalating flood of support tickets and an endra call from our largest enterprise client. Despite significant investments in infrastructure monitoring, application performance management, and synthetic testing, our first indication of critical issues continue to be direct to customer compliance. This incident exposed a fundamental flaw in our existing error detection strategy. Next we are talking about the state of alert. The average developer received over 40 alerts daily with the vast majority being false Positives are low priority issues, sir. More than three quarters of all alerts required no action or were duplicate of known issues. Sir. Operations team turnover froze as ongoing. Engineers reported high stress levels and sleep disruption. Engineers were becoming desensitized to alerts, often ignoring or delaying responses to what might be critical issues. Next, we are talking about the hidden costs. On the left hand. On the left hand side, we, I'm showing that four different categories. The engineering productivity teams spent excessive time investigating false positives and manually correlating data from multiple monitoring tools. The customer tasked eroded with each incident that went undetected leading to contract renewals at risk. The support burden teams fee fielded calls about issues that should have been detected and resolved before customers ever ripped. Notice the business impact sales provided credits. Marketing dealt with the negative social media and the company reputation suffered. We had dozens of monitoring tools and hundreds of dashboards, yet lacked the fundamental capability to detect when things were actually broken From the user's perspective, the catalyst, a database corruption issue went undetected for nearly three hours affecting thousands of customers. Next, we are going to see the rethinking of error detection at the source. The breakthrough insight. Real errors happen in code, not in infrastructure metrics. Traditional monitoring approaches place sensors at the infrastructure level. Monitoring CPU Memory Network and database performance. These are lagging indicators of user impacting issues. By moving our detection logic into the application layer, through strategic middleware implementation, we could capture errors. At the moment, they occur with full context about what the user was trying to accomplish. Our middleware captures rich contextual information about each error. The user's sessions need the specific operation being performed, the input data, and the complete execution path. Next, we are talking about the architecture philosophy. Here we are representing four different categories. The request pipeline. Every request flows through a chain of middleware, components for authentication, validation, business logic, data persistence, and response formatting. The error capture middleware operates with surgical precision capturing with contextual information about each error as it occurs. The next category is the classification ML algorithm determines and category of the error in real time based on historical patterns. The next category is intelligent routing. Alerts are sent to appropriate teams through their preferred channels with complete context. The middleware operates in multiple phases. Immediate capture with minimal processing, enrichment with additional context gathered synchronously classification with ML algorithms and routing to appropriate teams. Sir. Each phase is designed to fail safely. If the error reporting system itself encounters problems, the original application request continues processing normally. Next, we are talking about machine learning for intelligent alerting, traditional alerting systems, relay on static thresholds and simple rule-based logic. Our machine learning approach analyzes per patterns across multiple dimensions. Temporal patterns that identify unusual behavior at specific times are days correlation analysis that identifies relationships between different types of errors and system events. User coherent analysis that detects when errors affect particular segments differently. Continuous learning from historical incident data and engineering feedback. Next we are talking about dynamic priority assignment. Not all errors are created equal. A database timeout affecting one user might be a minor play, but the same error affecting hundreds of users simultaneously indicates a serious infrastructure problem. Our intelligent classification system considers multiple factors when assigning priority levels. On the right hand side, we are showing four different factors. In as a category. The number one is an error patterns, frequency distribution, and correlation with other errors. The next category is business context, affected user segments and their business value. The next category is historical data, similar past errors and their resolution parts. The next category is system state, current load performance metrics and scheduled maintenance. This dynamic approach transformed our alert quality high priority Alerts now consistently represent genuine emergencies while lower priority issues are appropriately batched for business hours. Next, we are talking about multi-platform notification strategies. Different teams have diff distinct communication preferences. Often varying by role, time of day, and incident severity developers, slack channels for team coordination. GitHub, issues for tracking resolution. IDE, plugins for immediate visibility coming to the operations, integration with runbooks incident management platforms, PagerDuty escalations next to the management is executed. Dashboards, email summaries, schedule reports. Our system learns which channels generate the fastest response times for different types of alerts and adjust routing accordingly. Intelligent noise reduction, the signs of signal detection. Our approach combines multiple techniques to separate signal from noise, baselines that learn normal behavior patterns for each application and service. Temporal patterns. Accounting for diary, weekly and seasonal variations. Load based patterns, understanding how error rates change with traffic, contextual awareness of deployments, maintenance windows, and business events. These dynamic baselines enable the system to detect anomalies. That would be invisible to static threshold based approaches. Next, we are talking about cascade prevention and root cause identification. When systems fail, they often fail in cascading patterns where one issue triggers multiple downstream problems. Traditional monitoring treats each symptom as an independent issue, generating dozens of alerts for what is fundamentally a single problem. Let's take a example of as shown below the database failure. When connection pool exertion causes primary alert, it'll cause the service timeouts related alerts, supposed and linked to root cause system maintenance service dependency map in real time. And it'll cause the UI frontend exceptions related with backend issue. Our middleware captures sufficient context to identify these relationships in real time, dramatically reducing alert noise during major incidents, real production monitoring metrics and lessons learned. Issues that previously took covers to identify are now detected with minutes of occurrence. The machine learning classification system has virtually eliminated false positive alerts. Customer reported errors decreased as issues were caught before customer impact. The key success factors, several FA factors provided critical to our medieval wear implementation and success execute to support through initial investment period when development velocity temporarily decreased, minimal performance overhead, allowing instrumentation of even performance sensitive code parts middle weight that fails safely to ensure monitoring does not introduce new points of failure. Cultural shift to treating error detection as integral to application design. Not an afterthought. Implementation roadmap. During the foundation phase, select a single well understood application with moderate complexity and clear success metrics for your pilot implementation. Basic error captured with rich context collection, comprehensive logging of sense. Session info and request parameters. Synchronous processing with manual alert routing, scaling across applications. Develop standardized approaches for rollout across your application portfolio. Create reusable middleware components. Establish standards for classification and routing. Provide training and documentation. During advanced futures, after establishing basic error detection, focus on advanced capabilities, machine learning classification, intelligent notification routing, cascade detection and suppression, performance optimization. Next we are talking about the future of operational excellence beyond error detection. Predict two operations. The Middleweight first approach represents just the beginning of a broader transformation toward predict two operations. Rich contextual data provides the foundation for advanced analytics. That can predict problems before they occur. Machine learning algorithms identify patterns that proceed system failures, enabling proactive intervention before customers are affected. The question is not whether to begin this transformation, but how quickly you can start building bulletproof error detection for your organization. Start small, measure everything and prepare to be amazed by the transformation your future self and your on-call engineers will. Thank you for taking the first step toward operational excellence today. Thank you.
...

Sreelatha Pasuparthi

Principal Software Engineer @ Liberty Mutual Insurance

Sreelatha Pasuparthi's LinkedIn account



Join the community!

Learn for free, join the best tech learning community

Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Access to all content