Intelligent Error Response: How ML-Driven Slack Integration Cut MTTR by 88.5% Across 17 Enterprise Systems

Video size:

Abstract

We turned alert chaos into $2.3M savings using ML that predicts which errors will break your business. 88.5% faster incident response, 93.8% accuracy, zero custom dev. I’ll show you the exact models, Slack bots, and production secrets that made it happen. Ready to sleep again?

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi, this is, I have a total of 18 plus years of experience in information technology. Currently, I am working as a principal software engineer in Liberty Mutual Insurance. Today I'm going to talk about how ML driven Slack integration cut mean time to resolution by 88.5 percentage across 17 enterprise systems. Now we are going to talk about the problem. Enterprise applications. Processing millions of diary transactions generated an overwhelming volume of errors and alerts. This led to critical issues being buried in noise and severely impacting our operations. With manual monitoring across 17 production systems, supporting over two 30,000 daily active users. Our teams faced significant challenges, which are mentioned below. Slow average detection times of 1 62 minutes. A staggering 68.2 percentage of critical issues discovered first by customers after H'S incidents averaging a prolonged 1 31 minutes to detection SLA compliance plummeting to a 62.3 percentage. This slide, we are talking about the agenda. The challenge alert fat, you understanding the presu issue of alert fat you and its enterprise wide impact. The ML solution architecture slide is going to talk about designing a robust middleware system for efficient and low overhead exception capture. The Intelligent Alert registration le talk about leveraging EML for context aware routing of critical alerts to optimal channels. The results and business impact slide is going to talk about demonstrating quantifiable improvements in mean time to resolution, cost savings, and developer productivity. The implementation guides is going to talk about. Practical ML OPS patterns for deploying your own intelligent error monitoring system. The challenge alert, fat fatigue, you the hidden cost of scale. Our 17 enterprise systems, were generating two 30 K plus daily active users, 3.7 million daily transactions. 12 k plus Diary error events, eight 60 plus diary alerts. On the right hand side, we have represented the operational impact. The meantime to resolution is represented in minutes. The customer reported issues are represented. The SLA compliance percentage is represented on the right hand side. Alert to action ratio. This alert overload meant engineers spent 23 percentage of their time triaging alerts, diverting crucial effort from building new futures and resolving core issues. Next, we are going to talk about the ML solution architecture. Middleware system capturing exceptions with minimal overhead. The key technical specifications, 99.7 percentage of exception capture rate across all systems only. 3.2 milliseconds overhead per transaction. Distributed tracing for context preservation. Event driven architecture with Kafka backbone, containerized ML interference endpoint, the ML classification models. 93.8 percentage of accuracy in prioritizing errors by business impact NEM approach. Combining random forest for categorical futures, LSTM first stack trace analysis, gradient boosting for time series patterns, continuous retraining pipeline with human feedback. The ML model Futures, what makes an alert critical? We have divided into four categories. The exception characteristics is the first category, the stack trace pattern matching exception type classification. Mrs. Semantic analysis code path frequency. The next category is business context based on affected user count estimation, transaction financial value, business process criticality, data integrity impact. The temporal patterns is the third category based on the time of delayed correlation. Error frequency trends. Business hour waiting. Seasonal pattern matching based on the historical response. Prior resolution times SLA. Breach prediction, historical escalation rate. Developer response patterns. Models are trained using 18 months of historical incident data. Encompassing 12,387 resolved incidents with complete resolution, workflows and outcomes. This slide is going to talk about intelligent alert registration, the adapter, routing, getting right alert to the right person. Contextual evaluation is. It means each alert is evaluated across 14 parameters, including severity, system impact, time of day, and team workload. The ML decision engine leverages ML to determine the optimal notification channel and urgency level based on predicted business impact. The channel selection routes alerts to the most appropriate channel Slack, direct messages, Microsoft teams, or email based on context and urgency. The alert delivery formats alerts with actionable information and includes severity, appropriate urgency signals. The intelligent rate limiting our ML powered rate limiting algorithms effectively prevented alert storms during major incidents achieving 68.2 percentage of reduction in overall alert. Volume during incidents 99.8, percentage detection rate of unique issues maintained automated clustering of related errors, predictive suppression of cascading failures. Now we are going to talk about the results and business impact. The transfer to results across the enterprise. On the left hand side, we have represented the key performance improvements. On the right hand side, we are going to show that tangible business impact, 88.5% of mean time to resolution reduction. $2.3 million of annual savings, 23.5 percentage of developer productivity, 73.4 percentage of alert volume reduction. The meantime to resolution is decreased from 1 62 to just 18.6 minutes. We are able to get $2.3 million of annual savings is achieved through. Reduced downtime and enhanced operational efficiency. 23.5 percentage of developer productivity is boosted by significantly reducing time spent on alert triage. 73.4 Percentage of alert volume reduction resulting from intelligent clustering related issues. Now we are going to talk about the key takeaways and implementation guide. On the left hand side, we have represented the ML ops best practices. Start small. Scale gradually. Begin with one high volume system and expand as models. Mature human in the loop feedback. Create explicit feedback mechanisms for continuous model improvement and accuracy. Measure the business outcome. Focus on mean time to resolution. And cost savings, not just model accuracy. Metrics build for transparency. Ensure engineers understand why an alert was triggered and how it was processed. Your implementation journey is represented on the right hand side, secure data foundation. Implement robust exception capturing with minimal overhead and distributed tracing across all enterprise systems. Model training and validation. Curate historical incident data for training and continuously validate ML classification models. Registration integration. Connect your ML engine to intelligent routing and notification channels like Slack teams and email iterate to development, roll out the solution incrementally gathering feedback and refining the system with each iteration. Thank you.

Slides

Download slides (PDF)

See all 37 talks at this event!

Conf42 MLOps 2025 - Online

September 18 2025 - premiere 5PM GMT

Intelligent Error Response: How ML-Driven Slack Integration Cut MTTR by 88.5% Across 17 Enterprise Systems

Video size:

Abstract

Summary

Transcript

Slides

Sreelatha Pasuparthi

Principal Software Engineer @ Liberty Mutual Insurance

Join the community!

Featured event

2026

2025

Info

Conf42 MLOps 2025 - Online

September 18 2025 - premiere 5PM GMT

Intelligent Error Response: How ML-Driven Slack Integration Cut MTTR by 88.5% Across 17 Enterprise Systems

Video size:

Abstract

Summary

Transcript

Slides

Sreelatha Pasuparthi

Principal Software Engineer @ Liberty Mutual Insurance

Join the community!