Abstract
We turned alert chaos into $2.3M savings using ML that predicts which errors will break your business. 88.5% faster incident response, 93.8% accuracy, zero custom dev. I’ll show you the exact models, Slack bots, and production secrets that made it happen. Ready to sleep again?
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, this is, I have a total of 18 plus years of experience
in information technology.
Currently, I am working as a principal software engineer
in Liberty Mutual Insurance.
Today I'm going to talk about how ML driven Slack integration cut mean
time to resolution by 88.5 percentage across 17 enterprise systems.
Now we are going to talk about the problem.
Enterprise applications.
Processing millions of diary transactions generated an overwhelming
volume of errors and alerts.
This led to critical issues being buried in noise and severely
impacting our operations.
With manual monitoring across 17 production systems, supporting
over two 30,000 daily active users.
Our teams faced significant challenges, which are mentioned below.
Slow average detection times of 1 62 minutes.
A staggering 68.2 percentage of critical issues discovered first by customers
after H'S incidents averaging a prolonged 1 31 minutes to detection SLA compliance
plummeting to a 62.3 percentage.
This slide, we are talking about the agenda.
The challenge alert fat, you understanding the presu issue of alert
fat you and its enterprise wide impact.
The ML solution architecture slide is going to talk about designing a
robust middleware system for efficient and low overhead exception capture.
The Intelligent Alert registration le talk about leveraging EML for
context aware routing of critical alerts to optimal channels.
The results and business impact slide is going to talk about demonstrating
quantifiable improvements in mean time to resolution, cost savings,
and developer productivity.
The implementation guides is going to talk about.
Practical ML OPS patterns for deploying your own intelligent
error monitoring system.
The challenge alert, fat fatigue, you the hidden cost of scale.
Our 17 enterprise systems, were generating two 30 K plus daily active
users, 3.7 million daily transactions.
12 k plus Diary error events, eight 60 plus diary alerts.
On the right hand side, we have represented the operational impact.
The meantime to resolution is represented in minutes.
The customer reported issues are represented.
The SLA compliance percentage is represented on the right hand side.
Alert to action ratio.
This alert overload meant engineers spent 23 percentage of their
time triaging alerts, diverting crucial effort from building new
futures and resolving core issues.
Next, we are going to talk about the ML solution architecture.
Middleware system capturing exceptions with minimal overhead.
The key technical specifications, 99.7 percentage of exception capture
rate across all systems only.
3.2 milliseconds overhead per transaction.
Distributed tracing for context preservation.
Event driven architecture with Kafka backbone, containerized ML interference
endpoint, the ML classification models.
93.8 percentage of accuracy in prioritizing errors by
business impact NEM approach.
Combining random forest for categorical futures, LSTM first stack trace
analysis, gradient boosting for time series patterns, continuous retraining
pipeline with human feedback.
The ML model Futures, what makes an alert critical?
We have divided into four categories.
The exception characteristics is the first category, the stack trace pattern
matching exception type classification.
Mrs. Semantic analysis code path frequency.
The next category is business context based on affected user
count estimation, transaction financial value, business process
criticality, data integrity impact.
The temporal patterns is the third category based on the
time of delayed correlation.
Error frequency trends.
Business hour waiting.
Seasonal pattern matching based on the historical response.
Prior resolution times SLA.
Breach prediction, historical escalation rate.
Developer response patterns.
Models are trained using 18 months of historical incident data.
Encompassing 12,387 resolved incidents with complete
resolution, workflows and outcomes.
This slide is going to talk about intelligent alert registration,
the adapter, routing, getting right alert to the right person.
Contextual evaluation is.
It means each alert is evaluated across 14 parameters, including severity, system
impact, time of day, and team workload.
The ML decision engine leverages ML to determine the optimal notification
channel and urgency level based on predicted business impact.
The channel selection routes alerts to the most appropriate channel Slack,
direct messages, Microsoft teams, or email based on context and urgency.
The alert delivery formats alerts with actionable information and includes
severity, appropriate urgency signals.
The intelligent rate limiting our ML powered rate limiting algorithms
effectively prevented alert storms during major incidents achieving 68.2
percentage of reduction in overall alert.
Volume during incidents 99.8, percentage detection rate of unique
issues maintained automated clustering of related errors, predictive
suppression of cascading failures.
Now we are going to talk about the results and business impact.
The transfer to results across the enterprise.
On the left hand side, we have represented the key performance improvements.
On the right hand side, we are going to show that tangible business impact, 88.5%
of mean time to resolution reduction.
$2.3 million of annual savings, 23.5 percentage of developer productivity,
73.4 percentage of alert volume reduction.
The meantime to resolution is decreased from 1 62 to just 18.6 minutes.
We are able to get $2.3 million of annual savings is achieved through.
Reduced downtime and enhanced operational efficiency.
23.5 percentage of developer productivity is boosted by significantly
reducing time spent on alert triage.
73.4 Percentage of alert volume reduction resulting from intelligent
clustering related issues.
Now we are going to talk about the key takeaways and implementation guide.
On the left hand side, we have represented the ML ops best practices.
Start small.
Scale gradually.
Begin with one high volume system and expand as models.
Mature human in the loop feedback.
Create explicit feedback mechanisms for continuous model improvement and accuracy.
Measure the business outcome.
Focus on mean time to resolution.
And cost savings, not just model accuracy.
Metrics build for transparency.
Ensure engineers understand why an alert was triggered and how it was processed.
Your implementation journey is represented on the right hand
side, secure data foundation.
Implement robust exception capturing with minimal overhead and distributed
tracing across all enterprise systems.
Model training and validation.
Curate historical incident data for training and continuously
validate ML classification models.
Registration integration.
Connect your ML engine to intelligent routing and notification channels
like Slack teams and email iterate to development, roll out the solution
incrementally gathering feedback and refining the system with each iteration.
Thank you.