Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome to Con 42, incident Management 2025.
I'm Pri Dewa, a senior frontend engineer at ThoughtSpot, where I've spent the last
several years working on complex data visualizations and analytics product.
My work is all about one thing, taking overwhelming amount of data
and presenting it in a way that's meaningful, timely, and actionable.
And if there's one area where that matters most, it's incident response
because when something breaks, and it always does, every minute counts.
In this talk, we'll explore what ai, how AI is reshaping
the way we respond to incidents.
We'll cover why static dashboards are failing us, what AI powered visual
intelligence brings to the table.
Which key technologies make this possible and practical approach for
implementing this in your organization?
And as you listen, I invite you to think about the last
major incident your team faced.
How much time did you spend finding the right data?
How many dashboards did you click through?
How many times did someone say, wait, let me check another system.
By the end of the talk, I want you to have a mental picture of what a future
intelligent AI powered incident response workflow could look like for you.
Let's set the stage.
Today's incident response teams are working in highly distributed
and highly complex systems.
These are, there are dozens or hundreds of microservices, cloud
infrastructure, spanning regions, and data streaming from everywhere.
When an incident hits, the flood of information is enormous.
Logs, races, alerts, status pages, support tickets, and yet.
The tools we rely on are still fundamentally reactive.
They tell us what happened after the fact.
That means team spend the first critical minutes just figuring out what's broken,
who is impacted, and where to look next.
We've all been there bridge calls where five people are looking
at five different dashboards.
Cross-referencing time ranges trying to line up events manually.
Meanwhile, customers are feeling the pain and executives are asking for updates.
This reactive approach doesn't just deal a resolution, it's costly.
Research shows downtime costs can easily exceed $5,000 per minute,
sometimes far more in high volume businesses like E-commerce or FinTech.
So this isn't just an engineering challenge, it's
business continuity challenge.
We need to stop playing, catch up and get ahead of incidents.
One of the reasons this is so hard is the scale and speed
of data we deal with today.
Enterprise systems generate massive streams of telemetry
logs and metrics every second.
If you think about observability as drinking from a fire
hose, we have now connected.
And fire hoses and turn them all on at once.
Traditional tools weren't built to process this volume in real time.
They sample data, they add latency, they create blind spots, which
is the last thing you want when you're in a high severity incident.
And then we have inflammation silos.
Logging in on one tool metrics and another deployment history somewhere
else to piece together what happened.
Responders are forced to swivel between systems and manually chlorate close.
This leads to context loss and delays.
The incident clock keeps ticking and we are still just gathering evidence.
Let's talk about dashboards.
Dashboards were a huge leap forward when they became mainstream.
They gave teams a way to visualize data trends and quickly share information,
but they were designed for monitoring, not for rapid crisis response.
They are reactive by design.
They show historical data and force you to reconstruct the
current state from the past.
They also create cognitive overload.
Imagine 20 different charts all updating every few seconds.
Your brain is trying to figure out what changed first, which metrics
matter most, and whether it's a system or a symptom or a root cause.
They require manual analysis.
Someone has to click filters, adjust time, windows, pull in additional context.
That manual effort slows everything down and leaves room for human error
Exactly when you can least afford it.
So how do we fix this?
This is where AI powered visual intelligence comes in.
Instead of simply showing you all the data, the systems interpret it.
They adapt dynamically as conditions change, they don't just display anomalies,
they correlate them group related alerts, and even suggest likely root causes.
Think of it as going from static maps to a GPS navigation system.
A map tell you.
MAP tells you where the roads are.
A GPS guides you, reroutes you when traffic builds up and helps you
get to your destination faster.
That's the leap AI enables for incident response.
Moving from passive dashboards to active context aware assistance
that work for you to resolve issues.
Let's break this down in three key capabilities that make this work.
First, intelligent, alert, prioritization.
AI learns from historical incidents and create context to reduce and focus
your attention on what's most critical.
Instead of triaging 500 alerts manually, you get a rank list
of the five that matter most.
Second, automated visual recommendations.
The system picks the right way to display data for the problems at hand.
For example, during a network outage.
It might autogenerate a topology map showing the affected nodes in red, helping
you see the blast radius instantly.
Third, contextual summaries.
This is where natural language processing shines complex technical
data gets turned into plain language incident summary that can be shared
with stakeholders saving responders.
From spending 15 minutes writing status updates every hour.
Now let's reimagine this in action.
You open your incident dashboard and instead of same static layout, it
reconfigures itself in real time.
The most relevant matrix bubbles to the top related
anomalies are grouped together.
Probable root causes are highlighted with confidence scores.
Meanwhile, predictive models are scanning the telemetry and warning you of possible
cascading failures, giving you a window to act before the instant spreads.
This terms the instant response from forensic exercise to a
guided and proactive process.
And this is just the beginning.
The future is multimodal where responders can use voice command or natural
language queries to interact with data.
Imagine saying, show me all error spikes for the checkout service
in the last 30 minutes, and having the dashboard instantly update.
We'll see AR overlays for physical infrastructure inspections and
collaborative interfaces where multiple teams can explore
data together in real time.
But the most powerful part is ai, human collaboration.
AI handles the data crunching at scale.
Human focus on judgment, communication, and decision making
together, they close the loop much faster than either could alone.
How do you get from where you are today to this future state?
Start with assessment and planning.
Measure meantime, measure your meantime to detection, mean time to resolution, and
number of hands off during an incident.
Then focus on data integration.
Build unified pipelines that feed logs, metrics, and alerts into a single system.
Without this foundation, AI won't have a complete picture.
Next, move to AI model development.
Train your models on historical incident so that they can recognize familiar
failure patterns, predict next likely next steps, and suggest actions.
Then prioritize interface design the interfaces where value becomes real.
It should be intuitive, role-based, and designed for stressful situations.
Your SREs should get a deep technical detail while executives
see business impact summaries.
Finally, build on continuous learning.
Every incident is a chance to improve your model and your workflows.
Treat the AI like any other team member one who gets smarter
with every retrospective.
And of course, implementing this is and without challenges,
you'll need to address data.
Address data quality first.
AI models are only as good as the data you feed them.
Invest and standardize logging practices, consistent naming conventions
and validation pipelines, then.
Think about team training responders need to understand what the AI is
suggesting and trust its recommendation.
Start with a shadow mode where AI provides recommendations alongside human decisions
so team can compare and build confidence.
Lastly, don't forget change management.
Rolling out a system like this is as much a cultural change As a technical
one, communicate benefits clearly involve incident commanders early and roll out
gradually to avoid overwhelming the team.
Scalability is critical.
Your architecture should be cloud native so it can scale
elastically during major incidents.
Use microservice for modularity and stream data in real time to keep insights
fresh and invest in a visualization layer that stays fast under load.
Because nothing's worse than a dashboard that freezes when everything is on fire.
Once implemented, measure the results carefully.
Track response time, metrics, meantime to detection, meantime
to recovery, and compare them before and after the deployment.
Measure analyst productivity.
How many incidents each responders can handle, and whether manual
correlation steps are decreasing.
Quantify the business impact dollars saved from reduced time, customer
churn, prevented and revenue protected.
And don't forget the user experience.
Our responders happier.
Do they trust the system?
Are executives more confident in incident communication?
That is how you prove the AI powered incident intelligence isn't just
nice to have, it's a measurable driver of business resilience.
So let's summarize.
Transform your approach beyond reactive dashboards to proactive intelligent
interfaces that guide your response.
Invest in intelligence, build or adopt AI capabilities that prioritize
alerts, surface context, and recommend actions enable your teams, give
responders training and tools to trust these systems effectively.
The future of incident response is intelligent, adaptive, and human
centered, and the sooner we start this transformation, the more
resilient, proactive, and calmer incident response process will become.
I'd love for you to reflect on one question.
If your next P one incident happened tomorrow, what would you want your
AI power dashboard to show you first?
Thank you.