Conf42 Incident Management 2025 - Online

- premiere 5PM GMT

Reimagining Incident Response with AI-Powered Visual Interfaces: From Static Dashboards to Real-Time Intelligence

Video size:

Abstract

Discover how AI is revolutionizing incident response dashboards, cutting MTTR by 35%, automating visual insights, and enabling real-time decision-making. Learn from a $650B case study where smart visualization led to $42M in annual impact.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome to Con 42, incident Management 2025. I'm Pri Dewa, a senior frontend engineer at ThoughtSpot, where I've spent the last several years working on complex data visualizations and analytics product. My work is all about one thing, taking overwhelming amount of data and presenting it in a way that's meaningful, timely, and actionable. And if there's one area where that matters most, it's incident response because when something breaks, and it always does, every minute counts. In this talk, we'll explore what ai, how AI is reshaping the way we respond to incidents. We'll cover why static dashboards are failing us, what AI powered visual intelligence brings to the table. Which key technologies make this possible and practical approach for implementing this in your organization? And as you listen, I invite you to think about the last major incident your team faced. How much time did you spend finding the right data? How many dashboards did you click through? How many times did someone say, wait, let me check another system. By the end of the talk, I want you to have a mental picture of what a future intelligent AI powered incident response workflow could look like for you. Let's set the stage. Today's incident response teams are working in highly distributed and highly complex systems. These are, there are dozens or hundreds of microservices, cloud infrastructure, spanning regions, and data streaming from everywhere. When an incident hits, the flood of information is enormous. Logs, races, alerts, status pages, support tickets, and yet. The tools we rely on are still fundamentally reactive. They tell us what happened after the fact. That means team spend the first critical minutes just figuring out what's broken, who is impacted, and where to look next. We've all been there bridge calls where five people are looking at five different dashboards. Cross-referencing time ranges trying to line up events manually. Meanwhile, customers are feeling the pain and executives are asking for updates. This reactive approach doesn't just deal a resolution, it's costly. Research shows downtime costs can easily exceed $5,000 per minute, sometimes far more in high volume businesses like E-commerce or FinTech. So this isn't just an engineering challenge, it's business continuity challenge. We need to stop playing, catch up and get ahead of incidents. One of the reasons this is so hard is the scale and speed of data we deal with today. Enterprise systems generate massive streams of telemetry logs and metrics every second. If you think about observability as drinking from a fire hose, we have now connected. And fire hoses and turn them all on at once. Traditional tools weren't built to process this volume in real time. They sample data, they add latency, they create blind spots, which is the last thing you want when you're in a high severity incident. And then we have inflammation silos. Logging in on one tool metrics and another deployment history somewhere else to piece together what happened. Responders are forced to swivel between systems and manually chlorate close. This leads to context loss and delays. The incident clock keeps ticking and we are still just gathering evidence. Let's talk about dashboards. Dashboards were a huge leap forward when they became mainstream. They gave teams a way to visualize data trends and quickly share information, but they were designed for monitoring, not for rapid crisis response. They are reactive by design. They show historical data and force you to reconstruct the current state from the past. They also create cognitive overload. Imagine 20 different charts all updating every few seconds. Your brain is trying to figure out what changed first, which metrics matter most, and whether it's a system or a symptom or a root cause. They require manual analysis. Someone has to click filters, adjust time, windows, pull in additional context. That manual effort slows everything down and leaves room for human error Exactly when you can least afford it. So how do we fix this? This is where AI powered visual intelligence comes in. Instead of simply showing you all the data, the systems interpret it. They adapt dynamically as conditions change, they don't just display anomalies, they correlate them group related alerts, and even suggest likely root causes. Think of it as going from static maps to a GPS navigation system. A map tell you. MAP tells you where the roads are. A GPS guides you, reroutes you when traffic builds up and helps you get to your destination faster. That's the leap AI enables for incident response. Moving from passive dashboards to active context aware assistance that work for you to resolve issues. Let's break this down in three key capabilities that make this work. First, intelligent, alert, prioritization. AI learns from historical incidents and create context to reduce and focus your attention on what's most critical. Instead of triaging 500 alerts manually, you get a rank list of the five that matter most. Second, automated visual recommendations. The system picks the right way to display data for the problems at hand. For example, during a network outage. It might autogenerate a topology map showing the affected nodes in red, helping you see the blast radius instantly. Third, contextual summaries. This is where natural language processing shines complex technical data gets turned into plain language incident summary that can be shared with stakeholders saving responders. From spending 15 minutes writing status updates every hour. Now let's reimagine this in action. You open your incident dashboard and instead of same static layout, it reconfigures itself in real time. The most relevant matrix bubbles to the top related anomalies are grouped together. Probable root causes are highlighted with confidence scores. Meanwhile, predictive models are scanning the telemetry and warning you of possible cascading failures, giving you a window to act before the instant spreads. This terms the instant response from forensic exercise to a guided and proactive process. And this is just the beginning. The future is multimodal where responders can use voice command or natural language queries to interact with data. Imagine saying, show me all error spikes for the checkout service in the last 30 minutes, and having the dashboard instantly update. We'll see AR overlays for physical infrastructure inspections and collaborative interfaces where multiple teams can explore data together in real time. But the most powerful part is ai, human collaboration. AI handles the data crunching at scale. Human focus on judgment, communication, and decision making together, they close the loop much faster than either could alone. How do you get from where you are today to this future state? Start with assessment and planning. Measure meantime, measure your meantime to detection, mean time to resolution, and number of hands off during an incident. Then focus on data integration. Build unified pipelines that feed logs, metrics, and alerts into a single system. Without this foundation, AI won't have a complete picture. Next, move to AI model development. Train your models on historical incident so that they can recognize familiar failure patterns, predict next likely next steps, and suggest actions. Then prioritize interface design the interfaces where value becomes real. It should be intuitive, role-based, and designed for stressful situations. Your SREs should get a deep technical detail while executives see business impact summaries. Finally, build on continuous learning. Every incident is a chance to improve your model and your workflows. Treat the AI like any other team member one who gets smarter with every retrospective. And of course, implementing this is and without challenges, you'll need to address data. Address data quality first. AI models are only as good as the data you feed them. Invest and standardize logging practices, consistent naming conventions and validation pipelines, then. Think about team training responders need to understand what the AI is suggesting and trust its recommendation. Start with a shadow mode where AI provides recommendations alongside human decisions so team can compare and build confidence. Lastly, don't forget change management. Rolling out a system like this is as much a cultural change As a technical one, communicate benefits clearly involve incident commanders early and roll out gradually to avoid overwhelming the team. Scalability is critical. Your architecture should be cloud native so it can scale elastically during major incidents. Use microservice for modularity and stream data in real time to keep insights fresh and invest in a visualization layer that stays fast under load. Because nothing's worse than a dashboard that freezes when everything is on fire. Once implemented, measure the results carefully. Track response time, metrics, meantime to detection, meantime to recovery, and compare them before and after the deployment. Measure analyst productivity. How many incidents each responders can handle, and whether manual correlation steps are decreasing. Quantify the business impact dollars saved from reduced time, customer churn, prevented and revenue protected. And don't forget the user experience. Our responders happier. Do they trust the system? Are executives more confident in incident communication? That is how you prove the AI powered incident intelligence isn't just nice to have, it's a measurable driver of business resilience. So let's summarize. Transform your approach beyond reactive dashboards to proactive intelligent interfaces that guide your response. Invest in intelligence, build or adopt AI capabilities that prioritize alerts, surface context, and recommend actions enable your teams, give responders training and tools to trust these systems effectively. The future of incident response is intelligent, adaptive, and human centered, and the sooner we start this transformation, the more resilient, proactive, and calmer incident response process will become. I'd love for you to reflect on one question. If your next P one incident happened tomorrow, what would you want your AI power dashboard to show you first? Thank you.
...

Priyanshi Deshwal

Senior Software Engineer @ Mode

Priyanshi Deshwal's LinkedIn account



Join the community!

Learn for free, join the best tech learning community

Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Access to all content