AI for Incident Management

Video size:

Abstract

A deep dive on how using AI correctly can impact Incident Management, showcasing via a demo how effective AI in incident management can be in solving one of the hardest challenges of getting updates during an incident and bringing teams upto speed as they join an incident

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello listeners. My name is Gandhi Kumar. I am a principal incident commander working for Twilio based out of Seattle, Washington. And I took interest in participating in Con 42, 20 25 to take, to talk about how incident management's evolving with the advent of AI and how. There are so many amazing solutions available readily out of box in the tools and also things that, we can build to make life easier. And today I'm gonna be talking specifically about. How AI is changing incident management, speci spec specifically on incident response communications at Twilio and the reason I chose this specific topic is because there is so much about detection, resiliency, SRE, DevOps, but there's very little content I personally have noticed on. How incident response communication happens. You gotta remember, this is an absolutely critical aspect that's often overlooked. So if you know what incidents are it's essentially a disruption to service. And at any point in time whenever something like that happens, customers are directly impacted. And if you have been a vendor or a customer. You know that communication is absolutely critical during an incident. You want to be notified of what's going on. You want to see progressive updates. You want to have updates on the status page for that specific product that's impacted because you probably have end users who are asking you questions on. To when the product's gonna be, up and running once again. And this is true, regardless if it's like a degradation or if a product's are down, communication during incidents is absolutely crucial because that's pretty much what builds trust between parties. So I'm gonna walk us through a few slides, but majority of this is gonna be. Me talking because I really love to share, what we have done and how we've solved this problem to a large scale. And then towards the end, I'm gonna probably, try to do like a deep demo and address some of the potential questions that I think, people may be having. So I'll start off by this, what is incident management? CQ probably here you probably know this, but for folks who don't know, a lot of this started with fema, implementing nims, which is National Incident Management System. And then eventually as technology companies picked up, everyone realized the need for a dedicated team of incident responders whose job essentially is nothing but getting service back up and running and that are so many other teams that collaborate today. You have your SRE, you have your DevOps, you have your product owners incident. Technology owners and depending on which product's impacted and all of, a lot of these teams sit together, and collaborate on an incident given, what's impacted and who's required. And the landscape essentially is changing because I'll give one specific example here. If you have worked with most big tech companies or if you have at least spoken to people there, you will see the common theme is a lot of tools. It's unavoidable. Every tool provides something that the other cannot, and you have to work with all of these tools during incidents. That's it's absolutely, you can cut down the number of tools, but you can't consolidate all of that to one single tool. So what I put there is like a good example of. The way incidents run at Twilio, like we use the vendor fire hydrant to declare incidents, run incidents. We have PagerDuty to alert and notify us during incidents and. We use Zoom to actually run an incident bridge. We use Slack for a lot of Async Chat, and it's 90% of the incidents essentially are run on Slack because you need Async communications with all the different parties that. That have to be present use. We use Datadog for dashboards and, notifying what product service status is. Atlassian status page for status page updates to customers Google Docs for running our post incident retrospectives. We use Confluence where we have a ton and tons. Theoretical knowledge, inner workings of products saved over, 15, 18 years. And we use slack canvas for tracking, action items that happened during incident. So essentially you're seeing it's, there's a lot of tools that are working simultaneously at play during incidents and it's unavoidable and the challenge starts when. You end up working the tools instead of working the incident? I'll repeat that and I'll let the listeners think about it. As an incident responder, your prime focus is mitigation, bringing the customer services back up and running. The impact of products back up and running. The challenge starts when you spend too much time working some of these tools, trying to gather information, understand impact, trying to gather data, start documenting things, which is, don't get me wrong, absolutely critical, but when you spend too much time doing that, as an incident commander or incident manager, your primary goal or my primary goal is making sure teams are working on what is required. Asking the right questions ensuring that the conversation or discussion doesn't go off topic. And we are stay focused on mitigation ensuring that if someone's too opinionated, pushing back on them and. Having everyone focus on the single goal, which is mitigation, and all of this needs a ton of focus attention, which, if diverted from, it's extremely hard to drive the incident and take it all the way to mitigation. What is the solution? How can AI help with internal external incident response communications before, during, and after incidents like. That's, as I said, that's the primary focus that I wanted to talk about today. Here's like a good snippet that I found on Atlassian's website. And the top, what is, oh, something's wrong. Monitoring tool alerts and escalates. The on-call team joins everyone. Swarms back to normal document. What failed, remediate, move on, right? That's what we see, but here's what our customers see. Something's wrong. They don't know about it. What's going on? We haven't gotten communications. The status page is still green. They start panicking. They start calling their account managers. They start filing in those Zendesk, tickets or any support tickets that, that they have. And most companies, seem to use Zendesk, but. Just to give an example, and they start emailing, creating a ton of, like a flood of support. Tickets start to come in and they assume the worst they get angry, which is very normal. You are paying a huge amount of money for those products and services and for them to be down and you are end users to, to feel the pain. It's. Frustrating. And they're back to normal, but less trust in this company, right? Because here is, I've been working across different industries over the course of 15 years. I worked in carrier avionics banking, finance and now communications platform as a service, which is which is with Twilio. One thing that's common is if you do not communicate well enough with your customers. From the time an incident starts all the way till it ends, what you're gonna do to fix it? Are you making sure this doesn't happen again? Future prevention, your classic idle terminology. If you don't do that, customers lose trust. One incident, two incidents, three incidents, four incidents. As time progresses, trust is broken and once broken it's very hard to men. So communication during incidents, absolutely critical. So one of the hackathon projects this year that I chose to work and build for Twilio was a bot. A bot that can gather information real time and present it. With the simplicity of a slack command since one of the, one of the principles that I really like to stick with is when you're running an incident, as I said at the start, when you're juggling different tool juggling between different tools, you lose a lot of time. What if you could stay in a single tool, let's say Slack and get all the answers you need without having to juggle and bounce around? I'm not saying the other tools are not important, but it's the frequency of how often you juggle and move around is what determines how quickly and how much more present you are on an incident. As the incident commander, as the incident manager. So I'll not dive into details, but I'll try to keep it high level here. And then if you are curious and I'll leave a link to my LinkedIn. We can chat more. I'm happy to share, how we did this but for the purposes of, con 42, I'm gonna share it. Idea, and I'll give as much details as I can. But we used a tool called Windmill and we ended up building a Slack app, which makes a call to open ai. And then, which is again, integrated to every single tool that we have that I showed you on the previous slide. Zoom voice, zoom chat, slack, chat, everything's in the chat. Fire had our incident. A management tool. Wiki pages because all the theoretical knowledge about how customers use the products are on our Confluence and Wiki pages, which is like a ton of information, but accessibility, on exactly what you need when you need in an incident. That's hard. And we've integrated it in such a way where we can tap into that information. 20 seconds. So simple commands in slack. Ask what you want and it'll give you real time contextual information. And we can even one can even give a persona, which is what we ended up doing, where we can give each command a persona and the persona essentially. Describes who you are asking that information for. So what I did this over here is created a command saying when I type this command with an incident number, I want you to give me. What a technical customer would want to know about this incident as a, somebody, as a somebody and another command saying, give me a high level status page post that I can integrate and post on, status page.io. And then another command dedicated to asking questions that I can literally go in and just ask the dumbest of dumb questions. Silliest of silly questions that people are often reluctant to interrupt the triage bridge. And ask, right? Because if the people who are fixing it start getting distracted and start answering you on what's broken it's a slippery slope because you are getting updates now, but you're also losing precious time where actually something could be fixed. I'll go ahead and give you a quick demo, and I'll try to speak. About, what we thought about and what was the ideology behind it it and how we are actually using it. So for the demos right now, if you see here, I've created like a public Slack channel, only for this demo Con 42 demo. And in this channel and this is something we. People also run event, people run in incident channels as well. But just for the sake of this demo I'll show you this. So right here the three commands that I mentioned that we ended up, creating for this, and I keep saying we, because there's like a team of six, seven people who worked on it within Twilio, just for the hackathon to solve the common problem that, that, that was in front of us every single incident. Windmill HVC is for generating technical summary for customers. Windmill sp is for status page summary or, simpler motor genetic explanation. And then windmill Gen is our gen command where we can essentially. Ask questions on what we want. So for the purpose of this I'll show you the Vin Gen command. Let's take an example. I can simply go ahead and say, Hey, VIN Gen what are Twilio's core products and how do customers use them? Now if you look at it, all this knowledge and info. It's present internally in our help center. It's tons and tons to hundreds, thousands of documents. The accessibility is what this was about. And in, in the next, I'd say about few seconds, if you see here, just gave me a response and it started explaining for every product what it means, what customers use about, yeah, not impressed. I'll show you one more. What if we go to a partner where, say email Jan my customer is reporting a 5 0 3 adder on all of their Twilio services. What could this mean? Again the reason such features were important is we have tons and tons of different teams who talk with different parties, like customers, account executives and all of them have different varieties of questions. And it's very important. It was very important to, to be to have a channel that anyone can literally ask anything and get answers they need. So if you see here, it was able to describe why the 5 0 3 error means service unavailable common causes, and then it was also able to point some troubleshooting articles in our internal product docs that we have for customers on how they can solve such an issue or what logs can they gather when they encounter such an issue. Now I have a few. I have a few test incidents right here. These are incidents we have created on our incident tool, so I'm gonna be using them to show you what the and again just to remind, when I generate these, it's gonna be static, but during an actual incident. The information keeps evolving. So for an example, the way I'm talking right now, so if this was an actual incident, everything I'm speaking real time right here, gets transcribed, is made available to our AI model, and it can contextualize between what is being said, real time on Zoom, what is being available, what's available in our wiki pages, in our theoretical knowledge base, and understand what exactly. Is the user asking for and come up with that app summary. So if I say four slash windmill HVC 1, 5, 0 8, 9, now it's actually like a test incident. And then when I mention, 1, 5, 0, 8, 9, give me a technical verbiage that I can go speak to my customers with. And it'll instantly say. This was a test incident initiated and quickly resolved with just over two minutes. There was no customer impacts. And then, the best part is I can quiz it further. Email Jen, what time did the incident? 1, 5 0 8, 9 start. And not just the time there's, there are layers of questions. One can ask just dive de into extreme details to what this person did. When did we find out about this first, what was the anomaly? And it'll give exact details of real time. And this is a lifesaver because if you have been in incidents and you've seen incident chats, there are hundreds if not thousands of messages in a matter of minutes and hours. And it's hard to keep track of what's. It's being said, and what exactly is the info that you are looking for, especially when the incident starts and you are it's chaos, right? The first few minutes are often chaos because you're trying to wrap your head around where the issue is, right? Who caused this? Was it a vendor? Is it upstream? Is it downstream? Is it us? Is it, a carrier? It's absolutely chaotic. And the other thing that we found, this model is super real time. So it generates responses like in, within 20 seconds, but even like right one minute into an incident run this model and it generates answers super, super quickly. So let's create another one. I'll say in sp now this time I'm gonna ask it. Different example to create like a status page post for me. This is one of the things we are actively exploding and doing, a prototype testing on to see. Because the eventual goal is to automate this directly to our status page. So when we have an incident we can be confident enough that we have trained the model to take the info, filter out internal stuff versus external, and then automatically post on the status page. Imagine how impactful and. Efficient that would be for customers to get notifications. So if you see here, it literally creates it in a title and body format, which is very standard and status page. It'll be a title and there'll be a body that describes what the incident is. And yeah. And it's super, super useful to take away all the focus, and all the noise out of, people were actually like troubleshooting and just put all that all that in a summarized systemic way for people to be able to like, question and get answers to hey, it's, it super simple. Tool, but it is absolutely effective. And then the best part is I won't be able to do it here, but we can generate post incident retrospective. We can generate out of four drafts. We can literally ask what we want and the model will generate a lot of these for us to be able to like, use it for communications internally and externally. And so coming back. Yeah, that's how quickly this model's been responding so far. It's less than 30 seconds. And, getting answers in a real time manner. And yes. That's me. You can scan that QR code, it'll take you to my LinkedIn. I am based outta Seattle, so if you are in the area, let me know. Would love to catch up for a coffee, lunch, whatever. And yeah, I'm always open to collaborating. Learning sharing what I know on how to make incidents better. Incident response is something super near and dear to me. I've been doing this for, as I said, like 15 years at this point. And communication during incident responses is something that I started exploring, late last year. And when I realized how important that is and how prudent it is to keep. Customers come and earn that trust or retain that trust even in the toughest of times. So once again thank you con 42 for this opportunity and thank you all for listening. Have a great rest of your day.

Slides

Download slides (PDF)

See all 22 talks at this event!

Conf42 Incident Management 2025 - Online

October 02 2025 - premiere 5PM GMT

AI for Incident Management

Video size:

Abstract

Summary

Transcript

Slides

Gandhi Mathi Nathan Kumar

Principal Incident Commander - Platform Engineering @ Twilio

Join the community!

Featured event

2026

2025

Info

Conf42 Incident Management 2025 - Online

October 02 2025 - premiere 5PM GMT

AI for Incident Management

Video size:

Abstract

Summary

Transcript

Slides

Gandhi Mathi Nathan Kumar

Principal Incident Commander - Platform Engineering @ Twilio

Join the community!