Conf42 Incident Management 2023 - Online

IT incident management training and assessment using generative AI

Video size:

Abstract

At Uptime Labs, we focus on developing a simulation platform to train IT engineers and managers to respond effectively to incidents. In this talk you will see how we enable our practitioners to form better hypothesis during the incident using AI. Step-by-step thinking and reasoning will be showcased.

Summary

  • Matty from Uptime Labs leads generative AI projects in the space of instant management. Matty: The main components that all we know related to people management is missing in this definition. There is more people managers elements to instant management elements.
  • uptime Labs is an immersive environment to enable people to practice people skills along with technical skills such as time management. In a gamified environment you can practice and hone your skills and improve them. We believe this will help to hone incident responsive skills over time.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, my name is Matty from Uptime Labs. Thanks so much for coming to my talk at uptime Labs. I lead generative AI projects in the space of instant management. Instant management is very important to me because in the past couple of years me and my team have been dealing with machine learning models and large data to handle. And as you know, a lot of incidents happen in that space. Incident management incidents management role so I started to start with I asked Chat GPT to give me a definition of incident management in general. Although the short answer is correct, the main components that all we know related to people management is missing in this definition. As we all know, there is more people managers elements to instant management elements such as effective coordination, effective communication, managing people in stress and under pressure, time managers, and understanding incident mechanics. Of course, I provided these as feedback to Chat GPT, so hopefully you won't see you will get more complete answers regarding incident management if you ask chat GBT uptime Labs is a London based startup, almost two years old at this point. So when you ask Chat GPT to give you some information about uptime labs, of course it lacks that information, but in this case Weiner was a Google Bard which gives us a brief information about uptime labs. We are an immersive environment to enable people to practice people skills along with technical skills such as time management. As I said, coordination communication in a safe environment without breaking, worrying about breaking things, everything is controlled and in a gamified environment you can practice and hone your skills and improve them. Through this gamified process. You can also improve instant response skills and muscle memory, identify the weaknesses with the reports that you get and also test instant response plans, et cetera, et cetera. So you're going to see in a demo in more specific way what we mean by practitioners these skills. Unlike many other solutions out there, our focus at the moment is people and processes. A lot of solutions focus on monitoring tools and monitoring incidents and providing information on incident. But we believe by practicing and improving people skills, navigating complex systems becomes more manageable over time. So this immersive environment, we believe will help to hone incident responsive skills over time. The immersive solution that is provided by uptime Labs is an immersive environment where you see a full blown e commerce company. This is their configurable of course, that you have several characters with different personalities and different characteristics that are powered by generative AI and you can interact with them with your team and you have also complete stack microservices, kubernetes that realistic dashboards are provided and in the demo you're going to see these dashboards and how these work. The immersive solution part of it, the main part of it is the virtual characters. Virtual characters that can interact and it can improve your skills. You can provide feedback and they love your feedback. They keep improving over time. And this team is reconfigurable based on different drills and different games. They change skill set and personalities. Of course, they can be stubborn. They have different personalities. They can have replying within several minutes, which really happens in reality. In case of incidents under pressure, people's response are different critical incidents. Management skills that you can practice. The main important ones, the train points, identify scope of the problem, effective congregation under time pressure, the structured troubleshooting, understanding incident mechanics, forming effective hypothesis, step by step reasoning and effective coordination. All of these are very key to incident management. And we believe with the help of this immersive gamified environment can actually practice these one by one and move forward. I've prepared a demo for this talk where you can see how this gamified environment works. And also I have a surprise for you toward the end. So yeah, bear with me. The first part of the demo is about the roles in your account. You can play different incidents, management sre role. You have this virtual online boutique environment that is completely realistic. You can go and place order and there are different variety of products are being sold on this platform. You have your team full hierarchy from CEO to development and platform team escalation team. And this is the hierarchy of the microservices. A lot of different drills are designed and available for you to practice. Different skill set. This drill set are personalized based on requirements from our customers and they can be personalized for different skills that you target to practice over time. The next demo next part of the demo is the actual start of the drill. One of the drills I've selected for this talk is powered fully by generative AI. And here you can see the full dashboards. You can see the orders per minute or all the logs and crfs and the changes that are happening in reality. And in the main monitoring dashboard you can see the logs happening across different teams and services and the main communication platform chosen to be slack at the moment. And in slack you are able to interact with your team. At the moment. The plan for supporting audio and video will happen in the future, but we're working on it. But at the moment it's text based and the communication happens on Slack. When you started rail, you are given two main channels automatically generated for you. Instant bridge, basically. And a business communication channel. When you start drill your characters, your team come and introduce themselves. Like introduce the company. Bez is the CEO, the person that sets the tone for the company and gives an introduction to the train. And as they're explaining and like a bit of chats going on, to give you some idea about different characters. In the meantime, you're understanding the mindset of the people in your team. And there's a point that incident gets started. And at that point, Bob, which is customer services manager, comes and reports that the customers are not able to see place orders and incidents. Manager role is to check the dashboards. And you see in reality that things are going down, things are breaking and user activities going down and starting to get the pressure, feel the pressure from the team. A lot of different messages coming from different characters. And of course they have different responsibilities at this point, declaring that who's the incident manager is very important. So that the team knows who's taking command. And in this example, Daniel is my assistant and from development team. And I'm interacting with Daniel to form better hypothesis and be able to. Because of Dan having access to logs and the patterns from the previous incidents. I like to question Dan in terms of the changes and a bit of information about that. So Daniel in this case, is a generative AI agent. And in the meantime, as you move forward and as you practice, you're getting rewards and hints as well. And they are directly going to your report of how you play the drill, your performance report. At this point, I'm asking Daniel about the latest changes. And I would like to know if the incidents, the current incident, there's any similarities and the patterns and what a response I'm getting from Dan is that the previous incidents that we had in the last week, like database capacity and ISP issues, or the front end issues of cache. And Daniel comes and says there's similarity in terms of based on the logs, it's able to match that with the information there. And it shares the exact logs to see what the logs are. And I see that the request per minute are increasing unexpectedly. And based on the analysis, I gather in the size up stage, which is one of the main steps of instant mechanics, like the size up. And then I'm moving triage, action and review. At this point in the size up stage, after gathering enough information, I come to the solution that, okay, it's time to revert certain change. And that change is the front end change that I believe is going to help. I command and command Dan and Dan takes care of all dads automatically for me. And at this point by checking the dashboards, I keep an eye on the dashboard as an incidents manager and I see that at some point the placing orders are back up and running. So at this point which is needed to refresh Grafana and that change, I can see that the orders are being placed again and things are back online. So I reported back to the team and I provide that information to the team that the customers are able to put more orders online and the entire cycle of moving from size up to triage to action to review is try to show in this demo within five minutes how generative AI is helping me to better understand the logs and find the patterns and understand the changes and reverting changes and stuff like that within a drill. Hope you enjoyed this demo. Generally gentle AI at this point helping us with drill design, personalized drill design as I said, other fact of automation and being able to control the platform. At this point we're getting help from generative AI reporting pros and cons and the performance reports based on which could improve and also insights that are available in the data that can use AI for getting that information. Finally, please feel free to scan this code and the first five to sign up will receive a one month free access and hope you enjoyed my talk. The thank you very much and see you soon. Thank you.
...

Mahdi Jelodari

ML / IoT Specialist @ Uptime Labs

Mahdi Jelodari's LinkedIn account Mahdi Jelodari's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways