Conf42 Incident Management 2023 - Online

How to use generative AI to finally fix out-dated runbooks

Video size:

Abstract

Incidents happen. The runbook process to deal with them is broken, so incident response teams escalate far more often than they should.

Generative AI offers an exciting new path to quickly create and edit up-to-date runbooks using copilot techniques. We’ll show you how to do it safely.

Summary

  • Shoreline aims to solve the problem of runbook automation. Even for a complex issue, you have all the diagnostics before a human has to come to the machine to take an action. Shoreline makes it easy to create runbooks, and that's what we're talking about here today.
  • Shoreline manually curates the runbooks that we create or are created by our customers. For me, the core problem with runbooks is that they don't actually run. Once you have these things in an environment that is controlled, you can apply fine grained access control.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Started shoreline to solve the problem of runbook automation. Deal with the false positives, deal with the simple, repetitive issues, and make sure that even for a complex issue, you have all the diagnostics before a human has to come to the machine to take an action. It just made so much sense to me. After eight teams at AWS, there's one big problem, though. Hardly anyone had runbooks, and even when they did, they were stale or they were inconsistent with other things out there. So what was someone to do? What we did here at Shoreline is we decided to make it easy to create runbooks, and that's what we're talking about here today. So about a year ago, when llms first started becoming popular, even before Chat GPT took off, we started working on prompt engineering using the llms that were available then. And of course, we've made those yet better over time, both for diagnostics and for repair. Let's take a look at what that looks like. So over here you'll see a top problem report where we're pulling data in from a ticketing engine. In this case, it's pagerduty. We do that on a continuous basis and measure MTTA, MTTR, number of people who are involved in an incident, how long it took, and the number of times we saw it. So you might go and say, well, let me work on the things that have the largest overall aggregate MTTR. In this case, that's the Apache server being down. And the way we calculate these groups is that we apply lightweight clustering algorithms using machine learning to see what tickets look similar. So for example, just because there's a hostname in the ticket, it doesn't matter. And then we also apply some semantic understanding using an LLM to say, hey, this thing is talking about a disk being full. This other thing is talking about a persistent volume claim being full. Those are actually the same issue, or they can be addressed with the same diagnostics and repair commands. So let's take this particular thing about the Apache server being down and generate a runbook so you can see what's happening is that second by second it's running, getting me all the diagnostics to check the status of the VM, check the logs, see if the deployment is running, see if the VM itself is accessible, and so on, and if the necessary ports are open. And then I can also pick other things. So maybe I want to add a script to say, hey, did it crash? Should I have to restart it? And now it's going to take a second or two and generate that script for me, I hope. And there we are. So here's a bash script to do that. You can also add your own diagnostic prompts here, and you can also add remediations. Right? So in this case, let me go and do a restart and get that in my bash script. There we are. And so you can run this on kubernetes, you can run it on VMS, you can run it on Linux or Windows, you can run it on the three major cloud providers, et cetera. And once you're happy with it, after adding cards, replacing things, et cetera, you can export it to markdown, in this case, maybe a conflux wiki, or you can export it to shoreline. And we'll talk about that in a. So one of the problems with llms is that they can hallucinate. We've all seen that over the last year. And so one of the things we do here at Shoreline is that we manually curate the runbooks that we create or are created by our customers. We're at about a little over 300 right now, and I hope to get to about 1000. That doesn't mean you're going to use all thousand. It means that across the variety of things that you use, there's a runbook for you that's been created and tested out already. So you can start from a clean place, and you don't have to run into an issue just to have the repair ready at hand. So, for example, here we're seeing MongoDB issues. I'm not an expert at MongoDB, but isn't it nice that we have eleven things from people who are next? For me, the core problem with runbooks is that they don't actually run. You have to cut and paste into every node that you want to modify, and that makes it super inefficient. So wouldn't it be nice if our runbooks were like Jupyter notebooks and just had both the markdown and the repair actions right there? So this is something for an application load balancer that's running into 500 errors. And it's basically saying across all my hosts, which are running, let's say, an ALB go and grab the load balancer names, get their details, describe the ingress paths, and so on, so forth, and eventually maybe even do some remediation with a heap dump and a rolling restart and so on. Last but certainly not least, once you have these things in an environment that is controlled, you can apply fine grained access control and audit capabilities. So, for example, this is a notebook run for a high cpu, and in this case there's a set of actions we want to do. Make sure we have the right number of bookstore instances, make sure that the release is correct, check them for high cpu over a minute so that we aren't bothering if the issue has already gone away. Keeping the metrics over time, listing the top processes using a top command or the top threads. In this case, I'm going to see that. Yeah, it's the JVM thread, so I probably have a JVM issue and I might go and look at the logs and yeah, it's definitely running into allocation failures. So let me go and do a dump and restart, and after that validate that the issue is gone. And so all of that was done in the past, but I have all of the data from an auditability perspective, including what was returned out of standard out and standard error. And that's really important. The other using that's important that I'm not showing here is that you can also apply fine grained access control. Not just who can run a runbook, but who can run what actions on which resources at what time, for example, only when on call or for the other commands, what sorts of approval workflows do they want and where should those go, and how do I integrate all of this with my observability tool of choice, slack or teams or my incident management tools or ticketing tools? So hopefully that gives you a quick flyby over. I hope you found it interesting. You can always reach out to me at Anurag at Shawline IO if you want to hear more or have any questions or feedback. Thank you so much.
...

Anurag Gupta

Founder & CEO @ Shoreline.io

Anurag Gupta's LinkedIn account Anurag Gupta's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways