Conf42 Cloud Native 2023 - Online

Pairing with a ChatBot: Building DevOps Automations with a little help

Video size:

Abstract

Automation: We know it’ll save time in the long run.. but sometimes there just isn’t enough time to create the automation. This is a recipe for burnout. In this talk, we’ll introduce an automation library, and create new automations using an AI chatbot.

Summary

  • Doug Sillars: I'm using an AI bot, basically Chat GPT to create DevOps runbook automations. The way he thinks about runbooks is that in DevOps, is there a checklist? You don't want people to get stuck in their runbooks or checklists.
  • Some Ops teams spend up to 55% of their time just doing stuff. Couldn't it be great if we could automate some of this away? That's where run books come into place. Build the runbook and then as you get time, automate more of the steps.
  • Our open source runbook automation is built on top of Jupyter notebooks. We have hundreds of pre built actions that you can just drag into your runbook using AWS, Google Cloud, Kubernetes. It's pretty straightforward and easy to get started with.
  • unskript open source so easy to use. If you create a credential, which is for AWS, a secret and a secret key. You can reuse that without knowing what the key or the secret is. Once you have actions like this created, you can just drag them and drop them into your Jupyter notebook.
  • Runbooks are a form of internal documentation. They're the checklists that you use when you need to provision something or when you have an outage. When you have good internal documentation, it improves the outcomes. There's hundreds of built in automations to help you get started.
  • I'm really happy to have been a part of the Cloud native 2023 conference. If you have any questions, feel free to reach out to me on the discord. I would be happy to help with any sort of automation DevOps runbook sort of questions.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
I'm Doug Sillars, and today I'm going to give a talk about how I'm using an AI bot, basically Chat GPT to create DevOps runbook automations. So building runbooks and automating them and writing the code with the help of my pair programmer Chat GPT. So let's just jump right in. The way I think about runbooks is that in DevOps, is there a checklist? There are checklists of the things that you need to do to accomplish a task. Whether it's something when there's an outage, your runbook lists all the things you need to do to bring the system back online. Or maybe it's just to provision something. You've got a runbook, blah blah blah blah blah, and at the end you've got it done. And being someone who works in Devrel, I am really passionate about documentation. If you're coming in to use a product and the documentation is not good, it's incredibly frustrating. The same thing happens internally. If you've got an outage and someone says oh, we have a runbook for that, and you go through the steps and the last step doesn't work because the runbook hasn't been updated or it's out of date or something changed and we didn't update the documentation. So let's just give an example. You know, one example is a few weeks ago I was looking at my GitHub data and there's this great chart showing you how many people have visited your repo and how many pages they visited every single day. And being in Devrel, this is data that's really interesting and I don't want it to be ephemeral up at GitHub. I want to just load this into a database so that I can keep a history of this over a longer period of time than it is stored at GitHub. And I can control how I'm displaying it if I collect this data. And what's really cool is GitHub has an API. So I was looking at the API and I'm like, this is going to work. I'm going to take this data, I'm going to suck it into postgres and that'll have this data. So I followed all the instructions in the documentation and of course I got a 403 error that this resource is not accessible by a personal access token. Of course the documentation says you need a personal access token. And so I banged my head against the wall for a couple of days or a couple of hours, and what I discovered is I didn't have the right access in my personal access token, so the error message was just wrong, and I eventually got it working. And now this is getting sucked into a postgres database every single day. And I can collect this data, which is really awesome, but we got stuck. And when you have documentation or you have a runbook, you don't want people to get stuck, stuck in their runbooks or their checklists. So another stat that's out there for cloud native, for SREs, for DevOps, is that the Ops teams and the application support teams and the SREs are spending up to 55% of their time just doing stuff. The stuff that gets into your inbox and you've got to do. But it isn't the most important things that you need to get done. Is that manual tedious stuff? And wouldn't it be great if we could automate some of this away? And so this is where these run books come into place. Like, if you have a good runbook, boop, boop, boop, boop, boop, maybe you can finish it faster. If we can automate some of those steps, then we're getting even a step further. So there's this great blog post. The URL is down there at the bottom, and it'll be at the end and it'll be in the slides. But this guy works for a gambling website. And a gambling website, if they go down, they're losing money, because really, all a gambling site is there is to take money from people. And he found that when he built a thorough library of runbooks, that all of their issues were getting resolved faster. He found that escalations were easier because maybe the knock or the team that was on call could run through the runbooks before they called the person. So rather than just like calling Tom, they could try the runbook. Boop, boop, boop, boop, boop, boop, boop. And if it doesn't work, that last step is call Tom. He found that hiring new developers and new DevOps members of the team was easier because if they had a question, they said, oh, we've got a runbook here. Read how it works. And they could read how everything is provisioned, how it all runs. And that just kind of goes into the training, right. If things work right, then it's easier to train people on how to use it because you already have everything documented. And then finally, as people found how useful the runbooks were, once they worked, they kept at it and they kept updating them. And you never ran into this bad documentation where the runbook was out of date. And then the next step is they started automating it. And that was speeding things up even further. Now, when you read this guy's post, he got seven months. When he started at the company, his boss gave him seven months off to focus solely on every single issue, every single outage, and create a runbook for everything. And then ongoing, there was this 10% work to keep them going. But by then the whole team was on board, right, because they found how great these runbooks were. Now, the thing is, of course, like very few of us have seven months to do this. So another approach is from this other blog post here, which is called do nothing scripting. It's the key to gradual automation. And his whole goal is let's open up a notebook, let's say a Jupyter notebook, or just even a text, a code editor, and just write out all the steps. These are the steps we need to do. So build the runbook and then as you get time, automate some of the steps. So it's like do this manually, do this manually, run this code, do this manually, do this manually. Run this code. And if you automation some of those steps as you use the runbook more and more often, you just automate more and more steps of it and gradually you come to the state where the runbook becomes fully automation. I would add one more step here in that if you have all these runbooks, that's great, but if they're not somewhere where the rest of the team can get to them, you're still going to be the person who's getting paged. So let's talk a little bit about unscript and the open source tooling that we have to help you automate your runbooks. It's all open source. You can see the URL down here at the bottom of our GitHub repo, so you can check that out if you're interested. Our open source runbook automation is built on top of Jupyter notebooks. And so what's great about that is if you go back to that whole like do nothing scripting, the whole idea is you can add text sections in the middle and then you can add code in between as you want to automate stuff. The other advantage is these are online, so it's easily to share amongst the team. In our open source you have a docker image that everyone can have access to. We also have an enterprise version that is in the cloud. You've got your text and markdown fields. As I was alluding to a second ago where you can write down your do nothing scripting. These are the steps and then in between. And I've minimized the code here just so that it all fits on the screen. But we've got automation fields and it's all Python based, so it's pretty straightforward and easy to get started with. You don't need that seven month kickoff. You can start using your automations bit by bit when you have a couple of minutes and you can build these runbooks very, very quickly. Another great advantage that we have is we have hundreds of pre built actions that you can just drag into your runbook using AWS, Google Cloud, Kubernetes, all of the data, lots of databases, Jira, there's about 30 of them. And we have about 400 actions that can just be dragged in and used with just wiring it up to your credentials. So here's an example runbook, and this is a Kubernetes health check. And so what we're doing here is the first action is we're going to list all the pods in our namespace, then we're going to get the logs. And then I wrote this code here just in Python to parse the logs and look for warnings in the logs, right? And then if there's a warning in the logs, we'll post a message to slack, say hey guys, bound a warning in the logs. Here's what it is. Here's the pod that's having issues, we can resolve this. This is the beginning of a full automation. Maybe then if something happened once we diagnose that, maybe we could create it to some actions to auto remediate the issue with that Kubernetes pod. The cool thing is, when I built this, these three actions right here are all pre built. Just had to drag them in and wire them up with the configuration and the credentials to log into my Kubernetes namespace, then I had to write this one. This whole runbook is available in our open source, so if you wanted to use it, you can just use the entire runbook, it's there. You just have to wire up all the different, the four different steps here and you can run this on the regular to see if there's any issues with your Kubernetes deployments. So let's talk a bit about these actions. In this screenshot here, you can see it says 342. We're bordering right at almost 400 right now. So it's growing rapidly and you can create your own. They're all python, so it's very straightforward. And if the desired action doesn't exist. You can write your own, and you can see here, when I took this screenshot I had 24 that I had written. You can also extend an existing action. So an example I like to give is we have an action that will list all the open pull requests at GitHub, but we don't have one that lists all the closed pull requests. But if you go into the Python code, you can see where it says open. You can change that to closed, and you've changed the entire functionality of the action, and it will now list all of your closed GitHub repositories. Or you could create a new action. If there's not one that's close enough, you might just have to write some Python code to create a new action. Or you can create a new action and this is where I started getting ready. With Chat GPT, you can connect to an external connection. So do something with Jira, or do something with GitHub, or with Google Cloud or Azure or AWS. Or you could create one of these glue actions like I did in the kubernetes where I just took the outputs of the logs and I did some parsing. Are there any warnings? And then send the message to slack if there are warnings. Here are two actions that I wanted to create. I wanted to be able to tag an EC two instance, and so tagging is a common way for people to understand what that instance, what that virtual machine is being used for. And if you have good tagging, then it's pretty obvious which ones are for production, which ones are for staging, which ones need to stay up, which ones. It's another way of managing cost is if a project goes down and this EC two cluster is tagged with that, you can turn it down and it won't hurt anything and it'll save the company money. The other one I wanted to do is I wanted to look to see if all of my Google Cloud virtual machines, and I wanted to know if they were public or not. You can't go a week without hearing about some company's s three buckets or virtual machines being exposed to the Internet and getting hacked. So if you wanted to write some security runbooks, this is one way you could do that. And for my copilots to help me write these, I thought it would be fun to have Chat GPT help me out. And if you haven't heard of Chat GPT, the URL is down there, chat OpenAI.com chat. And you can log in and you can ask it questions and it will write poetry for you. It will write essays for your school. Don't do that because your teachers know and your professors know, but you could. And it also writes code. And when I created this talk, we were doing Chat GPT-3 number four is coming out in beta right now, and I'm super excited to check out, but I don't have access yet. So here's how it works. If you have Chat GPT right here, you can just ask it a question and you can see, I said, can you write Python script to add a cost center tag with the value marketing to an EC two instance? And Chat GPT comes but, and says, sure, I can do that for you. And it's importing boto three, which is the SDK for Python, for AWS. It sets up your instance id and it puts a key and a value in a variable and then makes the API call. And just like that, here's your code. I like that it's all commented nicely, and then it also gives you a description to tell you exactly what it thinks it needs to do to make this work. And so there's the code, and I could take that and take this code and drag it right into my unscript runbook. And when I ran it, it didn't work quite right. And the reason for that is when you create your boto three client, you also need to put a region in here. And so like Chat GPT got it this close, like we're almost there. Of course, the error message told me exactly what was wrong. So then I said, hey, Chat GPT, doesn't it need a region? So let's use a variable for the region, and let's give that region the variable, the value us west two. And Chat GPT says, oh yeah, I made a mistake. That's right, it does require that. So now we've got the region, and you can see it's setting the region name equal to region. And the rest of the code is similar to what we saw in the first video. It. And so when we do this, we can see that the response comes back and we get a 200 response, meaning, okay, it actually worked. This is super, super cool. What I then did is you can see I have inputs set for my key, my value and my region. And that makes it even more modular. So I have this action. I can feed variables into it, almost like an API or a microservice, so I can feed the variables in, and it takes those variables and makes the connection to EC two and then creates the tags. And this works. And if you go into unskript today. This action is there, and it was built using Chat GPT. You can see now the result. I ran it twice, once with cost center marketing, and then I added a one and a one just to show that it happened twice. So it does actually add the key value pair that you want to your EC two instance. So let's walk through how I set this all up once you have the code written. And what makes unskript open source so easy to use is if you create a credential, which is for AWS, a secret and a secret key, you can reuse that without knowing what the key or the secret is by just selecting your aws whatever variable you give it. And you can select this for all of your actions. And it'll just run, it'll say, oh, I know which key value to use, it's stored over here in vault. And now I can run this action and you can see my variables. I have a region variable, an instance, and a key value. And so now by setting these all as variables, I can run this action and it will tag this instance in us west two with cost center one and marketing one. Once you have actions like this created, you can just drag them and drop them into your Jupyter notebook, and it makes it really, really easy to use. So then the second one I'm going to build here in this video is to get all of the Google cloud virtual machines and then see if they're publicly. So, you know, I asked this question, and I had to be a little bit more specific here, because when we build things here at unscript or with our open source platform, I want to use the same SDK. And it was using a different SDK. So I said, hey, can you just make sure that you're using the Google Cloud project library to do this? And then here's my project. Here's my region. Actually. First it says, no, I can't because I don't know how to do that, but I can't execute the code. But here I can give you an example code of how it would actually work. And so it institutes the compute engine client, it sets the region in the project, and for all of the vms, it just checks to see if it's publicly available or not. And if it is publicly available, print the list. It says, you install the library, you need to have authentication. And that's the cool thing about unscript, is we take care of all the authentication and we take care of the library. So in summary, runbooks are a form of internal documentation, they're the checklists that you use when you need to provision something or when you have an outage. The steps that you need to do to resolve the issues. When you have good internal documentation, it improves the outcomes, right? Just like if you have good documentation, people like this company has good documentation. They make it really easy to learn. I got started really easily. If you have good documentation, you have good outcomes. You're going to lower the mean time to response because step three isn't working. Oh yeah, we changed something, right? If the stress levels are already really high, that just moves the stress levels even higher. They improve your team collaboration. This isn't working. I built a runbook for that. Try this. Oh that solved it. Thanks. And then we can automate them. If we can automate them, then we don't actually ourselves have to go through all the manual tasks. We can zip through at least a few of those steps because we've automated it and we let the computer do the repetitive boring bits by automations, those steps, we're reducing that toil that day to day, that sometimes as much as 55% of some professionals times is just doing the mundane things we need to do to keep everything running. You could build auto remediations, like if a kubernetes pod is unhealthy or something. Your vm is publicly available. You could just make it not publicly available, right? Hide it. Turn off that ip forwarding, don't let it access. And it could be automatically remediated. So we don't have a problem. By increasing the observability, by testing these things on the regular, we're going to be alerted if something has changed. We also have runbooks now in unscript that look at your cost, spend every single day so you don't get a surprise AWS bill at the end of the month because you'll get an alert within 24 hours or maybe 48 hours that hey, your spend went up earlier this week, did you know that? And you can go back and say, oh yeah, I turned on a bunch of xlarge machines, I should spin those down now. So unscript has this neat niche where we're open source. We're built on top of Jupyter notebooks, which makes them publicly available to the whole team. They help you automate. There's hundreds of built in automations to help you get started really, really quickly, and they help you build these runbooks and they help you build these runbooks in an automated way. So you're improving your outcomes, you're lowering your MTTR you're reducing your toil and you're increasing your observability. If you use this along with Chat GPT, you get this prototyping of your automation and that auto while unscript is really fast to get you to the state where you have a runbook, once you add Chat GPT in there, you actually get there even faster, because if you have to write an action, Chat GPT can take you there, shaving off 80% of the time, and then you've got your automation even faster. So with that, thank you so much for watching the talk. Go check out unscript. We're at Runbooks SH and while you're there you can see the docker instructions to download and install it and run it. Give us a star while you're there. If you want to read more, we have lots of blog posts and documentation@unscript.com. If you want to play around with Chat GPT, it's a lot of fun. I recommend it. Chat OpenAI.com and then the two blog posts I talked about, the guy who built 1800 run books over seven months and then do nothing scripting. And so those are the links there. And with that, thank you so much for watching. I'm really happy to have been a part of the Cloud native 2023 conference. If you have any questions, feel free to reach out to me on the discord. I am there so ask me questions. I would be happy to help with any sort of automation DevOps runbook sort of questions. Thanks again and I'll see you in the Discord server.
...

Doug Sillars

Head of Developer Relations @ unSkript

Doug Sillars's LinkedIn account Doug Sillars's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways