Conf42 DevOps 2024 - Online

Reduce Toil by Improving Your Automation

Video size:


Toil is a four-letter word. No one likes it, but it has to get done. You create some scripts for repetitive tasks for your team. That’s just the first step for tackling toil and reducing interruptions. Provide everyone in your organization with access to your expertise in a safe, auditable way with Rundeck


  • Mandy walls is a DevOps advocate at Pagerduty. We'll talk about some automation and hopefully controlling some toil with automation. If you like, get in touch with me. I'm happy to chat about stuff.
  • When organizations get larger and their services get more complex, you start to naturally see some of this siloing off. How much organizational expertise gets shared varies from team to team, right? We need a way of dealing with getting through all this stuff.
  • We can use AI to help build automation. You may have more or less success with this, depending on the nature of your environment. But automation itself isn't AI. Even if we're using AI tools, we still need to invest people hours into maintaining it.
  • Right. We also want things to be repeatable. I want to run the same every time the tool runs. And I want that to be centralized. So there's not a lot of confusion that I don't have to then deal with.
  • We are designing for delegation to another team. Part of that is going to be supporting how our users work. We also want to provide results that make sense whether things are successful or not. This leads really well into folks on a platform engineering journey.
  • You can also think about automating the tasks that you do most often or the stuff that takes the most time. And get yourself to the point where you figure out okay, we're going to get back x number of person hours every week by automation.
  • Starting on the left we have automation opportunities. The most common one is human initiated automation. This is automation that runs on its own in response to some environmental trigger. Not all of your tasks are going to reach all of these phases. If you want to talk about Pagerduty, we are always available.


This transcript was autogenerated. To make changes, submit a PR.
All right, thanks for listening to my session. We're going to talk about some automation and hopefully controlling some toil with automation. I'm Mandy walls. I am a DevOps advocate at Pagerduty it. I'm Ellen Xchk on most of the social medias. If you like, get in touch with me. I'm happy to chat about stuff and you can always email me, which is great too, right? We all get email. So I'm going to talk broadly about automation and complex environments. But let's start with a task that is pretty common and maybe more common in larger organizations than smaller ones. But as an example, hopefully it makes sense to people. So I have a developer. Her name's Alice, and Alice works on a microservice. It's a feature in a customer facing application. When Alice or one of her team members is ready to push some amount of work into the main pipelines, they're responsible for doing some first sort of sanity checks on their code. And because of the way it fits into this microservices environment, they need a bigger interruptions environment than they could just run on their laptops. So they do this in the cloud. So to ensure that her code is going to work, Alice needs the sanity check environment in the development cloud account. Unfortunately, because of compliance or requirements or whatever, Alice doesn't have direct access to the cloud account that's restricted. So she has to put in a ticket with another team and they'll do it for her. So her request is submitted to the cloud operations team. It goes into a queue with everybody else's request, whatever might be happening for cloud ops, and it'll get handled eventually, right? The first subtask for her ticket is to get approval from Finops. The Finops team reviews the request tickets, and if a team is over budget or they have too many environments already, they haven't cleaned up all of their old stuff, the ticket's rejected and gets sent back to the requester to try again. If Finops approves the ticket, it proceeds to cloudops. At some point, hopefully, Alice will get her environment, but it feels so clunky. We want to make sure that all the boxes are checked, that we're using the most recent security profiles, that we're using the right firewall rules, that we're following all the rules and regulations and financial guidelines and all of those other things. But this takes time away from getting work done, and Alice should be continuing on her development journey. They are getting more stuff done. She wants to get back to work so she can ship more awesome features. So we need a way of dealing with getting through all this stuff. The hard part, know all of the folks in our example have goals and projects and things that they're responsible for. They have different products and services that they're permitted to have access to, but most of all, and sort of most importantly, they have specific expertise for their roles, right? So how much organizational expertise gets shared varies from team to team, right. It's definitely common in small companies. If you work in a startup, you might be one day helping fix a printer and the next day you're provisioning stuff in the cloud and the next day you're like debugging the mobile app. It totally happens. But when organizations get larger and their services get more complex, you start to naturally see some of this siloing off because their requirements now need somebody to go really deep in that product, into that environment or whatever it happens to be. So folks need to cultivate more expertise. In our fictitious organization, the development team isn't permitted to directly provision assets in the main cloud accounts. They don't necessarily even know how to do that. They're in there, they're writing the code, they know their libraries and the runtimes and the things that impact their application, and they just want to get code into production. Right. Then we've got our finops folks, and they're analysts mostly, right? They may not even do a whole lot of development work other than some tooling, but they're there to watch the spend and estimate costs and say, oh, make some recommendations on the budgets and say, well, we've been deploying this over here, but it might be cheaper if we moved it to somewhere else. It doesn't really impact us, do that kind of analysis on the thing. So they know their stuff and then cloud ops is there. They maintain the access and security and make sure everything is constantly updated for best practices or recommended practices for the environment that they're using. And they're sort of responsible for how all those things fit together. And then they provide environments back to all the other teams. So each of these teams knows their stuff, but they all have enough stuff that they're responsible for to keep them busy. And it's challenging to try and cross train folks across all these tasks. So unfortunately, we get to a point where we've got these important skills that get siloed off and it happens kind of naturally. Eventually, teams get too large to share all of this stuff, so we want to get to the point where teams are spending time doing important work shipping code to production and keeping customers happy. But we still have to get all the boring stuff accomplished and that's where we're going to need some automation. So what is automation? We know it sounds pretty good, right? We know that we probably want some. And the dirty secret of software development is really that a lot of the work that goes into getting software from idea to production has nothing to do with the application code itself, right? It's all that other stuff that has to happen. We want our code to have a nice, safe, cozy place to live when it gets to production. It has nothing to do with what the application is going to do, right? So our cloud ops team knows how to get that done. But once you've done that once or twice and you know how to do it, it's not that exciting. We need to get stuff there the right way, with the right controls in the right environments. We don't want to do it ourselves. The application engineers don't really have the time to get well versed in all the hundreds of different things available in our cloud platform, and we might not be able to give that to them anyway. So we look for another layer, another process to provide access in a safe way. So we're going to dig a bit into automation and we're going to dig a bit into building for delegation. First we'll take a short side quest because we get this question from time to time. Both of these. Is AI automation? And is automation AI? Well, take a step back. We can use AI to help build automation. You may have more or less success with this, depending on the nature of your environment. If you use a lot of open source software, a lot of publicly available stuff, you'll have more luck than if you're using a bunch of cot stuff or everything you have is homegrown, obviously, because these large language models that you might be asking questions to are using publicly available information. And if your information on your software is behind a login or a support portal that they can't get into, they're not going to know stuff, right? So you're basically limited to what's been published online. So if you are using a lot of open source, you might be okay, right? So it can at least help you get your automation built. But automation itself isn't AI. It's not at a point yet where we can build some automation and have it learn and grow with your systems. It's not to a point where it finds an error code and then goes out and looks at that error code and then creates more code around it to the point where you don't know what even is in there. So we're not at that point yet. So we're not to the point where automation is going to let you likes just sit back, relax, have a margarita and hope that the automation is going to take care of all of this for you. There's been a number of research trends over the years in complex systems, stuff about self healing, stuff about intelligence systems, and there's lots of pathways forward for these to happen, but we're not there yet. So we're going to continue to write most of this stuff ourselves and keep an eye on it because we're looking for a handful of basic benefits, right? Because we are going to invest in automation. Even if we're using AI tools to help us generate it, we still need to invest people hours into maintaining it, making sure it works, all that great stuff. So it is an investment, right? It doesn't happen overnight. Hopefully you're not doing it on your off hours. You want it to be part of the actual products thats you're working on. So you've got these complex environments, they're built with hundreds or even thousands of individual components. So you've got all this complexity and all this stuff. And it's more than most humans are capable of sort of rationalizing across all the environments. So we want to be able to contain all of that so that we don't get too many people confused about it. Then we want to be able to cope with all the change that comes downstream to us from all those components. Because things are updating constantly. You might be getting regular emails, hey, we're sunsetting XYZ. Hey the deprecation date on this is April 30 or whatever it is. You want to be able to maintain all of your components in a way that reduces risk, right? Because we don't want to be running stuff that's end of life, but we still want to be able to maintain things in our sane and rational way. To do that we're probably going to need some automation to help us out. And then the hard part is that with all these components they're all slightly different. Some of them are configured in Yaml and some of them are in their own markup language and some of them are like maybe still using a whole lot of Gui's or dot files and that kind of stuff. And it's hard to do all of thats work without making mistakes. And we're used to bugs, we're software developers, it happens. But we want to get to a point where we're reducing that risk has much as possible. And if we're doing the same process over and over and over again and multiple people have to be able to do it, the best way to do that is to encode it somewhere that makes it easier so people aren't copying and pasting and accidentally missing a line or missing the long strings of options at the end of a command. And obviously we want to get to the point where we're reducing toil. If you're not familiar with the word toil, it's kind of the way that the Google SRE books talk about work that, let's say boring, right? They don't usually use the word boring, but it's the work that has to get done. And it expands sort of linearly as your environment grows linearly, right? So things like doing your security updates or making sure that all of your containers are in line with things like all that kind of boring stuff that has to happen. And you're doing like you're rerolling images for security updates, all that kind of stuff. So we know it has to get done. It's like brushing your teeth. If you skip a couple of days, maybe, but if you skip months, you're going to have a problem, right? So you don't want your teeth falling out because you're not doing your toil work. So there are some potential drawbacks to automating too much stuff, and I'll say too much, but you kind of get to a point where things are more comfortable. One is to look out for is loss of expertise because you're not going to be able to completely replace everybody with automation. But you don't want to get to a point where folks don't know what's going on. There's a lot of research in systems engineering and automation engineering that talks about what do you do with junior engineers, new team members in automated environments. You want to make sure that they are constantly in the workflow and constantly working with the products to make sure that they can maintain the automation and they know what's going on. When the more complex problems come through that the automation isn't built to work through, they still understand what's going on and then you have a little bit of risk from brittleness, right. Automation is part of your lifecycle. The systems are part of the automation as much as your automation code is. It needs to change when the services change, it will probably need to be updated when the operating system or dependencies are updated. And you want to be in a place where you can maintain those tools for a service as well as you maintain the service itself. So you don't want to get to a point where you cannot relaunch a thing because you've lost the start scripts or you don't know where that stuff got to, or nobody knows how to maintain it. We don't want to get to that point where we're breaking things because the automation hasn't kept up. So we want to keep up on that. And another big one is more political and social than technical, but it's so prevalent, it's basically a bad management meme as like the dream of kicking back and drinking a margarita while the automation does all the work. On the opposite side of that is the manager is like, why are you guys all here if we've got everything automated? Like you're all fired, right? So we don't want to get to that point either. We're not at a point where the system is going to run fine without you. We're trying to make more time in your day by automating some stuff. So when we go for automation, it doesn't matter what you're writing the automation in. And depending on your platform, you could do things in something as simple as Powershell or bash or your favorite programming language, whether that's go or python or whatever else you might be deploying other automation focused tools, right? Like infrastructure as code components and that kind of stuff. So there's lots of ways of putting automation into your workflow that aren't necessarily you have to build the whole thing yourself. There are lots of building blocks for it, but you should be looking for some basic characteristics to make things a little easier on yourself over the lifecycle of the services that you're supporting. And these probably look like what you want for software development because they totally. So I've pulled these from Lee Atchison's architecting for scale, which is a great book. You should take a look at that one if you haven't. But it's ways to think about the automation for service as part of the product itself and the characteristics going to help you maintain things going forward. And these should hopefully be pretty familiar to you as you have gone on your DevOps journey. We want things to be testable, whether there's a test harness for the language that we've chosen to write our application in, or there's another whole like a test kitchen kind of application that likes along with it. We want to get to that. You can apply test driven development methods if that's part of your culture, that's great too. You also want things to be flexible. You want them to be implemented for future improvements. So we're not hard coding things. Maybe we're going out to the configuration database to pull more information about services and connectivity and things like that. We also want to put all of it in the version control system so that we can review it and we have access to its history if we ever need to go back and look at something. We also want to keep this automation for related systems the same. And this is like I say that and you think, oh yeah, that's obvious. But I have worked with customers in different places over the years that you could have two teams sitting on the opposite sides of a wall of cubicles and they're both running the same platform, but their stuff is binary incompatible because they've been allowed to recompile everything separately and then they can't share scripts and tools and all that kind of stuff. And that's crazy. That's such a waste of time and resources. So from a political standpoint, you might have to have some hard talks with the folks that have recompiled things or have built their own stuff or have kind of gone on side quests to make their stuff special, because that's going to be harder for absolutely everybody in the long run. You gain more benefit from your automation if you can share things across different teams. Right. So a little bit of a rant there. We also want things to be repeatable. We definitely want them to be auditable. Right. I want to run the same every time the tool runs, I want it to do the same thing. And I want to get to the point where if I run it, I know I ran it. I want that to be centralized. And folks know that because the goal here is for me to give this to Alice, right? So I want to know when Alice ran the thing, and I want to know that every time Alice runs the thing, it's going to act the same way for her too. So there's not a lot of confusion that I don't have to then deal with. So let's take a look at what it might mean to automate for delegation. And for folks who are used to sort of writing for themselves. There's some things that might be a little bit different when we think about it, right? Because we have a bunch of different teams, right? So all those folks on the left hand side of the screen there, they're not dumb. They know their own thing, right? They've got all their own set of tools and all their own set of knowledge and all that kind of stuff. And then we've got all of our tools on the right hand side there that are doing all the cool stuff in our environments, but it's hard to share all of the things on the right with all of the people on the left because they don't know all of the same things and they don't have all of exactly the same skills. And importantly, they also don't all have exactly the same access, right, which is another issue that always pops up. So as we are building for the goal of taking toil off my plate and giving it to Alice as maybe a click button somewhere, then I want to make sure that I am resolving all of her questions via the automation and taking care of all that stuff. So I want to be able to say, oh, I know Alice knows X, Y and Z and we'll understand it this way. And I can address those kinds of gaps for her before I pass this automation over to her team. So we're basically building an internal product, right, when we build this kind of automation. And like I mentioned before, this leads really well into folks that are on a platform engineering journey, right? We are designing for delegation to another team and part of that is going to be supporting how our users work. Knowing that some of our engineering teams are super comfortable on the command line, some of them aren't. A lot of them really just want to log into a web page and click a button, or maybe they're working in something likes backstage and they want a module there, or they want things in a certain place. So that stuff is super easy for them and that's great, but you need to know that and think about it. We also want to provide results that make sense whether things are successful or not. If it produces an error, it should be a helpful error. Oh hey, you're over budget, or hey, you've already got six environments deployed in your account, you can't have any more. Please go clean something up. Right. Rather than just saying error, failing, right. So being helpful to the users and for folks who have worked in the back end for a long time, some skills that you might have to build as you're thinking about how this delegation is going to work, you also want consistency of experience. You want the same kind of error messages, you want the same kind of options for the commands, you want things to be named similarly. You don't want to use a lot of jargon if you don't have to. You certainly don't want to use like inside jokes, right? And that kind of stuff in these kinds of internal products because you want as many people as possible to be able to use them and be successful with them without having a lot of insider knowledge, right? So making things as accessible as possible to as wide an audience as you can, knowing that they're all just your coworkers, right? If they don't understand something they're going to be pinging you on slack or teams or whatever. But at the same time they really just want to get their work done and not be burdened with a bunch of silly stuff. We also want to include as much documentation and context as we can, whether that's like hover overs or whatever it looks like for your tools. So that like you say, folks don't have to come back to you on slack all the time. So we are designing for other people that kind of look like us, right? They are other folks that are technically minded, they are doing technical work and they are headed in the same direction we are, right. We want to create awesome products for the customers. It's just that some of us work on the backside and they work on the front side. So like any software design we are focusing on empathy for the users. It just happens to be that the users are the next team over in the chart. So we want to turn our expertise, the things we know, things cloud ops knows, things finops knows into automation. So that when Alice wants to request that environment she logs into our automation platform, whatever it is requests her environment provision environment. And it says oh you're Alice, you're on this team, you have these permissions to these jobs in these environments and it just goes through the workflow right? Oh here's the jobs for Alice's team. Click the button, off it goes, it goes through the budget check. Doesn't need a human does it? Automatically creates the environment, pops it back up to Alice says here's your environment, please log in at da da da. And every job gets logged for audit. So that I know Alice provisioned an environment on Thursday because then when her co worker Ben tries to provision something on Friday and the account is full he can say oh Alice has one, let's see if she's done. We can turn things off. So what kinds of things should we automate? Like you may be thinking in your head, well I've got all this junk that I do thats if it makes the most sense to automate. So there's a couple of ways of framing up thinking about automation that you can take a look at. We're going to look at two of them. So the first one is to automate based on what feels most comfortable to sort of automate around a specific set of tasks. Now we're pager duty, so we took a look at some tasks that you might do during an incident response, right? Something's wrong. There might be some things that you need to do. Happens all the time, right? So looking at tasks on a set of axes, on the x axis there on the bottom, we are looking at things that go from simple to sophisticated or complex, right? So simple things are like maybe one single command in one set of services, and the next one sophisticated stuff might be looking across lots of different services or trying to coordinate something. And then as we look at the y axis, the bottom, it's stuff that has no impact. This is your read only kind of stuff, right? And then when we get higher and higher on the y axis, things get a little bit more high impact and might be a little scary to get automating, especially at the beginning, right? So thinking about the incident response context, things that in the green there are super easy maybe, and you might actually already have access to them, it's just a matter of putting them somewhere more easily available. Things like just info gathering, right? Memory, cpu error messages, get that performance check, that kind of thing. Getting out into the yellow, things get a little bit more complex and we might need some help from our environment for starter restarts or maybe multi step restarts or other more complex diagnostics out of like maybe Kafka is pushing something wrong or there's a weird error message there and we got into the red. There's some things here that folks aren't super comfortable automating and some of it fortunately will be handled by your cloud. But it might not be. Whether it's things like scaling up a complex environment that you push a bunch of new containers into a pod or something like thats. Or you need to do a multi service rolling restart, maybe across different locations geographically. So thats things all sync up and reconnect. Or you might have a process for rollback or redeploy that you get to the point where you really want to automate those kinds of things, right, for incident response, but you can sort of take that and break it down. In the Alice example too, we could do the same sort of tasks in a graph based on provision environments in the cloud and figure out what to do there. Another way to look at automating tasks is through this XKCD example. There's always an XKCD example, right? So you can also think about automating the tasks that you do most often or the stuff that takes the most time. Right. Because we want to look at getting time back out of our day by not doing these sort of toily tasks. So you can actually do this kind of graph for yourself and think about how often do you do a task? Well cloud ops has to provision environments like 40 times a week. It's crazy. Oh my gosh. And how long that actually takes. And get yourself to the point where you figure out okay, we're going to get back x number of person hours every week by automation. This thing. And some stuff might be hard to justify in this way, right? You might have some jobs that you only run once a quarter, maybe it's reporting or something like that. You think oh, we only run this once every three months and it doesn't run that long. But even that stuff you can benefit from automating because of the other things thats automation gives you because you're only running it once a quarter. It's easy for people to forget how to do it and mess it up. So having some automation around that can prevent that stuff too. So you have all of those sort of benefits that you want to think about when you're thinking about what kind of things to automate and what you're going to get back out of all of that. And eventually you get to the point where you've got a bunch of automation or you have a bunch of targets for automation you could think about. Thats exactly the end state is for that automation. Because not everything's going to get to a perfect run on its own kind of life, right? So thinking it too about as you're building your automation portfolio, thinking about the evolution of those automation components, whatever you've written them in doesn't even matter. Starting on the left we have automation opportunities. That's just fancy way of saying things we haven't automated yet, right? They're just in the backlog, we'll get to them. But then we get to the most common one which is human initiated automation. This is your common scripts and other tools the team members are running on demand. This is Alice getting her environment right. We want something done, we need to complete a task and it's unscheduled. This is going to happen when some other work has been completed. So it's human dependent, human initiated. A lot of stuff has out in that part, right? That's where a lot of automation kind of likes because that's the workflow, that's the nature of this kind of work. If you get to the point where you've got stuff that you want to kick off, say during an incident. We get to the point where we might get some automation with oversight. This is automation that runs on its own in response to some environmental trigger. Some event happens. And it might be simple things like your cron jobs that rotate logs or other stuff like restarting a service when it stops responding to queries or whatever the automation still requires. Maybe some humans to make sure it runs okay. It might require permission to be able to say, hey, this alert came in, should we try this remediation and someone human responder has to press the button. You could totally do that. And when we get to the point where we're pretty comfortable with that, we might get to the point where the automation runs on its own with a fallback stage. It runs and it only requires a human intervention if something goes wrong. So it might just keep running in the background that one time out of ten that it fails with some weird error code. Then it pops back up to the human responder and says, hey, I tried this, it didn't work. You need to tasks a look at it, right? It will let humans know when it's not able to fix a thing if all goes well, the other nine times out of ten it just does what it needs to do, doesn't necessarily need to report. Eventually you might get to the last has on some things monitor and evaluate tasks get done with automation. Edge tasks are pretty well managed. You don't get errors all that often and instead of tasks coming in and creating tickets or alerts, they just create metrics, right? Rather than saying like Brian in April cleared n requests for this error code this week, you might know this error code was completed by automation end times this week. So you can monitor it over time, right? And rather than relying on likes how Brian and April feel about having restarted this thing or whatever the task is, you actually have metrics based on the alerts coming in, the automation handling things. So you end up being able to monitor and manage the performance of the system a little bit better because you've got the actual raw data there. Not all of your tasks are going to reach all of these phases. Some things, like I said, are more naturally going to hang out in that human initiated automation part, but even that is going to gain back so much time for your team to get all those sort of toily tasks out of your way and put them into your automation portfolio so that you can do more interesting stuff and learn new things and do whatever you need to do. Otherwise. So that is most of my talk. Hopefully, that was helpful for folks. We love to talk about thats stuff. I didn't talk much about Pagerduty. If you want to talk about Pagerduty, we are always available. You can find more We also have an automation platform that was built off of Rundeck, which there's information There's also a community version if you want to try that out. And there are some links there on our community site for our automation resources and code and all kinds of examples and things like that, too. So we love to talk about this stuff, and it would be great to hear from any of you. If you'd like to reach out on our community site, we'd love to hear from you. So I hope you have a great rest of comp 42.

Mandi Walls

DevOps Advocate @ PagerDuty

Mandi Walls's LinkedIn account Mandi Walls's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways