Conf42 DevOps 2024 - Online

Developers, don't toil, Automate!

Video size:

Abstract

DevEx is about removing the suck, and the suck is toil, friction, and time-wasting in the day to day life of a developer. Toil is arguably worse than crisis, because a crisis is temporary and firefighting can feel rewarding when it’s over. Toil is more like a death march - an insidious force that eventually leads to burnout.

Automations - guardrails, notifications, actions, or workflows - are the best way to remove toil. I’ll walk you though the best automations and show you how can use them as experiments to enable a loop of continuous learning and improvement

Summary

  • Dylan: This talk is about delegating engineering toil to the. DevOps. Toil is repetitive tasks that your dev team needs to do to get things done right. Dylan: My argument is that automations remove toil. This is how we move our industry forward.
  • The traits of good automations in relation to developer experience workflow look like. I would argue that today we can do far more exciting and powerful things for developers. Because of all of these things, that have come before, we can start to unlock this higher level of automation.
  • automation is going to help drive your culture. Tooling helps define culture, right? It helps build these things like guardrails. A good automations enforces these norms and it also allows you to try new things.
  • Automations should be low lift. Almost nine times out of ten is that I start working on the automation. It's far more complicated than I thought. It takes more time. Whereas if you have low lift automations, you can treat them as an experiment.
  • The idea of a low lift automations means that you can really do a paradigm shift. You have to be able to use the tools that your teams are using already today in these automations. If you can do these things in a tight loop, you can continuously iterate towards a better and better process.
  • For our startup sleuth, question, is that right? Telling a story about something that I think fulfills these categories. A great example of taking all of those traits of an automation and building it into something that if we can adopt really quickly.
  • There are four major buckets, basically, guardrails, notifications, actions and workflows. For those familiar with the state of DevOps and Dora metrics, frequency is something that DevOps teams are trying to maximize. The way that you maximize that is by driving your batch size down.
  • A good notification is context sensitive. You can use notifications not just to keep people aware of what's going on, but you can nudge behavior in the right direction. Finally, we have workflows. It's how you get work from concept all the way through to launch.
  • You need to measure experiments, measure and repeat. It's the same with automations. We're on that staircase of automations where we can start to do these things. Not every automation is made for every team. Check out our sleuth automations marketplace.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
My name is Dylan, and I am co founder and CEO of a company called Sleuth, and this talk is about delegating engineering toil to the. You know, it's been a long day. You've all had a conference session. I want you to visualize a beautiful and perfect world. A world where developers get to spend their time on producing engineering value, right? Not wasting their time, not working on busy work, not doing these things, having the focus time that they need to actually ship the things that are going to make a difference for your customers and your organization. Now, can this world exist? Perhaps. But there's this thing, there's this thing called toil. Now, generally speaking, in the english language, toil means to work extremely hard or incessantly to do exhausting physical labor. But more specifically, in our engineering world, in software engineering world, we mean something slightly different. What do we mean by developer experience toil? Well, this is the repetitive tasks that your dev team needs to do to get things done right. They're repetitive tasks that aren't necessarily unimportant, but they are tasks that, again, you have to do over and over again. You have to do it correctly. And generally speaking, at some point it's wasting your time. It's not giving your customers that end value. It's something that you likely can do without. And how do you know if you've got developer experience toil? There's a couple of categories. There's a couple of ways that you can ask these questions to understand if you do, or, you know, if our team always does XYZ. In other words, like, I'm a developer, I need to remember to transition this issue into this bucket, and I need to remember when it goes out that I need to ping the PM and let him know. This other thing, it's manual task. I have to do it every time, but it's me. I have to do it. I have to change context, change flow, manual steps in your workflow, things like, I'm going to do a deploy, and at some point before I merge things into production, I should go figure out all the people that were involved in the change. I should find them, mention them in slack, ask them, is this okay to do? That's a manual step. That's context switching. It's toil. I have to do it over and over and over again. And another giant category is like syncing state between different systems. So again, your Jira backlog, the place that is never in touch with reality, right, because people forget to transition state or you got to go into slack and mention people, or you got to update the wiki with your release notes or whatever it is you're pushing state from one place to the other. Are your customers getting benefit from that? Likely not. Are your developers getting frustrated? Likely so. All right. My argument is that automations remove toil. So when we talk about how we even just move our software industry forward, it is through automations. Automation is this lovely, wonderful thing. When we do something 20 times and we realize, hey, this is a repetitive task, we can delegate that off to robots and that can be super exciting. Don't just take my word for it. I mean, we've got a timeline here of examples of automations just transforming our industry. So circa 2001 automate unit and synthetic tests. I worked at Atlassian back in the day, and right around them we were writing leads and heaps of best, but they were running on my laptop. They didn't necessarily gate my changes. We decided to move to automated execution of that, and that changed the game. Suddenly we could run them in a consistent manner. We could break them up and execute them. It really changed how we could work. Fast forward a little further. In 2005, ephemeral infrastructure, defined as code puppet chef eventually evolved into things like terraform again, maybe even got rid of an entire role. I knew ops people whose job was to understand the drift between this server and that server, and new engineers who'd lost days of time debugging issues that were specific to a server. So you move to infrastructure as code automate that, and suddenly we've changed the game again. And then fast forward a little further. The DevOps pipelines with no manual steps. I mean, huge game changer, right? DevOps revolution right there in a can, being able to deploy and make it a non event. This is how we move our industry forward. All right, so maybe I've convinced you that automations have a lot of potential for change. Let's talk about what the traits of good automations in relation to developer experience workflow look like. Because it's not just any old automation. I mean, obviously, if it's helping you, great, go for it. But I will argue that there are some very key characteristics that allow you to level things up. And I actually made one more point. If I go back to this slide here, I put these things in a sort of staircase because I think it's important to understand that the automations that we have developed continue to build upon each other. I would argue that today we can do far more exciting and powerful things for developers. Experience workflow. Because of all of these things, that have come before and it's truly now that we can start to unlock this higher level of automation. All right, so talking about traits, number one trait of a good developer experience, automation is that it's going to help drive your culture. Now I feel like when I've talked to people about this one, initially the reaction is you're talking about tooling. How is tooling culture? Well, tooling helps define culture, right? It helps build these things like guardrails. It says if we are going to always do a thing, it helps define how we work because we agree as people that we're going to work in a certain way and then the automation can enforce those norms. For very simple example, if I'm going to say there needs to be an issue, key in the title of every pull request so that people have the context about why we're making a change, they have the context to understand for the review cycle and that sort of thing. That's an agreement amongst a developer team and you can enforce that with an automation. And then rather than having somebody be the bad guy, you have basically automations being the bad guy. And it's not really a bad guy because you said we want to do this and it's there to remind you of these things. So a good automations enforces these norms and it also allows you to try new things. So if you want to try something different again, you can say let's agree as a group, let's put in place some automation and then the automation is there to support you and build the layers that you're going to build upon as a team. Okay, next automation. Excellent trait is that automations should be low lift. Now I love these little XKCD things because I think they got it super. Exactly right in all instances. As an engineer, I am guilty of having done this a million times, which is to say I'm working on a problem. It's a problem to solve things for customer value. I realize, hey, I could add an automation here and make my life way better. And what I think is going to happen is I'm going to spend a little time on the automation and then I'm really going to get into the flow and get to do things. What happens almost nine times out of ten is that I start working on the automation. It's far more complicated than I thought. It takes more time. I've run into a bug, it's some ongoing development and now I have no time to actually do the real work behind the scenes. So you can imagine if that's the world that every automation lives in. It's hard to make progress because you never quite know the depth of that automation. Whereas if you have low lift automations, you can treat them as an experiment and basically iterate on them a lot quicker. The other thing I would say too, is that the comic on the left is also true that sometimes there's things that you do, but it doesn't take a lot of time, it's just a context switch. And if you were to ask yourself if this is going to take me a week to automate this, when am I going to get paid back for that? And the answer might be never. But you add twelve of those up and they're going to add up over time. So if you can make your automation super low lift, you can get a lot of benefit and you can start to attack some of those death by 1000 paper cuts. Which leads me to the next topic. The idea of a low lift automations means that you can really do a paradigm shift. And this is honestly the crux of, I believe the way that modern leads are working is everybody is trying to achieve this idea of continuous improvement, continuous learning, and it's not rocket science to realize that. It's just straight up science, right? You need to be able to perform an experiment. You have a hypothesis, you run an experiment, you measure. If the hypothesis was wrong, you try again. But if your experiment is going to take forever, you're only going to get one or two shots at goal. But if you can do these things in a tight loop and you can really measure these things, you can continuously iterate towards a better and better process. So a good automation is an experiment. It's your team saying, what if we did x? And you're going to measure and you're going to check it out and you're going to try, which actually I thought there was a different slide coming. But another trait that I think you will find in this day and age that is huge is that you have to be able to use the tools that your teams are using already today in these automations. So if you think back to that slide where we had the stairstep of how automations have sort of built upon each other over time, we're at this place now where we're all using these cloud based tools, right? We're using some sort of git repository with like code review. We're using an issue tracker somewhere, we're using a CI CD system. My guess would be that 90% of you are using some subset of these tools that are up on the screen right here. Now, what that means is if we want to make an impactful change to the way that your development teams are working, you need to be able to automate across all of this because things start in Jira and then they move into a pull request, and then they move into a production environment via CI CD, and then they move into pagerduty when you've messed up. And you need to be able to talk to all of these systems to really take best practice workflows and implement them in your services. So let's talk about this in terms of a story. We talked about the traits of a good automation. Now, I have a number of different stories, but I like this one the best. So for our startup sleuth, question, is that right? Okay, well, maybe I meant to say four then. Yeah, sorry about that. Telling a story about something that I think fulfills these categories. Now, we cheated a little bit in the sense that sleuth is built to do a lot of these things and to basically bring the low lift side of this. But there's a number of different best practices that are out there in the world. One of these is to say, I'm going to make a change and I'm going to deploy that change to a pre production environment. And I want to have a culture on my team of people identifying that their change has hit a pre production environment. Take a hot minute, do a smoke test, check that it works in that pre production environment the way that we want it to before. We're going to go ahead and merge that into a production environment. Right. That's a reasonably common workflow, but you can imagine that that's a little difficult. Right. We have to merge a thing. We have to know that CI CD has deployed that to a specific environment. I need to know who it is that I need to mention, and then I need to hopefully collect that information in an environment where the people are working. So like something like slack. And then I need to trigger a CI CD pipeline in some other system to promote things to production. A complicated automation, but you can see how that's going to drive reliability. It's going to drive accountability, it's going to promote my team doing smoke tests and those sorts of things. So a great example of taking all of those traits of an automation and building it into something that if we can adopt really quickly, we can understand how that impacts our flow. And it is something that really defines the culture of our team. It's holding up so many pillars of what we want to do. And in our case, actually, we used that as an experiment, and we decided we needed some nuance to that. We started with just straight up approvals, and then we said, if it's no big deal and it's a small change after 10 minutes, if everything's okay based off of the monitoring that's coming in from other systems, let's auto promote it. And then we went, oh, how about if we have a label on a pull request that's like quick fix? It doesn't even stop at all. It just goes straight out. But because we could experiment with that automation and move really quickly, we could see how it would fit the flow that we were arguably trying to attack ourselves. Okay, so that's a little bit about the why, like the theoretical, like how we should build automations to be really impactful. Let's talk a little bit about what teams are doing today. So the good news is you all have been doing these things for the last ten or 15 years, so we have a ton of best practices out there, and teams have adopted these things, and they tend to fall into a bunch of different buckets, four major buckets, basically, guardrails, notifications, actions and workflows. So let's walk through these, and I'll tell you a little bit about what they mean, and I'll give you an example of each. Okay. First off, guardrails, we were kind of talking about this a little bit before. Think of a guardrail as defining the boundary that your team agrees not to cross. So when you say we always or we never, for example, we never merge a pull request when we're in an incident, right. It's a reasonable thing. Lots of teams do that. Or we never open a pull request without an issue key. Right. It's saying, these are the guardrails. As an organization, as a team, we want to live up to a certain level of excellence, and we won't cross these guardrails. These are types of automations that tend to be somewhat binary, right. They're going to either keep you from doing a thing or keep you doing a thing. And a great example of this is batch size. Right? So for those who are familiar with the state of DevOps and Dora metrics, frequency is something that DevOps teams are trying to maximize. And the way that you maximize that is by driving your batch size down. You want to make the smallest amount of change that's going to have an impact, but have a very small blast area, small best radius, in case something goes wrong. And as a team, you can agree, hey, we want to try and keep our batch size down so you could add an automation that says a pull request can't have more than x change files where x makes sense for your team. And you can see how that's going to say, as a group, if I've opened one that's too big, I need to split it. I need to go back. I need to just decide how am I going to stay within these boundaries. And of course, with any of these sorts of things, maybe you've reformatted all of your code or run some sort of new linter, and you decide, this is the time where we're going to ignore this check. But again, that's the exception, not the rule. And that allows you to understand, how often are we exceeding this thing? Is this a rule that makes sense for our team? All right, next up is notifications. My guess is, like, 100% of you have a cell phone, so you probably know what notifications are. It brings visibility and attention to critical information. But critically, a good notification, just like on your phone, is context sensitive. It hits you at the right time, it hits you with the right information, and it's hitting the right people. Right. Because I'm sure you all have notification fatigue. I'm one of those people who can't have 400 little red dots on my desktop. I know some of you can. I don't understand how you do it, but I can't. But you can use notifications not just to keep people aware of what's going on, but you can nudge behavior in the right direction. So one of the examples that I like is that a PR must not be open for more than a certain number of days. A draft PR. Right. So if you want to have this culture of keeping your work flowing and making sure that you're not spending a ton of time on this dead branch or whatever, you can open a draft pull request. And maybe you say, we don't want that to be open for more than ten days. Well, the notifications comes along and says, it's been eight days for this thing being open. Perhaps you should either move this into real work so that you can start to get this reviewed and get this into your flow, or move on to something else and close this draft pr. You're nudging the behavior in the direction that you want people to move. Another great example of this is goals, where you say, I want everybody to be reviewing a pr that's been opened within, say, like 10 hours or something like that. You identify and notify people at the seven hour mark. Well, cool. I've got 3 hours to hit this thing that we all agreed on. Notify them again an hour before you can nudge their behavior with a notification into the direction that you guys want people to be behaving. All right, actions. I don't know. I've heard people call this a different thing. I call it actions, but basically it's can if this then that. So I detect a condition in some sort of system. I am confident enough in the signal of that condition that I want to change the state in another system as a result. Right. So maybe I notice an incident in pagerduty and I want to lock all of the pull requests in GitHub. Right. It's a if this, then that. So again, attacking that thing from way back at the start of this talk where we're saying one of the types of toil is keeping these systems in sync so you can have something that you trust instigate changing that state rather than having somebody have to toil after that one. That is really, really popular with a lot of teams. We see people use this all the time is the issue tracker stuff. So like I said, I don't think any of you would think I'm crazy to say that Jira is where things go to be out of sync. So you do a deploy, you get that deploy into your production environment, you automatically transition an issue into deployed to production. And why is this cool? Well, you got support engineers who are looking to get back to your customer and let them know that an issue was fixed. Great. They can see that the pull request was merged, but maybe at your organization that could be anywhere between 2 hours to seven days before it actually gets into the customer's hands. So you're using this to keep work in sync so that you can have visibility, so that you can better service your people. Finally, we have workflows. And workflows are just what you think. It's how you get work from concept all the way through to launch. It can be something really small that guides the way that you're doing work, or it can be something really big, like I was mentioning with the slack based approvals. Right. It's a fairly complicated flow, but it's saying this is how we work. This is the flow with which we are going to take work from one place to the other. Now, automations that tend to be workflow automations are often a combination of the other ones. Like it's saying I'm going to have some sort of notification, I'm going to have some sort of, if this, then that I'm going to have some guardrails that check the thing and I'm going to bring it all together into a larger process that we can adopt that's going to help the flow of our team. One of the favorite ones that I have is that idea of auto locking a project when your environment is in an incident. Now I've done this with leads that I've run for probably ten years or something like that, but I've done it in ways that were inconsistent and not super effective. Right. Like you have the SRE team say, hey, I'm going to throw a slack notifications in there and be like, everybody don't deploy. We're in the middle of a know and well, what's the first thing that a developer does is I didn't read Slack and hit merge. Right now you've got more change on top or better yet, oh, this one I've done a thousand times which is you think, oh, everything's on fire. It's like all the cpus are at like 90%. It's a total mess. We need to get a handle on this and somebody does a deploy and nukes your old infrastructure and spins up new infrastructure. Now it's not 90%, it's like 160% and it's really on fire now it's like hard down, right? So the cost of not doing this right every time and getting it into a place where developers live is high. So being able to take this and say I'm going to use my pager duty as the source of truth for nastiness. I am going to put a merge block and get hub so that developers where they live, they're going to go in there and they're going to try and click that button and it's going to say no, right? And yeah, I'm going to add some notifications in there as well. And then when the thing gets resolved I'm going to make it all better. Right. And that's just how we work. So it's a good example of how you take a somewhat simple situation, bring together a lot of different tools and make that work in a way that's going to help your team and help your customers. Okay. So hopefully with all of this you can sense a theme and it's a pretty simple theme. You need to measure experiments, measure and repeat and you just need to do that over and over and over again. And automations, they're just like performance issues. Maybe you've worked with a junior dev who you ask to do some sort of performance issue. And they look at the code and they say, I think that it's inefficient in this area, right? And your red flags are flying off the shelf because you realize that's not how you understand a performance problem. You measure it and then you look and you go, oh, it looks like that's the area that's slow. And you try something and you push that change out and you measure it again and you go, turns out I didn't get it. And you do it again, right? And then you say, turns out I did get it. And it was not the thing that the junior Developer thought was potentially slow. It's the same with automations. It's the same with the software development flow. We just so happen to be in a place now we're on that staircase of automations where we can start to do these things. These were really hard to do in the past, it's a lot easier to do now. And we have a lot of best practices and things in place that we can utilize. So it wouldn't be a sponsored talk if I didn't shill my own product at some point. So if you like the way that I'm talking about these things, go check out our sleuth automations marketplace. Obviously I'm passionate about these things. And so we've built a product that works in very much that same way. It's covering all of those traits. We make it one click so that you can treat it as can experiment. We'll give you the efficacy of the thing based on what you used to be doing and now what you're doing here. And probably most importantly, it's a catalog. It's a catalog of what is the art of possible. And I will remind everybody that not every automation is made for every team. It might not be the right automations for your team. You can try it and see how it fits, see how it feels. I would guarantee that there will be something in there. There will be like three to five that work for you and probably 25 that don't. But yeah, take a look, check it out. And as we move into this golden age of automation, maybe that original vision that we had of developers actually getting to work and code and not having to have distractions and do obnoxious toil, we can get there. We can get a lot closer than we ever could get by embracing automations, the right type of automations and treating them as a continuous learning and experimentation framework. That's pretty much all I got. If you want to go check out more, you can go download our book. You can check out the marketplace. Like I say, you don't have to be a sleuth user to see that. It's just visible for everybody. So you can go browse and see if there's anything that works for you. And if you like it, give us a try or chat to us and tell us why you don't. Either way, that works for us. All right, well, thank you all so much. I really appreciate it.
...

Dylan Etkin

Co-Founder & CEO @ Sleuth

Dylan Etkin's LinkedIn account Dylan Etkin's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways