Conf42 Site Reliability Engineering 2022 - Online

Is your team any good? 4 key metrics for measuring your team's performance

Video size:

Abstract

DORA metrics have become the standard for gauging the efficacy of your software development teams, and can provide crucial insights into areas for growth. These metrics are essential for organizations looking to modernize, as well as those looking to gain an edge against competitors. In this talk, we’ll dive into each one, and discuss what these metrics can reveal about your development teams (and my engineering skills).

Summary

  • Christina is a founding engineer at Cortex, a series startup backed by Sequoia. She talks about four key metrics to measure your team's performance. We're giving organizations visibility into the status and quality of their microservices.
  • Dorometrics are used by DevOps teams to measure their performance. The four key metrics are lead time for changes deployment frequency, meantime to recovery and change failure rate. Circle CI is partnering with DevOps research and assessment to put out the next report.
  • Lead time for changes is the amount of time between a commit and production. Elite performers can have less than an hour in lead time, low performers have six plus months. How do you actually measure and improve lead time for change?
  • deployment frequency is how often you ship changes and how consistent your software delivery is. A high deployment frequency will end up reducing your overall risk, even though you are deploying more often. You really want to drive a DevOps ethos across your whole team.
  • This is the average amount of time it takes your team to restore service when there's a service disruption. It offers a look into the stability of your software and the agility of your team in the face of a challenge. And it's something that you need to be thinking about as your team thinks about the meantime to resolve.
  • Change failure rate can include bugs that affect customers or releases that result in downtime. This can also be a good indicator for how much time your team is spending fixing processes rather than working on new features. Key is to empower your developers and give them the tools that they need to succeed.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi. I'm super excited to be here at Conf 42, talking about four key metrics to measuring your team's performance. But before I get started, let me just introduce myself. So my name is Christina. I'm currently a founding engineer at Cortex. I've been at Cortex for over a year now. We're a series startup backed by Sequoia. And basically we're giving organizations visibility into the status and quality of their microservices and helping teams drive adoption of best practices so that they can deliver higher quality software. Before joining Cortex, I was a front end team lead at Bridgewater Associates for four years, and I also previously interned at Microsoft. I studied at the University of Pennsylvania, and I'm originally from Columbia. So what I'll be talking about today actually is dorometrics. So dorometrics are used by DevOps teams to measure their performance. And the origin of dorometrics is pretty cool. So it actually comes from an organization called DevOps research and Assessment, hence Dora. And it was a team put together by Google to survey thousands of development teams across many different industries to understand what a high performing team is, what a low performing team is, and the key differences between those and after this, basically, dorametrics, these four metrics that I'll be talking about came to life. You might have heard of them before. If any of you are using Circle CI, I actually just got an email from them two days ago asking me to fill out a survey because they're partnering with DevOps research and assessment to put out the next report. And so the four key metrics are actually lead time for changes deployment frequency, meantime to recovery and change failure rate. So the first one I'll be talking about and digging into is lead time for changes. This is basically the amount of time between a commit and production. Different teams choose to measure this differently. You could also choose to have it be from the time that a ticket gets added to a sprint and is actually in progress to getting to production, from a time it's merged to getting to production. That's up to you and your team to actually decide and figure out what works best for you. But this is a good indicator of how agile and responsive your team is. And so from that DevOps report that was put out in 2021, what they found is that elite performers can have less than can hour in lead time for changes, and low performers have six plus months. And so how do you actually measure and improve lead time for changes? Right. We've all been in this situation of opening up a pull request, being ready to review it and seeing that it has hundreds of file changed and many commits and lines of code and you just close it back up, you're like not going to do this right now. And that's something that's actually going to increase your lead time for change and actually mean that you're not doing so great. So this image is actually a perfect example of what can lead to long lead time for changes. If your team is trying to make huge changes in one go, it's going to take a lot longer to review it. It's going to take a lot longer to test it. It's going to take a lot longer to be confident before you get it to production. It also means that your code reviews might sit around for too long and so you want to avoid making huge changes like this. Another thing that can lead to long lead time for change is potentially changing requirements. So once you open up a pull request and you're testing it out, say, with your design team or your product manager or your users, and they're like, oh, but can you just add this one extra thing or this other thing? That's where it's on you as an engineer to say, no, these were my requirements going in and so this is what my pr is going to do and just make follow up tickets for those additional tasks. And then the fourth thing that can lead to long lead time for changes is insufficient CI CD pipeline. So you could be merging to production often, but if releasing is actually really long and painful process, you're probably not releasing that often. And I'll talk more about that later when I talk about deployment frequency. So what does short lead time for changes look like? You want to make sure that everyone on your team can review prs and that there's not bottlenecks waiting on that one person to do the review. You also want to reject tickets that aren't fully fleshed out. So if something doesn't make sense, they're like, oh, we'll get the design box to you later. That's going to mean that your ticket is going to be open for a long time. You're going to have merge conflicts. You're going to be going back and forth. It's not worth starting development on it if it's not fully fleshed out and the requirements aren't clear, and then you want to again escalate changing requirements. So if you see that something is taking super long and people keep adding things to it, you probably want to say, let's pause on development and make sure that we go back and flesh out the tickets before starting. And so how you can actually improve this lead time for changes is by breaking it up into buckets. So you can see the time that a developer is taking to work on the change and see if that's what's taking the longest, or if it's the time that these pull request is open and these review is done and taking it to test it, or if it's actually after it's merged and getting to production. And you can identify which of those three buckets is taking the longest and focus on increasing that amount of time. And so a way to measure this, again, is just using Jira. Look at your tickets, look at how long they've been open, look at the status of your tickets and how long it's taking to go from column to column. And you can see sprint over sprint if these numbers are getting longer, decreasing, or just staying the same. And again, spot where those bottlenecks are and figure out how your team can improve on it. So, moving on to my second metric, which is I touched briefly upon in lead time for changes, is deployment frequency. So deployment frequency is how often you ship changes and how consistent your software delivery is. So you want to ship as few changes to production as you can. And so a common misconception is that by shipping to production more often, you're creating more risk and that you might have more incidents. But actually it's the opposite because it's going to be easier to figure out what caused those incidents by having small changes. And so you'll be able to actually pinpoint incidents faster and get your meantime to resolve another metric I'll be talking about later to decrease. And so basically, the idea is that if you ship to production often, you deeply understand the small changes going into it, and you'll be able to basically improve upon that. A high deployment frequency will end up actually reducing your overall risk, even though you are deploying more often. And it's a useful to determine when your team is meeting goals for continuous delivery and that you're actually continuously improving customer experience. So deployment frequency has an impact on the end user, right. You're just getting stuff out to them way more quickly than if you're waiting on many releases to deploy. And then again, you don't know how those changes can play with each other, and potentially it can create problems. So going back to the report put out in 2021, they basically found that elite performers deploy multiple times a day and low performers do it every six months. And so basically, you want to encourage your engineers and your QA team once again to work closely together to make sure that you can deploy often and you want to build out good automated tests so that you are confident in your releases as you're going through. And so again, another image that we've all seen before, it kind of looks like that cascading waterfall. It kind of reminds me of back when Microsoft Office used to sell the cds and put out releases every like three years or so and we would all go and buy the software and upgrade. We're not in that world anymore. And so we're in a world where you can be deploying and releasing changes to customers often. And so you don't really want a waterfall looking thing like this. So again, low deployment frequency can be the result of having insufficient CICD pipelines. It can be that people areas bottlenecked. So if you only have say, three engineers who know how to deploy to production, you're taking up their time, they might not be around, they might be on vacation. That can mean that you're deploying less often. And then if you have a lengthy manual testing process, that's also going to mean that you deploy less frequently because it's going to be taking up your engineers time. Whereas a high deployment frequency comes from making it super easy to release. You want to be shipping each pr to production in can ideal world on its own so that basically you know exactly what the change is. And I totally get that. This might not work for big teams with monolith, but in this case you can use a technique called release trains where basically you ship to production in fixed intervals throughout the day. And that can help also increase your deployment frequency. You want to make sure that you're setting up good integrated and end time teams so that you're confident in your deployments and aren't spending a long time on manually testing each use case and each application. And you want to make sure that you have good testing environments with accurate data once again, just so that you're more confident in these releases. And you really want to drive a DevOps ethos across your whole team so that everyone knows that this is how things work. And so again, just ways to actually measure deployment frequency. You can look at the number of releases in a sprint. Everyone has different sprints. I've seen one week, I've seen two weeks, I've seen three weeks. Whatever it is that your team is doing, just measure how often are you actually releasing every sprint. Is your average number once a day or is your average number once a week and see how you can actually get that to be more frequent. And so you can do this by looking at GitHub, you can do it by looking at your deployments and seeing your pods. There's various ways to measure deployment frequency using whatever tools you're using today. Moving on to our third metric, meantime, to recovery. So this is the average amount of time that it takes your team to restore service when there's a service disruption, like an outage. And so this one actually offers a look into the stability of your software and the agility of your team in the face of a challenge. So again, the DevOps report found that elite performers have less than an hour meantime to recovery, and low performers can be anywhere from over six months to actually get that app. And by that point, you've probably lost all of your customers and should really evaluate why it took you that long to get back up and running. But to dive into why this metric is important a little bit more than I've done for the other two, I'll just use a concrete example, which is Meta's outage from October 4 that lasted five and a half hours. So whether you use Facebook Messenger, Instagram, whatsApp, you were probably impacted by this outage. I know I was. I have all of my family in Colombia and couldn't talk to them during that day because WhatsApp was down. But a lot of businesses actually run on WhatsApp. And so a lot of businesses were impacted by this outage as well. And so the outage was actually triggered by a system that manages the global backbone network capacity for Facebook. And basically it's built to connect all the computing facilities together. And they consist of tens of thousands of miles of fiber optic cables crossing the globe and link to all their data centers. And basically the problem was that during a routine maintenance job for these routers, there was a command issued with the intention to assess the availability of that global backbone capacity, which unintentionally took down all of the connections and effectively disconnected Facebook, all of Facebook's data centers. And so their commands are designed to audit things like this and prevent mistakes from happening. But there was a bug in the audit tool that actually did not catch this one. And as the engineers were working to basically figure out what was going wrong and how to get it back up, they faced two main obstacles. The first is that it was not possible to physically access the data centers because they were protected. And then also the total loss of DNS ended up breaking many of the internal tools that would help them diagnose these problems. And so Facebook actually put out a long post mortem on this and a long article about what they're going to do to prevent this from happening in the past. And I encourage you to take a look at it if you're interested. But at the end of the day, this outage cost Facebook over $60 million and again lasted five and a half hours. It's the longest outage they've ever had. Another popular tool that had a similar issue also in October of last year is Roblox. I was at a party recently with a bunch of kids and the seven year olds were talking my ear off about Roblox. And actually what happened was that they had an outage that lasted over three days. And you may be saying, yeah, it's just a kids game that's impacted, but it actually cost them about $25 million. So once again, huge cost associated with this outage. But what happened was two issues and once again, they put out a long post cortex on this and what they're going to do to fix it. So encourage you to take a look at it. But they were enabling a new feature that created unusually high read and write load and led to excessive contention and poor performance. And these load conditions actually triggered a pathological performance issue in an open source system that is used to manage the write ahead logs. And what this did is it actually brought down critical monitoring systems that provide visibility into these tools. And so this circular dependency on the thing being out, being the thing to help you diagnose is exactly what Roblox had said they're going to fix going forward. And it's something that you need to be thinking about as your team thinks about the meantime to resolve. You don't want your observability stack to be tied to everything that your tool is, because at the end of the day, it's just going to make it harder for you to bring it back up when these outages do occur. So again, if we look at what could cause long meantime to recovery, risky infrastructure, poor ability to actually roll back these changes, right. You want to make sure you always have a plan in place so that if there is an outage, you can roll back while you figure out what's wrong with that latest release. Having a bad incident management process where potentially you don't know who's on call or who the owner is or who to call, and then having tribal knowledge or insufficient documentation. You want to make sure that you have clear documentation for all the services that you have run, books that you have, logs that are accessible to everyone. Basically anything that could be needed to actually debug what's going wrong. You want to make sure your team is trained to do. And this is actually something that Cortex helps with. We have a service catalog feature where you can see all this information about your services and basically have one spot to go as you areas dealing with an incident and looking for this information so short meantime to recovery. The big difference here is having a tight incident management process. Again, knowing who to call when, having the ability to roll back quickly, having the tools needed to diagnose what's wrong and having those clear runbooks easily accessible. And again, a thing that personally I learned from hearing about these two outages I went through is that you probably don't want the DNS for your status page to be the same as these DNS for your website. If your website's down, so is your status page and you want to make sure that you're thinking about those things and keeping them separate. And so ways to actually measure meantime to recovery is using whatever on call provider. So for example, pagerduty Victor Ops, Ops Genie, you want to measure how long that outage was, how much time between the fix was discovered and how much time between it being released. Again, if you have insufficient CI CD pipelines, it might take longer to get that out, even if you know what the fix is. And then also you want to look at how long it took you to discover the outage. Like do you have the proper alerting so that when an outage happens, you know immediately, or is it taking a few hours and taking a customer calling it out for you to see the outage? And so you can use whatever tools you're using to measure this and see where those gaps are in order to improve your incident machine process and see how you can improve this going forward. And that brings me to my fourth metric, which is change failure rate. So this is these percentage of failures. This can include bugs that affect customers or releases that result in downtime or degraded service or rollbacks. This is again up to your team to define what you want to include in that change failure rate and basically figure out which parts of it you want to measure. A common mistake that teams make when measuring this is to just look at the total number of failures rather than the rate. But the rate is actually pretty important because the goal is to ship quickly. So if you look at the number of failures and you're shipping super often, that number might be higher, right? But actually you want to make sure that you have more deployments so that it's easier to again have that meantime to resolve be lower. And so you want to look at the rate, not just the number of failures. This can also be a good indicator for how much time your team is spending fixing processes rather than working on new features. And so again, looking at this report, the state of DevOps 2021 report found that elite performers have anywhere from zero to 15% change failure rate. Anywhere 15 or higher isn't great. And so we've all seen these memes, we've all kind of laughed at these. But you know, that moment when you're looking at your code and you're just patching up bug after bug after bug, that's something that you want to evaluate because it is increasing the number of bugs that your customers see and creating a poor customer experience. So a high change failure rate can be the cause of sloppy code reviews, maybe people areas just looking at the code, but not thinking about all the use cases, actually testing it out. Again, insufficient testing, whether it's unit teams integration tests, n ten tests, and then having staging environments with insufficient test data. So if your staging environment doesn't reflect the data that customers are using, at the end of the day, it may not be a good representation of testing your changes before actually rolling them out. And so a low change failure rate can actually, the way you get to that is by promoting an those that is focused on DevOps. And so basically creating that culture of quality, making sure that you have representative deployment, development and staging environments so that you can test this before it gets to production, and having a strong partnership between product and engineering so that you deeply understand the use cases and actually know how to test before going forward. And if these handled all the potential edge cases that you write tests for, those edge cases, anything basically to make you more confident that the features you're releasing work in the way that they're meant to. And so ways that you can measure this change failure rate is you can look at how many releases have caused downtime, you can look at how many tickets have actually resulted in incidents, you can look at how many tickets have follow up bug tickets to them. Because again, ideally you would catch those bugs before they go out. Even if they don't necessarily cause an outage, it still causes bad customer experience. And then you can honestly dig a step further and you can see how many of these issues areas a result of not having unit tests in place. Like would a unit test have caught the issue? Would an end to end test have caught the issue? Or was it having bad data and then making sure that you actually update that data in your staging environment so that going forward you can catch issues similar to whatever it is that caused these problem. And so that was a broad overview of all these four metrics, but now if we put them together, so again, they're lead time for changes deployment frequency, meantime to recovery and change failure rate. What they're really looking at is speed versus stability. So lead time for changes and deployment frequency are really looking at speed. How fast are you getting these changes out to your users? And stability is meantime to recovery and change failure rate. So how often is your app unstable due to changes that you have gotten out? And so the key is actually to empower your developers and give them the tools that they need to succeed. At the end of the day, it's not literally about these metrics, it's just about your team and figuring out using these metrics to improve their performance. And so your developers are the ones who are able to make the changes to help your team reach its goals. And you want to make sure that they understand these metrics, why they're important, and are using them to improve their processes day to day. And to give a more concrete example on literally measuring this, as I mentioned earlier, I'm an engineer at Cortex, and we help teams define standards and measure how they're doing. And so what you see on the screen right now is one of our features, which is called scorecards. It allows you to create rules for your team and actually will measure all your services and how they're doing and give you the scores based on the rules that you created, based on your integrations for your services. And then from this, you can create initiatives to help improve those things going forward. So you can say by Q three, I really want to improve my deployment frequency. I want to make sure that the CI CD pipelines are sufficient that we have better testing. So you can measure things like test coverage, and you can use scorecards to actually make best. You can use scorecards to make this a moving target across your teams. And so this is exactly what we do. Thank you very much. I hope you enjoyed learning about dorometrics, and feel free to put any questions in the chat.
...

Cristina Buenahora

Founding Engineer @ Cortex

Cristina Buenahora's LinkedIn account Cristina Buenahora's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways