Conf42 DevOps 2023 - Online

The State of DevOps - Capabilities for Building High-performing Technology Teams

Video size:

Abstract

Technology drives value and innovation in every organization. At Google Cloud, we have learned a lot about what it takes to build and scale high-performing technology teams. Our own lived experience combined with a multi-year research program led by the DevOps Research and Assessment (DORA) team can be used to help you and your team transform into a high-performing technology team. This talk will dive into some of the findings of the 2022 DORA research program. We will couple these findings with stories from the field about how teams are putting these ideas into practice. There will be success stories and cautionary tales: let’s all learn from one another. Spoiler alert! The best teams focus on getting better at getting better. You can do this, too!

Summary

  • The 2022 accelerate state of DevOps report. One of the key findings from the report is that context matters. Let's bring the report to life through story time.
  • All names, characters, and incidents portrayed in this production are fictitious. No identification with actual persons, living or deceased places, buildings, and products is intended. No animals were harmed in the telling of this story.
  • Nathan: Let's go back to December 10, 2021. Where were you when you heard about log for show? Nathan: Your Friday was looking like this, but then it changed. Would you hold my coffee?
  • A vulnerability in logging software affected production systems. The incident commander went through the five stages of grief, denial, bargaining, acceptance and action. He used one of his favorite tools, the Uda loop, to get the team to work.
  • About 400 different production applications were impacted by the vulnerability. The two most critical applications were the order management system and the ecommerce front end. By Monday morning, we'd identified two applications that were most critical.
  • There are about 27 different microservices that make up this application. There was no automated build process or testing. It took us the entire weekend to identify which systems to prioritize for remediation.
  • The changes for this site have to go through a change approval board who only meets on Tuesdays and Thursdays. The cab suggested building off on any updates to production until the log for J releases stabilized. The last one wasn't even released until the 20 eigth of December. 20 plus days. That sounds rough.
  • What about the order management system? That system is the heart of the business. Over the previous two years, the OMS team was able to go from quarterly releases to deploying updates to the system on a weekly basis. How do we help the website team have more of an experience like the ordermanagement team in the future?
  • Dora is an ongoing research program that's been around for about eight years. The research focuses on capabilities that span across technical, process, and cultural capabilities. Through our predictive analysis, we're able to show that these capabilities are predictive of or drive software delivery and operations. Investment takes time and practice.
  • Culture was one of the top predictors of whether or not a team was embracing these security practices. Having good continuous integration and good security is a real driver for your organization. Adoption of the technical aspects of software supply chain security appears to hinge on the use of good, continuous integration practices.
  • The idea is that you start with the outcomes that you want to improve and then work backwards to find the capabilities where you need to get better. To really affect change in our team, we cannot change it overnight. We have to start out slow.
  • Amanda Lewis: I'm a developer advocate with Google Cloud, focused on the Dora research program. In September, we also launched a community of practice around Dora. We hope that all of you out there will come and join us. Before you go, make sure you grab your very own copy of the 2022 accelerate state of DevOps report.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
You. The 2022 accelerate state of DevOps report. Executive summary. For the last eight years, we have produced the accelerate state of DevOps. Hey, Nathan, you know, the report, it's interesting, but I had, like, something else in mind that we could do today. Amanda, what. What is this? Well, you see, Nathan, one of the key findings from the report is that context matters. And so I know we were planning to share the highlights from the 2022 state of DevOps report, but instead, let's bring the report to life through story time. Okay. As long as it's clear to you and to me and to all of you that are watching, this is a completely fictional story. Oh, absolutely. Right. And we probably should add some disclaimers for the G lawyers. Oh, yes, of course. The story. All names, characters, and incidents portrayed in this production are fictitious. No identification with actual persons, living or deceased places, buildings, and products is intended or should be inferred. And also, no animals were harmed in the telling of this story. So I was thinking, good stories have a protagonist can antagonist it, incite action. Then there's conflict, challenges, and we get to a resolution. So when you think about that, the protagonist must face obstacles and setbacks throughout the story before they can reach their goal. So, for today's story time, let's talk about log Rochelle, because, really, isn't it the gift we all received in December of 2021? So let's go back there, Nathan. Let's go back to December 10, 2021. Where were you when you heard about log for show? Let's see. Friday, obviously, I was planning a pretty lightweight day. You know, hashtag no deploy Fridays. And it was December, so I'm sure I had some holiday shopping to do. So your Friday was looking like this, but then it changed. Yeah. In fact, everything did change. Would you hold my coffee? Sure. All right, so walk me through it. You started here. The CVE? Yeah, kind of. It wasn't really like that, though. It was more like a roller. Know, I went through the five stages of grief, denial. I mean, look, Twitter was the first place I heard about this issue. Was it really a thing? And then I had anger shoot it sure is. It's a real issue. Then began bargaining. But, I mean, a bug in the logging software? How bad could it be? Isn't this something that can wait until Monday? Or better yet, that really quiet week that's coming up? Can't I just put it on the backlog until, I don't know, December 27? But as I dug deeper, depression really started to set in the issue, and I was talking about it with my colleagues and part of the team. And I realized that my weekend was about to take a turn for the worst. Finally, fifth stage of grief is acceptance. I declared an incident and myself as the incident commander. I started our incident response procedures, which include firing up a slack channel, gathering representatives from each team, working on all of our applications, and started up some tracking documents. So it didn't really look like this single point in time. It was really kind of more of a flow. A roller coaster maybe. It was definitely a roller coaster like that. So the next thing I did, well, I picked up my phone to call my family and let them know that another Cve was going to change virtually everything about my plans for the weekend. We needed a plan, and so I went with one of my favorite tools, the Uda loop. Do you know the Uda loop? Can you remind me? I always forget what the second o stands, right? Uda loop. Observe, orient, decide and act. We observed that there was a vulnerability. Next, we had to orient which of our production systems were going to be impacted or were currently impacted. Then we had to decide what we were going to do. Well, actually, that was the easiest part. We were going to upgrade log per j to remediate this vulnerability and then act. That's the last step we get the team to work. Of course, it is a loop. So we act and then we go back through the loop. All right, so how many production systems were impacted? Oh, yeah, our production system. So let's see, there was one, two, about 400 different production applications that were impacted, and most of them were going to be impacted by this vulnerability. We thought, wow, so this was a gift. It was like 400 gifts, right? So how long did it take you to assess 400 production systems? Oh, yeah, it took about two minutes. We just did some querying through our s bombs, our software bill of materials, to find out which would be impacted. It was pretty easy, really. I mean, that is amazing. Then what did you do next? Yeah, sorry, I wish it was that amazing. S bombs are pretty awesome, but honestly, we haven't deployed them everywhere. There's maybe one application that's not yet in production, but we have a good s bomb for that, don't worry. So what really happened is we had to manually inspect all 400 of those applications, which meant calling in subject matter experts from each of those applications and asking them to do some work over the weekend. But by Monday morning, we'd identified two applications that were the most critical. We knew we needed to fix those. Know, Nathan, I've got to tell you, I love the jokes. The way you're laying it in there. And I'm curious if you were telling jokes like this when you were going through this. So which two applications were the most critical? Well, first, it was no laughing matter for sure. But the two most critical applications that we knew we needed to fix were our order management system. These system has been around forever. It's truly the heart of our business. If it's offline, customers can't buy anything and we can't ship anything. And the other system that was top of mind was our ecommerce site. This is the face of the business. It's where our customers come to purchase things. So if it's downed or not working, we can't serve any of our customers. So our two applications, the order management system and our ecommerce front end. All right, so I'm going to say let's start with discussing the ecommerce website. I expect it was easier to tackle that than the order management system since it was older and the website is newer. And also since an order starts, there seems like a great place to go next. So can you tell me about the ecommerce website? Yeah, you're totally right. It is a good place for us to start. As you mentioned, it's the front end and the application itself was built using microservices. So maybe it's the right place to go. But as it turns out, we actually did not have an easy resolution for the website. You see, decisions were made years ago. These decisions came back to haunt us. When we built these site several years ago, our team didn't actually have any expertise with microservices, but we knew we wanted a modern architecture. And a modern architecture requires microservices. So what did we do? Easy. We hired in some consultants and a vendor to help build and ship the site. Ultimately, we paid for functionality, not knowledge or documentation. Well, I mean, I imagine at the time this trade off made sense. Bringing in a partner is an awesome solution when it's done in collaboration with the organization's team, and then they're upskilled after that engagement. So do you have access to the code or do you need to work with that vendor to make these updates? Well, the website is on our infrastructure and it's in source code, the microservices. There are about 27 different microservices that make up this application. So the code is spread across about 27 different repositories. But since the marketing team has a UI for adding and modifying, removing, basically managing the content and the offerings that we have on the site. We don't really have to touch the code base very frequently. In fact, we only are putting out changes once or twice a year. And those updates are each strategically planned. They take at least two months to get through all of our manual testing. But we were able to quickly identify across these 27 where we had the vulnerability. Unfortunately, that was the only quick and easy part of remediating this microservices application. You see, there was no automated build process. So when we found a log for j that needed putting, we had to update it and then manually execute those builds. And these was no testing in place. No automated testing in place. Anyhow. Wow, there's a lot to unpack here. I guess I'm a little bit surprised. Okay, so you've really only been making updates to the application a couple of times a year without any automated build process or testing. I mean, I can only imagine that the likelihood of failure is going to be very high. Exactly. We found one microservice first that had log for J and we tried to upgrade it. I mean, we upgraded log for J on that microservice and we deployed everything to a staging environment and everything broke. It turned out that all of our microservices are very tightly coupled together. Interesting. So how did you know everything was broken? And I don't know if I even wanted to ask this question, but how long did it take you to fix it? Well, we knew everything was broken because we would deploy it and refresh the site to check to see if anything was broken. And what we saw was 500. 500 was not the number of orders we received. Instead, it was the server error code that we got. So we entered this process of build, deploy, see it fail. Let's try the next microservice. It was pretty painful. This does not sound like a very fun. No, no. Remember, Amanda, we didn't start even working on these changes until Monday. It took us the entire weekend to identify which systems we should prioritize for remediation. In the end, the teams spent all week updating and testing those 27 services that made up our front end website. By Friday afternoon, though, they were ready to deploy the changes. All right, so then it's about a week till you had everything fixed. No. Remember hashtag no deploy Fridays. So these team was ready on Friday, but we couldn't deploy. And the changes for this site, they have to go through a change approval board who only meets on Tuesdays and Thursdays. Luckily, though, we can call an emergency cab, especially for high risk security incidents like this particular one. They agreed to meet on Monday. So after looking over all the changes, the change approval board, the cab was a bit uncomfortable with this deployment. They asked that the development team do some additional manual testing. And I'll tell you, it was a good thing too, because one good vulnerability deserves another. It turns out there were a few releases of log for J in rapid succession. The cab suggested building off on any updates to production until the log for J releases stabilized. This way we could batch up all of the changes into one single release. Turns out there were four updates. The last one wasn't even released until the 20 eigth of December. Wow. So 20 plus days. I can totally understand why the cab made the decision they did, but I have to tell you, I'm listening to this and I'm thinking about it and putting myself in the story and I just feel a little burnt out. I mean, I imagine your team was burnt out at the end of this. Absolutely. Everyone on the ecommerce team was definitely feeling pretty crispy by that Wednesday morning after the final deployment. Long hours obviously contributed to that, as well as the stress of the vulnerability itself and the uncertainty of whether the changes would work. But probably one of the biggest stressors was the heavyweight change approval process. In the end, it was difficult for the team to understand and assess what it would take to go from commit to approved, much less deployed and working. That sounds rough. So I guess how has the team fared? Know? I know a couple months ago we had the OpenSSL Cve. Was that an easier experience? Have you learned from this? I mean, the team has definitely learned a lot over the last year and a half or so. And we were spared from the OpenSSL CVE because, well, that team is still on the 1.1 branch of OpenSSL. So having advanced warning of the pending vulnerability though did help some. But it also reminded the team of the progress that we still need to make. This is a journey. It's true. We all sometimes need that reminder. Right? So that was a great recap of these ecommerce team. What about the order management system? That system you said is the heart of the business. And I imagine since it's been around forever that it was even slower to update than the microservices based front end. Oh yeah, I would see how you would think that. Of course the OMS is older, it's larger, it follows more of a macro service than a microservice architectural pattern. But unlike the e commerce system, the OMS is something that our internal teams have been actively developing over the years. In fact, over the previous two years, the OMS team was able to go from quarterly releases to deploying updates to the system on a weekly basis. So in many respects, they were better prepared for log for Shell than the ecommerce team was. Wow, that's fantastic. And I love to hear that they've been iterating and improving. So how did it go? Well, on Monday morning, the team identified the three components that were impacted. These upgraded log for J library in one of the components, and then their continuous integration process automatically kicked in. A jar fire was built, automated tests were run. That jar file was automatically deployed to a test environment where some additional tests were run. These team took their passing best, their better pipeline. They took both of those things to the cab for approval and the deploy was rubber stamp ship it, they said, and the team did. So wait, okay, so that is incredible. But didn't you say that there were these components and they only updated and shipped one of those components? Yes, that is true. But the components are build in a way that they can be independently tested and deployed. And everyone is comfortable with this because, well, frankly, that's how we've been working in practice for well over a year now. Okay, so one down, two to go. These must have been pretty easy. So this team had it all fixed by Wednesday? Almost. I mean, the test failed on the second component. When the second component was updated and the test ran, these failed. So it took a while to track down and fix that bug. So I feel like you told me once about this team, and they were the ones that had that habit of prioritizing their broken builds. Wasn't this the team? Oh, yeah, that's exactly right. So after the first one was fixed, we split up the team and said, work on components two and three. When the tests failed on component two, the entire team swarmed. Let's figure out what broke these best. And it was a good thing too, because it took the teams most of the day to actually track it down. It was kind of a hidden bug. It was elusive, if you will, but they were ready to deploy by Wednesday morning. Thursday came around, the third component was updated and released. My goodness. I have to say, do they have any open positions on this team? Because it almost sounds fun, like, I would have enjoyed being a part of this process instead of something scary. It sounds exciting and thrilling. This has just been incredible. So I appreciate you sharing all of this with me, and I think I may be able to help. Oh, really? So, Amanda, how do we help the website team have more of an experience like the order management team in the future. Well, Dora the Explorer. No, not that Dora. Oh, the Digital Operational Resilience act. Not that one. Oh, I know. The designated outdoor refreshment area. Nathan, it's not even that. You know, it does look like they have a lot of fun in Ohio. Cheers to that. Right. So the Dora, for our purposes today, is we're talking about the DevOps research and assessment. Dora is an ongoing research program that's been around for about eight years. The research program has primarily been funded by a number of different organizations. Over those teams. For a few years, the research program was funded by the organization of that same name, Dora. Dora was founded by Dr. Nicole Forsgren, Jez Humble, and Gene Kemp. Then in 2018, Dora the company, was acquired by Google Cloud. The Dora team at Google Cloud has continued the research into the capabilities and practices that predict the outcomes we consider central to DevOps. The research has remained platform and tool agnostic. And personally, it has been an incredible experience to work with the research team, not only because of the learnings, but better understanding of the research practice, the oath, the ethics, and the passion they bring to this body of work. Yeah, I think it's super cool. And one of the things that's really important is that focus on capabilities. In fact, through the research, we're able to investigate those capabilities that span across technical, process, and cultural capabilities. And through our predictive analysis, we're able to show that these capabilities are predictive of or drive software delivery and operations. Oh. Which, by the way, predicts better organizational. Oh. So, Nathan, it's like a maturity model with a build in. No, no, Amanda, context matters. And in fact, there is no one size fits all roadmap or maturity model for you to follow. You have to understand your team's context and focus on these right capabilities. That's right. In previous years, we had learned that delivery performance drives organizational performance. But like you said, context matters. Additional context this year from the findings was that delivery performance drives.org performance, but only when operational performance is also high. That's right. And operational performance, we oftentimes talk about that as reliability. But reliability itself is a very context specific thing that's hard to measure. In fact, reliability itself is a multifaceted measure of how well a team upholds their commitments to their customers. And this year, we continued our explorations into reliability as a factor in that software delivery and operations performance. We looked at some of those things, like, how does a team reduce toil? How do they use their reliability to prioritize or reprioritize the work that they're doing. And one of the most interesting things that we found there is that reliability is required. As you said, software delivery doesn't really predict organizational success without that operational performance as well. But we also saw that SRE investment takes time. Teams that are newly adopting some of these practices or capabilities, or have only adopted one or two of them, may see some initial setbacks in their reliability, but as a team sticks with it, they can see this curve really start to take effect, where they will start ramping up their overall reliability. Investment takes time and practice. This is a journey. So while it's not a roadmap, these technical capabilities they're building on one another. Right. What I'm hearing you say is that teams improve as these get better at additional capabilities. That's right. And when you look at a number of capabilities together, this is where you really start to see that multiplicative effect. So, for example, teams that are embracing and improving their capability with these technical practices, like version control and loosely coupled architecture, these teams are 3.8 times or show 3.8 times higher organizational performance. And then security is a big part of this as well. And of course, security fits very well into our story about log for j. And the truth is we're all facing similar measures and what were similar constraints and capabilities? So one of the things that we looked into this year was supply chain security and specifically software supply chain security. And we used a number of different practices to measure that. But what we've seen is that adoption has already begun. So that's really good to see. Of course, there's room for lets more. Another thing that we see is that healthier cultures have a head start. Culture was one of the top predictors of whether or not a team was embracing these security practices. So when you say healthier cultures, you're really talking about generative cultures, right? Characterized by that high trust and free flow of information. These kind of performance oriented cultures are more likely to establish those security practices than those lower trust organizational cultures. That's right, Amanda. And it turns out that security also provides some unexpected benefits. And thinking about the security of your supply chain, so sure, you're going to have a reduction in security risks, that's not an unexpected benefit, that's the hoped for benefit. But better security practices can also carry additional advantages, such as reducing burnout on the team. Oh, and there's also a key integration point. Adoption of the technical aspects of software supply chain security appears to hinge on the use of good, continuous integration practices, which provides the integration platform for many supply chain security practices. So I guess here again is another example of how capabilities really interact with each other and build upon each other. Because when we compared the two continuous integration and security, we found that the teams that were above average on both, they had the best overall organization performance. So having good continuous integration and good security is a real driver for your organization. And I think we saw this in practice as well. Think back to that order management system team. They had a really good continuous integration practice on that team. And as a result, these were able to assess really how is this updated library going to impact these application. The continuous integration was building and running tests and building their confidence, whereas the website team, without any continuous integration to speak of, they had to do everything manually. Right. It's interesting because in both of these cases, they had change approval boards, but on one side, you have this kind of mysterious, spooky can that just is blocking all of your changes. Right. They don't appear out of nowhere, but maybe there's more for us to think about their role and how they show up in our organization, who's on it, how many people are on it, who gets the final say, and what happens if that person goes on vacation? So I think when we formed, we also look at maybe like, when was it formed? Why was it formed? How have things changed since these? And does our oversight need to change as well? I think we see the OMS team clearly had a very different experience with their can. And I'm going to guess that it has changed over time where the website team, perhaps it wasn't the case. I've heard of a story where after process changes, the can is no longer on the critical path. They only deal with those outlying challenges, and as a result, deployment frequency increased 800 x. Yeah, it is really startling to see that type of improvement. I have worked with a team that saw exactly those results. But you're right. In each of these takes, both teams had to go through the cab, the change approval board. It is really that demonstration of why context matters so much. Amanda, remember, we are just talking about two of the 400 applications that needed updating. There were a lot of meetings, negotiations, blood, sweat teams. All that went into getting the rest of the fleet updated. Oh, there were also spreadsheets. Lots and lots of spreadsheets. But in short, it was a very long tail to get everything fully up to date. I might not have wanted that round for the whole journey, but you know my love for spreadsheets. So thank you for letting me know about that. All right, so tell me about. Whoa, Amanda, this is too small. I can't read anything. That's. Hmm. Maybe you need new glasses or. Let me zoom in a little bit for. Oh, thank you. So that previous chart was a bunch of the capabilities that we've investigated as part of the research. And here we are zoomed in on a couple of those capabilities. For example, continuous integration and loosely coupled architecture. We can see that these capabilities drive better security practices. Our culture also drives better security practices. And those security practices and culture together can help reduce burnout. They can help reduce the errors that we see in our system and lead to a bunch of other really interesting outcomes. So when you think about how to apply the research to your own team and your own organization, the idea is that you start with the outcomes that you want to improve and then work backwards to find the capabilities where you need to get better. And the idea then is to understand which capability is holding us back, and let's make can investment on improving that capability. All right, so we had zoomed in, and thank you for kind of explaining how we can look at this and how to move through it. So now I kind of zoomed back out so we could view all of the capabilities, but I would say we've got all this potential of things that we could change. What's important is that we remember to not boil the ocean. Right. We can't go do all of these things tomorrow. You've inspired me, Nathan. I want to go do that. I want to be on that team. But the truth of the matter is that to really affect change in our team, we cannot change it overnight. We have to remember that it's an investment and we should start out slow and that really we're going to reach an inflection point where we start to see that improvement. But there might be some pain along the way, and we really need to support one another through that j curve that you showed us earlier. Absolutely. And it is team specific that order management system team, they still have areas to improve, but they're different areas than what these ecommerce team has to improve. So you cannot use this as a roadmap, but you can use it to help identify which capabilities are holding our team back and then commit to addressing and improving those capabilities and watching as your outcomes know. Nathan, I just realized there's one thing that we didn't do today. Oh, what's that? Well, we forgot to introduce. So I'm Amanda Lewis. I'm a developer advocate with Google Cloud, focused on the Dora research program. Hi. And I'm Nathan Harvey. I'm also a developer advocate focused on the Dora research program and helping teams improve using the insights and findings from the research itself. One of my favorite parts about my role as a Dora advocate is working with the community. And so back in September, when we launched the 2022 report, we also launched a community of practice around Dora. So I will hope that all of you out there will come and join us. If you go to Dora community, you can join the Google Group, and that will give you the ability to join in on some asynchronous conversations that are going on and also invitations to our open discussions that we're having periodically. And Nathan, do you want to share about maybe some experiences you've had in some of our lean coffee discussions or topics and things that we've been having with the. You know, my favorite part about these discussions is that we really cater them to the people that show up each time for the discussions. That's one of the benefits of using the lean coffee format. But the other thing that is really beneficial is that we don't always know exactly where the conversation will go. I like to say that we need to be prepared to be surprised. And so we've had really interesting conversations and perspectives from practitioners that are putting these capabilities to work. But we're also hearing from leaders and importantly, researchers, both the researchers on the Dora project, but also other researchers across the software delivery field, the developer productivity field, and so forth. So it truly is a community where we can bring together practitioners, leaders, and researchers to help us all improve. Absolutely. And I think as we've seen, Nathan, as we're working with teams and helping them apply and use the research, we realized that we really needed to connect people together because you are the experts in your business and you can bring that experience and how you've applied it together. And I have learned so much since September. It's been absolutely incredible. Absolutely. So thank you all so much for tuning in to our presentation today. We hope that we will see you on the Dora community. And before you go, make sure you grab these URL or QR code so that you can download your very own copy of the 2022 accelerate state of DevOps report. Now, Amanda, can you give me that report back so I can continue on with my reading aloud of the report? Okay. I like reading it, but. All right, you can have it. All right, well, maybe we'll save that for another time. Thank you so much, everyone. Thanks, Amanda.
...

Nathen Harvey

Developer Advocate @ Google Cloud

Nathen Harvey's LinkedIn account Nathen Harvey's twitter account

Amanda Lewis

Developer Advocate @ Google Cloud

Amanda Lewis's LinkedIn account Amanda Lewis's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways