Conf42 Site Reliability Engineering 2021 - Online

Lean Product Development Through SLOs

Video size:

Abstract

  • How can you strip over-processing out of your value stream?
  • How can you understand and communicate what “good enough” looks like?
  • How can you systematically challenge and remove assumptions from you consumers?

In supporting enterprises in DevOps & agile transformations across three continents, I discovered a common theme, stories talk of what is to be enabled, but not how well it needs to perform.

Recently, upon joining a platform team drowning under the weight of their backlog, I experimented with SLOs being the beating heart of our product development flow.

To do this, we: * Revisited all previous core usecases and determined meaningful SLIs and SLOs * Made SLOs an inherent part of defining features * Built a lightweight infrastructure to translate technical changes to business events * Made a dashboard to transparently broadcast our SLOs * Made SLOs be the core of conversations with our consumers

By doing this, we: * Removed assumptions from consumers, driving more meaningful conversations about value * Created a beachhead for a more data driven culture * Created a common understanding of what “good enough” looked like * Became better focused on maximising value delivery, over redundant gold plating of features

You’ll learn: * How to adopt an SLO driven product development flow * From the mistakes we made along the way * Tips and tricks for defining SLOs for low frequency usecases * How to drive more impactful conversations

Summary

  • Today I'm going to talk through how you can do lean product delivery through SLO. We're going to be talking lean theory, product development, and SRe all in about 25, 30 minutes. One of the powerful things about slos is they convert expectations into concrete outcomes.
  • I'm the AWS practice lead at Contino. I'm an AWS APN ambassador, which, if ever, means I do lots of stuff like this. And kind of most poignantly for this talk is I'm a bit of a lean junkie. I love kind of diving into everything about it.
  • We're going to be talking specifically about AWS landing zones. Today we're just going to look at one thing and that's account vending. Taking an SLO first approach really helps take some of the ambiguity out of this.
  • One of the core concepts of lean is this idea of waste. Defects, overproduction, waiting, unused talent, extra processing, transportation, motion and inventory. Product management is tension management, and that's the tension between delivery and consumers.
  • An SLO defines how frequently can I fail of every 100 or 1000 or 10,000 requests. And then you get down to an SLA, which is defining the penalty for exceeding the failure goal. In order to have an SLO, you have to have can SLI.
  • An SLO is a goal that you're trying to hit. It should be specific, measurable, attainable, relevant, and time boxed. What do we think is realistic and attainable?
  • Maintaining systems at three nines is ten times harder than maintaining a system at two nines. You want to negotiate the minimum slos for your features. By tracking how you are working and what the ongoing maintenance burden is on your team, you can make better decisions.
  • Another thing that comes with SLOS is we're ever inflating expectations. When you have this error budget, take risks. Make sure you remind people that systems can fail and your systems will fail. By making these things explicit, you can make these things real.
  • So eliminated waste drives better outcomes. The second is that implicit expectations exist. Slos are not only for the googles of the world, they're for everyone. Make sure you're building products that are fit for purpose.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Are youll an SRE, a developer, a quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native. Litmus Cloud hi. Today I'm going to talk through how you can do lean product delivery through SLO. So you'll have to bear with me a little bit here. There's a bit of buzzword bingo going on. We're going to be talking lean theory, we're going to be talking product development, and we're going to be talking SRe all in about 25, 30 minutes. So I always like to start these off with the key takeaways. So in case you don't get to the end, at least youll know what the points I was trying to make were. The first is very much coming from lean theory that if you can eliminate waste, you can drive better outcomes as you go along. The second is that implicit expectations exist. And I'll be diving into a little bit more of what I mean by that as we go through. And the third is that one of the powerful things about slos is they convert these expectations into concrete outcomes. So we're talking the implicit into the explicit. I think hopefully by the end of this, you'll get an idea of how powerful that is just very quickly, just the obligatory kind of about me slide. I'm the AWS practice lead at Contino. We're a global digital transformation consultancy. I'm an AWS APN ambassador, which, if ever, means I do lots of stuff like this, and I ride and talk and do all those kind of things. And I'm a hashicorp ambassador as well. And kind of most poignantly for this talk is I'm a bit of a lean junkie. I love my lean theory. I love kind of diving into everything about it. And you'll see, hopefully why I enjoy doing that so much as we go through. So I'm going to talk through a story of a time I went through with the team, and we're going to be talking specifically about AWS landing zones. Now, why have I picked a landing zone? As an example? And this is maybe a little bit of, when you look into Slo and SRE, normally, when you're looking at the kind of systems where those are applied, these are like millions of transactions a day, kind of being super high throughput. And I kind of wanted to prove the point that slos are valuable even in low throughput systems. Can Adris landing zone is one really just today we're going to kind of follow that kind of lean theory of focus. And when we talk about landing zones, they do account vending, we do account provisioning, we do security guardrails, we do cit pipelines, compliance, and do all manner of things. But really today we're just going to look at one thing and that's account vending. Because really, or project vending or subscription vending, depending on your cloud of choice. But if you can't do this, you do not have a landing zone if you can't make these accounts and projects for other people. So that's where we're going to narrow it on today because this is pretty much feature one, almost as you build out a landing zone. Now, I think everyone's probably seen user stories and you'll get something like this. I just looked this one up for what an account vending user story might look like. As a development team, I want to be able to provision AWS accounts so I can deploy my application. Awesome. Cool. We all have a pretty half decent shared understanding of that. I think as we dive in, hopefully youll see that there's a lot left unsaid with this and we'll dive into that and show how taking an SLO first approach really helps take some of the ambiguity out of this. Okay, I said I was a bit of a lean junkie, so let's just dive a little bit more into lean. So one of the core concepts of lean is this idea of waste. And there are eight kinds of waste. Defects, overproduction, waiting, unused talent, extra processing, transportation, motion and inventory. All these are things which you can look to target. And just by the by, just to kind of put this into context, so lean theory came out of the Toyota production system. When you look at Toyota production system and where they've got to in their value stream, when they produce a car, they're reckon they're up to about 33% value, which means they're 67% waste. The kind of idea in the software space is we're at 1% value. So you can either take that as a sad thing or a lot of room for improvement. I try and take it that way around. And really one of the things we're going to look at today is the waste of extra processing. And this is building something to better quality than is required. This is where you're building something that is fantastic when merely good would do. You're spending all this time making something amazing when really you need to be working on other things. So when we're talking about quality, if we come back to this user story. This can be done so many different ways. This doesn't talk to quality, it talks to output. And that's just a fundamental understanding that you have to get once the user story is written, you need more. I can take this and I can implement this a million different ways. I can fully automate to the nth degree to make it as fast as physically possible. I can just let people turn that can that mechanical turk on it and have it take a while at the expense of human effort. So you really can answer this fundamental question of what is good enough? At what point is my account vending solution good enough that I can stop working on it and move it on to the next thing? And even to answer this question, there's another question you have to answer, right? Who defines good enough? Who gets to say what that is? And this ends up coming back to and boiling down to the whole concept of product management. And really, when you look at product management, and this is not my term, but product management is tension management, you've got tension between the product owner, the scrum master and the delivery team. You've got between the team and the consumers. You've got between the team and the other teams they rely upon. There's all these tension in all these different areas. And we're just going to focus on where I think slos really drive is one particular kind of tension, and that's the tension between delivery and consumers. So what you're trying to build and the people who are asking you to build things, and obviously it's a one to many relationship. You've got one delivery team, you have many consumers, all with competing priorities, all pushing their competing priorities on you, right? So being able to be mindful and explicit about how you manage that tension can end up being really powerful. Now, potentially, not everyone has spent a lot of time in SRE. Maybe they've watched this because they want to learn what an SLO is. And don't worry, I'm going to go over that now. So where you start is with an SLI, a service level indicator. This will dictate the difference between success and failure. It's a binary distinction. So youll be able to, given whatever it is your feature is supposed to do, when it does the thing that it's there to do, youll very easily classify as either whether that's successful or was a failure. What your service level objective does is defines how frequently can I fail of every 100 or 1000 or 10,000 requests. How many of those are allowed to fail. And there needs to be some there that's allowed to fail. That number can never be zero. And then you get down to an SLA, which is defining the penalty for exceeding the failure goal. So if you break too much or break too often, what's the penalty there? Now, often with internal systems, you don't go down the full path for an SLA. You're not really going to be passing money around in the business. But if you look at AWS or GCP or azure, whenever you looks at service, there's an SLO that goes alongside, which says, how often in any given month is that service allowed to be down? And if they're down too much, they give you service credit back in exchange for that. So that's the service level agreement, right? And fundamentally, in order to have an SLA, you have to have an SLO. In order to have an SLO, you have to have can SLI. And a really common problem I see out there in my job is that there's SLAs and Slos that are written on paper. Sometimes, a lot of the time, there's not the SlIs to go along with it. So you have my network traffic, like my network traffic to my on premise environment that has an SLA of 99.9%. But you ask, how do you measure that? What's failure? What's success? No one ever knows, right? And they just put this on paper. And the term I've kind of coined for that is Schrodinger's SLA. If you can see your slA, if you can't measure your slA, you don't know whether your system's alive or dead. So how can you make decisions based on that? Right? Slo, it's a common anti pattern out there in the space. And I'll talk to youll towards the end of how you can get away from this antipattern. Okay, again, I know I keep coming back to user story, but bear with me. Right? So we have this use story provisioning AWS accounts. Cool. What's the SLI here? What dictates that the account was successfully provisioned or not successfully provisioned from the consumer's point of view? From the user's point of view, if you go and talk to your consumers and ask them, okay, you've asked, can AOS account? How quickly do you want it? They'll come back with something like five minutes. Now, for those people who haven't worked in AwS a lot, I can quite happily tell you that this is pretty much impossible, even if it is possible. The amount of effort that's going to have to go into this to get it to this point is huge, right, slo? Yeah, maybe we can do this. We can't really do much else. This is going to take weeks, maybe months of effort. So you counter, right? This is a negotiation. As you're managing that tension, it's about negotiating. So say okay, when you wrestle crown, I'll get you on within a week. Is that okay? When I think about this, as someone that's built all my fair share of landing zones, I could just have this as a weekly task that someone goes through and creates all the accounts for everyone. It's just a weekly task that someone picks up through the week and we get it done within a week. Cool. That seems pretty easy, right? Maybe it's a late Friday thing, I don't know. But now for consumers, they're now bottlenecking on us because a week is quite a long time to wait. If they're rapidly trying to provision and they realize need another account and it's like, oh, it's going to be a week, and now they're like twiddling thumbs, killing time, whatever else, this doesn't feel quite right. And this is take for an example of where I ended up with a client. Is a business day pretty reasonable? I'm going to have to build some automation around this to do this. If a team can't predict they're going to need an account a day in advance, there's probably other things that are maybe a little bit wrong. So this is generally fairly acceptable within the domain. And just for reference, this is 100 times slower than what the consumer wanted, and it's five times faster than what you proposed. But this is something that's workable and this is something that's explicit that it's measurable. And when we come to Slos, I use the idea of smart goals. Like an SLO is a goal that you're trying to hit. And I quite like smart goals. And I know they're more commonly seen in the personal development, career goal setting kind of space. Right? And you get the five ideas. It should be specific, measurable, attainable, relevant, and time box. Cool. All right. The slo that we had before that was I want an account vendored in one business day. That's specific, it's measurable, it's relevant, it's time boxed. Cool. But what about attainable? What's actually attainable here? What do we think we can actually do? Are we going to get every single account within a day for the rest of time. Do we really think that's possible? So when we start to narrow down this attainability bit, okay, so we have our SlI, and then once we add this attainable portion to it, it becomes an SLo. Maybe it's this. So account vending is in one business day 99% of the time, measured over a rolling 30 day window. Maybe that's where we're at. And one of the things I often find is because when you look at AWS or Azure or GCP or anything else, and you look at our slas, they're normally mentioned in three nines, four nines, five nines. Some of the stuff is. But at ten nines now, and it's obscene, right? And I found this has led to a lot of people dropping nines on stuff where the nine doesn't really make all that much sense. So let's just add a dose of realism to this. Let's not slap the number on because the number looks good. Let's work out what's feasible and attainable, what we think is realistic. If you put a 99% on here. So if we vend 100 counts a month, one's allowed to fail. Now, again, I appreciate not everyone spent a lot of time in the cloud or with this kind of problem, but trust me, 100 accounts is a lot. That is an unreasonable amount. I always like to think of it in this space. How many do we average and then how many do we think we want to let? What will give us enough room to fail? When you look at that 99%, like if youll say, oh, I don't mind a couple of accounts failing a month, I think we want to be able to have that freedom. You need to be 200 accounts a month to be able to do that. That is not reasonable. You can't do that. No business on earth is vending 200 accounts a month. Definitely not consistently. So. Think about historical data, look within the industry, look elsewhere, or go ask someone who's done it before. And ten accounts a month, that's a bit more reasonable. I'd expect that to, over time, average out. That ends up being a reasonable assumption. We still want the two failures. Our SLo has now ended up at 80%. Okay, cool. So now we're working out these numbers based on realistic expectations. Realistic numbers and everything else. It's starting to get a lot more real and possible and feasible, right? And hey, would you look at this? We now have an slo that we're kind of happy with. The account vending happens in the business date, 80% of the time over 30 day window. Cool. All right. Now it feels like we're getting somewhere. Hopefully you kind of feel that, too. One thing I always want to talk about with this is kind of the general rule of thumb, and I was talking about nines. Now, people are very nine happy that adding an extra nine is generally ten times harder than adding the previous one. So going from a system that's up 99% of the time to 99.9% of the time is ten times harder. It's ten times more investment. Youll need better architecture, better monitoring, observability, like better people, everything around it just gets way harder. Right? And this is an enduring investment. Software entropy is real. System entropy is real. Maintaining a system at three nines is ten times harder than maintaining a system at two nines. And when you come down to it, team capacity is limited. If you're building many systems that all have four or five nines attached to everything to it, you're being to blow your team capacity out of the water. And they'll just end up in this maintenance period where they're just trying to keep stuff alive and they no longer have any room to move forward with things. Less, really is more as the team that's building out the product, what you want to do is you want to negotiate the minimum slos for your features, because the lower those slos are, the more time you have to do other things. Youll can do core things at a lower slo or fewer things at higher slo. When it comes to a landing zone. Ideally you want more. Potentially in your context, actually, you want less features. With higher slos. It depends on the business case, but really, it's by tracking how you are working and what the ongoing maintenance burden is on your team, of the stuff you've already built, you can make better decisions about where to spend your time tomorrow or next week, or even just today. And really, I am an engineer at heart, right? If I'm given these two things, I'm given the classic user story on the left hand side and on the right hand side, I have my slo when it comes to I have to design the architecture, I have to code the solution, I have to build the system that enables this. If you give the one on the right to ten engineers, you're going to end up with stuff that looks a lot more similar, that actually fits the business requirement. If you give the one on the left to ten engineers in isolation, you're going to get ten very potentially very different solutions with different characteristics. As performance characteristics and the way they interact with the business in the now and enduring sense will change. Now, as I said before, I haven't talked about it yet, but I raised the antipatten of Schrodinger's SLA before. How do you stop it so these slos and everything else stop just being paper tigers that don't really drive any change and don't drive accountability. So Schrodinger's SLA, okay, how do we avoid? Youll know, as much as I don't think you can debug a system with a dashboard, but you can tell system health from a dashboard a lot of the time, it's not going to help debugging it, but it's going to help, you know, what the health is like, being to be able to help you make the right decisions. And this is just kind of an artistic rendition of something I build at a client. Right, slo, when you've got all these portions and features and parts of the system and services you're providing and you're measuring them here, then you can start to make the right decisions. And I've listed some AOS services on the left hand side. You can use to build this stuff cheaply with really low total cost of ownership, something that's really easy to build. And also, this is something that I fundamentally believe if you're working in sprints or campaign or whatever else, what you want to do with this is build it into your process, make it so this isn't some dashboard that no one ever looks at again, but build it into how you function as a team. Every sprint review, look at how your system and your services created and performed in the last sprint. Did we let stuff fall to the wayside? Is everything we built working well enough that on the next sprint we can focus on features? What does it look like? What do our consumers expect, and how are we operating according to those expectations? Finding the place where this fits in your natural flow is really important to make sure it drives the right behaviors. And when we look at these, and we'll dive in a little, but you can see the ISC compliance against numbers low. Account vending is high. I've tried to use the goals to indicate stuff here, but let's just kind of look at what the numbers start to be able to, what they drive in terms of decisions. Look at that account vending. So we had the SLO before at 80%. We're at 95%. Cool. We've got a big error budget here. We can afford to drop quite a few accounts before we start going under the expectation for people. So is there a big refactor we want to do? Became although this is working, there's things we really don't like about it is the stuff we wanted to add to account vending or tack onto it, which may make it a little bit more unstable as you add new features while you mature those or you go, okay, account vending is good enough. No one's touching account vending. For now. We're going to focus on other stuff. It's working well enough for now. Let's go look at the next thing that we need to. Another thing that comes with SLOS is we're ever inflating expectations. Slo, you're building system to match expectations. If you overshoot the expectation, like in this case, we're at 95 over 80. If you operate at 95 for six months, unfortunately, everyone's going to now start thinking that although 80 is what it says and 80 is what's calculated, people are going to start to expect that it's going to be 95. Became humans like to mistake correlation and causation. They'd like to draw patterns and things right. So when you have this error budget, take risks. When you have this space, make sure you remind people that systems can fail and your systems will fail. And they need to be prepared for that eventuality in kind of a way. Don't let them get too comfortable. Became that leads to massive problems down the road where all of a sudden they expect more than you're really willing to give to and commit to. Okay, let's look at another quick example. So access provisioning here. So we've only got a 2% error budget. We're only slightly above where we want to be. We're above where we want to be. Cool. So we don't necessarily need to go and look at this right now. But these are the kind of things where we probably don't want to add new features. This, if we can avoid it, because that's going to probably make it a little bit core unstable. Is there tech debt we can do a pay down of? Can we increase the testing? What can we do to make this a little bit more stable to give us error budget? How can we build the error budget up so we feel like we have enough room to take risk, to add new features, to leave this alone if that's what we want to do. And the last quick example is like the compliance scans, we're way below our budget here. Ideally, you should never end up 32% below, but who knows these things can happen, right? But this way you can use it to drive that forward. So right now it's okay. We're really underperforming here. We need to invest in stability. These be in the next sprint or the next tickets to get picked up off the queue. We need to make sure that we're getting the right tickets in here to bring this up to the level that we want it to be. Or potentially probably not. This is not the best example of this, but potentially is the old slo that we set. Is that too high? Is it turning infeasible to maintain over time? Did we misjudge things? Do we make the wrong trade offs? Do we need to go back and renegotiate with our consumers to go? We know we said we're going to give up 95% of the time, but we've actually found that that's really a mass amount of effort for us to maintain. We want to drop that a little bit to give us time to do other things. And then you can have these conversations and really drive forward. Right. As opposed to constantly being stuck not knowing what's going on. And you can do this for almost anything. Again, this is just a table of some slos that I came up with. A hypothetical customer. Right? These aren't anything particular. And ideally, I don't want you to read all of these. I mean, feel free to go look in the slides later, but I would expect that in here there are things you disagree with, and that's fine, really, at the end of it. That's the point. By making these things explicit, putting them down on paper, having the conversations around them, making them measurable and actionable, and the source of where you drive your priorities, youll make these things real. And instead of having all these people with different ideas of what is okay and what isn't okay, everyone comes away with a common understanding of what's okay. It's the same being as having ubiquitous language and domain driven design. If you take away ambiguity and arrive at a common consensus, you can really move forward knowing that you all are on the same page. And that's one of the real powers of this. Now, if you're on Aws and you want to build out like an example, I do have a tiny little GitHub repo that kind of goes the account events all the way through to Grafana dashboard at the other side. You can go and give it a check out. It's pretty extensible. Go and have some fun with it. Or don't. I know. I'd like you to, but obviously no force. So with that, I'll just quickly just come back around on the key takeaways. So eliminated waste drives better outcomes. If we can stop gold plating and over engineering services and make sure we are building stuff that's good enough to do what the client wants, we can drive better outcomes for the product that we're building because we're better able to reprioritize and make sure that our time is always spent at the best place, at the best time. The second is that implicit expectations exist. Events that table of potential slos I put at the end. Right? Like we all have slightly different ideas of what success and failure means in the network space, for example. So you really need to look at these and find a way to surface them. And the best way I found of surfacing them is with SLos taking these expectations and converting them into concrete outcomes that drive the way you build your product. Slos are not only for the googles of the world, they're for everyone. I think they're fundamental about making sure you're building products that are fit for purpose and really live and kind of evolve with the customer and their differing demands and the ability of your team. Not these static systems that just live forever, but something that you can see grow around the problem space you're trying to explore, right? So hopefully everyone found that useful again. The slides will be available after. Obviously there's videos available if you want to find me on any of the socials, Twitter, LinkedIn, all that good stuff. I'm around always having a chat about this stuff. I hope you have a wonderful rest of your day.
...

Josh Armitage

AWS Practice Lead @ Contino

Josh Armitage's LinkedIn account Josh Armitage's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways