Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are youll an SRE,
a developer, a quality
engineer who wants to tackle the challenge of improving reliability
in your DevOps? You can enable your DevOps for reliability
with chaos native. Create your free account at Chaos
native. Litmus Cloud hi.
Today I'm going to talk through how you can do lean
product delivery through SLO. So you'll have to bear with
me a little bit here. There's a bit of buzzword bingo going on. We're going
to be talking lean theory, we're going to be talking product development, and we're going
to be talking SRe all in about 25, 30 minutes.
So I always like to start these off with the key takeaways.
So in case you don't get to the end, at least youll know
what the points I was trying to make were. The first is very
much coming from lean theory that if you can eliminate waste, you can drive
better outcomes as you go along.
The second is that implicit expectations exist.
And I'll be diving into a little bit more of what I mean by that
as we go through. And the third is that one
of the powerful things about slos is
they convert these expectations into concrete
outcomes. So we're talking the implicit into the explicit.
I think hopefully by the end of this, you'll get an idea of how powerful
that is just very quickly,
just the obligatory kind of about me slide. I'm the AWS
practice lead at Contino. We're a global digital transformation consultancy.
I'm an AWS APN ambassador, which, if ever, means I do lots of stuff like
this, and I ride and talk and do all those kind of things. And I'm
a hashicorp ambassador as well.
And kind of most poignantly for
this talk is I'm a bit of a lean junkie. I love my lean theory.
I love kind of diving into everything about it. And you'll see,
hopefully why I enjoy doing that so much as
we go through.
So I'm going to talk through a story of
a time I went through with the team, and we're going to be talking specifically
about AWS landing zones. Now,
why have I picked a landing zone? As an example?
And this is maybe a little bit of, when you look into Slo
and SRE, normally, when you're
looking at the kind of systems where those are applied, these are like
millions of transactions a day, kind of being super high throughput.
And I kind of wanted to prove the point that slos
are valuable even in low throughput systems.
Can Adris landing zone is one really
just today we're going to kind of follow that kind of lean theory of focus.
And when we talk about landing zones, they do account vending,
we do account provisioning, we do security guardrails, we do cit pipelines,
compliance, and do all manner of things. But really today we're just going to look
at one thing and that's account vending. Because really,
or project vending or subscription vending, depending on your cloud
of choice. But if you can't do this,
you do not have a landing zone if you can't make these accounts and projects
for other people. So that's where we're going to narrow it on today because this
is pretty much feature one, almost as
you build out a landing zone. Now, I think everyone's
probably seen user stories and you'll get something like
this. I just looked this one up for what an account vending user
story might look like. As a development team, I want to be able to provision
AWS accounts so I can deploy my application.
Awesome. Cool. We all have a pretty half
decent shared understanding of that. I think as we dive
in, hopefully youll see that there's a lot left unsaid with this
and we'll dive into that and show how taking an
SLO first approach really helps take some of the ambiguity
out of this.
Okay, I said I was a bit of a lean junkie,
so let's just dive a little bit more into lean. So one
of the core concepts of lean is this idea of waste.
And there are eight kinds of waste.
Defects, overproduction, waiting, unused talent, extra processing,
transportation, motion and inventory.
All these are things which you can look to target. And just by
the by, just to kind of put this into context, so lean theory came out
of the Toyota production system. When you look at
Toyota production system and where they've got to in
their value stream, when they produce a car, they're reckon they're
up to about 33% value, which means they're
67% waste.
The kind of idea in the software space is we're at 1% value.
So you can either take that as a sad thing or
a lot of room for improvement. I try and take it that way
around. And really one of the things we're going to look at
today is the waste of extra processing. And this is building
something to better quality than is required. This is where
you're building something that is fantastic when merely
good would do. You're spending all this time making something amazing when
really you need to be working on other things.
So when we're talking about quality,
if we come back to this user story.
This can be done so many different ways. This doesn't
talk to quality, it talks to output. And that's
just a fundamental understanding that you
have to get once the user story is written, you need more.
I can take this and I can implement this a million different ways.
I can fully automate to the nth degree to make it as fast
as physically possible. I can just let
people turn that can that mechanical turk on it and have it
take a while at the expense of human effort.
So you really can answer this fundamental question
of what is good enough?
At what point is my account vending solution good enough
that I can stop working on it and move it on to the next thing?
And even to answer this question, there's another question you have to answer,
right? Who defines good enough? Who gets
to say what that is?
And this ends up coming back to and boiling down to the
whole concept of product management. And really, when you look
at product management, and this is not my term, but product management
is tension management, you've got tension between
the product owner, the scrum master and the delivery team. You've got between
the team and the consumers. You've got between the team and the
other teams they rely upon. There's all these tension in all these
different areas. And we're just going to focus on where I think slos really
drive is one particular kind of tension, and that's
the tension between delivery and consumers.
So what you're trying to build and the people who are asking you to build
things, and obviously it's a one to many relationship.
You've got one delivery team, you have many consumers, all with competing priorities,
all pushing their competing priorities on you, right?
So being able to be mindful and explicit about
how you manage that tension can end up being really powerful.
Now, potentially, not everyone has spent
a lot of time in SRE. Maybe they've watched this because they want to learn
what an SLO is. And don't worry, I'm going to go over that now.
So where you start is with an SLI, a service level indicator.
This will dictate the difference between success and
failure. It's a binary distinction. So youll
be able to, given whatever it is your feature is supposed to do,
when it does the thing that it's there to do,
youll very easily classify as either whether that's successful or
was a failure. What your service level objective does
is defines how frequently can I
fail of every 100 or 1000 or 10,000 requests.
How many of those are allowed to fail. And there needs to be some there
that's allowed to fail. That number can never be zero.
And then you get down to an SLA, which is defining
the penalty for exceeding the failure
goal. So if you break too much or
break too often, what's the penalty there?
Now, often with internal systems, you don't go down the
full path for an SLA. You're not really going to be passing money around in
the business. But if you look at AWS or
GCP or azure, whenever you looks at service,
there's an SLO that goes alongside, which says, how often in
any given month is that service allowed to be down?
And if they're down too much, they give you service credit back
in exchange for that. So that's the service level agreement, right?
And fundamentally, in order to have an SLA,
you have to have an SLO. In order to have an SLO, you have to
have can SLI. And a really common problem I
see out there in my job is that
there's SLAs and Slos that are written on paper.
Sometimes, a lot of the time, there's not the SlIs to go along
with it. So you have my network traffic,
like my network traffic to my on premise environment that
has an SLA of 99.9%. But you ask,
how do you measure that? What's failure? What's success?
No one ever knows, right? And they just put this on paper.
And the term I've kind of coined for that is Schrodinger's SLA.
If you can see your slA, if you can't measure
your slA, you don't know whether your system's alive or dead.
So how can you make decisions based on that? Right? Slo,
it's a common anti pattern out there in the space. And I'll talk to
youll towards the end of how you can get away from
this antipattern.
Okay, again, I know I keep coming back to user story, but bear with
me. Right? So we have this use story provisioning AWS accounts. Cool.
What's the SLI here? What dictates that the
account was successfully provisioned or not successfully provisioned from the
consumer's point of view? From the user's point of view,
if you go and talk to your consumers and ask them, okay, you've asked,
can AOS account? How quickly do you want it? They'll come back with something like
five minutes. Now, for those people who haven't worked in AwS
a lot, I can quite happily tell you that this is pretty
much impossible, even if it is
possible. The amount of effort that's going to have to go into
this to get it to this point is huge, right,
slo? Yeah, maybe we can do this. We can't really do
much else. This is going to take weeks, maybe months of effort.
So you counter, right? This is a negotiation. As you're managing that
tension, it's about negotiating.
So say okay, when you wrestle crown, I'll get you on within a week.
Is that okay?
When I think about this, as someone that's built all my
fair share of landing zones, I could just have this as a weekly
task that someone goes through and creates all the accounts for everyone. It's just a
weekly task that someone picks up through the week and we get it done within
a week. Cool. That seems pretty easy, right? Maybe it's a late Friday thing,
I don't know. But now for consumers,
they're now bottlenecking on us because a week is quite a long time to wait.
If they're rapidly trying to provision and they realize need another account and it's like,
oh, it's going to be a week, and now they're like twiddling thumbs, killing time,
whatever else, this doesn't feel quite right.
And this is take for an example of where I ended up with a client.
Is a business day pretty reasonable? I'm going to have to build
some automation around this to do this. If a team can't
predict they're going to need an account a day in advance, there's probably other things
that are maybe a little bit wrong. So this is
generally fairly acceptable within the domain. And just
for reference, this is 100 times slower than what the
consumer wanted, and it's five times faster than what you proposed.
But this is something that's workable and this is something that's explicit
that it's measurable. And when
we come to Slos, I use
the idea of smart goals. Like an SLO is a goal that you're trying to
hit. And I quite like smart goals. And I know they're more commonly seen in
the personal development, career goal setting kind of space.
Right? And you get the five
ideas. It should be specific, measurable, attainable, relevant, and time box.
Cool. All right. The slo that we had before that
was I want an account vendored in one business day.
That's specific, it's measurable, it's relevant, it's time boxed.
Cool. But what about attainable?
What's actually attainable here? What do we think we can actually do? Are we going
to get every single account within a day for the rest of time.
Do we really think that's possible? So when we start to narrow
down this attainability bit, okay, so we
have our SlI, and then once we add this attainable portion
to it, it becomes an SLo. Maybe it's this.
So account vending is in one business day 99% of the
time, measured over a rolling 30 day window.
Maybe that's where we're at. And one
of the things I often find is because when you look at AWS or
Azure or GCP or anything else, and you look at our slas, they're normally mentioned
in three nines, four nines, five nines.
Some of the stuff is. But at ten nines now, and it's obscene,
right? And I found this has led to a lot of people
dropping nines on stuff where the nine doesn't really make all that
much sense. So let's just add a dose of realism to
this. Let's not slap the number on because the number looks good. Let's work
out what's feasible and attainable, what we think is realistic.
If you put a 99% on here.
So if we vend 100 counts a month,
one's allowed to fail. Now, again, I appreciate not everyone spent a lot of time
in the cloud or with this kind of problem, but trust
me, 100 accounts is a lot. That is an unreasonable
amount. I always like to think of it in
this space. How many do we average
and then how many do we think we want to let? What will give us
enough room to fail? When you look at that 99%,
like if youll say, oh, I don't mind a couple of accounts
failing a month, I think we want to be able to have that freedom.
You need to be 200 accounts a month to be able to do that.
That is not reasonable. You can't do that.
No business on earth is vending 200
accounts a month. Definitely not consistently.
So. Think about historical data, look within the
industry, look elsewhere, or go ask someone who's done it before.
And ten accounts a month, that's a bit more reasonable. I'd expect that
to, over time, average out. That ends up being a reasonable assumption.
We still want the two failures. Our SLo has now ended up at 80%.
Okay, cool. So now we're working out these numbers based on
realistic expectations. Realistic numbers and everything else.
It's starting to get a lot more real and possible and feasible,
right? And hey, would you look at this? We now
have an slo that we're kind of happy with. The account vending happens
in the business date, 80% of the time over 30 day window.
Cool. All right. Now it feels like we're getting somewhere.
Hopefully you kind of feel that, too.
One thing I always want to talk about with this is kind of the general
rule of thumb, and I was talking about nines. Now, people are very nine happy
that adding an extra nine is generally ten times harder than adding
the previous one. So going from a system that's up 99%
of the time to 99.9% of the time is ten times harder.
It's ten times more investment. Youll need better architecture,
better monitoring, observability, like better people,
everything around it just gets way harder. Right? And this is
an enduring investment. Software entropy is real.
System entropy is real. Maintaining a system
at three nines is ten times harder than maintaining a
system at two nines. And when you come down to it,
team capacity is limited. If you're building many systems
that all have four or five nines attached to everything to it,
you're being to blow your team capacity out of the water.
And they'll just end up in this maintenance period where they're just trying to keep
stuff alive and they no longer have any room to move forward with things.
Less, really is more as the team
that's building out the product, what you want to do is you want to negotiate
the minimum slos for your features,
because the lower those slos are, the more time you have to do
other things. Youll can do core things at a lower slo or fewer things
at higher slo. When it
comes to a landing zone. Ideally you want more. Potentially in
your context, actually, you want less features. With higher slos.
It depends on the business case, but really, it's by
tracking how you are working and what the ongoing maintenance burden is
on your team, of the stuff you've already built, you can make better decisions
about where to spend your time tomorrow or next week, or even just
today. And really,
I am an engineer at heart, right? If I'm given these
two things, I'm given the classic user story on
the left hand side and on the right hand side, I have my
slo when it comes to I have to design the
architecture, I have to code the solution, I have to build the system that
enables this. If you give the one on
the right to ten engineers, you're going to end up with stuff that looks a
lot more similar, that actually fits the business requirement.
If you give the one on the left to ten engineers in isolation,
you're going to get ten very potentially very different solutions
with different characteristics. As performance characteristics and
the way they interact with the business in the now and enduring sense will
change. Now,
as I said before, I haven't talked about it yet, but I
raised the antipatten of Schrodinger's SLA before. How do you stop it so
these slos and everything else stop just being
paper tigers that don't really drive any change and don't drive accountability.
So Schrodinger's SLA, okay, how do we avoid?
Youll know, as much as I don't think you can
debug a system with a dashboard, but you can tell system health from a dashboard
a lot of the time, it's not going to help debugging it, but it's going
to help, you know, what the health is like, being to be able to help
you make the right decisions. And this is just kind of
an artistic rendition of something
I build at a client. Right, slo, when you've got all these portions and
features and parts of the system and services you're providing and
you're measuring them here, then you can start to
make the right decisions. And I've listed some AOS services on the left hand side.
You can use to build this stuff cheaply with really low total cost of
ownership, something that's really easy to build. And also,
this is something that I fundamentally believe if you're
working in sprints or campaign or whatever else, what you
want to do with this is build it into your process, make it so this
isn't some dashboard that no one ever looks at again, but build it into how
you function as a team. Every sprint review,
look at how your system and your services created
and performed in the last sprint. Did we let stuff
fall to the wayside? Is everything we built working well enough that on the next
sprint we can focus on features? What does it look
like? What do our consumers expect, and how are we operating according to those
expectations? Finding the place where this fits
in your natural flow is really important to make sure
it drives the right behaviors.
And when we look at these, and we'll dive in a little, but you
can see the ISC compliance against numbers low. Account vending is high.
I've tried to use the goals to indicate stuff here, but let's just kind of
look at what the numbers start to be able to, what they drive in terms
of decisions. Look at that account vending. So we
had the SLO before at 80%. We're at 95%.
Cool. We've got a big error budget here. We can afford to drop quite a
few accounts before we start going under the expectation
for people. So is there a big refactor we
want to do? Became although this is working, there's things we really don't like about
it is the stuff we wanted to add to account vending or tack
onto it, which may make it a little bit more unstable as you add new
features while you mature those or you go, okay,
account vending is good enough. No one's touching account vending. For now.
We're going to focus on other stuff. It's working well
enough for now. Let's go look at the next thing that we need to.
Another thing that comes with SLOS
is we're ever inflating expectations. Slo, you're building system to
match expectations. If you overshoot the expectation, like in
this case, we're at 95 over 80. If you operate
at 95 for six months, unfortunately,
everyone's going to now start thinking that although 80 is what it says and 80
is what's calculated, people are going to start to expect that it's
going to be 95. Became humans like to mistake
correlation and causation. They'd like to draw patterns and things right.
So when you have this error budget, take risks.
When you have this space, make sure you remind people
that systems can fail and your systems will fail. And they need to be prepared
for that eventuality in kind
of a way. Don't let them get too comfortable. Became that leads
to massive problems down the road where all of a sudden they expect more
than you're really willing to give to and commit to.
Okay, let's look at another quick example. So access provisioning here. So we've
only got a 2% error budget. We're only slightly above where we want to be.
We're above where we want to be. Cool. So we don't necessarily need to go
and look at this right now. But these are the kind of
things where we probably don't want to add new features. This, if we can avoid
it, because that's going to probably make it a little bit core unstable.
Is there tech debt we can do a pay down of?
Can we increase the testing? What can we do to make this a little bit
more stable to give us error budget? How can we build the error
budget up so we feel like we have enough room to take risk,
to add new features, to leave this alone if that's what we want to
do. And the last quick example is like the
compliance scans, we're way below our budget here.
Ideally, you should never end up 32% below, but who knows these things can
happen, right? But this way you can use it to drive that forward.
So right now it's okay. We're really underperforming here.
We need to invest in stability. These be in the next sprint
or the next tickets to get picked up off the queue. We need
to make sure that we're getting the right tickets in here to bring this up
to the level that we want it to be. Or potentially probably
not. This is not the best example of this, but potentially
is the old slo that we set. Is that too high?
Is it turning infeasible to maintain over time? Did we misjudge
things? Do we make the wrong trade offs? Do we need to go back and
renegotiate with our consumers to go? We know we said we're going to give up
95% of the time, but we've actually found that that's really a mass
amount of effort for us to maintain. We want to drop that a little bit
to give us time to do other things. And then you can have these conversations
and really drive forward. Right. As opposed to constantly
being stuck not knowing what's going on.
And you can do this for almost anything.
Again, this is just a table of some
slos that I came up with.
A hypothetical customer. Right? These aren't
anything particular. And ideally,
I don't want you to read all of these. I mean, feel free to go
look in the slides later, but I would expect that in here
there are things you disagree with, and that's fine, really,
at the end of it. That's the point. By making
these things explicit, putting them down on paper, having the conversations around
them, making them measurable and actionable, and the source of where you drive
your priorities, youll make these things real.
And instead of having all these people with different ideas of
what is okay and what isn't okay, everyone comes away with a common understanding
of what's okay. It's the same being as having ubiquitous language
and domain driven design. If you take away ambiguity
and arrive at a common consensus, you can really move forward
knowing that you all are on the same page. And that's one of the real
powers of this. Now, if you're on
Aws and you want to build out like an example, I do have a tiny
little GitHub repo that kind of goes the account events
all the way through to Grafana dashboard at the other side. You can go and
give it a check out. It's pretty extensible.
Go and have some fun with it. Or don't.
I know. I'd like you to, but obviously no force.
So with that, I'll just quickly just come
back around on the key takeaways. So eliminated waste
drives better outcomes. If we can stop gold plating
and over engineering services and make sure we are building
stuff that's good enough to do what the client wants,
we can drive better outcomes for the product that we're building because
we're better able to reprioritize and make sure that our time is always
spent at the best place, at the best time. The second
is that implicit expectations exist. Events that table
of potential slos I put at the end. Right? Like we all have
slightly different ideas of what success and failure
means in the network space, for example. So you
really need to look at these and find a way to surface them.
And the best way I found of surfacing them is with SLos
taking these expectations and converting them into concrete outcomes
that drive the way you build your product.
Slos are not only for the googles of the world, they're for everyone.
I think they're fundamental about making sure you're building
products that are fit for purpose and really
live and kind of evolve with
the customer and their differing demands and the ability of your team.
Not these static systems that just live forever, but something that you can see
grow around the problem space you're trying to explore,
right? So hopefully everyone found that useful again.
The slides will be available after. Obviously there's videos available if
you want to find me on any of the socials, Twitter, LinkedIn, all that good
stuff. I'm around always having a chat about this stuff. I hope you have a
wonderful rest of your day.