Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Hi, my name is Spoons and thanks for joining. I'm here to talk about driving
            
            
            
              service ownership with distributed tracing.
            
            
            
              Before I get started, just a little bit about myself. So my name is Spoons.
            
            
            
              Although my full name is Daniel Spoonhauer, no one calls me that. I'm CTO
            
            
            
              and a co founder at Lightstep, where we provide simple observability for
            
            
            
              deep systems based on distributed tracing. And I spend a lot of time at Lightstep
            
            
            
              thinking and working on both service ownership and, not surprisingly,
            
            
            
              distributed tracing. Before I helped found Lightstep, I was
            
            
            
              a software engineer at Google. I worked both on Google's internal infrastructure team
            
            
            
              and as part of Google's cloud platform team. I worked closely
            
            
            
              with SRE in both cases to build processes and roll out
            
            
            
              tools to improve reliability and reduce the amount of work for teams,
            
            
            
              and in both cases, hands. Subsequently, at Lightstep,
            
            
            
              I carried a pager for many years, although they don't let me do that anymore.
            
            
            
              So great. So before I get too deep, just to kind of set some context,
            
            
            
              I want to talk about this question. What changed? I think this is a really
            
            
            
              important question generally for SRE, but I want to talk
            
            
            
              about what changed in the kinds of technologies that we use,
            
            
            
              the kinds of architectures that we build. And these was that we work together
            
            
            
              as well. Many engineering
            
            
            
              organizations are adopting services or other kinds of similar systems,
            
            
            
              often aside DevOps practices. But what does that mean for how we work
            
            
            
              together? So there's been a series of technical
            
            
            
              changes as we move from bare metal to virtual machines, to containers,
            
            
            
              and finally to orchestration tools like kubernetes.
            
            
            
              And each of these has provided additional abstractions, right?
            
            
            
              They've raised the level of the primitives
            
            
            
              that we can work with. They've also introduced additional complexity.
            
            
            
              Hands more was for those systems to fail. So there's a trade
            
            
            
              off there. And of course, even though we have those abstractions,
            
            
            
              someone still needs to understand what's happening underneath the hood. Partly enabled
            
            
            
              by these. We've also changed these way that we build these systems, right? So we've
            
            
            
              moved to microservices, maybe we're using serverless.
            
            
            
              These are all ways of building more loosely coupled
            
            
            
              applications where different pieces of it can be deployed independently, can be scaled independently.
            
            
            
              And speaking of independently, we've also
            
            
            
              adopting methodologies like agile and DevOps.
            
            
            
              I'll talk a bit more about DevOps, but I really think of DevOps as a
            
            
            
              way to allow teams to work more autonomously, more independently,
            
            
            
              and to boost developer velocity. And while
            
            
            
              that's good in a lot of ways, that's also created some challenges. So I
            
            
            
              like to put this kind of in the form of this feedback loop, right?
            
            
            
              So if we look at the bigger picture in any software system, and really
            
            
            
              this goes for a whole bunch of other systems as well, we have a kind
            
            
            
              of feedback loop, and on one half of these loop we've got control.
            
            
            
              Control is the systems, the levers, the tools that we
            
            
            
              use to affect change hands in a software system today, that's things
            
            
            
              like kubernetes and associated tools could be things like service mesh,
            
            
            
              configuration management. These are all ways of affecting change.
            
            
            
              But there's another half to this feedback loop as well, and that's how we
            
            
            
              observe those changes. Observing is another
            
            
            
              set of tools we can use to understand what's happening. And it's really an important
            
            
            
              part of that feedback loop because it tells us what we should do next.
            
            
            
              Right. And I think maybe there's some good reasons
            
            
            
              for this, but the amount of investment
            
            
            
              that we've seen in the tools on the top half of this loop, I think
            
            
            
              has really outpaced these investment and really these innovation
            
            
            
              as well on the observability side of things. And I know we've had tools
            
            
            
              that allow us to observe our systems for a long time. Maybe we didn't call
            
            
            
              them observability tools, but I think with these shift to more loosely
            
            
            
              coupled services, with the shifts that we've made to our organization,
            
            
            
              it really requires a new way of thinking about how
            
            
            
              we observe these systems. And really,
            
            
            
              if we haven't adapted those systems, if we haven't innovated on those observability
            
            
            
              systems, it's like we've built this amazing car that goes super fast.
            
            
            
              We've got a great gas pedal, but we don't have a speedometer. So we have
            
            
            
              no way of knowing how fast we're going. And this has consequences at all kinds
            
            
            
              of different scales, both at the small scale. When we think about how we're going
            
            
            
              to do auto scaling and what are the metrics that we're going to use to
            
            
            
              inform that to? How do we decide when to peel off functionality into
            
            
            
              a new service, to thinking about the application architecture as a whole?
            
            
            
              And often the idea here is that
            
            
            
              we have given control and we've built out these control systems to allow
            
            
            
              teams to move independently. But what's happened is
            
            
            
              that we've lost the ability to understand performance or reliability as a system
            
            
            
              as a whole. That kind of brings me to the crux
            
            
            
              of what we've done here. Kubernetes and other control systems
            
            
            
              like that have given each team more power, right,
            
            
            
              more decision making, more control. But that control is now distributed,
            
            
            
              right? And DevOps likewise means that each team has more control of their own service
            
            
            
              when they push code. But each of those teams depends on a lot of other
            
            
            
              teams, right? So say youve the team that's responsible for the service. At the
            
            
            
              top of this diagram, you're beholden to
            
            
            
              end users, say, to deliver a certain amount of reliability and performance,
            
            
            
              and you have control over that service, right? You decide when to roll out new
            
            
            
              code, you decide when to roll it back as well. But you depend
            
            
            
              on all of these services below you. And ultimately you're also responsible
            
            
            
              for the performance of those services, right. If those services are slow,
            
            
            
              they're part of the functionality that you provide.
            
            
            
              Your service is going to be slow as well. But even though you have responsibility
            
            
            
              for that performance, you don't have control. Right? You don't have any way of rolling
            
            
            
              back those services other than reaching out to those teams and asking them to do
            
            
            
              it. And it's not just other services. These things lower
            
            
            
              on the diagram here could be infrastructure, they could be managed services. Things outside
            
            
            
              of your organization was, well, but this gap that we've created,
            
            
            
              for better or worse, and I think for worse, between control and
            
            
            
              responsibility, it's really the textbook definition of stress.
            
            
            
              And so kind of what I want to talk about today is how can we
            
            
            
              use service ownership to lower that stress and
            
            
            
              to fill that gap between control and responsibility?
            
            
            
              Okay, how are we going to do that? Well, I'm going to talk a
            
            
            
              bunch of specifics, but at a very, very high level. To me,
            
            
            
              ownership really has two parts, and the two parts are, I think,
            
            
            
              both really important. The first one is accountability, and maybe that's not
            
            
            
              super surprising. Of course we give people ownership, we have to hold them accountable
            
            
            
              to it. But another really important part of that is giving folks agency,
            
            
            
              right, the ability, the means to make things better.
            
            
            
              And I'm going to touch on both of these a number of times throughout the
            
            
            
              talk. At the same time, when we're talking about a loosely coupled
            
            
            
              system, we really need to think about the way that we're observing
            
            
            
              that system. And distributed tracing, as you may have guessed,
            
            
            
              is going to play a really big role in how we
            
            
            
              do that. Okay, so let me dive in and talk a bit about distributed tracing,
            
            
            
              and I'll come back and then we'll see how we can build better service ownership
            
            
            
              just to get everyone on the same page. Distributed tracing
            
            
            
              was sort of built and popularized by a bunch of
            
            
            
              larger Internet companies. Web search and
            
            
            
              social media companies as a way for them to understand their systems
            
            
            
              as they built out a microservice like architecture, even though
            
            
            
              we didn't call it that at the time. But just to give you a picture
            
            
            
              of what distributed tracing might look like, this is a trace.
            
            
            
              So almost any distributed tracing tool will show you something like this,
            
            
            
              or like one of these traces. And you can think of these
            
            
            
              kind of like a Gantt chart, right? So time is moving from left to
            
            
            
              right, and each of these bars represents the work done by some services
            
            
            
              as part of handling an end user request. And then
            
            
            
              as you got from top to bottom, you're kind of going down the stack,
            
            
            
              so you can see where time is being spent in servicing those requests.
            
            
            
              As I'll explain a bit more in a minute, though, a trace is really just
            
            
            
              a building block of what we can do with distributed tracing.
            
            
            
              And so I'll speak a bit more about that, but I want to kind
            
            
            
              of dive into what the building block is exactly.
            
            
            
              So again, we have these bars that represent these work that's being done.
            
            
            
              And you can think of the arrows that are going sort of down the
            
            
            
              stack, right? These are the calls that are being made when one
            
            
            
              service has delegated responsibility for the request to a service below it.
            
            
            
              And likewise the arrows that are coming back up are when those things return.
            
            
            
              So you can think of whats the trace is really doing is it's encoding the
            
            
            
              causal relationships between callers and colleagues, right. It alerts us know who's
            
            
            
              responsible for servicing requests at a given moment in time.
            
            
            
              And therefore we can understand where to attribute failures.
            
            
            
              We can understand where to attribute things like slowness.
            
            
            
              I'm not going to say tools much more about tracing here.
            
            
            
              If you want to understand more,
            
            
            
              Google put out a paper a while back, got ten years ago
            
            
            
              about Dapper Google's internal system that
            
            
            
              says a lot of details about what was important at Google and kind of rolling
            
            
            
              out the system in some of those use cases, especially about the
            
            
            
              mechanics of how they collected and managed those traces.
            
            
            
              And even though that's a bit old, I think there's a great kind of perspective
            
            
            
              on the history of it. I also want to say I wrote
            
            
            
              the book on distributed tracing, so if there's more you want to learn about it,
            
            
            
              I strongly recommend this book. I don't get any
            
            
            
              royalties from it. We donated all the royalties to a good cause,
            
            
            
              but the book can obviously cover a lot more than I can cover today.
            
            
            
              From the very basics of distributed tracing, to how to think about costs
            
            
            
              and implementation, to getting value from tracing and even what
            
            
            
              tracing might offer in the future. Okay, I said traces are just
            
            
            
              the building block, though. What do I mean? Well, traces are really the raw material,
            
            
            
              right? So distributed traces, those are just, youve can think of them as structs,
            
            
            
              right? It's a data type, it's these collection of data, but they're not the
            
            
            
              finished product. Right? And distributed tracing is really
            
            
            
              that process, that art hands science of deriving value from those traces.
            
            
            
              So just to give an example of that at Google,
            
            
            
              and you can read about this in the dapper paper, some of the most
            
            
            
              valuable uses of things tracing data were not looking at individual ones, but looking
            
            
            
              at aggregates. So there was at the time, maybe there still
            
            
            
              is a weekly Mapreduce that was run that looked at the
            
            
            
              entire collection of traces for web search requests over the past week,
            
            
            
              and it detailed how each team hands their service had contributed
            
            
            
              to performance for web search. So whether that
            
            
            
              service was searching web documents or news or images or video or whatever,
            
            
            
              that report would essentially say what percent of the latency was
            
            
            
              due to each of those services. And then that can be used to then prioritize
            
            
            
              work done by those teams. And in fact, when it wasn't used,
            
            
            
              we actually had teams that would go off and they would spend a month or
            
            
            
              more time on optimization, which might have improved performance
            
            
            
              for their individual service, but had no effect on the overall performance as
            
            
            
              observed by users. So again,
            
            
            
              distributed tracing, thinking about what the value is, and that's really what I
            
            
            
              want to talk about in the context of service ownership today. So service
            
            
            
              ownership, let's talk about what I think that means and
            
            
            
              some of the benefits and maybe the risks associated with moving forward
            
            
            
              and pushing that. So just
            
            
            
              to define it, service ownership is really making teams or
            
            
            
              responsible for the delivery of their software and their services.
            
            
            
              Right. And just to be concrete, it can include responsibilities like
            
            
            
              incident response. I think that's what a lot of people think of, but it
            
            
            
              also probably involves paying for the infrastructure that those services are
            
            
            
              using, for the storage that they might be using. And of course, someone needs
            
            
            
              to fix bugs. And I think service ownership is an important part of figuring
            
            
            
              out how to triage and allocate bugs. Service ownership,
            
            
            
              I think, comes up a lot in the context of DevOps. So I kind of
            
            
            
              wanted to spend a minute just to compare those two.
            
            
            
              And there's obviously a lot of overlap and
            
            
            
              a tight relationship. But DevOps also means a lot of different things to different people.
            
            
            
              So DevOps can mean an engineering culture, a culture
            
            
            
              that's really based on cooperation between the
            
            
            
              people that are developing hands operating software. In some cases they
            
            
            
              might be the same people, but not always. DevOps can also mean a set of
            
            
            
              tools. So it might mean the latest CI CD tool,
            
            
            
              it might mean some of the other infrastructure. Whats you're using in order to provide
            
            
            
              a platform for the rest of your organization? I really like
            
            
            
              to think a little bit higher. I mean those are great definitions as well,
            
            
            
              but I like to think of DevOps as I mentioned, as a feedback loop between
            
            
            
              developers and their users, right. And creating a
            
            
            
              type feedback loop so that those developers see immediately
            
            
            
              what the effects of the code that they're writing and deploying are and
            
            
            
              they're able to take that and then use it in terms of how they continue
            
            
            
              to do product definition and software developers. So in a lot of ways I think
            
            
            
              DevOps can be a bit broader and apply more to a larger
            
            
            
              set of processes. And I think if service ownership is really just
            
            
            
              thinking about for a team that both developers and
            
            
            
              operates, what are the set of responsibilities they have in order to make sure that
            
            
            
              they're doing that reliably, hands as
            
            
            
              according to their customers expect.
            
            
            
              Okay, so what are some of the good things about service ownership?
            
            
            
              Well, one thing is by giving teams ownership over their
            
            
            
              services, you allow them to be more independent and hopefully that'll raise the developer
            
            
            
              velocity for your organization. They'll be able to coordinate
            
            
            
              less and focus more on the functionality that they're providing.
            
            
            
              It also is a way for organizations to hold engineering
            
            
            
              teams accountable and to tie their performance to real business metrics.
            
            
            
              I'll talk a bit more about this in a few minutes. But for application
            
            
            
              developers, this is really about what their customers are
            
            
            
              adopting hands perceiving and how they're using the product. For platform engineers,
            
            
            
              it's really thinking about the organization as a whole and how other
            
            
            
              application developers, teams within the organization are leveraging tools and
            
            
            
              infrastructure and delivering on those promises that they're making to their customers.
            
            
            
              Now, there's obviously risks that come along with this was,
            
            
            
              well, if you make teams more independent, you allow them to make independent choices,
            
            
            
              they might make different choices, right. And you can have divergence.
            
            
            
              These, you have more frameworks, more tools that
            
            
            
              can have some downsides, right? So one of those downsides is that you have higher
            
            
            
              vendor costs. You're not getting the same economy of scale that you might get
            
            
            
              if you're just using one tool consistently throughout your organization.
            
            
            
              It also might mean that there's more training not only for new team members,
            
            
            
              but as developers transfer within teams,
            
            
            
              they need to learn a new set of tools, a new set of processes and
            
            
            
              thinking about it from the point of view of the organization as a whole or
            
            
            
              maybe from a platform engineering team, it's harder to get a sense of the big
            
            
            
              picture of your application. So if
            
            
            
              we think about how to kind of balance these benefits and
            
            
            
              risks and think about those trade offs on both sides,
            
            
            
              they come from this idea of this independence, right? That we are allowing teams to
            
            
            
              make their own choices. So I think the way that we're going to manage these
            
            
            
              trade offs is by allowing for that independence, but at the same
            
            
            
              time defining clear responsibilities and goals for those teams.
            
            
            
              We allow them to make choices, but we give them some guardrails about how they
            
            
            
              can make those choices. And then at the same time it's about
            
            
            
              ensuring consistency, right? So maybe there are some kinds of tools that we
            
            
            
              allow teams to make independent choices about, but maybe not for other tools.
            
            
            
              And when we talk about measuring these results of their work, we want
            
            
            
              to make sure that we're doing that consistently and we're measuring progress towards those
            
            
            
              goals and we're holding those teams accountable. Okay,
            
            
            
              so I think thinking of about service ownership in this context
            
            
            
              now, how do we allow for this independence? How do we provide for this consistency?
            
            
            
              And like I said, this is going to come back to accountability and agency.
            
            
            
              Like I mentioned at the beginning. Okay, how do we drive towards service
            
            
            
              ownership? I kind of have three pieces to this puzzle that I
            
            
            
              want to talk about each in turn. Hands. Those are documentation oncall.
            
            
            
              Not surprising. Hands then service level objectives.
            
            
            
              Okay, to start with documentation,
            
            
            
              the first step is really creating consistent and centralized documentation
            
            
            
              specifically around services in your application. As we grew
            
            
            
              at Lightstep and I'm sure was a lot of your organizations have we looked
            
            
            
              to define responsibilities for services as the teams got
            
            
            
              larger, we split teams as the number of services grew.
            
            
            
              But before you can split those responsibilities, I think you need to know who the
            
            
            
              experts are. I mean, knowing that is valuable in itself,
            
            
            
              but those experts are really going to services as the seeds for
            
            
            
              these clusters that will take over ownership of the services.
            
            
            
              That documentation is also a way to share that expertise. Right.
            
            
            
              Hands. A way for others to find related information that comes in
            
            
            
              the form of finding telemetry and dashboards. It could be to find
            
            
            
              alert definitions or when an alert fires, to find the playbook that helps you do
            
            
            
              that. Hands. One of the things that we found useful was to use a template
            
            
            
              for this kind of documentation, right? So youve know that you have some
            
            
            
              consistency hands. An engineer or developer knows that when they
            
            
            
              go to it, they'll be able to find links to say a dashboard or to
            
            
            
              a logs or to traces that help them understand what's happening.
            
            
            
              The other thing about a template that's I think, really interesting is that it allows
            
            
            
              you to run reports over that and extract information from that documentation,
            
            
            
              right? So now we can ask questions like how is expertise divided,
            
            
            
              right? Are there certain people that are listed as experts for
            
            
            
              more of the service? How does that change over time? And if we've written the
            
            
            
              documentation in a totally ad hoc and unstructured way,
            
            
            
              it's a really manual process to discover that. But if we've built a kind of
            
            
            
              template, a form, we can extract that information much more easily.
            
            
            
              The other thing is if we put all the documentation in one place, it's really
            
            
            
              easy to audit how often it changes, right? And you can require
            
            
            
              periodic updates. You can ask when was the last time for the documentation
            
            
            
              for a service x updated? Okay, that was too long ago. Someone on that team
            
            
            
              is going to need to be responsible for updating it.
            
            
            
              Now, centralized documentation is great. Even better is if you can make it machine readable,
            
            
            
              right? So if you do that, you can use
            
            
            
              the documentation actually was part of building and
            
            
            
              deploying these services, right? So you can use it to generate dashboard config,
            
            
            
              you can use it to define escalation policies, hands to define
            
            
            
              how deployments work. And that's great one,
            
            
            
              because it saves a lot of time, it makes it easier to define new services,
            
            
            
              there's less work to do there, but it also makes
            
            
            
              documentation necessary as part of the day to day work
            
            
            
              of a developer. It makes it necessary for them to get their job done.
            
            
            
              And if the only way to add a service to the CI pipeline is
            
            
            
              to add it to the documentation, then you can be sure that the documentation is
            
            
            
              going to be up to date, right? And that's really what we want. We want
            
            
            
              the documentation to be up to date because it's not up to date. People will
            
            
            
              lose trust in it and it won't be valuable, they won't go to it,
            
            
            
              and there's sort of a downward spiral that we'll be in.
            
            
            
              The other thing to think about keeping documentation up to date is just really focusing
            
            
            
              on which documentation should be written by humans, right? Not all
            
            
            
              of it should be. Some of it really should be dynamic hands.
            
            
            
              If we're talking about which team owns a given service, fine, like humans
            
            
            
              need to update that. But one of the notorious problems that we
            
            
            
              looked to tackle over and over again at Google was try to record service
            
            
            
              dependencies and it was just incredibly hard to get teams to do
            
            
            
              that because it was constantly changing. It's a function of the software itself,
            
            
            
              not of the humans involved. And so asking humans to do that,
            
            
            
              I think, was, well, I mean, the conclusion we came to
            
            
            
              in the end was that it was never going to work, but doing it programmatically
            
            
            
              makes a lot more sense, I think. And that's one of the ways that distributed
            
            
            
              tracing can come into play. So this is a service
            
            
            
              diagram that I pulled from our own system,
            
            
            
              and Aggie is one of our internal services that we run as
            
            
            
              part of Lightthep's product. And this is an automatically
            
            
            
              generated diagram from a set of traces that tells us the dependencies of that,
            
            
            
              not only the immediate ones, but the
            
            
            
              transitive dependencies as well. So we can discover dependencies that are two, three, four,
            
            
            
              even more hops away. And we can actually annotate that
            
            
            
              with other information, like which of those services is actually contributing to latency for
            
            
            
              my service. So that, say I just got paged
            
            
            
              for latency. Even without any additional information, I can
            
            
            
              already have a guess just based upon this kind of dynamic documentation about
            
            
            
              where I should start looking. I said,
            
            
            
              when it comes to documentation, these wikis are great for people
            
            
            
              processes, but don't try hands record information about the software though, right? Because it's just
            
            
            
              going to change too fast and it won't be useful.
            
            
            
              So as a whole, why is
            
            
            
              documentation important? Well, like I said,
            
            
            
              it's a shared database of ownership, right? It's about recording who
            
            
            
              is accountable in a way that everyone can see. But more than
            
            
            
              that, you can also use it to automate a lot of mundane tasks. So having
            
            
            
              up to date documentation can also be quite valuable just for reducing toil.
            
            
            
              It can also be used to train new team members, obviously. But I
            
            
            
              think one thing that was important to us at lightstep in
            
            
            
              really improving a lot of our internal documentation was building confidence in
            
            
            
              the developer and engineering teams. When we've
            
            
            
              tried in previous roles that I've been in,
            
            
            
              we've tried to change responsibilities, especially around production
            
            
            
              systems. Developers can be pretty unsure hands.
            
            
            
              This is kind of going back to this definition of stress.
            
            
            
              They want to do a good job, right? They want to be delivering
            
            
            
              great service, but if they don't feel like they have the information to do that,
            
            
            
              well, that can be a really stressful situation for them. And so having
            
            
            
              documentation goes a long way towards building that confidence, towards giving
            
            
            
              them that certainty and making them comfortable with those changes in responsibility.
            
            
            
              And of course, there's no place that developers
            
            
            
              probably feel more stressed. At least many developers is around on
            
            
            
              call rotations, right?
            
            
            
              Obviously, like I said, one of the most stressful moments for a lot of engineers,
            
            
            
              maybe not all of you, but certainly a lot of folks that I've worked with.
            
            
            
              And if you're going to establish service ownership,
            
            
            
              really, this is one part that you absolutely have to do
            
            
            
              right. So just to kind of lay out what
            
            
            
              I think oncall can mean, or at least what on calls mean in organizations that
            
            
            
              I've worked in, obviously incident response is a big piece of that,
            
            
            
              but I think not the only one. And like I said,
            
            
            
              this might not apply to every organizations, but at least in one organization
            
            
            
              I've worked in, Oncall has been responsible for a bunch of other things as well.
            
            
            
              So one of those is communicating status internally within these
            
            
            
              organization and externally to customers. Oncall is often responsible for
            
            
            
              managing changes within production, whether that's deploying new code themselves
            
            
            
              or being kind of a traffic cop for deployments or breaking
            
            
            
              other infrastructure changes within the production environment. On call
            
            
            
              is often responsible for sort of passively monitoring dashboards
            
            
            
              and also handling low urgency alerts, customer requests,
            
            
            
              and other kinds of interrupt driven work. In one role
            
            
            
              we thought of on call was just the person who's getting interrupted all the time,
            
            
            
              and they ended up just getting all the interruptions. But in addition to
            
            
            
              that, they're also responsible for handoffs between oncall
            
            
            
              shifts. So transferring information to the next on call,
            
            
            
              and in the case where there are incidents, writing post mortem
            
            
            
              so that we can address those hands. So thinking about how to
            
            
            
              improve all these and to do these well, I think is really going to be
            
            
            
              critical to doing service ownership well. So I wanted to
            
            
            
              kind of start with incident response, since that's
            
            
            
              certainly the biggest one. And if you think about
            
            
            
              service ownership, yeah, doing this well is really going to be important. And there's a
            
            
            
              lot of ways that we can do incident response well or improve incident response
            
            
            
              as it exists today. One of those is making pages more actionable,
            
            
            
              making it easier to mitigate those problems
            
            
            
              or ignore them if they're not problems. Another one is to deliver pages to the
            
            
            
              right teams. And finally, we can also just reduce the number of pages overall.
            
            
            
              So I said we can
            
            
            
              make alerts more actionable. Really, that's about understanding root
            
            
            
              causes, right? Like how do we get more quickly to what the root cause
            
            
            
              is, or root causes are so that we can take action to address and mitigate
            
            
            
              those things? And one of the things that we found
            
            
            
              to be really useful at lightstep is to actually annotate
            
            
            
              alerts, not only with what the condition,
            
            
            
              obviously that was triggered was but to add in additional information
            
            
            
              that helps us understand why that happened. Right. And so
            
            
            
              if I were to receive this page, I know that latency has gone
            
            
            
              up, but if I click on this link here, I also get an example.
            
            
            
              This is evidence of latency going up. This is a slow request,
            
            
            
              and now I can look and try to understand what's happening not only in this
            
            
            
              service, but in services that are deeper down the stack. And maybe in
            
            
            
              this case, I can look and see that the work that's being done
            
            
            
              by the service is actually being sharded. Right. It's divided into a bunch of
            
            
            
              pieces, and it turns out that those shards are not very equally balanced.
            
            
            
              One of them is taking a lot longer, and that's really what's driving up latency
            
            
            
              in this case. So being able to do this root cause analysis quickly without
            
            
            
              digging through lots of information is a way to improve that experience of oncall.
            
            
            
              Of course, youve know, even better than having to dig
            
            
            
              through a bunch of that stuff is making sure that the right people are involved.
            
            
            
              Just from the beginning. I know one of the teams that I worked at,
            
            
            
              at Google, we were relatively high on the stack, and when we would get paged,
            
            
            
              often the only thing that we could do was to turn around and page another
            
            
            
              team to tell these that it was actually their problem. And I
            
            
            
              wanted to give credit. This is from a talk that Luis Monero
            
            
            
              did last year at Srecon, based upon some work
            
            
            
              that he and others did at Zalando, which is an e commerce company based in
            
            
            
              Europe. And I think it's really cool work.
            
            
            
              So, like I said, at Google, my team was often responsible for sort of
            
            
            
              page routing in a way, which is a horrible thing for a human
            
            
            
              to do, especially at 03:00 in the morning. And so what
            
            
            
              they've built at Zalando is actually programmatically doing that
            
            
            
              routing. So they still alert based upon symptoms, right, as you should alert
            
            
            
              based upon things that their end users are observing. But what
            
            
            
              they've done is that when that happens, they actually look at traces from
            
            
            
              the application itself, and if there's an error that triggered that alert,
            
            
            
              they look at all of the immediate dependencies of the service
            
            
            
              that triggered the alert and say, do any of those dependencies also show
            
            
            
              errors in this trace? If yes, repeat and go and look at each
            
            
            
              of their dependencies. If any of those dependencies have errors, repeat, go and look
            
            
            
              at the next service down the stack and keep going until we find a service
            
            
            
              or services that don't have any immediate dependencies with errors.
            
            
            
              And then page those teams, they found that this is the best place to start.
            
            
            
              It might not always be the right place, but it's better than starting at the
            
            
            
              top of the stack and going down one service at a time.
            
            
            
              And yeah, like I said, I would have loved to have this kind of thing
            
            
            
              on the team that I worked at. It's a great way to get information
            
            
            
              to the right people. And like I said, this is sort of a function of
            
            
            
              the way that we've distributed ownership and the way that we've
            
            
            
              distributed the code itself. Right hands. We've broken apart
            
            
            
              the application to these more loosely coupled parts. The trace is really
            
            
            
              critical to understanding how to respond to these events.
            
            
            
              I want to touch briefly on one other part of being on call hands,
            
            
            
              that's writing, sharing and reviewing postmortems.
            
            
            
              Postmortems, I think are really important part, even if they're not sort of the
            
            
            
              same adrenaline rush that being paged is.
            
            
            
              But it's really about repeating issues that might
            
            
            
              come up again hands, maybe more importantly about improving responses,
            
            
            
              because the same issues sre not always going to come up over and
            
            
            
              again. So how can we respond better to a novel issue next time?
            
            
            
              And for post mortems to really be blameless,
            
            
            
              establishing what happened in an objective way is really important.
            
            
            
              And I've seen again and again that doing this through real telemetry,
            
            
            
              especially in a distributed system, using tracing, is really important.
            
            
            
              So I can think of a number of times
            
            
            
              when in the writing or the reviewing of a post mortem,
            
            
            
              there is essentially a disagreement about whose fault a latency problem is.
            
            
            
              Right? Is it service a is making an incorrect call
            
            
            
              or is configured incorrectly? Or is it service b is too slow in servicing that
            
            
            
              request? And if you look at aggregates, if you're
            
            
            
              just looking at something like p 50 latency,
            
            
            
              those two teams can have a pretty different perspective of what's going on, especially if
            
            
            
              they're not accounting for things like the network in between, and if they're not
            
            
            
              really making sure that they're pairing up slow requests on one side with
            
            
            
              the same kind of corresponding requests on the other side. And what tracing helps
            
            
            
              you do is really understand those causal relationships, right? It allows you to
            
            
            
              pair up a slow request on one side with one service
            
            
            
              with the response that was part of that request on the other side, and really
            
            
            
              understand if that slowness is responsible there.
            
            
            
              And look at the logs, look at the request parameters to understand
            
            
            
              what service needs to change in order to improve things.
            
            
            
              So yeah, obviously improving on call is important,
            
            
            
              not just for the obvious reasons, right. That whats
            
            
            
              has real impact on your, on your customers experience and on revenue and reputation
            
            
            
              and things like that. But it has a cost internally as well,
            
            
            
              right? Because time spent handling pages, writing post mortems,
            
            
            
              handling those interrupts, that's time that developers and
            
            
            
              engineers are not spending building new features or doing proactive optimization.
            
            
            
              Right? So there's a cost to that. And then the
            
            
            
              stress of being on call has a major impact, I think, on job satisfaction
            
            
            
              for a lot of developers. And so thinking
            
            
            
              about that stress that can be mitigated by
            
            
            
              improving on call, it can be mitigated by having good documentation.
            
            
            
              And I mentioned reducing the number of pages is also a great
            
            
            
              way of improving on call. Like giving teams the agency
            
            
            
              to do that. Right? Like giving teams the agency to say, hey, look, this alert
            
            
            
              is not valuable, right? It's not helping us meet
            
            
            
              our goals and so we want to delete it. And that's actually going
            
            
            
              to make our lives better and make us more productive.
            
            
            
              But like I said, we need to understand their goals, right? So how do we
            
            
            
              think about holding teams accountable for on call? Like what are the goals
            
            
            
              in a way that we can measure? Right. Well, that brings me to my next
            
            
            
              topic, which is to talk about Slos. So service level
            
            
            
              objectives, again, I'm just going to give kind of
            
            
            
              a whirlwind kind of intro to these. There's a lot
            
            
            
              more that could be said, obviously, but these are promises that service owners make
            
            
            
              to their customers, right? And those could both be
            
            
            
              internal customers, these other people within your organization, or end users people external
            
            
            
              to your organization. And what's important about an slos is that it's stated
            
            
            
              in a way that can be measured on relatively short timescales.
            
            
            
              So to give an example of what an SLO looks like,
            
            
            
              it might be something like 99th percentile latency should be less than 5
            
            
            
              seconds over the last five minutes. And to kind of break this down.
            
            
            
              So the first part is the service level indicator. That's the metric,
            
            
            
              these thing you're measuring, right. The second part is the threshold.
            
            
            
              That's kind of the goal in a way. And usually this is expressed as an
            
            
            
              inequality, right. We want to keep latency down and then finally
            
            
            
              we have the evaluation window. And I'll say a bit more about that
            
            
            
              in a second, but that's really important for making sure that we're measuring things in
            
            
            
              a consistent and precise way. So just to give some examples of
            
            
            
              other sorts of indicators. So I mentioned latency. You might choose different
            
            
            
              percentiles depending on what's important to your customers. You might
            
            
            
              measure error rate. That's important for a lot of folks. Availability is often
            
            
            
              something whats is promised to customers as well. Depending on your business.
            
            
            
              You might also measure something like durability or throughput as well.
            
            
            
              So I mentioned the way that you measure
            
            
            
              these Slis is important and things idea of
            
            
            
              a window. So when we look at a dashboard that's
            
            
            
              showing something like latency, usually what that's showing is what you might call instantaneous
            
            
            
              latency. And that's good. That's usually these
            
            
            
              default and that's what we want to see when we're in the middle of an
            
            
            
              incident. Right. Because that's going to be the most responsive way of measuring this.
            
            
            
              But if you're trying to measure an slO, the problem with instantaneous
            
            
            
              latency is if you look on narrower and narrower timescales,
            
            
            
              it can actually significantly change the value of it. And if there's
            
            
            
              one thing that's important about slos, it's that we all agree on what
            
            
            
              the definition is and whats we're all measuring it in the same way.
            
            
            
              And so when we look at something like latency for an
            
            
            
              SLO, we're really going to talk about measuring it over something like the last five
            
            
            
              minutes or over a five minute window. And really what that's doing is looking at
            
            
            
              all of the requests over that five minute window. If we're looking at P 99,
            
            
            
              then looking at the fastest 99% of
            
            
            
              those requests hands, making sure that all of them fit under some threshold.
            
            
            
              Okay, great. So how do we determine
            
            
            
              slos? Well, there's a bunch of questions that you need to ask yourself.
            
            
            
              The first one is, what do your customers expect? What have you promised them already?
            
            
            
              Right. You might be legally bound to provide a certain level
            
            
            
              of service, or it might just be that there's an expectation and you can
            
            
            
              measure conversion and things like
            
            
            
              that. To understand that users get bored and leave if
            
            
            
              it takes too long to service requests. Youve should also ask what you can provide
            
            
            
              today, right. There's no reason to set an Slo that you're not going
            
            
            
              to be able to meet or that you're not going to be able to meet
            
            
            
              anytime soon. And so thinking about what is the
            
            
            
              product roadmap look like? How much time do we have on
            
            
            
              the engineering team to make changes to improve performance or reliability and
            
            
            
              making sure that these all line up so that we're doing the best we can
            
            
            
              for our customers while providing the functionality that they need at the same time.
            
            
            
              Okay, so how do we actually do
            
            
            
              that? Right. Let me take a really small, simple example. So say
            
            
            
              here's a simple microservice
            
            
            
              based application, just three services in this case,
            
            
            
              and say whats we've promised our customers that
            
            
            
              will serve requests 99%
            
            
            
              of the time within 5 seconds. And of course under some evaluation
            
            
            
              window and for service a,
            
            
            
              the one that's, that's labeled a at the top here, that sort of translates
            
            
            
              immediately to what they're on the hook to provide.
            
            
            
              But what about internal services? Right? How should this map to
            
            
            
              service be? Right, so let's look at a trace, right?
            
            
            
              So how does a request actually flow through these things?
            
            
            
              And you probably want to look at more than one trace,
            
            
            
              in fact. But I just pulled out one here just as an example.
            
            
            
              So now that we see this, we can see, it looks
            
            
            
              like today, at least in this example, service b is actually responsible for
            
            
            
              a lot of the latency of service a. So we can also
            
            
            
              give a kind of similar bound to service b in a lot of ways.
            
            
            
              That is, it also needs to be able to serve p 99 latency in less
            
            
            
              than 5 seconds. But what's interesting is that, sure,
            
            
            
              in the kind of services diagram, there's one arrow between b and c,
            
            
            
              but in this request, there's actually two requests from b to c that happen
            
            
            
              in serial, which means that we need c to
            
            
            
              be twice as fast. Right? So maybe things is what
            
            
            
              you were thinking, whats p 99 latency for
            
            
            
              C needs to be less than two and a half seconds. If you think
            
            
            
              about it, maybe for another minute, you realize that that's not quite
            
            
            
              correct either. In fact, there's two chances for C to
            
            
            
              fail in this case as well, right? So there's two chances for C
            
            
            
              to serve in a server request in more than two and a half seconds.
            
            
            
              So we actually need the bound to be even tighter than that. It's around 99.5
            
            
            
              percentile latency, and that's
            
            
            
              sort of how we can pass that down to c.
            
            
            
              Now what's kind of interesting in this is it might be that in some other
            
            
            
              cases that b also depends on another service D.
            
            
            
              But at least in terms of this request, in terms of servicing
            
            
            
              a request that came from A, B doesn't depend
            
            
            
              on D at all, right? And so thinking about D's slos,
            
            
            
              actually, we don't have any information to do that from this case. So looking at
            
            
            
              traces is really important. It's not just enough to look at the service diagram.
            
            
            
              The trace is really going to tell you what's going to help you there.
            
            
            
              Okay, so why are slos important? They are
            
            
            
              really about measuring success in delivering a service, they're about measuring success for on
            
            
            
              call. Right. These teams can use them as a guide
            
            
            
              to prioritize work. So if we've established an slo, we can now
            
            
            
              understand how much improvement we need to make and we can use that to
            
            
            
              trade off against, say, new feature development.
            
            
            
              And it's a way of really holding
            
            
            
              teams accountable consistently across your right. So you want to make sure
            
            
            
              that as folks move from one team to another that they're not learning new ways
            
            
            
              of doing this. And if you're going to measure teams performance
            
            
            
              by their ability to meet their slos, it's really important that you do that consistently
            
            
            
              as well. And then
            
            
            
              finally, yeah, these are all things ways about thinking about
            
            
            
              accountability. But agency is also really important too. And slos
            
            
            
              are really a way of giving folks a budget for thinking about how much room
            
            
            
              they have to push more
            
            
            
              deployments out there. Right? Like how close are they to hitting their slos?
            
            
            
              And that's really a way for them to build can error budget as well.
            
            
            
              Okay, so just kind of review my three piece
            
            
            
              puzzle here. So documentation obviously is
            
            
            
              an important part of that. I think more important than just documentation for documentation's sake.
            
            
            
              But it's a way of establishing ownership hands knowing who is going to be held
            
            
            
              accountable. But if you're going to do that, it absolutely has to
            
            
            
              be up to date, right? You can't hold people accountable to documentation
            
            
            
              based upon documentation that's out of date.
            
            
            
              It's also really critical in building confidence within those teams
            
            
            
              hands. Along with tools
            
            
            
              that describe the dynamic state of a system. It's critical information for
            
            
            
              folks that are on call or need to understand how a system is actually behaving
            
            
            
              on call. Obviously youve can't do business without it.
            
            
            
              Incident management is often the part that people think about
            
            
            
              most. When you say service ownership, but I
            
            
            
              want to call it that, on call has a lot of other components to it
            
            
            
              too. A lot of those are really tools and hands.
            
            
            
              Finally, Slos, right. These are really like how you hold teams
            
            
            
              accountable, like I said, how you measure their success hands
            
            
            
              in all of this. I think in a system where you have a loosely coupled
            
            
            
              architecture where you have teams, whats are moving independently, tracing is really
            
            
            
              critical to understanding causality in that system, to understanding who
            
            
            
              is responsible for at a given moment in time, which services are actually contributing
            
            
            
              to latency. And if you don't have that information, you're not going to be able
            
            
            
              to keep your documentation up to date. You're not going to be able to make
            
            
            
              good decisions while you're on call and you're not going to be able to set
            
            
            
              slos in a way that actually reflects what your customers expect.
            
            
            
              Okay, so I mentioned error budgets.
            
            
            
              That's really just one kind of budget. And I think
            
            
            
              giving folks budget to improve reliability and giving them agency
            
            
            
              to do that will help them hit their goals and will lower their stress.
            
            
            
              But that agency requires them having the right information and the time to do it.
            
            
            
              And so this really comes down to ownership doesn't come for free. You've got to
            
            
            
              give your teams time to actually invest.
            
            
            
              You've got to give them time to improve and to make things better.
            
            
            
              Okay, so, sounds great. How do we get this right? Where do we start?
            
            
            
              Well, making changes
            
            
            
              in a DevOps organization, it's hard,
            
            
            
              right? Rolling out new tools and new processes always has to
            
            
            
              be a bottom up thing.
            
            
            
              And whether that's how you run your sprints, which tools
            
            
            
              you choose to do, developers, what observability tools you use.
            
            
            
              If they're going to be adopted, they really have to provide value to
            
            
            
              those application development teams. And ideally more than
            
            
            
              provide value, they would be a necessary part of their day to day work.
            
            
            
              If you don't have those things, at least in my experience, it's just
            
            
            
              going to be a long, long uphill road to get those
            
            
            
              things deployed and adopted.
            
            
            
              Then to establish hands, maintain service ownership, use a
            
            
            
              communication of documentation, on call process and slos, and manufacture
            
            
            
              a need for those tools, hands processes where necessary. Right. So what I mean by
            
            
            
              that is just to say make it a requirement to have service
            
            
            
              ownership defined within the documentation before a service can
            
            
            
              be defined, before it can be part of the deployment pipeline, like I mentioned.
            
            
            
              So as a platform team, as part of
            
            
            
              engineering leadership, you have the ability to actually make
            
            
            
              these processes required in a way, and if you do that, that'll actually go a
            
            
            
              long way towards them being adopted hands becoming part of the
            
            
            
              tool set of the folks in your organization.
            
            
            
              Okay, so just to kind of sum up, I think of ownership is
            
            
            
              really having two parts. Obviously, accountability is a big
            
            
            
              part of it. I think that's what folks think about a lot when they think
            
            
            
              about ownership hands. That's really setting the
            
            
            
              deliverables and the goals for the owners within your organization
            
            
            
              and making sure that youve evaluating their performance based upon those goals and deliverables.
            
            
            
              Right. That's really how youve make those things sink
            
            
            
              in. And I think a second and equally important part is to give
            
            
            
              those teams, agency agency to make change. Right.
            
            
            
              So they're going to be a lot more inclined and a lot happier oncall
            
            
            
              if they're able to control and make changes to the kinds of alerts they get,
            
            
            
              if they're able to make changes to the architecture itself. Right. So making sure
            
            
            
              that you're offering them the information nation, allowing them to build confidence
            
            
            
              and giving them these budget to improve is really critical, I think,
            
            
            
              in establishing service ownership. So with
            
            
            
              that, I wanted to thank everyone for your attention. You can find me
            
            
            
              at Dave Spoons on Twitter. You can find me@lightstep.com
            
            
            
              I'm always excited to talk about service ownership. I'm always excited to talk about distributed
            
            
            
              tracing.