Conf42 Site Reliability Engineering 2023 - Online

Introducing a New Reliability

Video size:

Abstract

Reliability is important. It’s what keeps your service from having costly, infuriating outages. Sociotechnical resilience is the often overlooked heart of reliability. If your team isn’t trained, equipped, and supported, you’ll be scrambling to keep up with outages.

Summary

  • Emily Arnott: What is reliability? It can be a surprisingly nebulous question. Orgs of all sizes are realizing that reliability should be a focus. She says it should be compared to customer expectations and sociotechnical resilience. How do you measure this new reliability?
  • High demand is also a customer oriented factor that puts a lot of strain on these systems. If they can't maintain the standard of communication, there is inconsistent messaging. This creates a sociotechnical resilience crisis. If you think of any sort of service you're unhappy with, you can see how it really is.
  • Product health, sociotechnical resilience and customer happiness are all important. It's not entirely clear just from this diagram which service you should prioritize. This can be a guiding framework to really uncover and highlight deficiencies.
  • The new definition of reliability is your system's health. It transforms it into something quantitative from something totally qualitative. It's not always trivial to get these numbers. But it's in the process of gathering this data that you can figure out where your strategic priorities should lie.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, I'm Emily Arnott, the content marketing manager at Blameless. And today I'm going to share with you something pretty exciting, a new definition of reliability. In this talk, I'm going to break down a surprisingly difficult question. What is reliability? We're going to look at why we need to align on a single definition, even though it can be kind of nebulous. We're going to zoom out a little bit and take a look at how this definition applies to the real world to get a little bit more context. Then we'll zoom back into the technical and look at three different services and how this definition of reliability applies to them and can help you prioritize. Then we'll take a look at the real gritty question of how do you measure this new reliability? How do you know you're moving towards it? So what is reliability? We can all agree that it's very important, right? It lives in the very center of what we do, site reliability engineering. But once you actually start asking people, it becomes a surprisingly nebulous question. If you ask ten different engineers, give me your definition of reliability, you'll probably get ten different answers. Maybe someone will say it's basically just your uptime, right? Like if you're reliably available, that means you're reliable, and that makes sense, but maybe you should include the error rate. Like, sure, the site is up, but if it's giving the wrong information, that kind of makes it unreliable. And then if you're kind of thinking along those lines, well, the speed matters too, if the site is giving the right information, but at a snail's pace, are people really going to think of it as reliable? Maybe some of the more SRE inclined people could cite Google's SRE book and say that it all comes down to the customer's expectations. They set the context and the standard of what is reliable enough based on their expectations. But that kind of just opens up more questions. Like what customers? How do you know these expectations? How do youll know when they're happy? Do any of these really cover everything? I think even if you can answer all of these questions, I'm going to put forth that this still isn't holistic enough. It still doesn't consider one key ingredient that we're going to get into. And this stuff is important. Orgs of all sizes are realizing that reliability should be a focus. In many cases, it should be priority number one. And even if they don't know what reliability exactly is, they know when it's missing. Here are some examples of recent major outages among major tech corporations. Within minutes, decades of goodwill, of customer service, of customer happiness, can be wiped out by one bad incident. The costs can be astronomical for big companies and even more devastating for small companies that don't have the leeway to recover from a major outage. If you aren't considering these sorts of factors, incidents can easily overwhelm you and put you deep into the red. So what is our working definition of reliability? What is the one that I want to explain to you? Well, it starts with product health, which is something that I think we all understand. It's what's being spit out by your monitoring data. It's everything that you have in terms of how your system is functioning, how fast it's functioning, how accurate is it. But we like to contextualize that with customer happiness. As it says in the Google SRE book, these numbers really mean nothing in a vacuum. They have to be compared to what your customer's expectations are. And then we have our third ingredient, which I think is all too overlooked in the technical world, and what we consider kind of the secret sauce of this new perspective, sociotechnical resilience. This is the ability for your teams to respond quickly, confidently and effectively to any incident that may occur. And that's what we're really going to be focusing on in this presentation. So think about the real world, for example, and you'll find that this definition is probably already deeply entrenched into how you make decisions, how you judge the reliability of something where you put your confidence. So consider flying. We'll set up our three buckets here, and then think about all the different assumptions you make when you choose an airline. When you step onto a flight in terms of customer happiness, you feel like the airline is prioritizing your needs, that they have some sort of picture of what would make you a happy, satisfied customer. And the choices they make are aligned with that. You also assume that the airline systems are working properly, that whatever computer programs are assigning you tickets, are printing your boarding pass, are making sure that there is in fact a seat for everyone who bought a ticket for that flight, that are displaying the correct gates. All of these little things that build up the workings of the airline and the airport are functioning properly. You trust these things. You also trust that all the systems in place to stock the airline, the actual airplane itself, are consistently functioning. That there's going to be food and water, that the bathrooms are going to work, that there's going to be toilet paper, all these little things that you never even consider you're implicitly assuming are happening behind the scenes consistently. And when it comes to these stocking choices, when it comes to the things they prepare, once again, there's this implicit assumption that the airline is making these choices based on what youll want, that the airline understands what makes you a happy customer and are working along those lines. But there's a major factor here, and this is something so implicit that we don't really even think about it. But we also put a lot of faith that all the people who are working at the airline, that are flying the airplane, that are serving, you know what they're doing. You assume that the pilot knows how to fly, right? That goes without saying. You also assume that the crew will show up on time and be ready to work. You assume that the crew will be in good spirits, that they'll cooperate with each other and with you, that they'll be able to execute on things that you need them to do throughout the entire flight. These things go without saying, and yet our understanding of what makes a good flight is so rooted in them. It's so essential to the operation of the entire industry. The airport staff itself also needs to know what to do. We extend this sort of thinking about the sociotechnical resilience and capability to everyone we interact with in the process of flying the plane. So let's take a look at when this breaks down. In the previous holiday season, if you tried to fly out to see family or friends or go on a vacation, you may have had a rough time. You wouldn't be alone. There was a once in a generation winter storm that passed through the continental United States. Over 3000 flights were canceled. We can see here a dramatic spike in cancellations over many years. And we can also kind of examine how these things failed through the lens of our three pronged definition of reliability. So we bring up our buckets again. So something like bad weather we can kind of contextualize as product health. And this is something that's not always in the control of the people working at the airline. Something like bad weather certainly isnt. High demand is also a customer oriented factor that puts a lot of strain on these systems. Suddenly there were so many people wanting to fly out for the holidays. It was way more flights needing to be scheduled and filled and processed than on any typical day. And then as a result, the systems start hitting their limits. The algorithms and programs they have set up to do things like automatic seat assignment to coordinate different schedulings, layovers, making sure that everyone can actually get on the flight, that they're meant to start breaking down under this unreasonable amount of strain and sudden changes. As a result, flights start getting canceled. And this creates a huge domino effect where now different connecting flights are being canceled, which causes flights that they were meant to connect to, to be effectively canceled. And people who have complex journeys of multiple layovers suddenly have their whole house of cards falling apart. And what this resulted in was a lot of unhappy customers, in part because this was communicated poorly. People weren't finding out that their flights were canceled until it was way too late to make alternative plans after they would already arrive at the airport. And now they're just stuck. It really broke down this trust between the customers of the airline and the airline itself. If they can't maintain the standard of communication, there is inconsistent messaging. People received completely conflicting requests and confirmations and were told one thing by one person and a different thing by another person. And then on the sociotechnical side, we see downstream of all of these changes in the customer's demands and their expectations and their happiness, and then the systems that they were relying on. This creates a sociotechnical resilience crisis. In many cases, things that were done automatically by the systems now had to be performed manually by people who probably weren't trained to do these sorts of things. They weren't trained to have to process these crisis situations. Also, as a result of the pandemic, a lot of airlines significantly downstaffed. They laid off people. They reduced in size as travel wasn't really an option for a few years. And now, as demand has spiked back to where it was before, if not higher, they're finding themselves understaffed. And people are stretched too thin and unable to perform all of their normal duties, let alone all of the additional duties of this crisis. And flying is just one example where we can see a crisis of unreliability play out in these three distinct areas. But it is really everywhere. In fact, I would say if you think of any sort of service you're unhappy with, you can very quickly start contextualizing it in these three categories to see how it really is. Always this threefold breakdown that leads to an undesirable outcome. For example, let's say you have a terrible cell phone provider. Well, what makes them so terrible? Sure, maybe it's the phone itself is not so great. It's missing features, it crashes, it's slow. Maybe the networking functionality isn't very good. Your cell phone provider doesn't have good service where you live. Maybe it's a customer happiness issue where they're prioritizing profits too much at the cost of having good customer service, good customer feedback loops, responding to the needs and market demands. And maybe it's a sociotechnical resilience failure where even if they have the best intentions, the people who are meant to supported you simply do not have the training to deliver on what you want. How often have you called technical support and found that the person on the other end really doesn't know how to answer your questions? They've been undertrained, they're probably overstressed, overworked and underpaid, and they can't live up to the expectations you have for them. With many other examples. A car when you're choosing a car, sure, you're thinking about the product health of the car, that is its functionality, is it top of the line in its features, its fuel economy. You're maybe also thinking about how well does the visions of the car company and what they want to deliver to you match your expectations and what would make you happy. But you also really consider, is this car reparable? Am I going to be able to find mechanics that know how to fix it? Am I going to know that the people who I'm going to rely on when my car breaks will be trained enough to know what's going on? Think about your apartment building. An elevator goes out of service. Well, that's immediately a system health issue, but it also becomes a customer happiness issue, that you need to trust that your building understands how debilitating it is to not have elevators, how inconvenient that is, understands how much of a priority it should be, and that they actually have a function within themselves. There's someone who's either trained to fix it, trained to know who to ask to fix it, has the time and the confidence to move this along quickly. If you don't have confidence in all three of these areas, a lot of these services that we take for granted in our lives can easily fall apart. So now that we've kind of seen how naturally this definition occurs in the real world, I want to challenge people and think, why don't we kind of have the same standards for our technical solutions? Why aren't we considering the programmers behind the products we choose, the operators that are resolving bugs and incidents, and generally just understanding that in order to have a good product, there needs to be a confident team of people behind it. So in order to kind of illustrate what this looks like in the tech example and how it can kind of help you strategize and prioritize in terms of improving your products, let's imagine three services, and this is going to be a pretty simplified example. Nothing ever is so cut and dry in the real world, but hopefully it can illustrate the way that this sort of definition can inform your strategic thinking. So we have three services, a, B and C, and we're going to look at them in these three buckets of reliability. So in terms of service health, that is all of your typical monitoring metrics. Service A is pretty good. It has a lot of uptime, it always runs smoothly, it's always returning the correct data. Service B, it's okay, maybe it has the occasional outage, maybe it has some slowdowns, but it's chugging along. And service C, that's bad. It has frequent major outages, it works inconsistently, even when people are able to get requests through to it, it's very slow. So you're thinking, all right, I have some free engineering cycles, I want to shore up the problems of my product and make my customers happy. Which one should I look at first? You're going to say service C, right? That seems like an obvious choice, but let's throw in the next ingredient and contextualize this based on what are the user expectations for each of the services. Well, service a, it's kind of popular. It's a feature of youll product that, let's say around half of your users make use of semi infrequently. It's not super critical for them to enjoy your product, but it's certainly an expectation they have that they'll be able to use it when they need to. Service B is something that's actually in very high demand. It's something that every customer, no matter who they are, no matter what their use cases are, is interacting with and they need it to work or their experience is as good as ruined. And service C, let's say that's like a brand new feature. It's something that is only being rolled out to certain customers. The usage of it isn't all that high yet, and nobody really is hinging their continued usage of your product on whether or not service C is functioning. So now you're thinking a little differently, right? If you improve service C, sure, that's great, but maybe not that many customers are even going to notice. It's going to have pretty small returns on your customer satisfaction, whereas a, and especially b, really do need to see some attention to meet up with these increased customer demands. So now maybe you're thinking, yeah, service b, that's the one I should take a look at. Finally, let's look at this bucket of sociotechnical resilience. Service A is a product that is very different than anything else that's offered in your company. It's something where the engineers aren't that trained, they aren't that prepared. They haven't dealt with many outages of a service like it. There's not runbooks in place, there's not a typical escalation procedure for it. There's not communications that you've already written up around it. Whereas service B and service C, even though service C is new, it's very closely modeled after something like service B. And all of your engineers are very confident that they're able to resolve any issues that arise with it. They've been trained, they've been through it all before. All of this is kind of old hat to them. So now you're kind of thinking, well, maybe I should be spending these engineering cycles on shoring up service a proactively, writing some guides on how to fix it. Sure, its system health is good now, but as demands change and users expectations change, suddenly it could be very unacceptable, and you have to kind of proactively prepare for that. Now, at the end of the day, it's not entirely clear just from this diagram which service you should prioritize. And obviously, in real life, things aren't going to be so cut and dry with such clean, single metrics of mid, high or low. But I hope you can see how this can be a guiding framework to really uncover and highlight where there are deficiencies. It's not just as simple as looking at monitoring data, but understanding this bigger picture of how things would actually unfold when something breaks. So when we're talking about these buckets in the technical context, they're all made up of smaller factors. And usually these are kind of questions you can be asking about your service. So let's break these down a little bit further in the first bucket, product health. These are the things we know very well. All the data that comes out of your product as it runs, things that your telemetry is capturing, looking at the stability of the code base. Sometimes it requires some meta analysis, but all of this sort of like objective, cold data about how your product is actually functioning. This can be embedded in the program as it runs, or it can be sort of a more black boxed approach where you're querying your product in a production environment and seeing how well it responds in terms of latency or the error rate, the traffic you're getting, the saturation of your resources, these being your typical four golden signals. These are all very measurable, very trackable facts about the way that youll service is functioning. Customer happiness starts to become a little bit more nuanced, but it really boils down to, are your customers satisfied? Is the product healthy enough that they're happy to continue to use it? What does their user experience look like? When we think about users, it's maybe not so helpful to think about an individual as much as a function. What are the steps that they take when they use your product? How critical are each of those keeps to their overall satisfaction? What types of steps link together that if one breaks, another one will break? Breaking down this user journey can really make these questions about user happiness a lot more tangible. You can start putting numbers to them and say, well, the login page has to work 99% of the time in under five milliseconds, or else users will start feeling like it's too slow or too unstable. Youll have to really understand this idea of importance. Probably logging in is more important than searching, is more important than adding something to a favorites list. But there's no universal prescriptive formula to this. It's something that you have to get from your users, from observing their behavior, from surveying them, to build up this kind of statistical understanding of importance. This one is definitely more of a nebulous question. But do they feel confident in your product? Do they feel like it will continue to work to their satisfaction? Does your business have a good reputation for a reliable product? Do they feel supported, informed? Are there places that they can ask questions? Are there faqs where they can learn about failure modes and understand what they can expect from the product? Do they feel connected to you? Finally, if we look at this sociotechnical resilience bucket, this is where we turn inwards and think about the teams that are operating, supported, improving and developing for your products. So think about when something goes wrong. How effective is your team? Is there a lot of toilsome work that's repeated every time? Are they moving swiftly? Are they moving effectively? Are they repeating work because there isnt good communication between people and people are trying the same solution independently over and over again. Is there clear service ownership when something goes wrong? Are the right people called in? Is there a clear understanding of who should be the subject matter expert of who knows the answers to the questions? Are teams aligned on their priorities and responsibilities? When something breaks, do you have a clear sense of how important it is that it breaks? Is there a clear severity scale and is there a consistent response that's proportional to that problem? Are the on call loads balanced? Are there people that are overworked, overstressed? You have to kind of look at this not just through the lens of on call shifts, but the actual expected number of incidents for different types of services, for different teams of different specializations. In the end, you want something that feels fair and transparent, where everyone feels like they're putting in roughly the same amount of work. Are people burnt out? Are there too many incidents? Are people endlessly fighting fires? Do they feel like they're disconnected from the work that they want to do, which is probably more planned, novel feature work? Are people equipped with the tools and knowledge that they need when something breaks, are they scrambling to figure out what to do? Do they have no way of consulting previous cases? Are there run books in place for them to work through? Or is everything coming from the top of their head? And does your team still function if somebody is suddenly away? Are there critical points of singular failure where if so and so from the login team is missing? Well, they're the only one that understand how that code worked at all. It breaks, nobody else has any clue what they're doing and everything falls into disarray. These are some of the questions you can start asking yourself to start getting a picture of where your overall reliability is now with this new definition. And it's in this last bucket that we really want people to focus because the first two things, I think most organizations already have some function to capture. But this looking inwards and really understanding, what are the stress points for your teams? Where do your teams feel confident and where do they feel insecure? That takes a lot of deliberate work, a lot of reflection, and should really motivate a lot of change. And what does it do for you once you're kind of aligned on? We really should prioritize looking at these three buckets. What benefits should you expect to get from this alignment on reliability? Well, first off, the value of just aligning in the first place is huge. So often in this world of distributed teams and microservices, where system architecture is so complex, it creates a lot of inconsistent communication, prioritization and understanding between different teams. But if you have the singular definition that any incident can be viewed in the lens of where you're constantly thinking about what is the overall impact on our product's health, what is the impact of that on our customers happiness, and how equipped are our teams to remedy that. All different issues across all services can be kind of put into this one singular language which creates a whole lot more of an alignment, motivation and engagement to improve these things. These motivation for impactful changes is also absolutely huge because it's sometimes hard to get your teams to want to invest in reliability. There is an eagerness to want to launch new features, to gain that competitive edge, to get ahead of competitors on having all the latest and greatest tech. But by showing that this definition links into the health of your product, the happiness of your customers, and the confidence and capability of your teams, it becomes clear, like what could really be more important than that? Finally, it prioritizes where these changes are needed by turning the question of is our product reliable? Into something that's a lot more measurable, that's a lot more monitorable, that's easier to assess. It creates a situation where you can see what improving reliability actually looks like in practice. It's not just this sort of vague goal of let's make our product more reliable, but it'll show you the clear outages of we need to fix this so it meets customer expectations. We need to trained this team so we don't have another twelve hour outage. It points you towards the changes that are most impactful and needed most. So I mentioned a lot that one of the goals of thinking reliability like this is that it makes it more measurable, it transforms it into something quantitative from something totally qualitative. And how you do that exactly will vary a lot from organization to organization. But what we think this definition outages a lot of is sort of a question and answer format that points you to the numbers that will be most valuable. So here's some examples you can think about. What are the sources of manual labor for each type of incident? Think about all of your most common types of outages or slowdowns or errors or whatever. When those things happen and they're being resolved, where is there a lot of toil? Where is there a lot of steps being taken manually? You can look at each of your engineers that work on call and think about how many hours have they spent actually in the trenches of responding to incidents, not just on call shifts, but actual on call work where they were paged and had to suddenly jump to their computer. How much time has your team spent fixing each service? Think about all your different service types and just come up with a total number of hours spent fixing them. So it's not always trivial to get these numbers. It's not like there's going to be a program that can spit them out for you. And it's not like these numbers will actually tell you the entire story, but instead it's in the process of gathering this data and having these conversations that you can start to really figure out where your strategic priorities should lie. So let's say you're trying to list off all the different sources of manual labor for each type of incident. And one has a lot. Oh, jeez. This thing, it goes down every couple of weeks. And every time it does, we have to do this, we have to restart that, we have to run this script, we have to coordinate with this guy, that guy has to run this script. It becomes very tedious and toilsome. And once you're starting to tabulate these lists, once you're starting to quantify it, that leaps right out. You can see that that's an outlier and you should go, yes, it's time to invest in automating this. When you're looking at the number of incident hours, you can see easily, wow, the team for this incident type is always busy. We should consider investing a few more people into this to pick up that slack, to make sure that there's nothing slipping through the cracks, to make sure this team isnt going to get burnt out. Once again, just trying to quantify things tells you this story that you can investigate further. And this one, every time you log some hours on a service, fixing it, repairing it, upgrading it, you become more experienced with it, you become more confident, you're capable of dealing with a greater range of issues that can emerge from it. So if you find an outlier, again, where, geez, this service, thankfully, it hasn't gone down much at all, because our teams have really no experience at all in fixing it. That's a sign that maybe you should proactively practice. You can run some drills, you can write some runbooks, you can prepare resources and processes that will get people confident even in the absence of the real world crisis. So measuring this thing, this brand new definition of reliability, it isn't trivial. It's something that varies from organization, from product, but it's something that's really worth investing your time in. And the goal, again, isn't to come up with one magic formula that answers every question, but to reveal what the right questions are, to find outliers and to decide priorities in the process of investigating the numbers behind them. So, in conclusion, our new definition of reliability is your system's health that we know very well, contextualized by the expectations of your users, what are adequate numbers? And then prioritize based on your engineer sociotechnical resilience. Where do they have confidence and where do they need trained and help in order to keep everything running smoothly. Here are a few citations around the cost of downtime in tech companies and also the crises among airlines last holiday season, so feel free to investigate that further if you want to see just how important this question has become to so many organizations. But I hope you've enjoyed my talk. I hope I've opened your eyes to a new way of thinking about reliability and motivated you into investing some serious time into building up that practice in your organization, too. Have a wonderful day. I've been Emily,
...

Emily Arnott

Community Manager @ Blameless

Emily Arnott's LinkedIn account Emily Arnott's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways