Reliability 101

Video size:

Abstract

It is 7 AM; you awake after a night of uninterrupted slumber. Being on-call, you check for issues, was your pager out of batteries? Nope, things are quiet.

Imagine a world where outages are a myth. Where a failure occurs, but there is no customer impact and no engineer is engaged. This is the aspiration of Reliability Engineering - to operate complex distributed systems effectively, without customer facing outages or heavy operational burden.

In this 101 talk, I will share the basics every team should know to start their reliability journey off on the right foot.

Summary

Kolton Andrus is CTO of a chaos engineering company, name Gremlin. He worked at Amazon and Netflix, where he was responsible for ensuring reliability. He says the key to building reliable systems is to expect failure and keep it simple.
In the cloud, hosts can fail, and we have to be able to handle that contingency. Every time we add a new dependency to our service, is a great opportunity to ask ourselves, is this a critical dependency? Can I gracefully degrade? Do I really need it?
There's more than just the services we rely on when it comes to the dependencies. Do we know what we rely upon? The second piece is knowing what's critical. In many cases, this is often a bit of a guess. This is a great place where failure testing comes into play.
James Hamilton: If we're unwilling to do testing and to validate our recovery mechanisms, then likely those recovery mechanisms won't work when needed. We have to debug, diagnose and fix at the same time. We want to make sure they work and have confidence in them.
Go hit redundancy. Graduate to being able to lose a zone or a data center. Turn your disaster recovery plan from an academic paper exercise into a real live exercise. Lastly, and not to be understated, we need to be able to train our teams.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi. Welcome to my talk. Pleasure to come share with you some of the lessons I've learned in building reliable systems. Along the way, let me jump in by introducing myself. My name is Kolton Andrus, CTO of a chaos engineering company, name Gremlin. But prior to that, I worked at Amazon and Netflix, where I was responsible for helping ensure that our systems were reliability, and that when they weren't reliable, we were acting and fixing it as quick as possible. Why are we here today? We're here today because we don't want to end up on this page. We don't want to end up in the news for an outage that occurred. And unfortunately, this is being more and more commonplace in today's day and age, and we feel for things and feel for the folks that are in these circumstances. We're operating very complex systems with lots of moving pieces and interconnected dependencies. And this makes the job of building a reliable system fairly difficult. In the past, this is a world where one architect or person could hold all of the things in their head and help and make sure the right decisions were being made. But today, reliability's problem is really everyone's problem. It's really up to us to be able to mitigate issues and failures early and often along the path, so that it doesn't conspire into larger failures and cascading failures. And so I'd love to talk a little bit about what I've learned and what I think is most effective in building complex distributed systems that are reliability. The first of a couple of simple tenets is to just expect failure. Sometimes, as engineers, we're thoughtful about the happy path, and we really need to be considering the alternate paths that our users may experience. Failure can occur at any time, at any level, and we really need to ensure that we're handling that failure gracefully. And by doing that, we're helping simplify the surface area and making it so that when failures do occur, there's less of us for us to figure. But this rolls right into the second tenant, which is, we need to keep our system simple. We need to, where possible, try to do the simpler, easier thing, as opposed to the clever or more complex things. As we saw in the previous slide, our systems are complex enough. And so a bit of a rule of thumb here is that if we things, some fix, some optimization is going to yield an order of magnitude better performance. Then it's probably worth that overhead or complexity. But if we're seeing only an incremental gain, we may want to think about the side effects of that additional complex and choose to leave it. But the next one is also one that's probably fairly straightforward and you've heard before, but we need to talk about it. We need redundancy. We need redundancy in our hosts and containers. We need redundancy in our data. We need redundancy in the ways in which we can do work within our system. Ultimately, a lot of failure testing and a lot of reliability comes down to, is there another way to get it done, or are we at a hard failure point? And so, thinking about our single points of failure and how to mitigate those along the way help us to balance that concern. Now, I will mention the keep things simple is also somewhat in conflict with this. And if you have redundant everything, you actually have a much more complicated system than a single instance of everything. So we need to weigh the pros and cons of that approach. How do we test to know if we're in compliance with this? The question is, are we comfortable bringing down a host, a service, a zone, a region? If we are, then we feel good that our system will continue operating and the right things will happen. If we don't have that confidence. If we're unwilling to do it, then we're probably not there yet. By the way, this is why chaos monkey was built, is that in the cloud, hosts can fail, and we have to be able to handle that contingency. And by forcing it to occur to our engineers, and that's a social strategy, then we're expecting them to consider and be able to address, and we're able to handle those host failures. Next, I want to think a little bit about how we design for operations. We often don't think about operations when we're designing our services. We wait until it's built or deployed or until there's an issue to really come back and put a fine tooth comb through things. And so if we consider this earlier on one, we can think about how to observe the service. How do we know that it's behaving well? And more than just error rates at the front door, how do we know that the underlying components are healthy and that they're performing the way we expect? We need to expose configurability that allows us to tune and to make sure that the system has the right edges and the right points that are going to protect itself. And all of these things that we're building likely should be treated as first class citizens with our source code checked into source control for our configurations, our runbooks, or our alerts, our monitors, our scripts. Next, we just need to think about this a little earlier and more consistently throughout the process. If we, similar to test driven development, are thinking about the failure modes up front, we're more likely to be able to leave room for them, design for them, or be able to address them, than if we have to come back and duct tape a solution on after the fact. If the case here, every time we add a new dependency to our service, is a great opportunity to ask ourselves, is this a critical dependency? Can I gracefully degrade? Do I really need it? Lastly, we want to be thoughtful about traffic control. Our services. We never want to allow in more work than we're able to complete. And so in interest of protecting ourselves and the work we have taken on, we need to shed work if we're unable to do it well. This creates a faster failure loop for our consumers that help them understand the state of our service, as opposed to a delay and a wait while things are unknown. Similarly, when we're calling our dependencies, it's an opportunity for us to a be good citizens and b protect ourselves if that dependency fails, and the circuit breaker pattern works wonders here. First, if that service is failing, we don't want to continue calling it. We don't want to make things worse, and if we know the outcome is going to be a failure, we don't want to waste our time. But second, if that service is failing, often circuit breakers are helping us to think about how we could gracefully degrade. What do we do if that does fail? And what other source of truth could we find for that information to steal a page from the security world? I think that failure modeling is actually a very useful exercise as we're designing and building our systems. And it's worth the team's time to spend a little bit of time together brainstorming. What are the types of things that could go wrong? What happens if two things go wrong at the same time? Which of these failures are non starters are things that we just must live with, that we cannot change? And what are the failures that we can turn into non failures by gracefully degrading that we can ensure behave the way we expect? Ultimately, this is a risk assessment, and it's a business and a technological decision in what we think sounds most appropriate. And so the one but of advice I'll give here is that many things can fail. And it seems like the combination of failure might be rare. But as we're dealing with components, with thousands of moving pieces, perhaps many, many more failure becomes commonplace. And something that seemed unlikely may occur more frequently than you think. A little visual for how we think about this is looking at a service diagram. And from this we can get a view of the dependencies that we rely upon. We can reason about how important those dependencies are. In the case of an ecommerce app, we know that the checkout service is being to be critical and we're not going to be able to perform meaningful work without it. But something like the recommendation service that offers other things we might like to buy could fail and we could hide from the user and they could still continue on completing the mission that they have come to us to accomplish. Which leads us to talk a little bit about our dependencies. There's more than just the services we rely on when it comes to the dependencies. There's infrastructure, there's the network, there's our cloud providers, there's our external service provider tools, whether that's a database, whether we need to hit the IRS's or the government's financial services to get the latest interest rate. There's lots of moving pieces here, and the first step is just being aware of them. Do we know what we rely upon? The second piece is knowing what's critical. And for those things that we rely upon, do we need it? Must we need it? In many cases, this is often a bit of a guess. And from my experience, this is a great place where failure testing comes into play. By going and actually causing the failure of a dependency, we can see if the system is able to withstand it and we can test if we're able to gracefully degrade. And that's how we turn as many critical failures into non critical failures was possible. Now, some we're going to have to rely upon, and in those cases, we want to work with those teams to make sure that they're doing the best they can to build a highly reliability service. If we're trying to build a service with four nines of uptime on top of a bunch of services that have three nines of uptime, we're probably going to be disappointed, because as we do the math on the availability of those services, it actually gets a little worse as we compose them, not better. So I want to share a quote that's one of my personal favorites from James Hamilton comes from on designing and deploying Internet scale services. It's about 15 years old, but the advice in there is all applicable and a great read and worth your time. But he essentially challenges us that if we're unwilling to go fail our data centers and go cause the failures that occur, that we're not actually confident our system can withstand them. And if we're unwilling to go do this testing and to validate our recovery mechanisms, then likely those recovery mechanisms won't work when needed. And as someone who's been in this situation in the past, if there's two failures that are happening at the same time, and we're going to go out and mitigate them, and we have a recovery mechanism that doesn't work, now we're dealing with multiple failures. We have to debug, diagnose and fix at the same time, while in the middle of dealing with it, in the middle of the night, urgently. And so we want those recovery mechanisms that are there to save us and to make things go smoothly. We want to make sure they work and have confidence in them. So, great. What should you do today? What are the most effective places to begin for a team that's new to this, or a team that wants to get better at this? The first is go hit redundancy. Start simple. Make sure you can lose a host. Graduate to being able to lose a zone or a data center, not your entire application, but a big piece of it. And then lastly, turn your disaster recovery plan from an academic paper exercise into a real live exercise that your company executes. Do a failover between zones or regions, do a traffic shift, and I promise you, you'll find and uncover lots of little things you weren't aware of that would have caused this to be a problem. If the real world event occurs, and having lived through that, you can make it so that when the real event occurs, it can run smoothly. And that's what we want here. The second thing is to understand our dependencies. First, we must know them all. Next, we must know which of them are critical, which we can do by going and testing them with a hard failure. And then lastly, we can tune how we operate and interact with those dependencies to ensure that we back off, that we time out, or that we stop calling them if things are going wrong. Lastly, and not to be understated, we need to be able to train our teams. Some folks have been living this for many years. Some folks are new to this approach and giving them a place to ask questions, a place to practice, a place to ensure that they can get the runbooks and that they're up to date, that they know what the alerts and the monitors look like, that they have access to their systems is paramount. It's key to allow people to practice and train during the day, after the coffee's kicked in, as opposed to 02:00 in the morning when things are going wrong, you so with that, I want to thank you.

See all 33 talks at this event!

Conf42 Site Reliability Engineering 2022 - Online

June 09 2022

Reliability 101

Video size:

Abstract

Summary

Transcript

Kolton Andrus

Founder & CTO @ Gremlin

Join the community!

Featured event

2026

2025

Info

Conf42 Site Reliability Engineering 2022 - Online

June 09 2022

Reliability 101

Video size:

Abstract

Summary

Transcript

Kolton Andrus

Founder & CTO @ Gremlin

Join the community!