Conf42 Site Reliability Engineering 2021 - Online

Fault isolation using shuffle sharding

Video size:


Distributing user requests to resources in a combinatorial way to reduce the impact of failures and providing logical isolation.


  • You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native Litmus Cloud. In this session we're going to be looking at fault isolation using shuffle sharding.
  • reliability is the ability of a workload to perform its required function correctly and consistently over an expected period of time. We need to build systems that embrace failure as a natural occurrence. There's a collection of over 60 labs here helping you to get hands on with some of the best practices.
  • Shuffle sharding is a technique for sharding workloads. By minimizing the number of components that a single customer can interact with, we can limit the impact of any potential issues. The scope of the impact could be related to the numbers of customers divided by the amount of sharding.


This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE, a developer? A quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native Litmus Cloud hi folks, and welcome to this session as part of the Comp 42 online conference. My name is Andrew Robinson and in this session we're going to be looking at fault isolation using shuffle sharding. Before we get into that quick bit of background on who I am, I'm a principal solutions architect at Amazon Web Services, part of the AWS well architected team. My day job is helping our customers and our partners to build secure, high performing, resilient and efficient infrastructure for their workloads. I've been working in the technology field for the last 14 years and most of my focus has been on infrastructure reliability and data management. So getting into this session I think it's always worth quickly defining, first of all, what we're talking about when we mean reliability, just so that we're all set on the same page here. So reliability is the ability of a workload to perform its required function correctly and consistently over an expected period of time. One of the quotes that I really like to use here is we need to build systems that embrace failure as a natural occurrence, which is from Werner Vogels, the CTO of Amazon. So what we need to do is adapt the way that we think about building these systems so that we embrace the potential failures that are going to happen and include that as being a natural occurrence within our systems. A couple of resources just to mention before we go into this, which I will be referring to. The first of these is the Amazon Builders library, and this provides articles and video guides on how Amazon has implemented different best practices across our architecture and software delivery. In short, it's sharing what we've learned over the last 20 years with our customers and our partners. I'd also like to talk about the AWS well architected labs. There's a collection of over 60 labs here helping you to get hands on with some of the best practices that I'm going to be speaking about. All the labs come with detailed walkthroughs and the content is all available on the linked to GitHub repo if you'd like to contribute or provide feedback on them. So I'll be referring to both of those aws we talk through what I'm going to be covering. So the main topic to cover here is on sharding of workloads, and we'll start out with this as a concept, and then we'll dive into a little bit more about what shuffle sharding means and how we can go about implementing it. So think of sharding a workload similar to how you would shard a database for one entire workload. We'll shard the overall capacity and segment it so that each shard is responsible for a subset of those customers. By minimizing the number of components that a single customer can interact with, we can limit the impact of any potential issues. If we had a workload without any shuffle sharding, where any worker could handle any request, a poisonous request, or a flood of requests from one single user could spread through your entire fleet. So in short, the scope of the impact could be related to the number of customers divided by the number of sharding. So we can help to limit the scope of impact. As the number of customers increase, we can increase the number of shards, which will then help to scope the impact that that potential failures could have down to a more manageable level, meaning that it's going to have less of an impact on customers using our systems. So let's talk through an example here of what this sharding can look like. So let's imagine we have a horizontally scalable system. In this case, it's made up of eight worker nodes. Each worker node receives requests that come in, maybe through a load balancer. There's no sharding, so each worker can handle any request that happens to come into the system. Now, if one of those worker fails, the other seven can absorb the work. Only a small amount of slack capacity is needed within this system for it to be able to function. However, what happens if we get a poisonous or a bad request? Maybe a denial of service, which causes a cascading failures that has an impact across all of those worker nodes. In this scenario, the failure is everything and everyone, the whole service goes down and every customer using it is impacted. So what we could look at then doing is to sharding that workload. In this case, one customer only uses one specific shard or set of sharded resources. We can limit the impact of that poisonous request, as it would only impact that one customer. So in this example, you can see that we've now sharding this workload into orange, blue, green and red. And because the orange shards are the ones that have been impacted by perhaps this poisonous requests, only those sharding have been impacted. The other shards that exist within our workload, and therefore the other customers using those haven't been impacted. So we've been able to reduce the impact of the failure here from 100% to just 25%. Now, of course, this does mean we need to have more slack capacity within each of those sharding in case we had a single node failure. In this example, if we did have that, we would lose 50% of our capacity. So it might be that we need more preprovisioned capacity, or we need to be able to scale more quickly to be able to account for changes in load. But it does mean that we'll be able to limit the impact of that failure from affecting 100% of our customers to just 25% of our customers using this system. Now, we can evolve this further by using this mechanism called shuffle sharding. Now, how this works. Fairly similar scenario. So we've got eight worker nodes, and we've got eight customers represented by the letters AbCDefG and H. So eight workers, eight customers. This time, we'll virtually sharding up each of those worker nodes so that every customer is split between two of them. So, as with the previous example, all customers have access to two worker nodes. Each customer is represented by a different letter. And for the purpose of this example, we're going to focus on customer a and customer B. So, customer a, you can see, is using this first sharding. And they're sharing that with customer B. They're also sharing the fourth shard with customer F. Customer B is sharing this first shard with customer a. And they're also sharing the 8th sharding with customer G. So what would happen in this scenario if we had that same failure? So that customer a submits a poisonous request, or maybe there's a flood of requests, or possibly some kind of denial of service attack. That problem will impact the virtual shards that that customer has access to. So shard one and shard four. But it won't impact any of the shards that other customers are working on. So customers on those remaining sharding continue to operate as normal. They don't suffer from any interruption to their service. Now, that's great, but what about those customers who are sharing those shards? Customer B and customer f, what do they do? Well, because we've shuffle sharded, those customers also have access to other sharding. So although both customer B and customer f share shards with customer a, because their other shuffle shards are different, they can continue to operate, but albeit at a reduced capacity. This was a smaller example for the purpose of keeping it to fit on a slide easier. But as you grow out, the level of potential impact drops significantly with eight workers. There are 28 unique combinatorial of those two, of two different workers. So 28 possible shuffle shards. So the scope of impact here is just 120 8th, which is seven times better than what that regular sharding would offer you. So distributing those customer requests that are coming in using a shuffle sharding mechanism means that if you do have a failure, poisonous request or a denial of service that takes out those shards, the impact that that has is going to be greatly reduced. Now, customer a still has the problem of they've lost access to their resources. But for the rest of the customers, we're maintaining access to the system so they can still do what they need to do, limiting the impact that that actually has. So that's great in theory, but what could that look like in practice? And how could you actually get hands on and see how this really works? So one quick resource to talk about here is using shuffle sharding with Amazon Route 53. For those who aren't familiar, Amazon Route 53 is a highly available and scalable DNS web service. And Amazon has built an Infirma library which is available for the AWS Java SDK. So this uses the theory of shuffle sharding that we just talked about to compartmentalize your endpoints and then bake in the decision tree logic to remove failed endpoints, which can help with isolating denial of service or poison pill requests. As that decision tree logic is prebaked, it means route 53 can react quickly to any endpoint failures. Combined with that compartmentalization through shuffle sharding means that we can handle failures of any subservice serving that endpoint. So it allows us to isolate those requests. And by having a pre built decision tree for which endpoints we're going to remove, once those endpoints fail, we can quickly remove them from our DNS service. And that means that we're not going to be routing customer requests to those anymore. So this actually takes that theory of how to shuffle shard those requests coming into different endpoints and actually bakes that into an easy to use Java library that you can use with the AWS Java SDK and the Amazon Route 53 service to actually get hands on and test this out with. So I know this is quite theoretical, but there is actually an opportunity here to get hands on with that as well. So to conclude with this, using sharding is a great way to help limit impact by distributing different requests to different endpoints. It means that if we have a poison pill or we have a denial of service from a specific customer, it's only going to impact those specific endpoints that they have access to. And the rest of our customers are still going to be able to access the resources that they need to. That helps us to reduce that impact down. In the example we saw, we went from 100% to 25%. As you scale out and add more worker nodes, you'll be still be looking at a similar percentage, but that can then have an impact on much fewer customers. We can then use shuffle sharding to exponentially limit that impact further. Remember in the example that we had, we were talking about 128 impact that this would have on customers. And as you then scale that out with more worker nodes and more customers, that impact exponentially decreases. So sharding is a great place to start here. And then looking at moving on to shuffle sharding can then help to exponentially limit that impact. As I mentioned, there's the route 53 Infirma library, which can help you manage this yourself and allows you to get some hands on experience with what this could look like in the real world. This example was done with the Amazon Route 53 service, but you could achieve similar with other DNS services, or even using your own type of load balancer as well to distribute those requests and remove any of the failed endpoints. So if you'd like to learn a little bit more about this, there is an article in the Amazon Builders library on workload isolation using shuffle sharding, which goes into much more depth than what I've been able to cover in this session and actually explains a little bit more technical detail about how Amazon's built this technology. Also an opportunity to get hands on with the AWS well architected labs in the reliability pillar. There's a lab on there for fault isolation with shuffle sharding, where it goes through AWS, an example and talks through and allows you to get hands on with building this using AWS resources, including our load balancers. And then of course there was the route 53 Infirma library. If you want to get a little bit more hands on with some code parts of this, you can go and grab that from the AWS Labs GitHub repository. As I mentioned, it works with the AWS Java SDK. It's based on Maven, so it was fairly straightforward to deploy. You can get started in that in just a matter of minutes. So that concludes the session on fault isolation using shuffle sharding. I hope that's been useful for everybody. Thank you ever so much for watching this session and hope you enjoy the rest of the comp 42 conference thanks, folks, and stay safe.

Andrew Robinson

Principal Solution Architect @ AWS

Andrew Robinson's LinkedIn account Andrew Robinson's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways