Enhancing Cloud Scalability and Fault Tolerance: Sharding Strategies for High Availability

Video size:

Abstract

Learn how sharding, intelligent request routing, and multi-zone deployment create resilient, scalable cloud architectures. Ensure high availability, fault isolation, and seamless scaling—minimizing risks and maximizing performance. Don’t miss this game-changing session!

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

my name is Rahing Taku and I am working as a lead software engineer in Salesforce. I have I wanted to talk about this. I, how can we scale the the service? So I do want to talk about the, how can we enhance the clouds, scalability and fault always using the sharding. So before I start, I will give the little background about this. So now, nowadays it's very common practice to deploy the service as microservices, right? When we deploy microservices, there are. Replic of the service which runs. And it is mainly done because we want to make sure if there is an issue with one replica of the service, the other is still continue to serve the profit. So my idea is also on the same. But but it extend the idea. So instead of having just one instance of the service, we can actually, the. Services itself, we can have the multiple instance of the services and then each instance of the service serves the some traffic, right? And and and using this architecture, we can actually improve the. Improve on and actually have the high availability for our service. Yeah. So I will go in deep down on the different aspect of the aspect of it. So first thing is that why sharding matters, right? So if we do not shard, right? Then most of the then you have one single point of failure, right? Everything goes on to the same place, right? Which means that if there is anything has to go wrong it'll impact all the all the incoming traffic, right? And since everything is coming onto the single place, you cannot have an intelligent load distribution, right? And then it affect your availability, right? If everything is on one place, everything has to go down. It'll affect it'll reduce the availability of the service. In my proposed architecture. So there are I'm introducing a metadata service and so in my proposed architecture there are different instance of the service, right? So not only one instance, but different instance of the service. And now since there are different instance of the service the request need to go to the one instance. Always need to go to the same instance. So because of that, we need a metadata service. The responsibility of the metadata service is to make sure that for a given request identified as the same request using the IP address or the customer credentials, it always goes onto the same chart of the service, right? So that's why we need the metadata service. So Met also can be used for, managing more metadata about the whole, all the services that participate in this, and then it knows about the, all the charts of the services. And then at the end, then it'll redirect the service through particular. Okay, so before, before I go further deep, I just want to focus one more time. Like what are the benefits of it? So since we have now multiple instance of the service running, it'll improve the availability of the service. If one instance is down, it still, there are, let's say we have five instance of the service still, there are four instance running, so that means the avail availability is down only by the 20%, right? And it also helps to make sure that all the request, the which. So for example, we have dive five, different instance, and then each of the fraction, like 20% of the customers. And then if we have a local cache, then it can serve the customer better, right? So we have a small instance and then we know the request from the customer will always come to, come on this particular instance. So then it can so it'll be able to. And third one is like the, it can optimize the deployment, right? So now if you can configure your pipeline to do Thete deployment and whenever you are releasing the change and not release on all the five instance, but release it on the first one, instance first, and do all kind validation. To go into production and then it's not causing any regression. Just wanna add more on the benefits. Since we have multiple instance of the service, we can actually have the intelligent routing. Okay. If you want to scale up or scale down one particular instance of the service to serve a different purpose, then you can actually scale it up. And then while you keep the keep the so for example, let's say you have five instance of the service and then one instance which serves the current. The request for the, let's say for the current day, okay. And then the other FI for instance, they serve the request from the historic data, right? So you can actually scale the current service, which serve the current data. To be more performant. So yeah, this is just an example, but the idea is that if you have different, then you can actually ma manage them separately and then you can have the intelligent load distribution also done for isolation. For example, let's say we have, again, going with the same example, we have five instance. They are running in different five availability zone. Let's say there is an issue with the vulner availability zone. It does not impact the other four instance, which are on the different availability zone, right? So you can actually isolate the fault. You can minimize the blast radius. Then also if we use it correctly, we can actually use it for the cost efficiency also. Because if we had only one instance where you have to keep scaling, scale up that instance all the time in this scenario, you can actually scale up or scale down the different instance at the different point of the time. So it gives you the flexibility of managing it better. Okay. Now I do want to talk about the, a little about the implementation. So we need way to dynamically provision this instance, right? So now since we have more than one instance. And it is possible for the number of instance to grow or reduce in the size, as in when we, as in, when the the product grow or shrink, right? So it is possible, it is very important for us to have a service which can provision the service instance dynamically, right? So it need to be configurable and. And then there need to be the configuration, which can be done separately for the each instance, right? And yep. So going next, yeah, we also need to have, maintain some kind of configuration for all these instance, right? Since these are the different instance, we, I mean it's the same service, but it's still different. Instance, there may be some runtime properties which are different for the different instance. And what we want to make sure is that the are configurable, right? So for that we need a configuration service. And for each instance we configure the properties same or differently, depend on the use case and then then used when we deploy the service instance. Okay. So as I told before, like we also need a metadata service. The responsibility of the metadata service is to make sure that we are able to route the request to the same chart every time which means that we need to have the consistent routing, also, I just want to touch upon the stateful migration. So if your service is state, which means that it has it, it to. Then if you want to move the customer from the one, if you want to move the customer from the one instance to another instance for any reason, right? There can be multiple reasons. For example, one of the instances, is not able to serve any more traffic. So you don't want you want to distribute the traffic and move some of the traffic to the the other instance, right? Or maybe for some other reasoning. For example, you have a logical distribution of the customers, but then for some reason you decide to move the one customer. So this like open to. But the point is, for any reason, at any point, if you decide to move the change the chart for a customer. If it's in a stateful service, then we have to take care of the state for the data migration also. And we all know data migration is not easy. But but I think if it's done correctly, we can actually have this also. The point is that we have to consider the stateful migration when we are to. Okay, now I would like to talk about some of the challenges. One is the deeper complexity, right? When we want to debug an issue, right? The first thing we need to know the routing of the shard happened correctly then we need to know which shard the customer belongs to. And then and then for each search, since there are different configuration, we just want to make sure that so it's up to the complexity for debugging, right? Migration complexity. I think this is one of the biggest challenge the migration complex. As I explained before, we have different insights. We if for any reason we decide to change the, we. Moving the data in the live production system is not easy. That's why I recommend if we ever try to do it consider the dual ingestion. So you keep ingesting the data into the old shard. You keep start to ingest the data into the new shard, then the data is in sync. Then you move the customer from the from the, and. Being served from the new shot, then you can think about cleanup from the from the old chart. And also I will suggest that if we keep the retention based cleanup, that will be nice in that sense. You don't have to worry about cleanup of the data, but like when the data leads to the end of life. Then third is infrastructure complexity. So since now we have multiple instance you need to have a service which can spin up the spin up the new instance and take care of it, right? Yeah. So this will add up the extra complexity. The data consistency. I think that the data consistency is mainly when we actually moving the customer from the one shot to another. So we just want to make sure that the data is consistent. Network latency is also in the same sense that when we are when we are actually moving the customer from one shot to another, we are actually doing the deal, right? And then that might add up to the latency for the customer. I do want to to support some of the mitigation for the Herbo problem. The first thing is that if we have the robust monitoring so for any given customer, if we know the, like which chart the customer goes, and then all the configurations that we can see for the. And then having a proper dashboard and alarms to monitor the, monitor, the each each service instance, right? This will help to reduce the the debugging effort to and can be done very fast. The second time second is caching strategy. The caching strategy will help a lot during the in the metadata service when we actually wanna, assign when we want to decide the which chart the customer should go, I think the caching strategy will play an important role because the aiding or removing the chart is not a frequent operation, right? So we can very well have a cache which gets invalidated whenever there is an a new. Through cache in here. And then this will help us. So this will help us like doing the routing it without adding up on the much latency because we'll be doing it from the cash the deal right system. So the deal right system is very important when we want to do the when we want to move the customer from the one shot to the another. I think if. Make sure that the customer do not see any downtime during the during this process. And then the customer even would know if there was any change on the back in the backend. Yeah. Yeah, so I just want to conclude on this, the idea is to have the multiple charts of the service. And so that each short serve some of the data or some of the customer, it helps you increase the availability of the service, also doing the better load management and improve on the fault tolerance. Yes. So I think, yeah, that's it. And, thank you very much.

Slides

Download slides (PDF)

See all 137 talks at this event!

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

Enhancing Cloud Scalability and Fault Tolerance: Sharding Strategies for High Availability

Video size:

Abstract

Summary

Transcript

Slides

Rahul Singh Thakur

@ Lead Member of Technical Staff at Salesforce

Join the community!

Featured event

2026

2025

Info

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

Enhancing Cloud Scalability and Fault Tolerance: Sharding Strategies for High Availability

Video size:

Abstract

Summary

Transcript

Slides

Rahul Singh Thakur

@ Lead Member of Technical Staff at Salesforce

Join the community!