Kube-Native Sharding: Multi-Zone Service Architecture for 99.99% Uptime

Video size:

Abstract

Practical guide to implementing multi-zone sharded microservices architecture achieving 99.99% uptime. Covers real-world Kubernetes deployment patterns, intelligent request routing, and operational strategies for fault-tolerant cloud-native systems.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Good morning. My name is, I am working for the Salesforce as a lead engineer. Today I want to talk about an I starting using the multiple zone to achieve the high availability. I have been working in software industry from past 13 years, and I have mostly been working as a backend engineer. Recently I have been working on the infrastructure services and that's where I got the opportunity to to work on a problem where we wanted to make our infrastructure service very highly available. Even though if there is an complete outage of one of the avail zone. So with that context, I will start with the with with the slides. And the idea here is that in the modern application we have to have a very high availability of the services, right? And and then how can we attribute right. For the agenda first I will be talking about the importance of the highly available fault tolerance system, and then the foundational principles the different components in the architecture and how we can use the Kubernetes to implement the same. And then I will also touch upon how we can have the consistent routing and at the end the the op. Before we actually talk about I available system the first thing is that why do we need it right? In a modern day-to-day, the there are services which if not up they will be losing the money. And also if it's not up then you are losing the customer trust. So in another way, again, losing the money, right? So it is very important to make sure that your service is very highly available and, and with the concept of the microservices how we can achieve it. So the monolithic service is the problem with the monolithic services, that it's the single point of this area, right? So that's where the microservices architecture comes into the picture. With the mi microservice architecture, we will have the multiple, we can have, we can start our services into the multiple zone. And let me explain about this. When I say multiple zone the most of the cloud providers, they have the they have the multiple availability zone and the availability zone. You can think it's a self-contained set of infrastructure. And when I say self-contained, it means that if something has to go wrong, it'll not affect the other availability zone, right? So with that logic, if we have a service deployed in the multiple availability zone, what happens if one availability zone is down? Your service is still functional your services is still available, right? So the the foundational principle so as I explained, there are multiple availability zone. We can treat the each one separately. One going down should not affect the the other one, right? The stateful services we have to be released. Careful when we do the sharding between multiple availabilities known for the stateful services because the stateful services generally means that there is an there is an state of the data with the portal, right? So we have to think when the service upscale or downscale, how do we maintain the state of the data, right? The data locality. We should try to keep the data not flowing into the multiple zones right to minimize the flow. Partial failure design when there is there is issue with the one zone. The service code still be functional. And we need the operational transparency. We need to have the enough monitoring and the alerting to monitor the the status of the different services in the different availability zone. So architectural component, right? So now since we have multiple zones, our service is deployed in multiple zones, we need to know. That which customer request will be served from the which zone, right? So that means that we need some kind of metadata service to provide that information. Think of this metadata service, which it's keep in thick of the your customer versus which zone service the customer, right? It's responsible mainly for the shared assignment. So whenever a new customer comes, it's it'll be able to assign that particular customer to one of the existing card. It'll also be help full monitoring the health of the services, and also routing policies like whenever the request come for the existing customer, we start to serve the request. Starting into the multiple zone, also help reduce the block radius. Okay so service design for multi zone resilience. So injury service registration that includes zone information, capacity indicators, and dependency relationship enables in intelligent client side load balancing and availability. A stateful component, isolation clear separation of a stateful component from a stateless processing logic. And that scale configuration will help us to maintain the global configuration, keeping the Joni specific configuration as an option to override. When we, if you have microservices, you have, you are implementing using the Kubernetes. These are the best practice to follow. You need to have the past distribution strategy. You also need to have the persistent volume management the persistent volume management. Whenever we create a stateful pod, we can actually have the pod affinity in in the topology. To tell the different parts to come up in a, to fall into the different topology, right? The service consideration. And also we can have the zone aware scaling deployment, right? This will be helpful when you have the one zone down and then you want to redirect the traffic to the other zone. If you have the zone of very scaling that time you don't have to worry about other zone, able to handle all the traffic. Okay. We'll do, I do want to talk a little bit more about the the request routing and the load balancing the metadata service, which keeps the state of the which knows the state of all the drones. It also knows that which customer is being come from the Wi Zone, so it'll be able to act as an intelligent router and also as a load balancer. We also need to make sure that we are using the consistent testing because the number of the Jones is not fixed. It can we might have. The requirement to add more zones or the requirement of removing a zone. In this case, we need to use the consistent housing which will, and we need to make sure that the the the data is being moved in the correct zone so that the customer can be served correctly, okay, so for the data consist consistency and the state management data consistency. If you have higher consistency requirements than what you want to do before you replicate the data across the zone, you have to you the request will only be successful if the data is replicated across all the Jones, which is highly consistent. From. From the ki you can know like the consistency and availability goes hand to hand. So if you try to make the system high consistency, your availability will go for a toss. And the other way also, if you want to keep the system for a highly available, the consistency will go for a toss, right? So if you go for high consistency, you are saying that you want to write to all the. Before you confirm the completion of the request, which means that if any drone is not available, then then you are you will not be able to write. So you're saying I'm going for the high consistency, but then you are reducing the availability. My suggestion is to go with the eventual consistency because our focus is the high availability. So it should be okay if the other zones eventually get the data. And then Joan Aware database charting. So we can like place complete the data participant within Joan while maintaining the global consistency. For the conflict solution. We can have the simple last slide win, or maybe we can have the better complicated, algorithm That depends on the requirement or requirement. It's also very important to have the monitoring and the observability. We need to make sure that the different zone of what are their capacity. And based on that, we have to pre-provision the, another drone or increase the capacity of the drone. So it's very important to monitor the the health of the services in the each zone since it's now multiple zones. Debugging any issue is harder. That's why it is if you have a distributed testing it'll help you trace the issue. And and also having the locks will help find the two better. This is also very important, the cost optimization. When we go for the high availability what we are doing actually is duplicating everything, right? So we are duplicating infrastructure and then, which means that the more cost there are few things which we can do to make sure that we have a check on the call. First thing is that we can, based on the pop test, we can rightsize the instance, right? So on the different zone, there is also concept of the reverse reserve instance, which means that we, we book the the instance like in advance which comes at the very discounted rate. So that can save us the save us well. We have to be careful, like how do we be, we communicate between the Jones. There are different options based on the cost. We can choose the one the Joan Aware auto Scaling policy. This will help us to not over provision the zone and only autoscale when it is needed. Yeah. Yeah, so operational best practice. We can actually have the CICD pipeline, which actually also have a canary. So before we deploy in June, we can actually test our solution to reduce the blast radius. We can. We also need to be, make sure that the changes that we are doing is is not zone dependent. So like your change work on the one zone and not onto the other. So your service need to be agnostic of it and it should be able to to work on the any zone. Okay, so in conclusion. Your service across multiple zones will help you achieve the higher availability. But then there is additional complexity to maintain the the metadata service, which is required for the routing. There is additional cost and there is an additional complexity when you want to debug the atrial. So depending on the application use we have, we need to see if charting across multiple zone is required. The one of the use case, which I can think is like the database, when you have a database, services where you actually store the data. If you charge your service on the multiple zone, then and then. You replicate, right? So what happens in this case, even if you, one of the zone is down, you still will be able to serve the data from the other zone if it's serving a mission critical services, right? Or the TGO services. So for those use cases, it makes sense. Yeah, so I think I will stop here. Thank you very much.

Slides

Download slides (PDF)

See all 53 talks at this event!

Conf42 Kube Native 2025 - Online

October 16 2025 - premiere 5PM GMT

Kube-Native Sharding: Multi-Zone Service Architecture for 99.99% Uptime

Video size:

Abstract

Summary

Transcript

Slides

Rahul Singh Thakur

Lead Member of Technical Staff @ Salesforce

Join the community!

Featured event

2026

2025

Info

Conf42 Kube Native 2025 - Online

October 16 2025 - premiere 5PM GMT

Kube-Native Sharding: Multi-Zone Service Architecture for 99.99% Uptime

Video size:

Abstract

Summary

Transcript

Slides

Rahul Singh Thakur

Lead Member of Technical Staff @ Salesforce

Join the community!