Prompt-Driven Infrastructure: Sharding AI Services for Scale & Resilience

Video size:

Abstract

Learn how to scale AI infrastructure using proven sharding strategies. Discover techniques for distributing prompt processing across zones, achieving 99.99% uptime, and building fault-tolerant systems that handle enterprise AI workloads efficiently.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey, my name is Raul and I want to talk about problem driven infrastructure, starting AI services for scaling re. Given the increased reduction of the ai, it is very important to make sure the infrastructure that those models runs are also able to scale to the requirements. In here I will be discussing about the, about how you can. IT scales and make your AI services highly available. The challenge that traditional AI infrastructure have performance bottleneck, single point of failure in the resource contention. So we, before we solve the problem, we need to understand what are the different challenges that very specific to the a IE services. The first thing is that there is extreme variability in the traffic. Second is that the conversations are stateful. So we have to stay, we have to store some mistakes. The third is that resource intensity. Some of the tasks require lot more resource than the other. And then since and then unexpected failure. The sorting solution that I want to propose is instead of serving all the traffic from the single start of the service, we can have the multiple charts of the services serving some percentage of the traffic, and then we can intend we can have the intelligent logic to route the traffic and also fail over the fail over in case of any failure. Okay, so I would want to talk about some basic component. The first is the AI service charts, which means that we will be having the multiple charts of these services each working independently. The metadata service, which is very important. This is the service which actually keep track of all the charts we have in in their dependency. We, since now we have the multiple, we also have to route the request to the charge that appropriate to that request. We also want our service to be charted to the multiple availability zone, so that if there is an outage in one availability zone, the service is still de functional. The state management since the the conversation with the AI is stateful, so we have to think, we have to take care of the state also. And then observability system, this becomes more important since we have multiple tasks. So we have to have the correct observability to debug the issue. Okay, yeah. Want to go deep on some of this. So first one is that how we will maintain the session affinity, right? So we, same request from the same user should go to the same chart every time, right? So that's where the intelligent metadata routing service come into the picture. It, it has it'll have the affinity toward the, towards the chart. Which already serving the serving the traffic. Second is that if we, if the sum charts are being overloaded, the routing need to be intelligent enough to route the traffic to the less the charts which have the less traffic. We can also have the priority based routing. This can have the multiple implementation based on the subscription model or based on the workload. The type of the workload, so many things. And also geographical optimization. So since we have multiple charts based on the different regions, so we can actually have the the beta routing to support it. So yeah, I think we need to talk about this, how we will be managing the state, since we already said that this. Conversations are stateful. There is a context, there is some customer settings. So whenever we whenever we if the customer is already being served on one particular card, and then for some reason we need to fail over to other, we also move, have to move the, all the state corresponding to it, right? The one solution that we can have the, we can have the centralized database. Which so what we are seeing that we will have the multiple parts of the service running, but then the, they will have centralized database for the context so that so that the computing is being started, but the, but not the database to be clear. Right. The conversation migration. So if for some reason one chart is not working correctly, we have to migrate the, we have to communicate to that, that this conversation is being migrated to the other chart. So the process will be, will first retrieve the state will find out the new shard, we'll update the routing and then we'll we, and also we can do some pre-cutting also. Okay. There are few different strategies that can be considered when we want to. So one thing is that we can have the more chart, which is horizontal scaling, right? Every time you want to scale, have the you can have add more chart. The second one is that if, for if the nature of one card is like with handling more load, you can actually vertically scale it right? The capacity reservation, this is we can reserve the the host to adopt, to the adopt to the capacity and the cost optimization. This is also if you use the reserve, the capacity, it'll help in the cost optimization. The important thing is how can you auto scale, right? So there can be different different indicator to the system that says that the system need to be scale up or they scaled down. We will have a queue where we monitoring the customer request. If based on the queue size, we can decide to scale up or scale down the system. We will have the latency to serve the request based on some threshold. We can decide whether the, if the requests are taking time, then we can scale up or scale down the system. Predictive models, this models are like the, again, AI model, which actually based on the past history of the usage the predict the predict the usage pattern. And then so that we can the port scale down accordingly. The one more is the cost aware triggers, which will help you keep the cost in the check if so if you have a chart which are not being utilized, you can actually bring it down and then control the cost all tolerance and resolution. Yeah, so since now we have multiple charts, it is very important to make sure they are all running fine. We can have the health check correspond to the e card. We can have the disaster recovery. Automatic fail. These are the few things which we can do to make sure that the service always be available. So operational since complexity and monitoring, since now we have multiple charts, it's become very important to to make sure that we are able to operate all the charts correctly. The most important thing is that if we have collecting all kind of metrics, we can actually build the automated system to monitor it and then make sure that all the systems are functional. The key benefit of using this approach is that you get close to a hundred percent availability. You can scale your service as much as you want. There is a cost reduction because you don't have to maintain the monolithic system. You can actually customize the chart to it need. So if it needs very basic resources, you can just have it so that you don't pay extra. And yeah, and each chart actually operates. Its on its own. So then there is like no dependency on each other. The implementation. So we have to first build few charges and then test it. That your that your code is functional, and then you are able to scale up and scale down the charts. Your metadata service is working fine, and then you can actually explore the solution to the more cart. The future of AI infrastructure. I think the infrastructure becomes very important if we want to support this the the increased demand of the ai. We have to make sure that that the infrastructure is always available and that's where the starting strategy will help. Thank you.

Slides

Download slides (PDF)

See all 28 talks at this event!

Conf42 Prompt Engineering 2025 - Online

November 06 2025 - premiere 5PM GMT

Prompt-Driven Infrastructure: Sharding AI Services for Scale & Resilience

Video size:

Abstract

Summary

Transcript

Slides

Rahul Thakur

Lead Member of Technical Staff at Salesforce @ Salesforce

Join the community!

Featured event

2026

2025

Info

Conf42 Prompt Engineering 2025 - Online

November 06 2025 - premiere 5PM GMT

Prompt-Driven Infrastructure: Sharding AI Services for Scale & Resilience

Video size:

Abstract

Summary

Transcript

Slides

Rahul Thakur

Lead Member of Technical Staff at Salesforce @ Salesforce

Join the community!