Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey, my name is Raul and I want to talk about problem driven infrastructure,
starting AI services for scaling re.
Given the increased reduction of the ai, it is very important to make sure the
infrastructure that those models runs are also able to scale to the requirements.
In here I will be discussing about the, about how you can.
IT scales and make your AI services highly available.
The challenge that traditional AI infrastructure have performance
bottleneck, single point of failure in the resource contention.
So we, before we solve the problem, we need to understand what are
the different challenges that very specific to the a IE services.
The first thing is that there is extreme variability in the traffic.
Second is that the conversations are stateful.
So we have to stay, we have to store some mistakes.
The third is that resource intensity.
Some of the tasks require lot more resource than the other.
And then since and then unexpected failure.
The sorting solution that I want to propose is instead of serving all the
traffic from the single start of the service, we can have the multiple charts
of the services serving some percentage of the traffic, and then we can intend
we can have the intelligent logic to route the traffic and also fail over
the fail over in case of any failure.
Okay, so I would want to talk about some basic component.
The first is the AI service charts, which means that we will be having
the multiple charts of these services each working independently.
The metadata service, which is very important.
This is the service which actually keep track of all the charts
we have in in their dependency.
We, since now we have the multiple, we also have to route the request to the
charge that appropriate to that request.
We also want our service to be charted to the multiple availability zone, so that
if there is an outage in one availability zone, the service is still de functional.
The state management since the the conversation with the AI is
stateful, so we have to think, we have to take care of the state also.
And then observability system, this becomes more important
since we have multiple tasks.
So we have to have the correct observability to debug the issue.
Okay, yeah.
Want to go deep on some of this.
So first one is that how we will maintain the session affinity, right?
So we, same request from the same user should go to the
same chart every time, right?
So that's where the intelligent metadata routing service come into the picture.
It, it has it'll have the affinity toward the, towards the chart.
Which already serving the serving the traffic.
Second is that if we, if the sum charts are being overloaded, the
routing need to be intelligent enough to route the traffic to the less the
charts which have the less traffic.
We can also have the priority based routing.
This can have the multiple implementation based on the subscription
model or based on the workload.
The type of the workload, so many things.
And also geographical optimization.
So since we have multiple charts based on the different regions, so we can actually
have the the beta routing to support it.
So yeah, I think we need to talk about this, how we will be managing the
state, since we already said that this.
Conversations are stateful.
There is a context, there is some customer settings.
So whenever we whenever we if the customer is already being served on
one particular card, and then for some reason we need to fail over to other,
we also move, have to move the, all the state corresponding to it, right?
The one solution that we can have the, we can have the centralized database.
Which so what we are seeing that we will have the multiple parts of
the service running, but then the, they will have centralized database
for the context so that so that the computing is being started, but the,
but not the database to be clear.
Right.
The conversation migration.
So if for some reason one chart is not working correctly, we have to
migrate the, we have to communicate to that, that this conversation is
being migrated to the other chart.
So the process will be, will first retrieve the state will find out
the new shard, we'll update the routing and then we'll we, and also
we can do some pre-cutting also.
Okay.
There are few different strategies that can be considered when we want to.
So one thing is that we can have the more chart, which is horizontal scaling, right?
Every time you want to scale, have the you can have add more chart.
The second one is that if, for if the nature of one card is like
with handling more load, you can actually vertically scale it right?
The capacity reservation, this is we can reserve the the host
to adopt, to the adopt to the capacity and the cost optimization.
This is also if you use the reserve, the capacity, it'll
help in the cost optimization.
The important thing is how can you auto scale, right?
So there can be different different indicator to the system that
says that the system need to be scale up or they scaled down.
We will have a queue where we monitoring the customer request.
If based on the queue size, we can decide to scale up or scale down the system.
We will have the latency to serve the request based on some threshold.
We can decide whether the, if the requests are taking time, then we can
scale up or scale down the system.
Predictive models, this models are like the, again, AI model, which actually
based on the past history of the usage the predict the predict the usage pattern.
And then so that we can the port scale down accordingly.
The one more is the cost aware triggers, which will help you keep the cost in the
check if so if you have a chart which are not being utilized, you can actually
bring it down and then control the cost
all tolerance and resolution.
Yeah, so since now we have multiple charts, it is very important to
make sure they are all running fine.
We can have the health check correspond to the e card.
We can have the disaster recovery.
Automatic fail.
These are the few things which we can do to make sure that the
service always be available.
So operational since complexity and monitoring, since now we have multiple
charts, it's become very important to to make sure that we are able to
operate all the charts correctly.
The most important thing is that if we have collecting all kind of metrics,
we can actually build the automated system to monitor it and then make sure
that all the systems are functional.
The key benefit of using this approach is that you get close to
a hundred percent availability.
You can scale your service as much as you want.
There is a cost reduction because you don't have to
maintain the monolithic system.
You can actually customize the chart to it need.
So if it needs very basic resources, you can just have
it so that you don't pay extra.
And yeah, and each chart actually operates.
Its on its own.
So then there is like no dependency on each other.
The implementation.
So we have to first build few charges and then test it.
That your that your code is functional, and then you are able to
scale up and scale down the charts.
Your metadata service is working fine, and then you can actually
explore the solution to the more cart.
The future of AI infrastructure.
I think the infrastructure becomes very important if we want to support this
the the increased demand of the ai.
We have to make sure that that the infrastructure is always
available and that's where the starting strategy will help.
Thank you.