Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey.
Hi everyone.
I'm, and today I'm going to talk more about building how to build
a developer centric machine learning inference platform.
So the major thing we, I'm going to talk about is what are all the different
challenges you face while developing a ML centric inference platform and
how better you can, serve the models at scale and what are the basic
things you, we can follow as a team?
Kubernetes is the underlying infrastructure we use for machine learning
models, but how better we can leverage the infrastructure and the surrounding tools.
Involved and the process involved and how well the whole ML platform
can scale and serve the customer.
So those are the things we, which we are going to talk about today.
Let's deep dive more in the next slides.
Yeah.
So if you see this evaluation of ML platform engineering, right?
It, it has three main components to it.
One is the resource intensive nature how well you are able to see what
kind of resources are required for a model to serve the production load.
For example, when you're trying to develop a model, you might need only one.
But when you're trying to deploy the model and serve it to, for some use
case let's take an e-commerce use case, a marketing use case, for example.
In that case, you might need a hundred parts or thousand parts.
And it depends on the latency requirements and what kind of
features required for the model.
And that's the second point we are talking about the computational requirements
required for ml. Platform and the rapid experimentation cycles involved.
Basically you can't have a mo the same model which can work for all
the use cases or all the users.
You try to experiment with different versions of the same
model or even different models and see which is be the best fit.
Maybe a model can work for a particular country.
It might not work for another country, and some models may work for a teenage,
teen population, but it may not work for the other kind of population.
And may, maybe some models are built specifically for like the women
population or something like that.
So it depends on the use case.
It changes.
So we need rapid experimentation cycle.
And Kubernetes it provides you the base, like the, it provides you with a
abstraction layer, but you should have a robust automation and deep understanding
of both machine learning workflows and developer experience principles.
So you can continue to evolve for for this machine learning platform.
So let me deep dive more on what are all the different things involved here?
Yeah.
So when we talk about the architectural foundations designing for scale and
flexibility is something very important.
When you.
Try to talk about architecture.
So there are three different pillars here.
One is the orchestration layer, and the second is the
model serving infrastructure.
And third, and last but not the least, is the data pipeline layer, right?
All these three go hand in hand.
So the orchestration layer, it provides you with the built in Kubernetes.
Which is there, and you have, you should have a foundational capabilities for
scheduling, resource management, container lifecycle, resource allocation, service
discovery, all those kind of, where you handle all the machine learning workflows
workloads required for the model.
And in some cases, you might even need GPU.
So g scheduling is also part of the orchestration layer.
And the model serving infrastructure is basically where you have a trained model.
You're trying to make the model readily available to be consumed by
some a PA or even by your homepage or any other place where the
model is going to be used, right?
So you, you need to have a way to.
Load the model, route the model, optimize the model.
Sometimes it requires batch optimization and then the response formatting.
The model might give you response in a, in some format, but the
client or the application might need in a different format.
You should have some way to format that kind of response and complex
stem from diversity of model formats and serving requirements.
Those kind of things should, are part of the model serving infrastructure
and the data pipeline layer.
So the data pipeline layer is basically when you're trying to develop a
model it requires a lot of features.
Some are realtime features, some are near realtime features
and some more offline features.
Let's take an example where you try to quote your shop
and purchase an item, right?
And similarly, you purchase an item online.
So you're trying to develop a model to see what is the buying pattern of a customer.
So it requires a lot of data.
It may require historical data of what kind of purchases or what's the
items the user is trying to browse or is trying to bid or is trying to buy.
So those kind of data is required and sometimes near real time data.
You you try to listen to some Kafka stream or some kind of near stream to listen to
those kind of topics and try to ingest it.
Near real time and serve it as a feature for the model.
And some cases you have real data, right?
Even before the model output is served to the actual user, we try to check
if that item is still available.
For example, you try and find like 10 different items which the user can buy.
You have to check whether the item is in stock or not before
showing it to the actual customer.
So those kind of real time checks.
Part of this data pipeline and our caching up in, in the way you pro
provide the in the model, right?
So these are the three different things with respect to
architecture we need to consider.
So I was talking more about the Kuber Kubernetes orchestration before.
If you go deep dive into that CRDs play a pivotal role in extending
the Kubernetes to support machine learning specific concept, right?
These extensions actually enable the teams to define higher level abstraction,
like model deployment feature, pipeline experimentation, runs, everything is.
Enabled through this custom resource definitions and basically
there are different patterns.
Operator patterns are the one which actually, which is valuable for
managing all the ML workload lifecycles.
Custom operators can handle both deployment workflow, auto scaling.
On inference load.
And also integration with external systems, like feature
stores or model registry.
If you're familiar with Databricks, Databricks is one external product all
in the market where you have a end-to-end model development, like cycle in place.
Like you, you have a training infrastructure, you have a feature
store where you can write the data into.
Like some schema, right?
And then it can also integrate with external system like red, this or any
other, suppose you are listening to some Kafka or near realtime data that
can even have or any, anything else.
So it can also interact with Microsoft Azure Kubernetes services
or AWS or they have connectors available in place that way.
You the integration with other platform is seamless, so that,
that's part of the orchestration layer and then developer experience.
So when I talk about developer experience it's like how easily you are trying to
provide this kind of capabilities, right?
The abstracting, the complexity.
In terms of a simple UI or some self-service capabilities, developer,
when he is trying to deploy a model, you should not run into like multiple commands
or go into multiple pages to deploy.
It should be seamless because I try to check in a code into my branch,
develop branch if they, once you check in, if there is an automated
way to deploy it to the the developer.
Development namespace are the development environment, right?
Then that's the best thing to do.
And if a code is merged to main, if there are like pipelines, automatic
pipelines integrated with the CACD, which can completely test, validate,
and deploy to production at scale, that's a, that, that actually enhances
the model development lifecycle.
By 10 times because, 10 times or a hundred times.
So you can, the, you can do faster deployments and faster testing with
these kind of automated approaches and how well to achieve that will
we'll see in the upcoming slides.
So I was talking about the automated CACD for ML in the previous slide, right?
So it, if you see the automated CACD, it's just if you see there
are like three different things.
High level one is the model validation.
So when you're trying to push a new change to the model, there should
be some way to validate the model.
Like you can manually validate the model before pushing.
That's.
Crude way of validating the model.
But if you have all the regression integration and sanity test smoke tests
in place, which are already automated and it's part of your pipeline, then
you don't need to do this manual check.
Whenever you deploy the model, the automated validation kicks in and if it
fails, the deployment won't even happen.
And post deployment.
Also, you can have validations in place to monitor and if you see the
failures are above some threshold, there should be automated rollback option.
So that's when the artifact management plays a key role, right?
When you have multiple versions of the model, like at least the past
three versions or four versions of the model or past 10 version
depends on your requirement.
It's easy to scale back or roll back to that version, the last
stable version of the model, right?
That way you're not disrupting the production environment and you
have several deployment strategies.
For example, a canary deployment is one thing you have thousand pods
which are serving for the model.
You, you can first deploy to one sanity, test pod, and then see
how better, the traffic coming to the pod is working, right?
If there are more failures in that part, then there's automated rollback
you can do that way you are not disrupting the whole set of users.
Maybe a few users are getting impacted still, but, and similarly, gradual traffic
shifting is mo is most important because like when you're trying to deploy the
model at scale, you cannot actually.
Affect the a hundred percent traffic instead, if there is a way to 1%, 5%,
10%, and then a hundred percent, that kind of deployment is, it is very
important and it should change it.
Or suppose you're trying to do a hard fix and you are trying
to deploy without the dar fix.
If the whole functionality breaks, then in that case you should be able
to a hundred percent deploy the change.
So those are the flexibility we need to provide in terms of.
Deployment strategy,
feature store, and data pipeline architecture, right?
We talked this, talked about this.
At high level, when you go deep dive for feature store, it addresses
the fundamental challenge of serving features for both batch
training and real time inference.
So basically you can consider it as two different things, right?
Online, offline.
So offline is something for historical data batch processing, you can
have a nightly job or weekly job, which can ingest those features.
Online is a real time or near real time inferencing rate.
You listen to some Kafka topic or you directly hit an a PA even before the
actual output is served to the customer.
So those are two different data and you should have a your architecture should
accommodate all, both the patterns, both online, out and offline patterns so it can
better serve the model to the customers.
And this is like how you are going to serve the model, right?
In general, there are three three different cases here.
One is the container optimization for machine learning, and second is
the serverless serving framework.
And third is the code start.
Optimization.
If you take the container optimization, it requires special
techniques that account for large model artifacts and GP dependencies.
And in some cases you need multi-stage build process and
which can minimize the container image size by separating the build
dependencies from runtime environments.
At runtime you may request certain and things, but during building you you
have to make sure the image size is.
Optimized that way you reduce the space.
And for serverless serving, you have to abstract away from
cluster management while providing automatic scaling capabilities.
So basically you deploy a model with thousand parts.
And you just receive like five, fire request per second, right.
Thousand parts might not be required for that model, but there are times when
you may receive like 20,000 requests per second, or 30,000 requests per second.
And if a part can handle like fire request, then you obviously need
depends on like per five request per second, a part can handle.
And you have 5,000 requests and you need thousand parts.
So how can you do this so that's when this automatic scaling place a key road, right?
You should, for example, when you have a best part which can
automatically scale, you have.
Minimum par parts required are always there, like one part or 10 part like, or a
hundred parts which will sell the traffic.
But when there is a huge traffic spike slowly you see the infrastructure
scales, the number of parts required.
That way, there is no disruption.
And we are able to serve the customers without any latency issues.
And cold start optimization is another important thing which you
need to consider when you are trying to do the model serving, right?
Like model preloading, lazy initialization, and
shared model caches, right?
So for example, if your pod is restarting.
It means your cash is gone.
Whatever you have in your local, those caches are gone.
So if you have a persistent cash multi-level cash concept in your design,
that way you still get the data from the shared persistent cash, right?
And at least the latency wise, it's not that big.
It might not be as good as the local cache, but it's still in few
milliseconds and you are not hitting the actual DBR actual source or a
PA call for that particular data.
So those are the things you need to consider just to
avoid those latency issues.
And, once you have the model in production and you handle all those
latency and all the other issues during deployment, the next thing which comes
into picture is the monitoring, right?
You constantly monitor the model for failures, maybe.
The input data changed, or the pattern in which the customers are using, the
model changed or the performance of the model is deteriorating over time.
So these are the things you, you do through constant monitoring
and then observability and performance optimization, right?
So there are like several metrics you need to monitor.
Some of the metrics are accuracy metrics.
And then prediction, distribution analysis feature drift reduction
is one of the key metrics.
Suppose there is a drift in the feature, then you can retain the model.
You don't need to try retain the model on a schedule basis
or on a daily basis, right?
You can monitor for feature drift or model.
And then you can take the a accordingly.
You can take the action of whether you need to train the model or you need the
new model to replace the existing model.
So those kind of things you need to have.
So the, in this slide we are mainly talking about the organization excellence.
So once you have the marketing capabilities in place, you have
the deployment capabilities in place and there are like automated.
Scaling capabilities.
How well you are working as a team.
So nothing can be done by a single person here.
There are several teams which you are going to work for a particular use case.
For example, when you are trying to develop a model, you need to know
what problem you are trying to solve.
So you work with the business to see what kind of problems we are currently facing.
Let's take example of Starbucks, right?
You have 50,000 stores.
You might have a problem where some store is constantly busy and the
coffee serving time is more than like 10 minutes or 15 minutes.
So what how a machine learning model can help, in this case, how to optimize
the resources and how to bring more stores are more people like baristas
to sell coffee at that store, right?
So these are the problems which we need first to find
the solution for the problem.
So that requires collaboration with different teams, different
stores those kind of locations to understand what problems they're
facing and in generally you, if you.
Divide all this into three.
You can divide broadly into three categories.
One is the platform team composition.
So when you try to deploy a model, you basically need a platform where you build
a model, train a model, deploy a model, and serve it and monitor the model, right?
So there's a platform team which works on improving the infrastructure and
making a, making sure the infrastructure stands over time or whenever it requires
a revamp, they revamp the platform.
And do constant updates to the platform, and cross-functional collaboration.
So there are teams who use the platform.
There are teams who build the platform, and there are scientists who try to
develop models on top of the platform.
And there are domain teams who consume the model from the
pla the model output, right?
So there should be alignment across teams.
When you are trying to solve the problem and adoption strategies, right?
You have to, whenever there's a change or trying to push, it's very
important that the change is there's a lot of governance around the change.
Like security issues or any other checks in place, right?
Suppose you, you are trying to push a change.
It's very important that you create a pull request for the change.
And then there's a review mechanism.
Even before the change, there should be a design review in
place if there's a major change.
And even for small bug fixes, you should have reviews in place.
And once the reviews are approved there, there should be some kind of a security
scan in place, which can detect any security or vulnerability issues and block
the merge even before the change gets pushed to the development environment.
And then there should be automated strategies where you can.
Work with teams.
So every use case is tested and the whole workflow is done seamlessly, right?
So these kind of adoption strategies needs to be there.
So whenever there, there is a, for example, you're trying to migrate as
a team to some higher version of the.
And it means you need to have a proper way to communicate that
migration going is going to happen.
So the teams who are using your platform are ready for that particular change.
So tho those are the things which we need to take into account for
organization excellence and the next and most important thing here
is the scaling challenges, right?
This, I'm just telling you, based on my past experience, when we started
building the ML platform we started with a very small pilot model, and
then finally we went at scale to handle three to four plus models, right?
So when you are trying do that.
When you're trying to evolve along with the teams, along with the scientists it
requires a lot of optimization strategies and a lot of caching strategies.
So like request batching when when you have like single to multiple requests,
the throughput is so important and the latency is more so important.
So in order to handle that, because batching is something
which you need to take care.
Caching strategies, right?
This we discussed like in the previous slides.
Basically, you should have a way to have support multi-level caches so that way you
can eliminate lot of a PA calls, database calls, and when you have multiple models
which require transformation, right?
Like model chaining kind of things.
Caching is always useful.
You don't hit the model unless.
You are out of that cash, right?
You don't have a cash in place.
So that helps a lot when it comes to model serving, right?
And hardware aware optimization is another thing.
So whether you are using GPU or TPU depends on how large your model
is and how large your scale is.
So that kind of hardware aware optimization is required.
Multi-tenancy is another thing.
So you always cannot have a single point of failure, so you should
have this multitenancy and and it involves like multiple sites
or multiple countries, right?
You should, your server should be located as close to the site.
So the latency is latency is quite good.
So those are the things that you need to consider geographic
distribution, optimizing global.
Latency and then edge deployment patterns.
CDN for example, Netflix.
If you see Netflix as a company, all their models, CDN plays a MA major
role because the way they serve their infrastructure is built on top of CDNs.
There's no call which goes beyond the CDN to the actual server.
So that's the way the infrastructure built.
So that's how you are able to see the movies so fast.
So the streaming is so fast in Netflix, right?
That's one of the reason.
Now we are going to talk about the security, compliance
and governance, right?
So I'll tell you some of the use cases which I encountered in the past.
So there are cases you'll handle data with.
PA information and like credit card information or in some cases you'll
handle very sensitive data, right?
You should have proper security and compliance in place to, before even
dealing with that kind of data.
So it's very important that those data are masked and not leaked anywhere outside.
And in some cases you.
You have to manage those legal regulated requirements.
Also, when you're dealing with health healthcare, or any other
financial related models, right?
You should make sure all these regulated requirements are in place.
GDPR, hipa.
Financial regulations and automated complaints.
Checking, right?
These are all most important.
Even though you try to automate everything, there should be proper
audit trail of what is being done, and data lineage is another thing.
Suppose there is a feature you are trying to use, we should know from where the
feature is getting generated, who is consuming the feature, and what are all
the different places the feature will go, that way you can backtrack and see.
If at all, there's a leak, you can see where there's a leak.
If they, even if there's a error, you can know where there's an error.
And it's very important to involve the security team in each stage.
Even if you're trying to use a open source or trying to introduce
a new component in your platform.
It's very important that the security and the governance is
in place for all those kind of.
Changes you're trying to make.
That way you don't have any surprises or any leaks, right?
So that creates a lot of financial and other problems if there is a leak.
So in order to avoid that, it's better to have the security
checks in place beforehand.
So when it comes to feature direct trends and emerging trend, these
are the four different things.
Edge computation, real time learning, federated learning and auto ML
integration, or something which you need to account for in feature, right?
For example, auto ml. It reduces the expertise required for model development.
Suppose you're trying to rapidly.
Push the models right?
It's always good to have some auto ML integration in place.
Your platform should support that.
That way it automatically trains and tries to deploy new versions
of models, and you should have proper monitoring capabilities.
So this auto ml model, which is getting generated is properly validated and edge
computing is something which is close to data sources and end users, right?
Sometimes platform architecture.
It should manage distributed deployments across diverse hardware, so that,
that's when this edge computing plays a key role and real time learning.
So based on new data, your model should be able to learn and we
should be able to fix the model and adapt to the new data, right?
So that's when you need this real time learning.
And your platform should support all the online learning algorithms, streaming data
processing, and dynamic model updating to support this kind of real time learning.
Federated learning is enabling model development across.
Distributed data sets without centralizing the data.
So it requires specific coordination mechanisms like privacy, preserving
techniques, and distributed model aggregation capabilities.
These are some of the f components for feature emerging trends.
One year, once your model is at scale, in order to keep up with
the feature, you need all these kind of capabilities in place.
To add on to what we discussed in the previous slide, right?
In order to create a successful ML inference platform, you
require a balance between.
Multiple things.
One is the performance and the second is the cost.
Third is the developer productivity and operation control.
Sometimes innovation is so important, especially with the generative ai.
Now, innovation plays a key role and reliability.
How reliable is the data you give?
For example, people try to go for Amazon Prime as a membership
because they find that.
Whatever they search in Amazon is relatable.
But the reliability is out of our picture, then it cause a lot of other concerns.
Similarly, how can you rely on a platform if it's able to serve all the
kind of traffic without any issues?
If we can autoscale if you can really that, that's why
reliability is most important.
And the other thing is you should achieve a balance through thoughtful
architecture, robust automation, and strong organizational alignment.
This investment is is mainly important for building sophisticated ML platforms
and you try to reduce any operational or add r. Any other overheads which
can arise in the platform, right?
And make sure your deployments are faster and seamless.
That way you go to the market on time or even ahead of time.
So that's the main goal.
And cross team collaboration and platform excellence are part of that goal.
Security, whatever we discussed in the previous slides, those are all the
building blocks to achieve that goal.
Yeah.
Thank you team for listening to me patiently, and if you have any doubts
or any questions, feel free to ask me.
I can get back to you on this.
Thank you.