Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
This is Han with, I work as a software engineer at Meta for this talk.
I would be covering scalable mLab pipelines that actually work.
Essentially, what are the, some of the best practices that can help to move model
development from, whatever model developed from laptop production essentially.
The first, we'll talk about the existing challenges.
If you look at the global landscape.
Looks like we can see ML deployments are growing at a rapid pace,
almost like 40% annually, or probably more than that right now.
And at success remains extremely low.
So if you look at some of the statistics we can see, you can see almost like 87% of
the thel projects never reach production.
A lot of them will die down much before experimentation phase or much before that.
Even for the 13% of them which actually reach production, we can see that, there
are a lot of operational challenges and eventually they may not deliver
the success metrics that we want to see in production environments.
Like it could be challenges around model drift, infrastructure bot likes,
or even operational complexity of how we have built the whole pipelines.
So what are we trying to what would we be covering at the, as part of this talk?
I think this is a brief agenda of what we would be covering.
First we'll try to look at what is the existing state of enterprise ops, what
is the current landscape look like?
And we'll talk about, what are the s ml ops architectures that we can look at
essentially all towards, what are the efficient model development, deployment,
management companies, monitoring, and all those different aspects of it.
And the next important aspect is what are the implementation strategies
that we can implement, mostly into technical solutions of what are the,
how do we have our right methods in place for model validation testing ml
ML workflows for seamless M-L-C-S-E-D deployment pipelines and all that.
Another important aspect is also like organizational transformation.
It's not only about the tools, it's also about how do we organizationally function
in such a way that, all, the teams like, data science engineering and operation
teams work together in collaboration to make this whole journey a success.
And eventually I think we'll also touch upon what are the emerging trends that we
can look at, so that we are building these pipelines or normal ops for the future.
So the first thing looking at, what is the current state, but
just starting with that, there are two quadrants we can look at.
One extreme is like purely manual, low maturity, and other side
is mostly like full automation.
High maturity.
ML lops systems.
Again looking at, the manual side of the things.
Essentially, if you look at, there are processes around starting
from left bottom side, there's like limited pipeline automation.
And from the left off we can see.
Okay.
But these are different different systems we usually see in,
in, in a production, right?
So mostly it'll start, there it can be like, some systems can only be ad
hoc experimentation, mostly manually.
We just do some kind of experimentation.
Mostly manual, no automation primarily engineers or scientists or essentially.
Billing pipelines, testing them, and ad hoc as the core, right?
And and the other aspect is there could be minimum automation, limited
pipeline automation for scheduled jobs.
Maybe like a small CR jobs or stuff like that.
But that's about it mostly, like manual and low maturity kind of models, right?
If you look at high maturity and the other spectrum of it the ML lops systems, right?
Team specific ML lops systems, but it can be like.
Every team or every company has its own lot of internal tools.
How do we deploy package package the models, deploy the
models, and and also test them.
All those construct are.
Relatively different in terms of how companies adopt such methodologies.
So having that consistent or, having that having that repeatable workflows,
how do we all through this life cycle of model development, of deployment
and testing and all those aspects, how do we have that repeatable
workflows is also extremely important.
Having that as a. Is in a, is a maturity level of any m lops systems in a way.
And the other important aspect, we can also look at enterprise
lops and a full automation, data driven where we have metrics to
capture at every stage of model.
Deployment when there is development or testing and even deployment
we can have lot of testing baked into at every phase of it.
And eventually looking at the metrics, we proceed with the model deployment or
roll back or go to different versions and all those different aspects, which
we will touch upon in a brief file now.
Again, a similar aspect to it, just to elaborate on what I've been touching
upon before is the first level Zero is mostly about ad hoc experimentation.
People just start with something new.
And also purely manual with with no standardized workflows at all,
and fragmented collaboration, people collaborating is also
more mostly fragmented in a way.
And there is no desired metrics or success metrics defined and model
performance also lacks a lot, right?
And the next level in terms of maturity is like pipeline automation.
Now we don't have a lot of automation in place.
Then we come up with some kind of automation, the basic
automation we can say, right?
And it even though we made some progress in terms of pipeline automation there
are still some lacking statements.
Things like, we don't have really good governance frameworks.
Teams are pretty siloed in a way.
And also the monitoring is not robust enough to identify model
thrifts or, all kinds of performance issues that we can see, right?
That, that is a state of level one we can briefly talk about.
Level two is like a little more advanced or matured kind of systems
is mostly continuous integration.
We have standardized repeatable pipelines.
You can bake in how you or want to have a model go through the system to have it.
Seamless or repeatable pipelines, essentially, right?
And also we have robust version control.
Every model we put a version to it and all that.
Stuff.
At the same time automated testing extremely important.
When we deploy a model every phase having that automated
testing is extremely important.
Ladies has lot of manual effort.
And also proactive having that really sound monitoring is also extremely
important in when we talk about, enhancing this lops pipelines.
And the next important thing, I think this is like a this is like
the we have the best in class.
You have the end-to-end automation with very, or very
low or no manual effort at all.
And comprehensive governance and compliance policies established and
also centralized ML lops pipelines and also advanced operator observed
continuous implementing optimizations.
So this is like a extremely sophisticated ML lops pipelines, which are built
to look at monitoring aspect of it.
When we deploy during the deployment or during model deployment testing, we have
metrics baked into the whole pipeline.
And all these checks around governance and the compliance is also baked
into the whole end, into pipeline.
And let's talk few details about what are the primary obstacles to ML success.
I think some of the.
Primary, if you look at some of the charts here.
So I think you can look at organizational silos.
I think for any success.
I think it's extremely important for teams to collaborate and work
as a liaison for its own success.
I think that is extremely important.
So at least in some of the metrics, some of the.
Studies, we can see almost 73% of ML initiatives directly are linked to
inadequate, failed ML Institute surgeon, that because of inadequate collaboration
and governance there are other limitations we can think of in infrastructure is
not as strong enough to is not built for what what we want to serve for customers
and also not having enough monitoring.
If we don't have monitoring, we don't know what's going on.
That's, I think, extremely important to have a robust monitoring.
Again, data quality issues, most similar.
I think garbage in, garbage out.
Having that really quality data to train on or build those models is also
an extremely important aspect of it.
And also the lack of standardization without having the standard
tools and technologies and also the pipelines to deploy.
And models at scale it becomes extremely important difficult,
and also governance gaps.
Having those governance governance governance gaps is also extreme.
But, and also looking at some of the metrics here, considering some of
these links, like we can think about like 56% of production ML failure.
Our directly linked to data quality issues.
As I was mentioning, data quality is extremely important in a production ml,
a ML kind of systems in a way, right?
Similarly for insufficient monitoring, almost like 38% are linked to insufficient
monitoring, lack of standardization, 45%.
These are like ballpark numbers that we have gathered.
This is also showing how important to have these systems in place
so that we are more buildings.
Towards these now obstacles so that we can build for success in a way.
So that's like what I want to cover for this slide.
The next important thing is what are the architectural patterns that
we can put in place we can build a system which is really robust, right?
The first important thing I want to touch here is mo modular component architecture.
So the more like decomposable components, right?
Every component again, the Compass, compass core construct is component
should be able to iterate on its own.
It can deploy model.
Yeah, not only model anyway, like all the ML ops supply plan has multiple
steps or multiple components to it.
Every component should be able to.
Be iterated on its own and also improved all those good stuff, right?
And the other important thing is centralized feature store.
For any model development pipelines or model development teams, features are the
most important things, essentially, right?
These are the things that are baked into the model and used for
predictions and stuff like that.
So having that consistent standardized feature store is extremely important.
So that also provides, guaranteed consistency for training
and inference environment.
So that is extremely important in a model environments.
In a way.
Another important thing is reproducible training pipelines, again so we really
want to deploy models at scale and also want to release new models.
As frequently as possible.
So we are giving the best to our customers, right?
Looking at that as a metric, having that reproducible training pipeline so that lot
of manual efforts can be can be avoided.
And lot of these things can be, having that consistent pipelines and also
reproducible would help eventually in accelerating the development station.
Another important thing is also having automated validation back then.
So at every stage, having that model automation in terms of
having those validations at every step of the phase, extremely
important as we scale our systems.
So let's talk about how do you build a robust MOPS pipeline?
Having that model lifecycle deployment, model lifecycle management.
I know essentially, how do we is also extremely important in terms
of, this helps almost deliver three x better quality monitoring at
the same time, reduces production strength by almost a half, right?
So what are the various things that we usually look at when
we think about model, right?
So we have data engineering.
Which provides all the essential data points, whether it is like data
required for feature engineering on all those other aspects of data.
So once we have data, I think the next important part is like model development.
So the primarily ML engineers or applied scientist, who looks at, takes this
data and builds some robust pipelines, where we train ML models to come up
with the robust models for any use case, essentially, and the next important,
the thing is, like now, once a model is developed, the important thing is like
we have to validate it, whether it is working now models are usually trained
on training data set, essentially.
Then we have to validate it how it is working on the real data
set and production data set.
Are we seeing the right kind of production predictions or if if,
if it is not performing as good, it can be various other things.
It can be modeled accuracy kind of information, or it could be
like performance and production and stuff like that, right?
So having that validation is at the step that is extreme important.
And also eventually, once it, it covers all the validation aspects
of it, and mostly like deployment of it in production and scaling it.
And once we deploy it in production, the most important
thing is like monitoring for it.
So for the first thing is that now we deploy the model and so once we deploy
the model how is the model performing?
How is the predictions look like for the attachable model?
Are the users that are using our systems are.
A sentiment is positive or they're not so much or negative.
Having that feedback signals fed back into our systems would help us help us improve
our models tremendously in the future.
So having that end-to-end monitoring and also feedback systems that can be
used for model deployment or development deployment is extremely important
in building any ml s pipelines.
So the next aspect I want to touch upon is so technical implementation
of model validation framework.
So how do we validate what are the things we want to actually validate?
So again from, as I was mentioning, now we have a model that is developed.
And now we need to think about what are the things we need to, while we
are building these MLS pipelines, what are the things, what are the
testing things we need to build, right?
So the first important thing, government, data, everything
starts with the data and ml, right?
So I think we have how good data that we have would determine
how good our models are.
Garbage in, garbage out.
So it's extremely important to start with ml primarily a data validation
and whatever is a use case, we need to have some kind of a data set.
Having that.
While we are creating this data set, having that schema enforcement and also
having that in our distribution checks, drift detection and stuff like that.
So to make sure we have a very clean data is the first step of any validation.
The next thing we want to think about is like now model is deployed.
Model is deployed in production.
So the next important thing is like, how do we.
Measure the accuracy, so how good it is.
Production, performance, production.
And also align with some kind of a key performance metrics, right?
So that is other aspect of validation as well.
The other thing is like the, now we talk about operational things now the other
operational thing is the latency to stick.
So even though the performance or inaccuracy super good, but the thing is if
the per model predictions are taking too long, or if it is taking a lot more time,
the customers may not like it, right?
So I think in this world, users would like to see fast responses.
The other important aspect we need to validate is like how fast it is
providing all these predictions and, that are providing the timely predictions
is extremely important as well.
Again, other aspects like it's not only latency metrics and also you look
at resource utilization, throughput analysis, all the, all those good stuff.
And the last thing I want to touch upon is like ethical validation.
Is there any bias that is coming out of this model that is one side of the things
we need to have a validation as well?
Is the responses more consistent or is it fair?
Is also an extremely important metric we need to capture
or, test our models against.
So what are the best practices here, right?
So having that clear acceptance criteria.
With the with pass or fail threshold for each validation step is
extremely important while we go build these production systems.
And also having those integrated into CCD pipelines, we don't
want to make it manual, right?
So I think we want to build them into the consistency CD pipelines that can
actually enforce these checks at one time.
And also having that history.
To track how this quality of models are, how this model deployment and
validation delivery is happening is also very good indicator for us
for our maturity or on how we are progressing as ML ops teams, right?
The next thing I want to touch up on is so technical implementation.
So there are a few aspects I want to touch upon.
For any CSC pipeline or CSCD systems, continuous integration,
continuous delivery is an extremely important aspect of it.
So for in, in case of continuous integration again, a few
things I want to touch upon.
Model quality validations.
During the integration stuff and also versioning model artifacts, during
building the models, artifacts, having this versioning in place
and right method attached to it.
And also dependency environment management is also baked into
the same process as well.
The other important thing I want to touch upon is like a continuous delivery.
So the, when we are deploying these models at scale into any production
environment, usually it's containerized.
There's a lot of other tools that we usually use, like it could be
Titan Server, various other things that we use, Docker containers,
various other things that we can use.
But.
This is one thing we need to also keep in mind how do we build those
consistent con containerization of model serving framework, right?
The other thing is also environment specific configurations
and IAS infrastructure as code deployments as well.
Other important thing is like canary or blue green deployment.
So when I, when we want to deploy the model into production, there are various
deployment strategies that we can look at.
For example, if you want to do something like canary deployment,
it's more in a production systems, we may expose some percent of production
traffic to this new version of the model and see how it is performing.
And based on, metrics that is omitted by this new model, we
can either promote it to higher percentage of production traffic.
And eventually take a hundred percent of operating traffic or blue green
deployment is creating a simultaneous green environment or a blue environment
along with the production environment and see and pass through some of the
traffic to this new model version.
And see if if the metrics are looking good, we promote the new version, green
deployment into a production version.
So those are like different methodologies that we primarily use in case of
microservices, deployment architectures again once we build and deploy.
The other important thing, as we have talked about previously,
is also like monitoring, right?
Having this real time performance metric.
And also other aspects like drift detection and alerts, and also
feature distribution monitoring is also extremely important.
And all these are, have all these are correlated also, see how this is
all correlated to business method.
Correlation is also extremely important.
Having that consistent monitoring gives a better view of how
the system is performing.
What are the things that we can improve upon as well.
Yeah.
The other thing I want to chop on here is like a b testing of models.
Again, I think ML models I think this is a very well known framework that is used
in a ML teams is having that AB testing.
So basically enterprise, with automated AV testing, we achieve almost 40%
faster model attrition cycles and also 25% higher performance improvements
compared to manual testing approaches.
So what is actually IB testing?
So it's mostly having that when we have two different versions of the
models exposing a customer to both the versions of the models and see that user
sentiment, which is performing better in terms of giving, whether it is like a.
Recommendation systems are any kind of use case that we're trying to build,
seeing which model is performing better, and also trying to use those signal to
promote either of them is like a testing is what we usually call about, right?
There are a few aspects to it.
Again, traffic allocation dynamic.
Clear route is a traffic to model variance.
Again, it could be configurable.
Similarly, performance measurements, accurate track across.
Both miss race and technical metrics.
So as I was talking about latency and all those other things will
come into technical businesses, like how is a sentiment look like?
Is it performing positive, negative, and all this user aspect of the things right?
And also other thing is like statistical analysis.
Just think about again, mostly about data-driven deployment decisions, right?
Rigorously having that analysis is also.
Extremely useful in such kind of systems.
So the other aspect I want to talk about here is like now we have
talked about what are the challenges?
What are the things that we can put in place and what are extremely
important in terms of testing and some of the testing framework, like
a testing and stuff like that, right?
So now functionally in terms of how do we build this in such a way that.
It is built for success.
There is like primarily in any ML teams.
We have the data science team and also ML engineering team.
So two different teams.
And also the collaboration of both of them comes to a ML ops excellence.
So essentially these are the people who have mixed knowledge of both data
science side and also engineering.
So these are the this, these are the.
Core collaboration that helps build some of the the typical
aspects of ML lops pipeline.
So a few things I want to mention here is like cross-functional, like an ML lops
teams are extremely important in terms of working cross teams to build those
consistent end to ML products and shared accountability model in a joint ownership.
It's a joint ownership between ML performance, operational
health business outcome.
So it's not one team's responsibility or one person responsibility.
It's a joint responsibility.
And also the other thing is also ml Op Center of Excellence, like critical team.
Having that best practices governance framework and also helping them
do that self service ops is an extremely important thing as well.
Then I want to touch upon few aspects of cost optimization as well, right?
So I think all these GPO hardwares are expensive.
For building any ML pipeline, I think these are few things we need
to keep in mind so that, we are building in a much more efficient way.
Key cost drivers looking at, compute cost, as we all know, I think compute hardware
all the GPUs are extremely expensive.
We definitely need to consider that.
And also data storage and processing, the data pipelines and
our feature stores, ML pipelines.
How do we generate this data is also the next aspect of, where the cost could
come in and the tools and platforms ML ops, pipelines and monitoring
systems, and also specialized tools that help us build this monitoring at
scale is also something we need to.
Keep in mind the other thing is like operational overhead, like system
maintenance, support incidents and all those other things, right?
A few things I want to touch upon in terms of operational
strategies, resources, autoscaling.
I think a lot of provide cloud providers provide this out of the box.
We should be able to autoscale the infrastructure to dynamically
match our workloads demands.
If there's the, if the request grow, we can have those auto automatic
knobs in place so that, we can get those auto scaling right.
Model efficiency.
These are few ml ml but ml concepts that we usually use, like model pruning,
quantization distillation to reduce the footprint and the size of the models,
and also probably get similar or close to good, similar performance in a way.
And the other thing is like process automation.
Having that complete end-to-end automation is extremely important
in terms of our strategies as well.
Enterprises that actually implement these strategies achieve definitely have
significant cost detections and also most importantly, visibility, right?
I think where, how much is going on is also extremely important while we are
building this system so that we know where to increase our spending and how
we can optimize is also an extremely important decision that we can look at.
Since we have talked about a lot of other aspects of it, now I want to
touch upon what is the future, right?
This is an evolving space, which is rapidly growing.
How do we build systems that can looking at the future, right?
So I think we are just making sure that we are building for the future.
So a few things I want to touch upon is like mops, observability,
gaining that insights.
Into the model behavior as I was speaking, is having that complete monitoring and
end-to-end metrics is extremely important.
That is one thing which we need to keep in mind.
And also having that AI governance implement frameworks for responsibility,
deployment responsibility AI is extremely important nowadays.
And also having that auditing bias detection compliance controls
is extremely important as well.
Federative learning.
Again, this is more like having that models being trained across
decentralized data sources.
Again, extremely important aspect and also obviously improving privacy and security.
Auto ml is also like autonomous continuous model improvement.
Again, looking at feature selection, getting the feedback hyper tuning.
It's like a complete cycle of how do we use that end to end to
actually improve our models at scale.
Is something that is that is, that we can keep in mind as well.
Key SI want to talk of briefly about key takeaways or high level
start with clear go governance.
I think there's a few things like, establish that robust governance and
motor lifecycle management practices.
Make sure to make sure that we have that solid foundation for ML instructors.
Build modular architectures.
Making sure that these are decomposable units, I can build them, deploy
them, scale them independently.
So those are extremely important as well.
Automate ruthlessly everything, I think extremely important.
Automation is very important as we scale our infrastructure systems.
Any manual process.
Look for opportunity to automate, integrate teams.
Again, the teams, if there are silos, it's extremely important things work in,
is on have that consistent expectations or collaboration with each other.
Is extremely important in this kind of systems as well.
Again, looking at some of the statistics we can see like enterprises with mature
ML ops deploy almost five x faster and also reduce, achieve 60% higher
model performance and production.
Extremely important statistics there.
I think having these tools and processes and building these consistent
pipelines can definitely improve.
These systems, ML systems at scale.
Yep.
I think these are all the things I want to cover for this talk.
Thank you very much for tuning in.
Have a nice day.