Transcript
This transcript was autogenerated. To make changes, submit a PR.
Good morning guys.
Welcome here to my session.
The aim for today is to explore the synergies between the cloud,
native world and the AI world.
So let's start.
My name is Graciano.
I am a dev rail engineer at Mia Platform, an Italian company that develops among a
platform builder that among other teams, simplified adoption and of the technology
we are, we will be discussing today.
This talk actually comes from a conversation I had a few months ago with
a close friend of mine, Leonardo, who works as a data scientist and deals with
model and data every day, one evening after a slide to alcoholic, A operative.
What do you think your engineers will talk about work?
Of course.
So Leonardo begins explaining, all the problem he faces in his daily work.
It started with this sentence, when everything seems to be going smoothly,
something always going wrong, and then he had, it takes longer to put the
model into production than to develop it, and as I found out later, he's
not the only one with this problem.
As you can see from this screenshot, from this tweet.
So guys, I am not an expert of this kind of stuff, but I'm extremely curious.
So I ask him to explain his workflow.
He immediately starts talking about feature management and the inference,
and I'm like, inference what?
Give me the beginner version, please.
So he simplifies.
As much as possible explaining that the workflow follows four step.
As you can see on the slide, the data preparation, which is the preliminary
phase where the data sets are prepared, the model training and the timing
where the model is actually developed.
And then when the deployment and monitoring is the phase, which is
when the model is put on production and its behavior is monitored
to, and finally the process is repeated and improved the model.
As, I soon learned, none of this step was free from issues.
The first problem Leo points out is dealing with the data.
Manage the large volumes of data without a clear.
Governance is a huge challenge tracing the history or lineage of the
dataset, identifying the data owners and understand data sensitivity,
it's all a nightmare for him.
Then there is the gap between the local development and the production release
of the model, which is actually the most important trouble for Leonardo.
He explains that when he do his magic on its local Jupyter notebook, everything
works fine and runs perfectly, but then when it's taken by the delivery team,
it's entering in a world new world.
The last one is the observability nightmare when they survive
the production release.
Monitoring the model behavior to spot tissue as early as possible become a
challenging task, and they often don't have a sense of full control over
what's happening, risking to being too late in resolving critical issues.
The conversation ended with Leonardo feeling frustrated and me headed to home
still thinking about what Leonardo said.
And the question that arise in my mind is, how can I apply my
expertise to help a friend of mine?
In my mind I had somewhat like a meme with the Vs.
On one side feeling frustrated, thinking at the models, and excited
thinking at the infrastructure stuff.
And on the other side, the data scientists excited of their models and frustrated
thinking at deployment activities.
However, this situation wasn't new to me.
This is exactly how operation guys and developer felt years ago before the
introduction of the DevOps culture.
So here are my intuition.
How can I apply my knowledge about the application development, DevOps, and
Cloud Native World to solve the issues that are actually similar to those we
faced in our context some years ago?
Spoiler, the answer is yes, of course.
but let's see together how.
First, we need tools and fortunately, the Cloud Native Computing Foundation
and specifically the AI working group, made an excellent work mapping all the
tools in the cloud native landscape that could help the ML world creating
the cloud native AI landscape.
Adding tools alone into the delivery chain.
Risk to add just a lot of complexity, actually increasing the issues.
then we need the methodology, a framework to govern all these tools and make
their adoption simple and seamless.
This methodology is actually the platform engineering, thanks to the
platform, provide a place between engineers and their dependency,
helping them to consume what they need in an easy and self-service way.
The first problem pointed by Leonardo is dealing with data, and that's
actually not a problem related with tools, but it's a problem of culture.
The lack of data governance is a clear reflection of the lack of a proper
data culture inside the company.
I suggest start cultivating the data culture and inside
organization by adding some tools that could help during the process.
The key point here is to understand that the AI is as effective
as the data used to train it.
A model fit with poor quality or inaccurate data will produce poor results.
A tool that simplify all these activities when dealing with data like discovering,
profiling, versioning, tracing is essential to ensure high quality and
consistent data being used and produced.
The tool we are talking about is a data catalog, which is essentially
a hierarchical archive of all the metadata of your data sources.
A data catalog contains all the technical metadata, like data type,
storage, location, and the lineage, which explain how the dataset are
connected, each other's, and also the business metadata, like ownership or
compliance level to specific regulation.
Since with Data Catalog, all the information about our dataset are
centralized, making them easily searched and well organized.
It's the Swift, the Swiss knife to for everyone who need to deal
with data inside the organization.
I. Now it's time to deal with the main issue pointed out by Leonardo.
So we have to understand how to bridge the gap between the local development
and the production release of the model.
This way, many tool from the cloud native AI landscape could help us.
One of them is cube flow.
The problem here is due to the fact that, On one side, we have the
data scientists who works in their environment like Jupyter Notebook
or something else on the other side.
And on the other side we have the ops guys, which are to deal with the
completely different environment.
And it's this opposition of ecosystem, the core of the problem.
So to solve this challenge, we need to bridge the gap
between these two ecosystems.
This Bridges Cube Flow, a collection of open source projects designed to support
each stage of the machine learning lifecycle bridging the ML on Kubernetes
in a simple, portable and scalable way.
In this way, we bring the world of the data scientists in science, something
that delivery teams know wells, the containers, and the Kubernetes.
The first project I want to discuss about is cube flow notebooks.
The core problem underlined by Leonardo is the gap between developing this model
locally during the experimentation phase and then moving them into production.
This is because the development happens in a completely different environment.
From production and producing all the development condition could
be a huge challenge for them.
Cup Flow Notebook helps solving this issue by providing a way to run
web-based development environment inside the Kubernetes cluster.
This notebook cells actually, runs as a container inside the Kubernetes
ports so that user can create their notebook directly inside
the cluster rather than locally.
And admins can provide standard notebooks image for their organization.
With the required packages style, for example, working in this way, data
scientists develop the model in a standardized way that the ops engineer
can then understand and replicate over the production environment sits both
phase, actually happens in the same environment, which is Kubernetes itself.
Another major challenge in this context is managing distributed training and
Kuber Flow helps simplifying these activities, providing the kuber flow,
training operator, which is a Kubernetes operator for fine tuning and scalable
distributed training of the ML model.
It works by implementing a.
Central Kubernetes controller that coordinates distributed
training, jobs supporting also high performance computing task.
The training operator allowed to scale the model training from a single machine
to a larger distributor Kubernetes cluster by using its APIs and interface.
Another important use case to cover is the case in which we just want to create
AI powered product or to add AI powered feature to our existing product without
developing the model, but using a third party model like GPT, mytral or lama.
In this situation, our application has to handle various aspects including recast.
Authentication authorization for accessing the model, for example,
from management caching and spending check with AI provider.
This approach not only adds complexity to our application with task outside the
core function, but also makes switching between models challenging and it couples
the application to the specific model.
The solution to the problem is the user, is to use an AI gateway
similar to common API gateway.
An AI gateway is a middleware between the application and the
model, managing their interaction.
By introducing an AI gateway, we can centralize and allocate
task like model authentication, authorization, prompt management.
Caching and spending check.
This way, the, these responsibility are handled by the AI gateway,
not within the application itself.
Reducing the coupling between the application and the model.
The setup also make it easy to switch to a different model or even use
multiple model based on, specific needs.
The last tissue involves, model monitoring.
My suggestion here is to use some of the best practices already used
in the cloud native world with few adjustment to adapt them to the
specific need of the model monitoring.
Monitoring a model is essential to, identify and solve
issue before it's too late.
But identifying what monitor is the key.
The most immediate metrics to collect are the operational
metrics like CPU or RAM usage.
That indicates if the system is working properly, but.
It is not enough in the space of model monitoring, we need those to take a
look at inputs that the model received to be sure it received the right data
in the right form to perform prediction efficiently and the outputs of the
model to evaluate its performance.
All this attention is needed to spot a eventual drift that
could happen in our system.
In this context, the drift, refer to any changes that could impact
the model accuracy, and that can be essentially of three types data drift.
Prediction drift.
A concept drift.
In simple terms, data drift occurs when the statistical properties of input data
changes leading to decreasing accuracy, even if the model itself don't change.
prediction drift instead refers to prediction changes, even if the data
doesn't change due to the model itself.
Or to external factor, then constant drift res, refers to changes in the
relationship between the input data and the target variable affecting the
pattern that the model has learned.
To keep track of all this aspect, we can use true open source of software
promises, which is a monitoring toolkit to collect real time
metrics in a time series database.
And pharma, which offer an extensive option for dashboard creation
and database visualization.
Completing Prometheus, allowing us to visualize the connected metrics.
Okay.
Parameters and Grafana together offer many benefits in the real
realm of monitoring ML system.
In fact, they allow us to collect a wide range of metric from the system
performance to the to specific model in customizable way, in a real time way,
offering also the possibility to set up others on specific events or thresholds.
Both are designed to handle large scale data.
Providing a high flexible system for model monitoring, even when the
complexity grows and the needs change.
To wrap up, we discussed about many tools, but just starting tools to the chain only
increase the complexity of the system.
We don't need just tool, we need a way to adopt them with a cell service
approach and with less friction possible.
This way is true internal developer platform and platform engineering
by building an internal developer platform that leverage this.
Tool we can standardize and simplify their adoption while creating power
path for common use case that can be consumed throughout a catalog
of resources in a cell service way.
Just for reference, if Leonardo needed a development environment
to run their experiments, just need to access the catalog and choose
the one with the Configurational libraries that best with is need.
The same happen when Leonardo needs a cluster for distributed
training or if delivery teams need to set up the monitoring dashboard
for a new services, for example.
Analyzing all these factors.
A question raised in my mind, could the cloud native world and the
application development world benefit from the recent explosion of AI?
And, if yes, how I deal with platforms every day.
So trying to find an answer to this question, I started thinking to them.
platforms are an heterogeneous set of element based up of metadata
repositories, pipelines, documentation, and many other assets in various format.
And using a platform today actually means managing all these assets and
being able to go learn and navigate them.
It.
So what if instead of having to work against this asset, we made them
work together and allowing us to access them in a unified interface.
This is where the conversational dev device comes into play with
conversational dev device are referred to a way to elevate the
use of platform and then enhance the dev device of those who use them.
It involves the use of our R system, applied the up to the platform assets,
allowing us to leverage the capabilities of Gen AI to analyze and query the
platform assets using natural language.
That's not just a reason.
These already available in an early stage version, and I invite you to try it over
your own assets by scanning the QR code in the last slide to access the qap repo.
Imagine a world where you have a virtual that.
Speaks like you with the added benefits of perfectly understanding
the context of your platform because it is actually your platform.
Now, imagine what you could do with this assistant that know all the models, the
lineage, then the pipeline around them.
It's now, the tool you can use the best practices for using them
and even how other have used them.
Troubleshooting production issues or conducting a proof of concept is such
a context is on a wall, new level.
The presentation is fine and the before said goodbye.
Let me recap the key points from the session first.
Applying all solution to new problem can be a smart way to solve them
without reinventing the wheel.
Second, platform, A, a highly customizable tool that can met
specific needs and, can streamline the work even in the realm of mops.
Finally, I don't know if AI will still our job in the future, but now it's a powerful
tool to enhance the developer experience.
Thank you guys.
Here is a feedback form where you can leave your feedbacks.
See you.