Challenges and takeaways of managing AI workloads on cloud environments

Video size:

Abstract

Cloud Native and AI integration offers innovation but exposes ecosystem gaps. The Cloud Native Artificial Intelligence paradigm optimizes AI workloads using K8s, serverless and microservices. This talk explores CNAI’s potential, addressing challenges and unlocking future AI-driven opportunities.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Good morning guys. Welcome here to my session. The aim for today is to explore the synergies between the cloud, native world and the AI world. So let's start. My name is Graciano. I am a dev rail engineer at Mia Platform, an Italian company that develops among a platform builder that among other teams, simplified adoption and of the technology we are, we will be discussing today. This talk actually comes from a conversation I had a few months ago with a close friend of mine, Leonardo, who works as a data scientist and deals with model and data every day, one evening after a slide to alcoholic, A operative. What do you think your engineers will talk about work? Of course. So Leonardo begins explaining, all the problem he faces in his daily work. It started with this sentence, when everything seems to be going smoothly, something always going wrong, and then he had, it takes longer to put the model into production than to develop it, and as I found out later, he's not the only one with this problem. As you can see from this screenshot, from this tweet. So guys, I am not an expert of this kind of stuff, but I'm extremely curious. So I ask him to explain his workflow. He immediately starts talking about feature management and the inference, and I'm like, inference what? Give me the beginner version, please. So he simplifies. As much as possible explaining that the workflow follows four step. As you can see on the slide, the data preparation, which is the preliminary phase where the data sets are prepared, the model training and the timing where the model is actually developed. And then when the deployment and monitoring is the phase, which is when the model is put on production and its behavior is monitored to, and finally the process is repeated and improved the model. As, I soon learned, none of this step was free from issues. The first problem Leo points out is dealing with the data. Manage the large volumes of data without a clear. Governance is a huge challenge tracing the history or lineage of the dataset, identifying the data owners and understand data sensitivity, it's all a nightmare for him. Then there is the gap between the local development and the production release of the model, which is actually the most important trouble for Leonardo. He explains that when he do his magic on its local Jupyter notebook, everything works fine and runs perfectly, but then when it's taken by the delivery team, it's entering in a world new world. The last one is the observability nightmare when they survive the production release. Monitoring the model behavior to spot tissue as early as possible become a challenging task, and they often don't have a sense of full control over what's happening, risking to being too late in resolving critical issues. The conversation ended with Leonardo feeling frustrated and me headed to home still thinking about what Leonardo said. And the question that arise in my mind is, how can I apply my expertise to help a friend of mine? In my mind I had somewhat like a meme with the Vs. On one side feeling frustrated, thinking at the models, and excited thinking at the infrastructure stuff. And on the other side, the data scientists excited of their models and frustrated thinking at deployment activities. However, this situation wasn't new to me. This is exactly how operation guys and developer felt years ago before the introduction of the DevOps culture. So here are my intuition. How can I apply my knowledge about the application development, DevOps, and Cloud Native World to solve the issues that are actually similar to those we faced in our context some years ago? Spoiler, the answer is yes, of course. but let's see together how. First, we need tools and fortunately, the Cloud Native Computing Foundation and specifically the AI working group, made an excellent work mapping all the tools in the cloud native landscape that could help the ML world creating the cloud native AI landscape. Adding tools alone into the delivery chain. Risk to add just a lot of complexity, actually increasing the issues. then we need the methodology, a framework to govern all these tools and make their adoption simple and seamless. This methodology is actually the platform engineering, thanks to the platform, provide a place between engineers and their dependency, helping them to consume what they need in an easy and self-service way. The first problem pointed by Leonardo is dealing with data, and that's actually not a problem related with tools, but it's a problem of culture. The lack of data governance is a clear reflection of the lack of a proper data culture inside the company. I suggest start cultivating the data culture and inside organization by adding some tools that could help during the process. The key point here is to understand that the AI is as effective as the data used to train it. A model fit with poor quality or inaccurate data will produce poor results. A tool that simplify all these activities when dealing with data like discovering, profiling, versioning, tracing is essential to ensure high quality and consistent data being used and produced. The tool we are talking about is a data catalog, which is essentially a hierarchical archive of all the metadata of your data sources. A data catalog contains all the technical metadata, like data type, storage, location, and the lineage, which explain how the dataset are connected, each other's, and also the business metadata, like ownership or compliance level to specific regulation. Since with Data Catalog, all the information about our dataset are centralized, making them easily searched and well organized. It's the Swift, the Swiss knife to for everyone who need to deal with data inside the organization. I. Now it's time to deal with the main issue pointed out by Leonardo. So we have to understand how to bridge the gap between the local development and the production release of the model. This way, many tool from the cloud native AI landscape could help us. One of them is cube flow. The problem here is due to the fact that, On one side, we have the data scientists who works in their environment like Jupyter Notebook or something else on the other side. And on the other side we have the ops guys, which are to deal with the completely different environment. And it's this opposition of ecosystem, the core of the problem. So to solve this challenge, we need to bridge the gap between these two ecosystems. This Bridges Cube Flow, a collection of open source projects designed to support each stage of the machine learning lifecycle bridging the ML on Kubernetes in a simple, portable and scalable way. In this way, we bring the world of the data scientists in science, something that delivery teams know wells, the containers, and the Kubernetes. The first project I want to discuss about is cube flow notebooks. The core problem underlined by Leonardo is the gap between developing this model locally during the experimentation phase and then moving them into production. This is because the development happens in a completely different environment. From production and producing all the development condition could be a huge challenge for them. Cup Flow Notebook helps solving this issue by providing a way to run web-based development environment inside the Kubernetes cluster. This notebook cells actually, runs as a container inside the Kubernetes ports so that user can create their notebook directly inside the cluster rather than locally. And admins can provide standard notebooks image for their organization. With the required packages style, for example, working in this way, data scientists develop the model in a standardized way that the ops engineer can then understand and replicate over the production environment sits both phase, actually happens in the same environment, which is Kubernetes itself. Another major challenge in this context is managing distributed training and Kuber Flow helps simplifying these activities, providing the kuber flow, training operator, which is a Kubernetes operator for fine tuning and scalable distributed training of the ML model. It works by implementing a. Central Kubernetes controller that coordinates distributed training, jobs supporting also high performance computing task. The training operator allowed to scale the model training from a single machine to a larger distributor Kubernetes cluster by using its APIs and interface. Another important use case to cover is the case in which we just want to create AI powered product or to add AI powered feature to our existing product without developing the model, but using a third party model like GPT, mytral or lama. In this situation, our application has to handle various aspects including recast. Authentication authorization for accessing the model, for example, from management caching and spending check with AI provider. This approach not only adds complexity to our application with task outside the core function, but also makes switching between models challenging and it couples the application to the specific model. The solution to the problem is the user, is to use an AI gateway similar to common API gateway. An AI gateway is a middleware between the application and the model, managing their interaction. By introducing an AI gateway, we can centralize and allocate task like model authentication, authorization, prompt management. Caching and spending check. This way, the, these responsibility are handled by the AI gateway, not within the application itself. Reducing the coupling between the application and the model. The setup also make it easy to switch to a different model or even use multiple model based on, specific needs. The last tissue involves, model monitoring. My suggestion here is to use some of the best practices already used in the cloud native world with few adjustment to adapt them to the specific need of the model monitoring. Monitoring a model is essential to, identify and solve issue before it's too late. But identifying what monitor is the key. The most immediate metrics to collect are the operational metrics like CPU or RAM usage. That indicates if the system is working properly, but. It is not enough in the space of model monitoring, we need those to take a look at inputs that the model received to be sure it received the right data in the right form to perform prediction efficiently and the outputs of the model to evaluate its performance. All this attention is needed to spot a eventual drift that could happen in our system. In this context, the drift, refer to any changes that could impact the model accuracy, and that can be essentially of three types data drift. Prediction drift. A concept drift. In simple terms, data drift occurs when the statistical properties of input data changes leading to decreasing accuracy, even if the model itself don't change. prediction drift instead refers to prediction changes, even if the data doesn't change due to the model itself. Or to external factor, then constant drift res, refers to changes in the relationship between the input data and the target variable affecting the pattern that the model has learned. To keep track of all this aspect, we can use true open source of software promises, which is a monitoring toolkit to collect real time metrics in a time series database. And pharma, which offer an extensive option for dashboard creation and database visualization. Completing Prometheus, allowing us to visualize the connected metrics. Okay. Parameters and Grafana together offer many benefits in the real realm of monitoring ML system. In fact, they allow us to collect a wide range of metric from the system performance to the to specific model in customizable way, in a real time way, offering also the possibility to set up others on specific events or thresholds. Both are designed to handle large scale data. Providing a high flexible system for model monitoring, even when the complexity grows and the needs change. To wrap up, we discussed about many tools, but just starting tools to the chain only increase the complexity of the system. We don't need just tool, we need a way to adopt them with a cell service approach and with less friction possible. This way is true internal developer platform and platform engineering by building an internal developer platform that leverage this. Tool we can standardize and simplify their adoption while creating power path for common use case that can be consumed throughout a catalog of resources in a cell service way. Just for reference, if Leonardo needed a development environment to run their experiments, just need to access the catalog and choose the one with the Configurational libraries that best with is need. The same happen when Leonardo needs a cluster for distributed training or if delivery teams need to set up the monitoring dashboard for a new services, for example. Analyzing all these factors. A question raised in my mind, could the cloud native world and the application development world benefit from the recent explosion of AI? And, if yes, how I deal with platforms every day. So trying to find an answer to this question, I started thinking to them. platforms are an heterogeneous set of element based up of metadata repositories, pipelines, documentation, and many other assets in various format. And using a platform today actually means managing all these assets and being able to go learn and navigate them. It. So what if instead of having to work against this asset, we made them work together and allowing us to access them in a unified interface. This is where the conversational dev device comes into play with conversational dev device are referred to a way to elevate the use of platform and then enhance the dev device of those who use them. It involves the use of our R system, applied the up to the platform assets, allowing us to leverage the capabilities of Gen AI to analyze and query the platform assets using natural language. That's not just a reason. These already available in an early stage version, and I invite you to try it over your own assets by scanning the QR code in the last slide to access the qap repo. Imagine a world where you have a virtual that. Speaks like you with the added benefits of perfectly understanding the context of your platform because it is actually your platform. Now, imagine what you could do with this assistant that know all the models, the lineage, then the pipeline around them. It's now, the tool you can use the best practices for using them and even how other have used them. Troubleshooting production issues or conducting a proof of concept is such a context is on a wall, new level. The presentation is fine and the before said goodbye. Let me recap the key points from the session first. Applying all solution to new problem can be a smart way to solve them without reinventing the wheel. Second, platform, A, a highly customizable tool that can met specific needs and, can streamline the work even in the realm of mops. Finally, I don't know if AI will still our job in the future, but now it's a powerful tool to enhance the developer experience. Thank you guys. Here is a feedback form where you can leave your feedbacks. See you.

Slides

Download slides (PDF)

See all 40 talks at this event!

Conf42 Large Language Models (LLMs) 2025 - Online

March 20 2025 - premiere 5PM GMT

Challenges and takeaways of managing AI workloads on cloud environments

Video size:

Abstract

Summary

Transcript

Slides

Graziano Casto

DevRel Engineer @ Mia-Platform

Join the community!

Featured event

2025

2024

Info

Conf42 Large Language Models (LLMs) 2025 - Online

March 20 2025 - premiere 5PM GMT

Challenges and takeaways of managing AI workloads on cloud environments

Video size:

Abstract

Summary

Transcript

Slides

Graziano Casto

DevRel Engineer @ Mia-Platform

Join the community!