Building Developer-Centric ML Inference Platforms: From Kubernetes Orchestration to Production Excellence

Video size:

Abstract

Transform your ML infrastructure from chaos to clarity! Learn battle-tested patterns for building Kubernetes-native ML platforms that scale to 30+ billion daily inferences while keeping developers happy. Real-world insights from Starbucks & eBay’s production systems.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey. Hi everyone. I'm, and today I'm going to talk more about building how to build a developer centric machine learning inference platform. So the major thing we, I'm going to talk about is what are all the different challenges you face while developing a ML centric inference platform and how better you can, serve the models at scale and what are the basic things you, we can follow as a team? Kubernetes is the underlying infrastructure we use for machine learning models, but how better we can leverage the infrastructure and the surrounding tools. Involved and the process involved and how well the whole ML platform can scale and serve the customer. So those are the things we, which we are going to talk about today. Let's deep dive more in the next slides. Yeah. So if you see this evaluation of ML platform engineering, right? It, it has three main components to it. One is the resource intensive nature how well you are able to see what kind of resources are required for a model to serve the production load. For example, when you're trying to develop a model, you might need only one. But when you're trying to deploy the model and serve it to, for some use case let's take an e-commerce use case, a marketing use case, for example. In that case, you might need a hundred parts or thousand parts. And it depends on the latency requirements and what kind of features required for the model. And that's the second point we are talking about the computational requirements required for ml. Platform and the rapid experimentation cycles involved. Basically you can't have a mo the same model which can work for all the use cases or all the users. You try to experiment with different versions of the same model or even different models and see which is be the best fit. Maybe a model can work for a particular country. It might not work for another country, and some models may work for a teenage, teen population, but it may not work for the other kind of population. And may, maybe some models are built specifically for like the women population or something like that. So it depends on the use case. It changes. So we need rapid experimentation cycle. And Kubernetes it provides you the base, like the, it provides you with a abstraction layer, but you should have a robust automation and deep understanding of both machine learning workflows and developer experience principles. So you can continue to evolve for for this machine learning platform. So let me deep dive more on what are all the different things involved here? Yeah. So when we talk about the architectural foundations designing for scale and flexibility is something very important. When you. Try to talk about architecture. So there are three different pillars here. One is the orchestration layer, and the second is the model serving infrastructure. And third, and last but not the least, is the data pipeline layer, right? All these three go hand in hand. So the orchestration layer, it provides you with the built in Kubernetes. Which is there, and you have, you should have a foundational capabilities for scheduling, resource management, container lifecycle, resource allocation, service discovery, all those kind of, where you handle all the machine learning workflows workloads required for the model. And in some cases, you might even need GPU. So g scheduling is also part of the orchestration layer. And the model serving infrastructure is basically where you have a trained model. You're trying to make the model readily available to be consumed by some a PA or even by your homepage or any other place where the model is going to be used, right? So you, you need to have a way to. Load the model, route the model, optimize the model. Sometimes it requires batch optimization and then the response formatting. The model might give you response in a, in some format, but the client or the application might need in a different format. You should have some way to format that kind of response and complex stem from diversity of model formats and serving requirements. Those kind of things should, are part of the model serving infrastructure and the data pipeline layer. So the data pipeline layer is basically when you're trying to develop a model it requires a lot of features. Some are realtime features, some are near realtime features and some more offline features. Let's take an example where you try to quote your shop and purchase an item, right? And similarly, you purchase an item online. So you're trying to develop a model to see what is the buying pattern of a customer. So it requires a lot of data. It may require historical data of what kind of purchases or what's the items the user is trying to browse or is trying to bid or is trying to buy. So those kind of data is required and sometimes near real time data. You you try to listen to some Kafka stream or some kind of near stream to listen to those kind of topics and try to ingest it. Near real time and serve it as a feature for the model. And some cases you have real data, right? Even before the model output is served to the actual user, we try to check if that item is still available. For example, you try and find like 10 different items which the user can buy. You have to check whether the item is in stock or not before showing it to the actual customer. So those kind of real time checks. Part of this data pipeline and our caching up in, in the way you pro provide the in the model, right? So these are the three different things with respect to architecture we need to consider. So I was talking more about the Kuber Kubernetes orchestration before. If you go deep dive into that CRDs play a pivotal role in extending the Kubernetes to support machine learning specific concept, right? These extensions actually enable the teams to define higher level abstraction, like model deployment feature, pipeline experimentation, runs, everything is. Enabled through this custom resource definitions and basically there are different patterns. Operator patterns are the one which actually, which is valuable for managing all the ML workload lifecycles. Custom operators can handle both deployment workflow, auto scaling. On inference load. And also integration with external systems, like feature stores or model registry. If you're familiar with Databricks, Databricks is one external product all in the market where you have a end-to-end model development, like cycle in place. Like you, you have a training infrastructure, you have a feature store where you can write the data into. Like some schema, right? And then it can also integrate with external system like red, this or any other, suppose you are listening to some Kafka or near realtime data that can even have or any, anything else. So it can also interact with Microsoft Azure Kubernetes services or AWS or they have connectors available in place that way. You the integration with other platform is seamless, so that, that's part of the orchestration layer and then developer experience. So when I talk about developer experience it's like how easily you are trying to provide this kind of capabilities, right? The abstracting, the complexity. In terms of a simple UI or some self-service capabilities, developer, when he is trying to deploy a model, you should not run into like multiple commands or go into multiple pages to deploy. It should be seamless because I try to check in a code into my branch, develop branch if they, once you check in, if there is an automated way to deploy it to the the developer. Development namespace are the development environment, right? Then that's the best thing to do. And if a code is merged to main, if there are like pipelines, automatic pipelines integrated with the CACD, which can completely test, validate, and deploy to production at scale, that's a, that, that actually enhances the model development lifecycle. By 10 times because, 10 times or a hundred times. So you can, the, you can do faster deployments and faster testing with these kind of automated approaches and how well to achieve that will we'll see in the upcoming slides. So I was talking about the automated CACD for ML in the previous slide, right? So it, if you see the automated CACD, it's just if you see there are like three different things. High level one is the model validation. So when you're trying to push a new change to the model, there should be some way to validate the model. Like you can manually validate the model before pushing. That's. Crude way of validating the model. But if you have all the regression integration and sanity test smoke tests in place, which are already automated and it's part of your pipeline, then you don't need to do this manual check. Whenever you deploy the model, the automated validation kicks in and if it fails, the deployment won't even happen. And post deployment. Also, you can have validations in place to monitor and if you see the failures are above some threshold, there should be automated rollback option. So that's when the artifact management plays a key role, right? When you have multiple versions of the model, like at least the past three versions or four versions of the model or past 10 version depends on your requirement. It's easy to scale back or roll back to that version, the last stable version of the model, right? That way you're not disrupting the production environment and you have several deployment strategies. For example, a canary deployment is one thing you have thousand pods which are serving for the model. You, you can first deploy to one sanity, test pod, and then see how better, the traffic coming to the pod is working, right? If there are more failures in that part, then there's automated rollback you can do that way you are not disrupting the whole set of users. Maybe a few users are getting impacted still, but, and similarly, gradual traffic shifting is mo is most important because like when you're trying to deploy the model at scale, you cannot actually. Affect the a hundred percent traffic instead, if there is a way to 1%, 5%, 10%, and then a hundred percent, that kind of deployment is, it is very important and it should change it. Or suppose you're trying to do a hard fix and you are trying to deploy without the dar fix. If the whole functionality breaks, then in that case you should be able to a hundred percent deploy the change. So those are the flexibility we need to provide in terms of. Deployment strategy, feature store, and data pipeline architecture, right? We talked this, talked about this. At high level, when you go deep dive for feature store, it addresses the fundamental challenge of serving features for both batch training and real time inference. So basically you can consider it as two different things, right? Online, offline. So offline is something for historical data batch processing, you can have a nightly job or weekly job, which can ingest those features. Online is a real time or near real time inferencing rate. You listen to some Kafka topic or you directly hit an a PA even before the actual output is served to the customer. So those are two different data and you should have a your architecture should accommodate all, both the patterns, both online, out and offline patterns so it can better serve the model to the customers. And this is like how you are going to serve the model, right? In general, there are three three different cases here. One is the container optimization for machine learning, and second is the serverless serving framework. And third is the code start. Optimization. If you take the container optimization, it requires special techniques that account for large model artifacts and GP dependencies. And in some cases you need multi-stage build process and which can minimize the container image size by separating the build dependencies from runtime environments. At runtime you may request certain and things, but during building you you have to make sure the image size is. Optimized that way you reduce the space. And for serverless serving, you have to abstract away from cluster management while providing automatic scaling capabilities. So basically you deploy a model with thousand parts. And you just receive like five, fire request per second, right. Thousand parts might not be required for that model, but there are times when you may receive like 20,000 requests per second, or 30,000 requests per second. And if a part can handle like fire request, then you obviously need depends on like per five request per second, a part can handle. And you have 5,000 requests and you need thousand parts. So how can you do this so that's when this automatic scaling place a key road, right? You should, for example, when you have a best part which can automatically scale, you have. Minimum par parts required are always there, like one part or 10 part like, or a hundred parts which will sell the traffic. But when there is a huge traffic spike slowly you see the infrastructure scales, the number of parts required. That way, there is no disruption. And we are able to serve the customers without any latency issues. And cold start optimization is another important thing which you need to consider when you are trying to do the model serving, right? Like model preloading, lazy initialization, and shared model caches, right? So for example, if your pod is restarting. It means your cash is gone. Whatever you have in your local, those caches are gone. So if you have a persistent cash multi-level cash concept in your design, that way you still get the data from the shared persistent cash, right? And at least the latency wise, it's not that big. It might not be as good as the local cache, but it's still in few milliseconds and you are not hitting the actual DBR actual source or a PA call for that particular data. So those are the things you need to consider just to avoid those latency issues. And, once you have the model in production and you handle all those latency and all the other issues during deployment, the next thing which comes into picture is the monitoring, right? You constantly monitor the model for failures, maybe. The input data changed, or the pattern in which the customers are using, the model changed or the performance of the model is deteriorating over time. So these are the things you, you do through constant monitoring and then observability and performance optimization, right? So there are like several metrics you need to monitor. Some of the metrics are accuracy metrics. And then prediction, distribution analysis feature drift reduction is one of the key metrics. Suppose there is a drift in the feature, then you can retain the model. You don't need to try retain the model on a schedule basis or on a daily basis, right? You can monitor for feature drift or model. And then you can take the a accordingly. You can take the action of whether you need to train the model or you need the new model to replace the existing model. So those kind of things you need to have. So the, in this slide we are mainly talking about the organization excellence. So once you have the marketing capabilities in place, you have the deployment capabilities in place and there are like automated. Scaling capabilities. How well you are working as a team. So nothing can be done by a single person here. There are several teams which you are going to work for a particular use case. For example, when you are trying to develop a model, you need to know what problem you are trying to solve. So you work with the business to see what kind of problems we are currently facing. Let's take example of Starbucks, right? You have 50,000 stores. You might have a problem where some store is constantly busy and the coffee serving time is more than like 10 minutes or 15 minutes. So what how a machine learning model can help, in this case, how to optimize the resources and how to bring more stores are more people like baristas to sell coffee at that store, right? So these are the problems which we need first to find the solution for the problem. So that requires collaboration with different teams, different stores those kind of locations to understand what problems they're facing and in generally you, if you. Divide all this into three. You can divide broadly into three categories. One is the platform team composition. So when you try to deploy a model, you basically need a platform where you build a model, train a model, deploy a model, and serve it and monitor the model, right? So there's a platform team which works on improving the infrastructure and making a, making sure the infrastructure stands over time or whenever it requires a revamp, they revamp the platform. And do constant updates to the platform, and cross-functional collaboration. So there are teams who use the platform. There are teams who build the platform, and there are scientists who try to develop models on top of the platform. And there are domain teams who consume the model from the pla the model output, right? So there should be alignment across teams. When you are trying to solve the problem and adoption strategies, right? You have to, whenever there's a change or trying to push, it's very important that the change is there's a lot of governance around the change. Like security issues or any other checks in place, right? Suppose you, you are trying to push a change. It's very important that you create a pull request for the change. And then there's a review mechanism. Even before the change, there should be a design review in place if there's a major change. And even for small bug fixes, you should have reviews in place. And once the reviews are approved there, there should be some kind of a security scan in place, which can detect any security or vulnerability issues and block the merge even before the change gets pushed to the development environment. And then there should be automated strategies where you can. Work with teams. So every use case is tested and the whole workflow is done seamlessly, right? So these kind of adoption strategies needs to be there. So whenever there, there is a, for example, you're trying to migrate as a team to some higher version of the. And it means you need to have a proper way to communicate that migration going is going to happen. So the teams who are using your platform are ready for that particular change. So tho those are the things which we need to take into account for organization excellence and the next and most important thing here is the scaling challenges, right? This, I'm just telling you, based on my past experience, when we started building the ML platform we started with a very small pilot model, and then finally we went at scale to handle three to four plus models, right? So when you are trying do that. When you're trying to evolve along with the teams, along with the scientists it requires a lot of optimization strategies and a lot of caching strategies. So like request batching when when you have like single to multiple requests, the throughput is so important and the latency is more so important. So in order to handle that, because batching is something which you need to take care. Caching strategies, right? This we discussed like in the previous slides. Basically, you should have a way to have support multi-level caches so that way you can eliminate lot of a PA calls, database calls, and when you have multiple models which require transformation, right? Like model chaining kind of things. Caching is always useful. You don't hit the model unless. You are out of that cash, right? You don't have a cash in place. So that helps a lot when it comes to model serving, right? And hardware aware optimization is another thing. So whether you are using GPU or TPU depends on how large your model is and how large your scale is. So that kind of hardware aware optimization is required. Multi-tenancy is another thing. So you always cannot have a single point of failure, so you should have this multitenancy and and it involves like multiple sites or multiple countries, right? You should, your server should be located as close to the site. So the latency is latency is quite good. So those are the things that you need to consider geographic distribution, optimizing global. Latency and then edge deployment patterns. CDN for example, Netflix. If you see Netflix as a company, all their models, CDN plays a MA major role because the way they serve their infrastructure is built on top of CDNs. There's no call which goes beyond the CDN to the actual server. So that's the way the infrastructure built. So that's how you are able to see the movies so fast. So the streaming is so fast in Netflix, right? That's one of the reason. Now we are going to talk about the security, compliance and governance, right? So I'll tell you some of the use cases which I encountered in the past. So there are cases you'll handle data with. PA information and like credit card information or in some cases you'll handle very sensitive data, right? You should have proper security and compliance in place to, before even dealing with that kind of data. So it's very important that those data are masked and not leaked anywhere outside. And in some cases you. You have to manage those legal regulated requirements. Also, when you're dealing with health healthcare, or any other financial related models, right? You should make sure all these regulated requirements are in place. GDPR, hipa. Financial regulations and automated complaints. Checking, right? These are all most important. Even though you try to automate everything, there should be proper audit trail of what is being done, and data lineage is another thing. Suppose there is a feature you are trying to use, we should know from where the feature is getting generated, who is consuming the feature, and what are all the different places the feature will go, that way you can backtrack and see. If at all, there's a leak, you can see where there's a leak. If they, even if there's a error, you can know where there's an error. And it's very important to involve the security team in each stage. Even if you're trying to use a open source or trying to introduce a new component in your platform. It's very important that the security and the governance is in place for all those kind of. Changes you're trying to make. That way you don't have any surprises or any leaks, right? So that creates a lot of financial and other problems if there is a leak. So in order to avoid that, it's better to have the security checks in place beforehand. So when it comes to feature direct trends and emerging trend, these are the four different things. Edge computation, real time learning, federated learning and auto ML integration, or something which you need to account for in feature, right? For example, auto ml. It reduces the expertise required for model development. Suppose you're trying to rapidly. Push the models right? It's always good to have some auto ML integration in place. Your platform should support that. That way it automatically trains and tries to deploy new versions of models, and you should have proper monitoring capabilities. So this auto ml model, which is getting generated is properly validated and edge computing is something which is close to data sources and end users, right? Sometimes platform architecture. It should manage distributed deployments across diverse hardware, so that, that's when this edge computing plays a key role and real time learning. So based on new data, your model should be able to learn and we should be able to fix the model and adapt to the new data, right? So that's when you need this real time learning. And your platform should support all the online learning algorithms, streaming data processing, and dynamic model updating to support this kind of real time learning. Federated learning is enabling model development across. Distributed data sets without centralizing the data. So it requires specific coordination mechanisms like privacy, preserving techniques, and distributed model aggregation capabilities. These are some of the f components for feature emerging trends. One year, once your model is at scale, in order to keep up with the feature, you need all these kind of capabilities in place. To add on to what we discussed in the previous slide, right? In order to create a successful ML inference platform, you require a balance between. Multiple things. One is the performance and the second is the cost. Third is the developer productivity and operation control. Sometimes innovation is so important, especially with the generative ai. Now, innovation plays a key role and reliability. How reliable is the data you give? For example, people try to go for Amazon Prime as a membership because they find that. Whatever they search in Amazon is relatable. But the reliability is out of our picture, then it cause a lot of other concerns. Similarly, how can you rely on a platform if it's able to serve all the kind of traffic without any issues? If we can autoscale if you can really that, that's why reliability is most important. And the other thing is you should achieve a balance through thoughtful architecture, robust automation, and strong organizational alignment. This investment is is mainly important for building sophisticated ML platforms and you try to reduce any operational or add r. Any other overheads which can arise in the platform, right? And make sure your deployments are faster and seamless. That way you go to the market on time or even ahead of time. So that's the main goal. And cross team collaboration and platform excellence are part of that goal. Security, whatever we discussed in the previous slides, those are all the building blocks to achieve that goal. Yeah. Thank you team for listening to me patiently, and if you have any doubts or any questions, feel free to ask me. I can get back to you on this. Thank you.

Slides

Download slides (PDF)

See all 83 talks at this event!

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Building Developer-Centric ML Inference Platforms: From Kubernetes Orchestration to Production Excellence

Video size:

Abstract

Summary

Transcript

Slides

Gangadharan Venkataraman

Technical Lead, AI/ML Platform @ Starbucks

Join the community!

Featured event

2026

2025

Info

Conf42 Platform Engineering 2025 - Online

September 04 2025 - premiere 5PM GMT

Building Developer-Centric ML Inference Platforms: From Kubernetes Orchestration to Production Excellence

Video size:

Abstract

Summary

Transcript

Slides

Gangadharan Venkataraman

Technical Lead, AI/ML Platform @ Starbucks

Join the community!