From DevOps to MLOps: Scaling Machine Learning Models to 2 Million+ Requests per Day

Video size:

Abstract

Learn how to deploy and scale Machine Learning models to 2 Million+ requests/day using MLOps best practices. In this talk, you’ll learn how to go from data preparation to deployment to scaling ML models that can run at large scale - all without breaking the bank.

Summary

Today I am going to talk about, from DevOps to Mlops, a journey of scaling machine learning models to 2 million API requests per day. I am a founder at company called 120 n where we help startup and enterprises with backend and site reliability engineering.
Mlops is operationalizing data science. It is all about moving machine learning workloads to production. Fundamentally, this is around a feedback loop, just like in software. Here is what simplest mlops workflow looks like.
We were working with a company which was building Ekyc SaaS APIs. This needed to scale up to 2 million API requests per day to the model. We applied DevOps and mlops practices, best practices in production. We set out for two nines of availability during peak hours or during business hours.
What did our scaling journey and how did we go from zero requests to 2 million API requests per day? I want to talk about four or five important points and then drill down on each one of them. So hopefully you all can learn from it.
One of the things we did is we added high availability mode for RabbitMQ. We also had cross AZ deployment for Rabbitmq. Where possible, we used SaaS offering for some of the important stateful systems. Managing and scaling ML was one of the bigger challenges.
Let's talk about capacity planning. Where do we think is the bottleneck. Obviously it was on the mlworker side, because that's the component which takes most time in the request path. We set out to figure out how many ML model requests can a single node handle.
How do you go about optimizing and auto scaling? What is the parameter on which you can auto scale? Could it be based on memory utilization or number of incoming requests. We've been using this autoscaler for years now and it's just been working very well.
One of the issues that we encountered was GPU utilization in Nomad. Another issue was high latency on the API side. One workaround was using raw exec driver instead of docker. That allowed us to run not one, but four workers. It straight away brought our cost down by four x.
Data quality and training the model is super important. I would highly recommend to use cloud agnostic architecture to build scalable systems. Treat production, maintenance and operations as a first class citizen in software building and delivery process.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Today I am going to talk about, from DevOps to Mlops, a journey of scaling machine learning models to 2 million API requests per day. So before we dive in a brief about me, I am Chinmay. You can find me on Twitter LinkedIn etcetera via Chinmay 185. I am a founder at company called 120 n where we help startup and enterprises with backend and site reliability engineering. I write stories of our work in what is called pragmatic software engineering. These stories, I published them on Twitter LinkedIn, etcetera. I love engineering, psychology, percussion, and I am a huge fan of a game called Age of Empires. All right, so let's start. So what's, what are we covering today? We are covering three things fundamentally. One is what is mlops? How do you think mlops for DevOps practitioners? Fundamentally, I want to talk more about, and spend more time in talking about a real world production case study that we worked on, which will talk about all the learnings that we had into in a case study kind of walkthrough. So what is mlops fundamentally? Mlops is operationalizing data science. We all know what DevOps is. DevOps is operationalizing software delivery, software engineering. Similarly, MlOps is equivalent to DevOps in a sense. It talks about operationalizing data science workloads. Think of machine learning AI ML workloads essentially, right? So that means it is all about moving machine learning workloads to production. Just like we have DevOps phases, we have various phases in mlops. Fundamentally, it's like build, where you build the models, you manage various versions of the models. For example, you deploy these models on production. You monitor, you take feedback, you continuously improve the models, etcetera. So these are various four steps of mlops. Let's look at some of them in more detail. Right? So fundamentally, just like software engineering is about building and shipping code to production, MLOps is, is about building machine learning models. Now, what do you need for building machine learning models? You need data. Data. You need to extract this data in various forms. You will need to analyze it. You will need to sort of prune some parts of data. Essentially, you are doing data preparation and gathering. Then you feed in this data to your ML model. You will train the model. You will evaluate models response, you will test and validate whether the model works correctly or not. You will fine tune this process over time. You know, you have test data segregation, you will have production data, stuff like that. This is all the machine learning part of it, which is what data scientists work on. Now, the operational parts of it are model serving, how do you serve this model to production? How do you run this on GPU's? Do you run this on cpu's? Which cloud provider do you want to use? How do you monitor the model, whether it's performing as per expectations or not? How do you manage scale up and scale down of that model? All of that is the operational concern, which is the ops part of it. Fundamentally, this is around a feedback loop, just like in software. We have Ci CD continuous integration and continuous delivery deployment. We have a third parameter, or third item in mlops called continuous testing and training, where you are going to continuously monitor and train the model and improve the model over the period of time. So here is what simplest mlops workflow looks like. This diagram is from sort of Google's mlops guide. You can find the link in the description. Fundamentally, again, it starts with getting data. So we are trying to map all these previous steps and phases that we looked at into this model. So we're going to get some data from various sources. It could be offline data, it could be real time data, things like that. For now, we're keeping the diagram very simple and just looking at some offline data. For example, we are going to extract this data, analyze it, prepare that data essentially. And there's a second step. Then the whole model training step appears where you are going to train the model, you're going to evaluate the model, you're going to check the performance, you're going to manage various versions, validations, etcetera. Finally, you have a train model which you put into model registry. Now once that model registry has the model, the operational part of the mlops comes into picture, which is serving the model, and then which is where you have, for example, a prediction service which you can run on production. Then you have to monitor, scale that service, run this on GPU's, figure out cloud cost optimizations, etcetera, around all of that. So that's operationalizing data science. That's the simplest mlops flow that you can think of. You can also map this into a classic DevOps Infinity loop. So the typical DevOps Infinity loop talks about your code, build, test, plan, then release, deploy, operate, monitor, and doing this in a loop consistently over long periods of time. So similarly for mlops, it kind of starts with having data preparation. You're going to prepare the data, you're going to train the model. Again, that's a build and a test part of it. Then the release will go into model registry. You're going to then monitor the model performance. You're going to deploy the model, monitor the performance, and all of this altogether would be continuous training and testing of the model. Right? Enough about theory. What I want to talk more about is a use case that we worked on. So this is a production work that we worked on. I'm going to cover the use case at a high level. I'm going to talk about what work we did, how we applied the DevOps and mlops practices, best practices in production, and the kind of issues that we faced during the production journey. Right. So let's start with the case study that we had in mind. We were working on. We were working with a company which was building Ekyc SaaS APIs, which was accessible to B two B and B two C customers. This needed to scale up to 2 million API requests per day to the model. Now, the SaaS APIs, these three or four APIs that we had, one was face matching API. Imagine you provide two images to the model. You're going to have to match the face between two images and model outputs a score between say zero and one. Based on the matching score, you can decide if the two images, if the two people in this images are same or not, and that's a face matching API, then we have face liveness detection, which is if you have an image of a face, is this face of a live person, or is this a face of a non alive person? Then we had an OCR or optical character recognition from an image. For example, you upload a photo of a passport or any identity card, you would be able to extract the text information from it. So that imagine this Kyc use case for an insurance or a telecom or any other domain. People would have to manually enter a lot of information for the user, like their name, their date of birth, their address, etcetera. All of this information gets captured via the OCR API and you get that information returned via response in a structured fashion. So that eliminates having to type and mistype information. So similarly, we had some other small APIs as well, which I'll ignore for now. So fundamentally, we had this ML system, and the architecture of that system, along with other components was something like this. So it was served mainly for b two b use cases. We also had some things for b, two c, but again, we'll ignore that for now. So imagine from a b two b use case. I'm an insurance or a telecom company. I have my own app, and there is a client SDK in that app that I need to install and run. I need to package the client SDK as part of my app, now this SDK talks to our backend APIs, which are these ML APIs exposed via an HTTP API, for example. So it connects to for example load balancer. We have an API layer then to be able to serve these models. We had a RabbitMQ as another like a queue mechanism where we would push messages. For example, we want to map match two images, right? Face recognition or face matching across two images. We would create a message in RabbitMQ, push that message on the rabbitmQ. And there is these background workers, which are these ML workers, which would run, which would accept message from the RabbitMQ. They would do the processing, they would update the results in database. Maybe they would even save the results in a cache like redis, which we had. And the images themselves can be stored in a distributed file store. Could be minio, could be s, three things like that. Essentially these workers would perform the bulk of the task and then the API would return the results to the user. So this is the kind of architecture we had. Now what were the requirements from a slo point of view? So we set out for achieving at least like two nines of availability, that is, two nines of uptime during peak hours or during our business hours. Because we were dealing with a b two b company, there typically would be business hours. Typically stores would open at 09:00 a.m. In the morning and would go on till like 10:00 a.m. 10:00 p.m. In the night, for example. Right, the local time. So we had promised like two lines of uptime during that. In terms of SLO, we obviously had to worry about costs and optimizing the costs as an important requirement. From SLo point of view, we hadn't defined specific metrics, but we'll get to that later. And then from an API point of view, these were synchronous APIs as far as the user and the SDK is concerned. So less than three second API latency for 95th percentile. That was our goal that we had set out. Now, given this, let's think about our architecture and we set out to build a cloud agnostic architecture. I'll cover more of that soon. And why that is. So, for example, for the storing of actual images, which were ephemeral for short time and whatnot, because again we're dealing with sensitive data. We were using s three if we are deployed on AWS, we were using gcs if we were deployed on GCP, and Minio if we were deployed on on premise. One of the reason was that we wanted to create, we wanted to have same code, could use different type of image store without having to change the code a lot or without no change to the code at all. Why? Because then we could deploy this entire stack on any cloud. We could run it on on premise. We could even run it on customers premise. We could run it on AWS, GCP or any other cloud for that matter. This was the main point that we wanted to achieve. That's why we set out to have a cloud agnostic architecture where we don't use a very cloud specific component and then we are tied to that particular cloud as a code dependency. Now that's for Binayo. For Redis we could either go with self hosted redis which is basically a cache for storing bunch of latest computation. That API can quickly return the results to the users. We could do this as a self hosted redis or elasticache if you are on one of the cloud providers. Now for RabbitMQ, we again choose RabbitMQ purely so that we could run this on premise easily. And if you were on the cloud, we could use something like sqs or like equivalent in India cloud for GPU's and workers, which were predominantly GPU workload, we would use them on one of the cloud providers, or we could get our own custom GPU's, et cetera. For this production use case we were on one of the cloud providers and so we use most of the cloud components, but our code was such that it was not tightly coupled to cloud at all. And for database, again, we could use postgres. It could be RDS or cloud SQL or something else depending on the cloud provider. We fundamentally had Nomad as the orchestrator which would orchestrate all the deployments and scaling of components. Back then we were not using kubernetes purely, again from a simplicity point of view that we wanted to deploy this whole stack with the orchestrator on premise and we didn't want to in the team. We did not have a lot of Kubernetes expertise to be able to manage self hosted kubernetes on premise ourselves. So that's where we chose Hashicorp stack, which is fairly single binary, easy to manage and easy to run, and we already had expertise in the team for that. Why cloud agnostic? I think it's a very important point that I want to highlight because we were cloud agnostic, we could package the same wine in a different bottle, for example, so we could package the same stack and run it on any environment that we wanted to. We could do air gap environments if we had to, things like that. So this was the main reason why we went cloud agnostic. And I think one of the lessons to learn is to build more cloud agnostic systems. That way you're not tied to any of the cloud providers, although using cloud providers obviously simplifies a lot of things for you. But you would want to have your architecture and code not coupled with the cloud provider so that you can change freely and migrate to a different cloud if you want to without having to redo a lot of effort. What did our scaling journey and how did we go from zero requests to 2 million API requests per day? Let's talk about that. Obviously it wasn't zero request one day and 2 million the next day. It was a gradual scaling journey, something like this. So we would roll out on few regions or few stores, and then we would slowly increment the traffic, we would observe the traffic and so on, so forth. So fundamentally from our scaling journey, I want to talk about four or five important points and then drill down on each one of them as we go through the talk. One is the elimination of single points of failure to be able to scale. We want to have zero or no single point of failure so that your system is more resilient to changes, resilient to failures. We also need to do good capacity planning so that you are able to scale up and down very easily and you can save on cloud costs. Otherwise, if it requires you to scale and you are having to do a changes to architecture, it causes problems. So having good capacity planning and how we went about that, I'll also cover that. Obviously, cost optimization and auto scaling goes hand in hand with capacity planning. So I'll cover that. Then comes around a lot of operational aspects about deployments, observability, being able to debug something, dealing with production issues, stuff like that. And obviously all of this journey wasn't very straightforward. It was fraught with some challenges that we encountered. So I'm going to cover like two interesting challenges that we encountered along the way. So hopefully you all can learn from it. So in the next part of the talk, I'm going to take each one of these points and then go drill down on each one of them. So let's talk about eliminating single points of failure. We had this architecture and just showing the architecture here as, as a, in the background. So one of the things we did is we added high availability mode for RabbitMQ. What does that mean? It means we have queue replication. So whichever queue is there on one machine, it gets replicated or mirrored onto the other machine. We were running RabbitmQ in a three node cluster instead of a single single node, for example. We also had cross AZ deployment for Rabbitmq. So the three nodes of RapidMQ, each one would run its own easy, for example. Obviously this was on premise or a setup where we wanted to host and manage RabbitMQ ourselves. But if it were a cloud managed service that we would use, we would use something like sqs, for example. One of the other things that we did to eliminate single points of failure is to run ML workloads in multiple azs. Now back then we had only two azs where we could have ML GPU's available. The third zone from the cloud provider did not yet provide the GPU's. So we had to tweak our logic and deployment and automation to be able to spin up and load balance between these two acs. So we would have to fix auto scaling, we would have to fix deployment automation to be able to run workloads only on two zones instead of three. For most of the other two cloud components or most of the other components in the architecture, we would have workloads run on all three acs. Other thing that we did is wherever possible, we used SaaS offering for some of the important stateful systems, like databases, for example, redis and postgres, just to make sure that we don't have to manage and scale those components. Also. And managing and scaling ML was one of the bigger challenges. So we wanted to offload some of the lower hanging fruits to the cloud providers. Fundamentally, the idea again is that scaling and managing stateful components is bit hard and stateless is much more easier. So wherever possible, it's easy to automate stateless application scaling, component scaling and stateful becomes difficult. So that's on the eliminating single point of failure. Let's talk about capacity planning. So always when you think about capacity planning, you think of the bottleneck, because if the strength of the link is the strength of the weakest component in the link, for example, strength of the chain is the strength of the weakest component in the chain. So you want to find out what's the weakest component and improve the strength of that component. So for example, if you think about various components from the architecture, we have API, which is simple API which is does talk to database and get the results from database. For example, there is database, which is stateful component. It could be redis, rabbit, postgres, etcetera, then mlworkers and something else. So where do we think is the bottleneck. Obviously it was on the mlworker side, because that's the component which takes most time in the request path. So we set out to figure out, for example, how many ML model requests can a single node handle. So for example, if you have a single node with say one gpu with 16 gigs of GPU memory and a GPU with whatever few cores, how many workers can I run on that? And how many requests per second or per hour can I get out of that? Now, each model, ML model may give us different results. So for example, face matching may be faster than OCR, or face liveness detection could be faster than face matching, for example. So we would run load test and we would run each of these models, each of the nodes, and we would run them via a load test to be able to find the maximum throughput that we can get over a long period of time, say an hour or two, for example. So again, I've broken this down into more detail and even more generic format in another talk that I gave, which is optimizing application performance. How do you go about it from first principle? So if you're interested, you can check that out. I'll provide the link in the description, hopefully cost optimization auto scaling, that is one of the pet peeves given the current market scenario. So mostly you will have seen, if you use GPU in cloud, it costs a lot. So how do you go about optimizing and auto scaling? So you have to think about what is the parameter on which you can auto scale like, is it the utilization of CPU or GPU? Could it be based on memory utilization or number of incoming requests, for example? Or it could be depth of the queue in case we were using rapid MQ. So could it be queue depth or something else? Now we kind of used a combination of some of these components. So I'll talk about how we went about. So this is the cost optimization auto scaling, like our auto scaler, how it works, right? So on the left you have top left you have a current request rate. This is the graph where we would have, what's the number of requests we are getting over last 20 or 30 minutes interval? And we would have a capacity predictor component which would run every 20 minutes, which would fetch this request rate, or it would have this information and it would also get the current node count or current worker count. So again, imagine we are running these gpu's on workers. What's the current number of workers that we have currently? So you would get the current workload, you would get the current request count. And based on that, based on the request number of requests and the growth rate of that, the capacitor, the capacity predictor would kind of predict, using just simple linear regression, the desired node count. So for example, if we, for example, open the stores at 09:00 a.m. And we know that we at least need like 50 machines at that point, so we would have time based auto scaling and we should just spin up 50 machines. You have them ready at like before ten minutes. The stores open, right? But the stores open and you continue to see increase in traffic, you would want to spin up more nodes. If you see decrease in traffic, typically during lunchtime, you would want to spin down a couple of nodes, for example, right? So we had this capacity predictor component which would take this request rate of growth or rate of decline, you would take the current node count, and based on the linear regression math and some other parameters that we talked about, it would predict the desired node count. Now this is the count that would go as an input to auto scalar. The autoscaler would then do couple of things. It would update Nomad. It would tell Nomad, hey, can you please spin up those many number of components or those many number of workers? So the Nomad update cluster configuration would run and the Nomad would correspondingly spin up more nodes. It will deploy the model on those nodes, etcetera, and it will also update, for example, if you're using slack. So it will also update us on slack. That, yeah, we've got two more nodes added, or we've got two nodes destroyed because it was less traffic time. For example, one of the reason we built this kind of a system is so that we have a manual override at any point. If we knew that there is a big campaign going on, are we going to scale, we are going to have to scale the machines at a particular time or due to some other kind of business constraints, we would be able to manually overwrite that value and change the configuration to be able to spin up those many number of nodes. This gave us a lot of control and we've been using this autoscaler for years now and it's just been working very well. It's a very simple, less effort work, but it just works flawlessly for us. So we've never had issues with auto scalar as such. There's more that we've written about in our, one of the recent blog posts and case studies. You should check it out if you're interested in this kind of stuff. Now, our journey wasn't smooth, right. It was fraught with some issues and errors. So I'm going to talk about some of the issues we encountered and how we navigated those, and what kind of impact it had on downtime, slo, etcetera. So one of the issue that we encountered was GPU utilization in Nomad. For example, imagine this top box to be a GPU. It has GPU cache memory and cores. We would be able to run a ML worker using Nomad, using Docker driver. So we would run one ML worker per GPU. What we notice is that the GPU wasn't utilized fully. It was with one worker, it was just 20% utilized, and lot of other resources of that machine were just left unused. We did load tests to be able to figure out how much throughput we can get out of a single machine running single ML worker. It's a docker container, and it wasn't impressive with this. If we just ran with this kind of hardware, our cost was going through the roof, and we had to really figure out how do we fix this. So we dug deep into Nomad. GitHub issues, some pull requests, some parts into reading obscure documentation and figuring out. And fundamentally later, what we discovered is one workaround which we can use, which is instead of using docker driver, if we use raw exec driver, which allows you to run any kind of component, it doesn't guarantee any. Like you have to worry about a lot of scheduling yourself when you use raw exec driver. But with raw exec driver, you could run docker compose. And we spin up multiple docker containers using docker compose via raw exec driver in Nomad. So what that allowed us to do is the same GPU machine, we could run not one, but four workers, right? Using docker composer. Now that led us to having about 80% utilization. It straight away brought our cost down by four x. So imagine if you had to have $100,000 per month on just on GPU, we would slash it by one fourth directly and have like 400% impact, essentially. So this was one way we solved it. This is as of today, last I checked, it is still open, this issue, and you would see this pull request, or this issue is still open. And one of the solutions or workarounds that we've used, I've highlighted that here for you to look at. One thing that we noticed, if you use raw exec driver, you're going to have to worry about the shutdown part of it yourself. So otherwise, nomad generally handles sick term, and it handles the graceful termination of resources or components. In this case you will have to have waits and timeouts and you have to do some magic and work. You have to put in some work to be able to tear down the components correctly. So we invested in that and we wrote some bash script to be able to run, which could run the components and also tear them down easily when we wanted to. One other issue that we encountered is high latency. So again we have steady traffic, new regions or new stores are opening up and we are getting them migrated to use our APIs. And the rollout is happening suddenly on one of the days. What we see is that more than 25 2nd response time. Now imagine like our sla is less than 3 seconds response time for 90th percentile or 95th percentile. Now we suddenly get 25 seconds of response time. That's unacceptable. So we debug this issue. We try to find out, oh, there must be some problem with Mlworker, right? Because that's the slowest component in the chain. But we realize that there is no queue depth. The workers are doing their job, there is no extra jobs for them to be processed. So the worker scaling or auto scaling is not a problem. But then why do we have this much latency on the API side? There is no processing lag on the worker. Why do we have that? We spent a lot of time debugging this issue and ultimately we discovered that the number of go routines that we had on the API side which would process the results from RabbitMQ. So for example, our flow was that we would have the client SDK call our NLB or AlB. The API would send a message to Rabbitmq. The workers would then consume that message, produce the result. It would be updated on database and it will also send another message on to RabbitMQ that the processing is done and some other metadata. Now there is this workers that were running on the API which would then consume this metadata via the RabbitMQ message and then it would return the response to the user or do something else. Right now that's the part. We were just running five coroutines. Now what we realize is that we were also running three nodes or three docker containers for workers. And we never faced any issue with worker, the cpu utilized API layer. Sorry, we never faced any cpu utilization or any issue with API layer. But when the request count increased, what we realized is that these five go routines were not enough and which is where we're actually seeing queue depth on the API side. The API go routines that we're trying to consume from Arabic MQ. That's where we saw the queue depth. And it took us a while to figure this out and to fix this, because our preconceived notion and the first response was to look at workers as a problem. Now we change the go routines to 30 and suddenly the flip, the traffic goes normal and we have less than 3 seconds response time. So that really shows how you want to, how you should understand the components and how data pipeline works, and you need to really think about where the latency is being introduced. After that, we also added a lot of other observability signals and metrics to be able to track this issue and further any other issues even even further. So we invested a lot in tracking cross service latencies and stuff like that. So I've written about this as a form of pragmatic engineering story. You can check it out on Twitter if you follow me there. I want to conclude this session by talking about some of the lessons that we learned along the way. So one is for ML workloads, it goes without saying, but it's worth repeating that. Data quality and training the model is super important. Like a lot of it depends on the quality of data and the volume of the data and how you train your models, how you carve out the test data versus the data that you run the model on and you want to check and how you label and data. So a lot of it depends on data quality and training in our case, and I would highly recommend to use cloud agnostic architecture to be able to build scalable systems that can give you that flexibility to deploy them on any type of workload or any type of underlying cloud. For us, that has been the massive win. And I would highly encourage you to think about building cloud agnostic systems. One of the things that I've seen majorly is people do not treat operational workloads as first class citizens. They just slap on top of the existing software and just say that yeah, somebody will manage this. But treating operational work as first class citizen really helps in automation of a lot of your day to day tasks, and it gives you, especially when you are first launching, you should really treat production, maintenance and operations as a first class citizen in software building and delivery process. Lastly, from a team point of view, we had really good collaboration with various teams, data scientists, the mobile SDK team, the backend engineering team, sres, and even the business folks. So for example, whenever there is a new campaign or we would get some info from business team that a new region or new bunch of stores are being onboarded, preemptively scaled the nodes to be able to handle that traffic, for example. So being able to closely collaborate with data scientists, we also optimize, for example, one of the case, we also optimize the docker image size for the models. Earlier, we would have all the versions of the models in our final Docker image, which would mean the Docker image itself would be like tens of gb. So we'll have ten GB Docker image later. We optimize that with close collaboration with the data scientists to less than like three gb of model. So that just speeds up a lot of warming up of nodes. It just speeds up the deployment process and the time it takes for nodes to be ready to serve traffic. So ensuring good collaboration between teams is super, super important. I think that's it from my side, what I would say is connect with me on LinkedIn, Twitter, et cetera. You should check out our go or SRE bootcamp that we have built at 120 n, along with the software engineering stories that are right. Here's the QR code. You can scan this and you can check this out.

Slides

Download slides (PDF)

See all 26 talks at this event!

Conf42 Site Reliability Engineering (SRE) 2024 - Online

May 09 2024

From DevOps to MLOps: Scaling Machine Learning Models to 2 Million+ Requests per Day

Video size:

Abstract

Summary

Transcript

Slides

Chinmay Naik

Founder @ One2N

Join the community!

Featured event

2025

2024

Info

Conf42 Site Reliability Engineering (SRE) 2024 - Online

May 09 2024

From DevOps to MLOps: Scaling Machine Learning Models to 2 Million+ Requests per Day

Video size:

Abstract

Summary

Transcript

Slides

Chinmay Naik

Founder @ One2N

Join the community!