Harnessing Microservices for Scalable, Fault-Tolerant ML Systems: Trends, Challenges & Best Practices

Video size:

Abstract

Unlock the power of microservices to create scalable, fault-tolerant ML systems! Learn strategies for resilient apps, tackle challenges like data consistency & performance overhead, and explore trends like AI-driven orchestration & serverless computing. Don’t miss out!

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey guys, this is Abha Alia, and I'm here to talk about how do you harness microservices architectures for really highly scalable and amazingly fault tolerant ML ecosystems. Let's begin. I. Architectures have been really prevalent, right? And basically it's a traditional approach which has very tightly coupled code base. And everything basically is packaged into a single executable. Your data processing, your model training, your service layer, your inference layer, all that stuff goes into a single deployable unit, which basically that at the end of the day, what you're scaling is that one cohesive unit. What that also means. On top of that is that when you run into a hotspot in any of those layers, you will have to scale your service out, which is single, deployable, unit based, and the ram, the CPU, all those use cases or the usage of those particular service, resources is gonna be dependent on what that one deployable unit is gonna need because everything's gonna start into that single deployable unit. Another big thing is that there is no very limited, or actually no very limited fault tolerance isolation because everything is deployed together. So if one of those processes set faults. I think you're toast at that point. So basically to prevent any kind of systemwide failures ease scaling things evolve around microservices architectures, and it's. It's really convenient. It's modern, it's smaller, it's independently deployable, capsule size services which you deploy and operate as really discreet units. What happens to these discreet units is that now you can, because they are discreet and independently deployed, you are able to scale by themselves, scale them by themselves. You are able to isolate the faults. You are able to isolate any issues that are going to happen. In the lifetime of the service hotspots are also very isolated. The scaling is really good because if you're influencing their needs a lot more CPU and ram, you don't have to scale the entire service. You can just scale their influencer layer and you're good, right? I personally believe microservice architecture is a per win-win for any kind of deployments, especially for the ML system deployments. As we talked about, there are, these are the key benefits and we have been talking about this. It gives you dynamic scalability because everything is independently deployed, so you scale all those components by the need of those components. And what I mean by that is if one component needs. X amount of ram and y amount of instances. Not every component has to get X amount of ram and y amount of instances. Now you can dyna scale each and every one of them. You get enhanced resilience, right? No Ssec faulting from one in service is gonna affect the other service because they're just independently deployed. Big one, big win. Technology flexibility. This is a big difference which you get. Let's say you were deploying everything in Python and suddenly you want maybe your API layer to go into Golan instead of Python, just because Golan is a little faster. It's easier to parallelize things in Golang as well. And API layer is pretty much very easy to build in Golang. You wanna get to Golan, but you can't because if you're deployed in the micro monolithic services pattern, there's no way you're gonna move just one day out. To do that, you'll want to move to the microservices pattern. And in that way you just rewrite your API layer in Golan and deploy those services and you're good, right? And that the big win is the accelerated deployment process. Now that because you deploy them. On their own, you, your CICD processes can decide when each of those services can get deployed. One, their schedule could be completely different. Second, you can actually have it linked to something like a GitHub hook, and they just continuously keep getting deployed, and you don't have to deploy your training modules just because your APIs are getting refreshed or they're getting faster in the backend. So it gives you all those flexibilities as well. All good with microservices, but there are some challenges there as well, right? Data consistency is big one of them, right? That now because you have to have all those different service boundaries and those services now dependent on the data from one service and they're dependent, interdependent on each other, what that. May. What that leads to is that your, you will have data consistency challenges, and you will have to make sure that your data is consistent across all those services and the services instances of those processes that you have deployed. There is definitely a communication overhead, right? Because one, you're not a monolithic service anymore, so it's not an inter. Within InProcess communication, now you're talking about T-C-P-G-R-P-C based communication, which will have some amount of network overhead as well, right? It may not, because you might be in the same box, but it'll still have an interprocess layer, which will still have a little bit more effect on the communication throughput. So yeah it does change some of that stuff. System complexities. It gets. Higher. Definitely it's a lot higher because now you have so many more components to look at, so many more components to monitor if god forbid, you run into issues, you have to debug like six different services in a singular time series fashion so that you can understand which service failed and what pattern led to that failure. Diplomat architectures change a lot. You have to now make sure that they, you're, they're scaling independently. They have their resources tuned independently. All that stuff changes as well. So I agree that there are amazing benefits to the microservices architecture patterns as long as you make sure that all these challenges are taken care of as well. So it's not a free for buffet until, unless you address some of these things. From a communication perspective, there are like four different things, right? Which comes to mind. One is you can almost always say that my services are just gonna talk to each other. If via rest it's an API layer and I'm gonna just use REST or GRPC and that's how they're gonna communicate. Full stop, right? You don't have to worry about it too much. It gives you a strong consistency. Guarantees, because of the nature of being asynchronous, it gets a little easier. You can establish a HTTP two connection 2.0 connection one once, and then you reuse that particular connection for all those multiple requests that you're gonna send. It gets much easier because now you don't have to construct and deconstruct TTP connection every single request that you want to. So win-win, but you just have to make sure that you're doing that. As A-G-R-P-C layer, that is one of the things that you get with G rrp C sgtp two connection. Another thing which has been happening in the recent times have been immense streaming. People want to reuse those continuous data pipelines that they have their, the constant barrage of data that's coming in, and they want to utilize that high volume data as model inputs. In, in those use cases, those event streaming patterns. Where this data, which is incoming, never stops, never ceases to exist. You want to look at the event streaming patterns to a degree. But yeah, again it's up to your use case specifically that is event streaming the right pattern for you or not. Another one is. Where all your instances are kind of stateless, and they use a single shared data store in the background. So basically all they do is they talk to that single data store in via transactions, and everybody just gets updated via those shared data stores. It's a lot heavier. It's a lot harder to do that shared data store model because yes, you are relying on that one service, which has all those data sets. So be a little careful when you do that. I generally recommend even streaming patterns for all these things that, yeah, event happened, X X happened. This is the current state. Please update it everywhere. It could be a data store, it could be anything. But I prefer those patterns because again, that's just my preference, but you have to look at your use case if that's useful or not. And message queues have been, they have been there for years now, right? At decades at this point. It basically eases the decoupling of services from each other. All you do is you talk to the message queues and the message queues. Somebody's listening. Somebody's sending basically allows you to do asynchronous processing while independently scaling. So you get the best of both worlds. But yeah. Again, like I said, for all of them, depends on your use case. Do you want to use which one do you want to use? ML specific deployment approaches are and these are not something which are generic, but I have seen these used really heavily. Where I, in the ecosystems, which I've seen, right. Containers have been really good. For formalizing the deployment model, it's that unit which gets deployed anywhere and everywhere. It allows you to orchestrate that really well and. Kubernetes in the recent times have done it really beautifully, where you don't have to worry about how things are gonna happen, the network apologies, the load balancers and how they're gonna interact the domains, the in ingress, all that stuff. Kubernetes takes care of really beautifully and again, in a really well defined, codified manner. More or less YAML files for almost all of them. But again, it. It's something which is amazing. All that needs to happen is that you need to understand that with all those great things, there might be some overhead in the operational paradigm of the operation paradigm for these things. Kubernetes as amazing as it is. Operationally it's a little bit challenging. So make sure that you have some experts at hand who can deal with co kubernetes issues, network resolutions. The DNS, it's almost always DNSI think everybody agrees whenever there's an issue. It's almost always DNS otherwise, right? It's, it allows you to intelligently scale all those things. It allows you to make those operations in a fault tolerant way. It's amazing at scheduling resources, and it's amazing for a AB testing as well, right? Bluegreen deployments, AB testing. Yeah, Kubernetes can do it pretty easily. All you need to know is the way how you configure those and you're done. Another big one is the serverless functions. AWS Lambda comes to mind, right? Google Cloud functions is another one. Again, they're more or less the same but it eliminates all this infrastructure management overhead for you, right? Kubernetes, one way or the other, you are either. Allocating resources, you're making sure the machines have enough capacity, you're still doing some node pooling and all those related things. Serverless just gets rid of all that complexity. All you do is deploy the containers in that serverless function manner, and you get. Then there are specific, very specialized ML platforms kerv, formerly known as KF Serving, if you have heard of that. That's a really good one. SageMaker is another one. The, these are framework optimized for model serving, right? And for making sure that they're performant. So if you are using any of these, I think you are good. If you are looking at any of these. Look at, maybe start with the serverless functions. Those are the easiest to approach. Then look at the specialized platforms and the container orchestration if you have to. One. Another big thing that I I am a big proponent of is observability and monitoring, right? Once you deploy a service it was easy when, in the monolithic fashion where you just deploy that one process, everything is in, contained in there. So you monitor that one process and you're golden, right? With. Microservices sprawl. What happens is now you have to monitor so many services. You have to make sure that logs are available from all the services. The resource utilization is pretty well handled so that you can find the hotspots. So all that stuff keeps adding up. So operationally there are a little bit more involved as compared to the monolithic architecture. But it's. The pros that you get from it, I think it over outweighs the monolithic way of deploying stuff. So from that perspective, the model performance and metrics, right? You want all that stuff in some per store like Prometheus, right? So you can graph it out in something like Grafana or Giana, or. Anywhere else which you prefer, right? Distributor tracing is another one, right? And this probably comes a little bit after logging and making sure that you have the resource utilization. But once you have distributed tracing, you'll be able to trace a specific event going through your microservices orchestration. It'll have a specific trace id and you'll be able to pinpoint some performance bottlenecks if you need to, right? And it's really good at that. There are frameworks like Yeager and Zipkin, which help you do that. Please take a look at them. If you're looking at distributor tracing it's pretty well. Pretty well organized organized library, and it solves a really decent problem, which a lot of systems will face once they move into that microservices deployment pattern. Like I said, resource utilization and centralized logging are cornerstones, you will want them, you will need them in the long run to make sure that you are able to understand your resource utilization. Where hotspots are, if one service is causing too much ruckus, if there is another service which needs to be handled. Centralized logging is a must because now you have 6, 7, 10, 15 different services which won't be running and. If there is a fault in, and you're tracing out the fault in these services, you'll want to have a time series styled logs, which can tell you what happened, and then with service and what was the prior activities happening, or adjacent activities happening at the same time in other services so that you can figure out what's going on and what went wrong. But yeah, up to each and every one of the teams. How do you wanna handle that? The emerging trends, which I have seen recently is, service meshes, right? SDO is pretty great at creating that service mesh with zero trust zero trust, security, and fault tolerance, while requiring actually minimal code implementation. They deploy themselves as sidecars within Kubernetes and have used SST O I've not used linker D yet, but maybe it's used somewhere internally in East U as well. I don't know, and I don't remember honestly, but. The Istio framework is pretty good at that. The, another thing which has started coming up recently is AI driven orchestration with the whole the new frameworks that have been coming up, the Google's a to a, the tropics service patterns, all those things are. Are gaining steam now and once these agent tick systems actually start doing these orchestrations really well, I think this is gonna pick up really heavily because it's gonna be able to understand what it needs to do and take the right actions eventually with, because of those details, which it has available, right? With that said, I'm just gonna move on to the takeaways for a real world implementation. If you're actually interested in doing a microservices based deployment please take a look at these. And this is the TLDR kind of a slide deck, which I wanna present. The first and the foremost step is gonna be identification. You'll wanna make sure that you identify the right service to start with. One thumb, rule of thumb that I have is pick a service, which will be the least amount of effort for you, but will give you the most amount of bang for your buck, kind of thing, right? So if there is a service, which can is the most painful, but it's gonna take a lot more steps to decompose that service and do all that stuff, I would not start there. I would start with the service, which is probably a little easier on decomposition, but it's gonna give you almost the same amount of result. So you're not doing really critical service right out of the gate, but you're still doing something which is 80% there and it's gonna give, take 20% of the HE effort to get the 80% of the gain probably start there, right? Service decomposition, this is huge, right? And this is something which you will have to think about. When you are partitioning that whole system, you would want to partition it with some things in mind where you are probably just decomposing it enough that it doesn't get overdone. I. What I mean by that is do not decompose a single process into 15 services because you want to maybe segregate them out into 15. Start with maybe four or five decomposition layers and see if you need more. I. Processes out of them and extractions out of them, and then move forward with that. You don't want to overdo your microservices architecture because that's not gonna help. It's just gonna make you skip from one service to another. You just keep moving and bouncing around when you're looking at the logs and when you're debugging any of those things. But yeah, so think carefully, really carefully about how and where you de decide the boundary and what your decomposition is gonna look like. Another big one, if you have to look at the orchestration, orchestrated deployment models nowadays Kubernetes is awesome, right? Like I said, but start with serverless maybe. And if you are a fan of talker like me my home lab runs purely on Docker. There's no Kubernetes in my home lab, but I use Docker and it's, it has some amount of resource allocation, but there's. I don't think there's enough auto-scaling in Docker self. Self-standing. With Docker Swarm, you get some of these things. Yes, the capabilities are there but with Plain Docker, which is on the same host, it's not that great. Measure your results. That's another big one, right? Make sure once you decompose all your services, you are measuring some of them to understand that have you over decomposed or is it good enough? Or there are still some hot spots that you can change your patterns. From there, you will be able, and all those results are gonna come from your observability. These if I have to pick and choose. The most critical ones are these, I'm gonna say identify service decomposition and observability are my pick. For the three biggest right, you need the observability. You need to know how to decompose the service and have a pretty set boundary and a pretty set signatures to talk to them, talk between each other. Infrastructure modernization can probably wait a little right while you figure out all these three things. With Kubernetes, you will be able to do observability improvements as well really quickly. But you will, you'll prob if you have never used Kubernetes before, you can only be using learning Kubernetes before you do all that stuff. So if you already are using Kubernetes. You're good, right? Get the Kubernetes going first. Observability is almost free in Kubernetes to a degree, right? With Prometheus and labels and things like that. Continuous evolution. This is a huge one, right? Again, optional thing and can probably wait at the back of the line, but you'll wanna think about it at some point of time. You want good? Established CICD patterns and pipelines, which auto deploy some of these things. Once you have the first five of in the play, you will want continuous evolution to be part of your play as well, so that your systems are robust, fast, and almost always current. Instead of, Hey, which version are we running? Let's see. And if we can deploy the latest versions. That's probably my takeaways. And with that. Thank you for your time. I think I hope you learned something from this and if you have any questions, feel free to ping me. Feel free to reach out to me. Thank you again.

Slides

Download slides (PDF)

See all 136 talks at this event!

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

Harnessing Microservices for Scalable, Fault-Tolerant ML Systems: Trends, Challenges & Best Practices

Video size:

Abstract

Summary

Transcript

Slides

Abhishek Walia

Staff Customer Success Technical Architect @ Confluent

Join the community!

Featured event

2026

2025

Info

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

Harnessing Microservices for Scalable, Fault-Tolerant ML Systems: Trends, Challenges & Best Practices

Video size:

Abstract

Summary

Transcript

Slides

Abhishek Walia

Staff Customer Success Technical Architect @ Confluent

Join the community!