Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey guys, this is Abha Alia, and I'm here to talk about how do you
harness microservices architectures for really highly scalable and
amazingly fault tolerant ML ecosystems.
Let's begin.
I. Architectures have been really prevalent, right?
And basically it's a traditional approach which has very tightly coupled code base.
And everything basically is packaged into a single executable.
Your data processing, your model training, your service layer, your inference
layer, all that stuff goes into a single deployable unit, which basically that
at the end of the day, what you're scaling is that one cohesive unit.
What that also means.
On top of that is that when you run into a hotspot in any of those layers, you
will have to scale your service out, which is single, deployable, unit based,
and the ram, the CPU, all those use cases or the usage of those particular
service, resources is gonna be dependent on what that one deployable unit is
gonna need because everything's gonna start into that single deployable unit.
Another big thing is that there is no very limited, or actually no very
limited fault tolerance isolation because everything is deployed together.
So if one of those processes set faults.
I think you're toast at that point.
So basically to prevent any kind of systemwide failures ease
scaling things evolve around microservices architectures, and it's.
It's really convenient.
It's modern, it's smaller, it's independently deployable, capsule
size services which you deploy and operate as really discreet units.
What happens to these discreet units is that now you can, because
they are discreet and independently deployed, you are able to scale by
themselves, scale them by themselves.
You are able to isolate the faults.
You are able to isolate any issues that are going to happen.
In the lifetime of the service hotspots are also very isolated.
The scaling is really good because if you're influencing their needs
a lot more CPU and ram, you don't have to scale the entire service.
You can just scale their influencer layer and you're good, right?
I personally believe microservice architecture is a per win-win for
any kind of deployments, especially for the ML system deployments.
As we talked about, there are, these are the key benefits and
we have been talking about this.
It gives you dynamic scalability because everything is independently deployed,
so you scale all those components by the need of those components.
And what I mean by that is if one component needs.
X amount of ram and y amount of instances.
Not every component has to get X amount of ram and y amount of instances.
Now you can dyna scale each and every one of them.
You get enhanced resilience, right?
No Ssec faulting from one in service is gonna affect the other service because
they're just independently deployed.
Big one, big win.
Technology flexibility.
This is a big difference which you get.
Let's say you were deploying everything in Python and suddenly
you want maybe your API layer to go into Golan instead of Python, just
because Golan is a little faster.
It's easier to parallelize things in Golang as well.
And API layer is pretty much very easy to build in Golang.
You wanna get to Golan, but you can't because if you're deployed in the micro
monolithic services pattern, there's no way you're gonna move just one day out.
To do that, you'll want to move to the microservices pattern.
And in that way you just rewrite your API layer in Golan and deploy those
services and you're good, right?
And that the big win is the accelerated deployment process.
Now that because you deploy them.
On their own, you, your CICD processes can decide when each of
those services can get deployed.
One, their schedule could be completely different.
Second, you can actually have it linked to something like a GitHub hook, and
they just continuously keep getting deployed, and you don't have to deploy
your training modules just because your APIs are getting refreshed or
they're getting faster in the backend.
So it gives you all those flexibilities as well.
All good with microservices, but there are some challenges there as well, right?
Data consistency is big one of them, right?
That now because you have to have all those different service boundaries and
those services now dependent on the data from one service and they're dependent,
interdependent on each other, what that.
May.
What that leads to is that your, you will have data consistency challenges,
and you will have to make sure that your data is consistent across all those
services and the services instances of those processes that you have deployed.
There is definitely a communication overhead, right?
Because one, you're not a monolithic service anymore, so it's not an inter.
Within InProcess communication, now you're talking about T-C-P-G-R-P-C based
communication, which will have some amount of network overhead as well, right?
It may not, because you might be in the same box, but it'll still have
an interprocess layer, which will still have a little bit more effect
on the communication throughput.
So yeah it does change some of that stuff.
System complexities.
It gets.
Higher.
Definitely it's a lot higher because now you have so many more components to look
at, so many more components to monitor if god forbid, you run into issues, you
have to debug like six different services in a singular time series fashion so that
you can understand which service failed and what pattern led to that failure.
Diplomat architectures change a lot.
You have to now make sure that they, you're, they're scaling independently.
They have their resources tuned independently.
All that stuff changes as well.
So I agree that there are amazing benefits to the microservices architecture patterns
as long as you make sure that all these challenges are taken care of as well.
So it's not a free for buffet until, unless you address some of these things.
From a communication perspective, there are like four different things, right?
Which comes to mind.
One is you can almost always say that my services are just
gonna talk to each other.
If via rest it's an API layer and I'm gonna just use REST or GRPC and
that's how they're gonna communicate.
Full stop, right?
You don't have to worry about it too much.
It gives you a strong consistency.
Guarantees, because of the nature of being asynchronous, it gets a little easier.
You can establish a HTTP two connection 2.0 connection one once,
and then you reuse that particular connection for all those multiple
requests that you're gonna send.
It gets much easier because now you don't have to construct and
deconstruct TTP connection every single request that you want to.
So win-win, but you just have to make sure that you're doing that.
As A-G-R-P-C layer, that is one of the things that you get with
G rrp C sgtp two connection.
Another thing which has been happening in the recent times
have been immense streaming.
People want to reuse those continuous data pipelines that they have their,
the constant barrage of data that's coming in, and they want to utilize
that high volume data as model inputs.
In, in those use cases, those event streaming patterns.
Where this data, which is incoming, never stops, never ceases to exist.
You want to look at the event streaming patterns to a degree.
But yeah, again it's up to your use case specifically that is event streaming
the right pattern for you or not.
Another one is.
Where all your instances are kind of stateless, and they use a single
shared data store in the background.
So basically all they do is they talk to that single data store in via
transactions, and everybody just gets updated via those shared data stores.
It's a lot heavier.
It's a lot harder to do that shared data store model because yes, you
are relying on that one service, which has all those data sets.
So be a little careful when you do that.
I generally recommend even streaming patterns for all these things that,
yeah, event happened, X X happened.
This is the current state.
Please update it everywhere.
It could be a data store, it could be anything.
But I prefer those patterns because again, that's just my preference,
but you have to look at your use case if that's useful or not.
And message queues have been, they have been there for years now, right?
At decades at this point.
It basically eases the decoupling of services from each other.
All you do is you talk to the message queues and the message queues.
Somebody's listening.
Somebody's sending basically allows you to do asynchronous processing
while independently scaling.
So you get the best of both worlds.
But yeah.
Again, like I said, for all of them, depends on your use case.
Do you want to use which one do you want to use?
ML specific deployment approaches are and these are not something which are generic,
but I have seen these used really heavily.
Where I, in the ecosystems, which I've seen, right.
Containers have been really good.
For formalizing the deployment model, it's that unit which gets
deployed anywhere and everywhere.
It allows you to orchestrate that really well and.
Kubernetes in the recent times have done it really beautifully, where
you don't have to worry about how things are gonna happen, the network
apologies, the load balancers and how they're gonna interact the domains,
the in ingress, all that stuff.
Kubernetes takes care of really beautifully and again, in a really
well defined, codified manner.
More or less YAML files for almost all of them.
But again, it.
It's something which is amazing.
All that needs to happen is that you need to understand that with all those
great things, there might be some overhead in the operational paradigm of
the operation paradigm for these things.
Kubernetes as amazing as it is.
Operationally it's a little bit challenging.
So make sure that you have some experts at hand who can deal with co
kubernetes issues, network resolutions.
The DNS, it's almost always DNSI think everybody agrees
whenever there's an issue.
It's almost always DNS otherwise, right?
It's, it allows you to intelligently scale all those things.
It allows you to make those operations in a fault tolerant way.
It's amazing at scheduling resources, and it's amazing for
a AB testing as well, right?
Bluegreen deployments, AB testing.
Yeah, Kubernetes can do it pretty easily.
All you need to know is the way how you configure those and you're done.
Another big one is the serverless functions.
AWS Lambda comes to mind, right?
Google Cloud functions is another one.
Again, they're more or less the same but it eliminates all this infrastructure
management overhead for you, right?
Kubernetes, one way or the other, you are either.
Allocating resources, you're making sure the machines have enough
capacity, you're still doing some node pooling and all those related things.
Serverless just gets rid of all that complexity.
All you do is deploy the containers in that serverless
function manner, and you get.
Then there are specific, very specialized ML platforms kerv, formerly known as
KF Serving, if you have heard of that.
That's a really good one.
SageMaker is another one.
The, these are framework optimized for model serving, right?
And for making sure that they're performant.
So if you are using any of these, I think you are good.
If you are looking at any of these.
Look at, maybe start with the serverless functions.
Those are the easiest to approach.
Then look at the specialized platforms and the container
orchestration if you have to.
One.
Another big thing that I I am a big proponent of is observability
and monitoring, right?
Once you deploy a service it was easy when, in the monolithic fashion where
you just deploy that one process, everything is in, contained in there.
So you monitor that one process and you're golden, right?
With.
Microservices sprawl.
What happens is now you have to monitor so many services.
You have to make sure that logs are available from all the services.
The resource utilization is pretty well handled so that you can find the hotspots.
So all that stuff keeps adding up.
So operationally there are a little bit more involved as compared
to the monolithic architecture.
But it's.
The pros that you get from it, I think it over outweighs the
monolithic way of deploying stuff.
So from that perspective, the model performance and metrics, right?
You want all that stuff in some per store like Prometheus, right?
So you can graph it out in something like Grafana or Giana, or.
Anywhere else which you prefer, right?
Distributor tracing is another one, right?
And this probably comes a little bit after logging and making sure that
you have the resource utilization.
But once you have distributed tracing, you'll be able to trace
a specific event going through your microservices orchestration.
It'll have a specific trace id and you'll be able to pinpoint some performance
bottlenecks if you need to, right?
And it's really good at that.
There are frameworks like Yeager and Zipkin, which help you do that.
Please take a look at them.
If you're looking at distributor tracing it's pretty well.
Pretty well organized organized library, and it solves a really
decent problem, which a lot of systems will face once they move into that
microservices deployment pattern.
Like I said, resource utilization and centralized logging are cornerstones, you
will want them, you will need them in the long run to make sure that you are able
to understand your resource utilization.
Where hotspots are, if one service is causing too much ruckus, if there is
another service which needs to be handled.
Centralized logging is a must because now you have 6, 7, 10, 15 different
services which won't be running and.
If there is a fault in, and you're tracing out the fault in these services,
you'll want to have a time series styled logs, which can tell you what happened,
and then with service and what was the prior activities happening, or adjacent
activities happening at the same time in other services so that you can figure
out what's going on and what went wrong.
But yeah, up to each and every one of the teams.
How do you wanna handle that?
The emerging trends, which I have seen recently is, service meshes, right?
SDO is pretty great at creating that service mesh with zero trust
zero trust, security, and fault tolerance, while requiring actually
minimal code implementation.
They deploy themselves as sidecars within Kubernetes and have used SST O I've not
used linker D yet, but maybe it's used somewhere internally in East U as well.
I don't know, and I don't remember honestly, but.
The Istio framework is pretty good at that.
The, another thing which has started coming up recently is AI driven
orchestration with the whole the new frameworks that have been coming
up, the Google's a to a, the tropics service patterns, all those things are.
Are gaining steam now and once these agent tick systems actually start doing
these orchestrations really well, I think this is gonna pick up really heavily
because it's gonna be able to understand what it needs to do and take the right
actions eventually with, because of those details, which it has available, right?
With that said, I'm just gonna move on to the takeaways for
a real world implementation.
If you're actually interested in doing a microservices based deployment
please take a look at these.
And this is the TLDR kind of a slide deck, which I wanna present.
The first and the foremost step is gonna be identification.
You'll wanna make sure that you identify the right service to start with.
One thumb, rule of thumb that I have is pick a service, which will be the
least amount of effort for you, but will give you the most amount of bang
for your buck, kind of thing, right?
So if there is a service, which can is the most painful, but it's
gonna take a lot more steps to decompose that service and do all
that stuff, I would not start there.
I would start with the service, which is probably a little easier on
decomposition, but it's gonna give you almost the same amount of result.
So you're not doing really critical service right out of the gate, but
you're still doing something which is 80% there and it's gonna give, take
20% of the HE effort to get the 80% of the gain probably start there, right?
Service decomposition, this is huge, right?
And this is something which you will have to think about.
When you are partitioning that whole system, you would want to partition
it with some things in mind where you are probably just decomposing it
enough that it doesn't get overdone.
I. What I mean by that is do not decompose a single process into
15 services because you want to maybe segregate them out into 15.
Start with maybe four or five decomposition layers
and see if you need more.
I. Processes out of them and extractions out of them, and
then move forward with that.
You don't want to overdo your microservices architecture
because that's not gonna help.
It's just gonna make you skip from one service to another.
You just keep moving and bouncing around when you're looking at the logs and when
you're debugging any of those things.
But yeah, so think carefully, really carefully about how and where you
de decide the boundary and what your decomposition is gonna look like.
Another big one, if you have to look at the orchestration,
orchestrated deployment models nowadays Kubernetes is awesome, right?
Like I said, but start with serverless maybe.
And if you are a fan of talker like me my home lab runs purely on Docker.
There's no Kubernetes in my home lab, but I use Docker and it's, it has some amount
of resource allocation, but there's.
I don't think there's enough auto-scaling in Docker self.
Self-standing.
With Docker Swarm, you get some of these things.
Yes, the capabilities are there but with Plain Docker, which is on
the same host, it's not that great.
Measure your results.
That's another big one, right?
Make sure once you decompose all your services, you are measuring some
of them to understand that have you over decomposed or is it good enough?
Or there are still some hot spots that you can change your patterns.
From there, you will be able, and all those results are gonna
come from your observability.
These if I have to pick and choose.
The most critical ones are these, I'm gonna say identify service decomposition
and observability are my pick.
For the three biggest right, you need the observability.
You need to know how to decompose the service and have a pretty set boundary
and a pretty set signatures to talk to them, talk between each other.
Infrastructure modernization can probably wait a little right while
you figure out all these three things.
With Kubernetes, you will be able to do observability improvements
as well really quickly.
But you will, you'll prob if you have never used Kubernetes before, you
can only be using learning Kubernetes before you do all that stuff.
So
if you already are using Kubernetes.
You're good, right?
Get the Kubernetes going first.
Observability is almost free in Kubernetes to a degree, right?
With Prometheus and labels and things like that.
Continuous evolution.
This is a huge one, right?
Again, optional thing and can probably wait at the back of
the line, but you'll wanna think about it at some point of time.
You want good?
Established CICD patterns and pipelines, which auto deploy some of these things.
Once you have the first five of in the play, you will want continuous
evolution to be part of your play as well, so that your systems are robust,
fast, and almost always current.
Instead of, Hey, which version are we running?
Let's see.
And if we can deploy the latest versions.
That's probably my takeaways.
And with that.
Thank you for your time.
I think I hope you learned something from this and if you have any
questions, feel free to ping me.
Feel free to reach out to me.
Thank you again.