Conf42 MLOps 2025 - Online

- premiere 5PM GMT

MLOps at Trillion-Parameter Scale: Operationalizing Large Language Model Training Pipelines

Video size:

Abstract

Master MLOps for trillion-parameter LLMs: automated training pipelines, failure recovery, cost optimization, and production deployment strategies from real-world large-scale AI systems.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Okay. Okay. Hello everyone. My name is Angen Das and I studied at Indian Institute of Management and I'm currently a technical leader at Meta. So here I work in ads ranking systems where we use big ML models to find the right kind of ads. That can be shown to users on Facebook, Instagram, et cetera. So today I'm here to talk about lops at massive scale, how massive trillion parameter large models, right? And we'll see what are the operational challenges that we get while training this model. These models, these really large language models. Okay let's dive right in. In agenda, the first item, which we have, is to how to deal with the scale and what are the challenges while dealing with this really large. Machine learning models training them, operationalizing them, serving them, doing the inference, et cetera. So we'll talk about that. The second point we'll talk about is that what are the infrastructure components required to build these kind of models, right? And serve these kind of models. The next is pipeline orchestration automation. We'll talk about different frameworks and techniques for managing this really long training cycles which can be weeks or sometimes months, right? Then we'll talk about the monitoring and failure recovery. These train, these models are getting trained for a really long time or weeks. And then what are the metrics that we should be tracking because otherwise we are just flying blind, right? And also so we'll be tracking how stable the models are when the training is happening. And if some error happens, the loss should not be, catastrophic. Okay. The next is resource optimization and cost management. So we'll talk about different strategies, how to utilize the maximum of these hardwares so that we can bring down the cost, right? Okay. So the first thing is the skill challenge. The scale is very high. And this is a really challenge because we are talking about trillion parameter models, right? So the first challenge that we see is, I think we already touched Job on, is that really long training cycles. Because of these extended training cycles, it takes. A really long time to actually find out whether anything worked or not. So you have data, you train and train for weeks, and then you got a model. Model is good or not. You only find out after weeks, right? Okay. So next is numerous distributor, GPU working in synchronized. And parallel manner, right? So you need a literally lot of GPUs. So we are talking about thousands and thousands of GPUs are required, and these GPUs have to be synchronized. Otherwise, you'll not be able to get a model trained, right? So that's a second challenge. The third challenge is cost, because this. Training is happening on a data center, which have a lot of GPUs and GPUs. Cost is massive power cost is massive. And on top of that, people who know technical expertise on these things these people's cost a lot to the companies, right? So that is a substantial training cost. So the second is massive amount of training data, right? So you have very big, cluster. And then, you have a very big model you want to train on, but what is the data, right? So we need really massive data to be fed into this really hungry, right data hungry models. And so that is the next challenge. And then, the last challenge is considerable volumes of model checkpoints. So you have really big trillion, trillion parameter, large model. Now, if you want to save a snapshot, like you train for say, one hour and you want to save a snapshot, okay, what are my weights right now that trained weights right now? So you have to save the whole, whole model and saving that basically means that, you need that much memory. So these are the biggest challenges of, of training these big models, right? Let's take look at, take a look at financial realities, right? So what are the financial cost involved here? We talked about. That it takes a lot of time. And then the weekly training cost is super, super high, right? That is first thing. The second thing is cost per failed hour. If it fails for some reason, right? The training is happening and it fails. Then then because of the. Because of a failure. Your cost is very high because your GPUs are sitting idle. You have already invested a lot of power in running the GPUs and that is gone. You might have to go back to a last checkpoint. Because so any kind of downtime. Basically cost a lot, right? Then there is potential waste, right? So potential waste means the data is not coming. That's why your training is not happening, or your power went out or power's not enough, right? So that is always chances of waste, right? So we have to make sure that we minimize the waste, right? Yeah. Third is the last one is basically trading duration. And since. This training is going on for a really long time, then, you have to have right strategy to minimize cost here because it's going to take a really long time and really lot of money because of that. Okay. Okay, moving on. Now let's take a look at the second agenda. Basically we have is what are the infrastructure components required? To build this and operationalize these large language models. The first is very obvious. You need a distributed training infrastructure because these are really large models. They cannot fit in one machine, and you want a really big cluster to. Basically load these models and then train these models, right? That is first, and then there since they're distributed. That's why you need a right very high bandwidth connection between GPUs, between clusters, so that you can actually effectively train. Otherwise, you'll lose lot of time to only data. And on top of that, additionally you can have custom kernels. Kernels are nothing but programs that run on the single set of instructions that run on multiple machines, right? And that way it helps to train the model efficiently. But to harness the hardware power you can have custom kernels. And they can really harness the hardware specific, acceleration capabilities that are provided by these hardwares. Okay. So the next is data pipeline system. You have your distributed training infras already, but you also need and those are data hungry. That's why you need a data pipeline system to feed data continuously into that. So that's a data pipeline system we're talking about. Okay. The third the third is orchestration frameworks. So you need, different kind of frameworks, which are aware of the GPU and which you can schedule right. Kind of jobs and right kind of hardware, right? And orchestration is very important. The last one is monitoring and observability. This company is very important. So in monitoring and observability we talk about what kind of metrics that you want to track. Okay. And if those metrics are going, haywire going up, down and you have right kind of alerts are set up right? So that what you have is, as the training is progressing, you are know, what's going on because without the metrics, you are basically flying blind. Okay. Okay. So let's move on and take a deeper look at distributed training frameworks. We talk about four components distributor training, component distributor training frameworks, and then we talk about data pipelines. Then we talked about automation and frameworks, and then monitoring, right? So the first thing, we talk about distributor training frameworks now. We have, we are here, we are showing two distributor frameworks. One is Megatron by Nvidia. The other is deep speed by Microsoft, right? So in Megatron we have tensor parallelism for efficient layer distribution. So any neural network has lot of layers. So with tensor parallelism, we can basically make it more efficient. Then the next one is sequence parallelism. What is sequence bism? So let's understand what is sequence. You send a set of send a prompt some text to a large language model that is a sequence, right? So input sequence can be really large and that has a very big memory footprint, right? And this is in for sequence, we can chunk it out. And assign a chunk to A-G-P-U-A specific GPU, right? So that way you can manage that memory footprint, right? So that is what a sequence parallelism is. And then we can basically megaton is optimized to Nvidia GPUs, right? Since we have some GPU here, a 100 H 100, but it can be more different kind of GPUs also similarly deep speed. Have something called zero optimizer, which kind of me optimizes memory. You have pipeline parallelism or for balance computation, you have some offloading techniques for CPU and VME memory extensions. So these are different kind of techniques. These distributed frameworks used to make the training faster, right? This modern, massive scale parameter training. Combines multiple parallelism strategies simultaneously. So you can have tensor parallelism, you can pipeline parallelism, you can data parallelism. It's a sequence parallelism all working together to together to basically train the model more efficiently. Okay. Okay. So next let's talk about the pipeline organization. So we talk about distributed components, distributed architecture, distributed frameworks. And then we talk about pipeline organization, right? So pipeline organization, we talk about what that the first is, first step is data preparation, right? So without data you don't have a model, right? So you will prepare the data and then feed that data to this data hungry training clusters where we are training the, training this models, right? So in training, so the next step is training orchestration. You are doing this, orchestrating this training on, various big cheap U clusters, right? And after that you have to have a model and you have to evaluate the model and see what is good or bad. And if it's good, then you have to register the model. You have to, basically, you have to save the model version it, register it in a way that you know, you can basically, you have thousands of versions and you can find the right version. Which version is working correctly, it was good or bad, and all that. So the good model checkpoint or snapshot, basically will be store in model registry. So all these components, all these com pipeline orchestration is very important to get this whole thing running smoothly for a long duration of time. Like we are talking about weeks or sometimes months. Okay. Moving on. The third is automated recovery strategies. What kind of so here we'll talk about what kind of failures can happen and what can we do if a failure happens, so a failure detection, right? So the first is failure detection. So it'll be continuously monitoring for harder failures without gradient explosions. Basically your training and you have forward path and backward path, and then you are computing this gradient and a gradient spike. So you have to basically notify for that, and you also have to notify about data corruption issues. So that's very important. The next point is checkpoint management. We talked about, which we touched upon it a little bit earlier that you have, like your training is going on. For two weeks. It's not that you will see, you will be having a output only after two weeks, right? You're getting an output after every little training you do, and basically keep track of what is the checkpoint, what is the saving the checkpoint as a training progresses. It should have a right kind of strategy that if something fails, which checkpoint I should look at, and how many checkpoints I can fall back. Two. The third one is, the next one is resource reallocation. Dynamic provisioning. And cluster reconfiguration happens here. Your resources allocated for some training, but then you can want to reallocate it, right? So you can be, you should be able to do that here. Then the training re redemption. So if something fails. Which the training should resume automatically. It's not you have to manually go and fix something. It should be automatically doing. So robust recovery automation is very important and mission critical. Okay. Moving on. What kind of monitoring are we doing? We talked about metrics. So what kind of metrics are we tracking here? So the first metric is system metrics, and the next is training metrics, because your system metric is basically hardware metrics. So G utilization, network bandwidth disc io power. So these are hardware. Hardware and power, right? So these are system metrics, but you are training it. So there are training methods, training a model. That's why they have a training metrics. So you have loss, you have a gradient, you have a learning rate and a parameter, denomination, parameters, how they're doing, right? Evolution of parameters. So these are the metrics. And then for all the system metrics and training metrics, you have to have some kind of alerting. So if something is not right or not tending right, then you have to have an alert system. Anomaly detection algorithms can basically notify folks whenever there is an anomaly. And then you can have predictive failure analysis. You have the data, now you can do some. Put analysts on that, then you have to have, you have notify multiple channels, right? It's not like only a phone rings. You have a phone, you have Slack, you have IM messages, email, right? Call, everything you can do, right? And automated escalation policy. If you, if somebody doesn't pick up the phone and fix it, then, keep on escorting. Okay. So that is real time monitoring systems. Let's move on to some advanced monitoring systems. And early warning systems, right? So those are not early. So these are earlier, early rise systems, right? The gradient explosion signals, so we talked about this gradient, right? So gradient is the way to adjust the weights, right? So you are turning the model. Turning the model is nothing, but, adjusting the weights of the parameters and whenever the gradient becomes, shoots up or shoots down, shoots up really high. Then it will just mess up the training, right? The stability of the training will go sideways and go south. Monitoring for certain spikes is super important. Activation distribution ships, right? So tracking layer output distribution statistics for unexpected pattern changes. So if you are basically tracking every activation and there's like super high change on that, then also it should be basically, taking some action here. Validation and trained ence. So you have trained the model a lot and then you have validated it that, okay, the trained model is actually performing here. Or not. If the difference between train and the test, train and validation is really high, then something is wrong and we need to be notified for that. Similarly, learning data and loss correlation, monitoring for unexpected loss behavior during learning rate changes that also needs to be tackled, right? So early duration basically enables intervention before catastrophic failure happens, right? So you have some ation happening that if something is going on, so you can intervene, right? You can see a temporary learning rate reduction. You can do, you can say selective gradient clipping adjustments. The grad is going high, so you can clip it a little bit. You can do emergency checkpoint, emergency check pointing. Things are going bad. So then that's why you can checkpoint quickly emergency so that you can save a good snapshot and a snapshot of the weights, and then, train from there. Or you can do some automatic hyper tuning. So now let's talk about computational efficient optimization. So we're talking we touched up on this, right? So you can do hardware acceleration and those kind of things. But here we have some optimization techniques we talk about. So the first optimization technique is mixed precision training. Use using f floating point 16 bit instead of 32 bit. You can basically reduce the memory footprints and you can also increase the throughput of the system. Basically, number of bits is less so computation is less and memory is less. That, so the, so overall, the throughput of the system improves. So next is activation Checkpointing. So activation checkpointing is basically a mechanism to, to trade. Memory for compute, when the training is happening, you are having, creating activations and then saving those activations. And in the backward pass you can use it in the backward back propagation, you can use it. Instead of that, you are basically computing it on the fly during the back propagation. So that is basically is active reject pointing. Okay. The flash attention. So these are, this is a special technique which basically makes the attention. Computation is really faster. So that is flash attention. And with a lower memory footprint, they can be really helpful in speeding up your machine learning model train. Okay? So with all this computational efficiency optimizations you'll get faster training cycles and lower costs. Okay moving on to resource optimization strategies. So sometimes the resource so this approach is basically dynamic resource allocation strategy, right? Your resource needs to change without within the training. Throughout the training, your resource might change, right? Resource requirement will change. So initial phases you'll need a really high memory. When the middle phases, you'll need more compute, less memory. Or stable memory and final phases, you might basically reduce passage. So that's why in the final pages know resource requirement might be different, right? So there are graphs which basically show that you know how things are changing, right? Early phases here, then phase, then later, then scaling down and refinement, right? So in all these phases. Different kind of resource requirement is there. So if we can, manage that, that will improve overall cost and the resource requirement. Okay. Okay. Okay. The next is model deploy permanent. So you have the whole thing ready. You have trained four weeks and you have the, finally you have the model, which can be used in production. But to use a model in production, you need to do a lot of things. The first is serving these para big models will require big memory, right? So that means you need a substantial GPUs to basically just serve request to, for predictions on these models. Next is inference latency. Yeah, you have a model, but then when the request comes in to predict it, you have user's expert, not super responsive responsible, basically super fast responsive. But if the inference latency is very high because because you have less GPUs or something like that, or your bandwidth is not good, then it'll not work. So for that, we have. Inference, la inference we have to be very careful about it, right? So inference latency basically increases or basically increases when the sequence length increases, right? When the number of inputs, sequences is very large. Then this attention layers that we have in large, longer models or transformer layers that we have, that competition skill will basically increase a lot. Throughput optimizations. So you can do some batching strategies and balance latency versus throughput. So you can do some batching. So basically increase it. So key diploma strategies that we need to make sure that first extensor parallelism across multiple GPUs within a node. So within a node, basically you have multiple GPUs, and then you can do Tencent pallets in there, right? That is the first thing. Second is quantization basically use. Reduced memory footprint because you are using lower precision, 14 point numbers and quantizing and lower precision lower precision, 14 point numbers if you use, it'll be good. And then KV cache optimization for efficient context handling. So you can use some caching there. And then specialized high performance hardware you can always use to, and the speed of the performance of your systems. Okay, so let's come do key takeaways here. So for ML ops to work for you, you have to have infrastructure first approach, right? At a massive scale. This ML Ops is not an afterthought. It has to be critical foundation that makes training possible. So you have to invest in distributor systems. You have to invest in distributor system expertise in terms of people who know what they're doing. Next is automate everything Manual. Manual intervention during extended training runs is impractical. And build a comprehensive, you can't just do that manually. So you have to, comprehensive automation, we talk about monitoring. We, you have a recovery, you have optimization. Everything has to be automated. So then multi-level monitoring. So you have to have integrate hardware level monitoring, like system level monitoring and alerting is there. Then for similarly, you have the training level metrics and training level alerting. So multi-level monitoring and alerting has to be there. Okay last but not the least is cost, right? So cost is a first class method. At this scale, at basically a trillion parameter, large language models, right at that scale, ML ops efficiency directly impacts substantial learning cost, right? So you can reduce the running cost very easily if you improve your mops. So that's why you have to track it, optimize it, report it, and keep on improving it, right? And this has to be a si a similar goal as model performance. So model you train is performing better that you track, but I would tracking the cost also. Okay, so that's the key takeaway and with that I would like to say thank you very much and have a nice day.
...

Anjan Dash

Software Engineer @ Meta

Anjan Dash's LinkedIn account



Join the community!

Learn for free, join the best tech learning community

Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Access to all content