Conf42 Golang 2025 - Online

- premiere 5PM GMT

Evolving Architectures in Distributed Deep Learning: From Parameter Servers to ZeRO Systems

Video size:

Abstract

Discover the future of AI! Dive into the evolution of distributed training, from parameter servers to ZeRO systems. Explore how innovations like Ring-AllReduce, ZeRO optimizations, and advanced algorithms are transforming scalability, efficiency, and speed for massive AI models!

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. thanks for tuning in. we will talk about, the architectural evolution that has happened in distributed training over the past decade, actually. so today we'll explore how distributed training has evolved to keep up with the explosive growth in model size. The deep learning models have grown from millions parameters to trillions of parameters, and as models have scale, so have the challenges. So memory limits, communication, resource usage, all of these have some issues. And there have been many things that have been built to, get around those issues, fix those issues. We'll walk through some architectural milestones, from parameter server to ring all reduce pipeline parallelism and zero along the way. We'll see how these advances work alongside optimizers, like Lars and Lamb, to train larger models faster and more efficiently. let's begin. between 2012 and 2020 model sizes skyrocketed. A g single GPU couldn't just keep up. we needed distributor training, splitting work across devices, but that created new problems. Early systems use data parallelism. each device persist a batch and synced with others. Sounds simple, but syncing meant a lot of communication, especially for big models as GPUs waited on each other or the network efficiency tanked. And the other issue was memory. So as models grew, so did the need to store parameters, gradients, and optimizer states that meant memory often became the bottleneck. Even with multiple devices. So then came in the parameter server architecture, and this was done, this came in to coordinate multiple workers. The parameter server emerged a central node holding the model workers who would compute gradients and send them to the server, which would then update the model. it had two sync modes, synchronous and async in the synchronous when all workers wait and update together. Simple but slow. Asynchronous worker move at their own PAs faster, but, it doesn't optimize as good and there are some inconsistencies. so while useful learning on the centralized server became a choke point, and you guessed it because there was a centralized server, As you scale to dozens or hundreds of workers, the server got overloaded. Much like everything, trying to exit a stadium through one door. plus it introduced a single point of failure. So we scaled up. The architecture started to fall apart. then, early optimization challenges. Specifically stood out and, these were, communication overhead. The frequent, syncing of massive tensor over tensors overwhelm the bandwidth, the straggler effect. So in the synchronous setup, slow workers held everyone back, like waiting for the last hiker in the group. And the third one, idle resources. So GPU spent time waiting instead of computing, especially in mixed hardware setups. So we needed smart, smarter ways to share data, memory and manage the memory and keep the hardware busy. then came in, ring reduced, which was basically a better way to sing. To get the rid of the central server bottleneck, we turn to ring all. Reduce and imagine arranging all GPUs in a ring instead of syncing with a server. Each GPU talks only to its neighbors, so gradients are passed around the ring gradually summed and shared. this, reduces network congestion and scales well. Each node does equal work. No bottlenecks. It's like replacing a traffic jam at one big intersection with a smooth flowing roundabout. How's that for an analogy? So the result, better bandwidth use and near linear scaling, add more GPUs and training speeds up proportionally and up to a point. Sound good? Next came in, pipeline parallelism. this was basically done to fit bigger models. even with Ring already fitting huge models on a single device was still tough. That's where, pipeline parallelism helps. we basically here in this approach, split the model layers across devices. So let's say GPU one handles layers, one to 10. GP U2 handles layers, 11 to 20 and so on. you take the model, you divide up the layers and, hand them off to, different compute nodes or GPUs. it's like an assembly line. Data flows from one GPU to the next, during forward and backward passes. we use, micro batches to keep all GPUs busy. just like feeding the items down the pipeline so no station is idle. this dramatically reduced, memory per device, memory usage per device, letting a strain, much larger models without hitting, memory walls or memory constraints. next came in the approach of, zero, this was done to basically kill the redundancy. so traditional data parallelism meant each GPU would keep full copies of the models of the model gradients and optimizer, states. And it was, as you can see, it was wasteful. I. So zero fixes this by splitting those elements across devices. Now each GPU only stores a slice of model and data. So it's like dividing a textbook among students instead of everyone carrying all of them. this leads to eight x better memory efficiency, and making trillion parameter models more trainable. So I would just go back to the analogy. It's like dividing. The textbooks for a day across students. So History book is brought by a student. The maths book is brought by another student. Instead of all students carrying all copies of all the books. each student copying his or her copy of each book. So it led to a significant memory efficiency and models that could be trained got significantly bigger in size. It also didn't hurt performance. it's smart enough to fetch or share data as needed while keeping everything stable. So this was a good thing to come along. Now next what happened was, we enabled a large scale batch, large batch training. Not large scale, but large batch training. So batch sizes, larger batch sizes, increase efficiency, and faster training is achieved through, bigger bat sizes. but they can hurt convergence if you're not careful. So there are two. Optimizers specifically designed for this, these are larges and lamb larges are just learning rates for each layer based on the weight to gradient ratio, so keeping layers balanced during training, and then lamb builds on that with added adaptivity for deep models like transformers. So LAMB is specifically there to help with transformer type models. Now these models make training with 32,000 plus path size, practical without losing accuracy or spending weeks on tuning, frankly. So they are key to keeping training stable at scale. Next, communication optimization. basically. And less talk, more work. So even with Google Architectures, communication is still costly. And so we, what, we did was compress and quantize the gradients, gradient compression, was done and it helps, drop or delays, communicating tiny values. And quantization uses fewer widths for each number, like jpeg for gradient. These, all these, both of these reduce communication without hurting accuracy if tuned properly. And the goal is to spend less time waiting on data and more time training the model. Sounds good. what did we cover? We covered communication optimization. Yeah. putting it all together, what's the impact? so the impact is that the training time for the same size of model I. and the memory efficiency, the training time has come down with each one of these milestones that we covered starting from parameter servers. for this reference, the, the model that has been used to compare the, is the same model. So if we just had the earlier milestone of parameter server, and if this model is used, is trained using that approach, it would, The training time would be basically 30 a month, and the memory efficiency would be low. And with each one of these milestones, the training time drops and the memory efficiency increases. These numbers are not hypothetical thanks to these methods. Australian parameter models are now a reality, and they're powering cutting at science language, morals, and climate simulations. So distributor training is no longer a limiting factor. It's a launchpad.
...

Aditya Singh

MS in Computer Science @ UW - Madison

Aditya Singh's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)