Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone.
thanks for tuning in.
we will talk about, the architectural evolution that has happened in distributed
training over the past decade, actually.
so today we'll explore how distributed training has evolved to keep up with
the explosive growth in model size.
The deep learning models have grown from millions parameters to trillions
of parameters, and as models have scale, so have the challenges.
So memory limits, communication, resource usage, all of these have some issues.
And there have been many things that have been built to, get around
those issues, fix those issues.
We'll walk through some architectural milestones, from parameter server
to ring all reduce pipeline parallelism and zero along the way.
We'll see how these advances work alongside optimizers, like Lars
and Lamb, to train larger models faster and more efficiently.
let's begin.
between 2012 and 2020 model sizes skyrocketed.
A g single GPU couldn't just keep up.
we needed distributor training, splitting work across devices,
but that created new problems.
Early systems use data parallelism.
each device persist a batch and synced with others.
Sounds simple, but syncing meant a lot of communication, especially for big
models as GPUs waited on each other or the network efficiency tanked.
And the other issue was memory.
So as models grew, so did the need to store parameters, gradients,
and optimizer states that meant memory often became the bottleneck.
Even with multiple devices.
So then came in the parameter server architecture, and this was done, this
came in to coordinate multiple workers.
The parameter server emerged a central node holding the model workers who would
compute gradients and send them to the server, which would then update the model.
it had two sync modes, synchronous and async in the synchronous when
all workers wait and update together.
Simple but slow.
Asynchronous worker move at their own PAs faster, but, it doesn't optimize as
good and there are some inconsistencies.
so while useful learning on the centralized server became a choke
point, and you guessed it because there was a centralized server, As
you scale to dozens or hundreds of workers, the server got overloaded.
Much like everything, trying to exit a stadium through one door.
plus it introduced a single point of failure.
So we scaled up.
The architecture started to fall apart.
then, early optimization challenges.
Specifically stood out and, these were, communication overhead.
The frequent, syncing of massive tensor over tensors overwhelm the
bandwidth, the straggler effect.
So in the synchronous setup, slow workers held everyone back, like
waiting for the last hiker in the group.
And the third one, idle resources.
So GPU spent time waiting instead of computing, especially
in mixed hardware setups.
So we needed smart, smarter ways to share data, memory and manage the
memory and keep the hardware busy.
then came in, ring reduced, which was basically a better way to sing.
To get the rid of the central server bottleneck, we turn to ring all.
Reduce and imagine arranging all GPUs in a ring instead of syncing with a server.
Each GPU talks only to its neighbors, so gradients are passed around the
ring gradually summed and shared.
this, reduces network congestion and scales well.
Each node does equal work.
No bottlenecks.
It's like replacing a traffic jam at one big intersection with
a smooth flowing roundabout.
How's that for an analogy?
So the result, better bandwidth use and near linear scaling, add
more GPUs and training speeds up proportionally and up to a point.
Sound good?
Next came in, pipeline parallelism.
this was basically done to fit bigger models.
even with Ring already fitting huge models on a single device was still tough.
That's where, pipeline parallelism helps.
we basically here in this approach, split the model layers across devices.
So let's say GPU one handles layers, one to 10.
GP U2 handles layers, 11 to 20 and so on.
you take the model, you divide up the layers and, hand them off to,
different compute nodes or GPUs.
it's like an assembly line.
Data flows from one GPU to the next, during forward and backward passes.
we use, micro batches to keep all GPUs busy.
just like feeding the items down the pipeline so no station is idle.
this dramatically reduced, memory per device, memory usage per
device, letting a strain, much larger models without hitting,
memory walls or memory constraints.
next came in the approach of, zero, this was done to basically kill the redundancy.
so traditional data parallelism meant each GPU would keep full
copies of the models of the model gradients and optimizer, states.
And it was, as you can see, it was wasteful.
I. So zero fixes this by splitting those elements across devices.
Now each GPU only stores a slice of model and data.
So it's like dividing a textbook among students instead of
everyone carrying all of them.
this leads to eight x better memory efficiency, and making trillion
parameter models more trainable.
So I would just go back to the analogy.
It's like dividing.
The textbooks for a day across students.
So History book is brought by a student.
The maths book is brought by another student.
Instead of all students carrying all copies of all the books.
each student copying his or her copy of each book.
So it led to a significant memory efficiency and models that could be
trained got significantly bigger in size.
It also didn't hurt performance.
it's smart enough to fetch or share data as needed while keeping everything stable.
So this was a good thing to come along.
Now next what happened was, we enabled a large scale batch, large batch training.
Not large scale, but large batch training.
So batch sizes, larger batch sizes, increase efficiency, and faster training
is achieved through, bigger bat sizes.
but they can hurt convergence if you're not careful.
So there are two.
Optimizers specifically designed for this, these are larges and lamb larges are just
learning rates for each layer based on the weight to gradient ratio, so keeping
layers balanced during training, and then lamb builds on that with added adaptivity
for deep models like transformers.
So LAMB is specifically there to help with transformer type models.
Now these models make training with 32,000 plus path size,
practical without losing accuracy or spending weeks on tuning, frankly.
So they are key to keeping training stable at scale.
Next, communication optimization.
basically.
And less talk, more work.
So even with Google Architectures, communication is still costly.
And so we, what, we did was compress and quantize the gradients, gradient
compression, was done and it helps, drop or delays, communicating tiny values.
And quantization uses fewer widths for each number, like jpeg for gradient.
These, all these, both of these reduce communication without
hurting accuracy if tuned properly.
And the goal is to spend less time waiting on data and more time training the model.
Sounds good.
what did we cover?
We covered communication optimization.
Yeah.
putting it all together, what's the impact?
so the impact is that the training time for the same size of model I. and
the memory efficiency, the training time has come down with each one
of these milestones that we covered starting from parameter servers.
for this reference, the, the model that has been used to
compare the, is the same model.
So if we just had the earlier milestone of parameter server, and if this
model is used, is trained using that approach, it would, The training time
would be basically 30 a month, and the memory efficiency would be low.
And with each one of these milestones, the training time drops and the
memory efficiency increases.
These numbers are not hypothetical thanks to these methods.
Australian parameter models are now a reality, and they're powering
cutting at science language, morals, and climate simulations.
So distributor training is no longer a limiting factor.
It's a launchpad.