Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, I'm Jo Mundi.
I'm going to talk about reinforcement learning and how to scale RHLF based
systems for distributed learning across multi GPU environment.
There's a lot of improvements you can see recently in the models either
the Sonet models or the OPA models.
Where they have done a lot of instruction based RHLF based models which continuously
improves the performance of these models.
And, i'm going to talk about the challenges there and also the, core
capabilities that are being used to build these platforms and systems.
So the agenda to talk is, the, a little bit summary about our
reinforcement learning from theoretical frameworks to production systems
the foundations distributed rl what are the challenges involved there?
Then, the RHL principles and the, practices that are normally applied in
a production systems for training or for inference the cloud native architectures.
Mostly these systems are trained on Kubernetes with multi GPU set up with
proper networking using and will link.
And I'll also talk about some of the scaling frameworks that I've used.
So mostly on Impala, really lightning AI deep speed.
Some of the modern frameworks that are being utilized and the
challenges of monitoring and deploying these pipelines at at scale.
So reinforcement learning the previous version of RL based systems was it
came from MDP, it's called markup Decision Process, where it was
more of a sequential based learning algorithm where can maximize the
cumulative reward at each step.
But with deep parallel deep neural networks.
Now there transformer based networks it can learn deep processing.
Power how the input gets transformed is a little, is more sophisticated
with the transformer based networks compared to just a sequential process
in MDP, so it can learn much more sophisticated functions and multimodal
either be text image or video audio.
So the foundations of distributed reinforcement learning are there are some
it can be either done in a multi-node, multi pod system where, there'll be
a main server which will accumulate all the weights as the system learns
throughout each epoch and each batch.
The core challenges there are networking.
How does the how does the workers, where the learning is happening,
how do they communicate along with the servers, and how does the server
accumulate all the weights to make sure the weight of the entire network is is
is updated after each learning event?
Now the, there are different kinds of loss functions and policy optimization
algorithms that will be applied.
DD like PPO or GRPO or DDPO.
These are different kinds of policies that are mainly applied.
There's also actor, critic models where where you learn from examples.
And these examples are either hand labeled or it can be sometimes
generated by the machines itself.
So that is, it becomes a continuous learning feedback loop.
So learning can happen in multiple ways.
So one is the most common is the from human feedback.
And that there is that is called supervised fine tuning, where
you have a pre-trained model that you train on a large cluster.
But then on top of that, then you have to fine tune based on your own
instructions which are a basic set of questions and answers for lms.
For vision models, it can be the image and the labels itself.
So these are human labels and these are created by experts in specific domains.
For example, doctors or physicians or math olympiads, they create
this supervised the kid is very.
High quality data set that can be used for supervised fine tuning, and that
improves the performance of these models.
So to train this data at scale you have to also make sure you have a reward
model or reward function defined so that the, so that the entire network
can optimize for that your function.
The policies that can, you can apply for achieving the reward functions are
like PPO is one of them, as I mentioned.
There's also, DDPO and the GRPO and all these policies depends on like some of
the policies can do much better than the others depending on the type of.
The dataset and reward functions you have defined.
So you need to be very careful in defining the reward functions because
you there are challenges involved.
There are reward hacking.
If the functions are not defined properly, the model might try to do
if you can hack the reward function and it might lead to un consequences,
which are un which are not desirable.
So distributed implementation of RHLF.
There, there is there is one, one of them most of them are, did,
are trained on different kinds of large Kubernetes clusters and their
different kinds of frameworks.
Each of them has their own advantages and disadvantages.
But the most common is data.
The data parallelism and model parallelism, data parallelism meaning your
model, same model, which is is distributed across all the pods in the cluster.
And the data it is very large size.
It gets split across those nodes and the learning happens in each, the
individual pods which are then accumulated and the, and that the workers, which
learns after each batch, it sends that weights back to the server, and the
server accumulates those weights and the neur network weights are updated.
And it's sent back to the workers.
There, there is a grad, the accommodation strategies.
There are various kind of accommodation strategies that can happen on the server
pod which is the main pod, which is monitoring all all the training and
learning across all these workout pods.
And you also have to make sure the models are checkpoint because some
of this training can take a long time to to, to train weeks to train.
And sometimes you don't want to restart the training after after
let's say 50% of the training is done.
You want to make sure it, it can restart from that checkpoints so that it can, like
in minimize the compute resource that is needed for the entire training process.
Human feedback collection is one of the other challenges.
We have like human feedback.
Sometimes sometimes the feedback loop can be very long.
So depending on the use case some feedback loops are faster to collect.
Some feedbacks can take up to months to collect.
So normally in, generally there is a lot of companies like scale ai.
And know Marco and few others which helps creating this kind of
labels based on human feedback.
And if you create millions of examples these examples then becomes a data set
for training RHLF models at school.
And you can control the quality and also make sure.
The, the evolution systems that we have are in line with the controls
and the quality that you want to achieve after the training.
So that when it goes to inference it can really highly perform
compared to the previous models.
The, there are different kinds of cloud native architectures for distributor.
There I'll talk more about it.
There are different kinds of frameworks that are there.
But most of this is run on top of Kubernetes orchestrating all the pods.
Of the resource management, the compute management fail word recovery
based on the checkpoints replay buffers based on that actor, critic
patterning the stateful management of, of memory as well as the data.
Which is the, the, it can be in, in terms of sometimes in terms of terabytes.
So all this state management has to be done and in the proper way so that it
can restart or it can communicate the states across all these training pods for
a consistent and reliable training run.
Network optimization is another very important thing.
Your network needs to be really optimized for this multi GPU training.
And NVIDIA has NNV link, which is really provides a highly optimal network.
So that, the multi GPU communications can be done at scale.
And there a lot of improvements that are going on for low latency
communication across multi GPUs.
So that it can it can reduce down the compute time when the data and the
state gets passed across these pods.
There are different kinds of frameworks.
Impala is one of them, really is another.
Deep speech is another.
So there are different frameworks available, and each of these
has their own pros and cons.
And depending on the use case, you can decide what framework you want to use.
For multi gpu RL based trainings really is used if you have it gives a lot
of advantages of of the components.
There's some really available components and metrics are, it gives you the
metrics of how the training is going like quickly out, out of the box.
DP another one, which is normally used for very large scale transformer know networks
when you cannot fit the model in one.
GP like the model needs to be distributed.
The model itself needs to be distributed across multiple GPUs.
The learning happens across multiple pods and not just one pod.
So where for very large networks, you cannot have the
entire model load in one pod.
In that case, you need to have the model except distributed across multiple pods.
And the learning happens in a distributed way across all these all these pods from
parts of the, from one part of the network to another part of the neural network.
This is an example of so gpu, she link Kate themselves, don't buy
natively, do the GPU shell link.
Nvidia has device plugins, so you need to install them as part
of as part of the pod setup.
And you can say, okay, how many pods, how many gps do you want to in one pod?
Normally four is the number, but you can also go up to eight depending
on the resources that you have.
And the distributed training frameworks where the training is actually happening.
There are some of the things that all they mentioned this is there.
DDP is there for data parallelism.
There's deep speed, which is from Microsoft.
There's Ray Train, which is from any scale.
And Q Flow also provides toppy jobs and, TF job operators where you can
use them for distributed training.
The most important thing is also when you're doing multi GPO training is
the, is the communication for a fast communication you need to use Nvidia the
NCCL package to make sure the network is fully optimized for skill training.
So production, deployment strategies like we need to monitor all these various GPU
usage because some of the GPUs the, if the data is not properly distributed,
some of the GPUs can really underperform.
And so you need to have like very close monitoring tools.
They can, you can inter integrate with any.
Platforms like Datadog and things like that to understand the GPU uses.
And this meet some trial and error on understanding what is the data,
right distribution and shorting of the data that needs to be done so
that now all the GPUs are maximally utilized and there's no there's no
like imbalance or high imbalance in terms of, utilization of the GPUs when
you're doing a distributor training.
There are things like, safety mechanisms.
So circuit breaker patterns for unstable policies.
If there are pots that goes down, you need to have failover and,
high reliable like robust failover mechanisms so that if a port goes down.
It does not break the entire training.
It can recover from the specific checkpoint to it failed, and
it can take it up from there.
And then the entire training across the cluster can still continue.
So it needs very optimal it's a monitoring in terms of all the
utilization across G-P-U-C-P-U your network across the cluster.
So that you have a good understanding about ization across when the training
is happening, and you can optimize based on bottlenecks that you did that you see
the bottlenecks can be of various types.
It can be either network, it can be the data or sometimes it can be like the model
itself is the, it is learning slowly.
So you have to make sure the the learning parameters are set up
in optimal way and you have to.
Test and tune that multiple times to, to get to optimal learning
learning for the entire model.
So the loss can be the loss and the, the loss function that you're applying or the
policies that you're applying are really like improving the learn, improving the
learning performance of the entire model.
So there are a lot of things like learning rate parameters, and all those
things that also needs to be tuned depending on the type of data and the
type of way cluster that you have.
So case studies for rf there is most common use cases
are large language models.
Which are commonly being trained using label data.
There's SFT, which is, I talked about supervised, fine tuning, but also there
is, there now new techniques coming up, which are like reinforcement based.
Fine tuning.
And that actually can really speed up the training because it's very
difficult to get labeled data from experts like, doctors, physicians, or
olympiads or mathematic mathematical olympiads or cist olympiads.
To get labeled data takes a long time.
If the system can, if you can define reward functions and if you can
define property or function, the system, the network can itself come
up with new kinds of la labeled data that can be fed back to the model.
So in that case, you don't need a very large set of labeled data.
You can have an initial seed and that model can learn from that initial seed.
And they can they can generate new kinds of data sets itself as part of
a training and that can improve our, improve the speed of your training
and you don't have to wait for long cycle of getting human feedback data.
So large linguist models is the most common use case is also use cases
for autonomous vehicles and robotic applications where these data is being
collected for robotics applications.
A lot of challenges in collecting data of images.
But as you collect multiple various kinds of data sets the model becomes
more robust to new use cases.
The challenges I'll talk about is the human feedback, scalability
human feedback is the golden dataset.
So if you can get that data quickly on and in a reliable way, in a very
high, that's most high equity data.
But that's the challenge.
I think the next two, three years will be needed to make sure the human data is
collected across very special tasks that the model can learn and specialize on.
So that, the quality of the model can increase over time.
Communication overhead is, how do you make sure the policy parameters or
the the weight parameters are shared across the network in a reliable way.
In a in a quick, in a fast way, so the coordination becomes safe.
It needs a very high network bandwidth to make sure the coordinate coordination
of this as the model is training, the coordination of all this state stateful
data needs to happen in a very quick way as that can become the bottleneck
of for your entire cluster and can increase the cost of your computing.
Then there are other challenges like temporal dynamics.
How do you catch how do you make sure your temporal dynamics in terms of the changing
preferences are caught in the network?
So that de that depends on the policy that we're trying to build.
And also the loss functions.
It depends on the network architecture as well.
Then there are consistency grantees and bias and representation, which
talks about like how do we make sure this the model is trained?
When it is trained, it gives consistent results and does not,
give very highly like the certainty of the results should be within some
bands so that each time you train.
It should improve.
'cause if you have inconsistency, then you cannot trust the
model every time you train.
So you have to make sure there's consistency and kind of the data
availability, the know the time, compute time that is needed to run.
And consistently in terms of the model performance itself when you're
tr training it multiple times, you should give the same results close to.
It might not be exactly same because it's an un deterministic system,
but it should be close enough.
The emerging trends there is also a new way of doing RHLF, which is
federated based RHLF, federated based systems, where that means the learning
can happen on the edge itself instead of on a cluster in a Kubernetes pod.
You can also do learning directly.
On the H devices it can be, there can be millions of H devices
where there can be the data gets collected and the training happens.
Some of the training can happen directly on the, on, on the
devices for privacy preservation.
And then the devices then sync.
They are learning the weights to the, to a central server.
Accumulates all the learning parameters, and then it merges it and then it can
send it back to devices so that so that you don't have to need to have a like
very large training every time there.
Some of these continuous learning can happen on the devices, on the edge itself.
So that's a feder learning approach.
There's a lot of research going on, on, on that area itself.
Then there's multimodal which is like image, audio your text
video and all those other things.
There's automated feedback based learning where you can
collect the data automatically.
And there is a lot of promise.
A lot of companies are seeing where the feedbacks are collected within the
product itself and either good or bad.
And that can be used and those labels that are collected from the product itself
are then used either as a part of the.
Training process while you're doing fine tuning with with your instruction
instruction based fine tuning with RHLF, or some people use that as
part of the rag application itself.
Where those feedbacks are are used as part of the prompt.
And then the, when they call the the lms those feedbacks are given as references
for chain of thought reasoning.
So the path forward there is still a lot of work to do.
I think the most biggest the way to do know, do distributor.
There are a lot of frameworks out there, but most of challenging part I think
is right now the data collection part from researchers, engineers like coders
or doctors as physicians or teachers.
Or any professional, any domain.
So we need lots of data, I think high quality data that needs to be collected
so that the model can continuously can be trained with reinforcement
learning with human feedback.
And this feedback data are specialized data gets collected
so that, getting that at scale.
Across all the people in the world let's say 1 billion people that
can really improve the new versions of models that are coming out now.
That might not be always possible to get.
So that's why I think the reinforcement based fine tuning might be might be
one of the new ways where where with custom reward functions can be the new
way, where the model itself generates new kinds of data and hypothesis
that can be used for training.
But there's, that's a lot of research going on this area as well right now.
So that's it.
Thank you so much for listening to me.
I would love to get any questions and yeah.
And answer them as needed based on my experience so far.
Thank you