Scaling Reinforcement Learning: From Human Feedback to Distributed Intelligence

Video size:

Abstract

Discover how Reinforcement Learning is powering the next wave of AI from aligning large language models like ChatGPT to scaling decision-making across fleets of autonomous agents. Learn practical strategies for building RL systems that adapt, cooperate, and scale in the real world.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, I'm Jo Mundi. I'm going to talk about reinforcement learning and how to scale RHLF based systems for distributed learning across multi GPU environment. There's a lot of improvements you can see recently in the models either the Sonet models or the OPA models. Where they have done a lot of instruction based RHLF based models which continuously improves the performance of these models. And, i'm going to talk about the challenges there and also the, core capabilities that are being used to build these platforms and systems. So the agenda to talk is, the, a little bit summary about our reinforcement learning from theoretical frameworks to production systems the foundations distributed rl what are the challenges involved there? Then, the RHL principles and the, practices that are normally applied in a production systems for training or for inference the cloud native architectures. Mostly these systems are trained on Kubernetes with multi GPU set up with proper networking using and will link. And I'll also talk about some of the scaling frameworks that I've used. So mostly on Impala, really lightning AI deep speed. Some of the modern frameworks that are being utilized and the challenges of monitoring and deploying these pipelines at at scale. So reinforcement learning the previous version of RL based systems was it came from MDP, it's called markup Decision Process, where it was more of a sequential based learning algorithm where can maximize the cumulative reward at each step. But with deep parallel deep neural networks. Now there transformer based networks it can learn deep processing. Power how the input gets transformed is a little, is more sophisticated with the transformer based networks compared to just a sequential process in MDP, so it can learn much more sophisticated functions and multimodal either be text image or video audio. So the foundations of distributed reinforcement learning are there are some it can be either done in a multi-node, multi pod system where, there'll be a main server which will accumulate all the weights as the system learns throughout each epoch and each batch. The core challenges there are networking. How does the how does the workers, where the learning is happening, how do they communicate along with the servers, and how does the server accumulate all the weights to make sure the weight of the entire network is is is updated after each learning event? Now the, there are different kinds of loss functions and policy optimization algorithms that will be applied. DD like PPO or GRPO or DDPO. These are different kinds of policies that are mainly applied. There's also actor, critic models where where you learn from examples. And these examples are either hand labeled or it can be sometimes generated by the machines itself. So that is, it becomes a continuous learning feedback loop. So learning can happen in multiple ways. So one is the most common is the from human feedback. And that there is that is called supervised fine tuning, where you have a pre-trained model that you train on a large cluster. But then on top of that, then you have to fine tune based on your own instructions which are a basic set of questions and answers for lms. For vision models, it can be the image and the labels itself. So these are human labels and these are created by experts in specific domains. For example, doctors or physicians or math olympiads, they create this supervised the kid is very. High quality data set that can be used for supervised fine tuning, and that improves the performance of these models. So to train this data at scale you have to also make sure you have a reward model or reward function defined so that the, so that the entire network can optimize for that your function. The policies that can, you can apply for achieving the reward functions are like PPO is one of them, as I mentioned. There's also, DDPO and the GRPO and all these policies depends on like some of the policies can do much better than the others depending on the type of. The dataset and reward functions you have defined. So you need to be very careful in defining the reward functions because you there are challenges involved. There are reward hacking. If the functions are not defined properly, the model might try to do if you can hack the reward function and it might lead to un consequences, which are un which are not desirable. So distributed implementation of RHLF. There, there is there is one, one of them most of them are, did, are trained on different kinds of large Kubernetes clusters and their different kinds of frameworks. Each of them has their own advantages and disadvantages. But the most common is data. The data parallelism and model parallelism, data parallelism meaning your model, same model, which is is distributed across all the pods in the cluster. And the data it is very large size. It gets split across those nodes and the learning happens in each, the individual pods which are then accumulated and the, and that the workers, which learns after each batch, it sends that weights back to the server, and the server accumulates those weights and the neur network weights are updated. And it's sent back to the workers. There, there is a grad, the accommodation strategies. There are various kind of accommodation strategies that can happen on the server pod which is the main pod, which is monitoring all all the training and learning across all these workout pods. And you also have to make sure the models are checkpoint because some of this training can take a long time to to, to train weeks to train. And sometimes you don't want to restart the training after after let's say 50% of the training is done. You want to make sure it, it can restart from that checkpoints so that it can, like in minimize the compute resource that is needed for the entire training process. Human feedback collection is one of the other challenges. We have like human feedback. Sometimes sometimes the feedback loop can be very long. So depending on the use case some feedback loops are faster to collect. Some feedbacks can take up to months to collect. So normally in, generally there is a lot of companies like scale ai. And know Marco and few others which helps creating this kind of labels based on human feedback. And if you create millions of examples these examples then becomes a data set for training RHLF models at school. And you can control the quality and also make sure. The, the evolution systems that we have are in line with the controls and the quality that you want to achieve after the training. So that when it goes to inference it can really highly perform compared to the previous models. The, there are different kinds of cloud native architectures for distributor. There I'll talk more about it. There are different kinds of frameworks that are there. But most of this is run on top of Kubernetes orchestrating all the pods. Of the resource management, the compute management fail word recovery based on the checkpoints replay buffers based on that actor, critic patterning the stateful management of, of memory as well as the data. Which is the, the, it can be in, in terms of sometimes in terms of terabytes. So all this state management has to be done and in the proper way so that it can restart or it can communicate the states across all these training pods for a consistent and reliable training run. Network optimization is another very important thing. Your network needs to be really optimized for this multi GPU training. And NVIDIA has NNV link, which is really provides a highly optimal network. So that, the multi GPU communications can be done at scale. And there a lot of improvements that are going on for low latency communication across multi GPUs. So that it can it can reduce down the compute time when the data and the state gets passed across these pods. There are different kinds of frameworks. Impala is one of them, really is another. Deep speech is another. So there are different frameworks available, and each of these has their own pros and cons. And depending on the use case, you can decide what framework you want to use. For multi gpu RL based trainings really is used if you have it gives a lot of advantages of of the components. There's some really available components and metrics are, it gives you the metrics of how the training is going like quickly out, out of the box. DP another one, which is normally used for very large scale transformer know networks when you cannot fit the model in one. GP like the model needs to be distributed. The model itself needs to be distributed across multiple GPUs. The learning happens across multiple pods and not just one pod. So where for very large networks, you cannot have the entire model load in one pod. In that case, you need to have the model except distributed across multiple pods. And the learning happens in a distributed way across all these all these pods from parts of the, from one part of the network to another part of the neural network. This is an example of so gpu, she link Kate themselves, don't buy natively, do the GPU shell link. Nvidia has device plugins, so you need to install them as part of as part of the pod setup. And you can say, okay, how many pods, how many gps do you want to in one pod? Normally four is the number, but you can also go up to eight depending on the resources that you have. And the distributed training frameworks where the training is actually happening. There are some of the things that all they mentioned this is there. DDP is there for data parallelism. There's deep speed, which is from Microsoft. There's Ray Train, which is from any scale. And Q Flow also provides toppy jobs and, TF job operators where you can use them for distributed training. The most important thing is also when you're doing multi GPO training is the, is the communication for a fast communication you need to use Nvidia the NCCL package to make sure the network is fully optimized for skill training. So production, deployment strategies like we need to monitor all these various GPU usage because some of the GPUs the, if the data is not properly distributed, some of the GPUs can really underperform. And so you need to have like very close monitoring tools. They can, you can inter integrate with any. Platforms like Datadog and things like that to understand the GPU uses. And this meet some trial and error on understanding what is the data, right distribution and shorting of the data that needs to be done so that now all the GPUs are maximally utilized and there's no there's no like imbalance or high imbalance in terms of, utilization of the GPUs when you're doing a distributor training. There are things like, safety mechanisms. So circuit breaker patterns for unstable policies. If there are pots that goes down, you need to have failover and, high reliable like robust failover mechanisms so that if a port goes down. It does not break the entire training. It can recover from the specific checkpoint to it failed, and it can take it up from there. And then the entire training across the cluster can still continue. So it needs very optimal it's a monitoring in terms of all the utilization across G-P-U-C-P-U your network across the cluster. So that you have a good understanding about ization across when the training is happening, and you can optimize based on bottlenecks that you did that you see the bottlenecks can be of various types. It can be either network, it can be the data or sometimes it can be like the model itself is the, it is learning slowly. So you have to make sure the the learning parameters are set up in optimal way and you have to. Test and tune that multiple times to, to get to optimal learning learning for the entire model. So the loss can be the loss and the, the loss function that you're applying or the policies that you're applying are really like improving the learn, improving the learning performance of the entire model. So there are a lot of things like learning rate parameters, and all those things that also needs to be tuned depending on the type of data and the type of way cluster that you have. So case studies for rf there is most common use cases are large language models. Which are commonly being trained using label data. There's SFT, which is, I talked about supervised, fine tuning, but also there is, there now new techniques coming up, which are like reinforcement based. Fine tuning. And that actually can really speed up the training because it's very difficult to get labeled data from experts like, doctors, physicians, or olympiads or mathematic mathematical olympiads or cist olympiads. To get labeled data takes a long time. If the system can, if you can define reward functions and if you can define property or function, the system, the network can itself come up with new kinds of la labeled data that can be fed back to the model. So in that case, you don't need a very large set of labeled data. You can have an initial seed and that model can learn from that initial seed. And they can they can generate new kinds of data sets itself as part of a training and that can improve our, improve the speed of your training and you don't have to wait for long cycle of getting human feedback data. So large linguist models is the most common use case is also use cases for autonomous vehicles and robotic applications where these data is being collected for robotics applications. A lot of challenges in collecting data of images. But as you collect multiple various kinds of data sets the model becomes more robust to new use cases. The challenges I'll talk about is the human feedback, scalability human feedback is the golden dataset. So if you can get that data quickly on and in a reliable way, in a very high, that's most high equity data. But that's the challenge. I think the next two, three years will be needed to make sure the human data is collected across very special tasks that the model can learn and specialize on. So that, the quality of the model can increase over time. Communication overhead is, how do you make sure the policy parameters or the the weight parameters are shared across the network in a reliable way. In a in a quick, in a fast way, so the coordination becomes safe. It needs a very high network bandwidth to make sure the coordinate coordination of this as the model is training, the coordination of all this state stateful data needs to happen in a very quick way as that can become the bottleneck of for your entire cluster and can increase the cost of your computing. Then there are other challenges like temporal dynamics. How do you catch how do you make sure your temporal dynamics in terms of the changing preferences are caught in the network? So that de that depends on the policy that we're trying to build. And also the loss functions. It depends on the network architecture as well. Then there are consistency grantees and bias and representation, which talks about like how do we make sure this the model is trained? When it is trained, it gives consistent results and does not, give very highly like the certainty of the results should be within some bands so that each time you train. It should improve. 'cause if you have inconsistency, then you cannot trust the model every time you train. So you have to make sure there's consistency and kind of the data availability, the know the time, compute time that is needed to run. And consistently in terms of the model performance itself when you're tr training it multiple times, you should give the same results close to. It might not be exactly same because it's an un deterministic system, but it should be close enough. The emerging trends there is also a new way of doing RHLF, which is federated based RHLF, federated based systems, where that means the learning can happen on the edge itself instead of on a cluster in a Kubernetes pod. You can also do learning directly. On the H devices it can be, there can be millions of H devices where there can be the data gets collected and the training happens. Some of the training can happen directly on the, on, on the devices for privacy preservation. And then the devices then sync. They are learning the weights to the, to a central server. Accumulates all the learning parameters, and then it merges it and then it can send it back to devices so that so that you don't have to need to have a like very large training every time there. Some of these continuous learning can happen on the devices, on the edge itself. So that's a feder learning approach. There's a lot of research going on, on, on that area itself. Then there's multimodal which is like image, audio your text video and all those other things. There's automated feedback based learning where you can collect the data automatically. And there is a lot of promise. A lot of companies are seeing where the feedbacks are collected within the product itself and either good or bad. And that can be used and those labels that are collected from the product itself are then used either as a part of the. Training process while you're doing fine tuning with with your instruction instruction based fine tuning with RHLF, or some people use that as part of the rag application itself. Where those feedbacks are are used as part of the prompt. And then the, when they call the the lms those feedbacks are given as references for chain of thought reasoning. So the path forward there is still a lot of work to do. I think the most biggest the way to do know, do distributor. There are a lot of frameworks out there, but most of challenging part I think is right now the data collection part from researchers, engineers like coders or doctors as physicians or teachers. Or any professional, any domain. So we need lots of data, I think high quality data that needs to be collected so that the model can continuously can be trained with reinforcement learning with human feedback. And this feedback data are specialized data gets collected so that, getting that at scale. Across all the people in the world let's say 1 billion people that can really improve the new versions of models that are coming out now. That might not be always possible to get. So that's why I think the reinforcement based fine tuning might be might be one of the new ways where where with custom reward functions can be the new way, where the model itself generates new kinds of data and hypothesis that can be used for training. But there's, that's a lot of research going on this area as well right now. So that's it. Thank you so much for listening to me. I would love to get any questions and yeah. And answer them as needed based on my experience so far. Thank you

Slides

Download slides (PDF)

See all 25 talks at this event!

Conf42 JavaScript 2025 - Online

October 30 2025 - premiere 5PM GMT

Scaling Reinforcement Learning: From Human Feedback to Distributed Intelligence

Video size:

Abstract

Summary

Transcript

Slides

Jyotirmoy Sundi

Co-founder & CTO @ Votal AI INC

Join the community!

Featured event

2026

2025

Info

Conf42 JavaScript 2025 - Online

October 30 2025 - premiere 5PM GMT

Scaling Reinforcement Learning: From Human Feedback to Distributed Intelligence

Video size:

Abstract

Summary

Transcript

Slides

Jyotirmoy Sundi

Co-founder & CTO @ Votal AI INC

Join the community!