What goes into the making of LLMs

Video size:

Abstract

4 robots are being trained by different parents, but whose training method is like THE BEST. One learns only human language, another tackles code, a third mixes both, and the fourth taps into multilingual insights. Which robot outsmarts them all? Cohere’s research reveals that the secret sauce.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

With new models being released regularly, do you ever wonder what goes into the making of these lum? What makes these models so powerful? hello everyone. This is Antra Zaha, and today we will be deep diving into the secret sauce behind making these lums. We will be asking what refinement does each new model makes, and discussing about one of the most common practice used in all language model inclusion of code data, including. Of code data in the pre-training data mixture, even for the models, not specifically designed for code, has become a very common practice. For example, all the state of the art model such as palm gofer bloom, which are not intended to support code generation, includes a percentage of code data together. For instance, LAMA three has more code data as compared to your LAMA two, which brings to a question. To code, hard not to code. So to analyze the impact of code, this paper conducts extensive ablations and evaluates the language models along different benchmarks, namely your natural needs, natural language reasoning task, world knowledge task, and code ation. Before moving ahead, I want to give you a concise information about the phases of training l lm. So it's all starts with pre-training phase. In this phase, model learns the language structure. Its semantics from the vast text data. This results in the broad understanding of our natural language. Then the second step is the fine tuning phase. So here the model adapts for a specific task. This helps in specialized task performance. Third is our continual pre-training. So in this new or updated knowledge is added without losing the previous knowledge, which we added in our pre-training phase. Then we have the fourth stage, which is the cool down phase. So during cool down models. Learning rate is gradually reduced and high quality data set has given more priority. Last, we have the evaluation and fine tuning phase. So the best analog to relate to all this would be the phases of learning in our own human learning, the lifelong earned lessons from our parents. The school life is the pre-training and the fine tuning the building block of our learning journey. College is continual pre-training. Knowledge is added or updated, but the previous knowledge, which we have learned from our parents and school is not lost. What we further continuing in our pro profession is the cool down phase here. The learning rate is gradually reduced. If we compare to our school life and high quality data set, that is our specialization is given much more preference. Lastly, we always fine tune and evaluate our learning based on our experiences. So the overview of. This experimental framework is we evaluate the impact of code by evaluating code proportions, state as which the codes has been introduced, code quality and property, and the model scale. So in this experiment setup, pre-training is considered into two phases. One is the continued pre-training, and the second is the cool down phase here. Continued pre-training refers to a model which is initialized from a pre-train model and trained for a specific. Token budget. So the first is your impact of initializing using code pre-trained models. To understand this better, imagine four robots whose trainers have trained them on different ways. First is your text language model, who has been trained on text only data. Second, we have balanced language model. So here his trainer has trained him on equal ratio of code and text data. We have balanced initialized text. So here the, this model, this robot has been trained on equal ratio of code and text data and continually pre-trained non text data. Then we have code initialized text. So what happens is here it is only co trained on code data and continually being pre-trained on our text data. Let's see what happens. So during exam time for natural reasoning, code text has the best performance followed by our balanced text for no let's, world Knowledge Task balanced text has the best performance. And for a code generation balanced only robot has the best performance. Then we have impact of scale. So we are, considering two. Which is four 70000002.8 billion robot. Continuing with our robot analog. Turns out bigger reports did much better at everything. Then we have code data proportion in pre-training. So here we read six models. For 200 billion token with increasing core proportion, such as 0%, 25%, 50%, 75%, 90%, and a hundred percent, and the remaining proportion is spelled with the text data. So for natural reasoning, as the code increases, the performance also increases the best performance. Is from 25% code data and 75% text data, world knowledge. We see an inverse relationship with increasing amount of code data. this is and reduced for code generation. Pest is no code model, so for the code generation, linear increase in performance as the amount of code increases. So we get the best is the one, is the best code model. Then we have code quality and code property. Now we have been judging with different proportion of code and text data with their different combination. But what about the type of code data? So if we compare four types of code models, code, code model, which is trained on web-based code data, then we have code markup in this 20% does the markup language. Markup language. Your is STML Cs. Then we have code that, which is in this 15% is code adjust data such as your GitHub issues stack Jupyter notebook. Then lastly, we have code synthetic data in this 10% is. Synthetically generated code for different tasks such as natural language reasoning, world knowledge to code generation. Code synthetic gives the best performance code in pre raining code. On, in this paper, we evaluate the impact of. Including the code and cool down, comparing three model pre-train before cool down without code data. Cool down with 20% code data and we find cool down with code data is the most beneficial. So the results for the natural language reasoning task, adding 25% core data, boosted the natural reasoning by 8.2%. So the best word, the balance model is comparatively best if we are going to go with natural language reasoning task for World Knowledge Task Code and pre-train provide a 10.1 person boost in World Knowledge Task. So cool. Down with code was a very crucial in world knowledge task. Then for code performance, code heavy model outperform text model by 12 times in all the code tasks. So what balance model gave a strong performance but lagged in, if we compare to your code only models? So the best recipe for code performance. For code benchmarks. Code only models were the clear winner and balance text models were strong performing in natural language reasoning, but lagged heavily in code then sens. The code data was a key differentiator in boosting the code overall performance. So the key recommendation for pre-training with code would be including a balanced mix of code and text data from the start. And using synthetic code data to improve both the code as well as your natural language task, and prioritizing the inclusion of coding cooled on phase two. Maximize the performance game. So the future research area could be how we can see if we scale the synthetic code generation. Right now we are only seeing two up to 20%. What if we increase it and we can also explore as new models are being released with different training model. What if, we are doing any task specific tuning and we can always introduce advanced, cool down phases. So the final takeaways from my side would be code data significantly improves AI model across all data, not just if you are talking about code specific data, such as your very common example would be Jag Z pt. So Chad, GPT is heavily used by programmers also as well as your general audience also, but it was not intended to support your code related task. It was a general model to be used by, the general public. But if we see, and not just Chad G pt, any other model which has been released. For your general purpose task, it includes some amount of code data in its training data. So balanced model with both text and code are best for general task while code heavy models dominate the coding benchmark. So if you want to go, for a task which require heavy codings, let's say you want in a specific area and you want, in your coding related only, let's say you want to replace. create a software engineer or automate any development pipelines, then my recommendation would be to go with code heavy related models, which are released for these purposes only. And the cool down phase, particularly with code, is critical for optimal model performance. That's it from my side. Thank you everyone.

Slides

Download slides (PDF)

See all 40 talks at this event!

Conf42 Large Language Models (LLMs) 2025 - Online

March 20 2025 - premiere 5PM GMT

What goes into the making of LLMs

Video size:

Abstract

Summary

Transcript

Slides

Antara Raman Sahay

SWE-T @ Helmerich and Payne

Join the community!

Featured event

2026

2025

Info

Conf42 Large Language Models (LLMs) 2025 - Online

March 20 2025 - premiere 5PM GMT

What goes into the making of LLMs

Video size:

Abstract

Summary

Transcript

Slides

Antara Raman Sahay

SWE-T @ Helmerich and Payne

Join the community!