Transcript
This transcript was autogenerated. To make changes, submit a PR.
With new models being released regularly, do you ever wonder what
goes into the making of these lum?
What makes these models so powerful?
hello everyone.
This is Antra Zaha, and today we will be deep diving into the secret
sauce behind making these lums.
We will be asking what refinement does each new model makes, and
discussing about one of the most common practice used in all language model
inclusion of code data, including.
Of code data in the pre-training data mixture, even for the models,
not specifically designed for code, has become a very common practice.
For example, all the state of the art model such as palm gofer
bloom, which are not intended to support code generation, includes
a percentage of code data together.
For instance, LAMA three has more code data as compared to your LAMA
two, which brings to a question.
To code, hard not to code.
So to analyze the impact of code, this paper conducts extensive ablations and
evaluates the language models along different benchmarks, namely your natural
needs, natural language reasoning task, world knowledge task, and code ation.
Before moving ahead, I want to give you a concise information about the
phases of training l lm. So it's all starts with pre-training phase.
In this phase, model learns the language structure.
Its semantics from the vast text data.
This results in the broad understanding of our natural language.
Then the second step is the fine tuning phase.
So here the model adapts for a specific task.
This helps in specialized task performance.
Third is our continual pre-training.
So in this new or updated knowledge is added without losing
the previous knowledge, which we added in our pre-training phase.
Then we have the fourth stage, which is the cool down phase.
So during cool down models.
Learning rate is gradually reduced and high quality data
set has given more priority.
Last, we have the evaluation and fine tuning phase.
So the best analog to relate to all this would be the phases of learning
in our own human learning, the lifelong earned lessons from our parents.
The school life is the pre-training and the fine tuning the building
block of our learning journey.
College is continual pre-training.
Knowledge is added or updated, but the previous knowledge, which we have learned
from our parents and school is not lost.
What we further continuing in our pro profession is the cool down phase here.
The learning rate is gradually reduced.
If we compare to our school life and high quality data set,
that is our specialization is given much more preference.
Lastly, we always fine tune and evaluate our learning based on our experiences.
So the overview of.
This experimental framework is we evaluate the impact of code by evaluating
code proportions, state as which the codes has been introduced, code quality
and property, and the model scale.
So in this experiment setup, pre-training is considered into two phases.
One is the continued pre-training, and the second is the cool down phase here.
Continued pre-training refers to a model which is initialized from a pre-train
model and trained for a specific.
Token budget.
So the first is your impact of initializing using
code pre-trained models.
To understand this better, imagine four robots whose trainers have
trained them on different ways.
First is your text language model, who has been trained on text only data.
Second, we have balanced language model.
So here his trainer has trained him on equal ratio of code and text data.
We have balanced initialized text.
So here the, this model, this robot has been trained on equal
ratio of code and text data and continually pre-trained non text data.
Then we have code initialized text.
So what happens is here it is only co trained on code data and continually
being pre-trained on our text data.
Let's see what happens.
So during exam time for natural reasoning, code text has the best
performance followed by our balanced text for no let's, world Knowledge Task
balanced text has the best performance.
And for a code generation balanced only robot has the best performance.
Then we have impact of scale.
So we are, considering two.
Which is four 70000002.8 billion robot.
Continuing with our robot analog.
Turns out bigger reports did much better at everything.
Then we have code data proportion in pre-training.
So here we read six models.
For 200 billion token with increasing core proportion, such as 0%, 25%,
50%, 75%, 90%, and a hundred percent,
and the remaining proportion is spelled with the text data.
So for natural reasoning, as the code increases, the performance
also increases the best performance.
Is from 25% code data and 75% text data, world knowledge.
We see an inverse relationship with increasing amount of code data.
this is and reduced for code generation.
Pest is no code model, so for the code generation, linear increase in performance
as the amount of code increases.
So we get the best is the one, is the best code model.
Then we have code quality and code property.
Now we have been judging with different proportion of code and text data
with their different combination.
But what about the type of code data?
So if we compare four types of code models, code, code model,
which is trained on web-based code data, then we have code markup in
this 20% does the markup language.
Markup language.
Your is STML Cs.
Then we have code that, which is in this 15% is code adjust data such as your
GitHub issues stack Jupyter notebook.
Then lastly, we have code synthetic data in this 10% is.
Synthetically generated code for different tasks such as natural language reasoning,
world knowledge to code generation.
Code synthetic gives the best performance code in pre raining code.
On, in this paper, we evaluate the impact of.
Including the code and cool down, comparing three model pre-train
before cool down without code data.
Cool down with 20% code data and we find cool down with code
data is the most beneficial.
So the results for the natural language reasoning task, adding 25% core data,
boosted the natural reasoning by 8.2%.
So the best word, the balance model is comparatively best if we are
going to go with natural language reasoning task for World Knowledge
Task Code and pre-train provide a 10.1 person boost in World Knowledge Task.
So cool.
Down with code was a very crucial in world knowledge task.
Then for code performance, code heavy model outperform text model
by 12 times in all the code tasks.
So what balance model gave a strong performance but lagged in, if we
compare to your code only models?
So the best recipe for code performance.
For code benchmarks.
Code only models were the clear winner and balance text models were strong
performing in natural language reasoning, but lagged heavily in code then sens.
The code data was a key differentiator in boosting the code overall performance.
So the key recommendation for pre-training with code would be including a balanced
mix of code and text data from the start.
And using synthetic code data to improve both the code as well as your natural
language task, and prioritizing the inclusion of coding cooled on phase two.
Maximize the performance game.
So the future research area could be how we can see if we scale
the synthetic code generation.
Right now we are only seeing two up to 20%.
What if we increase it and we can also explore as new models are being
released with different training model.
What if, we are doing any task specific tuning and we can always
introduce advanced, cool down phases.
So the final takeaways from my side would be code data significantly
improves AI model across all data, not just if you are talking about
code specific data, such as your very common example would be Jag Z pt.
So Chad, GPT is heavily used by programmers also as well as your general
audience also, but it was not intended to support your code related task.
It was a general model to be used by, the general public.
But if we see, and not just Chad G pt, any other model which has been released.
For your general purpose task, it includes some amount of
code data in its training data.
So balanced model with both text and code are best for general task while code heavy
models dominate the coding benchmark.
So if you want to go, for a task which require heavy codings, let's
say you want in a specific area and you want, in your coding related
only, let's say you want to replace.
create a software engineer or automate any development pipelines, then my
recommendation would be to go with code heavy related models, which are
released for these purposes only.
And the cool down phase, particularly with code, is critical
for optimal model performance.
That's it from my side.
Thank you everyone.