Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey everyone.
Welcome to the session on evolution of Natural Language Processing.
Here I would like to cover how we have progressed over the years from
counting words to conversing with ai.
The ability to communicate through language is one of the key
characteristics of humans for decades.
The dream of training machines with a similar understanding has fueled the
field of natural language processing.
What began as rudimentary attempts to analyze text has blossomed into a
sophisticated discipline, culminating in generative AI and multimodal models
that capture our imagination today.
This journey marked by breakthroughs and paradigm shifts is a testament to
human creativity and the relentless pursuit of artificial intelligence
that truly understands us.
Before we begin, a quick introduction about myself.
I'm the Director of IP Strategy and Technology at United Lake.
I'm based in Austin, Texas.
In United States of America.
I lead the IP data science team where we create innovative
solutions at the intersection of intellectual property and technology.
One of the products that our team has developed is Vantage for ip.
It is a patent intelligence platform that processes hundreds of thousands of
patents using natural language processing.
It provides competitive intelligence, evaluates patent
strength, and uncovers licensing opportunities amongst other things.
Two of my key focus areas are developing deep learning models to evaluate
patents and designing user-friendly platforms to convey the results.
NLP being the hot field it is at the moment can be attributed to the surge in
investment in this domain, and also the improvement in computer hardware, thus
increasing the compute capabilities.
According to the 2023 AI index report by the Stanford Institute of
Human-Centered ai, global AI private investment was about 92 billion in 2022.
This amount was 18 times greater than it was 10 years back in 2013.
The amount of computational power used by significant AI machine
learning systems had increased exponentially in the last half decade.
As pre-trained large language models have gotten bigger and more capable
over the years, they have required bigger and more elaborate training sets.
For example, when OpenAI released its original G PT one model in 2018.
The model was trained on Book Corpus, which is a collection of
about 7,000 unpublished books, comprising about five GB of text.
GT two Launched next year in 2019, was trained on about 40
GB training set created based on scrap links from Reddit users.
Next year, G PT three was released, which was pre-trained on more than
500 GB of text from open sources, including book Corpus, common
Crawl, Wikipedia, and Web Text Two.
While officially the training set details are scanned for G PT four, which
launched in 2023, the training set has been reported to be 13 trillion tokens,
which would be a few terabytes of data.
The early days of NLP were characterized by statistical models, techniques like
bag of words, prevalent in the early, late 20th and early 21st centuries
treated text as an unordered collection of words, disregarding, grammar and
syntax, while simple to implement.
Bag of words, enabled basic tasks like text classification and information
retrieval by analyzing word frequencies.
Term frequency in words, document frequency built upon this by weighing
words based on their importance within a document and across a corpus.
These models though limited in their understanding of context
and relationship between words laid the foundational statistical
groundwork for future advancements.
The limitations of pure statistical approaches became increasingly apparent
when tackling more complex NLP tasks.
The inability of compute, the inability to capture semantic
meaning and the sensitivity to word order led to the development
of more sophisticated techniques.
Engram models represented a step forward by considering sequences of N
consecutive words, thus incorporating some contextual information.
Perplexity is a common metric for evaluating language models.
It estimates how well a model can predict the next word in a sequence.
Lower perplexity scores generally indicate a better model, meaning the model is
more confident in its predictions.
Our perplexity score of 120 plus for engram models highlights
the limitations of early models.
These models also suffered from the curse of dimensionality as vocabulary
size increased leading to data sparsity shoes, for example, with a
vocabulary of 10,000 unique words.
Each document will be represented by a vector of 10,000 dimensions.
If we have only few, hundreds of few thousand documents, each document
vector will mostly be zeros.
Trying to find similarities or train a classifier on such spars, high
dimensional data can be very challenging.
With sparse data in a high dimensional space, the data points or document
vectors become increasingly far apart from each other.
This makes it difficult for distant based algorithms to find meaningful
similarities between documents.
The concept of neighborhood becomes less reliable.
Further storing and manipulating these large spars vectors can consume
vast amounts of memory, potentially exceeding the available resources.
The late 2000 and early 2010s witnessed a paradigm shift towards machine
learning, particularly with the rise of deep learning word embedding,
such as what to work and globe revolutionize, how words were represented.
These techniques learned dense, low dimensional vector
representations of words based on their ence in large text corpora.
Crucially, these embeddings capture semantic relationships, allowing models to
understand that king is similar to queen in a way that simple statistical models
could not design for sequential data.
Recurrent neural networks are an ins.
Utilize a unique lube architecture that allowed them to remember past
information, making them useful for tasks like language modeling and
predictive text where context matters.
A significant limitation of RNN is their struggle with long range
dependencies, primarily due to the vanishing radiant problem, which
causes them to lose information from earlier parts of longer sequences.
Long short-term memory or LSGM networks were developed as an advanced RNN
architecture, specifically to overcome the vanishing gradient problem.
Their introduction of memory cells capable of retaining information for
extended durations and intelligent gating mechanisms that control information
flow, enable them to effectively learn long-term relationships.
The best per reported perplexity score of RN models on pen tree
Bank dataset was around 73.4.
And few LSTM models have also achieved perplexities in the forties, such as 44.9.
While LSGM offer a substantial improvement over RNAs in handling long sequences, they
come with a higher computational cost.
Their complex skating mechanism require more computational
resources and training time.
Moreover, their inherent sequential processing still
hinders parallel computation and can limit their effectiveness
with extremely long dependencies.
The transformer architecture introduced in 2017 with the groundbreaking paper.
Attention is all you need marked another significant leap.
By leveraging the attention mechanism, transformers allowed the model to
weigh the importance of different words in a sequence when processing
it regardless of their distance.
This overcame the limitation of RNs and enabled parallel processing.
Leading to significantly improved performance and scalability.
Blue, which stands for bilingual evaluation UNDERST study.
It's a metric used to evaluate the quality of machine translated text.
It measures how similar the machine translated text is to a set of
high quality human translations.
Transformer architecture achieved a blue score of 41.8 on
English to French translation.
Here is an example of the attention mechanism following long distance
dependencies from the paper.
Different colors represent different heads of the multi-head attention.
Attention shown only for the word making.
As we can see, many of the attention heads attend to the distant
dependency of the word making.
Completing the phrase Making more difficult.
Here is another example of two attention heads involved in N four resolution.
On the left side, we see full attentions for one of the heads, head five, and
on the right side we see isolated attentions from just the word.
It's for attention.
Heads five and six.
While Head five relates, its to only the word law Head six relates it to both the
words law and application fairly strongly,
but or bidirectional encoder representations from transformers.
This architecture achieved 94.9% accuracy on sentiment analysis,
benchmarks, surpassing previous models by a significant margin.
Bert introduced several key innovations that significantly advance the field
of natural language, understanding the.
Bidirectional training.
Unlike previous language models that process texts, unidirectionally
either left to right or right to left, bird is designed to learn
deep bidirectional representations.
This means it considered the context of each word from both the word
that preceded and the words that follow it simultaneously in all
layers of the transformer network.
This bidirectional understanding allows Bert to grasp the meaning of words
based on their full context within a sequence leading to a more nuanced and
accurate understanding of language.
For example, the.
Bert can better differentiate the meaning of bank in river
bank versus financial bank
mask language modeling.
To enable bidirectional training, Bert employs a novel pre-training
task called Mask Language modeling.
During pre-training a certain percentage, typically 15 percents of the word in
the input sequence are randomly massed.
The model's objective is to predict the original mass words based
on the context provided by the unmasked words in the sentence.
This forces the model to understand the meaning of each word by considering its
surrounding context from both directions.
This is crucial for learning deep bidirectional representations.
Pre-training and fine tuning.
Bird established a powerful pre-training and fine tuning pattern.
The model is first pre-trained on a massive amount of unlabeled text data
using the mass language modeling task.
This allows the model to learn general purpose language representations.
Then the pre-trained bird model can be fine tuned on smaller tasks,
task specific labeled data sets for various downstream NLP tasks.
Example, text classification question answering named entity recognition, et
cetera, by adding a simple task specific output layer, this approach significantly
reduced the need for large label dataset for each specific task, and led to
substantial performance improvement across a wide range of NLP application.
The GPT series developed by OpenAI has shown a remarkable evolution in
its capabilities, primarily driven by scaling, model size, and training data.
G PT one released in 2018 included 125 million parameters.
It demonstrated the effectiveness of pre-training a transformer model for
language understanding and generation, followed by task specific fine tuning.
GT two released in 2019 included 1.5 billion parameters.
It showed significant improvements in text generation, quality coherence,
and the ability to perform various downstream tasks with little to
no task specific fine tuning.
GPT-3 released next year in 2020 included 175 billion parameters.
It showed remarkable few short and even zero short learning capabilities.
It could perform a wide range of NLP tasks, including translation, question
answering, and code generation.
With a few examples or even natural language instructions, it showed
improved coherence and context understanding compared to its producers.
Superglue is a benchmark data set for evaluating natural language understanding.
It focuses on tasks that require deeper reasoning and understanding of
language, including tasks like question answering, core reference resolution,
and natural language inference.
While GPT one had a super glue score of 45.2 GPT-3 had super glue score of 71.8.
This scaling trend in the GPT series has been a major driving force behind the
advancements in generative AI and has paved the way for even larger and more
capable models like GPT-4 and beyond.
The latest frontier in NLP is the rise of multimodal models.
These models go beyond processing, just text, and can understand and generate
content across different modalities, such as images, audio, and video.
By jointly learning representations from diverse data sources, multimodal
models can achieve a more holistic understanding of the world and
enable exciting new applications.
Combining text and image data has improved healthcare diagnostic accuracy.
In a recent study, multimodal models were used to generate assessments based on
medical images and clinical observations.
Those were evaluated against physician authored diagnosis.
In 80 plus person cases, the multimodal assessments outperformed human
diagnosis, AI-driven diagnostics minimize misdiagnosis, leading to fewer malpractice
claims and unnecessary treatment.
Automated data processing speeds up disease detection, decreasing
patient wait times and hospital stays.
AI enabled hospitals report 30 to 40% improvement in workflow efficiency,
allowing providers to treat more patients without increasing staff workload.
Visual question answering or view QA model, which understand images
and answer questions about them have achieved 84.3% accuracy.
This has wide application in education, accessibility tools, and
even in visual impairment community.
Video chat two achieved 51.1% MV bench surprising the previous date
of the model by over 15% in accuracy.
This benchmark evaluates complex temporal reasoning in videos.
L MSS or large multimodal models show promise in generating accurate and
contextually relevant closed captions.
Their ability to understand the video context can lead to better
handling of similar sounding words, speaker identification
and overall caption quality.
Multimodal retrieval enhanced by LMS aims to retrieve relevant
content, images, or videos based on complex multifaceted curies,
including text and visual elements.
Large language models, while powerful are computationally expensive and
memory intensive to make them more practical for various applications
and deployments, researchers have developed several efficiency techniques.
Here is an elaboration on some of these techniques.
Model compression or pruning.
Pruning aims to reduce the number of parameters in a model by identifying
and removing unimportant weights or even entire neurons and layers.
This leads to reduced model size, faster inference, and low memory footprint,
knowledge distillation.
Knowledge distillation involves training a smaller, more efficient
student model to mimic the behavior of larger, more complex teacher model.
The teacher model, having learned from a vast amount of data, transfers
his knowledge to the student.
This leads to smaller model size.
The student model has significantly fewer parameters, faster inference,
and improved performance.
The student can sometimes generalize better.
Then if trained directly on the target task, limited data as it benefits
from the teacher's broader knowledge.
Quantization reduces the precision of the numerical representations used for
the models, weights and activations.
Instead of using floating point numbers like 32 bit or 16 bit, the
model uses lower precision numbers like eight bit or even four bit.
This leads to reduced model size, faster inference, and lower power consumptions,
which is crucial for edge devices.
Sparse attention.
It is an efficiency technique, used large language models to reduce
the computational cost of the self attention mechanism, especially when
dealing with long input sequences.
The standard self attention mechanism in transformers has a
quadratic time and memory complexity.
Sparse attentions aims to elevate this quadratic complexity by only computing
attention scores for a subset of possible query key pairs rather than all of them.
The selection of these pairs is based on a specific pattern of learned criteria,
aiming to retain the most important contextual information while significantly
reducing the number of computations.
Throughout this session, I have mentioned various metrics or scores like
perplexity, blue glue, or super glue.
As the models have evolved, the metrics have evolved too to
effectively compare different models.
Here are a few metrics which will be in focus for comparing future and LP models.
Training efficiency and sustainability.
Energy consumption will be a key metric to watch.
Studies have shown that training large language models can have
significant carbon footprint.
Researchers at the University of Massachusetts amhurst estimated that
training a single large language model could emit around 300 tons of CO2
equivalent reasoning capabilities.
New benchmarks will emerge to assess a model's ability to perform multi-step
casual reasoning, which involves understanding how different actions
of or events cause specific outcomes.
Current evaluations often miss these capabilities, which make
up a large portion of human reasoning, ethical alignment.
Future metrics will evaluate models on their ethical considerations.
Including fairness, transparency, and safety.
This should lead to creating responsible AI systems that align with human
values across diverse cultural context.
This brings us to the end of this session.
I would like to thank every one of you for attending.
From the humble beginnings of the counting words to sophisticated
capabilities of generative and multimodal ai, the evolution of
NLP has been a remarkable journey.
Each stage built upon the limitations of it producers driven by the desire
to create machines that can truly understand and interact with humans.
As we continue to push the boundaries of ai, the future of NLP promises even more
exciting developments bringing us closer.
And human interaction.
Thank you.