Data-Driven NLP: Quantifying the Revolution from Statistics to Generative AI

Video size:

Abstract

Discover NLP’s explosive journey from 78% accuracy to near-human performance through cold, hard numbers. I’ll decode the 300,000× computational leap, benchmark breakthroughs, and reveal the metrics that will determine AI’s future winners—all backed by data, not hype.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey everyone. Welcome to the session on evolution of Natural Language Processing. Here I would like to cover how we have progressed over the years from counting words to conversing with ai. The ability to communicate through language is one of the key characteristics of humans for decades. The dream of training machines with a similar understanding has fueled the field of natural language processing. What began as rudimentary attempts to analyze text has blossomed into a sophisticated discipline, culminating in generative AI and multimodal models that capture our imagination today. This journey marked by breakthroughs and paradigm shifts is a testament to human creativity and the relentless pursuit of artificial intelligence that truly understands us. Before we begin, a quick introduction about myself. I'm the Director of IP Strategy and Technology at United Lake. I'm based in Austin, Texas. In United States of America. I lead the IP data science team where we create innovative solutions at the intersection of intellectual property and technology. One of the products that our team has developed is Vantage for ip. It is a patent intelligence platform that processes hundreds of thousands of patents using natural language processing. It provides competitive intelligence, evaluates patent strength, and uncovers licensing opportunities amongst other things. Two of my key focus areas are developing deep learning models to evaluate patents and designing user-friendly platforms to convey the results. NLP being the hot field it is at the moment can be attributed to the surge in investment in this domain, and also the improvement in computer hardware, thus increasing the compute capabilities. According to the 2023 AI index report by the Stanford Institute of Human-Centered ai, global AI private investment was about 92 billion in 2022. This amount was 18 times greater than it was 10 years back in 2013. The amount of computational power used by significant AI machine learning systems had increased exponentially in the last half decade. As pre-trained large language models have gotten bigger and more capable over the years, they have required bigger and more elaborate training sets. For example, when OpenAI released its original G PT one model in 2018. The model was trained on Book Corpus, which is a collection of about 7,000 unpublished books, comprising about five GB of text. GT two Launched next year in 2019, was trained on about 40 GB training set created based on scrap links from Reddit users. Next year, G PT three was released, which was pre-trained on more than 500 GB of text from open sources, including book Corpus, common Crawl, Wikipedia, and Web Text Two. While officially the training set details are scanned for G PT four, which launched in 2023, the training set has been reported to be 13 trillion tokens, which would be a few terabytes of data. The early days of NLP were characterized by statistical models, techniques like bag of words, prevalent in the early, late 20th and early 21st centuries treated text as an unordered collection of words, disregarding, grammar and syntax, while simple to implement. Bag of words, enabled basic tasks like text classification and information retrieval by analyzing word frequencies. Term frequency in words, document frequency built upon this by weighing words based on their importance within a document and across a corpus. These models though limited in their understanding of context and relationship between words laid the foundational statistical groundwork for future advancements. The limitations of pure statistical approaches became increasingly apparent when tackling more complex NLP tasks. The inability of compute, the inability to capture semantic meaning and the sensitivity to word order led to the development of more sophisticated techniques. Engram models represented a step forward by considering sequences of N consecutive words, thus incorporating some contextual information. Perplexity is a common metric for evaluating language models. It estimates how well a model can predict the next word in a sequence. Lower perplexity scores generally indicate a better model, meaning the model is more confident in its predictions. Our perplexity score of 120 plus for engram models highlights the limitations of early models. These models also suffered from the curse of dimensionality as vocabulary size increased leading to data sparsity shoes, for example, with a vocabulary of 10,000 unique words. Each document will be represented by a vector of 10,000 dimensions. If we have only few, hundreds of few thousand documents, each document vector will mostly be zeros. Trying to find similarities or train a classifier on such spars, high dimensional data can be very challenging. With sparse data in a high dimensional space, the data points or document vectors become increasingly far apart from each other. This makes it difficult for distant based algorithms to find meaningful similarities between documents. The concept of neighborhood becomes less reliable. Further storing and manipulating these large spars vectors can consume vast amounts of memory, potentially exceeding the available resources. The late 2000 and early 2010s witnessed a paradigm shift towards machine learning, particularly with the rise of deep learning word embedding, such as what to work and globe revolutionize, how words were represented. These techniques learned dense, low dimensional vector representations of words based on their ence in large text corpora. Crucially, these embeddings capture semantic relationships, allowing models to understand that king is similar to queen in a way that simple statistical models could not design for sequential data. Recurrent neural networks are an ins. Utilize a unique lube architecture that allowed them to remember past information, making them useful for tasks like language modeling and predictive text where context matters. A significant limitation of RNN is their struggle with long range dependencies, primarily due to the vanishing radiant problem, which causes them to lose information from earlier parts of longer sequences. Long short-term memory or LSGM networks were developed as an advanced RNN architecture, specifically to overcome the vanishing gradient problem. Their introduction of memory cells capable of retaining information for extended durations and intelligent gating mechanisms that control information flow, enable them to effectively learn long-term relationships. The best per reported perplexity score of RN models on pen tree Bank dataset was around 73.4. And few LSTM models have also achieved perplexities in the forties, such as 44.9. While LSGM offer a substantial improvement over RNAs in handling long sequences, they come with a higher computational cost. Their complex skating mechanism require more computational resources and training time. Moreover, their inherent sequential processing still hinders parallel computation and can limit their effectiveness with extremely long dependencies. The transformer architecture introduced in 2017 with the groundbreaking paper. Attention is all you need marked another significant leap. By leveraging the attention mechanism, transformers allowed the model to weigh the importance of different words in a sequence when processing it regardless of their distance. This overcame the limitation of RNs and enabled parallel processing. Leading to significantly improved performance and scalability. Blue, which stands for bilingual evaluation UNDERST study. It's a metric used to evaluate the quality of machine translated text. It measures how similar the machine translated text is to a set of high quality human translations. Transformer architecture achieved a blue score of 41.8 on English to French translation. Here is an example of the attention mechanism following long distance dependencies from the paper. Different colors represent different heads of the multi-head attention. Attention shown only for the word making. As we can see, many of the attention heads attend to the distant dependency of the word making. Completing the phrase Making more difficult. Here is another example of two attention heads involved in N four resolution. On the left side, we see full attentions for one of the heads, head five, and on the right side we see isolated attentions from just the word. It's for attention. Heads five and six. While Head five relates, its to only the word law Head six relates it to both the words law and application fairly strongly, but or bidirectional encoder representations from transformers. This architecture achieved 94.9% accuracy on sentiment analysis, benchmarks, surpassing previous models by a significant margin. Bert introduced several key innovations that significantly advance the field of natural language, understanding the. Bidirectional training. Unlike previous language models that process texts, unidirectionally either left to right or right to left, bird is designed to learn deep bidirectional representations. This means it considered the context of each word from both the word that preceded and the words that follow it simultaneously in all layers of the transformer network. This bidirectional understanding allows Bert to grasp the meaning of words based on their full context within a sequence leading to a more nuanced and accurate understanding of language. For example, the. Bert can better differentiate the meaning of bank in river bank versus financial bank mask language modeling. To enable bidirectional training, Bert employs a novel pre-training task called Mask Language modeling. During pre-training a certain percentage, typically 15 percents of the word in the input sequence are randomly massed. The model's objective is to predict the original mass words based on the context provided by the unmasked words in the sentence. This forces the model to understand the meaning of each word by considering its surrounding context from both directions. This is crucial for learning deep bidirectional representations. Pre-training and fine tuning. Bird established a powerful pre-training and fine tuning pattern. The model is first pre-trained on a massive amount of unlabeled text data using the mass language modeling task. This allows the model to learn general purpose language representations. Then the pre-trained bird model can be fine tuned on smaller tasks, task specific labeled data sets for various downstream NLP tasks. Example, text classification question answering named entity recognition, et cetera, by adding a simple task specific output layer, this approach significantly reduced the need for large label dataset for each specific task, and led to substantial performance improvement across a wide range of NLP application. The GPT series developed by OpenAI has shown a remarkable evolution in its capabilities, primarily driven by scaling, model size, and training data. G PT one released in 2018 included 125 million parameters. It demonstrated the effectiveness of pre-training a transformer model for language understanding and generation, followed by task specific fine tuning. GT two released in 2019 included 1.5 billion parameters. It showed significant improvements in text generation, quality coherence, and the ability to perform various downstream tasks with little to no task specific fine tuning. GPT-3 released next year in 2020 included 175 billion parameters. It showed remarkable few short and even zero short learning capabilities. It could perform a wide range of NLP tasks, including translation, question answering, and code generation. With a few examples or even natural language instructions, it showed improved coherence and context understanding compared to its producers. Superglue is a benchmark data set for evaluating natural language understanding. It focuses on tasks that require deeper reasoning and understanding of language, including tasks like question answering, core reference resolution, and natural language inference. While GPT one had a super glue score of 45.2 GPT-3 had super glue score of 71.8. This scaling trend in the GPT series has been a major driving force behind the advancements in generative AI and has paved the way for even larger and more capable models like GPT-4 and beyond. The latest frontier in NLP is the rise of multimodal models. These models go beyond processing, just text, and can understand and generate content across different modalities, such as images, audio, and video. By jointly learning representations from diverse data sources, multimodal models can achieve a more holistic understanding of the world and enable exciting new applications. Combining text and image data has improved healthcare diagnostic accuracy. In a recent study, multimodal models were used to generate assessments based on medical images and clinical observations. Those were evaluated against physician authored diagnosis. In 80 plus person cases, the multimodal assessments outperformed human diagnosis, AI-driven diagnostics minimize misdiagnosis, leading to fewer malpractice claims and unnecessary treatment. Automated data processing speeds up disease detection, decreasing patient wait times and hospital stays. AI enabled hospitals report 30 to 40% improvement in workflow efficiency, allowing providers to treat more patients without increasing staff workload. Visual question answering or view QA model, which understand images and answer questions about them have achieved 84.3% accuracy. This has wide application in education, accessibility tools, and even in visual impairment community. Video chat two achieved 51.1% MV bench surprising the previous date of the model by over 15% in accuracy. This benchmark evaluates complex temporal reasoning in videos. L MSS or large multimodal models show promise in generating accurate and contextually relevant closed captions. Their ability to understand the video context can lead to better handling of similar sounding words, speaker identification and overall caption quality. Multimodal retrieval enhanced by LMS aims to retrieve relevant content, images, or videos based on complex multifaceted curies, including text and visual elements. Large language models, while powerful are computationally expensive and memory intensive to make them more practical for various applications and deployments, researchers have developed several efficiency techniques. Here is an elaboration on some of these techniques. Model compression or pruning. Pruning aims to reduce the number of parameters in a model by identifying and removing unimportant weights or even entire neurons and layers. This leads to reduced model size, faster inference, and low memory footprint, knowledge distillation. Knowledge distillation involves training a smaller, more efficient student model to mimic the behavior of larger, more complex teacher model. The teacher model, having learned from a vast amount of data, transfers his knowledge to the student. This leads to smaller model size. The student model has significantly fewer parameters, faster inference, and improved performance. The student can sometimes generalize better. Then if trained directly on the target task, limited data as it benefits from the teacher's broader knowledge. Quantization reduces the precision of the numerical representations used for the models, weights and activations. Instead of using floating point numbers like 32 bit or 16 bit, the model uses lower precision numbers like eight bit or even four bit. This leads to reduced model size, faster inference, and lower power consumptions, which is crucial for edge devices. Sparse attention. It is an efficiency technique, used large language models to reduce the computational cost of the self attention mechanism, especially when dealing with long input sequences. The standard self attention mechanism in transformers has a quadratic time and memory complexity. Sparse attentions aims to elevate this quadratic complexity by only computing attention scores for a subset of possible query key pairs rather than all of them. The selection of these pairs is based on a specific pattern of learned criteria, aiming to retain the most important contextual information while significantly reducing the number of computations. Throughout this session, I have mentioned various metrics or scores like perplexity, blue glue, or super glue. As the models have evolved, the metrics have evolved too to effectively compare different models. Here are a few metrics which will be in focus for comparing future and LP models. Training efficiency and sustainability. Energy consumption will be a key metric to watch. Studies have shown that training large language models can have significant carbon footprint. Researchers at the University of Massachusetts amhurst estimated that training a single large language model could emit around 300 tons of CO2 equivalent reasoning capabilities. New benchmarks will emerge to assess a model's ability to perform multi-step casual reasoning, which involves understanding how different actions of or events cause specific outcomes. Current evaluations often miss these capabilities, which make up a large portion of human reasoning, ethical alignment. Future metrics will evaluate models on their ethical considerations. Including fairness, transparency, and safety. This should lead to creating responsible AI systems that align with human values across diverse cultural context. This brings us to the end of this session. I would like to thank every one of you for attending. From the humble beginnings of the counting words to sophisticated capabilities of generative and multimodal ai, the evolution of NLP has been a remarkable journey. Each stage built upon the limitations of it producers driven by the desire to create machines that can truly understand and interact with humans. As we continue to push the boundaries of ai, the future of NLP promises even more exciting developments bringing us closer. And human interaction. Thank you.

Slides

Download slides (PDF)

See all 137 talks at this event!

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

Data-Driven NLP: Quantifying the Revolution from Statistics to Generative AI

Video size:

Abstract

Summary

Transcript

Slides

Shahzeb Akhtar

Director @ UnitedLex

Join the community!

Featured event

2026

2025

Info

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

Data-Driven NLP: Quantifying the Revolution from Statistics to Generative AI

Video size:

Abstract

Summary

Transcript

Slides

Shahzeb Akhtar

Director @ UnitedLex

Join the community!