Beyond BLEU and ROUGE — Modern Approaches to Evaluating LLMs and AI Systems

Video size:

Abstract

Traditional metrics like BLEU and ROUGE fall short in capturing advanced LLM capabilities. In this talk, discover modern methods—from benchmarks like MMLU and TruthfulQA to real-world evaluations and human-in-the-loop insights—that better assess AI performance. Join us to rethink AI evaluation—now!

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone and welcome to our session here at Con 42. Today we are diving into a crucial but often overlooked aspect of working with large language models and AI systems. How do we evaluate the meaningfully traditional metrics like Blue and Rouge? Were a good starting point, but as AI systems grow more powerful and complex, we need smarter. More human-centric approaches to truly measure their effectiveness. So in this talk, we'll explore what's beyond blue and rules, how the evaluation landscape is evolving, and what modern techniques are emerging to keep pace with rapid advancement in generative ai. Before we dive deeper into our presentation, let me give. A quick introduction about ourselves. I'm aloen, currently leading engineering at Dropbox, where I work on storage systems, scalable infrastructure, and the AI ML platforms that powers some of our internal tooling. I completed my masters from Carnegie Mellon University, and much of my work today involves balancing performance, observability, and scale. And I'm, I work as a applied AI engineer. I build scalable machine learning solutions which focus on real world impact. I specialize in working in education technology. My current focus lies in distributed machine learning and the world of agent ai, where I'm trying to create systems that doesn't just respond to you, but it keeps reasoning and adapt. So together, me and Alo, we bring a shared passion for making AI systems not just smarter, but more accountable and more measurable. Let's begin with a simple scenario. Imagine you're a part of a startup launching an AI powered customer service chat bot. Sounds familiar, right? It's a classic use case, automate support, cut costs, and improved response times. Now think about what your customers expect from this chat bot. It's not just about answering queries. They want answers that are fast, relevant, and feel human. So how do you evaluate whether your model is doing a good job? This is where the problem begins. We can't rely solely on word overlap or rigid rules. We need to measure effectiveness across three key dimensions. Accuracy. Is the information factually correct? Usability is the response phrased in a way the user can understand and act on. And reliability does. The system performs consistently across different user inputs, context, and each cases. This sets the foundation for why we need to go beyond traditional metrics. It's no longer just about language similarity. It's about actual performance, trust and impact. Now that we have set the stage, let's define what we are actually aiming to evaluate when it comes to AI systems, especially large language models. Traditionally, we have been good at measuring things like grammar or overlap with reference answers. But today's AI systems operate in open-ended real world context, so we need a much richer set of objectives. Here is what we believe. Every modern evaluation framework should aim to capture first accuracy. This one's obvious. The AI should get facts right. A model generating confident but wrong information is dangerous, not useful. Second bias mitigation. We want systems that generate inclusive and fair outputs, and that means actively identifying and minimizing harmful stereotypes or skewed viewpoints. Third coherence, it's not enough for an answer to be correct. It needs to make sense in the flow of a conversation. This is Es especially critical for applications like tutoring. Therapy or customer service. And finally, reliability. The AI should perform consistently across different prompts, languages, domains, and even user personas. No surprises, no hallucinations, no brittle behavior. These four accuracy, fairness, coherence, and reliability together form a much more complete picture of what good AI actually means in the real world. And me measuring these effectively is where the challenge begins. Before we talk about what's next, let's take a step back and quickly revisit what Blue and Rules actually are and why they have been so dominant in NLP evaluation for so long. Blue or bilingual evaluation under study is essentially a precision based score. It compares the ngrams or short chunk of words in a model's output. To those in a reference sentence. It's been a staple in machine translation tasks for years. Rules on the other hand, is more recall oriented. It looks at how much of the reference text is captured in the model's output. It's widely used for evaluating summarization, where the goal is to see whether the generated summary includes the important bits. Now, why were these metrics adopted so widely? They're fast language agnostic and they don't require any human label data. You can just run them over a corpus and get numeric scores. It's clean and scalable. But, and here's the catch. Just because something is easy to compute doesn't mean it truly captures what quality looks like. Blue and Rose don't understand meaning intent tone or factual correctness. That's the real gap we need to fill. Let's bring this down to earth with a real world example that highlights just how fragile these tradition traditional metrics can be. Say we have built a customer support chat bot for an e-commerce store, and the user asks a simple question, what's the size of this jacket? The chat bot response correctly, with just one word, 34. That's perfectly valid answer. It's concise, accurate, and exactly what the user needs, but here is the catch. The blue score for this response is just 0.016. That's extremely low and completely misleading. Why? Because Blue is comparing Ngram overlaps with reference sentences like it's XXL, or it's small, or it is 34, and since 34 doesn't share enough bigrams or trigrams with any of these. It gets penalized even though it's semantically perfect. This is the kind of disconnect that makes blue and ruse unreliable in real dialogue or open domain scenarios. They care more about surface level similarity than actual meaning, and if we are not careful, we end up punishing good responses. Just because they're phrased differently. Here is another example that really drives home the limitation of these surface level metrics like Blue. Imagine we are evaluating a chat bot designed to explain HR policies, in this case, remote work guidelines. The reference sentence says employees are permitted to work remotely up to three days per week, subject to manager approval. The model response with staff members are allowed to telecommute for a maximum of three days, weekly pending approval from their supervisor. Now, pause for a second. That response is perfectly accurate. It conveys the exact same meaning just using slightly different phrasing, but play it gives this a score of zero. That's right. Zero y. Because Blue is looking for direct overlaps, exact words and ngram matches. It doesn't understand that staff members and employees mean the same thing, or that supervisor are just managers with a different label. This is the core flaw blue and rules semantic equivalence and conversation quality. And if we are going to build LMS that actually communicate well, we need to evaluate methods. That reward, meaning not just matching. Let's take a moment to summarize what we have seen so far about traditional metrics like blue and rules. First, both of these metrics are built around n gram oral lab, essentially counting how many words or phrases match between the morals, response and a reference answer. This works well with the task. Is rigid, like translating a sentence, word for word or summarizing with fixed phrasing. And to be fair, blue and rules have excelled in domains like machine translation and text summarization. They're efficient, widely adopted, and they helped standardize benchmarks early on. But this is the key. They completely struggle when it comes to contextual understanding or semantic variation. They can't tell if a response makes sense in a conversation or if it's phrased differently, but means the same thing. In other words, blue and ruse rewards surface similarity, not actual quality. And that's why we need to evolve our evaluation toolkit as our models get more sophisticated. So what do we mean when the, when we say Ngram overlap, let's quickly unpack that because it's the foundation of both blue and rules, and ngram is simply a sequence of N words, A one gram, or unigram just means individual words like cat runs or fast. A two gram or bi gram is a pair of cons, security words like the cat or runs fast, and a three gram or tri gram combines three words, for example, the black cat or runs very fast. The move, the more overlap you gen your generated text has with these kind of sequences from a reference sentence, the higher your blue or rose score are going to be. It's a way to measure fluency. And similarity without needing deep understanding, just matching patterns. Now this diagram below shows how we move from basic word representation to increasing contextual complexity from individual overlapping words to longer shared sequences. But here is the limitation. As we have seen earlier, this structure can completely mis meaning intent, or even factual accuracy if the phrasing is just slightly different. So while gram overlap gives us something measurable, it's not the whole picture. So now that we have seen what Ngram overlap actually measures, let's talk about why it's not enough for today's AI systems. Especially large language models. First, it's a surface level measure. It doesn't understand what a sentence means. It just checks. If the word looks similar, this is fine for basic translation task, but it breaks down when the language gets more flexible or nuanced. Second, it's easy to game. You can bump up your blue score by repeating parts of the reference answer. Or adding boilerplate phrases even if your actual output is poor. Third, and this one's huge, it ignores factuality and coherence. Your model might say something romantically perfect, but factually wrong, and blue won't catch it. Or it might output something semantically spot on, but still get a low score because the phrasing was different. And finally, these metrics often penalize longer or more fluent responses. If your model rephrases an idea eloquently or adds context, it can actually hurt the score because the Ngram pattern don't match the reference exactly in short, blue and Ruse gives us a quick answer, but not a deep one. And with modern lms, that's just not good enough anymore. So now that we have the, we have seen the limitations of traditional metrics, let's shift gears and look at what modern evaluation frameworks actually focuses on. Instead of counting word overlaps, these new methods try to assess what really matters in a model's response. Qualities that are more aligned with how humans judge quality. The first and arguably most important is factual accuracy. Is the information actually correct? Especially in domains like healthcare, finance, or education? This isn't just nice to have, it's critical. Then we have semantic coherence. Does the response flow well? Is it logically structured and grammatically sound? Or does it feel like a random text bump? Next comes answer, relevance. Does the model stay on topic and directly address the question being asked? This is especially important for systems like chatbots, tutors, or customer service tools. We also evaluate context precision. Did the model respond with the most accurate detail from the context it was given? And conversely. Context recall, did it include all the key elements that were necessary to build a complete answer? Together these dimensions form a far more comprehensive and human aligned way of judging model output, and they are setting the foundation for NextGen AI benchmarks. Now let's talk about how we evaluate the factual accuracy of responses using a method called fact score. The core idea here is to break down a generated response into atomic facts. That is small, standalone pieces of information that can be independently verified. Each atomic fact is then checked against a reliable external knowledge source, like a trusted database. Verify document or reference corpus. For example, if the model says the Eiffel Tower is in Paris and it was built in 1889, that's two separate atomic facts, and we check each of them individually. Once this checking is done, we compute a score based on the proportion of facts that are supported by the source, and that becomes the fact score. This approach provides a granular, interpretable view of factual factuality instead of just evaluating the sentence as a whole. Alright, let's take a closer look at how factuality evaluation actually works in a practice, especially with model grade systems. This approach typically relies on three core inputs. The prompt, the output, and reference answer, the first input is the prompt. That's what we send to the LLM. It could be a question, a task instruction, or a real work query like what is the capital of Switzerland? The second input is the output. This is the model's response to that prompt. For example, it might say Zurich or Burn, depending on how it was trained or what it retrieved. The third component is the reference. This is the ideal response that a model should be should have given, and this is typically crafted by the author of the evaluation. Usually a human or another trusted system. The factuality check is then done by comparing the model's output to the reference. Does the answer aligned with the verified facts? Was the reasoning sound, did it hallucinate or go off topic? This approach gives us a more targeted lens. We are no longer scoring for linguistic overlap. But for truthfulness and correctness. Now let's dive into a more advanced model-based approach for evaluating factuality One that goes far beyond keyword matching or basic overlap. What you are seeing here is a factuality scoring pipeline that leverages semantic alignment, entailment classification, and weighted aggregation to generate what's called a fact score. We begin with the input text. We use something like S-B-E-R-T-A sentence Bert model to identify how well the output semantic aligns with the reference facts. The next step is to break the reference answer into atomic facts. Labeled a one and then possibly reframe or relocate them in different forms marked as a two. This ensures we are not overly strict about phrasing. Now we introduce NLI classification, natural language inference. This checks whether the models output entails, contradicts, or is natural towards each atomic fact. This is where the green and red boxes come in. Green for support, red for contradiction. Then comes the aggregation phase where all of this gets pulled together. We apply different weights to different fact types. For example, facts classified as a gets a weight of 0.9, whereas contradictions like D two gets a lower score or even a zero penalty. The final result is a fact score, a weighted structured view of how factual the LM LMS output really is. It's nuanced, interpretable, and much more robust than traditional method, and most importantly, this method adapts well across domains and languages. It evaluates not just what the model says, but how well it aligns with the truth. Let's now talk about a modern metric that actually goes beyond surface level. Word overlaps birth score. Birth score is built on top of contextual embeddings. It uses models like Bert to assign vector implantation to each token, not just based on the word itself, but how it is used in context. That's key. In the example below, we have two sentences. The reference says, the weather is cold today. The candidate says it's freezing today. Now, even though there is little direct overlap, a human would agree. These two say almost the same thing. So instead of relying on ngrams, we generate contextual embeddings for each word using bird. These embeddings capture the meaning of each token in its context. Then we calculate Pairwise co-sign similarity between tokens across the reference and the candidate sentence. From these, we select the best matching pairs, aggregate their similarity scores, and compute a final bird score. This gives us a much richer sense of how well the semantic flow is preserved. Regardless of wordings. B score excels at evaluating coherence, paraphrasing, and even subtle rephrasing. Areas where blue and rules simply fall short. In other words, B score helps us evaluate what the model means, not just what word it uses. One of the most critical areas in AI evaluation today is addressing toxicity and bias. Why? Because even a single harmful response from an LLM can damage trust. Safety and inclusivity in the user experience. Bias can creep in from training data, prompt phrasing, or even subtle model behavior, and without careful evaluation, it often goes undetected. To help tackle this, we have tools like Lang Bite, which stands for language bias testing environment. Lang Byte works in a systematic and scalable way. First, it selects prompt templates from a predefined library, for example, prompts about race, gender, or religion. Then it generates test cases based on these templates. These are structured prompts designed to pro for biased or toxic behavior. Next, it executes those prompts across different LMS and carefully analyzes the response Finally. It generates insights highlighting areas where the model may have shown bias, tendencies, either overt or subtle. This lets developers pinpoint specific weakness and re and retrain or fine tune models to be more inclusive, respectful, and safe. Bias and toxicity aren't just bugs. They're ethical risks. And tools like Lang Byte help us bring transparency and accountability to the space. Let's now look at how this evaluation process works. In practice using Lang Byte structured pipeline, the flow starts with understanding ethical concerns. I. This includes input from sensitive communities and domain experts, which help define what types of harmful or biased behavior we need to test for. This leads to the ethical requirement specification, which creates a formal model of what needs to be avoided, such as hate, speech, stereotyping, or toxicity. Next, we move to test generation here. We use prompt templates designed to target those ethical concerns. From these templates, we generate real prompt instances, specific test cases. These prompts are then fed into the L lms. In the test execution phase, the model responses are collected, and these outputs we want to evaluate for signs of toxicity or bias. The final step is reporting. Here the responses are analyzed using prompt articles or even an evaluator LLM. The goal is to interpret the outputs and produce structured evaluation reports. This end-to-end flow ensures that bias detection is not just an afterthought. It's built into the system from the ground up, guided by real ethical priorities. Despite advances in automated metrics, human judgment remains irreplaceable. Especially in evaluating tone, coherence, and helpfulness, which current metrics can't always capture. That's why we use Pairwise comparison, where two outputs are shown side by side, and human judges select the better one. This is more reliable than assigning absolute numeric scores. We then apply probabilistic ranking models like Bradley Terry, or elo. To convert these comparisons into meaningful rankings. These methods are widely used in platforms like Chat Botina, and they consistently reveal gaps that automated metrics like Blue and Ruse fail to detect. In fact, two outputs might score equally in blue, but humans may strongly prefer one over the other due to clarity, style, or tone. So human-centric evaluation. Does not just validate models, it helps us uncover what really matters in user experience. Let's bring it all together. We want, we start with fact verification, breaking model response into atomic facts. Then verifying each one against trusted sources to generate an accuracy score. In parallel, we conduct pairwise comparison, A versus B, C versus A, where humans simply judge which response is better. Using models like Bradley Terry, these judgments are aggregated into a global ranking offering a probabilistic view of model quality. We then compared this human ranking against scores from automated metric. To identify where our evaluation system align and more importantly, where they don't. Human evaluation are increasingly taking central stage in measuring conversational AI quality, not just contemplating, but also correcting what automated metrics miss. With the, with that foundation lay laid, I'll now hand it over to Soro who will take us deeper into how modern metric systems. Like gal and real world task evaluations are changing the landscape. Alright, let's talk about one of the most promising shifts in evaluation using the l and m itself. As a judge, large language models bring in three very powerful strengths as a result. The first is context. Awareness means they can interpret the nuanced meaning and adapt to the domain. Scalability, since they can evaluate massive datasets quickly and very efficiently, and consistency, eliminating the subjectivity and the fatigue human evaluators often bring. So here is how it works. A benchmark dataset provides both input. Prompts and the correct outputs. The L lms, indeed as suggested output, and then using a judge prompt, we ask another LLM, given the input and the correct output. Is this the suggested output acceptable to make sure the judge stays reliable? We loop in human experts to assess the S judgment and refine the judge through feedback. This feedback loop from LLM response to LM Judgment to export audit, which help us continually improve and the evaluator itself. So to wrap up the discussion on evaluation, let's talk about why automated evaluation. It is not just helpful, it's very essential. First, it ensures we maintain consistent quality across multiple model versions. Whether we are fine tuning a base model or it trading on prompts, we need stability in how we assess outputs. Second, it provides an objective and reproducible performance signal, removing the human bias and the ambiguity from the evaluation loop. Third, it enables rapid testing at scale. Something manual methods simply cannot match. And finally, it supports continuous improvement cycles, which are foundation for the modern ML ops workflows. So now that we have seen why automated evaluation matters, let's also to take a look at the key technologies powering the shift. And whenever we talk about lms, the first thing that comes to our mind is open ai, the Open AI Evolve framework. This is a standardized way to define and run evaluation on large language model outputs. It's very extensible, modular, and integrates really well with other components of the open AI ecosystem. Then we have GE or generative evaluation. This goes beyond fixing test cases. It leverages generative prompts to dynamically evaluate model performance across a variety of tasks. It is especially useful for edge cases or nuance reasoning. And then we have REG evaluation frameworks. These are specifically tailored for retrieval augmented generation. They're not just assessing the output, but how well the model grounds its answers. And retrieving context. So it's about both the what and the why behind the answer. And finally, we rely heavily on evaluation data sets or eval sets. These include curated prompts, gold standard outputs, and even adversarial examples to rigorously stress test the moral behavior. Together, these tools form the backbone of how we benchmark an element scale. So let's dive a bit deeper and see how this prominent tool in this space, the open AI valve framework works. So this framework allows us to, simplify all of these things, like it helps us to build custom evaluation pipelines. It lets us summarize, code generation reasoning and dialogue. But one of the most important strength and standardizing test methodology is that you can benchmark across models and iterations using consistent. Metrics and it helps ensure reproducibility. It's also very natively integrated into the opening dashboard. It makes it very seamless to set up evaluation, view the results, and monitor change over time. So if you can see the screenshot below, you can see you can choose from a multiple data sources importing chart completion, uploading a recent NEL file, creating prompts manually, or even building custom evaluation logic. So this flexibility allows team to quickly spin up robust evaluation workflows without even reinventing the wheel. So one of the most powerful aspect of the open AI valve framework is the flexibility in defining the criteria. So you aren't limited to just accuracy. You can choose what matters for your application the most. Let's walk through some of this buildin options, right? So you have factuality and semantic similarity, which s assesses whether the response is both factually correct and align in meaning with the res reference, answer. For task that requires more of emotional intelligence, the sentiment evaluation can flag toxic or very often responses. Now, let's say you need structural validation, so we can use schema matching or check with the output is a valid, decent format or XML format. It's basically function calling. So string check and criteria match lets you test the presence or absence of specific tokens or features you can afford. Forbidden freezes or required formats. And if you're building custom metrics, you can define your own logic. Using the custom prompt option. And finally, if you have the need for the classic, NLP evaluation, there is always this text qualities that supports blue and rouge scores or something as simple as score sign, similarity based scoring. So all of these F frame, all of these tools included in the open a L framework. And that is something that we, highly recommend use it. So now let's take it, how easy was it actually for us to run an evaluation test with the open air framework using this one line code? Okay, in this scenario, we were using OI EAL combined, which uses GPD 3.5 turbo on the spider SQL benchmarks, and we took a, maximum of 25 samples. The framework sends each prompt to the EPI and evaluates the responses in real time. And as you can see in the log it logs every single TTP requires and tracks a progress sample with sample. Once all the samples are processed, pick it a summary of report, in this case, 20, 20 as responses were correct and five were incorrect. A s giving us a score of 80% accuracy. This output is in saved to these in a log, which makes it very easy to review, compare across, model, or visualize performance later. And the best part about this, it's reproducible, testable, and scalable, all from a single CLA command. So let's work through a real evaluation sample from the SQL benchmark. Yeah, the prompt of the question here was how many country has a republic as their former for government? So the expected SL query is shown green, select star account from country where the country is a firm of government, which is equal to republic. So a precise match using equality. The model submission is in the orange, which use a slightly broad, broader pattern like, like republic, which also matches values like a federal republic system or a Republican system, which actually introduces some ambiguity in the result. So while the model might technically be correct, in many cases, it's not aligned with a strict expected output, and this will be flagged as incorrect by the eval framework. The small variations help us understand not whether just the model is right, but how closely it aligns with the exact semantic and precision, which is especially critical in structured tasks like SQL. So below you can see the next test case queued up. This happens automatically and repeatedly over the whole data set. So this is how we stress test the model with granular feedback, especially in domain, something, as complex, a structured query. So now we can look. Now let's take a look at GE value short for generative evaluation. It is used as an LLM that evaluates another LL M'S output. So GAL has very simple three steps. First, we define a task. The evaluation criteria. For example, you might ask the model to judge a answer based on the clarity or the correctness. Then the models reads its own evaluation steps using chain of thought reasoning. This is what gives GE value, structure and explainability. Next, it'll use these steps to analyze the output and produce a detailed judgment. And finally, it applies a structured scoring function. Usually something that like structure score from the violation text and what. What is a powerful part here is that LLM becomes both the evaluator and the analyst and is very rich, interpretable feedback, not just a random number. So let's take a look at how the GE vial actually applies score and criteria using assertion. So each of these assertion define a specific example and expectation that is set for the large language model output. So in the first one, we are going to use a model graded rubric to ensure the response is not apologetic. Apologetic. This could be used in scenarios where you want the model to answer confidently, like in technical support scenarios or policy decision scenario. Use. And in the second use case, we use a factuality check. Verify whether the model output aligns with the specific known fact, like Sacramento is the capital of California. These assertion provide more modular composable test that can be reused across prompts and tasks. And this is what gives GE valves flexibility. So it's, think of it as a unit test, but for large language models. So here is a practical use case where ge y helps enforce behavior constraints across customer meetings. We have taken a common support prompt. And from audit tracking to product recommendation and it attach a very simple grading rule, do not mention that you are an AI chat bot or AI assistant. This is very useful in brand settings. So Anthrop izing the AI in my conflict design choices, tone guidelines, or, compliance policies. So instead of managing each output, we can automate these tests and get instant evaluation at scale using GE value. This will ensure brand alliance responses without needing a human in the loop every single time. So now let's look at the, some of the same task prompt evaluated on the GE file and can show pass, fail outcome depending on the wording and the tone. In the left side column, we see AI. Responding correctly in terms of information, but feeling due to which phrases like as a AI language model or a e-commerce assistant, these ed, the grading constraint that we saw earlier. In contrast, the right side column use a very different persona prompt, a smart, bubbly customer service rep, and gives answers that are contextually aligned and brand safe. Thus PAs the evaluation. So this shows how geal is adjusted, judging accuracy, but also behavioral consistency and how even minor prompt engineering choices can flip a result from fail to a pass. So it's a very important reminder for us that evaluation is not just what is being said, but the way it is being said. So now let's shift gears and look at how evaluation retrieval argumented generation or REG systems work. So in A REG, which is a retrieval argumented generation pipeline, has two components. A retrieval, which factors relevant context from a knowledge base and a generator, which formulates the final response using that context. So in evaluating REG, it requires looking at both the stages separately. For the retrieval, we use metrics like contextual recall. They data retrieve all the relevant documents and contextual precision, which means whether retrieve document actually useful to us or not. And then we look at generators, which is metric as answer, relevant as the response. Address the user questions and faithfulness. Checks if the output actually sticks to the fact in the retrieved context, this two part evaluation is very essential because a great generator can still hallucinate the retriever fields, and a perfect retriever would not help if the generation steps misinterprets it. So now let me understand the re architecture. Let's quickly zoom in and see how we actually evaluate the retriever component. We use three very important metrics here, the contextual position. This tells us whether retriever is ranked relevant. Is ranking relevant information higher than in relevant ones, which actually means there's a higher score, would prioritize the right context. So as you can see, G PT 3.5 performed best with the 92.23% using very basic IG. While multi queries with GPD, it struggles a bit. So showing how query expansion doesn't always help if not tuned properly. Next very important part is contextual recall. This will check how much relevant information your retriever can actually fetch. Think of it as coverage, alarm to scores, an impress of 90% with basic IEG while again, multi query falls short. A very interesting pattern that we just saw here is RD Fusion with GP PT four, or GPT 3 1 5. In this case also hits 90%, which shows how fusion based approach improves recall through multi retrieval. And then we have contextual relevancy. So this combines both of precision and recall, but it adds a layer of nuance. I was retrieving chunks that are both relevant and not bloated with noise. So RAG fusion with LAMA two leads here with 83.46% follow close following very closely, G PT 3.5 is at 87.22. The Delta fusion approach helps balance precis precision and recall while filtering irrelevant text. So the very important key takeaway is no single metric is enough. You need to monitor all three, especially when choosing between basic multi query and the fusion strategies. So while selecting an LLM, it's very important. You need to fig figure out these steps as well, like backends, which have elements like GP 3.5 or LA two. So let's take a closer look at how different evaluation metrics aligns with human judgment across four very key important points, coherence, consistency, fluency, and relevance. So basic what basically what it is seeing is, bench benchmarking, compiles and using two very important correlation statistics, spear zero and NDL tower, both of which relevant is relevant for us to measure how well a metrics scores. Agree with human ratings. We'll start of course, with something very traditional, Rouge one, Rouge two, and rouge. You'll notice they consistently perform very poorly, especially on the coherence and fluency because they doesn't really, can correlate on these. Semantic qualities of generated responses there is focus on engram overlaps. So next we have a set of basic embedding metrics like the bird score or the mover score, which shows these improvements. They still fall short of truly understanding the meaning across diverse responses. But now let's look at Uni eal. It's a learning evaluation metric. It performs noticeably better, especially in the coherence. I fluency still, it's not the top performer of all. The real breakthrough comes from l LM based evaluators like GPT SCORE and geal. Let's take a look at the score at GE eal. You're right, so we'll see. The GE EAL has the high correlation score. It has the strongest overall average score. Point by one, and these all matters because it shows the model cannot evaluate other models in a way. And it is very remarkable. It's very remarkably close to human reasoning, especially when we prompt them with structural criteria and we enable chain of thought reasoning. So we also, we should also note this, you using probability, you are using chain of thoughts. It performs slightly worse, but combining both of these things gives us really strong signals. So the key takeaway that we have there is gval four with both chain of thought and scoring probabilities currently lead the pack and automating evaluation for elements. So now we see that you know the pipeline for generating synthetic evaluation data sets, especially for retrieval, augmented generation re systems. We begin this by uploading documents that can be in PDF format and docx format, or in any, RX format. These documents are then chunked or broken down into smaller segments. Some chunks may be unqualified, a bit lack useful information, or it is too noisy. Next, we generate contextual frame from disqualified chunks and essentially prepare passages that would serve us as a grounding evidence from those contexts. We now generate goldens. That is gold ground truth, that the answer should ideally produce if it had right access to the right information. Any ambiguous or low quality goldens are discarded and unqualified. Once goldens are finalized, we evolve queries by increasing their complexity and difficulty to stress test the system retrievals and reusing capabilities. The result of this entire pipeline is a synthetic evaluation data sets that can use, that can be used to benchmark both retrieval and generated performance. And very importantly, this processes itrate. We can loop back and forth and define edits and ly improve the dataset. So this kind of synthetic dataset creation allows us to simulate real world scenarios while we maintain full control over evaluation quality. So as we wrap up the discussion on evaluation, let's quickly go over what practical checklist do we have for building very robust landmark ecosystems? First, we need to clearly identify our objectives. We need to ask what are we measuring? Why are we measuring it? Is it really accurate? We do we need accuracy. Is it factually correct? Is it safe? And so that we could reach on a clear goal that aligns us with our metric and method. Second, develop and diverse evaluation method. Not a one single metric is never going to be enough. We need a mix of automated automatic scoring, human review, and model based grading. Third, you must. Represent, we must create representative data sets. The dataset should represent and reflect the actual complexity and diversity of real world inputs that the model will face during in production. The fourth point is evaluation. That evaluation has to be rated. You can never, I. Have a one chart solution for everything you have to evaluate. As the world, moves forward, you have to evaluate as the data has a drift in it. You have to redefine your test cases, redefine your edge cases, your what causes a failure, what is meant by failure and the user feedback. And then the fifth point is establishing baseline comparison. Always compare your model against the strong baseline. This helps quantify improvement and spot regression early. And finally, we need to leverage AI rate evaluation. We need to use frameworks like Gval, open AI evaluation, or custom LM based grad to scale evaluation efficiently without compromising on that. So if you follow the stick list that helps you move along from ad hoc testing to systematic, defensible, and a scalable solution for the violation process. So now let's look at a real world scenario, evaluating a customer service AI system. Traditionally, we might have defaulted to blue or rouge scores for the model evaluation. For this use case, it's not, it's simply that's not enough, right? These metrics do not capture the user satisfaction, business outcome, or operational efficiency. So instead, we break down the evaluation into three critical layer first. The technical performance system and assessment, we start with building blocks. It's so simple as LLU component. How will the system handle the dialogue management and the quality of the response in ratio Here we can still use some LLM focus metrics, but they need to be task aligned in context of error. Second, the shift to customer experience metrics. This is where many AI models stumble. We need to take a look at the actual response team, whether the conversation feels natural, high quality, and how well the system handles follow, especially when multiple turns are needed for evaluation. And finally, the business impact. This is the bottom line. The bottom line of all of these problems are, is the AI improving the conversion rates? Is it actually reducing the resolution time? And most importantly, is it delivering cost savings at scale? So why do we need this evaluation techniques? What do they give us? They give us trustworthy output. They make. Our answers factually correct. They help us catch hallucination early on, even before they reach a user or go into production. We get readable logical answers. By enforcing semantic coherence, we ensure the model response are clear, internally consistent, and actually makes sense like a sounding good, right? The third point is, the techniques brings us strong task alignment. The model doesn't just respond to you because it has to. It has to focus on the intent of the prompt. It should be no wandering. There should be no going off topic. And then with is we also have a signal to noise control, right? The context precision metrics can penalize irrelevant or irrelevant, or made up information, which helps us trim the fluff and boost, content fidelity. And then finally we get the complete coverage. Context recall, ensure the model captures all the critical facts from the source. So nothing important gets left behind with these evaluation upgrades. We don't just optimize performance, we optimize trust, clarity, and we have real world reliability right now. So was the real p of all of these new violation methods. It's this, we now have a multi scorecard, not just a single number or a static metrics, but a fully diagnostic that tells us exactly what to fix. Whether the model is struggling and how we can improve. And here's the best part. It is not just for academic benchmarking anymore. This approach directly boost real world user satisfaction because the feedback is granular, actionable, and it ties to our actual experience. So in short, better evaluation means better models and better models means happy user. That is the future we're trying to build with, right? All right folks, so that is a wrap. Thank you so much for listening to us, and feel free to contact us if you have any of your evaluation needs that you need or you want to work on. If you are any specific use cases that you want the evaluation frameworks to support you or any kind of AA problems, we're happy to help you out. Thank you for listening in. Have a good day.

Slides

Download slides (PDF)

See all 137 talks at this event!

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

Beyond BLEU and ROUGE — Modern Approaches to Evaluating LLMs and AI Systems

Video size:

Abstract

Summary

Transcript

Slides

Alok Ranjan

Engineering Manager @ Dropbox

Saurabh Suman

Instructional Student Assistant @ San José State University

Join the community!

Featured event

2026

2025

Info

Conf42 Machine Learning 2025 - Online

May 08 2025 - premiere 5PM GMT

Beyond BLEU and ROUGE — Modern Approaches to Evaluating LLMs and AI Systems

Video size:

Abstract

Summary

Transcript

Slides

Alok Ranjan

Engineering Manager @ Dropbox

Saurabh Suman

Instructional Student Assistant @ San José State University

Join the community!