You get an LLM, you get an LLM, everyone gets an LLM, but does it work?

Video size:

Abstract

While everyone wants to say those magical words AI, LLM, have you evaluated your models to see if they actually work for you and your use case, or are you just using whatever’s the most trending model out there? In this talk, we will explore the language models and evaluate those choices.

Summary

Ashwin has been working in the field of machine learning, computer vision, and NLP for over three years. Today he will talk about large language models evaluations and how you can evaluate your model in a better way.
Will the language model speak the truth? That's the question that everyone's been asking. The bigger question is whether the language models will be truthful in answering the questions that we ask them.
These evaluations are important as a part of a measure of your overall large language model development workflow. Think about it as how we can effectively manage or leverage these language models. Understanding these models, measuring these models will ultimately lead to how you can improve these models.
The ideal framework will offer you the metrics tailored to your different tasks. The other thing that we need to consider for a good evaluation framework is the metric list. The framework that we're going to use should be extensible, it should be maintainable.
The human evaluation aspect really focuses on multiple geographies, multiple languages. The way people understand a particular response, a particular language, determines a proper metric or a proper score for your language model to be evaluated upon. There is obviously many challenges with human evaluation which has its own limitations.
Automatic evaluators assess how well specific parts of speech or output match your expectations. They do struggle with creativity and overall quality judgment, but they are a lot better evaluator in terms of metrics. Which one to choose really depends on your particular use case.
Using the public benchmarks that are already available. And the other way is using golden datasets. They will give you an understanding of the general capabilities of a model on different sorts of data sets. But they don't necessarily guarantee success.
Moving ahead to the golden data sets, these data sets are tailored to your specific needs. These are instrumental in building the rag based workflows. And these will give you a really good idea on how your rag workflow should be.
Your use case is probably well defined. You're looking for a model that performs really well in a particular sector. And the evaluation metrics when they're actually in progress or work with these models. Once these metrics are defined, you really know how a language model is working for your particular flow.
How do we measure the effectiveness of a tailored large language model? We could first move to traditional metrics. Another metric that we should also be exploring is perplexity. It assesses the LLM's internal ability to predict the next word in a sequence.
What about llms evaluating llms? There's been some good research on whether an LLM can actually evaluate other LLM. The other approach of this is definitely the metrics that you choose. Despite the rise of all of these automated metrics, I believe that we should be also looking at the human aspect of it.
Public libraries or public frameworks like hugging face eval or lmewal harness is a really good start at first. By leveraging these right metrics, you can now unlock the potential of these models in for your specific use case. As this field grows, there's new, more new research into it.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone, and thanks for joining me today. I'm excited here to talk about large language models, especially large language models evaluations and how you can evaluate your model in a better way, or how you can use certain frameworks that are available to understand how a particular language model is performing for your specific use case. For those of you who don't know me, I'm Ashwin. I've been working in the field of machine learning, computer vision, and NLP for over three years, and I've also worked in some other domains throughout my career. Today I would like to dive into into a specific aspect of course evaluations, but also why they're necessary and how we can get to that particular evaluation metric. So, to get started, let me just quickly move the screen. Yep. Okay. Yeah. So, will the language model speak the truth? I guess that's the question that everyone's been asking. Everyone's been really concerned about the whole evaluation flow, or if we can trust these language models into telling us something that it definitely knows versus making things up and telling us something that even we are not sure whether this is the truth or not. So the bigger question that we probably tend to answer today is whether the language model is going to be truthful in answering the questions that we ask them or, you know, getting summaries out of it. So, moving ahead, we are jumping into a particular aspect is why we need these evaluations and why these particular evaluations are necessary. The reason being that these evaluations are important as a part of a measure of your overall large language model development workflow. So think about it on how. Think about it as how we can effectively manage or leverage these language models, while also making sure that we are not letting it lose completely and having a bad customer experience in general. So the three aspects that you need to take care of here is the management of these language models. Maybe you're using just a few APIs that are available online and getting the results and just publishing them to your users. Or maybe you're using, or maybe you're hosting your own models using frameworks like B LLM. I guess there's one more called Lama CPP or something like that. So these frameworks also allow you to host your own models and also probably understand how you can improve the overall performance of these language models. Understanding these models, measuring these models will ultimately lead to how you can improve these models, how you can understand on where these llms fall short, so that we can either refine our training data, we can start thinking about fine tuning these language models, maybe lower adapters or, you know, just in general testing out different language models or different specifically publicly available, fine tuned language models. Moving ahead, we are concerned about what kind of frameworks or what kind of measurement techniques, or maybe let's just call them frameworks, would help us in doing a particular thing. So we have three selection criteria, or let's say three evaluation criteria that we can think of, and we can jump right into that. But let's seeing that we can have task specific scores that measure the right outcome of a particular language model. Imagine a toolbox where you have specifically designed metrics or specifically designed items that you use to assess llms. And while when you consider the toolbox, your tools would be these three things that we've mentioned are the task specific scores that we talked about here. I must say that a one size fits all approach usually doesn't work. So the ideal framework will offer you the metrics tailored to your different tasks. To specific tasks, let's say question answering or summarization. It will help you, or it will allow you to evaluate these models on these specific tasks. The other thing that we need to consider for a good evaluation framework is the metric list. You know you have the toolbox, you know how you have the toolset. But if you don't have a reliable set of metrics that have been proven before, there is no way in really understanding what a particular score or what a particular metric will mean if it is presented to you in an aggregate or a abstracted manner. So having a good list of available metrics that you can easily implement is really crucial when deciding what framework or what library or what general research you're going to follow to evaluate your language models. And the third part is basically, really, we could discuss this broadly or narrowly, depending on what the context is. But the framework that we're going to use should be extensible, it should be maintainable, and of course it should provide you an ease of access to the underlying function or the underlying classes of it. The reason for this is because if the framework is adaptable to a particular task, that's public, but your task requires some specific understanding or specific knowledge, or a specific way of loading, maybe loading the model weights. Or maybe if you use a different tokenizer for tokenizing your particular request responses, then this becomes a really crucial aspect in determining whether a particular framework is going to be useful for you. So, for example, the LM eval, the LM evaluation harness framework, a really good framework, a really good design made to evaluate any model that you can think of, any supported support model that you can think of on public datasets. But while working on this particular framework, if you have a task that's specific to your needs, it's a really difficult, how do I say it's a really difficult way of implementing that in this particular library? It's not just about this particularly, it's in general, any library that has come up for evaluations which does, which kind of abstract these evaluation metrics for us. So, moving ahead, by incorporating all these elements, we can create a robust, or we can decide on what robust evaluation framework we can use. Now, let's dive deeper into the two main approaches for the LM evaluation and why they are necessary. As you can see in this particular diagram, you can see that the part where the human evaluation is concerned, it ranks really higher, as opposed to user testing, fine tuning and maybe public data sets, public benchmarks, auto evaluation. And the reason really being for that is the human evaluation aspect really focuses on multiple geographies, multiple languages, and the way people understand a particular response, a particular language, and that really determines a proper metric or a proper score for your language model to be evaluated upon. Because as we know, there's multiple dialects for a particular language, there's multiple people talking or using different style of grammar, that is not a common way of speaking or understanding things in their own countries. And that could really mean a lot of difference. When you're evaluating a framework, let's say, for someone in the US, versus if you're evaluating, sorry, if you're evaluating model on a specific response for, let's say, for someone in the US, versus for someone who's not a native english speaker, for them, understanding the context, understanding what's going on around a particular response may or may not differ, and they may or may not think that this particular answer suits them well. So having a human evaluation framework, that's drug geography, region specific, or, and of course, application specific, is a really important aspect of evaluating these models. Now, once we've talked about human evaluators and why those are necessary, we should also consider that it is not always possible to collect all of this feedback. And it is not also always possible to have your customers decide or determine whether a particular answer was good or bad. The customers are, the customers are really concerned about the value that your application or your use case is providing. So in general, we can think of these evaluation metrics. So in general, these evaluation metrics you can understand from two different perspectives. One is of course, the one we discussed, which is the human evaluator part, and one is the frameworks of the libraries that we'll be exploring in this talk, we can call them as auto evaluators or, you know, anything that's non human evaluator. And so previously or traditionally, the way these language models even came into being were based on a large number of data sets, a large number of text or material that was open, that was available publicly. And so overall, while these companies are use cases that are trying to test these models, they've made human evaluators a kind of a standard, or how do I say, kind of a stop in the loop on giving users a complete access to the language model use case, versus maybe releasing it as an experiment, or maybe releasing it as a beta, so that when people interact with it and people understand what's going on, and sometimes you get oh no, this answer is totally false, and then people will just bash you for whatever you've done and just give you negative remarks that in turn helps you in better serving the model or better evaluating on what went wrong and where. So the strengths really here are that humans can provide a nuanced feedback, judging whether an outcome is simply right or wrong based on their understanding, but also considering factors like creativity, coherence, and relevance to the task. Just as I mentioned before, the same thing that probably would be relevant or coherent or creative to a native english speaker wouldn't be the same as for a non native speaker, just because of how they understand the language. There is obviously many challenges with human evaluation, which has its own limitations. The choices, as I said, can be subjective, based on location and defining success matrix, only based on what someone from a particular place said is obviously challenging and may raise questions on how this came out to be. So, to this, adding to the fact that human evaluation will also take time, it is an iterative process and it is also expensive because you will probably be putting this into the hands of your potential customers, who probably would get maybe frustrated, who stop using this app, and then you have to convince them to use it and give feedback and whatever all this, that entire flow costs time, costs money, so we can move towards automatic evaluators. Now, when I say automatic evaluators, I don't necessarily mean that everything's happening by itself and you're just calling a simple function and everyone's happy, and all the language models have achieved nirvana or greatness. And whenever I say automatic evaluators, it really means that the toolbox that we set up with the three different criteria, that toolbox allows you to evaluate a particular large language model on its own while you're iterating over the use cases and the strengths. Basically, here are their fast, they're efficient, and they're objective. They assess how well specific parts of speech or output match your expectations based on the data sets that you have. And choices are based usually on known outcomes and well defined metrics. Now, how do these known outcomes come come into picture? It's the people who are actually serving these models determining what a particular output for a particular, let's say, question or a summary should be, so that whenever you are close to it, like for example, root scores, whenever you're really close to it, you think that this is probably a better understand the better understood answer rather than something that's completely made up. However, we can also cannot discount the limitations that they possess. They can't replicate what we talked about before, which was a great thing about human evaluator, as well as its limitation, is that the ability to understand the context and nuance of this in a overall sense of what a particular response should be, or what particular answer, or what particular future question can come over based on that particular answer. So, automatic evaluators, they do struggle with creativity and overall quality judgment, but they are a lot better evaluator in terms of metrics and in terms of, let's say, as I said, Google scores, n gram matching. So they somehow work well together. But the real, the golden chance, or the golden opportunity here is to mix these two human evaluators as well as the auto evaluators in having your entire flow is such a way that you leverage the best of these two and you also overcome the bad, I would not say bad parts, but the less best parts, I guess, from each of those. So I would the takeaway from overall, this is a collaborative option is probably better, and which one to choose really depends on your particular use case. And the ideal approach most likely often involves a combination where humans provide valuable feedback on whether this was right or wrong. And the auto evaluators often consistency in checking whether the answer that you're getting for the, let's say, for the same question or for requesting the same summary is always going to be somewhat similar. Let shift gears to what we are talking about when we were talking about whole evaluators, and let shift gears on understanding how do these two evaluators function, or how do these two evaluators work with different sorts of data sets. So we have two kind of data set divergence or data set paths here. One is using the public benchmarks that are already available. You don't have to do much, you just trust these benchmarks that are available probably on these public leaderboards, maybe hugging face or individual competitions. And the other way is using golden datasets. Now, whenever I say golden datasets, it doesn't mean that this is literally the gold standard. It just means that these are datasets or these are the values that you control and these are the values that are definitely, or I should say almost 90% true to be really effective in giving you an answer. So public benchmarks, these are predefined data sets. You can easily get that from hugging face. And they are more. The more the research that was put into creating these data sets, understanding or training a particular model to test on these data sets, the more fair assessment these public benchmarks will give you. And they will give you an understanding of the general capabilities of a model on different sorts of data sets, like how well a model is, how well a model performs with, let's say, something as something called, as maybe abstractive summarization, summarization, question answering, or even giving you answers to a particular code, debugging or explaining maybe programming concepts or scientific concepts, or, you know, any concepts in general. However, these public benchmarks, given the broad scope of what kind of data sets they usually have and the formats of these, they don't necessarily guarantee success, or they don't necessarily guarantee a yes or no answer whenever you're looking at a particular model and making a decision whether you should use that model or not, because your use case is really a specific use case that's coming out of the model, rather than people asking travel tips or money saving tips, which is what we usually see. The most popular use cases of these language models are so the takeaway on public benchmarks. They do offer valuable starting point, as in, they do offer something to give you a edge over, but they shouldn't be the sole measure of what you're trying to achieve and are trying to test based on the LLM's effectiveness. Moving ahead to the golden data sets, these data sets, as I already said, are tailored to your specific needs. You control what output you expect, you control what prompt, or let's say, what metrics or how to put it in a better way. You control what the exact result from the large language model that you're expecting is. Of course, you do not control what the large language model obviously gives you to a certain extent. But what you know is that if I refer this to particular, let's say, a particular sentence, I know, the more this sentence matches with what the LLM has given me. It's more accurate, or I would assume that it's more accurate and performs well on the task that I have made it to. So these, let's say golden data sets are used will allow you to evaluate how well an LLM performs of the tasks that matter most to you, rather than being a generic data set of maybe Reddit comments on maybe people just posting this, this, this in a chain that doesn't make sense to you anymore. But obviously this is a part of the whole training data set if they weren't clean before. So some examples of golden data sets would be for a, let's say for a use case, like checking semantic similarity of the content that was given out, or measuring perplexity, or as I said, root scores in summarization tasks so that you could understand how well a summary was generated, how short or how long that summary even was. And these are instrumental in building the rag based workflows, which is the retrieval augmented generation based workflows. And these will give you a really good idea on how your rag workflow should be, which is what we are evaluating or what we are understanding to evaluate here. So the most effective approach, again, leverage the as we discussed before, leverage the public. Leverage the public information, as well as leverage the data sets that you prepare because they both provide you a certain benchmark that you can then evaluate and iterate over. Moving ahead. Now that we understand what these evaluation methods resources are, what usually works, let's talk about applying this knowledge to the specific needs. As I said before, your use case is probably well defined. You're looking for a model that performs really well in a particular sector rather than a general model for. Let's assume this Chevrolet case where someone tried to ask the language model and try to get a brand new Chevrolet Tahoe for like $1. Let's assume that you are in the scientific research community and your model is really open to answering any and all sorts of questions that's not really good for you, that's not really conducive for you. And the evaluation metrics when they're actually in progress or work with these models. These will tell you how close people are talking about the use case, how close the topics are, and whether they really make sense, or whether the user feedback makes sense. So in general, you could consider this as do you need a particular use case to answer questions? Do you need to answer summaries? Or do you need particular use case to provide you citations, references, or just be a general language model that really answers the question without any external source and answers questions from its knowledge database. So the content that goes in and the content that comes basically, whatever the content that goes in and whatever the content that LLM interacts with, is it supporting a document library, a vector database, or just a vast collection of text data? That's how your use case will determine what the output, or how well defined your output should be. And eventually, once this is defined, you could really understand on how the model is performing. You could maybe involve some fine tuning for it. You could go on topic modeling and maybe restrict the model on answering, or even, or even getting the model to answer particular questions based on what your use case is. And it's also crucial to establish certain guardrails, certain input validation, certain output validation, so that you're not drifting into something that you shouldn't be doing or you shouldn't be talking about. Of course, prompt engineering or in general, defining the prompt is also somewhat an aspect of this particular flow. But once these metrics that we'll discuss now are defined, you really know how a particular language model is working for your particular flow. Moving ahead, let's dive into the nitty gritty, or, you know, the specifics of what we were talking about. How do we measure the effectiveness of a tailored large language model? We could first move to traditional metrics, the most familiar how these tools are. These tools are effective kind of outputs we outputs we get from flavors of rules of rule scores. More language language processing language processing approach. How well the LLM in n gram matches, or in other use cases where you check other scores, will help you understand on how better to how better your model is performing. Also, there's another particular metric that we should also be exploring is perplexity. It's a metric that takes a slightly different approach. Rather than having a well put dataset together, having a well put timeline together. It assesses the LLM's internal ability to predict the next word in a sequence. So it will tell you how confident a particular LLM was in predicting the next word, and a lower the score. Lower the overall score. It indicates that the LLM is more confident in its predictions and is less likely to generate surprising results, or even completely non essential results. So while these traditional metrics are available, things like regas or ragas, I should say, I don't know how it's actually pronounced, but things like ragas or regas, they are really good workflows to to treat your retrieval augmented generation pipelines. And the metrics like faithfulness or relevance or other aspects of this particular framework will allow you to understand more better on how your whole rag pipeline is working. And also another good framework is the QA Ul by Daniel. It also provides good insights and it checks whether a particular LLM captures the key concepts or the key information that was asked of it. So these are some of the I guess the next step on this, now that we have established a benchmark is what about llms evaluating llms? There's been some good research, some good findings based on whether an LLM can actually evaluate other LLM. And this is most likely a trial and error approach on making sure to be understand the use case and something like chatbot arena, which basically you could use the outputs generated from that from two different models and evaluate and see whether these two different models could compete with each other in generating the text. If you ask one model, hey, is this factually correct based on this answer, and you could iterate on that, I wouldn't put much stress on how efficient or how correct these metrics are, these judgment would be, but considering that most models were trained on somewhat similar data sets, you should expect that the results would be more subjective and will give you a good insight on how a model would perform against some other model. The other approach of this is definitely the metrics that we discussed. So the metrics that you choose, your own metrics, you compare those metrics, you have a definite set of iterative metrics that you've had so far. You have a data set that you can test on for, and you can really put that into a number or an understanding in a perspective that even your customers, people in your management, or people you're working with, or even the scientists that you're working with, really have a good benchmark on where they are right now and where they are looking to be. So the human touch, the ultimate judge, what human evaluation does, what a multifaceted approach will give you. And despite the rise of all of these automated metrics, I believe that we should be also looking at the human aspect of it and be really sure that we are closing the gap between what we are evaluating versus what is necessary. So that instead of chasing the next cool model or the next best model, the next public best model available, we really make an informed decision on whether we are falling behind just because there was a new model launch, or whether we are falling behind because the data set that we fine tuned on isn't that good. Or maybe we need more of that data set, or maybe we are good with whatever the metrics that we have so slowly iterating over that we'll get good at the game and we'll be able to clearly determine what is essentially required. Moving ahead. These are some available frameworks, Regas helm, which is a really good benchmark. Again, these benchmarks will tell you how a particular model was working on a particular large corpus of data sets. There's also langsmith by Lang chain, with and without the integration for the weights and biases for you to eval. There's OpenAI evals that gives you a good idea on what evaluation frameworks or what other evaluation metrics are available. There's deep eval, there's this LM evaluation harness that I've really grown kind of fond of. They do a really good job at allowing you to evaluate a particular model on a particular public dataset with really less overhead. But as I said, there's some overhead to making it work for your own custom use case. And that's where I found that using simply basic hugging face functions to load your data set to calculate scores is probably what is beats, or is more easier to use or is more efficient to use. So that's the frameworks that are available that we can try and test. And at the end, all you need is a test eval set, a multifaceted approach on how you can use the existing test eval set, or whatever the existing data set you have, or you will create, and into understanding really how you can develop a particular set of metrics around it. And you can use those. So the foundation is going to be your test set or evaluation set, and the properties, some which are internal to the model or which are internal to the framework that you're using, like perplexity, which can be called calculated from the outputs that you get, or any existing implementations for any particular public task, or any particular specific task that you're trying to look. And the benefits of this would be that you really control what sort of metrics are used in particular evaluation, what sort of metrics you can quantify your application against. And you can also have an established set of data sets that can help you over the long term in understanding how the model performed over a certain duration or a certain time, and what you can do more better after. Let's say you fine tuned the model multiple times, you've changed the model, so it gives you a good flow of all of the metrics that you can make the decisions on. I guess that's probably it. That's what I want to discuss. And I would in general say that adapting public libraries or public frameworks like hugging face eval or lmewal harness is a really good start at first to get any metrics like f one aggregated scores, blue scores or root scores, blue scores or root scores, or all of these scores, to decide or define a particular evaluation framework for a RNG based flow. And these will work seamlessly with your chosen data sets. Also, of course, including human evaluation, human flow into the application and comparing really on what human evaluation results are with what you're getting, as opposed to a certain particular benchmark, will be a overall good strategy to evaluate large language models on. In conclusion, we've really, in a really abstract manner, explored how you can evaluate why a particular evaluation type is necessary. And it's, and at the end, it is not just about whether you get a score out of it or not. It's not about establishing a particular metric and, you know, charting out six months of metric data and just saying that, hey, this is the model that works for me. It's about establishing a foundation so that whenever you're developing iteratively, whenever you're mark, you're managing these large language models so that you can continuously improve on the remarkable research that has already done into putting these large language models out public with their weights. And by leveraging these right metrics, you can now unlock the potential of these models in for your specific use case and determine whether a particular model works for you, or if it doesn't work for you, why does it not work for you? And in general, understanding or delivering real world benefits to the user. And as this field grows, there's new, more new research into it. We'd probably move ahead, or we've probably already moved ahead beyond these basic scores and we've gone into more a complex understanding of context, whether the particular model understands a particular context grammar and all of these ideas. And it is going to be really exciting on what more, what more metrics or frameworks that come up so that we could evaluate a large language model better. So yeah, that's it from me, and I hope you got a really good starting point. This was meant to be a good, this was meant to be just an introduction on what kind of frameworks are available and what you can do with those and which approaches work really well. So I hope you enjoyed this, you learned something, or maybe you confirmed something that you already knew, and I'd be happy to connect and explore more in depth given the time restraint that we have. And if you have any questions, or if you want to really go dive deep into any of these concepts, or why you should make a particular choice or what my previous experiences have been. You could connect with me on LinkedIn and we could discuss that as well. Again, thank you Khan 42 for giving me this platform explaining this. I've learned a lot while researching over this particular topic from my previous experience and I hope you guys learned too.

Slides

Download slides (PDF)

See all 28 talks at this event!

Conf42 Large Language Models (LLMs) 2024 - Online

April 11 2024

You get an LLM, you get an LLM, everyone gets an LLM, but does it work?

Video size:

Abstract

Summary

Transcript

Slides

Ashwin Phadke

Senior Machine Learning Engineer @ Servient

Join the community!

Featured event

2025

2024

Info

Conf42 Large Language Models (LLMs) 2024 - Online

April 11 2024

You get an LLM, you get an LLM, everyone gets an LLM, but does it work?

Video size:

Abstract

Summary

Transcript

Slides

Ashwin Phadke

Senior Machine Learning Engineer @ Servient

Join the community!