Conf42 Large Language Models (LLMs) 2024 - Online

Automated Evaluation for your RAG Chatbot or Other Generative Tool

Abstract

Retrieval Augmented Generation (RAG) chatbots and other generative tools that you want to build out for narrow use cases are software and can be tested like software. In this talk, I’ll discuss how to write automated evaluations of your tools by using LLMs to assess your LLM.

Summary

  • Last year, I spent part of a weekend making a little demo for a retrieval augmented generation. This is a chatbot that uses an LLM to answer questions about a specific set of documents. I couldn't rate its lack of usability because I didn't have an evaluation strategy. But now I've been working on LLM model evals.
  • Why do we ever automate testing? We automate testing because you're going to break things. But in the context of your LLM tools specifically, we also automate testing. The more your tool is doing, the less feasible it becomes to test things manually.
  • We test to make sure that our tools are doing what we want them to do. With generative AI, like with a chatbot, we're not talking about a classification model. How can we assess that? There are a few options which I'll go through.
  • You can write a specific test for what you're looking for and let your LLM or an LLM do your evaluation for you. It's not a substitute for doing user testing, but it can complement it. You're also going to want more flexible testing that can be done on the fly with new questions.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Last year, like a lot of people, I spent part of a weekend making a little demo for a retrieval augmented generation, or rag chatbot. This is a chatbot that uses an LLM to answer questions about a specific set of documents, where those documents aren't necessarily in the LLM's training data. So the user asks something, your tool searches to find related text in your documents, and then it passes those chunks of text to an LLM to use in answering the question. This was my weekend demo project. It used lang chain, it used radio. It was probably 80% code I took from someone else's collab notebook. But my code was a little cleaner and better organized, at least. But the actual chatbot guys was terrible. It didn't work. It was supposed to answer questions about shiny for Python, which is a python library for building dashboards that was too new to have its documentation in GPT four. But my chatbot was making things up. It was getting stuff wrong, and just generally it wasn't usable. And I could tell you it wasn't usable from, you know, trying to use it, but I couldn't tell you how not usable it was. I couldn't rate its lack of usability because I didn't have an evaluation strategy. Then I made another one, this time to answer questions about the California traffic code. I scraped the data from the website. I fought with it a little to get it parsed appropriately. And then this one was with streamlit, and someone else built a nice interface for it, and this one was somewhat better. But when we were trying to make decisions about things like what open source model should we use as our underlying LLM, I still didn't have an assessment strategy for figuring this out. How were we supposed to tell if it was working when we demoed it out to potential clients? Was the best we could do to just ask some questions and hope that the questions the client asked was something it could answer. But now I've been working on LLM model evals for a while, and I spent just a lot of time with evaluating LLM output in different ways. And so I have at least the beginning of a strategy for how to evaluate your rag chatbot or other generative AI tool like that you're building for customers, or maybe internally to do a specific thing. And I'm going to tell you about that today. First, why are we automating testing? Well, look, why do we ever automate testing? We automate testing because you're going to break things, and you want to find out that you broke it when you push your code to a branch that isn't your main branch, and before you merge that code in and your product goes boom. We automate testing because we're human and therefore we're fallible. But in the context of your LLM tools specifically, we also automate testing because there are choices you're going to make about your tool, and you want to have quick feedback about how it works, or could work. For instance, like I mentioned, if you're trying to decide what underlying LLM to use, or what broad system prop to use, if that's relevant for you when you make or consider changes, you want to know how they affect performance. And the more your tool is doing, like with any kind of software, the less feasible it becomes to test things manually. What I was doing with my California traffic code chatbot, which was having a series of questions, running them through multiple models, and then looking at the responses myself, it's not the worst, but it's not the greatest either. Let's talk next about how to automate testing broadly. We test to make sure that our tools are doing what we want them to do. So what do you want your tool to do? What are some questions you want it to be able to answer? What does a good answer look like? What does a bad answer look like? Now test that it's doing that easy, right? Just test that it's doing what you want it to be doing. We do that all the time for machine learning problems generally, and NLP specifically. But actually, okay, this is not that easy. It's actually pretty hard. Why is it hard? It's hard because text is high dimensional data. It's complicated, it has a lot of features. And with generative AI, like with a chatbot, we're not talking about a classification model where the result is pass or fail, spam or not spam. Now, as a digression, you can also use large language models to build classification tools or do entity extraction, and they're really good at them. And that's my favorite use case for llms, in part because you can evaluate them super easily, the same way we've always evaluated these kinds of tools, by comparing the model output with our ground truth labeled data. So if you have the opportunity to do that instead. Instead, oh my gosh, of do that, evaluate it with a confusion matrix, call it a day, but everyone wants a chatbot, so that's what we're talking about. And in the case of chatbots, we're asking a question and getting a whole sentence back, or a paragraph. How can we assess that? Well, we have a few options which I'll go through, but first I want to note that the purpose of this kind of testing is not to comprehensively test everything someone might ask your tool about for accuracy. If you were going to generate every possible question someone could ask of your tool, as well as criteria for evaluating answers that are specific to that question, then you wouldn't need a generative tool, you would need a frequently asked questions page, and then some search functionality. So the purpose of this instead is to select some of the types of questions you want your tool to be able to answer and then test the content of those. So first option string matching we've got some choices here. We can look for an exact match, like the answer to an exact, the answer to be an exact sentence, or to contain a particular substring, like if we ask it for the capital of France is Paris. Somewhere in that response we can use regular expressions or regex if there's a pattern we want, like if we want a particular substring, but only if it's a standalone word, not part of a bigger word. We can measure edit distance, or how syntactically close two pieces of text are, like how many characters we have to flip to get from one string to another. Or we can do a variation of that exact matching where we want to find a list of keywords rather than just one. Here's an example. Here we have a little unit test where we do a call to our tools API. We passed a question, we get back a response, and then we test to see if there's something formatted like an email address in it. So ship it. Does that look good? Is this a good way of evaluating high dimensional text data to see if it's got the answer we want? Nope, this isn't great. There's a lot we just can't do in terms of text evaluation with string matching. Maybe there are some test cases you can write like, okay, if you want very short factual answers, you can do this, but in general, don't ship it. Next, we have semantic similarity. With semantic similarity we can test how close one string is to another string in a way that takes into account both synonyms and context. There are a lot of small models. We can project our text into 760 dimensional space, which is actually a major reduction in dimensionality, and then we can take the distance between those two strings. So the thing, the response we got from our model versus the response we wanted to get from it. Here's an example. This isn't exactly real, but basically you download a model that's going to do your tokenizing. So that'll break your text up. You hit your tool API with a particular prompt to get a response. You project that response into your n dimensional space in your model, and you project the target text that you wanted your tool to say into that space as well. And then you compare the two. Your text then uses a threshold for similarity for how close they were, and then tells you if it passed or not, or if the two texts were sufficiently close using that model. So ship it again. I don't want to say never, but there's a lot of nuance. You're not necessarily going to capture its similarity, and that's especially as your text responses get longer. Something can be important and you can miss it. Okay, so finally we have LLM led evaluations. That's where you write a specific test for what you're looking for, and you let your LLM or an LLM do your evaluation for you. And this doesn't need to be the large language model you're using behind your tool. For instance, maybe you use an open source tool or smaller LLM to power your actual chatbot, say, to save money. You might still want to use GPT four for your test cases because it's still going to be pennies or less to run them each time. So what does this look like? It looks like whatever you want it to. Here's one I used to evaluate text closeness, so this would be how close is the text that tool output to the text you wanted to see? And this gives back an answer on a scale of one to ten. Here's another one. You can write an actual grading rubric for each of your tests. This is a grading rubric for a set of instructions where it passes if it contains all seven steps and it fails otherwise. I'm using a package called marvin AI, which I highly recommend, and that makes getting precise, structured outputs from OpenAI models really as easy as writing this rubric. You can also write rubrics which return scores instead of pass fail, and then you can set a threshold for your test passing. For instance, you could make this pass if it returned four, five, or six or seven of the tasks and failed otherwise. Again, this is a level of detail which you can't get using string matching or semantic similarity. I'm doing some other work involving LLM led evals, and so I'll show you another example of how we can get pretty complicated with these there's this logic problem about transporting a wolf, a cabbage, and a goat, but you can only bring one at once. And the goat can't be left alone with either the fox or the cabbage because something will get eaten if you swap out those names, the goat, cabbage, and fox for other things, some of the LLMs get confused and can't answer accurately on screen. This is a rubric for using an LLM to evaluate text to see did it pass or fail. The question and what it's possible to do here is write a rubric that works for both the five step answer and the seven step answer, the difference being the seven step one is two steps where you're traveling or teleporting alone without an object. And the LLM, if you pass it as a rubric, is capable enough to accurately rate each answer as passing or failing. You can do the same thing with a select set of questions you want your generative tools to be able to answer. You can write the question, you can write the rubric, and you can play with different llms, system prompts, etcetera, any parameter your tool uses to see how it changes the rate of accuracy. It's not a substitute for doing user testing, but it can complement it. But you're also going to want more flexible testing. That is, testing that doesn't rely on having specific rubrics and therefore can be done on the fly with new questions. And there are also tools for that. For instance, you can use LLM led evals to see if your rag chatbot is using your documents to answer the questions versus if it's making things up. That is, when we're asking the LLM, in this case is was the answer that my tool gave only based on information that was contained in the context that I passed to it. And then we give it both the context, so the document chunks, and we give it the answer that the tool actually gave. That's really useful. And you can run it on your log file, you can even run it on real time as a step in your tool, and then you don't give a response to your user which didn't pass this test. You can also use LLM led evals to assess completeness. That is, did the response fully answer the question. Now, neither of these will necessarily get you accuracy. For accuracy, you need to, you know, define what accuracy means, and that's individual to each question. But these can still get you a lot. And again, the strength of these is you can run them on any question, even in real time. I got these from Athena AI, which is a startup in this space, but there are other companies in the space as well, doing novel and cool things with monitoring llms in production. I do think you can roll your own on a lot of these for getting your own evals, but if you don't want to, you don't have to. So ship it. Yeah, totally ship it. Write some tests and treat your LLM like real software, because it is real software. Before I go, I want to again mention real quick two products that I'm in no way affiliated with but that I think are doing cool stuff. So Marvin AI is a python library where you can very quickly write evaluation rubrics to do classification or score, and it'll manage both the API interactions and also will transform your rubrics into full prompts to then send to the OpenAI models. And then Athena AI, that's Athena with an I in the middle, is doing some really cool things with LLM led evals, including specifically for rag applications. This is me on LinkedIn. Please get in touch if you're interested. And here's my substack. I wrote something on this specific specifically, but I'm also writing about LLM evals in general, red teaming, other data science topics, et cetera. Thank you for coming.
...

Abigail Haddad

Lead Data Scientist @ Capital Technology Group

Abigail Haddad's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways