Measuring hallucinations in RAG

Video size:

Abstract

Developers building RAG applications want to overcome the inherent problem of hallucinations in LLMs. In this talk, I’ll explain why hallucinations occur, how RAG supports them, and why the HHEM (Hallucination Evaluation Model) can help.

Summary

Oper Mendeleevic: Today I'm going to talk about measuring hallucinations in rag or retrieval augmented generation. He says within five years, we'll see a transformation of all applications, from consumer to enterprise. Use cases for that include question answering and chatbots, he says.
Reg is built when you actually want to build it yourself and do all the steps on your own. You sort of ingest the data first into the system. Once you finish with the chunking, you go and you embed each chunk. This is used for neural search later on.
First sign up to a free account. Then you need to ingest some data. There's a lot of different ways to do that. You can use our APIs directly. Or you can use some of the tools we have available too.
So let me show you some of these apps that we've built just to demonstrate how to use this. The first one is an example called Ask News. It does the retrieval really quickly and it gives you a response here to answer the question. The next one is actually the same application, but now using hem.
This is a chatbot built with the Victor APIs. It answers questions based on the information I crawled in the website. Full disclosure and warning, please don't use this website other than for demo purposes, and use your tax advisor to file your taxes.
I encourage you to sign up to our free account. It's actually pretty generous and allows you to get started with up to 50 megabytes of text and 15,000 queries a month. If you're a startup, take a look at our startups program.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, everyone. My name is Oper Mendeleevic, and I head developer relations at Viktar. Today I'm going to talk about measuring hallucinations in rag or retrieval augmented generation. A little bit about myself. I've been with Victor for about a year, and I had the opportunity to work on lms kind of early on, since the times of GPT-2 it's been an incredible journey for me to see how this technology evolved to become so useful and help us be more productive. And, you know, I truly believe what's kind of stated in this slide, which is the LLM and the generative AI revolution in general is really important. And within five years, we'll see a transformation of all applications, from consumer to enterprise. And every piece of knowledge we acquire will have the ability to be based in this generative AI interface. So we'll be able to interact with computers in a way that's very different than what we do today. To me, this is a little bit like the transformation we've seen with the iPhones when they came out, a very different user interface. You can swipe, you can use your hands, you can use your fingers instead of the keyboard and the mouse. It's that level of transformation. Now with that, you know, as I interact with customers of Vectara, I see a lot of different use cases, and I want to share some of those with you. We have use cases around chatbots, very popular. For example, for customer support, you can put a chatbot that's based on llms to answer customer questions. There's a lot of question answering applications that are very useful. And I'll show a couple of examples here today. Product recommendations. Again, using the latest in LM and NLP capabilities to do recommendation engines. Semantic search, kind of moving away from the traditional keyword search to do a better search experience, workplace search, and many others. Now, one of the problems with LMS, at least today, is that they still hallucinate. And what that means is hallucination is when the LLM can actually give you a good response that looks very authentic and looks very convincing, but it's actually wrong. And this is one of my favorite examples here. Did Will Smith ever hit anyone you ask? GPT 3.5 that, and it gives you this response. No, you know, Will Smith is a decent guy. No known assault incidents, etcetera. And of course, we all know that's wrong, because this is really what happened in the oscars about two years ago. So, you know, that's an example of hallucination. And there's a lot of hallucinations. And the question is, you know, how can we avoid, and how can we reduce the amount of hallucinations to make the end application for the user much better? So one of the ways you address hallucination is with Rag, and that's what made rag so popular. So rag stands for retrieval augmented generation. Let me walk you through how that works at a really high level. So the idea behind rag is that you actually augment the information that the LM has with some other information. So it could be other public information, but in enterprise context, often it is just some private information that only exists within the firewall of the organization or your enterprise. So if an LM regularly takes a user query, you know, thinks about it for a while, and gets you a response that's only based on its internal knowledge. With retrieval augmented generation, what you do is you, the LM, you know, holds for a second and asks a state of the art retrieval engine to look at the data you provided and come up with relevant pieces of text or chunks or facts that the ELM could use to augment its internal knowledge base and answer more accurately. And so, again, use cases for that include question answering and chatbots, like I mentioned earlier. And it's become a very common and very useful application in the enterprise setting. Now, I apologize for this busy slide, but I wanted to share with you a little bit of how Reg is built when you actually want to build it yourself and do all the steps on your own. So on the blue arrow here, I walk through the data ingest flow. So initially you have some data, that's your data described earlier. I could be in a database, like in Microsoft database, MSQL could be in AWS, redshift, or snowflake, or databricks. It could also come from enterprise applications like Jira or notion or something like that. And very often it's just a bunch of files. It could be PDF files, or PowerPoint, or documents of different kinds on s three, or just on a different platform, like box or Dropbox. And you sort of ingest the data first into the system. And so the first thing you need to do is to take the document in its original form, let's say it's a PDF, and extract the text part in a document. So turning it from a binary to a text, another text, could be really long, so very common. You chunk it into smaller chunks. So a chunk could be a page, or it could be three paragraphs, or two sentences, or a lot of different strategies around that. And I encourage you to read more about this. If you're building that, there's a lot of different ways to chunk text that actually impacts their performance pretty significantly. By the way, before I move on, I want to mention, I'm mentioning a couple of different vendors or product names you can use to do any of these steps. That's just a small list, it's not comprehensive. I just wanted to mention a couple of options for everybody. Once you finish with the chunking, you go and you embed each chunk. What does embedding mean? It's a model. It's a different model than your GPT four. It's called an embedding model. And what it is, it takes this text and it translates it into a vector of numbers. Think, you know, 1000 floats. And that vector of number really represents in this embedding vector space the semantic meaning in that text and which is going to be used for neural search later on. So you take this vector and you put it in something that's called a vector database or a vector store which knows how to handle these vectors and searches vectors really well. Again, there's many, many options here. I'm mentioning just a few. Okay, so now that you have the text here and the vector here, you're ready to do the actual search. So let's go through the user query journey. So again, there's some user interface, some application where user has a box to enter their query. You enter the query and again the query also gets embedded. So there's a vector representing what the query is and what its semantic intent is. And then you run this against this retrieval engine. And the retrieval engine looks at the vector store, retrieves the most relevant matches of what was indexed before, and retrieves that the text back into here as the facts or the candidates, those get integrated into a prompt that essentially says something like hey, here's a user query and here's some facts that can help you address this query. Please respond to this query in the best way possible. Given these facts, you send it to an LLM like a GPT four or anthropic clod or something else lama two or anything else, and then the response gets sent back to the user. There's also an option here. You can actually look at the response, especially in the enterprise context. And sometimes I use products like guardrails that essentially make sure inappropriate content does not get back to be shown to the user. Now I kind of didn't mention the red arrow that much, but the Red Arrow represents action. What I mean by that is sometimes in the application you don't just show the response to the user, you also do something with it. You want to open a Jira ticket with this information, you want to send it in an email, etcetera. So those are all options. You have the end of this process. All right, so this is how do it yourself rag works. And as you can see, it's quite complex and there's a lot of steps you have to take, a lot of systems you have to set up. There's a cost to each of these systems. You have to have your DevOps and your machine learning engineer and you have to maintain these systems. And especially when you go from one or two or ten documents to a million documents in a really enterprise scale application, it becomes quite difficult to do this. So that's why at Vektara, what we've created is this rag as a service. And what we mean by that is we've taken all the complexity and put it in a box, kind of behind an API. So all you have to do with Vectara is you essentially index the text or the documents you want. We'll do all the extraction and the chunking and the vector store and everything that I've just shown you. And then you can also have an application called the query API. It will do all the matching and retrieval and everything like this, and give you back the response. So this makes building reg applications very easy, very fast, it's robust, can scale up and down, it's secure, it's got all the different encryption and everything you need for enterprise, and so you don't have to do it yourself. And that is actually really helpful. You can build applications faster, more robust, and move them from sort of an MVP or POC stage into production really quickly. So that's what Vectora does. And again, to recap, why is retrieval augmented generation useful? Well, you augment the element with your own data. So again, if you have private data, which most enterprises do, then, you know, check GPT would not know about this data. So that's the main reason you start. But also, again, it reduces hallucination likelihood. It, the amount of hallucinations is smaller just because you give it the right facts to base its response on. So this retrieval step is really key. Reg outputs are also explainable, and what I mean by that is they come with citations, so you increase the user trust. We'll see that in a demo, the information is private, we don't need to train in rag. You haven't seen any training or fine tuning step there in the architecture, so you don't need to train. So it becomes the information is safe, it doesn't leak into any future LLM. And then lastly, and this is one of my favorite reasons to use rag, is that it allows you to do a per person sort of permissioning or access control. So, for example, if some of my documents are from the HR department and I still want to use them in Rag, but only for the HR people or people who are allowed to see the results, I can ask the retrieval engine in Vektara to not include documents in the set of facts it retrieves, unless the person issuing the query has permission for that. So that allows you to create responses that are customized to a certain level of permission, which is actually really, really helpful. Okay. Okay, so why Vectara? Again, just to recap, building rag is more complex than it seems. And so a lot of reasons I mentioned, doing retrieval in a robust way is usually more complex than you think. Supporting multiple languages is hard. Again, with Vektar, you don't have to worry about a lot of expertise that's very specific to the LLM space, like prompt engineering, machine learning, operations, etcetera. And then, you know, we handle citations very well and just, you know, everything is ready for enterprise scale. Furthermore, again, security, privacy, permissioning, everything is taken care of by our platform, and you get a lower total cost of ownership when you use our platform than if you build it yourself. So one other thing I wanted to highlight is this hem, a hallucination evaluation model. This is a model that is very easy to use. It's open source, you can download it here, and it allows you to take a set of facts and a response from an LLM and detect whether it was hallucinating or not. What we see here is the leaderboard that ranks different llms based on their hallucination rate. It's actually really useful to know that there are differences, first of all, and then what the differences are. So that's ham. And again, to summarize how you build an application with Vekta or RaG application, first sign up to a free account. Then you need to ingest some data. So there's a lot of different ways to do that. You can, of course, use our APIs directly. There's a standard indexing API, there's a file upload API, and then you can also upload files from our console. Once you have an account, you get access to the console. And there's also other ways you can use Vector ingest, which is an open source project we created to help you with ingestion of data and indexing. Of data, including a few cool crawlers that crawl the data for you. And then there's integrations we have with companies like Airbyte and unstructured IO that also could be used for kind of no code ingestion. So take a look at those tools. Once you have the data there again, you can build the UI on your own using the query API and point it to the corpus and run queries. Or you can use some of the tools we have available too. We have an open source project called Victor Answer that can help you build question answering apps. There is a create UI which allows you to build a whole application, end to end in node JavaScript, and then a react search and react chatbot, which are components that you can use in your react application that help you simplify some of this billing process. So I encourage you to take a look at those and build your app with that. So now let me go to show you some of these apps that we've built just to demonstrate how to use this. So this is an example called Ask News. Let me go click on this. So I go to the actual application. So here we've actually crawled using Victor ingest, a bunch of news sources from BBC, NPR, CNN, et cetera. And as you can see, this crawling happens every day, adds the new news articles, crawls their content and adds them to this corpus. Now, when I run a query, let's say, should AI be regulated? You can see that it does the retrieval really quickly and it gives you a response here to answer the question. Now, not only that, it has, as I said earlier, these citations. So you can click on one of these citations and see, okay, this part of the answer was given from this article. Based on this information, you can actually click on this and go to that URL and see, you know what, where it came from, investigate further. So this gives a lot of trust and that's very useful. I also wanted to mention that we have an option here to use different languages. So for example, I can try to get the answer in German. And of course I don't speak German, so I won't be able to tell you if this is correct or not. But you can see that the answer gets translated into German, which is really, really helpful. And again, this is happening even though all the text is in English, so it knows how to match between languages really, really well. So that's an example of a question answering application. The next one I want to show you is actually the same application, ask news, but now using hem. So created a little demo of how you could use it, although there's many other ways. So this is ask news. But if I ask the same question, what you see happening here is that the response is generated in the same way. But then after it's get generated, there's an evaluation of the confidence using HHM. So this, this little step runs the hhem. In this case, it's using the hugging face inference model, and it generates an evaluation of this. In this case, yeah, this high confidence, it means that this response is not a hallucination relative to the facts. So this is one way you can use HHm on your own in your application to do that. So moving on here, this is question answering. But I also mentioned chatbots quite a bit. So let's look at, oh, I didn't mean to click that. Let's look at a chatbot example. So here's a chatbot. This is on hugging face. Again, built with the Victor APIs. So what we did here is create another corpus, crawled about 100, 150 pages from the IR's website and put them in a corpus. And now I can ask some questions about this. So for example, I can go in and say, is my college tuition text deductible? So again, it'll go into the corpus and try to answer this question based on the information I crawled in the website. Full disclosure and warning, please don't use this website other than for demo purposes, and use your tax advisor to file your taxes. I just have to say that. But it's just meant to show a demo. Okay, but again, you get this answer. And the nice thing about the chatbot is you can then ask a follow up question. So for example, here it said cost tuition and related expenses may be tax deductible in certain conditions. So I can say what conditions would make it tax deductible? And the idea is that it'll know that make, it probably refers to college tuition, right? So it has that context of the previous question in the previous answer. So it really answers the chatbot. And you see that it can, it knows that already. So this is a chatbot. I also want to emphasize again, this is all open source. So if you go to this particular website, you can actually see the files and all the code here, including how we run the query and the whole application and everything like that. So you know, feel free to use that as a reference to build your own app if you like. And with that, thank you for listening. I wanted to highlight a few other things here on my final slide. First, again, I encourage you to sign up to our free account. It's actually pretty generous and allows you to get started with up to 50 megabytes of text and 15,000 queries a month, which is quite a bit to get started and try it out. We have a lot of resources for you, our documentation, which is pretty thorough. We have a discord channel for the community, so you can join that and ask questions from fellow developers who build with Vektara, or from a lot of us at Victor are there all the time to answer questions. We have a GitHub where you can see a lot of open source projects that you can use that I mentioned here, like react, search, vector, ingestar answer, etcetera. And then we have a set of example notebooks. This one, for example, is how to use Vectar with Lama index, but we have others. You can look at this repository and then if you're a startup, I encourage you to take a look at our startups program. It's a very good way to get started with Vectara while giving you additional support in forms of credit and customer support and other things. So really a good way to get started if you want to use Vectara to power your product. And that's it. Thanks for listening again and I hope you have a good rest of your conf 42 conference.

Slides

Download slides (PDF)

See all 28 talks at this event!

Conf42 Large Language Models (LLMs) 2024 - Online

April 11 2024

Measuring hallucinations in RAG

Video size:

Abstract

Summary

Transcript

Slides

Ofer Mendelevitch

Developer Relations @ Vectara

Join the community!

Featured event

2025

2024

Info

Conf42 Large Language Models (LLMs) 2024 - Online

April 11 2024

Measuring hallucinations in RAG

Video size:

Abstract

Summary

Transcript

Slides

Ofer Mendelevitch

Developer Relations @ Vectara

Join the community!