Building Chatbots with Advanced Retrieval-Augmented Generation Techniques

Video size:

Abstract

I will present how you can build realistic chatbots capable of handling authentic conversations using advanced retrieval augmented generation (RAG) techniques such as hybrid search, re-ranking, query generation, and semantic text chunking.

Summary

Zan Husson will talk about advanced retrieval, augmented generation techniques, advanced rack techniques. He will use an example of a chatbot that functions as a super doctor. These techniques could be used to build advanced chatbots for your applications.
The average human doctor sees about 100,000 patients in their lifetime. A large language model or AI powered medical doctor or assistant could help a human doctor with the diagnosis that they have to perform. It would need to have access to a lot of patient cases. And then it has to be able to propose new or similar diagnoses for patients.
The next approach uses a technique called rag retrieval augmented generation. And this works as follows. You take relevant medical information along with the new patient information, you give it to a language model, and the language model then has to propose potential diagnoses.
In order to successfully execute this rag doctor approach, you need a lot of data. We've got about 1.4 million published medical articles and about 170,000 patient cases. Using a search functionality to retrieve the most relevant articles or patient cases to a new patient. And then you give those limited retrieved articles to a language model to reason over.
A vector database needs to be able to capture every single patient, case or medical article as a vector. For this, we're going to use a open source medical domain embedding model called MedCPt. You can use this to take the unique information for this patient.
Metatron 70 billion is a fine tuned version of Lama two from meta. It was also supervised fine tuned on medical data as well. With this, with this type of stack, you now have a data set. Can we do even better? How can we improve on top of this?
A technique called query rewriting allows you to rewrite a query to a vector database. Not only that, but we also want to rewrite how we prompt the language model to solve the problem. DSPY framework allows to optimize and generate prompts for a large language model by iterative search.
The idea behind hybrid search is that if you are searching over medical data, medical data has a lot of very specific keywords that you might be interested in. In hybrid search, you perform both vector search as well as keyword search, and then you mix the results together.
The third advanced retrieval augmented generation technique is called autocut. The idea behind it is to cut off irrelevant results from the vector database. This gives the language model better information to set it up for successful diagnostic generation later on.
Autocutter lets you see how similar patient cases are to your input patient case. The higher the score, the more similar it is, the better. If you want to implement this, this is literally just one line of code.
The idea behind re ranking is you've retrieved the most similar other patient cases to this new patient. Now you use a much more powerful and heavier machine learning model to re rank them. This again increases the quality of the output that we, that we're going to eventually pass to our large language model.
AI based super doctor assistant has access to more knowledge than pretty much any doctor has experience from. Can retrieve from more than any human doctor can. Can also propose plausible diagnoses, and not only that, but it can also source citations. Everything that I've talked about in this entire talk is open source.
All right, so I wanted to thank everybody here. If you're interested in this, check us out. There's a QR code that you can use here as well. Feel free to connect with me either on Twitter or LinkedIn. I would be more than happy to talk to all of you.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey, everybody. My name is Zan Husson. I'm with Weaviate. We build an open source vector database, and I'm very excited for this talk. So, in this talk, I'm going to be talking about advanced retrieval, augmented generation techniques, advanced rack techniques, and the example that I'm going to be using to explain these techniques is a chatbot that functions as a super doctor. And I'll talk about what it does in just a second. But I'm going to be using this example to explain a lot of these advanced rag techniques that you could potentially use to build advanced chatbots for your applications as well. Okay, so let's dive right in. So let's talk a little bit about the average human doctor. The average human doctor studies for at least seven years after undergrad. So in North America, that's four years of studying during undergrad, and then seven plus years after that. And once they're practicing, they see about 100,000 patients in their entire lifetime. So if you think about what a doctor has to do when you go see them, you present to them some of the problems you're having, some of the symptoms you're having, and then they use their experience of similar patients they've seen, or knowledge that they've gained over their career or during medical school to present plausible diagnoses for you. And so what I wanted to do was I wanted to see if we could build a large language model or AI powered medical doctor or assistant to help a human doctor with the diagnosis that they have to perform. So if we unpack that statement, what would this AI powered super doctor have to be able to do? So the first thing that we need to solve this AI powered doctor would need to have access to a lot of patient cases. It would need to have access to a lot of data. Not only that, but it would also need to be able to search over this data. If you have 200, 300,000 patients worth of data, it's not possible for it to reason over all of it. You have to give it the top five or the top ten patients that are relevant to any new patient that you want to diagnose. Right? So it has to be able to search for relevant patient cases. And not only that, but maybe even if you give it access to medical literature, publications, articles, it has to be able to search over those and use those as well. Because this is a medical domain application, it has to be able to cite its sources. So it's not enough for it to just propose a diagnosis. It has to explain why it proposed that diagnosis. So if it's using a couple of patients to learn from about that disease and then diagnosing a new patient, it has to be able to cite that. These are the two patients that I read history and I saw how they were diagnosed, and I'm using those to propose a new diagnosis. So the system has to be fully explainable, otherwise it's not going to be used in the medical industry. Lastly, obviously, it has to reason over those medical technical concepts, so it has to be able to read over historical patient cases and published articles and medical writing, and then it has to be able to propose new or similar diagnoses for patients. And then it has to do this all in real time, right? So it can't. A patient comes to this AI powered medical doctor, tells it their problems, it has to propose diagnoses in real time, right? Or it has to propose the next step in real time. So we only have a few seconds to do all of this. So let's dive in and see how, how I solve this problem. So the simplest approach that you could potentially have is you take the patient information over here, and you feed it into a large language model, and you ask it to provide a cheat sheet or a list of potential diagnoses for this patient. This is literally the simplest thing that you could do. Literally open it, chat, GPT, type in patient information and say, what could potentially be wrong with this patient? And so here you're leaving a lot of things to chance, right? If the language model doesn't know anything about this particular patient's disease, then you're going to get irrelevant output. So there's definitely things that we can do to improve on this if we have a large language model that's fine tuned on medical data. So if we improve the language model for this particular domain, we'll definitely get better results. And also, if we prompt the language model properly, then we might get better results as well. And this has been shown in previous works where the difference between somebody who knows how to use a language model versus somebody who doesn't know how to use a language model. And by being able to use a language model, what I mean here is prompt engineering. If you know how to use a language model properly, you know how to prompt engineer properly, you can get these things to do really wondrous things, right? So that's another approach that would make this simple framework quite successful. But I want to see how we can get even better than this. So the next approach uses a technique called rag retrieval augmented generation. So I call this the rag doctor approach. And this works as follows. So what we're going to do here is when a new patient comes to your office, you're going to use this AI chatbot, and you're going to give this new patients information, and then you're going to retrieve relevant cases. So this vector database has all of the medical history stored inside of it. You query it with the new patient information, and then out comes new outcomes, relevant medical information over here. And then you take those relevant medical information along with the new patient information, you give it to a language model, and the language model then has to propose potential diagnoses. And so this is called a rag approach, because you're retrieving a bunch of historical cases that might be similar to this new patient. And then what you're doing is you're passing it to the language model. So the language model gets to read over the relevant cases. So this is almost a research phase where it gets to study previous patient cases and how they were diagnosed and what happened to them before. It has to reason over and propose a diagnosis for this new patient that we're looking at. And the great thing about this rag approach is that you can now cite sources as a result of this. Right? So the language model can now say, based on the information that you provided from this retrieval task, these are the proposed diagnoses for this patient that I'm providing. So it becomes an explainable system. So let's dive a little deeper into this rag doctor approach. In order to successfully execute this rag doctor approach, you need a lot of data. So the first piece of this puzzle is going to be a patient's data set that is open source. You can have a look at it here. So the PMC patients data set is approximately 170,000 patient summaries. So these have been extracted from medical literature, and they talk about the problem that the patient had, how they were diagnosed, what the solution was, so on and so forth. So this is a paragraph or more that's related to one patient, their diagnosis, and everything that they went through in that diagnosis. So this is a pretty complete summary of that patient. It also has about one and a half million published medical articles and abstracts as well. And so this is all taken from an external data set that's openly available. If you want to have a look, you can click on this link here. But this is all publicly released information that I used, and this became the knowledge base that my language model could retrieve from and reason over. So technically, all of these 1.4 million and 170,000 patient cases the language model had access to to learn from. And so this is what that data set looks like. So this is going to form the basis around which we're going to build our AI powered super doctor. The next thing that we need, that's a very critical ingredient to this entire setup is the vector database. And vector databases essentially give you the search functionality over all of this data. We've got about 1.4 million published medical articles and about 170,000 patient cases. The large language model can't read and reason over all of these you have to use a search functionality to retrieve the most relevant articles or patient cases to a new patient. And then you give those limited retrieved articles to a language model to reason over. And so here the vector databases can potentially store billions and billions of documents, medical articles or patient cases. And then given when a new patient walks into your office, you can ask them about the problem that they're having and you can use their unique information as a query to the, to the vector database. And this vector database is going to perform a similarity search where you go in and you say that to the vector database. Here's the information for this new patient I have. I want you to retrieve for me the top five most relevant previous patients that you've heard of in your repository from the 170,000 data set that we talked about and potentially five medical literature articles that are related to this patient's case. So the vector database is now going to perform this similarity search for you and it's going to give you the retrieved articles over here. And so to serve this purpose we're going to use an open source vector database called Weviate and it's going to be able to scale up nicely for us. If you want to learn more about it, I've linked some of the docs and linked the website over here. You can look into that as well. So the usage of the vector database looks something like this in practice. Let's say a patient comes to you and tells you that their left shoulder hurts and they have numbness in their thumb and index finger. So you're going to go and take that query, you're going to pass it into the vector database and we'll talk about how this works in a second. But you're going to pass that into the vector database and the vector database is going to retrieve for you the top three other patient cases that it has in its knowledge base. So from that 170,000 it's going to retrieve the top three most similar cases and it might even retrieve for you, if you ask it, the top three published medical articles that are relevant to diagnose this patient. So now that you have the top six relevant data points, these can be fed into a language model, and we'll talk about that in a second. Okay, so the next ingredient that we need here is an embedding model, right? So a vector database needs to be able to capture every single patient, case or medical article as a vector. And so how this works is we need, no matter what type of data you have in your vector database, it needs to be turned into a vector embedding over here. Right? So here I've represented a vector embedding. A vector embedding is just a large array of numbers, and it captures the meaning behind patient cases or articles. So in this case, all the 170,001.4 million articles are going to be turned into vectors, and you'll have 1.4 million vectors for the articles and 170,000 vectors for the unique patient cases that you want to store in your vector database. And so for this, we're going to need an embedding model, a model that generates these vectors, that understands the medical domain. And so for this, we're going to use a open source medical domain embedding model called MedCPt. And you can access this at the link over here. This generates vectors for short texts. So we can use this to take the unique information for this patient. Like I showed in the other slide, I have pain in my left shoulder and my index and thumb. You can take that short description and you can turn that into a vector. Not only that, but you also have a article encoder, another embedding model, which can embed very large text. So if you have a large abstract for a medical article, or you have a large historical patient description, you can use this type of embedding model to generate vectors for your, for your large data set. So both of these are going to be very critical. And this is also from an open source paper, and the code is also released so you can have a look at that. Okay, so the next piece of the puzzle here. So we've talked about the vector database. We've talked about how the, how the data gets turned into vectors. We're now going to talk about the large language model. So for this experiments, I mainly used chat, GPT and GPT four as the underlying model here. But another open source alternative, because everything that I'm talking about here is open source. I wanted to present an open source alternative so that you could build this whole thing from scratch. And run it in a private environment is a biomedical domain large language model. So one of the more powerful biomedical large language models that's open source is a model called Meditron. So Meditron is a large language model that's fine tuned on a lot of medical domain data. So these can be anything from medical abstracts to patient, patient histories and things like this. So this is where the Metatron 70 billion comes in. Metatron 70 billion is a fine tuned version of Lama two from meta. And what they did is they basically took around 50 billion tokens, 50 billion word pieces from the medical domain, and these come from published articles and patient summaries and patient doctor interactions. So they trained it on these 50 billion tokens on top of the training that Lama two got, and then they tried it on medical reasoning tasks. And this fine tuned version outperforms the base lama 270 billion, as well as the GPT 3.5 model, which is significantly larger than the 70 billion parameters that this Meditron 70 billion has. So this is an open source alternative GPT four that you could potentially use to build this project. And not only was it pre trained on 50 billion tokens, it was also supervised fine tuned medical data as well. Okay, so that completes our entire rag stack, right? With this, with this type of stack, you now have a data set. You have a vector database that can search over that data set, and you have a large language model that can take those relevant patient cases as well as the new patient case and generate potential diagnoses for you. The question now is, can we do even better? How can we improve on top of this? And so the answer here is that we need to dive into each one of these techniques that I've talked about, the vector database, the retrieval, the generation, and we need to see if we can, if we can see what the problems are with them and if we can apply an advanced version to improve the pipeline there. So now I'll propose about three to four more advanced techniques, and I'll explain what the intuition behind all of them is. So the first thing we're going to talk about is a technique called query rewriting. And the main idea behind query rewriting is if a new patient comes to you and they describe for you all of the problems that they're having, you might not know the best way to search for relevant articles from a vector database, because you have to write the query, you might not be able to write the query appropriately, and you might get irrelevant results from the vector database. And so the idea here is we want to rewrite the query optimally for a vector database to retrieve the most relevant articles. But not only that, but we also want to rewrite how we prompt the language model to solve the problem. So we might not trust our ability to prompt engineer properly. So we want to rewrite both the query that goes to the vector database to search our data set, but also rewrite the prompt we give to the language model. And there are solutions to do both of these steps. The first solution here is a query rewriting step. So the query rewriting step allows you to go in and rewrite a query to the vector database. And the DSPY framework that's also open source allows you to optimally generate the prompt that can be given to a language model to ensure that this gives you good results. So the first thing that we're going to do is rewrite the query to the vector database. So initially this was our query that we sent to the vector database. My left shoulder hurts and I have numbness in my thumb and index finger. This is what the patient told you. So that's what you try to retrieve articles with. And this is what that framework looks like, right. We're now going to modify this and we're going to pass this through a language model. And the language model's job is to rewrite the query so that our vector database can understand it better. So maybe it rewrites this query into this version, so it kind of chunks it up into smaller sentences and it says pain, left shoulder, numbness thumb, numbness, index finger. And so this is less understandable to a human, but maybe this is more understandable to a vector search engine like we vector database. So it optimizes the query to be understood by the vector database. And this is going to help us retrieve more relevant cases. Not only that, but we're also going to rewrite the prompt that we give to our language model. So DSPY is a framework that allows you to optimize and generate prompts for a large language model by iterative search. So we can also use this open source framework to identify what the best way to prompt a language model to generate a diagnosis is as well. Okay, so that's our first technique. We admit that we don't know how to query a vector database properly and we don't know how to prompt a language model appropriately. So we use language models to solve those tasks for us as well. Okay, so the next technique that I'm going to talk about is called hybrid search. So the idea behind hybrid search is that if you are searching over medical data, medical data has a lot of very specific keywords that you might be interested in. They could be names of diseases, they could be names of medicine, they could be chemical compounds. There's very specific keywords that you want to pay respect to and use in the field of medicine. A lot of the search that we've been talking about and how a vector database retrieves and knows what articles out of these millions of articles are relevant for this patient are based on similarity. How similar is what you tell me about this patient to the patient cases I have here. But for medical domain, this might not be the best type of search. You might want to search over the words in those patient cases, right? So if a patient was given a particular type of drug and this patient says that they're also taking that type of drug, then there's a match. I don't necessarily need to understand exactly what that drug is. If I can do a simple keyword matching, that might be good enough. So the idea behind hybrid search is why don't we mix vector search and keyword search and we do a little bit of both. And so the idea here is we want to search not just over the meaning and the problems that this patient is having, but also the keywords that are used in their description. So maybe numbness or the type of medication they're on or index finger, things like this that match well with medical literature and medical lingo. And so in hybrid search, you perform both vector search as well as keyword search, and then you mix the results together so that you get the best of both worlds and you can re rank them. And so if we're talking about how to implement this in Weaviate, it's literally one line of code that you have to change. And you go from doing pure vector search to a hybrid search of vector and keyword. So the third approach that we're going to use here, the third advanced retrieval augmented generation technique, is called autocut. And the idea behind autocut is if you do a search and you get irrelevant results from the vector database, then what you want to do is rather than give that to the language model and confuse it further, you want to cut off those results, right? You just want to throw them away. And so how you can potentially do this is you retrieve from the vector database relevant articles, and each article has some sort of number of how similar this article is to your patient information. And then you look over this and you say, okay, the top three return results are very similar, but the fourth and fifth are very unsimilar. Right. They're very far away compared to the top three results. So then you automatically cut them out and you never pass it over to your language model. And so if you do do this, if you do this automatic cutting of irrelevant results, it's less likely that your language model gets irrelevant results and it's less likely that it hallucinates as well. And so you're giving it better information to set it up for successful diagnostic generation later on. And so I want to dive into how this actually works. So let's say you start off with this vector database query that we talked about that is now rewritten. Notice pain, left shoulder numbness, thumb numbness, numbness, index finger that goes into your vector database and the vector database. Now let's say it gives you these six patient cases and not only is it going to give you these six patient cases, it's also going to give you a number of how similar these patient cases are to your input patient case. And each one is going to have a score. The higher the score, the more similar it is, the better. But notice one thing about these scores. These top four returned patient cases are quite similar, right? They have high similarity scores. But the fifth and the 6th one here are, there's a big jump in similarity. So autocut is going to come in and just say I'm automatically going to cut these two cases because they're quite different from these other four cases. And these other four cases are quite similar to the thing that you're interested in. So that's why it's called autocutter. And again, if you want to implement this, this is literally just one line of code in your v eight vector database where you can say how many chunks of similar things are you interested in? If you're interested in one similar thing, then it's going to give you one big chunk here of four articles and then you only keep those, the other two are disposed of in this example. Okay? And so the, the next thing that I'm going to talk about is called re ranking. And the idea behind re ranking is you've went through this search process and you've retrieved the most similar other patient cases to this new patient. What you want to do now is have a closer look into all of these other patient cases that have been retrieved and compare them one at a time to the input patient information and say, does this really deserve the top spot? Should I re rank it to be lower or should I re rank something that's at the bottom to be a little higher? So this is the time where you get to spend more, compute more time comparing the new patient to all of the retrieved patients that the vector database spat out. And so what you do to make this successful is initially you tell the vector database, instead of just giving me the top three most similar matching cases, give me 20, 30, 40 cases that you think are similar. And then you compare this new patient to each one of those 20, 30, 40 cases that the vector database thinks are similar. But now you use a much more powerful and heavier machine learning model to re rank them based on what it thinks of these 30, 40 articles compared to this patient information. And so this again increases the quality of the output that we, that we're going to eventually pass to our large language model. So let's have a look at how this actually works. We go into our vector database, we take that query that we've already been passing into the vector database and now let's say we do this over fetching. We ask it to retrieve for us the top ten or 15 most similar patient cases. So it gives us this long laundry list of similar patient cases along with the similarity scores for all of these patients. So now what we do is we ignore these similarity scores and we say we're going to go to another more powerful model. And the job of that model is to see how similar each one of these patient cases is to the original query. And it has the opportunity to re rank and kind of promote or demote some of these patient cases if it thinks that they're more similar or not as similar. So less similar using its more advanced re ranking and search functionality. So you get this re ranking step where now maybe the third most similar patient case, this more powerful model thought, oh, that is actually a lot more similar. So it's going to re rank it to be number one, right. It's going to take the first one and it's going to down rank it to spot three over here and it's going to promote number five here to number two. So now the idea is that these re ranked patient cases are a lot more similar to the initial query than these initially retrieved cases. And so the new score is a lot more reliable because a more powerful model generated this score for you. And now we can take these results, maybe the top five or top six here and give it to a language model. So now it has a better quality of patient cases that are relevant to this new patient and so it can generate better diagnoses. And so to implement this in a vector database like weviate, this is also just one line of code where you pass in and you say, I want to re rank based on a particular property, and the query is a string, and you can pass it in, and it's going to take everything that was returned from this semantic near text search, and it's going to re rank it and then pass it out to the language model. Okay, so we talked about the retrieval augmented generation stack, the rag doctor approach, and then I proposed four improvements, more advanced retrieval augmented generation techniques that you can kind of use to improve the quality of the superdoctor that we've created here. And so let's do a little quick summary of everything that we've accomplished. We've basically created an AI based super doctor assistant. This super doctor assistant has access to more knowledge than pretty much any doctor has experience from, so they can retrieve from more than any human doctor can. It can also propose plausible diagnoses, and not only that, but it can also source citations. It can tell you why it's proposing somebody has a viral infection or somebody has heart disease based on historical patient cases that it's read and reasoned over. And so it can learn from previous patterns in patient health and use those patterns to propose future diagnoses for new patients and to kind of complete it all up. It does this in real time. So it does this in a few seconds because the vector database search component takes a few milliseconds, and the generation portion takes maybe half a second a second. And so this can be done in real time. You can literally have a patient walk into your office, the doctor takes their symptoms, understands them, passes them off into the superdoctor chatbot assistant. The super doctor proposes some diagnosis, the doctor thinks about some diagnosis, and then these diagnoses come together to have a much more well informed diagnosis for that patient. And so that comes complete, full circle. We've talked about the entire stack of how you would build this technology. And the plus point here is that everything that I've talked about in this entire talk is open source. The language model, the vector database, the retriever, the re ranker, the auto cut. Every single thing I've linked and sourced as well is fully open source. So if you wanted, you could build this tomorrow as well. All right, so I wanted to thank everybody here. I was really excited to give this talk. If you're interested in this, check us out. There's a QR code that you can use here as well. And if you have any questions, feel free to connect with me either on Twitter or LinkedIn. I would be more than happy to talk to all of you. I'm also active on the Wev eight community slack, so join us there, drop by. Thank you very much and I'll leave you guys with this last QR code here. So if you want to try this out, give it a go. Weviate is open source. We also provide free hosted services as well. Thanks everybody and hope you're enjoying the conference.

Slides

Download slides (PDF)

See all 28 talks at this event!

Conf42 Large Language Models (LLMs) 2024 - Online

April 11 2024

Building Chatbots with Advanced Retrieval-Augmented Generation Techniques

Video size:

Abstract

Summary

Transcript

Slides

Zain Hasan

Senior Developer Advocate @ Weaviate

Join the community!

Featured event

2025

2024

Info

Conf42 Large Language Models (LLMs) 2024 - Online

April 11 2024

Building Chatbots with Advanced Retrieval-Augmented Generation Techniques

Video size:

Abstract

Summary

Transcript

Slides

Zain Hasan

Senior Developer Advocate @ Weaviate

Join the community!