Conf42 Machine Learning 2025 - Online

- premiere 5PM GMT

Hallucination by Design: How Embedding Models Misunderstand Language

Video size:

Abstract

Join me to discover how embedding models misunderstand human language. I’ll reveal test results where models think “take medication” and “don’t take medication” are identical. Learn the patterns and techniques to make GenAI systems more reliable. If you’re using LLM’s, you can’t afford to miss this.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. I am ESH mdi. I'm a principal engineer at Microsoft, and I am going to talk to you about removing hallucinations from your Gen I applications using the embeddings perspective. But what are embeddings? So embeddings is simply your computers don't understand the natural language that we speak and type it, understand numbers and. Embeds are a process of converting sentences, words into numerical values. But what fun those numerical values have is that they have semantic meaning. If you can see on the screen here, there is a man and a woman queen, and a king, and a dog and a puppy. Interestingly, the dog and the puppy within this X and y coordinate, you would see are closer compared to a dog. And a woman, but a man and a woman are more closer compared to a king and a queen, while king and man are more closer compared to queen and woman. So converting into numbers from a sentence, but having semantic meaning in those numbers is what embedded em is all about. But then. The topic today is that we shouldn't be trusting, embedding Strom Foundation model without experimentation and evaluation. So we have been using large language models and small language models from various different vendors like Hugging Face and other O Open and other providers like OpenAI, for example. And we have been converting our sentences into numbers and should we trust them. This topic is all about that. But before we even get into the details about why, what we should do to remove hallucinations into a semantic, meaning that numbers provide after converting from a sentences, using the embedding process, let's understand the embedding process. So as you can see here, there are two sentences. The CAD jumped over fence and the cat jumped well. So each word in the embedding process. Has a mapping to a number, which is also called input id. So the CAD jumped over. Fence has input IDs 1, 2, 3, 4, 5, and which is an additional word, has a input ID six. So they get mapped when you send a sentence to a model. That gets mapped into, converted into those numbers. As you can see here, that we have 1, 2, 3, 4, 5, whereas you have in the second sentence, 1, 2, 3, 6. Then it goes through a embedding process and it generates some numbers for you. Interestingly, you would notice that there are five rows over here, one for each word in the corresponding sentence. While you have four rows over here, one for each word in the corresponding sentence. The another interesting fact you would notice is each row has four values and why four values? Because this embedding model generates an output of four numerical value per word, which is also known as that dimensionality. So the dimension dimensionality is nothing but the output for each word. The number of numbers in each row is what the dimensionality is, and they can range from anything between 32 to 4 0 9 6. And the higher is better because the more the number of dimension, which means the more the nuances, the more the semantic meaning for that word can be captured. Within that embedding lower the dimensions, which means there are lower semantic value into it. But. They are more faster. They're more faster, and they can easily be stored with less storage requirements. Another important concept in understanding embeddings is the max sequence length. So if you remember, we had, we send the CAD jumped over French, the cat jumped well here. If our max sequin lens is 10, then we can send 10 words at a time. To this embedding model over here, and it'll generate the output based on what the dimensionality is. But if we send 12 from as a, as the number of words in a sentence, then it'll actually reduce, it'll only use a first 10, but it'll not use a, it'll just discard the rest of two of them. So if we have to know what is the max sequence lens of the model that we're using and. If, and basically the numbers get converted into tokens, and the tokens are what then gets evaluated in terms of the max sequence length. Now, they can range between, as you can see here from sentence models can have 1 28 to 512 tokens, whereas document models can have from 1,002, maybe close to 10,000. And they get cut if you go beyond that. But if they're less than that, they get patterned and they are well used in the semantic. Meaning. Another very important is the vocabulary and the size when we are generating embeddings. Why is this important? This is important because if we are providing words like cat jump and they are part of the vocabulary. Then the model has been well trained on those words and they preserve the somatic meaning for those words. But if you have an out of vocabulary word which has not even been ever been seen by the model, then I. The model cannot have a semantic meaning for it. So it's always good to have a vocabulary within a model which is quite rich, which means the model has been trained on data, which has seen a variety of content. From a data perspective, the size is also important because the bigger the vocabulary. The more training time you need, the more it might, the model might be slower because it has to find those number of input IDs, convert them into tokens, and then generate the embeddings and all of those kind of things. Smaller the size, faster the model, the faster the embedding process and richer the vocabulary, the better. The semantic meaning preserved, lower the number of lower the number of words in vocabulary. You have less semantic meaning, but then. What are the use cases for embeddings? Embeddings are very, become very popular for gene AI based applications, especially after the invent of retrieval documented generation, which is also known as the rag pattern. And in which basically we try to search, and when we try to search at times, you want to search with keywords, but at different times you want to search using semantic meaning, which means we might ask for a thing. That not, might not exactly available within our data store. And we might have to infer the intent, the meaning of the sentence or the utterance or the query the user is sending to find out what relevant results we want to give it as part of the search results, which is very important because these days, especially every Gen AI application is using such technology. Most of it. Then we have recommendation systems, which again, you want to, if somebody's searching for a product like a cellular phone, then would you like to have a mobile phone displayed as part of recommendation system or maybe other electronical device, which might make similar choices, but you have to know that a cellular phone might mean also a automotive device. Similarly, natural language processing, and there are other use cases as well. So how do we compare embeddings? Now if we have embeddings, which is great, we have numbers stored in our database and we have numbers coming up. How do we compare numbers to number the way to compare numbers? Is using various different matrix and some of them have been shown over here. They the most famous one is the cosign similarity in which we try to find you saw, you remember the man and the woman in the first slide. Then if we draw straight lines from the point where the 0.0 starch, then what's the angle between those two lines? The lower the angle, the most similar they are, the higher the angle, they're not similar, and it could range from minus one to one. Similarly, you have Qadium distance, Manhattan distance and dot product, and I will leave it on you to explore them. But IGN similarity is the leading way of doing comparisons. So let's see a small example of how we can find embeddings using open ai. And its, and. Sentence Transformers. So sentence transformer is basically a package that you can, in Python, you can download or install it and use it to find embeddings. As you can see here, I have some code and I'm, I am going to use a simple model called MP net, which is available in hugging face. And I have two sentences. What's the weather like in New York today? And can you tell me the current weather in New York? And I'm trying to find embeddings for both of them. And to find the embeddings for both of them. I could also, for example, come here and I could show you that in this case. I have another example where I am just trying to find the embeddings for American Pizza is one of the nation's greatest cultural export. Just provide me the embeddings. The results for this you would see here is an embedding over here with quite a few numbers and why these numbers. Now you might know that these are the dimensions, so this embedding model. M Ms. Marco provides us with 7 68 dimensions. If you remember the row I was talking about, how many numbers, there were four in that example. In the first example. Here they are 7 68, which is quite rich in semantic meaning, and it's, as you can see here, those numbers semantically mean something about this sentence. So it might not be exact keyword match, but semantically, what does it mean now if we have to? Find and compare embeddings, which is what I was trying to show you before. If we want to compare two sentences and find how close those two sentences are from an embeds perspective, I have those two sentences I was talking about. I can use the package sentence transformers and code method, and once I do that, I use the sign similarity, that measure that I was talking in the beginning and. Once I do that, you can see the results of it. So if I use cosign similarity, I get 88% similarity between these two sentences. The words are not in the same order. They might not even have the same words in them, but I am able to find a lot of similarity between these two sentences. And that's the beauty of embeddings, that you don't have to be doing an exact search keyword search, entity-based search. You can do a simple search. And the semantic meaning of that is what is being shown as being compared with the other sentence. There are other metrics used over here and we can ignore it for now and we can just concentrate on cosigning. So that's how you compare embeddings, right? So now I can find out embeddings, but let's see how well we understand embeddings. So I have a small quiz for you and let's you try it in mentally or you want to write it down somewhere. Answer this. You saw the cosign similarity ranges from minus one to one. Now there are four different sentences, pairs of sentences over here. You tell me or you find out just by reading it, you don't have to write any code for this. How closely these two sentences would be, or what would be the cosign similarity answers for this? The treatment was completely ineffective against the disease versus the treatment was absolutely effective against the disease. What does. The human mind says, are these two sentences similar? No way. They're completely opposites. So the IGN similarity should be quite low because the angle between them should be quite high. Similarly, placed the specimen in the refrigerator at exact four degrees centigrade, the sample must be stored at precisely four degrees centigrade, and that there is a degree word over here rather than a symbol in the cooling unit. Results showed statistical significance. The findings indicated a significant effect. The patient shows hypertension, the patient shows hypotension. Feel free to think about how closely they are and, but let me show you what they eventually mean. So if you look at the quiz for this, I already had a run, you would see that quiz three had an 83% similarity, which is great because they almost mean the same thing. The finding that absolutely puzzled me was quiz one or is quiz one? This sentence is completely opposite of this sentence, but I got a 96% similarity, which means if I am building a system where I have stored some data that I have in my documents, and those data could be looked like this, the content in my documents could be like this. And if a user is sending me something like this. It is telling me completely opposite as an answer. So when we are implementing rack based solutions, or we are using building our own copilot, or when we are building things based on search, how can we rely on an embedding which is outta the box like MP net. With such a scenario, and that's what the crux of the whole topic today is that how do we remove such hallucinations so that we can even work with such kind of sentences within our data? Why is this happening? So think about it. Why is this happening? So we have a embedding model like MP net that you see here. Now, why does MP net capture the difference between these two and say that the embed and the cosign similarity should be quite low? Because of some of the reasons I've been mentioned over. And the top of them is that some of the models provide more weightage to words that it thinks should be is important. And if those words are available on them, in both of them, they think, oh it's a good ign, it's a higher cosign similarity. Similarly, it might try to compare. The available words in both the sentences and see if the overlap is high. It might just say that the cosign similarity is high, irrespective it'll not take the order of the words that's available within these two sentences. It might also see that if it has a positive sentence sentiment, for example, then it'll provide more weightage to it compared to others, so that if enjoyed is there in both the sentences it just thinks it is very similar. It does not see that there might be a knot word not enjoyed as well in the next sentence. So there are quite a few reasons why this happens, and it's our job to make sure that we experiment the base model, find out if they are hallucinating, and if they're hallucinating, then make sure we remove them. What are the various scenarios I've found out for embedding models? Hallucinating, and there are quite a few, as you can see, there are 38 of them starting from capitalization, and I'll show you all of those. Some of these examples, wide space, variations between sentences, does it impact embeddings? Negation? We just saw an example, not versus not special characters. Word, order, date, time, and all of those things. So we will go and look into this, how that actually happens. So if I come to my code here, back again. I have quite a few sentences over here, and let me show you some of the ones that you would be able to immediately capture here. You would be able to. Think clearly whether they should sound similar or not. Now let me show you. See, she completed her degree before starting the job. She started her job before completing her degree. How similar are they? The company barely exceeded earnings expectations. The company significantly missed earning expectations. Think about it, how closely they are related to each other. If the treatment works, symptoms should improve. The treatment works and symptoms have improved. One is hypothetical, one is actual factual. The meeting ran significantly shorter than planned. The meeting ran significantly longer than planned, and there are quite a few of them. And as you can see, they all have been different types of sentences. Legal domain, medical domain. Attribution unit of time, quite interesting. The procedure takes about five minutes. The procedure takes about five hours. There's a ton of a difference between these two sentences, and we'll see how closely similar they are. So same for speed, kilometer per hour, miles per hour. The product costs ranges between $50 to a hundred dollars versus $101. The patient's fever was 101 degree Fahrenheit. The fever was 104 degree faite and things like that. If I execute this, you see a result like this and you can see that date time. For example, if I say her interview was on 10th of December, 10th, 12, 20 25 versus. 12, 10, 20, 25. The similarity is 99%. It's a two month difference, but from a language perspective, it is two month different. From a numbers perspective, it is not a two month difference because our embedding model thinks they're exactly the same thing from a taxonomic per percentages. And all of these are different combination. As you can see, the minimum I'm getting is 80 for spelling and typos, whereas they should all have been quite low in terms of embeddings. So it's a big challenge, right? How do we solve this? So the way to solve this, there are quite a few ways we can solve it. One of, and I've listed down some of them over here, and you should think about using them one or the other, or in combination of multiple of them to solve this problem. So you can use a domain that is very specific to your use case. So if you are building a medical based system. And you using a rag or a copilot building your active chatbot you're building, then you could use a medical based embedding model and you might find a ton of them within, say for example, in hugging face, if you're doing a legal based system, then you can use a model based on legal words. It's been trained on legal semantic meanings and things like that. So that's one way to look at it. So it's try to find a model that discloses to the work that you're doing. The second thing is pre-processor data. If you have numbers, expand them into words. If you have dates, expand them into words. If you have words like entities like an apple, that apple means a fruit or a brand, provide additional metadata within the content itself. And those are all pre-processing steps, which means when you are ingesting these documents within your database. Or a data store, you take them through a process. Each chunk goes through this process and add more metadata if it needed, expand the abbreviations numbers into words, data into sentences, and all of those things you do. And then you store them in your database and then eventually you fine tune an existing foundation model on the data that you already have. If you do that, and a combination of these. You would find that the embeddings assign similarity between the same sentences that we've seen before falls down consider. So let's see that. I'm not going to fine tune because fine tuning is a pretty exhaustive process and it takes time. But I'll talk about the process of fine tuning. So for fine tuning, basically what you do is. You need data first, right? So the, I already have data. I have the test data, train data, and the validation data, and they're very simple. I need to provide them in a format that is acceptable by sentence transformers, by loss functions, and the evaluation function. So for example, I have sentence one, sentence two, and what is this level of similarity between them? So for example, if I see this, the software includes new feature. Technology continues to evolve rapidly. There might be some similarities, so I'm giving you a 0.5 as similarity. Only 15% students fail the exam up to 85%. Students fail the exam. Big difference. The similarity is zero over. So this kind of a data, either you can generate using another large language model or and curate it after that. Or you can bring your own data like this. And then use it for fine tuning. So I have trained data on which I will fine tune the model, and then I have test data on which I will test the model and see how good the results are. So the test data is something the model has never seen back, ever. And the validation is I use the validation data while training and evaluating at the same time to check whether things are getting overfitted or unfitted, whether my model is going in the right direction or not. So I have data now and how do I do fine tuning? So what I do is I load up an existing model. I load up my data. So I load my data, I load my, I convert the data into format that sentence, transformers understand for fine tuning. So which means sentence one, sentence two, and the similarity, which you saw in the CSE file. I load that data. And then I have a embedding similarity evaluator because when I fine tune, I will want to evaluate on the new data how the model is behaving. I have a loss because if I find the embedding for the same sentence to be 0.8 versus I expect them to be 0.0, then that loss is 0.8. And so I want to reduce that loss. So what kind of a loss I'm trying to use and the host of other. Attributes to make it run. And you can, obviously these are all a well document and available within hugging face and sentence transformers package. And then we try to fit, which means this fit is what starts training the model. Fine tuning the model it uses the data that we have given. It uses a loss function. We are given an evaluator, and the number of times you want to loop through is the number of epochs and eventually the output is where you get your fine tuned model. So I have named my model as. Fine tune, semantic model that, and that's why you see a new model that's created over here. And now if I want to evaluate my new model after I've trained the model, I can load that model and then I can load my test data as well, and then can. Execute the same encoding because I'm generating embeddings now. I can send sentence one, sentence two from a test data, get both the embeddings, and then use codesign similarity to see how closely they are and then can list them down over here. So that is what the entire process of fine tuning. And after fine tune, what I can do is come back over here and I will run the same thing, same code that you see here, which is giving me such high embedding model. Using a fine tune model. So instead of using, if you see here, I had used Ms. Marco, and now I'm not using Ms. Marco, but what I'm using is a fine tune semantic model, which is going to fine embeddings between the same two pairs of sentences, all the 38 or ones stuff that I've shown you before, and see what the results are. So if you see here, I've executed this already. And now you will see that the embedding model is even giving me negative returns. It was the minimum was 80%, the previous one with the foundation model. Now for negation, for example, if I see the negation over here, I get 46%. And anything less than 60 or 70% in embeddings, you can obviously discard it. But you see now for a majority of them, I have very low embedding. So I have removed hallucination to a very large extent. From my system, if I use the new embedding model, which I've just fine tuned, obviously you will find some of them quite high even now, which means you might have to either fine tune. Further with more data might be data slash or maybe I haven't used finetune appropriately with changing some parameters of the number of epochs and things like that. Or maybe capitalization does not matter to me. If the capitalization, whether it's slow or big, it matters, then I have to fine tune. But if it does not matter, and having one is great for me. So it's also use case dependent. So you have to. Think about how you want to fine tune, get your data and the entire process to make this thing happen. So what are the steps of fine tuning? Just as a recap, you set up an environment, which means you have, you need maybe a environment, like a virtual environment conduct or a uv, and you set up that environment. You download all the packages, like sentence transformers. And you obviously need some code to write. You prepare your training data, validation data, test data for fine tuning the model. You pick up a model to fine tune, and it could be MP net, Ms. Marco, a couple of examples. There are a ton of them. You can use any of them, but you need to find which model works best for you. And then you can go for fine tuning as well. Provide configuration of hyper parameters, like which kind of a loss function, what kind of an evaluator, how many epochs, what should be your. How much data and things like that. Train your model, save it, and then evaluate it to see how better it is performing based on the foundation model. How do you choose a model? It could be based on a domain size. You, the bigger, the size, the most space, it needs more storage, it needs, and then you need to also less performance because it takes more time to revert back on the, from a request response perspective. But it's a legal based, if you can't use a medical based domain model on a legal based use case, right? So it, and you can't then expect it to have a good response. So you have to think about the domain alignment as well. The speed at which you want the response back. Some needs quick response come, some does not. And so if you need higher speed in terms of inference, then you need relatively smaller model compared to higher models. And there are various other things, but these are some of the top ones, how you choose a model, foundation model, and then you go for fine tuning. And this is the same thing, but showing some of the top models or some of the models that are quite widely used. What's the sizes, strengths, and the best, how are they used in different capacities? I won't go into those details. I already talked about the evaluators and other epochs, but you can obviously return into details into. A hugging first document. But these all impact the quality of the new fine-tune model. So if you think that the quality is not meeting your standards, you have to come and tube them to see whether and generate a new model, and obviously compare with the opposite, the previous model to find whether you're improving or not improving. So that's what the experimentation is all for. So that's all I had. So what are the key takeaway? So first, understand your data. If you don't know your data well, you won't be able to know whether, what kind of a model I want to use, what kind of a foundational model I should bring in. How should I fine tune if I have to fine tune what is my training data? What is my validation data? What is my testing data? And whether they meet the quality of the data or not identify the model to use. Very important. And then embedding models are not magic. You have to first use the embedding model using some of the examples or samples of data that you have and see whether, how are they performing and if they're not performing well, then you think about the various ways to remove that hallucination. Very important that the data quality is super important, and if that's not important, no matter what you do, you'll not get the right results, but only fine tune if it's required. It's not the, it's not the end. It's not the starting point of removing the hallucination. It is one of the, towards the end side of removing the hallucination, there are other steps that I mentioned before, like pre-processing, finding a good model that already does not do all this, or somebody might have already done a fine tuning, then you can use it instead of you doing the fine tuning. Systematic evaluation is quite important. So you should evaluate your models well and before using them in your Gen I applications. And then finally, keep iterating, keep exploring and find out whether things can be further improved. Have a baseline. If you are crossing that threshold, you should be good to go. But Dr. Han does not let you stop the experimentation further because who knows, you might find better results as well. So with that, I. Pause or I'll stop my presentation here and obviously if you have any questions, feel free to post them out in any of these contacts that I have, you can get reach out to me on LinkedIn. I have the code that you saw here is available in GitHub. You can obviously go and check out this code, and that's my automation. Next is my account in Twitter. So thank you very much. That was my time and I hope to see you in the conference and I hope you're enjoying a lot. Thank you very much.
...

Ritesh Modi

Principal AI Engineer @ Microsoft

Ritesh Modi's LinkedIn account Ritesh Modi's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)