Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
I am ESH mdi.
I'm a principal engineer at Microsoft, and I am going to talk
to you about removing hallucinations from your Gen I applications
using the embeddings perspective.
But what are embeddings?
So embeddings is simply your computers don't understand the
natural language that we speak and type it, understand numbers and.
Embeds are a process of converting sentences, words into numerical values.
But what fun those numerical values have is that they have semantic meaning.
If you can see on the screen here, there is a man and a woman queen,
and a king, and a dog and a puppy.
Interestingly, the dog and the puppy within this X and y coordinate, you
would see are closer compared to a dog.
And a woman, but a man and a woman are more closer compared to a king and
a queen, while king and man are more closer compared to queen and woman.
So converting into numbers from a sentence, but having semantic
meaning in those numbers is what embedded em is all about.
But then.
The topic today is that we shouldn't be trusting, embedding Strom Foundation model
without experimentation and evaluation.
So we have been using large language models and small language models
from various different vendors like Hugging Face and other O Open and other
providers like OpenAI, for example.
And we have been converting our sentences into numbers and should we trust them.
This topic is all about that.
But before we even get into the details about why, what we should do to remove
hallucinations into a semantic, meaning that numbers provide after converting from
a sentences, using the embedding process, let's understand the embedding process.
So as you can see here, there are two sentences.
The CAD jumped over fence and the cat jumped well.
So each word in the embedding process.
Has a mapping to a number, which is also called input id.
So the CAD jumped over.
Fence has input IDs 1, 2, 3, 4, 5, and which is an additional
word, has a input ID six.
So they get mapped when you send a sentence to a model.
That gets mapped into, converted into those numbers.
As you can see here, that we have 1, 2, 3, 4, 5, whereas you have in
the second sentence, 1, 2, 3, 6.
Then it goes through a embedding process and it generates some numbers for you.
Interestingly, you would notice that there are five rows over here, one for
each word in the corresponding sentence.
While you have four rows over here, one for each word in
the corresponding sentence.
The another interesting fact you would notice is each row has
four values and why four values?
Because this embedding model generates an output of four
numerical value per word, which is also known as that dimensionality.
So the dimension dimensionality is nothing but the output for each word.
The number of numbers in each row is what the dimensionality is, and they can range
from anything between 32 to 4 0 9 6.
And the higher is better because the more the number of dimension, which means the
more the nuances, the more the semantic meaning for that word can be captured.
Within that embedding lower the dimensions, which means there
are lower semantic value into it.
But.
They are more faster.
They're more faster, and they can easily be stored with less storage requirements.
Another important concept in understanding embeddings is the max sequence length.
So if you remember, we had, we send the CAD jumped over
French, the cat jumped well here.
If our max sequin lens is 10, then we can send 10 words at a time.
To this embedding model over here, and it'll generate the output based
on what the dimensionality is.
But if we send 12 from as a, as the number of words in a sentence, then
it'll actually reduce, it'll only use a first 10, but it'll not use a, it'll
just discard the rest of two of them.
So if we have to know what is the max sequence lens of the
model that we're using and.
If, and basically the numbers get converted into tokens, and the tokens
are what then gets evaluated in terms of the max sequence length.
Now, they can range between, as you can see here from sentence
models can have 1 28 to 512 tokens, whereas document models can have
from 1,002, maybe close to 10,000.
And they get cut if you go beyond that.
But if they're less than that, they get patterned and they
are well used in the semantic.
Meaning.
Another very important is the vocabulary and the size when
we are generating embeddings.
Why is this important?
This is important because if we are providing words like cat jump and
they are part of the vocabulary.
Then the model has been well trained on those words and they preserve
the somatic meaning for those words.
But if you have an out of vocabulary word which has not even been ever been
seen by the model, then I. The model cannot have a semantic meaning for it.
So it's always good to have a vocabulary within a model which is quite rich, which
means the model has been trained on data, which has seen a variety of content.
From a data perspective, the size is also important because
the bigger the vocabulary.
The more training time you need, the more it might, the model might be
slower because it has to find those number of input IDs, convert them into
tokens, and then generate the embeddings and all of those kind of things.
Smaller the size, faster the model, the faster the embedding process and
richer the vocabulary, the better.
The semantic meaning preserved, lower the number of lower the
number of words in vocabulary.
You have less semantic meaning, but then.
What are the use cases for embeddings?
Embeddings are very, become very popular for gene AI based applications,
especially after the invent of retrieval documented generation, which
is also known as the rag pattern.
And in which basically we try to search, and when we try to search
at times, you want to search with keywords, but at different times you
want to search using semantic meaning, which means we might ask for a thing.
That not, might not exactly available within our data store.
And we might have to infer the intent, the meaning of the sentence or the
utterance or the query the user is sending to find out what relevant results we
want to give it as part of the search results, which is very important because
these days, especially every Gen AI application is using such technology.
Most of it.
Then we have recommendation systems, which again, you want to, if somebody's
searching for a product like a cellular phone, then would you like to have
a mobile phone displayed as part of recommendation system or maybe
other electronical device, which might make similar choices, but you
have to know that a cellular phone might mean also a automotive device.
Similarly, natural language processing, and there are other use cases as well.
So how do we compare embeddings?
Now if we have embeddings, which is great, we have numbers stored in our
database and we have numbers coming up.
How do we compare numbers to number the way to compare numbers?
Is using various different matrix and some of them have been shown over here.
They the most famous one is the cosign similarity in which we try to
find you saw, you remember the man and the woman in the first slide.
Then if we draw straight lines from the point where the 0.0 starch, then what's
the angle between those two lines?
The lower the angle, the most similar they are, the higher the
angle, they're not similar, and it could range from minus one to one.
Similarly, you have Qadium distance, Manhattan distance and dot product, and
I will leave it on you to explore them.
But IGN similarity is the leading way of doing comparisons.
So let's see a small example of how we can find embeddings using open ai.
And its, and.
Sentence Transformers.
So sentence transformer is basically a package that you can, in Python,
you can download or install it and use it to find embeddings.
As you can see here, I have some code and I'm, I am going to use a
simple model called MP net, which is available in hugging face.
And I have two sentences.
What's the weather like in New York today?
And can you tell me the current weather in New York?
And I'm trying to find embeddings for both of them.
And to find the embeddings for both of them.
I could also, for example, come here and I could show you that in this case.
I have another example where I am just trying to find the embeddings
for American Pizza is one of the nation's greatest cultural export.
Just provide me the embeddings.
The results for this you would see here is an embedding over here with quite
a few numbers and why these numbers.
Now you might know that these are the dimensions, so this embedding model.
M Ms. Marco provides us with 7 68 dimensions.
If you remember the row I was talking about, how many numbers,
there were four in that example.
In the first example.
Here they are 7 68, which is quite rich in semantic meaning, and it's, as you
can see here, those numbers semantically mean something about this sentence.
So it might not be exact keyword match, but semantically, what
does it mean now if we have to?
Find and compare embeddings, which is what I was trying to show you before.
If we want to compare two sentences and find how close those two sentences
are from an embeds perspective, I have those two sentences I was talking about.
I can use the package sentence transformers and code method, and
once I do that, I use the sign similarity, that measure that I
was talking in the beginning and.
Once I do that, you can see the results of it.
So if I use cosign similarity, I get 88% similarity between these two sentences.
The words are not in the same order.
They might not even have the same words in them, but I am able to find a lot of
similarity between these two sentences.
And that's the beauty of embeddings, that you don't have to be doing an exact search
keyword search, entity-based search.
You can do a simple search.
And the semantic meaning of that is what is being shown as being
compared with the other sentence.
There are other metrics used over here and we can ignore it for now and we
can just concentrate on cosigning.
So that's how you compare embeddings, right?
So now I can find out embeddings, but let's see how
well we understand embeddings.
So I have a small quiz for you and let's you try it in mentally or you
want to write it down somewhere.
Answer this.
You saw the cosign similarity ranges from minus one to one.
Now there are four different sentences, pairs of sentences over here.
You tell me or you find out just by reading it, you don't
have to write any code for this.
How closely these two sentences would be, or what would be the
cosign similarity answers for this?
The treatment was completely ineffective against the disease
versus the treatment was absolutely effective against the disease.
What does.
The human mind says, are these two sentences similar?
No way.
They're completely opposites.
So the IGN similarity should be quite low because the angle
between them should be quite high.
Similarly, placed the specimen in the refrigerator at exact four degrees
centigrade, the sample must be stored at precisely four degrees centigrade, and
that there is a degree word over here rather than a symbol in the cooling unit.
Results showed statistical significance.
The findings indicated a significant effect.
The patient shows hypertension, the patient shows hypotension.
Feel free to think about how closely they are and, but let me
show you what they eventually mean.
So if you look at the quiz for this, I already had a run, you
would see that quiz three had an 83% similarity, which is great because
they almost mean the same thing.
The finding that absolutely puzzled me was quiz one or is quiz one?
This sentence is completely opposite of this sentence, but I got a 96% similarity,
which means if I am building a system where I have stored some data that I
have in my documents, and those data could be looked like this, the content
in my documents could be like this.
And if a user is sending me something like this.
It is telling me completely opposite as an answer.
So when we are implementing rack based solutions, or we are using
building our own copilot, or when we are building things based on search,
how can we rely on an embedding which is outta the box like MP net.
With such a scenario, and that's what the crux of the whole topic today is
that how do we remove such hallucinations so that we can even work with such
kind of sentences within our data?
Why is this happening?
So think about it.
Why is this happening?
So we have a embedding model like MP net that you see here.
Now, why does MP net capture the difference between these two and
say that the embed and the cosign similarity should be quite low?
Because of some of the reasons I've been mentioned over.
And the top of them is that some of the models provide more weightage to words
that it thinks should be is important.
And if those words are available on them, in both of them, they think, oh it's a
good ign, it's a higher cosign similarity.
Similarly, it might try to compare.
The available words in both the sentences and see if the overlap is high.
It might just say that the cosign similarity is high, irrespective it'll
not take the order of the words that's available within these two sentences.
It might also see that if it has a positive sentence sentiment, for example,
then it'll provide more weightage to it compared to others, so that if
enjoyed is there in both the sentences it just thinks it is very similar.
It does not see that there might be a knot word not enjoyed
as well in the next sentence.
So there are quite a few reasons why this happens, and it's our job to make
sure that we experiment the base model, find out if they are hallucinating,
and if they're hallucinating, then make sure we remove them.
What are the various scenarios I've found out for embedding models?
Hallucinating, and there are quite a few, as you can see, there are 38 of
them starting from capitalization, and I'll show you all of those.
Some of these examples, wide space, variations between sentences,
does it impact embeddings?
Negation?
We just saw an example, not versus not special characters.
Word, order, date, time, and all of those things.
So we will go and look into this, how that actually happens.
So if I come to my code here, back again.
I have quite a few sentences over here, and let me show you some
of the ones that you would be able to immediately capture here.
You would be able to.
Think clearly whether they should sound similar or not.
Now let me show you.
See, she completed her degree before starting the job.
She started her job before completing her degree.
How similar are they?
The company barely exceeded earnings expectations.
The company significantly missed earning expectations.
Think about it, how closely they are related to each other.
If the treatment works, symptoms should improve.
The treatment works and symptoms have improved.
One is hypothetical, one is actual factual.
The meeting ran significantly shorter than planned.
The meeting ran significantly longer than planned, and
there are quite a few of them.
And as you can see, they all have been different types of sentences.
Legal domain, medical domain.
Attribution unit of time, quite interesting.
The procedure takes about five minutes.
The procedure takes about five hours.
There's a ton of a difference between these two sentences, and we'll
see how closely similar they are.
So same for speed, kilometer per hour, miles per hour.
The product costs ranges between $50 to a hundred dollars versus $101.
The patient's fever was 101 degree Fahrenheit.
The fever was 104 degree faite and things like that.
If I execute this, you see a result like this and you can see that date time.
For example, if I say her interview was on 10th of December, 10th, 12, 20 25 versus.
12, 10, 20, 25. The similarity is 99%.
It's a two month difference, but from a language perspective,
it is two month different.
From a numbers perspective, it is not a two month difference because our embedding
model thinks they're exactly the same thing from a taxonomic per percentages.
And all of these are different combination.
As you can see, the minimum I'm getting is 80 for spelling and typos,
whereas they should all have been quite low in terms of embeddings.
So it's a big challenge, right?
How do we solve this?
So the way to solve this, there are quite a few ways we can solve it.
One of, and I've listed down some of them over here, and you should
think about using them one or the other, or in combination of multiple
of them to solve this problem.
So you can use a domain that is very specific to your use case.
So if you are building a medical based system.
And you using a rag or a copilot building your active chatbot you're building, then
you could use a medical based embedding model and you might find a ton of them
within, say for example, in hugging face, if you're doing a legal based system, then
you can use a model based on legal words.
It's been trained on legal semantic meanings and things like that.
So that's one way to look at it.
So it's try to find a model that discloses to the work that you're doing.
The second thing is pre-processor data.
If you have numbers, expand them into words.
If you have dates, expand them into words.
If you have words like entities like an apple, that apple means a
fruit or a brand, provide additional metadata within the content itself.
And those are all pre-processing steps, which means when you are ingesting
these documents within your database.
Or a data store, you take them through a process.
Each chunk goes through this process and add more metadata if it needed,
expand the abbreviations numbers into words, data into sentences,
and all of those things you do.
And then you store them in your database and then eventually you fine
tune an existing foundation model on the data that you already have.
If you do that, and a combination of these.
You would find that the embeddings assign similarity between the same sentences that
we've seen before falls down consider.
So let's see that.
I'm not going to fine tune because fine tuning is a pretty exhaustive
process and it takes time.
But I'll talk about the process of fine tuning.
So for fine tuning, basically what you do is.
You need data first, right?
So the, I already have data.
I have the test data, train data, and the validation data, and they're very simple.
I need to provide them in a format that is acceptable by sentence
transformers, by loss functions, and the evaluation function.
So for example, I have sentence one, sentence two, and what is this
level of similarity between them?
So for example, if I see this, the software includes new feature.
Technology continues to evolve rapidly.
There might be some similarities, so I'm giving you a 0.5 as similarity.
Only 15% students fail the exam up to 85%.
Students fail the exam.
Big difference.
The similarity is zero over.
So this kind of a data, either you can generate using another large language
model or and curate it after that.
Or you can bring your own data like this.
And then use it for fine tuning.
So I have trained data on which I will fine tune the model, and then I have
test data on which I will test the model and see how good the results are.
So the test data is something the model has never seen back, ever.
And the validation is I use the validation data while training and
evaluating at the same time to check whether things are getting overfitted
or unfitted, whether my model is going in the right direction or not.
So I have data now and how do I do fine tuning?
So what I do is I load up an existing model.
I load up my data.
So I load my data, I load my, I convert the data into format that sentence,
transformers understand for fine tuning.
So which means sentence one, sentence two, and the similarity,
which you saw in the CSE file.
I load that data.
And then I have a embedding similarity evaluator because when I fine tune,
I will want to evaluate on the new data how the model is behaving.
I have a loss because if I find the embedding for the same sentence
to be 0.8 versus I expect them to be 0.0, then that loss is 0.8.
And so I want to reduce that loss.
So what kind of a loss I'm trying to use and the host of other.
Attributes to make it run.
And you can, obviously these are all a well document and available within hugging
face and sentence transformers package.
And then we try to fit, which means this fit is what starts training the model.
Fine tuning the model it uses the data that we have given.
It uses a loss function.
We are given an evaluator, and the number of times you want to loop through is the
number of epochs and eventually the output is where you get your fine tuned model.
So I have named my model as.
Fine tune, semantic model that, and that's why you see a new
model that's created over here.
And now if I want to evaluate my new model after I've trained the model, I
can load that model and then I can load my test data as well, and then can.
Execute the same encoding because I'm generating embeddings now.
I can send sentence one, sentence two from a test data, get both the
embeddings, and then use codesign similarity to see how closely they are
and then can list them down over here.
So that is what the entire process of fine tuning.
And after fine tune, what I can do is come back over here and I will run the same
thing, same code that you see here, which is giving me such high embedding model.
Using a fine tune model.
So instead of using, if you see here, I had used Ms. Marco,
and now I'm not using Ms.
Marco, but what I'm using is a fine tune semantic model, which is going
to fine embeddings between the same two pairs of sentences, all the 38
or ones stuff that I've shown you before, and see what the results are.
So if you see here, I've executed this already.
And now you will see that the embedding model is even giving me negative returns.
It was the minimum was 80%, the previous one with the foundation model.
Now for negation, for example, if I see the negation over here, I get 46%.
And anything less than 60 or 70% in embeddings, you can obviously discard it.
But you see now for a majority of them, I have very low embedding.
So I have removed hallucination to a very large extent.
From my system, if I use the new embedding model, which I've just fine
tuned, obviously you will find some of them quite high even now, which means
you might have to either fine tune.
Further with more data might be data slash or maybe I haven't used
finetune appropriately with changing some parameters of the number
of epochs and things like that.
Or maybe capitalization does not matter to me.
If the capitalization, whether it's slow or big, it matters,
then I have to fine tune.
But if it does not matter, and having one is great for me.
So it's also use case dependent.
So you have to.
Think about how you want to fine tune, get your data and the entire
process to make this thing happen.
So what are the steps of fine tuning?
Just as a recap, you set up an environment, which means you have,
you need maybe a environment, like a virtual environment conduct or a
uv, and you set up that environment.
You download all the packages, like sentence transformers.
And you obviously need some code to write.
You prepare your training data, validation data, test data for fine tuning the model.
You pick up a model to fine tune, and it could be MP net,
Ms. Marco, a couple of examples.
There are a ton of them.
You can use any of them, but you need to find which model works best for you.
And then you can go for fine tuning as well.
Provide configuration of hyper parameters, like which kind of a loss
function, what kind of an evaluator, how many epochs, what should be your.
How much data and things like that.
Train your model, save it, and then evaluate it to see how better it is
performing based on the foundation model.
How do you choose a model?
It could be based on a domain size.
You, the bigger, the size, the most space, it needs more storage, it needs, and then
you need to also less performance because it takes more time to revert back on
the, from a request response perspective.
But it's a legal based, if you can't use a medical based domain model
on a legal based use case, right?
So it, and you can't then expect it to have a good response.
So you have to think about the domain alignment as well.
The speed at which you want the response back.
Some needs quick response come, some does not.
And so if you need higher speed in terms of inference, then you need relatively
smaller model compared to higher models.
And there are various other things, but these are some of the top ones,
how you choose a model, foundation model, and then you go for fine tuning.
And this is the same thing, but showing some of the top models or some of the
models that are quite widely used.
What's the sizes, strengths, and the best, how are they used in different capacities?
I won't go into those details.
I already talked about the evaluators and other epochs, but you can
obviously return into details into.
A hugging first document.
But these all impact the quality of the new fine-tune model.
So if you think that the quality is not meeting your standards, you have to come
and tube them to see whether and generate a new model, and obviously compare with
the opposite, the previous model to find whether you're improving or not improving.
So that's what the experimentation is all for.
So that's all I had.
So what are the key takeaway?
So first, understand your data.
If you don't know your data well, you won't be able to know whether, what kind
of a model I want to use, what kind of a foundational model I should bring in.
How should I fine tune if I have to fine tune what is my training data?
What is my validation data?
What is my testing data?
And whether they meet the quality of the data or not identify the model to use.
Very important.
And then embedding models are not magic.
You have to first use the embedding model using some of the examples or samples of
data that you have and see whether, how are they performing and if they're not
performing well, then you think about the various ways to remove that hallucination.
Very important that the data quality is super important, and if that's
not important, no matter what you do, you'll not get the right results,
but only fine tune if it's required.
It's not the, it's not the end.
It's not the starting point of removing the hallucination.
It is one of the, towards the end side of removing the hallucination, there
are other steps that I mentioned before, like pre-processing, finding a good
model that already does not do all this, or somebody might have already
done a fine tuning, then you can use it instead of you doing the fine tuning.
Systematic evaluation is quite important.
So you should evaluate your models well and before using
them in your Gen I applications.
And then finally, keep iterating, keep exploring and find out whether
things can be further improved.
Have a baseline.
If you are crossing that threshold, you should be good to go.
But Dr. Han does not let you stop the experimentation further because who knows,
you might find better results as well.
So with that, I. Pause or I'll stop my presentation here and obviously if you
have any questions, feel free to post them out in any of these contacts that I have,
you can get reach out to me on LinkedIn.
I have the code that you saw here is available in GitHub.
You can obviously go and check out this code, and that's my automation.
Next is my account in Twitter.
So thank you very much.
That was my time and I hope to see you in the conference and
I hope you're enjoying a lot.
Thank you very much.