Conf42 Machine Learning 2024 - Online

Vector Databases 101 - How do they work?

Abstract

Discover how Vector DB work behind the scene and how they are powered. Lot of people are using them but knowing how they work isn’t something common. We will go over the different implementations, indexes, metrics and people should have a better idea on how they work after this talk.

Summary

  • Stephan Batifau is a developer advocate at Ziliz. Today he will talk to you about vector databases 101. His objective is to explain to you how they work in detail.
  • Unstructured data is about 80% of the data you have worldwide. vector databases are specifically built to work with this kind of data. They usually like the infrastructure to help you scale and deploy and manage your apps in productions. But you might encounter some problem with the search quality.
  • The most popular use case is retrieval, augmented generation. It allows you to search over multiple types of data, text and images. Anomaly detection is also a common one. But it's not all about embeddings and everything. When should you use it?
  • The concept of similarity search works with a vector database. It measures not only the angle but also distance between the two vectors. If you want to reduce your query, you can also filter on those as well.
  • The flat index is a bit more elementary similarity search. Then you have inverted file flat, which is also called IVF flat. To address memory constraints, we can combine with IVF and product quantization. This addresses the limits of the square concentration.
  • Milvis combines skip list and navigable small world or otherwise NSW. It can really bring some speeds to your search process. There's no like good or bad index. It really depends on your application requirements.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, I'm Stephan Batifau. I'm a developer advocate at Ziliz and today I'm here to talk to you about vector databases 101 vector databases. Everyone talks about them and today my objective is to explain to you how they work, in detail actually. So let's get started. First, what is Ziliz? Zilis is the maintainer of Melvis, which is an open source vector database. We are also maintaining nowhere, which is a vector search engine GPT cache which can be very useful if you want to do 17 cache for LLM queries. We have VDB benchmark, which is also like a benchmark tool for vector databases. We have Zliscloud, which is our cloud offering, but I'm not here to talk about that. Today we also partner with different industry leaders such as AWS, Google Cloud, Azure. We also partner with different Gene AI tooling, also partnered directly with some chip manufacturers such as Nvidia or Intel. But today what we're going to talk about is going to start with why vector databases, why you may need one. Also, where do vectors come from the different use cases? How do vector database work? What is similar to search different indexes? And then I'm going to explain to you and show you the build basic architecture and then hopefully after that you should be able to understand how database work and vector databases work in particular. So let's start with the why. Why do you think you need a vector database? Well, one reason is that unstructured data is about 80% of the data you have worldwide. And the way the vector databases is working is a bit different to usual SQL NonSQL database. So the search pattern or the search type in a traditional database is you go to find a very specific result. You go with select star from a specific table or select some different columns because you want a result that is exact. With vector database, it's a bit different. We're doing a lot of vector comparisons, which, and if you don't know, vectors are just representation of the data of the unstructured data. So they are basically a long series of numbers. And vector databases are specifically built to work with this kind of unstructured data, including text, images, videos and audio. And one thing that usually people tell me and they're always wondering is like, can I just use numpy canon? Could I just write my own and then make it work? The answer is that, well, I mean, technically yes you can, but you might encounter a problem once you go to scale at the beginning, it's fine, you only have 1000 embeddings it's very small. You don't really need more time. Once you go really high in the amount of embeddings then you can really struggle and can take a long time. And then basically it doesn't scale. And that's usually the reason why when people ask me about vector search libraries in general, I mean, yeah, they can work. I'm not going to say you always need a vector database, but you might encounter some problem with the search quality. What do you do if you need hybrid search or if you need filtering? Also scalability. What happens when you have billions of vectors? What about multi tenancy? What about the cost? Are you putting everything in memory? Are you putting everything on disk? Are you putting everything on s three? Security as well as a big concern. What about data safety and privacy? Also, what if you have to update your data? What if you have to update your embeddings? It can be a problem. TLDR, they usually like the infrastructure to help you scale and deploy and manage your apps in productions. They can be really cool for pocs, but once you go more into the scale and once you go more into production, then they can really struggle. Why not use a SQL or NoSql database? Well, there's also a problem here is that they're usually inefficient in high dimensional spaces, which is what you have actually for vector databases. They also have sub optimal indexes. They don't support different indexes that you might need for vector search. Inadequate query supports for scalability can be a problem. You have data conversion issues and then yeah, you have different problems like that. So usually vector search and vector operations in general are a bit too intensive from a compute point of view. For traditional databases and vector databases, and in particular milvis is really beautiful scale. So I won't go in the details over like everything, but thanks to vector databases you might have advanced filtering, hybrid search, multi vector search, you can also have automatic backups, replications, higher variabilities, aggregations. You have different things really good for billion plus scale, so you can really do those. If that's not a problem for us, then some vector erases. And the one in particular, like Mailvis, we have GPU index, which can be very useful in some cases. You might need a lot of embeddings, you need a very high accuracy and you need to be very quick. Then you may need to use a GPU index. So as a takeaway. Vector databases are purpose built to handle indexing, storing and querying vector data. And they usually design and we are usually designed for billion plus use cases. It doesn't mean you don't have to do, it doesn't mean you can't use us for smaller scale. It just means that basically you start from the ground up, you start with a couple of hundreds embeddings, and then we can help you go to the billion plus. But then where do vectors come from? It's a very good question. They just don't pop up like that out of the blue. So it's usually, imagine you have your knowledge base, so it can be text, pictures, internal documents, and the data gets passed into a deep learning model. And that's very important that this deep learning model is the right type of model. What I mean by that is that it has to be trained on the right of data. So if you're working with images, you gotta have a model that has been trained on image data. For example, if you're working on text, it has to be trained on text data, working on images. And you wanna identify the type of dogs you have, then, well, you gotta have that in the data. It has to be trained on that. And what happens then is that you take your image data, for example, you run it through a deep learning model, and then you cut off the last layer, because usually the last layer does a prediction and we're not interested in the prediction, we're interested in what the model has learned about the data. We really don't care about the prediction. So what we want to know is what it has learned. And we want to have the numerical representation so that later on, when we work with that kind of data, we have a way to work with it, we can understand it. And yeah, that's why we cut off before the last layer. And why not take the second layer from the beginning or something? Why is the last one? Why not one in the middle? It's also because as you pass through the model, each layer will learn something new. And the second to last layer contains all the semantic information without doing the prediction. And that's what vector is in our context. That's what we use all the time. So then, yeah, then you have your vectors, then you put it in a vector database like Milvis, and then you can query it. And how did it work in practice? Usually you have embedding models and you have a lot of different embedding models, and they are actually quite important. You have to keep in mind that embedding models will be key for you. They are like, you know, you have embedding models that have been trained on like very very specific things. So let's say you're working with english data and it's like nothing specific. Maybe the embeddings of OpenAI will work. But in my case, I live in Germany and we have german documents, so we need embedding models that have been trained on german data. And for example, if I do that, then I'm gonna use Gina AI, usually, because it's quite good. And I would suggest for you, if you're like, if you don't know where to start, go on hugging face, look at the leaderboards and check the different embeddings they have, because they have so many that it can be really good. And then hugging face makes it very easy for you as well to use embedding models. So I would say, yeah, maybe try at first with OpenAI, try with different models. But yeah, those are very important. You have to have a specific one. And how does this work? Because here we have a very simple example. We have a text which is happy dog wagging trail. Then you put it through an embedding model and then it generates a vector. And then this vector is, then you can imagine this is a vector space. So that's very simplified. But you can imagine you have vectors everywhere and it's in so many dimensions. And what happens is that everything that is similar, that would be close to each other in the vector space. So if you have a building with a small window, and then you have another one, which is a big building, once that transformed, then the vector will be very close in vector space, and that's the same for other text. And if you have something that is completely unrelated, then it will be far from the others. So what about the use cases we've been talking about? Maybe when you should use it or why you should use it. But it's more like when should you use it? Now, it's not all about the embeddings and everything. It's like, okay, the different use cases. Well, the first one and the main one is rag. So that's the one that has been the most popular recently, which means retrieval, augmented generation. It's basically, if I were to explain it, it's like you want to expand your LLM with knowledge that is external to the LLM. Let's say you work for a company and you want to add this data to the LLM. Usually you would use rack for that. So you put all your data, you process it, you put everything in a vector database, and then once you make the query, then it goes through the vector database to see if you have something that is similar and then it gives that back as a context to the LLM. And that's what a drag is. But that's only one of the nine usual use cases. You may use a vector database for, you have the good old one recommended system. So for product recommendation, it's actually one of the most common use cases for mailroom production. Then they're really able to compare products and users and then use a vector database to quantify that and compare that. Then you have text and semantic search, image similarity search, video search as well, for similarity, and audio as well. It can also have molecular similarity search, and it can be really useful for companies that are working with proteins, for example. Anomaly detection is also a common one. It's how different are two user actions, how similar are the actions of two users. And it's very important when it comes to photo detection, for example. And then you have multimodal similarity search. So it allows you to search over multiple types of data, text and images. And that can be really useful. When you do, then you can combine the two, you can combine multimodal similarity search with rag and then you have multimodal rag. That's what happens. Then that's also very useful. Instead of just giving text to your LLM or your rack system, you can give an image which is then transformed and then you can actually return an image. You can return text, you know, it can return whatever you want it to return. And that's like, it can be very, very useful, especially in the era of GPT four or Gemini 1.5, you know, like with what was being announced, you can actually also do it on yourself. So yeah, those are the common use cases. So then how did it work? You know, how does it actually work? And now we're going to go a bit more into the details. So we're going to talk about the different, you know, the different similarity search. How do you do similarity search? So let's go. We have here an example entry and this is what you may store in a vector database. So on this one, the two most important pieces are the id and the embeddings. Id is the way the vector database is able to have a unique id on your entry and embedding is the vector embedding. And this is what gets compared when we go and search with a vector database. So they basically like mailbis and vector database in general. They are specific kind of database that are automate, making everything automatic for you. So you compare embeddings and then you gonna find the best one, the closest one to you, and that's what we do. Then here you also have a bunch of metadata fields where you may filter on those as well. If you want to, if you want to reduce your query, if you want to lose your search when you run a query. So let's say, I don't know. Here you only want publication from towards data science, then you can put a filter for that one. If you want publication from a different one, then you can also put a filter on that one and that can allow you to have better search results. And how does it work? How do you know that something is similar to something else? Well, that's the whole concept of similarity search. And so what is it, how does it work? Usually as a basic rule of thumb, you always have similarity metrics for your index. And people always ask me, oh, which one should I use? Usually match the one that was used to train your embedding model. It can be written on hugging face, for example, depending on the model you use or the research paper or something. So yeah, if it's been trained with cosine, go with cosine. So yeah, just try to match the one that has been trained for your embedding models. And so how does it work? So as an overview of the search, so you know, like here you have a text which is find the two most similar images, documents or video. Then that is processed through an embedding model. You get a vector and then you're going to find the most similar ones. So you can see like the most similar ones to our request is the first one in the database and the second to last one. And that's what basically is happening. Then it will give you results back. And yeah, that's in a nutshell, that's how it works with vector search. But then, you know, I've been telling you, oh, I'm going to tell you about the metrics. Let's go over the metrics. You have the first one, which is called l two or euclidean, and this measures the distance in space. So if you have the right triangle, it measures direct distance in space. So you can imagine this as like the amount of space between two objects. So let's say how far from the screen I am with my face. That's what it's measuring. And one problem though, that this metric is sensitive to scale as well as the vector relative location in space. So what it means is that vectors with large values will have a larger euclidean distance than vectors with smaller values. Even if the vectors are very similar. That doesn't matter. So that's, one is like, in most cases, you don't use it with a deep learning model, but you would rather use it with more basic vector encoding models like LSH, which means locality sensitive hashing, you very likely won't see that one. If you work with LLMs in general, then you have the inner product one. So it's a bit more complicated than the Euclidean, it measures the projection of one line into the other. And the bigger the angle between the two vectors, the smaller the inner product. And so what we do is that, you know, we project from the origin these two points, and the inner product is the projection of this point onto another. And so it's like, sorry, it's like how do you project that out of the right angle? And this measure only not the angle difference between the two vectors, but also the space distance between the two vectors. So it measures the difference in orientation and magnitude. And l two is only magnitude. And then the last one, which is the most famous one, very likely it's cosine. And this one measures orientation. So cosine is the difference in angles between two vectors. And you just measure how big the angle is between your two vectors, you know, and this one is not affected by the size of the vector, but only by the angle between them. So that's like, that's like a really good one, you know, like no matter, like how big your vector is, you know, it doesn't really matter. And this one is mostly used in NLP. So usually you see like formatting models, if you work with LLMs, they're going to use cosine and base also, if you, you may have noticed, but cosine is normalized in a product. So yeah, if the model was trained using cosine similarity, you can either use cosine similarity or you can normalize and you that our product. So it's usually not suitable though, when you have data where the magnitude of vector is very important. So you have to take that into account. For example, it's not really appropriate to, if you want to use it, to compare the similarity of image embeddings based on pixel intensities, for example. So yeah, you may like, you know, choose your different types for the embeddings. You have sparse embeddings, depending on the model you have splayed, you have BG, m three, and then you have some specific kind of index that will go with it very well. If you have dense embeddings with OpenAI cohere, different models like that, then you have different distances you have IPL two, or cosine, and then you can use face or H and SW. And then you have binary as well embeddings. So those ones are like also cohere as binary embeddings. And then, yeah, you may use different indexes and then a different distance. So that's kind of it for the similarity measures. Now we go into a bit like a bit deeper. We go into indexes. So keep in mind, all of them have the pros and cons. There's not one index that is absolutely the best for everything. That is going to be the best is going to be the cheapest. That is going to have the best recall. That doesn't happen. So what it indexes first is that it allows you to facilitate efficient searching and retrieval of similar embeddings. It's a special data structure that is built on top of the embeddings. So that's basically what index is. You have different strategies. I have tree based index, graph based, hash based and cluster based. Later I will only talk about graph based and cluster base. You can look for the others, but it's not really relevant for us. Those are some of the indexes that milve support. So you can see we have a lot of them. But no worries, there are many options and as I've said before, they all have their pros and cons. So I'm going to go a bit deeper now into the different indexes so you can have like an understanding of how they work. All right, so the first one is the flat index and this one is quite straightforward, I would say. And it's a bit more elementary similarity search, you know. So it utilizes the knee algorithm, or also called KNN. And basically if you consider a database with a collection of like a hundred embeddings and a query embedding, what does we're going to call Q as query. Then if you want to find the most similar embeddings to queue, then KNN will calculate the distance between queue and each embedding in the database using a specific metrics that you define as well. Then it's going to identify the embedding of the smallest distance as the most similar to Q. And it works amazing for accuracy. It's very accurate. The problem is that it's an exhaustive search. So scalability is a concern and it's very accurate, but then it's very slow. Also, the time complexity actually increases linearly with the number of vector embeddings in the database. So that's flat. It's very straightforward, but it doesn't really scale. Then you have inverted file flat, which is also called IVF flat, and it's very similar to flats, but instead of using k nearest neighbors it's using Ann algorithm which is approximate nearest neighbors. So how it works is that it divides embeddings into several partitions and those partitions don't intersect as well. That's very important. Each partition has a centroid and every vector embedding in the databases associated with a specific partition that is based on the nearest centroid. So when you have a query queue or query embedding queue that is provided, IVF will only calculate the distance between queue and each centroid rather than the entire set of embeddings that you have in your database. So you already have a tremendous gain of time. And then the centroid with the smallest distance is then selected and the embeddings associated with that partition are used as candidate for the final search. So then once you know which entry to use, then you can use and search for all the embeddings that are in this partition. So that's nice, like it really speeds up the sales process. You may have a problem though, which is the candidate that is found might not be the nearest neighbor. So then you can play with the hyper parameters and that's basically what you can do. Then it's like you can increase the number of, like there is a hyper parameter that is called NPRoB which means the number of partitions that you may search. So yeah, then you have more partitions that you can search and then it can really help you ensure that the nearest embedding to the query is not missed even if it's in a different partition. With that, the cost of search increases with time, but also it's going to be better than a flat anyway. Then you have another one which is inverted file with confisation. So it's the same, but when memory is limited then it may not be optimal because you might struggle with that. So then to address memory constraints, what we can do is that we can combine with IVF, we can combine, you're going to involve the mapping of the values in each vector dimension to lower precision integers. So it's usually what we call quantization. And there are two kinds of quantization. There's the first one which is scalar quantization, and the second one which is product quantization. So scalar quantization, how it works, like you're basically mapping flow to integers, but the first step is to determine and store the maximum and minimum values of each dimension of the vector in the database, because that's what you need to calculate the step size. The step size is crucial to scale the floating point numbers in each dimension to its integer integers representation. Sorry. So you can see the formula here. It's max value minus minimum value. And then you divide it by the number of bins, which is something that you decide. How it works is that, you know, like if you take the quantity quantized version of the any vector dimension, then you subtract the value of this nth dimension from its minimum value and then you divide the result by the step size. Then you have another one, which is product concentration. So this one, it addresses the limits, the limitation of the square quantization, you know. So you consider the distribution of each dimension. You then divide vector embeddings into sub vectors and you perform clustering within each sub vector to create centroids. And then you encode each subjector with the id of the nearest centroid. So again, you know, you have here you have the query vector, which is a float. Then you like, you consider the distribution, then you divide the vector into sub vectors and then you perform clustering, which is what you have on the image, you know, at the third point you really like, you perform clustering to create centroids and then you encode each subjector with an id of the nearest centroids. And that's how you can save a lot of memory. And it's really, you know, like project consolidation offer more powerful memory compression. So it can be really good if you have like massive memory constraints, for example. Then last but not least, it's usually called h and SWT mouthful. And this one is a graph based indexing algorithm. So it combines two key concepts. It combines skip list and navigable small world or otherwise NSW. So it's a graph. So skip list, it's a probabilistic, probabilistic data structure that is composed of multiple layers of linked list. And then NSW graphs. It's built by randomly shuffling data points and inserting them one by one. So each point is connected to a predefined number of edges. And then this creates a graph structure that exhibits the small world phenomenon where any two points are always connected through relatively a short path. And that's how it works, basically. So if we go for the skip list, so again, you know, probably probabilistic data structure that is composed of like multiple layers of linked lists. So the lowest layer, which is layer zero, here it contains all the original linked list with all the elements as we move to the higher layers. The linked lists progressively skip over different elements and then you have fewer and fewer elements with each layer which each higher layer, sorry. Then during the search process you start from the highest, you start from the highest layer and gradually you're going to descend to lower layer until you find the desired elements. And because of this skip list can, you know, like it can really bring some speeds to your search process. It really speeds up your search process. So as an example, let's say we have a skip list with three layers, as you can see here. And we have eight elements in the original linked list. If you want to find the element seven, then the sales process will look like we have on the image. So we're going to look and we're going to be like okay, so four, seven is higher than four, so we're going to go on the right side of four and then six is lower than seven. So then we're going to go on the right side. And that's how then you can find seven. And that's how it works basically for skip list. And then you have NSW graphs. So again, as I said, so they built randomly by shuffling data points and inserting them one by one. And then it creates a graph structure that exhibits some small worlds and any two points, they're always connected through a short path. So that's how it can be really good and really efficient. If I were to show you a bit of an example of how it works. So this is it. This is a multi layer graphs where the lower layers contains the complete set of data and the higher layer contains only a small fraction of the points. Again, as we've seen for the linked list, then HNSW starts by assigning a random integers number from zero to I to each data point. And those can be referred as nodes. We can call it in a graph algorithm, I is the maximum layer at which the data point can be present. So if a data point has like I two and the total number of layer and the graph is five, then the data point should only be present up until the second layer and then it shouldn't be available in the layer three and upwards. So yeah, during the search process how it works is HNSW starts by selecting an entry point and then this should be the data point that is presenting the highest layer. And then it searches for the nearest neighbor closer to the query point and then recursively continues the search in lower layers and again and again and again until the nearest data point is found. So now we have this image again so hopefully it makes a bit more sense. And again, you know, there's no like good or bad index. It really depends on your application requirements. So are you more interested in like query speeds? Do you have a lot of inserts or deletes? Are you constructing the machine type? It really, really depends. For example, if you need 100% recall then flat would be the best. Then depending on your size of index you might use a standard IVF. If your index is between 2gb and 20gb, maybe consider product consolidation or HNSW. And then you also have composite index and then you have disk based indexes as well, which can be really useful for you. And then I'll finish with that. So you can see the Milvis architecture is Milvis has been built to be like fully distributed. So that's what you see here. You can see like on the bottom part you see the s three storage, the object storage. If you go above you have the worker nodes and those are fully distributed. So you have the query nodes, the data nodes and the index node. And that's how we can then scale up to billions as thanks to those because they are fully independent, so we can scale them up and down. And that's kind of how Mailvis works. That's how vector database work. I hope this talk was useful to you. If you have any questions you can chat with me on discord directly. Also check out our GitHub movers is fully open source so you can check our repository. And yeah, if you have any questions, chat with me on discord. You can find me on LinkedIn as well. Thank you very much for watching this talk and I hope you have a good day.
...

Stephen Batifol

Developer Advocate @ Zilliz

Stephen Batifol's LinkedIn account Stephen Batifol's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways