Conf42 Machine Learning 2022 - Online

Build your own search application with vector search engine Weaviate

Video size:

Abstract

In machine learning, e.g. recommendation tools or data classification, data is often represented as high-dimensional vectors. These vectors are stored in so-called vector databases. With vector databases you can efficiently run searching, ranking and recommendation algorithms. Therefore, vector databases became the backbone of ML deployments in industry.

This session is all about vector databases. If you are a data scientist or data/software engineer this session would be interesting for you. You will learn how you can easily run your favorite ML models with the vector database Weaviate. You’ll get an overview of what a vector database like Weaviate can offer: such as semantic search, question answering, data classification, named entity recognition, multimodal search, and much more. After this session, you are able to load in your own data and query it with your preferred ML model!

Session outline

1: What is a vector database? You’ll learn the basic principles of vector databases. How data is stored, retrieved, and how that differs from other database types (SQL, knowledge graphs, etc).

2: Performing your first semantic search with the vector database Weaviate. In this phase, you will learn how to set up a Weaviate vector database, how to make a data schema, how to load in data, and how to query data. You can follow along with examples or you can use your own dataset.

3: Advanced search with the vector database Weaviate. Finally, we will cover other functionalities of Weaviate: multi-modal search, data classification, connecting custom ML models, etc.

Summary

  • Laura Ham is a machine learning product researcher at semi technologies. At semi technologies we build the vector search engine Weaviate. Today she will talk about vectors search and vector databases. She will also show you how it works in live demos.
  • Unstructured data is what I call data, what you find in the wild. It's like longer text documents with a lot of unstructured text in there. In a traditional search engine, on structured data, it's really easy. But if you have un Structured data on the right it's difficult to find the information that's hidden.
  • vector databases are slightly different from graph databases. Data is not stored in traditional rows and columns, but data is stored in models or classes. By adding these vectors, you add some context or some meaning to your data. This allows you to also search through it semantically.
  • Near text is a function that we defined which uses the semantic search principles. You can also use machine learning models which uses multimodal search. This means you can mix media types within VV eight. And yeah, there's also these complete english Wikipedia index.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
You. Hello everyone, welcome to my presentation about the vector search engine Weaviate. In this presentation, I will explain you what vector search or vector databases are, and I will show you how it works in live demos. I am Laura Ham. I am a machine learning product researcher at semi technologies. And at semi technologies we build the vector search engine Weaviate. You can connect to me on LinkedIn, or you can join our slack channel of Weaviate, which is a slack channel of the community. So the open source community. And today I will talk about vectors search and vector databases. Before I dive into vector databases, I want to explain you what structured and unstructured data is and what the challenges in unstructured data. Then I will explain you everything about factorsearch and factor search engines, and I will then continue with showing this in live demos. So first, what is a factor database, and before that, so what's difficult about unstructured data? If we compare structured data with unstructured data, we see that structured data is what you find in typical database. So in typical database, typical database is a relational database with rows and columns and connections between those tables. So it's mostly quantitative data. So we have, in this example, we have an id number, a name, which is a short string, and these city, which is also a short string. And this is what the typical quantitative data is in a relational database. On the other hand, we have unstructured data, and that is what I call data, what you find in the wild. So that means it's like longer text documents with a lot of unstructured text in there, raw sensor data, we have images or videos, audio files, also social media data. And what you see in all those data sources is that there's a long piece of text or an image of video. And in this data, there's a lot of information hidden. So in one image, there can be a lot of information hidden, but it's difficult to capture that in a traditional structured database. So with structured databases on the left hand side, it's easy to search through. It's easy to store data, easy to make conclusions from the data, because you have quantitative data. But if you have the unstructured data on the right, it's difficult to find the information that's hidden in that unstructured data. So if we take an example of a search engine on top of data sources, so in search engine, on structured data, on relational data, it's really easy because it's quantitative data stored in those tables. If you have unstructured data, it becomes a lot more challenging. So here we have an example of, let's say we have a data set of news articles and all those news articles have a title and a longer piece of text, which is the actual article. So if we have this one article which is about dogs, so the title is these origins of dogs, how dogs were domesticated. If you then have a search engine and you would search with the query in natural language, animals, in a traditional search engine with structured data or relational data, you may not be able to find the article about dogs. And this is because we search for the word animals. While the article is actually not using the word animals, but it's using the word dogs. As humans, we know that animals are related to dogs, that dogs are a type of animal, so that maybe we want to see this article as a result. But if we don't use any semantic layer or machine learning that understands natural language in English, you may not be able to find this article in a traditional search engine. Then if we have a vector search engine and a vector search engine uses machine learning to be able to understand the semantics behind language, you might be able to find actually this answer these article if you are searching with the animals, and that is because as I explained before, we know as humans that dogs and animals are semantically similar and this is hidden in the english language and these machine learning model is able to connect those two. Yeah. Before I explain in detail so how this actually works, vector Search, I want to show you that you already know one vector search engine and that is Google. So Google is a vector search engine too. With Google, you can put in a very abstract question like this example, what color of wine is answer? So Google will browse through all the open available web pages and it tries to not only find the web page that is interesting to this query, but also extract an exact answer from this web page. So these question is very abstract. Google is able to find a concrete answer from a piece of unstructured data and it also does it really fast. So in this case, you put in a query, what color of wine is chevronay? Google is first browsing through all its indexed web pages and find web pages that might contain this answer. So it found like almost 13 million results in less than a second and it's then also able to extract the answer. So white wine from a paragraph from all of these web pages. So that's pretty impressive. And yeah, we're all really happy with that, of course, but Google is only able to do this with all these, or Google is doing this with all the unstructured data that is available on the public web. So all the web pages that is accessible by everyone. And this is only a very small portion of data that's available on all the world. That's because most of the data is of course available or in the hands of private companies or for people themselves. And we don't want Google to see what your company has in terms of data. But then if you want to search through your own data, so we cannot use Google for searching through your own data. So then the question becomes is what if you could do the same thing? So searching through unstructured data, through your own data in of course a simple and secure way. And that's where open source vector search engines come in. VV eight is such a vector search engine, which is open source. And yeah, as it says here. So instead of just storing raw data like traditional databases, do, weaviate leverages power of machine learning model to factorize data. And what these means factorization is that machine learning models try to understand what your data is about while it stores data. And also when you search through it using VV eight, your search query will also be put through this machine learning model, be factorized and it's able to find data objects that are near your search query. So you can do discovery or classify similar search results in your data set. So in short, weaviate is a factor search engine that tries to understand what your data is about. On the right you see a three dimensional space and this is an example, simplified example of how data is stored in a vector search engine or a vector database. So we have in this case three highdimensional. In reality this goes up to like 300 or even 3000 highdimensional. And all these dimensions capture some kind of meaning of data. So on the right we have some words and some images, and these words and images have a meaning to a human and the meaning is determining these. This data object is stored in the database. So for example, on the left we have the word chicken. We have an image of a chicken. Those are very close because they have kind of the same meaning. So they are close together in the database. It's also relatively close to other animals like wolf, dog, cat, an image of a cat, but it's then far away. So all these animals are stored far away from things like an apple or a banana or the company's school on apple because they're not semantically similar. So this is how data objects are stored in a vector database when you then do a search query. So like animals, we did before in the example or in this case, we have the search query kitten. It also puts these query in the highdimensional space, and it returns items or data objects that are close to this search query. So in this case, if you search for kitten, you will see data objects like cat or the image of a cat returned. And I will explain these in more detail in the coming minutes. But first, I want to show you how vector databases are slightly different from graph databases. So, with graph databases, as you might know, data is also not stored in traditional rows and columns, but data is stored in models or classes. And all those data objects can have relations to each other. So if we take the example again of news articles, we might have a data object article with the title the origins of dogs. And this is written by some author and has some category, which are also separate data objects. And these are relations between those data objects. So, article has author, this person, John Doe, and this person wrote some articles, for example, this particular article, and this article has a category nature in this case. So this is a graph database. This is different from vector databases. But what you do with vector databases is you add highdimensional to it. So these, I have like added two dimensions to it. And what happens is these data objects in these graph database will be stored exactly where the meaning of this data object is. Also. So in this case, we have indexed, like, for example, english language. And you see that a category nature, this data object will be stored close to, for example, the concept of environment, concept of animals, but far away from technology and laptop, and the concept of article, or this article about dogs. So it's article. So it's close to newspaper, it's close to dog, and also to cat and animals because they're all semantically related. And so by adding these vectors, so these vectors are essentially just coordinates in a highdimensional space, you add some context or some meaning to your data, and this allows you to also search through it semantically. So how does this work on a technical perspective? So, the first step of doing vector search or storing data in a vector database is choosing some kind of model that can map your data to vectors, to coordinates in this high dimensional space. And for this, you can, for example, use machine learning models. So the first step is an encoder model. So from data to a vector, to a coordinate, this is encoding. It transforms data into vectors. They're also called retrieved models. And one example is dense retrievers. Dense retrievers are deep neural networks or machine learning models that calculate the vectors from a word. These can be language models. So youll see an example here on these, right. Can example of language models are bird models or sentence transformers models. You can also do this to images. And then one example is a Resnet 50 model. So all these models, they can calculate a vector position from a piece of data. You can also do this by not using deep neural networks but using sparse retrievers, sparse retrievers like TD TFIDF or BM 25. They don't use machine learning models, but they use word frequencies in the documents instead. So they are a bit light weighted. So now we have a model that is able to transform data. So natural language or images or videos whatever into vectors. The second step is you can use weaviate for example to import all this data and to store it actually in this vector database. So Weaviate looks at all the data objects and uses this encoder machine learning model to vectors the data and then it will be placed in this hyperspace, so this high dimensional space. And then you end up something like this, what I showed before. So we have a cat that is semantically related to dog and to animals and to can image of a cat, but it's far away from Apple and banana. Then if you use this database as search engine, weaviate will also take your search query which you can put in in natural language or search by an image. For example, it will also use this encoder model or retriever model to index or vectorize this query. And so it also gets a vector position like you see here on the right. And then it does an approximate nearest neighbor search, an answering to retrieved data objects that are close to this query. So if I search for kitten, then I will get back results that are for example cat or image of a cat and so on. And this is called approximate nearest neighbor search. And this is for example by calculating the cosine distance between the data objects and the search query, you only want to retrieve the results that are most close to the query. And with a vector search engine like Weaviate, this is done very efficiently. Even if you have millions or billions of data objects or search queries, you can still retrieve it very efficiently. And this is because it uses for example a indexing library H and SW to search for these data objects very efficiently. So to summarize, you have a pretrained machine learning model, for example a build or sentence transformer model from hacking phase. You have your own data and then with VVH you can index them and store them in this vector database. You can do a search query and then it will do can approximate nearest neighbor search to retrieved the data objects close to this query. So this is how a very relatively simple search pipeline looks like. And you can extend this pipeline. So now it was just a very simple search and you use retriever models for that. But you can extend this factor search pipeline by for example reader or generator models. Reader models are models that extract information from the retrieved data objects. So that means you can do question answering. With question answering you put in a natural language query like really a question in your search query, and then VV is able to not only find a relevant data object, but also extract an answer from these relevant data objects. And that's done by a reader model on top of the retrieved model. Another example is named entity recognition. You can also extend it by using generator models. Generator models use natural language generation to generate an answer, for example from retrieved data objects. So here for question answering. In a reader model it only searches in a data object for a particular answer but doesn't modify it, it just retrieves a piece of data. A generator model can actually generate language from this data object. So for example it can summarize a piece of text. So now that I explained you a bit how this works, then the question becomes how do you use it? So how do you interact with Zach factor database vv eight has two types of API endpoints. It firstly has all usual restful API endpoints for crud operations and additionally it has a GraphqL API to do intuitive querying. So you can of course retrieve all the data objects with a get query. You can do semantic search or for example question answering depending on what kind of machine learning models you have attached. And I will show you this. So there are two demos that are always available for you. The first one is a super small or relatively small data set of news articles. So it's only less than 4000 news articles in there. So that's really small if we talk about machine learning, but this is just for demo purposes. Second one is the complete English Wikipedia indexed. So that's like billions or millions of pages from Wikipedia, English Wikipedia and you can also search for it using Weaviate. But for now I will just show you the small data set. So in here I'm connected to this data set. So this data set has news articles with natural language text already indexed in Weaviate. And what youll see here is a, I'll make it a bit bigger is a graphical user interface to query and you can query this data set using graphs. So I will build this query step by step to explain. So on the left hand side I built query. On the right hand side you will see the results. So I can do a simple query to get all the articles. So the news articles that are in the data set and I just query. It's not these name, it's title. Just query the title. So this is a really simple get query basically just to show you how it works. And on the right you see the result. So now I have a list of all the articles that are in there and I just see the title. I can ask for more data properties here. So this is the summary of article. For example. Now with this query there's of course no semantic search happening. So there's no machine learning happening here. But I can add this to the query and let's say I want to find the articles that are near the concept of animals like I used in the example in the slides I can do a near text search. Near text is a function that we defined which uses the semantic search principles. And for example I can go for animals. So now I want to get all the articles that are near the concept of animals and I want to see the title and the summary. And when I run this on the right, I get back all the other articles ordered on the relevancy to the query. So they are ordered on how close they are in the vector space to the search query. And as you can see the first result is the example that I used in the slides. So the oranges of dogs, a new idea about how dogs were domesticated. And you can see also in this article it's about dogs and predating domestication. It's all about wolves. So it's all about animals, but it doesn't literally use the word animals. And this is how you can see that you really do a semantic search. So the second result, it's about Nigeria cattle principles. So cattle, I'm not sure if the word animals is used here but I don't see it. So vv eight or the machine learning model behind it, in this case a sentence transform model understands that with animals I also want to see results like dogs or cattle, those kind of things. I can also show you this. I can call for certainty. And certainty is a number ranging from zero to one indicating how close this data point is actually related to these search query. So in this case you can say that Weaviate is 79% sure that this search query is relevant to what I'm searching for. Um, okay, so as I showed you, you can extend the search pipeline with reader models and one of the reader models is a question answering model. So I can also ask Weaviate or this data set a detailed question. So here I have an exact question. So who is the king of the Netherlands? I'm really looking for one specific answer. I don't want to see the whole data item, maybe I just want to know who is the king of the Netherlands and I need to ask for the answer here. So one answer is enough. So if I do this, what happens now is that VVT will still do a can query. It gets articles that are near this search query. So it will find articles that might contain something like this about King and the Netherlands. And then with this search query or with these result. So it found this result about dutch royals. And with these search query it uses a question answering machine learning model to extract an exact answer. So here it found this answer and you can see that here it's found already in the first sentence. So King William Alexander and then went to Netherlands. So the machine learning model understands that this king is the king of the Netherlands. And yeah, there's also these complete english Wikipedia index as I mentioned. So you can also ask questions to Wikipedia and it works similarly. So you can play around with this if you want. You can find the links also on our documentation on Wevy IO. So this was a very simple example what I showed you with just indexing text. But you can do more. You can also use machine learning models which uses multimodal search. So you can mix media types within VV eight. This means you can also index images at the same time as text. So you can search by an image and retrieve text or the other way around so you can mix media types. For example, this is what I showed youll before. So question answering then as I explained to you, wefit works with machine learning models to vectorize data. We have multiple machine learning models available out of the box. For example most of the models from hugging these or our own trained machine learning model called contextionary. We have models from OpenAI and so on. But you can also use weaviate without any machine learning models. If you want to use it as a pure factor search engine. Or you can use custom machine learning models if you have your own trained machine learning model but want to for example scale it or make search engine from it, you can use VV eight for that. And finally, what you can also do with VV eight because it uses these vectors. You can do classification, automatic data classification, for example KNM classification if you have training data or previously classified data available. Or you can do zero shot classification if you just want to use the context in Weaviate and you don't have any training data available. So that was my presentation about the vector search engine VV eight. If you have questions, I'm available in the chat of this conference. Or you can of course join our slack channel, our community slack channel, which is very active, and there's a lot of people who can help you out with questions. You can always shoot me an email, of course. And if you want to find out more, you can go to our website, which is VvT IO. Okay, thank you very much.
...

Laura Ham

ML Product Researcher @ SeMI Technologies

Laura Ham's LinkedIn account Laura Ham's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways