Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Everyone, welcome my, welcome to my talk on introduction
            
            
            
              to vector databases. So over the next 30, 40 minutes,
            
            
            
              I want to take you through a general introduction of what vector databases
            
            
            
              are and how they help you search over your data
            
            
            
              in a much more organic, organic manner.
            
            
            
              And then I want to introduce you to the basic concepts of vector databases,
            
            
            
              how we take those basic concepts and scale them up so that you can search
            
            
            
              over for hundreds of millions of data objects, if not billions of data objects.
            
            
            
              And then I'll end off with hopefully, what's an engaging
            
            
            
              demonstration of the power of vector databases?
            
            
            
              So before we jump into vector databases, a little bit about myself.
            
            
            
              I'm a developer advocate at Weaviate and I'm an engineer and data scientist
            
            
            
              by training. And the first time I really learned
            
            
            
              about vector databases, that was a bit of a light bulb moment for me
            
            
            
              where I thought, why doesn't everybody just use vector
            
            
            
              databases? They're just a much more natural way to
            
            
            
              search over and deal with your unstructured data.
            
            
            
              And so if the only thing that you get out of this talk is that
            
            
            
              nugget of insight around what vector databases can
            
            
            
              help you accomplish and how they can help improve the projects that you're working on,
            
            
            
              then I'll have considered my job done.
            
            
            
              Okay, so the essence or the main idea
            
            
            
              behind vector databases is that these allow you to go
            
            
            
              from a classical approach of keyword search, where in,
            
            
            
              let's say a SQL structured database,
            
            
            
              if you want to look for an item, you would have to do some sort
            
            
            
              of keyword search or exact matching search.
            
            
            
              Vector databases transition from that idea
            
            
            
              and allow you to do a similarity search or a semantic search over
            
            
            
              your data.
            
            
            
              This is a fairly abstract definition,
            
            
            
              so let me further explain this using an example.
            
            
            
              So imagine you've got a bunch of
            
            
            
              documents that you store in your database. The first one here as an
            
            
            
              example is how to build a rest API. Imagine you have thousands
            
            
            
              and thousands of these and you want to search over these documents.
            
            
            
              The traditional approach would be you go into your search
            
            
            
              bar, you type out the word python, and the
            
            
            
              traditional keyword search would go in and
            
            
            
              would essentially look for the word python anywhere in
            
            
            
              your document, right? So imagine you only have two documents here.
            
            
            
              The word python is nowhere to be found. So it says that it
            
            
            
              returns nothing and there's no articles that have this keyword.
            
            
            
              The more fundamental problem with this traditional keyword search is that
            
            
            
              it technically doesn't even understand whether you mean
            
            
            
              Python the animal, the snake, or python the programming
            
            
            
              language. And that gets to the heart of the problem here.
            
            
            
              Keyword search is just doing string matching to find out
            
            
            
              the document that you are looking for, it doesn't actually
            
            
            
              understand the context and the meaning behind your
            
            
            
              search. And so on the other hand, if we
            
            
            
              do a semantic search, we go in, we type out the exact same
            
            
            
              word, and now it recognizes that in
            
            
            
              the context of the documents that you've provided me, you're probably not looking for the
            
            
            
              animal, you're looking for a programming language for data scientists,
            
            
            
              which is the Python programming language. And the essence here
            
            
            
              is that it understands your query and it matches it
            
            
            
              with a similar document that's in your
            
            
            
              database. And in order for this to happen, it has to understand
            
            
            
              everything that's in your database, the documents as well as your query, and then it
            
            
            
              has to match it appropriately. And this functionality of
            
            
            
              understanding your data and then going in and matching it is exactly
            
            
            
              what a vector database does. And instead of just being able to
            
            
            
              search over a couple of documents or thousands of documents, it can search over
            
            
            
              hundreds of millions or billions of documents. And that's the
            
            
            
              scalability of the vector database that we'll talk about in a bit.
            
            
            
              So really, if we want to achieve
            
            
            
              this end goal of semantic search, similarity search
            
            
            
              over our documents, there's a couple of hurdles in our way.
            
            
            
              The first hurdle being that we need to take our unstructured data,
            
            
            
              we need to be able to process it, understand it and search through it.
            
            
            
              And that's a difficult task. The second tall order,
            
            
            
              the second hurdle in the way is that it's not enough to just be able
            
            
            
              to search over and understand a small amount
            
            
            
              of data. We have to be able to do this in a scalable
            
            
            
              and secure way so that we can search over hundreds of millions or billions
            
            
            
              of documents. And so those are the two hurdles that are in the
            
            
            
              way before we can achieve this idea of semantic search that
            
            
            
              I just presented. So we'll go through and see how vector databases
            
            
            
              address each one of these issues.
            
            
            
              So the first issue, how do we understand and search through
            
            
            
              our data? Well, we use machine learning models to understand the
            
            
            
              context of our unstructured data. And so every time I say unstructured data,
            
            
            
              I mean either natural language, text. So this can be documents,
            
            
            
              books, emails, or it can also be
            
            
            
              images, video, audio. All of your unstructured data,
            
            
            
              we pass them through a machine learning model. It understands the context of that
            
            
            
              data and then we use the output of the machine learning model to
            
            
            
              learn about these context in our vector database. So let's
            
            
            
              dig deeper into this concept of using ML.
            
            
            
              So let's say you've got your data over here
            
            
            
              on the left hand side, upper left. This can be emails, videos,
            
            
            
              audio, images. We're going to pass each
            
            
            
              one of these through our machine learning model.
            
            
            
              And our machine learning model is going to spit out a vector
            
            
            
              associated corresponding with each one of our unstructured
            
            
            
              data objects. And the way that I like to think about this is that our
            
            
            
              unstructured data is in human understandable format.
            
            
            
              We can read an email, we can view and understand an image,
            
            
            
              but the machine or the computer can't understand it. So these
            
            
            
              vectors that the computer generates are machine understandable.
            
            
            
              So it's much more easier for a machine to
            
            
            
              understand these vectors for our data.
            
            
            
              So now that we have our data translated into machine language,
            
            
            
              now what we're going to do is project this and plot this out
            
            
            
              and create vector representations. So each object
            
            
            
              that we had a vector for here is now represented
            
            
            
              using a green dot on the right hand side. And so we call
            
            
            
              this a vector space or an embedding
            
            
            
              space. So we take all of our vectors and we embed them
            
            
            
              into this vector space that I've demonstrated
            
            
            
              using a 3d graph. There's two
            
            
            
              important things about this vector space. The first one is that
            
            
            
              it preserves similarity between objects.
            
            
            
              So, for example, if you look at the vector for these word cat,
            
            
            
              it is going to be close by to the vector for the image
            
            
            
              of the cat. And the machine learning model understands
            
            
            
              that the word cat and the image of the cat are similar in nature
            
            
            
              in these unstructured data. So it's going to keep the vectors closer together.
            
            
            
              So similarly, if you have the word chicken and the image of a chicken,
            
            
            
              those are going to be closer together in vector space. On the
            
            
            
              other hand, if you take a look at the word cat and
            
            
            
              the word banana, the vector for the word banana, those two concepts
            
            
            
              are further away, and so the corresponding vectors are also further away.
            
            
            
              And so in this way, you can take objects, or unstructured
            
            
            
              data, objects where you can show them to a human,
            
            
            
              and you can ask them which two of these things are similar and which two
            
            
            
              of these things are different. Now, you can do the same thing by conducting a
            
            
            
              distance metric around your vector objects
            
            
            
              in vector space, and a machine can do this much more easily.
            
            
            
              The second thing that's important to keep in mind is that even though I'm showing
            
            
            
              you these green dots in a 3d space where I have three axes,
            
            
            
              in reality these vector spaces are often very high
            
            
            
              dimensional. They can be 300 dimensional, 600 dimensional, 2000 dimensional,
            
            
            
              depending on how much data you want to preserve, how accurate you
            
            
            
              want the vector representation to be, you're going to choose a higher dimensional
            
            
            
              vector space. It's kind of like asking a person to read
            
            
            
              a book and summarize it into one page. And that
            
            
            
              summarization is going to lose a lot of data. If you ask them to summarize
            
            
            
              it into one chapter, they'll be able to preserve a lot more information in that
            
            
            
              one chapter. So the higher the dimensionality of the vectors, the more
            
            
            
              descriptive and the more information that they are going to preserve from the
            
            
            
              original unstructured data format.
            
            
            
              Okay, so now that we have these vector representations,
            
            
            
              that's not enough. What we need to be able to do is we need to
            
            
            
              be able to have millions and millions, if not billions,
            
            
            
              of these vector representations, along with their
            
            
            
              unstructured data object counterparts.
            
            
            
              And then we need to be able to search over and query these objects
            
            
            
              thousands and thousands of times per second. And this is this idea
            
            
            
              of being able to scale your vectors and search
            
            
            
              over them that a vector database really excels
            
            
            
              at. So in order
            
            
            
              to be able to search over our data, we need to first
            
            
            
              of all store billions of these
            
            
            
              data objects, both the vectors as well as the non vector
            
            
            
              unstructured counterparts. In order to have this happen,
            
            
            
              we need to have our machine learning models work at scale in a production level
            
            
            
              environment. And so the idea is that every time you have a
            
            
            
              data object, an unstructured data object come in, it has to go through
            
            
            
              the machine learning model. You have to perform inference, you have to
            
            
            
              take the vector that comes out, pair it with
            
            
            
              the object that went in producing that vector, and then you have to store both
            
            
            
              of these things into your vector database. And then of
            
            
            
              course the vector database, the keywordsto there being database has to support crud
            
            
            
              operations. Any of these green objects that you have on the left
            
            
            
              hand side here could need to be deleted, updated.
            
            
            
              You could need to add new data objects depending on the
            
            
            
              velocity of your data that you're dealing with. So a
            
            
            
              vector database has to have crud operation
            
            
            
              support.
            
            
            
              So when we're talking about scaling vector search and
            
            
            
              scaling, the ability to embed vector objects in
            
            
            
              our vector databases, that's exactly what
            
            
            
              a vector database excels at. So I want to go through the
            
            
            
              definition of a vector database, and then the problems that
            
            
            
              you need to tackle as you need to scale,
            
            
            
              storing vector objects on the billion object scale
            
            
            
              as well as searching over them on a billion object scale.
            
            
            
              So when we talk about a vector database, really these fundamental, the first
            
            
            
              fundamental is that you need to store the unstructured data objects
            
            
            
              and all of these corresponding vector embeddings.
            
            
            
              Anything that you want to be able to perform vector search over, you need to
            
            
            
              have the original unstructured object as well as its vector embedding.
            
            
            
              And then it's not enough to just do this at
            
            
            
              a small scale, you have to be able to scale this up efficiently
            
            
            
              to millions and millions of data objects.
            
            
            
              And so if we look at some of the challenges that a vector database needs
            
            
            
              to solve, the first thing that it needs to do is to be able to
            
            
            
              store both your data objects, your unstructured data
            
            
            
              objects, as well as the vector representations. You have
            
            
            
              to be able to perform similarity search. These idea of a semantic search over
            
            
            
              your objects as well as a structured filtering.
            
            
            
              So for example, in this picture on the right hand side, let's say I wanted
            
            
            
              to tell my vector database to query and search through
            
            
            
              all of the animal objects that I've got in my database and
            
            
            
              tell me which one of these objects I
            
            
            
              can make a recipe out of. So in these case
            
            
            
              it would sub filter, it would do this structured filtering where
            
            
            
              it would exclude all of these company logos, all of these fruits out,
            
            
            
              and then it would allow only the animal
            
            
            
              objects to pass through. And then you would go through and then
            
            
            
              say okay, which ones can we use in a recipe? And then
            
            
            
              it would let the chicken pass through that filter. And then you could do some
            
            
            
              sort of vector search over the
            
            
            
              post filtered objects. And then of course
            
            
            
              let's say you have new data that's coming in, you have to be able
            
            
            
              to take that unstructured data object, turn it into a vector store.
            
            
            
              Both the unstructured data object as well as the vector, you have to support
            
            
            
              create options, you have to support delete
            
            
            
              options as well as update options.
            
            
            
              And these, the main idea is that in order to make this scalable and
            
            
            
              efficient, you have to implement a nearest neighbor's algorithm,
            
            
            
              which we'll talk about in a second.
            
            
            
              So when we're vectorizing data, when we're vectorizing images
            
            
            
              or natural language, we use machine learning for that. And usually
            
            
            
              those are neural network or deep learning models. And you
            
            
            
              can use any open source model or any one
            
            
            
              of the companies that we've partnered with, for example cohere,
            
            
            
              OpenAI, Google's new large language model,
            
            
            
              to take in your data, vectorize it,
            
            
            
              and then the vector database stores that data.
            
            
            
              So once you've done that, this is similar to what your data
            
            
            
              would look like. And then what we need to do is go ahead and
            
            
            
              query our data. So let me show you an example of what querying a vector
            
            
            
              database looks like. And it's quite different from something
            
            
            
              like a structured query language query that you would
            
            
            
              see. So all of the green dots on the right hand side are your data
            
            
            
              points that are embedded in vector space. So now let's say a user
            
            
            
              comes along and they have a question. We're going to take that question,
            
            
            
              that query, and we're going to pass it through the same machine learning model
            
            
            
              and create a vector for that question, which shows
            
            
            
              up as that red dot on the right hand side.
            
            
            
              And the human understandable representation
            
            
            
              for this query is nothing other than a word. So in
            
            
            
              this case, the word being kitten. So if a user comes along and
            
            
            
              queries your vector database with the word kitten, that word goes
            
            
            
              through the same machine learning model, it gets projected into vector space and it creates
            
            
            
              that red dot. And the act of vector searching over
            
            
            
              your vector space is essentially the act of
            
            
            
              going through and saying, which green dots are the closest
            
            
            
              to my pink dot over here? And then the closest ones. Let's say the
            
            
            
              user wanted the top three closest vectors
            
            
            
              that are in my database. It's going to go through and pick
            
            
            
              the closest three green dots to my pink dot and return
            
            
            
              those to the user in the
            
            
            
              corresponding images or text for those dots.
            
            
            
              In reality, we don't actually go through and do
            
            
            
              a brute force search where we compare the distance of the pink dot to every
            
            
            
              other green dot. That would take hours and hours depending on how much data you
            
            
            
              have. What we do instead of doing
            
            
            
              a full brute force nearest neighbor search is what's known as
            
            
            
              a ann, an approximate nearest neighbor search. And the idea
            
            
            
              behind this is that instead of searching through all
            
            
            
              of the green objects that you have in your database, you kind of set up
            
            
            
              a hierarchy of layers over here where you enter
            
            
            
              your search at the very top and you look through a very small
            
            
            
              subset of all of your objects that
            
            
            
              you have in your vector database. And then as you go down this hierarchy
            
            
            
              of search, you search over more and more data points.
            
            
            
              And so in order to get to the final nearest neighbors, it takes a significantly
            
            
            
              smaller amount of time. You have to do significantly less comparisons and
            
            
            
              compute. The advantage here is that you can
            
            
            
              search over hundreds of millions of objects without actually having to do hundreds
            
            
            
              of millions of distance computations. The disadvantage
            
            
            
              over here is the a in this algorithm's name,
            
            
            
              it's approximate, so you're not guaranteed to get the
            
            
            
              closest nearest neighbors. For that, you would have to do a brute force search.
            
            
            
              But given the amount of time that we save doing approximate nearest
            
            
            
              neighbors, this is the
            
            
            
              only way to reliably do vector
            
            
            
              search at scale. You have to have some sort of approximate algorithm
            
            
            
              that can search over all of your data objects.
            
            
            
              Okay? So with those fundamental concepts of being
            
            
            
              able to embed your unstructured data objects
            
            
            
              into vectors and then put that into your vector database and then scalably search
            
            
            
              over them, supporting crud operations that defines
            
            
            
              the foundations of what a vector database is.
            
            
            
              Weaviate that I'm going to introduce now is an open source vector database
            
            
            
              and you can try it, you can go to our website and play around
            
            
            
              with it. And the main idea behind Weaviate is that it understands
            
            
            
              your data because it vectorizes your data. It understands
            
            
            
              your data in this vector space and it allows you to search over
            
            
            
              your data in a way that it understands it.
            
            
            
              So I want to show you a typical vector search
            
            
            
              pipeline with Weaviate just to give you a center idea of how all
            
            
            
              of these moving pieces come together.
            
            
            
              So everything kind of revolves around the machine learning model,
            
            
            
              the machine learning model that can be open source so it can be Resnet 50
            
            
            
              that's widely available. It can be a model that a friend
            
            
            
              uploaded to hugging face and made available to the wider community. It can
            
            
            
              be a model that, for example, OpenAI has
            
            
            
              created and is an API for. So all of these models
            
            
            
              allow you to take your data over
            
            
            
              here and the data can be in whatever format the model
            
            
            
              understands and generate vectors for that data.
            
            
            
              And then you take those vectors and you pop them into VVA, and VVA stores
            
            
            
              both the unstructured data object as well as the
            
            
            
              vector representation for that object. So now let's
            
            
            
              say a user comes by and they have a query they want to
            
            
            
              search over your data using a concept.
            
            
            
              We're going to take that query that
            
            
            
              has to be in the same modality as the data modality
            
            
            
              that the machine learning model supports. Weve going to take that query, pass that
            
            
            
              through the machine learning model and generate a vector for that query similar to
            
            
            
              that red dot that I just showed a couple of minutes ago,
            
            
            
              and we're going to pop that into weve as well,
            
            
            
              and we're going to do a nearest neighbor search around
            
            
            
              that query. And depending on the results, depending on the closest
            
            
            
              objects that we have in our vector database, we take that and we return
            
            
            
              these results to the user.
            
            
            
              And the cool thing about Weaviate is that it's modular
            
            
            
              in the sense that you can go ahead and plug and play whichever
            
            
            
              models you want to use, whether that's your own proprietary model,
            
            
            
              you can plug that in, you can plug and play any one
            
            
            
              of the partner companies models that are available. So if you want to use an
            
            
            
              OpenAI model, you can plug that in. We've integrated with Google's Palm
            
            
            
              model that was released just last week.
            
            
            
              You can vectorize your data using that model.
            
            
            
              You just have to specify which model you want to use. The other option
            
            
            
              that's also popular is you can bring your data pre vectorized to
            
            
            
              us and then we can store that into weve eight.
            
            
            
              The different data modalities that we support are limited only
            
            
            
              by what the machine learning model understands. So if you have a machine learning model
            
            
            
              that understands image, video, audio,
            
            
            
              all of these different data modalities, you can take all of those different data modalities,
            
            
            
              vectorize it using the model, and plug it into vv eight.
            
            
            
              Probably one of the things that I'm most interested about is what happens when you
            
            
            
              take the results, the list of results that are the closest in vector space.
            
            
            
              What can you do with them? And weaviate supports this modular output
            
            
            
              where instead of just displaying the results, you can pipe them
            
            
            
              through any other functionality so you can re rank
            
            
            
              them. You can do some question answering with them. You can
            
            
            
              even pass them to a large language model like Chat
            
            
            
              GPT, which is a really interesting application
            
            
            
              that a lot of people are interested in. So I'm going to dive into that
            
            
            
              a little bit over the next few slides.
            
            
            
              So the main idea behind this is we want to use vector search,
            
            
            
              this idea of searching over hundreds of millions of data,
            
            
            
              unstructured data objects, and returning the most relevant data objects.
            
            
            
              And instead of showing the results to you, we send them off to
            
            
            
              a large language model. And the whole idea behind this is that you
            
            
            
              prompt a large language model and you say, answer this question,
            
            
            
              given the information that is returned by a vector database.
            
            
            
              So the information returned by a vector database provides
            
            
            
              relevant context that the large language model can then use
            
            
            
              in formulating a response to your prompt.
            
            
            
              And so imagine you have a browser of Chat GPT
            
            
            
              open. You would say, answer my question. Whatever your question is, you would type it
            
            
            
              in, and then you would say, the vector database
            
            
            
              returns all of this relevant information that you need to know.
            
            
            
              And you say, here's everything relevant that you need to know to answer my original
            
            
            
              question. And so now the workflow
            
            
            
              for asking your Chat GPT model, a prompt is a
            
            
            
              bit different, where you can say you have a prompt
            
            
            
              that comes in, you pass it to your large language model,
            
            
            
              but instead of just letting it rely on its general knowledge
            
            
            
              of information, you also pass it relevant documents
            
            
            
              or relevant information. Usually how I've seen
            
            
            
              people solve this problem is they go through, they see whatever
            
            
            
              information is relevant, they copy paste that and they
            
            
            
              pass it along with the prompt into the same window.
            
            
            
              The problem with that is that there's only a limited amount of information that you
            
            
            
              can copy paste in. And the other problem with that is
            
            
            
              you have to do all of the relevant information identification
            
            
            
              and filtering yourself. So imagine you've got
            
            
            
              thousands of documents here and only some are relevant
            
            
            
              to the prompt that you're asking. You have to sit there and go through
            
            
            
              each one at a time and say, this is relevant. That's not relevant.
            
            
            
              Once you're done this, you'll have a subset of the relevant filtered
            
            
            
              documents. Then you can go ahead and pass the prompt
            
            
            
              over to your large language model along with the documents
            
            
            
              that you filtered over that you copy paste in.
            
            
            
              But if you think about what the point of the user is over here,
            
            
            
              they're just performing similarity search over your documents. They're reading the
            
            
            
              documents one by one. They have some concept that they're looking for,
            
            
            
              whether it's present in the documents, and then if it is, it goes through.
            
            
            
              If not, it gets rejected. It's not going to end up in
            
            
            
              the list of relevant documents. That's exactly what a vector
            
            
            
              database excels at. So if we want to scale these approach,
            
            
            
              we can replace that user that was sitting there and filtering our
            
            
            
              documents with a vector database. And the vector database,
            
            
            
              its main job is to house hundreds of millions
            
            
            
              of documents and search over them to provide relevant,
            
            
            
              the most relevant ones that are relevant to your prompt.
            
            
            
              And in the demonstration that I'll show, I'll actually show you a
            
            
            
              demonstration of how I'll have 100,000 objects that are stored
            
            
            
              in my vector database. I'll provide it
            
            
            
              some context to query over those documents. And then I'll filter the
            
            
            
              documents and send it to a large language model and then use generative search
            
            
            
              over the documents.
            
            
            
              And so in the example that I'll show you,
            
            
            
              I'm using Weaviate. The cool thing about Weaviate is that we've implemented
            
            
            
              an endpoint whereby you can pass in all
            
            
            
              the documents and then instead of returning the results, you can
            
            
            
              say, take these results that Weaviate returns these filtered relevant documents
            
            
            
              and send them to a large language model. And what you see at the end
            
            
            
              of that is the generated text as a result
            
            
            
              of the answer. So you see the customized response that the large
            
            
            
              language model is going to provide to your prompt as
            
            
            
              well as the relevant documents that you've passed in.
            
            
            
              So that's a lot of talking from my side. So what I want to do
            
            
            
              is I want to go over a short demo that
            
            
            
              shows you the power of vector databases and what they enable.
            
            
            
              Okay,
            
            
            
              so let me
            
            
            
              go into my Jupyter notebooks ide
            
            
            
              over here. Let's start off all the way at the top.
            
            
            
              So to get this demo working you're
            
            
            
              going to need the Python Weaviate client. You're also going to
            
            
            
              need to go to the link to download the data.
            
            
            
              And then the other thing that you're going to need is to go to
            
            
            
              weaviate cloud services over here and
            
            
            
              then log in and you're going to need to create a cluster. So these is
            
            
            
              a remote cluster, a remote instance of weaviate
            
            
            
              that is going to store your data and then perform vector search over it.
            
            
            
              I've already created this cluster over here and you
            
            
            
              can see that I've already uploaded about 100,000 unstructured
            
            
            
              data objects in there ready for us to query.
            
            
            
              So we're going to go in here and then we're going to go in,
            
            
            
              import our Python Weaviate client.
            
            
            
              The other thing we're going to need to provide is a WCS token. So the
            
            
            
              WCS token is important because once you create your cluster,
            
            
            
              if you enable authentication over here, you'll see that you have a
            
            
            
              token that you're going to need to provide to be able to remotely hook into
            
            
            
              that instance. So over here this token
            
            
            
              is provided from these console.
            
            
            
              And then if you're using any third party machine learning models to vectorize your data
            
            
            
              or query your data, you're also going to need to provide API
            
            
            
              keys for them. So here I've provided the relevant keys
            
            
            
              and then once the client is ready
            
            
            
              it'll print out. True. So a
            
            
            
              little bit about the 100,000 objects that I've uploaded, these are just Wikipedia
            
            
            
              articles that weve chunked up into paragraphs
            
            
            
              and weve got a bunch of metadata around them. So for example, we have the
            
            
            
              id for the Wikipedia article, the title, and then we have the actual text
            
            
            
              from the paragraph, the URL, the wiki id, so on
            
            
            
              and so forth. We're going to take all this data and
            
            
            
              we're going to batch it and upload it to our remote
            
            
            
              instance that we just created.
            
            
            
              Before we can do that, we have to create a database schema that
            
            
            
              lets weaviate know what data is coming in, how we vectorize
            
            
            
              it. So to do that we define a class and we give
            
            
            
              it a name. So these are going to be Wikipedia articles, a description over these
            
            
            
              and then we specify the vectorizer. So because our data is
            
            
            
              text, we want to vectorize text. So we use a specific
            
            
            
              text to vec machine learning model. In this particular case,
            
            
            
              this is a multilingual model and that's
            
            
            
              going to be very advantageous later down the road. So I'll come back to that
            
            
            
              in a second. We're also going to specify exactly
            
            
            
              the distance metric that we want to use when we're
            
            
            
              conducting vector search. So in this case it's going to be a dot product.
            
            
            
              So when it's comparing how far away your query vector is
            
            
            
              from all the other vectors, it's going to conduct a drop product between
            
            
            
              those two vectors to find out which two data points are closer
            
            
            
              together, which ones are farther apart, and then we go in and we name
            
            
            
              all of our properties over here. So we've got the text of
            
            
            
              the Wikipedia article, we've got the title, the URL, the metadata
            
            
            
              that we went through in our pandas data frame. So once
            
            
            
              we're ready with the schema, we're going to go ahead and create
            
            
            
              the schema and then we go ahead and batch upload data into
            
            
            
              EVA. So we're going to start off, and we're
            
            
            
              going to say start off with a batch size of 100. So upload 100
            
            
            
              article paragraphs at a time and then we'll set this to dynamic
            
            
            
              so that this can increase over time if necessary.
            
            
            
              And then we're going to go in and loop through our
            
            
            
              data and upload 100 objects at
            
            
            
              a time. So this takes some time.
            
            
            
              So in prep for this demo, I've already uploaded all of
            
            
            
              that data and you can see that all
            
            
            
              100,000 uploaded correctly. Once the data is in there,
            
            
            
              you want to do a quick check to make sure that you've got all of
            
            
            
              the unstructured data objects are registered and counted for. The way
            
            
            
              you can do that is just by doing this simple query and
            
            
            
              counting how many objects there are. There's 100,000. Another way
            
            
            
              to verify that is, as I showed, if you go into the console,
            
            
            
              you'll see how many objects you've got in here using
            
            
            
              this object count. So now that all of
            
            
            
              our data is in these, I want to go through and show you these different
            
            
            
              ways that you can search over your data. And I'm going to start off
            
            
            
              with the way that I started this talk with classic
            
            
            
              word search. So this is boring old word search where I'm going to go
            
            
            
              in and create a filter where I'm
            
            
            
              going to look for a specific word in the titles property
            
            
            
              of my objects in my vector database and I provide in
            
            
            
              the exact word that I want to match. And then I'm going
            
            
            
              to print out only the first one. So I slice out and I print
            
            
            
              out the first object that it finds. So for
            
            
            
              example, if I go ahead and if I do a word search for the word
            
            
            
              avocado, it matches with the word avocado that's
            
            
            
              found in this document, that's in the title, and it returns
            
            
            
              this text from that Wikipedia paragraph, I can
            
            
            
              go ahead and do the same word search for data science and it goes
            
            
            
              ahead and it matches the word data science over here with a string in between,
            
            
            
              and it returns this. So nothing exciting, nothing interesting.
            
            
            
              But notice what happens when I do a word search for fast animals.
            
            
            
              Supposedly I'm looking for all the different or one
            
            
            
              of the paragraphs that describes a fast animal. It goes ahead and
            
            
            
              gives me this error, and the code is not wrong,
            
            
            
              the implementation is wrong. The problem here is that if
            
            
            
              I take away this word search function and I show you the actual query,
            
            
            
              I go ahead and I do a value string of
            
            
            
              fast animal and it looks for this string.
            
            
            
              What I actually get back is an empty
            
            
            
              list. And the reason why is because nowhere in my 100,000 objects do
            
            
            
              I find this string fast space animal.
            
            
            
              And that's one of the downfalls of keyword search that I was showing you before.
            
            
            
              If the actual string does not exist, you get nothing back.
            
            
            
              The database doesn't understand what you're trying to ask for. It doesn't understand the data,
            
            
            
              so you don't get this back. So now what I want to show you
            
            
            
              is with vector search or semantic search, you can really
            
            
            
              remedy this problem really quickly with semantic search.
            
            
            
              What we want to do instead is go over and
            
            
            
              search for a concept. Instead of searching for a value string or
            
            
            
              a keyword search, we want to search for a concept and
            
            
            
              that concept is going to be vectorized and we look for the closest concepts.
            
            
            
              So we have a parameter here,
            
            
            
              concept, and we're going to go through. And one
            
            
            
              other thing that's important here is we're going to call the near text search.
            
            
            
              It's going to go through and it looks for the text objects
            
            
            
              that are the closest to my query object over here
            
            
            
              that I'm providing. And I also limit the return to three because it'll
            
            
            
              be nicer to look at, it won't overpopulate
            
            
            
              the page in front of me. So I'm going to only return the three most
            
            
            
              relevant results, the three vectors that are the closest to my query vector.
            
            
            
              And then I have a function that's going to format and style this nicely.
            
            
            
              Let's say I go through and I search for this
            
            
            
              concept, a programming language used for machine learning.
            
            
            
              And the first result, the closest vector
            
            
            
              to my query vector is python, and then it's c plus plus and then
            
            
            
              central processing unit. And if you look at what we're doing here,
            
            
            
              we're literally just typing and chatting with our
            
            
            
              vector database. I queried it using this concept
            
            
            
              programming language used for machine learning, and it realized even
            
            
            
              though the exact string was nowhere to be found in
            
            
            
              any of these texts, it found the closest concept
            
            
            
              to what I was asking and it returned that. And that's the
            
            
            
              power of vector databases. You can never do this with a
            
            
            
              structured or unstructured regular SQL
            
            
            
              NoSQL database. The only way that you can do
            
            
            
              this is if your vector database understands your data as well as
            
            
            
              the query what you're asking for. And it has a way to quantify how
            
            
            
              similar or dissimilar those two things are, which a vector database
            
            
            
              exactly does. That's the main power of a vector database.
            
            
            
              Again, if I go back to my original query of fast animals and I
            
            
            
              conduct a vector search now, now it gives me relevant answers,
            
            
            
              right? So it goes through and it vectorizes fast animals and it realizes that
            
            
            
              a gazelle, a cheetah and a bobcat are fast animals,
            
            
            
              because when I vectorized these objects, there was
            
            
            
              some mention or it understood that these
            
            
            
              unstructured definitions, text definitions,
            
            
            
              are affiliated with fast objects. It could be mentioned somewhere
            
            
            
              in the object itself, or so on and so forth.
            
            
            
              And this is the power of vector databases. And if
            
            
            
              you remember, I told you that this was a multilingual model
            
            
            
              that we were using to vectorize our data. And that means that
            
            
            
              you can query it in any language that you want. So here I've
            
            
            
              queried it with, this means great movies in Chinese
            
            
            
              and it comes back and it shows me the most relevant
            
            
            
              data objects that I've stored in here. So, got goodfellas, totally spies.
            
            
            
              You can also query in Hindi. So this is
            
            
            
              the same query, great movies in Hindi,
            
            
            
              you get Schindler's list, the Dark Knight.
            
            
            
              So this is quite powerful. The flexibility of your vector database
            
            
            
              is only limited by the modalities and
            
            
            
              data types that your machine learning model understands and can vectorize.
            
            
            
              And again, vacation spots. This is in Farsi and it
            
            
            
              understands that as well. For the
            
            
            
              next part here, what I want to show you is this idea of generative
            
            
            
              search, searching over your vector database and then instead of just
            
            
            
              returning the results to you, piping those to a large language
            
            
            
              model, and then providing those results
            
            
            
              as context to the large language model so that it can answer a query
            
            
            
              or a prompt using those documents
            
            
            
              as context. So the context that I'm going to provide
            
            
            
              here is going to be from this return query.
            
            
            
              So I do a semantic search over my vector database for famous basketball players
            
            
            
              and it returns to me three famous basketball players. You've got Will Chamberlain,
            
            
            
              Magic Johnson and Will Chamberlain. The reason why it repeats
            
            
            
              will Chamberlain here is because the same Wikipedia article can
            
            
            
              be chunked multiple times. So I've got different paragraphs from the exact
            
            
            
              same Wikipedia article. So now what I'm
            
            
            
              going to do is instead of show you the results of this query, I'm going
            
            
            
              to tell Weaviate to take these results,
            
            
            
              send them to OpenAI's large language model,
            
            
            
              answer a question reading these results, and then
            
            
            
              give me the generated text back. And this process is known
            
            
            
              as generative search or retrieval augmented
            
            
            
              generation. So here
            
            
            
              the interesting thing that I want to do is this is the prompt that
            
            
            
              I would give to chat CPT. So I want it to
            
            
            
              write me some interview questions that I can ask. So this is something funny that's
            
            
            
              going on here that I'll explain in a bit that I can ask title
            
            
            
              and also how title would answer them. Here's some information about them.
            
            
            
              And then I've got text in here. So to understand where these titles and
            
            
            
              text come from, let's look at the actual query here.
            
            
            
              I start off and I query my client and I say that I want to
            
            
            
              do a near text search where the concept is famous
            
            
            
              basketball players. And this will return something
            
            
            
              like this that I've shown here. And what I want to extract
            
            
            
              from this is the title as well as the
            
            
            
              text. So it's going to give me the title of the Wikipedia article which
            
            
            
              is shown here as an example. And these the text of the Wikipedia article
            
            
            
              which as an example I've highlighted over here. And instead of showing
            
            
            
              you the returned objects,
            
            
            
              I am going to pass them over to the generative
            
            
            
              model using the with generate command here.
            
            
            
              And so in this case what happens is the title gets replaced with
            
            
            
              whatever title was returned by the vector database. The text here gets replaced
            
            
            
              with whatever text was returned by the vector database.
            
            
            
              And that creates the whole prompt. And that prompt gets sent to the large
            
            
            
              language model. And this single prompt
            
            
            
              parameter that I've provided here essentially makes it so that every single
            
            
            
              data object that the vector database returned gets passed
            
            
            
              one by one to the large language model. And then you get
            
            
            
              the equivalent number of responses back that you can show.
            
            
            
              So here I'm just going to show you one of the responses back. So if
            
            
            
              we scroll down here, you can see that the relevant
            
            
            
              context that was provided was Will Chamberlain and
            
            
            
              the text for this Wikipedia article title. And then the
            
            
            
              generated text that I got back from OpenAI was what was
            
            
            
              your biggest challenge as a basketball player? So it's essentially made a question answer
            
            
            
              session where I ask questions to will Chamberlain and
            
            
            
              he responds based on what
            
            
            
              the large language model understands of the context that I provided
            
            
            
              it. So this is the power of generative search that
            
            
            
              comes built in with weaviate.
            
            
            
              And this is just one query that I can do all of
            
            
            
              this with. The other interesting thing that
            
            
            
              I can do is I can say write me a heroic tale about,
            
            
            
              again, take the results from the vector database and then pipe
            
            
            
              them to the generative model and here's some context about them. This is the actual
            
            
            
              paragraph. And so it goes in and it generates
            
            
            
              me a story about will Chamberlain and
            
            
            
              all of the context that's provided here. It comes from the general knowledge
            
            
            
              of the general knowledge of the large language model
            
            
            
              as well as these context that the text and title that the vector database
            
            
            
              provided it with.
            
            
            
              Another interesting thing that you can do is instead of
            
            
            
              passing in one of the unstructured objects at a time and generating
            
            
            
              a prompt result using one object at a time, you can group all of
            
            
            
              the objects that the vector database returns and pass these all together
            
            
            
              to your large language model to answer a more complicated question.
            
            
            
              So for example here, my question to Chachi
            
            
            
              Bt is which of these basketball players mentioned in text
            
            
            
              is the most accomplished? And it has to choose at
            
            
            
              least one and explain why. And the titles and the text here are going
            
            
            
              to be replaced by 15 of the retrieved documents that we v eight returns
            
            
            
              to me. And so here I go through
            
            
            
              and I'm going to show you the articles that were provided as context.
            
            
            
              First of all, so these are the names of the articles that we provided,
            
            
            
              the large language model as context. And then
            
            
            
              this is the answer that the
            
            
            
              large language model provided to us. So all the way over here
            
            
            
              and it shows that Will Chamberlain was
            
            
            
              the greatest basketball player, the most accomplished basketball player. And it
            
            
            
              explains why, given the context of information of not
            
            
            
              just will Chamberlain but all of the other basketball players that
            
            
            
              we asked it to compare Will Chamberlain against.
            
            
            
              And then one last example here, kind of got excited when
            
            
            
              I was putting this together. I go ahead and I say,
            
            
            
              give me five famous basketball players.
            
            
            
              Weve goes and searches over my Wikipedia documents.
            
            
            
              It returns the titles for those basketball players as well
            
            
            
              as the context, the paragraph for those basketball players.
            
            
            
              And then I ask story
            
            
            
              of tell me a story where these people, all these basketball players fight
            
            
            
              each other and then I give it context. I give it information that the vector
            
            
            
              database returned to me and I pass that in over here.
            
            
            
              And so if we look at the results now, first let's look at
            
            
            
              the context that was provided to Chat GPT here. So here
            
            
            
              we gave it the information and the name for Will Chamberlain,
            
            
            
              Magic Johnson. Again, a repeat of Will Chamberlain, Scottie Pippen. We also
            
            
            
              gave it information for James Naismith, which is what the vector database
            
            
            
              returns. James Naismith is not a basketball player, he's the inventor
            
            
            
              of basketball. But we'll let that slide for a second. Let's have a look at
            
            
            
              the generated story that we got.
            
            
            
              So in this generated story we're
            
            
            
              essentially customizing the response of Chad
            
            
            
              CPT to these context that we provided. So now the story starts off
            
            
            
              and it's saying that will Chamberlain and Magic Johnson were both legends,
            
            
            
              so on and so forth. And it tells a very intricate story.
            
            
            
              It's got Scottie Pippen in there and it's
            
            
            
              kind of pitting them against each other. And then it shows that Chamberlain stood tall,
            
            
            
              which it thinks that once Chamberlain is the best. So of course he's
            
            
            
              going to win at the end of the day. But that's the power
            
            
            
              of vector databases and what you can do with the inputs,
            
            
            
              the outputs, how you can chain them with large language models to
            
            
            
              get these large language models to answer your prompts
            
            
            
              grounded in the context that the vector database provides.
            
            
            
              And so this is one of the most exciting things around vector databases
            
            
            
              right now, where vector databases essentially act like
            
            
            
              long term memory for these large language models.
            
            
            
              They can go and retrieve ten of the most relevant documents
            
            
            
              to a prompt over millions of documents, and then answer your prompt,
            
            
            
              given that information as context.
            
            
            
              Okay, that's the end of my demo, so I'm going to go back
            
            
            
              and go over here.
            
            
            
              So I hope everybody enjoyed this intro to vector
            
            
            
              databases. If you have any questions, feel free
            
            
            
              to connect with me on Twitter LinkedIn. You can join our slack
            
            
            
              community. The entire team is around, we're happy to help.
            
            
            
              We'd love to get you to try and use weve
            
            
            
              eight. If you have any cool ideas, let us know.
            
            
            
              And then if you do have any questions as you're using WeV,
            
            
            
              feel free to shoot us a message. If you come up with anything interesting,
            
            
            
              if you come up with cool implementations, cool projects that
            
            
            
              you're using Weaviate with, give us a shout out.
            
            
            
              We'd love to talk to you. You can also blog about
            
            
            
              it. Always love to see what the community uses this open
            
            
            
              source tool. For me thank you to
            
            
            
              Con 42. This was great. I really enjoyed this,
            
            
            
              and I hope you guys enjoyed this as much as I did.
            
            
            
              Thank you, everybody. Take care.