Vector Ops: How to run vector embedding-powered apps in production

Video size:

Abstract

Vector embeddings are everywhere - from recommender systems to search engines and generative AI. It’s easy to use them to build a quick demo in a notebook, but getting vector-powered systems to work reliably in production is a challenge. Let’s run through everything you’ll need to get started!

Summary

Daniel: Why are vectors useful? Why use them to make your app better? We'll talk about which features of your app benefit from being vector powered. And finally, you unlock the benefit for your end users, which is why we are doing this.
When humans adopted language, they kind of lost something in the process. Words are not very useful to represent information in terms of, for computers. And here is where we introduce vectors, basically because vectors give us the resolution that we need.
Before using vector representations to build search interfaces, it was all about keywords. Now, how is this looking in the recommendation side of the problem? This is kind of two sides of the same coin. We get recommendations wrong, same as the search.
The purpose of the nearest neighbor index is to help us find similar pictures or similar vectors in the space really fast. Technology is maturing that can store a bunch of vectors on each node. Production builders are interested in building on top of the vector databases.
Right now there is this explosion of different vector databases. When you choose a vector database, I wouldn't focus so much on the speed. What you care about in surgeon recommendations use cases is recall around 80, 90%.
The idea is to represent your content and users as vectors and then do a vector search that basically for a given user finds the nearest content. The new thing with vectors is that you can take all these features and literally just concatenate them into the content vector. This allows you to do cool stuff later.
And then other aspect of this is, let's say you want to improve the diversity of authors for the content that you recommend. And for that, again, the query manager on top of your vector based retrieval is useful. And then finally it's an approximate system, it's approximate nearest neighbor retrieval.
We want to build a vector powered system. What will you need to add to this to build an MVP? The first step will be just eyeballing the results. The second step is kind of quantitative evaluation. And then finally, probably the most importantly, the analytics.
The next stage is where this all gets very complicated, right? Because you are now assembling all those different constraints. You'll need a sort of a serving system that has APIs. And then through an API layer again, you surface it into your product search recommendations.
The touch point between what we talked about before, vector representations of data and generative AI is this aspect of memory. There are libraries that are useful for working with, building on top of Generative AI use cases.
Today we looked at the connection of vector embeddings and generative AI. We are building a vector ops platform that makes building surgeon recommendations easier. We would love to learn from each other and deliver something useful. Connect on LinkedIn and take it from there.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi, everybody. Welcome to my vic talk. I'm Daniel. I used to be a tech lead at YouTube for seven years, and now I work in the vector ops space with superlinked. Let's just dive right in. So, in today's talk, we'll cover the three most important questions. Why are vectors useful? Why use them to make your app better? We'll talk about which features of your app benefit from being vector powered, and then finally, we'll talk about actually building a vector powered feature of your app in a way that makes it easy to put it into production in a reliable way. And finally, you unlock the benefit for your end users, which is why we are doing this. We are using new technology to make your app better. So let's go into the why first, though. Because I think vectors can be not so intuitive, and it helps to have a mental model for why they are different from perhaps other representations of data that you have used before. So we have to kind of start from, really the basement here, and first look at when humans adopted language. We kind of lost something in that process. And I'll try to illustrate that with a few pictures here. So, if you look at this slide, you see, okay, on the left side, we have kind of a famous landscape picture for the Windows fans, but there and then on the right side, we have a representation of that picture in the RGB values of the individual pixels of that photo, right? And we have, let's say, about 1 million pixels in there, and each is defined with these three values. So we have a lot of data. We have a book worth of data in there that represents what each of these pixels looks like. Right? Now, a human would translate this picture into words, right, in order to communicate with others or kind of remember it. And so this is very efficient on the surface, right. We would say, let's say field of grass to describe what this picture contains. And then this is just three words, right? So we went from 1 into something very short. So this looks pretty efficient, right. However, the problem is that when you say field of grass, you may mean lots of different things. We have lost the resolution of similar pictures being called the same thing. All of these different photos kind of collapse to those three words, right? So let's say I did this experiment with mid journey, and I took field of grass and put it in as an input, right? And I got this field of grass, right. Looks very different from where we started. So then you might want to sort of start refining that, right? To kind of regain back the resolution that we lost by going from 1 words. So we start to specify more words, right? So we say field of grass on a summer day that kind of looks a bit closer to what we had before, but still worlds apart. And so we can kind of keep adding more and more words to try to regain that resolution. And after a while we get to a field of grass on a summer day, a few clouds in the blue sky, and it's slightly hilly, right? And it sort of like gets somewhere, but it's still a very different picture. So on the left we have the original, and on the right we have the kind of approximate reconstruction from those 16 words or whatever that was. And in this demonstration we see how the natural language is a bottleneck of communication, right? It's a very imperfect way to describe stuff, and it misses the nuance, right? It's very hard to describe small changes in the input in a way that is preserved in the output. So this is a problem of language, right? It's not only hard to reconstruct what we described, but also the descriptions are ambiguous, right? All of these three images are constructed from those same 16 words, field of grass, hilly, summer day and so on. So basically what I'm saying is words are not very useful to represent information in terms of, for computers. And it's not just the words, it's any sort of structured data that you are using to represent stuff out there in the real world, right? Once you go from audio recordings that are analog recordings and pictures, and you try to represent it with some structured information, you'll just basically lose a bunch of the resolution there. So that makes those kind of structured representations not very useful for actually exploring the different concepts and for example, training machine learning models. And here is where we introduce vectors, basically because vectors give us the resolution that we need, right? So just imagine we have a set of two dimensional points on the xy axis, and we have some sort of function that given a picture, outputs xy coordinate, right? And we feed the pictures that we have been looking at into that function. And that function can represent similarities and differences in that space of different pictures of a grassy field. And this kind of resolution and nuance is what words are, or any sort of structured data is basically lacking. And this is why a vector representation is kind of closer to the truth in a way. And so, just to summarize, the reason vectors are useful is they are expressive, right? They allow you to capture differences between things that you cannot otherwise express. They're smooth. So things that are close by in the vector space are kind of gradual, they are similar items that gradually get less and less similar as youll go further in that space. And so it's smooth, which means you can explore it, right? You can kind of navigate, you can get closer to certain parts of the space, and this makes it very useful for any sort of machine learning that is basically doing the exploration, right. The downside is that vectors are very difficult to work with because they are just basically a list of numbers, right? And when you print out a vector, it doesn't really tell youll anything, right? So that sort of answers the question of why vectors, right? They're useful representations of data because of those properties. Now, in terms of what youll can build with vectors, obviously the world of surgeon recommendations existed before this kind of latest discovery of the transformer models that help us generate these vectors and these whole new hyped with vector databases. Obviously the search and recommendation space has existed for decades. So let's just have a look at how that space is doing sort of pre this type of technology, and let's look at search and recommendations specifically. So before using vector representations to build search interfaces, it was all about keywords, right? So you had some sort of natural language processing pipeline that tried to normalize the keywords in your documents and in your search queries, try to perhaps expand them to synonyms and do all kinds of precomputation and expansion. So you had a bunch of natural language processing functionality in there. Then you had some sort of index that, given a set of keywords or a keyword, pointed you to documents that contain those, right? So that was the core aspect that powered your search. So that's why it's really important to get those keywords right and normalized and processed. And then finally, these types of systems are very much fine tuned by hand, right? Observing queries that you haven't seen before, trying to figure out how to tweak your keyword based rules to get good results. And this has been happening for decades. And the outcome is that it still kind of doesn't really work that well. And here I have a Walgreens I could have used any example. So this is an online pharmacy basically, and I type in splitting headache, and I'm basically getting one result, which is essential oil and no pain medication, nothing for migraines, right? Because I used keywords that are not exactly matching the results. And so a keyword based system gets it wrong. Right? Now, how is this looking in the recommendation side of the problem? Because surgeon recommendations are very much linked, right? This is kind of two sides of the same coin. So I'm sure you have been to LinkedIn and you have seen your jobs recommendations, and this is a screenshot of mine and they're particularly funny. So I not so rarely get recommendations for intern jobs after being in the industry for 15 plus years. And then also some sort of stealth, like this stealth company. It's kind of a meme. If you search for stealth on LinkedIn, you'll see that there are tens of thousands of people working at stealth and this entity shouldn't really be recommended. So basically we still get recommendations wrong, same as the search. The problems are somewhat related, but actually recommendations have their own pile of problems. Let's just look at a few of them. So how would you build a recommender system? Before vectors, you would try to combine two aspects of your data, right? You would have content features. So you try to have metadata for your content that help you understand what that content is about. You would also have that for your users, right, stuff they tell you during sign up or any metadata that they create about themselves. So that's the content features. Then you would have collaborative features, which is features based on users behavior. So kind of similar, users liking similar things, type of signal, right? And you would sort of build those features and you would build the content features and you would try to marry them together in a model that responds well if you don't have enough behavioral data for the user responds well if you don't have enough content features. And kind of lets best of both worlds. And usually this work is quite custom, right? So companies do this in house with a lot of effort and time after being able to marry those two aspects. Then a recommender engine typically has kind of two steps, right? So first you use some sort of feature store, some sort of database to retrieve candidates that roughly match what you think would be a good recommendation, right? But because it's a RaF kind of search, kind of broad strokes, type constraints, you have to retrieve a lot of candidates, right, because youll know that the candidates will be low quality. So for the content you want to recommend, you kind of take a broad brush and somehow fetch 10,000 candidates for that recommendation that you hope that some of them actually will make sense. So that's the retrieval typically then, and then you have the ranking or the scoring, right? So you have a model that given a candidate piece of content and the user predicts the probability that this user will actually like the content or will engage with it, or let's say if it's a job that they'll apply, you have this model that you run over all the candidates, right, 10,000 times. And then you sort the candidates by this score and then finally find three jobs, let's say, out of those 10,000 that have the highest score, and then those you return to the user, right. This is a project that this whole system takes months to build, typically is done in a custom way, and even for a very large company, it can still miss terribly. Right. So basically, surgeon recommendations still an open problem. And this is the before vectors world. Obviously, I'm not saying with the vectors it's a solved problem, but we have some new approaches to tackle it and see what we can do there. Right. So let's look at the kind of new world. How do you build certain recommendations with vectors? If, let's say you want to treat your vectors as just features that you extracted, and then build the old school stack with retrieval and ranking on top, that's fine, that works, right. What I want to look at here is new ways that you can use the fact that we can actually index these vectors and find similar vectors really fast. Right. So basically, we'll kind of focus on that aspect of the problem. And if you look at the stack of this kind of new generation surgeon recommender system, we'll look at three aspects of it. So how to basically strengthen the retrieval part, given that we can do the nearest neighbor search. We'll look at representations of content and also users with vectors that allow us to do that. And then we'll talk about maybe having a component on top of this that will be much simpler and faster than the normal kind of ranking that sits on top of the retrieval. We'll talk about probably still a need of having that component, but it's much simpler. The benefits of this is that the whole stack is much simplified because you basically just have the retrieval. The results are, you don't rely on this kind of candidate retrieval up front, which probably misses a lot of the stuff you actually wanted to recommend. And then ranking this actually is, in terms of recall, not ideal, right, because what if you missed something out there, even though you retrieved the 10,000 items? So in that sense, the approximate nearest neighbor search is closer to a global optimum. Right, because you are basically indexing the whole database. You are not just running your ranking on some 10,000 candidates. So in that sense it's better and then finally faster, right, because you are not doing ranking, you are not running some big model on top of the candidates, you are not fetching 10,000 candidates from a database. So you push everything into the approximate nearest neighbor index. And this is then much faster to actually generate the recommendations, which if there is a user waiting for the feed to load or an ad to serve, or maybe you have a bot detection use case where if there is a misbehaving user, you want to catch them as fast as possible before they do damage. The speed really matters. Right? So this is kind of the basic setup. And let's dive into each of these three aspects of a vector powered surgeon recommendation system. All right, so first of all, I mentioned approximate nearest neighbor search. So what the hell is that? Right? We have seen on some of the previous slides this sort of two dimensional space with the pictures and similar pictures were closer to each other. And then kind of an autumn looking grass field was a bit further apart. Right? So the purpose of the nearest neighbor index is to help us find similar pictures or similar vectors in the space really fast, right? And we have to talk about what nearest means and what fast means, right? Those are the important aspects here. So in terms of quantifying the difference between vectors, normally what we use is actually we just look at the angle between the vectors, right, because that's kind of the distinguishing feature. This obviously depends on the model that you use to produce, given a piece of content, the vector representation, right? This model has certain properties that, for example, say that the distance should be scale independent, right? So if there are two vectors pointing the same way, they just have different lengths, they should be basically considered the same. And then this translates into us using the angle between vectors as the distance measure, right? So let's say a cosine similarity. So that's what we mean by nearest, right? And then fast, basically, just imagine that kind of the current, there is a bunch of vector databases out there. There are benchmarks, you can review those. But rule of thumb is that you can do thousands of queries per second per machine for tens of thousands of vectors in the index, right? So this is now technology that's maturing that can store a bunch of vectors on each node, and then you can kind of share this and you can get to millions and even billion vectors, right? So that's definitely possible. And there is a bunch of progress happening in this kind of approximate nearest neighbor index layer. And what we, the production builders, are interested in that want to build on top of the vector databases is the question of, okay, how do I take my problem search and recommendations and how do I express it as a vector search, right? Because then the vector search, there is a bunch of databases out there that can help me with that. So let's talk about that. But before we get there, I kind of want to reflect on something that's happening in the space, which is these benchmarks, right? So typically when you identify a metric that is easy to measure, like a query per second benchmark, everything will kind of coalesce around the metric, right? So right now there is this explosion of different vector databases, and somehow there is a lot of emphasis on the speed. When you choose a vector database to work with, I wouldn't focus so much on the speed, right. You just need good enough, basically, just to give you an idea, this chart basically shows the recall on the x axis. What do we mean by recall? This is approximate nearest neighbor search, right? So if you look at a data set of 10,000 vectors, then you look at the actual 100 nearest neighbors for each vector. The recall tells you that for this index, it was able to retrieve within the first hundred closest neighbors certain percentage of the actual closest nearest neighbors, right. Because it's doing approximation. So it's going to miss some, right. What you care about in surgeon recommendations use cases is recall around 80, 90%. It depends really on your use case. So you kind of look at this area here, and then y axis is queries per second, and here we have 1000 queries per second. And here we have 10,000 queries per second, right? So in this region, and these are 800 dimension big vectors, right? Which is kind of a good rule of thumb. You want your vectors to be up to, let's say 2000 dimensions for good healthy system. All right, perfect. Okay, we talked about how a search and recommendations use case can make progress by representing your content and users as vectors and then doing this vector search that basically for a given user finds the nearest content. Now the devil is in the details, obviously, and it's in how you construct these vectors, right? So we are doing vector search, nearest neighbor search, but how are we constructing the vectors that we are doing it with, right? And so here is how basically, right. What you want to do is youll want to capture the aspects of the content and also of the user that are relevant for this trade off of what the user then sees in your app, for example, right? So let's talk about a use case with social network. In this use case, there is a feed of content and you care about certain properties. As a user, you have certain expectations of what you'll see in your feed, right? You expect it to be sort of roughly chronological. Let's just imagine LinkedIn, right? I already gave the example with the jobs recommendation. So let's talk about the LinkedIn feed, right? You kind of expect it to be roughly recency or kind of age of the content ordered, right. You expect that it's going to contain content that is topically relevant for you, right? So it's some sort of combination of the content that you liked or engaged with before. You also expect that the platform will recommend you content that is interesting for users that behave similar to you. This is the collaborative aspect of it, right? And there is maybe some measure of quality, maybe certain users are more sensitive or less sensitive to quality of the content. And so youll want to capture this as well. Now, this is nothing new. Like in past, you would have these features extracted, youll would assemble them, you would do some filtering on them to generate the candidates, then ship it to the ranker. There you go. Right? All build custom in house recommender engine staff, right? This is normal. The new thing with vectors is that you can actually take all these features and literally just concatenate them into the content vector, right? So you kind of normalize each of these vector parts and then you literally just concatenate them together. And this allows you to do cool stuff later. Right. It allows you to do a search that actually balances between these different objectives and it completely offloads that navigation of the trade off to the nearest neighbor search, right. Because in the end of the day, you just have one vector and you just do nearest neighbor search on it. But it's assembled from these different parts that are important for your specific use case. So that's the content vector construction. And then similarly, in the same vector space, you do that for the user as well. Right. So the user will have some topical preference. Right. So some representation of the content they liked before in terms of topics, they'll have some popularity preference, right? Is this user mostly interested in sort of the highest popularity content or are they kind of venturing into not just the most sort of hyped content, but also other parts, how well they tolerate quality, let's say degradation, right? And this might come from moderation signals and so on. And then also the recency preference, right? Are they after sending only the most recent kind of news type of stuff, or are they happy to venture broader into the catalog? And they are kind of more driven by the topic, lower popularity measures than recency, right? So basically you can again represent all these different preferences of the user into one vector that also has those parts like we did for the content, right? And that means that they are aligned. And now you can basically just take the user vector and search with it into the content space, right. And then that's your retrieval. Basically what you can also do with this, I have this node here in the bottom right corner is user to user search, right. You can discover users that behave similarly. This is useful for lookalike audiences for bot detection, for recommending people to each other to engage. Right? I mean, dating apps, obvious use case. So you have a bunch of opportunities on that front. But the core idea is that I'm extracting different features of both the user and the content and I'm staffing them all together into one vector, I'm putting different weights on them, I'm normalizing them so they can be combined in this way. And that gives me the basis for my content and user vector representations. Here is just a representation of how then we use these vectors. So if then I have a user coming in and I want to generate a feed of content for them, I literally just take the user vector and I search in the content space and then this is the content that will come up. And the key thing to understand that's kind of different from just these very basic overviews of vector powered retrieval that you see online. Typically people just use one model, right? They would just have the relevance part, for example, right? So you use some semantic embedding model and you just do the semantic embedding and then if you visualize it, you'll have kind of topically similar content clustered together, right. This is what you see out there. But what people are kind of realizing now is that you can blend all those other features in there, right? So this is not just kind of topically relevant content for user two, but it's also very recent content. And then the user two likes that. And so that's why they are closer together, right? So there are all these other things that can be expressed in the space. Obviously here is kind of a projection of that space into two d, so it's hard to visualize that, but that's kind of what it's doing in its original 1000 dimensional space, is kind of navigating all those different trade offs. And then you can let the approximate nearest neighbor index actually do the heavy lifting, right. All right, cool. So that's the retrieval step, right. That covers the okay for a given user. Here is a bunch of content that sort of really matches their preference. Now there are still some aspects of this recommendation and search problem which can't quite be represented or as easily represented with the vector embedding. And for this, you might want a module on top of your retrieval that is managing the queries. Right. And it's sort of managing, it's potentially manipulating the search vector, right. So user is coming in, I grab their user vector, I tweak it and then use it to search into the content space. This tweak could, for example, turn a recommender engine into a search engine because I will basically just get a search query for the user. I will create a semantic embedding of the query and I will use it to overwrite or augment the topical preference of the user. Right. So I'm doing personalized search out of the box because all the other preference aspect of the user are still in the user vector. And I'm now also putting in the current context of, all right, this user is searching for splitting headache or whatever example we saw before. So that's kind of search vector manipulation. That's how we can build the search on top of the same idea. And then other aspect of this is, let's say you want to improve the diversity of authors for the content that you recommend. This is sort of difficult to express as a set of linear constraints. And so you might want to have this query manager on top issue multiple queries, for example, into different clusters of the author users for the content, and then assemble the result together. Right. So this would be sort of kind of like creating multiple searches from that one initial search to satisfy some additional constraint around recommendation variation, diversity, that sort of thing. And then finally it's an approximate system, it's approximate nearest neighbor retrieval. So you can't right away give a guarantee that there won't be something weird in the result set. To actually have guarantees, you need to post filter the results. Right. You should do this minimally. Right. This part shouldn't do the heavy lifting, but you might still want to combine some of the results, filter a small percentage of them out, that kind of slip through the nearest neighbor approximation. And for that, again, the query manager on top of your vector based retrieval is useful. Okay, so now we talked about why vectors. We talked about the types of things you can build. We focused on the surgeon recommendations. We will actually touch on the generative AI in a minute, but let's talk about how you will actually get this done. Right. This is the interesting part. So we are motivated. We want to build a vector powered system. I split it into three parts, this section. So first, just what do you need to get started? Right, this is some basic demo. You are just playing with vector embeddings. You want to experience how they work. Right? For this, youll need these four items, basically, very simple. Youll need your data, right? Ideally unstructured text or images. Basically you need to load this data from wherever you have it right now, maybe on your computer or in a database you need to load it into. Ideally the simplest is a Python notebook collaboratory or one of the online providers of Python notebooks will totally work. Once you load the data, you need to turn it into vectors. Obviously for this youll need a vector embedding model, right? There is a bunch of open source models out there. You can, for example, go to hugging phase and find something that's popular. And from that you kind of know that maybe it's an interesting place to start. There are also APIs, right? So for example, the famous OpenAI API that can for a given piece of content, generate the vector embedding. Of course there you'll have to pay. But you can embed thousands and thousands of pieces of content for cents or tens of cents of dollars. The cost is minimum until you start to work with millions or tens of millions of pieces of data. And then finally, okay, so you have your content, you have the vectors that represent the content and you want to find similar vectors to just sort of see. Okay, for this image, which ones are similar? You don't need any extra infrastructure for this. There's a bunch of libraries in Python like Sklearn that has built in cosine similarity. So you just literally have vectors as numpy arrays and you do a similarity calculation with kind of all your content. Right. So there is no indexing going on. This is brute force. This totally works for even thousands of data vectors. And this is a great way to get started and get a feel for how this works. You'll have access to the slides and I have these examples linked. So there is a collaboratory jupyter notebook that basically contains an example like that. Okay, so this is the getting started, the first steps right now, the second part. What will you need to add to this to build an MVP? And I would kind of advocate for two parts. Sometimes people just kind of go with the vector database and call it a day. So a vector database is a place to, once you create your vectors, you store them there, you index them there, and then in youll product you can do a query into the vector database instead of the cosine similarity directly in the notebook. Right? So that's kind of the basic setup for the MVP. But what I would want to also add to that basic package is some approach to evaluation, right. We are working with vectors. We are experimenting with vectors because we want to improve the end user experience, right. This is the whole point. And so you need some way to keep an eye on the quality and on what also the users think about this, right. So obviously the first step will be just eyeballing the results. Sanity checking, do they make sense? Let's look at 20 random inputs. Let's look at the top ten results. This sanity checking is priceless, right? You definitely need to start there. Youll find a bunch of issues. You'll need to tweak your models, choose different embedding models and so on. Then the second step is kind of quantitative evaluation, right? If you have some kind of data labels or some annotations from before, let's say for searches, which results people are actually clicking on, or for recommendations, a similar kind of data set, you can back test vector powered implementation of your search or of your recommendations and then kind of see, okay, are the things that people tend to click on appearing high up often enough in the vector powered results? Right. Is there a chance that this will actually work? And then once you get into the product, once you have the MVP out there, it's useful to collect user feedback, right. Are people interested in the results? Do they think they make sense? And then finally, and probably most importantly, the analytics, are people actually interacting more with the content? Are you achieving your goal? Are you getting more people to apply for a job or whatever the recommendation, use case or search use case that you have, right? The analytics is kind of the be all, end all. That's where a b testing comes in and all of that other stuff, but maybe not for the MVP. Some basic analytics setup, however, is something that I would definitely recommend because that's how you learn if your mvp should transition into the next stage. And the next stage is where this all gets very complicated, right? Because you are not just vectorizing pieces of content and letting those vectors be, you are now assembling all those different constraints. So previously we might have just worked with one embedding model, right? The semantic embeddings, vectorizing the content, vectorizing the queries, matching the two simple, right? But if you guys want to do the stuff that I described before, where you assemble content signals and collaborative signals and all these other stuff into one vector and then have a state of the art system, you need a few more components, right? So I'll just quickly run through this. You'll need a sort of a serving system that has APIs, and on one side you'll push the data into the API, right. You'll push youll content and users metadata and you'll push in your events. So this is what the users are actually doing in the app, right. So we are actually using the user history as well, not just the semantic embedding of the content and the query. And we are also using the user metadata. Right. So that's why the user data is also on the input. So this should all be coming into the kind of vector powered system that you have built. And then the next step from there is the transformation. So all of this data will have properties, will have different parts of it that ultimately should make it into the vectorization. Right. Different aspects. Let's say you have a job, you want to classify seniority required for that position. So the job title is probably pretty important for that. And the transform is just about pulling that seniority key from the content object and making sure that it comes into the classifier vectorizer that resolves. Okay, what is the seniority required for this job? Right. So that's the transform step. Then finally you have the vectorization, which is where youll be loading the embedding models from hugging phase. But also you'll probably have some of your own models because your data, there is always that aspect that's kind of special for your product and your kind of embeddings model needs to be able to support that, right? Sometimes this is just loading a pretrained large language model, but sometimes it's just vectorizing recency like we have shown in one of the previous slides, right, because your users care about fresh content, so you want to add that as one of the features. And now you have a problem. How do I do that? How do I express recency in a way that I will not have to recompute the age of the content? I'll not be putting the age of the content into the content vector and then having to run a pipeline every night to update that, right. That's not good. So you need to be really mindful about how each of these data properties actually makes it into that vector, right? So that it is available during the nearest neighbor lookup to guide the recommendation. So that's the vectorization step. And then finally, okay, we have the database. It has the user and content vectors. So that's the thing that obviously you are building this on top of the vector database. You have your user and content vectors in there. They're up to date. Ideally, if you want TikTok level recommendation performance, then these vectors are updated in the real time so that when the user clicks on something or does something, you immediately use that signal on the input the event into recomputing your vectors. And then finally you do the retrieval, right. You do the nearest neighbor search to power your recommendations. And then through an API layer again, you surface it into your product search recommendations. We talked about bot detection through the user to user similarity, user segmentation, and other use cases, right? Sometimes when the user is, for example, searching, you might have to kind of feed the query back through the vectorization before youll can do the retrieval. But this is kind of the anatomy of a system that you would need to really pull this off in production at scale, you can expect something looking like that. All right, and finally, let's look at the generative. What does all this have to do with generative AI? Right, this is the current hype. That's actually for a large part driving the vector hype. So let's kind of connect the dots here, right? So what is generative AI? And I'll just talk about the text, but this obviously applies to other modalities like images. But if you think about these large language models, basically they are function that on one side takes text, a prompt, and then outputs some sort of response to that prompt. Now, the next kind of step that happened after people kind of played with this very basic setup, right? I prompt you, you give me a response. Cool. The next thing you can do is kind of take that response and feed it back into the model, right? And that's where, for example, Chat GPT really took off, right. Because you had that sustained conversation back and forth, the model saw its own previous responses to you at every step of the way and therefore was able to build a response that feels like you are actually talking to the model. But it's critical to understand that the model itself doesn't have any memory, right. It only responds to the text that it receives on the input. And so in the chat use case, you are kind of constantly feeding in the whole chat history or a part of it, right. If the chat is too long, then it's just the recent part. This is why, for example, in Chat GPT, if the conversation is long enough, it forgets the stuff that was said a while ago, right? Because the context window can only take so much of the history and it has no other memory, right? So it has no other place to store, let's say, things that it learned about youll, right, so this is the Chat GPT situation. There are libraries that are useful for working with, building on top of generative AI use cases. I youll recommend checking out lank chain. There is a 13 minutes explainer video that I found really useful. So again, check out the link that I added to most slides. There is something useful there. All right, but what's up with this stuff? So here I show, I demonstrate that the system kind of pretends that it has memory, just to illustrate how, when I was saying that for it to sort of feel like a chat, it has to constantly feed back the whole chat history and then generate the next response. However, you can sort of play with it a game where you ask it to make up a number and not tell you, and then you guess it and it gives you feedback if the secret number is higher or lower. And you can actually play through a game like that, right? You can guess. It tells you if you are too high or too low. And then finally, eventually you get the result. It's just that technically it couldn't have made up a number and then kept it somewhere in memory, because the chat is its only memory, right? So this is an interesting case of the model sort of pretending that it thought about the number and didn't tell you, but again, it has no place to store that number, right? So you can try this yourself. It's actually quite interesting. And so, okay, obviously I kind of set it up. The touch point between what we talked about before, vector representations of data and generative AI is this aspect of memory, right? Is this aspect of being able to use the large language model to generate a vector representation? Because that can also be a byproduct. It doesn't have to be text, it can be a vector and then store, let's say you are processing a document, a large document that doesn't fit into the input text all at once, right? You might want to chunk it up by chapter or paragraph and then generate vectors that then you store in some vector database, right? So for each paragraph you'll have a vector that represents its meaning. And then you can do clustering, do similarity search, all kinds of different things on this kind of corpus of memory that you build out with the large language model. And there has been this word floating around agents. Sometimes the agents can be autonomous and they kind of feed the output text inside of the large language model and also use the large language model output to actually decide what the next action should be. Should I retrieve something from memory? Should I feed another new input to the model? So those would be autonomous agents, but any sort of agent or kind of controller module that you run on top of the large language model sort of use case can just have manually controlled logic of. All right, first I'll chunk up a document, I'll generate those paragraph vectors, and then I'll use that for search. Right? And that's the touch point between vector embeddings and generative AI use cases. Right? Again here I recommend checking, but auto GPT if you haven't yet. That's the most famous example of the autonomous agent. All right, so today we covered why vector embeddings, vector representations of your content, but also users are useful, how you can use them to build surgeon recommendation features for your product, the different levels of software stack that you need to pull this off. And then finally, we also looked at the connection of vector embeddings and generative AI. Thanks a lot for joining me. Obviously I like to talk about this topic and would love to learn about your use cases for vector embeddings and vector ops and learn about the things you are struggling with in the space. As I mentioned, I'm a founder at a company called Superlinked. We are building a vector ops platform that makes building surgeon recommendations on top of vector embeddings easier. Right. And so we are interested in talking to people who are in the space. They're experimenting with surgeon recommendations and we would love to learn from each other and deliver something useful. So let's connect on LinkedIn and take it from there.

Slides

Download slides (PDF)

See all 13 talks at this event!

Conf42 Machine Learning 2023 - Online

May 18 2023

Vector Ops: How to run vector embedding-powered apps in production

Video size:

Abstract

Summary

Transcript

Slides

Daniel Svonava

Co-Founder @ Superlinked

Join the community!

Featured event

2025

2024

Info

Conf42 Machine Learning 2023 - Online

May 18 2023

Vector Ops: How to run vector embedding-powered apps in production

Video size:

Abstract

Summary

Transcript

Slides

Daniel Svonava

Co-Founder @ Superlinked

Join the community!