Conf42 Machine Learning 2025 - Online

- premiere 5PM GMT

Machine Learning for Advanced AI Search

Video size:

Abstract

This session explores the use of generative AI to build a product catalog similarity search solution with machine learning and PostgreSQL vector databases. Learn how to leverage AWS services to create powerful search solutions that enhance user experiences and drive better results across industries.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. I'm Nia Gio. I'm a cloud application architect at Amazon Web Services, and today I'm here to talk about the topic machine learning for advanced AI search. So let's quickly get into the topic and before I dive deep into the advanced AI search, I wanna talk about, some of the terms that we are going to talk about through the presentation. So let's get started. So what is generative Artificial intelligence also known as gen ai. So it refers to a class of AI models. And it's designed to create new content such as text, images audio, video or code. By learning patterns and structures from existing data. So unlike traditional AI systems that analyze in or classify data, gene AI can actually produce original outputs and that mimics human creativity. So that's why it's the buzz right now. So these models often based on deep learning architectures. Are trained on vast data sets and can generate responses if they can complete tasks or they can even simulate scenarios with remarkable realism. So that's why gen AI is widely used right now in applications ranging from content creation or design to personalized recommendations. Coding assistance is a big one. And also natural language interfacing. And now that we understand on a high level, the gene ai generate gene ai, let's talk in, let's talk about gene ai how it is found powered by foundation models. As I said, gene AI is powered by foundation models and throughout this course we are gonna reference to them as fms or foundation models, depending on how we get through the content. But they're also short formed into fms with which are these pre-trained mass on, which are. These foundation models, they are pre-trained on massive amounts of unstructured data. It could be text, images, audio, whatnot. And these models contain a very large number of parameters and it enables them to learn complex patterns and concepts. Because of the broad knowledge and flexibility, foundation models can be applied across a wide range of tasks and enterprises even. When I say enterprises like industries, enterprises, organizations, whatever you call it these organizations can customize these models. Also, they can use their own data to fine tune these models for specific domains and very specialized applications that unlocks greater value for these organizations and enterprises. Really cool. And we talk about foundation models, right? And the first thing comes to our mind is what are the inputs and outputs of foundation models? Yeah. So we, when we talk about a foundation model, and also when we talk about inputs inputs can be anything. And they're typically unstructured data sets like we talked about, like text, images, audio, video. It can be code. It depends on the model type, in fact. And these inputs are used during the pre the pre-training phase to build the general capabilities and during inference as well when users provide prompts or questions or content to process. So these are, this is how the inputs is defined and the outputs. What are the outputs of foundation models? So the outputs, they are typically the newly generated content based on the input, such as, it could be a text or it could be a translated language. It could be a generated image, it could be a summarized document. It could be answering a question that you have asked, or if you it could be a nap, it could predict the next step in a sequence if you, if that's how you phrase your input. So the the important thing here is that for the foundation model, the output of the foundation model is that these outputs are designed to be coherent, contextually relevant, and often creative. That reflects the models understanding, the learn understanding of how complex patterns, of how complex patterns it can analyze and how well it is responding to your input. A quick recap on what an input is. It's typically a raw or unstructured data or prompts like text or image or audio, et cetera. And the output would be the generated content based on the input. It could also be text, image, audio or any that answer to your question. Based on the learned knowledge that it has. So now that we are covering the basics, let's talk more into how we can customize these foundation models as well. So yeah, I was talking about that, right? Like foundation model, you give a prompt and the prompt can be anything. And this is the example I'm using here where it talks about the prompt could be as simple as explain what thermodynamics is. To a middle school student, right? And the foundation model, based on its initial pre-training or whatnot, it is supposed to be, it is going to use all the data and generate a response, which is our answer, which is the output. And it'll explain in terms that a middle school student can understand. So that's how smart an answer should be from the foundational model. As I talked about, how do we customize a foundation model? So foundation models can be customized to better suit a specific task, or a company based on, we have several approaches that we can do, and I'm gonna talk about three main approaches at hand. The first is the instruction based fine tuning. So instruction based fine tuning is adjusting the model by training it in on example prompts and also like desired outputs like teaching it how to respond to a specific instruction more effectively. So that is the name suggests like what it is going to do, right? The instruction based fine tuning. And the other one I have brought here is the domain adaptation. And this makes a lot of sense for domain specific needs. For example if you need to fine tune the model, but domain specific data. And when I say it could be like the domain could be medical or legal or financial content. So the model. We do this domain adaptation so that it better understands specialized vocabulary and context. It can play a crucial part in, when it has such use cases. And the next one I have is the information retrieval. So this approach could be that. We are enhancing the model by integrating it with external databases or search systems. So what it does is it allows it to fetch and incorporate UpToDate information during responses. And for this process you don't have to retrain the model fully. Like you could augment the data so that your responses are accurate. These are some of the customizations that we could do to a foundation model. And information retrieval is also standing out prominently for a lot of organizations because not everybody could effort a lot of time in retraining for specific use cases, but rather this will help them augment the data as they go and not, don't have to deal with retraining the model fully. So it works both ways. And this addresses the situation that we have at hand, right? Like, why do we customize the foundation model also, is that using all these methods, depending on the use case, these allow the organizations to tailor foundation models for relevance and higher accuracy and effectiveness for their use cases. So it works really well that way. And now that we talked about a different all things about gen AI and the foundational models and the inputs and outputs and customizations. Let's also slowly get into our topic. And here I'm bringing, introducing the word vectors. So in machine learning a vector is simply an ordered list of numbers. These numbers represent data in a mathematical form that algorithms can easily process. Vectors each number in the vector could be like a component or a dimension. So they allow the machines to perform calculations like similarity measurement, clustering classification, and more let's say you have a customer profile and it is being represented in a vector like, age, income, number of purchases. They could all be grouped together in a vector dimension. And so the key point here that is, has been highlighted here as well, is that vectors are the ones that turn real world data into a format, like a array of numbers or floats to be precise. That machine language. Machine learning models can understand and they can compare and they can manipulate the data efficiently. So that's that's a huge role that vectors play when it comes to the machine learning models. So now that we talked about vectors what is a vector embedding? That's the next question we have in our mind, right? As we talked about some of the terms, now putting them into context. Vector embedding is machine learning, especially is a way of representing complex data like words, images, products, or users. Everything has different structure of format of the data as numerical vectors. That is a list of numbers like in a continuous high dimensional space. So each embedding captures the meaning features or relationships of the original data. Like for example similar items. Are placed together in the space while different items are placed further apart. Like for example, if you are looking at the recommendation systems two similar movies will have emits located next to each other. The key points that we have to remember is that they are learned from data using models. They preserve important properties like similarities, structure, and they allow machines to perform operations, like I mentioned like similarity search or clustering or classification in a very efficient way. That's a huge role that vector embedding plays in here. And the next topic I have is vector store. And I want to talk about Vector store using a setup that I have by like how the vectors, like how the for whole flow looks like. And here I have in the setup, I have some raw data such as audio, images documents and whatnot is first put into I'm using Amazon Bedrock, the machine learning model for emitting here and bedrock processes. This data and then it generates embeddings into our dance vector representations that capture the essential meaning and features of the input. As you can see here, the representation is how the vector embeddings look like. So these dense vectors are then stored in a vector database and in this sample I'm using Amazon open search as a vector database. Which supports vector search capabilities also. Storing a embeddings en enables efficient similarity search. It also supports semantic retrieval and recommendation systems so that users can quickly find data points that are meaningfully related based on their vector proximity. Yeah. So now that we understand about foundation models and vectors, vector stores, vector embeddings, let's start into look into some of the vector databases. And I want to quickly highlight Postgres sql here. Before I go into that, let me also talk about. The Vector data stores because in this topic I'm covering most of the AWS services because that is what I'm familiar, a lot, familiar with. I wanna touch, because in the previous example I picked up an example to show Amazon Open Search, but that's not the only data store that is available. There are multiple others that are available the director stores as well. The first one here is the Amazon Kendra. There are multiple, as I mentioned, there are multiple vector data stores that are available and they all work efficiently. But depending on the use case that we have, you could be looking at one over the other. For example, Amazon Kendra, it is a intelligent search service that now supports semantic search using vector embeddings. It helps find users, find relevant information based on meaning not just keywords. So it's very powerful tool. And you would be, this would be a good fit for you if you're looking for something low-code, no-code solution and rapid deployment. This is your go-to use case because it supports a lot of ingestion data sources. It has 40 plus connectors and stuff like that. So you do not want to deal with some of the things like data chunking and embeddings and mixing algorithm choices. It does all of that for you. So really powerful service to go with. And the next one I have is the Amazon Open search service that I just used in a previous example. This is an open source search and analytics engine that supports dense vector fields and K nearest neighbor search. It is useful for use cases like recommendation systems, semantic search engineer retrieval. So really good as well. And. This will be a good use case for you if you are already using Amazon open search service and you wanna go with a solution where you are you don't really want a lot of deal with sql, like a no SQL solution, and you also need like a low latency algorithm and you want higher search accuracy and stuff like that. The other item that I have and more focused on our use case for this presentation today is a, is about the Amazon RDS and also Amazon Aurora. So this has the post se Postgre SQL with the PG vector extension, and I'm gonna talk about those things also in the upcoming slides. So this is a managed relational database that supports that now supports the vector storage and querying as well. So this is great for applications needing tight integration of structured data with vector search, such as, and this also helps with, personalized recommendations. Gives you hybrid search as I mentioned, right? It could have that integration between your vector search as well as your structured data search. And this will be a good use case for you when you already have. You prefer SQL and you are dealing with RDS or Aurora Postgres sql and you want to keep the application and the AI ML data vectors in the same DB for better governance and faster loans. So this is the, I think, the biggest advantage that you get with this solution where you wanna keep them together, the application and the AI ML data and vectors in the same db. So that's where this is coming from, and. Now that we discuss some of the data store vector data stores here I want to dive deep into the poster SQL and the PG vector and why and how they're beneficial to what we are trying to achieve. Yeah. Also, one more thing is that offer all of these services, right? One of the advantages on using them also is that they come with all the builts and whistles like. You have role based access controls, you have authorization authentication, you have these are fully managed services with serverless options available. So you don't have to work on managing the infrastructure, setting these up, but rather work on the actual meet. And that's, that, that's one of the big advantages coming out. For this use case, as I mentioned, we are going to be focusing on Postgres SQL in Aurora or Amazon, RDS. And the PG vector connected into it. So let's dive into that. So why Postgres sql, like, why I have chosen this for this use case or what I'm trying to talk about here? Postgres sql it's becoming a popular choice as a vector database because it combines the reliability of traditional relational database with new capabilities to store and query vector embeds. The. If you ask me like, why, what are the reasons that we might wanna go for PostgreSQL, for a Vector database is, as I mentioned, there is a PG vector extension. It adds native support for vector data types and similarity search and the flexibility. It lets to store both structured data like a customer data like product metadata alongside vectors in one database that enables you. Powerful, really powerful hybrid search. Like you can get filters and similarity and that is one of the examples towards the end where I'm gonna show you with the PostgreSQL, PG vector, how that looks like, and also the scalability and maturity that we are talking about here. So PostgreSQL is battle tested. It's highly scalable and. Trusted already for mission critical applications. This is not something new that has not been around. It's been around for a while and does a great job and cost effective. Why am I saying cost effective? Because typically, RDS might not sound like it is cost effective, but the thing is, it avoids the need for a separate, specialized vector database. For many use cases, it simplifies the architecture and reduces the cost. Be because you don't need that. Another separate specialized vector database. It supports natively now. So PostgreSQL with PG Vector, it can give you a very powerful, unified platform for building intelligent, vector driven applications with very familiar tools and minimal overhead. So let's talk about a little bit about PG Vector. It is an open source library for vector search. A vector as a data type helps in similarity search and it has the senior SQL integration. So I say these things, right? So what are the key features that I'm highlighting here is also vector data type. So it enables storing vectors. It's as I mentioned it, you can think of that as a list of a red area of like numbers, especially floats. Think of it as an area of float. And it supports it as a native data type. It provide, provides similarity search, so supports finding vectors that are closest to a given vector based on metrics like cosign similarity and to the EQU and in distance or inner product. I, it can index for fast search. It has seamless SQL integration. So for all these advantages. We can go this is a good good choice and there are many use cases, as I mentioned, the semantic search or recommendation engines or personalized delivery, content delivery, all of these will be a very good use case for the PG vector. So I want to discuss the Fiji Vector example with using the query in the nearest neighbor sample. So let's talk about some of the algorithms we have at hand. So the first one. So it has three common similarity metrics, like it enables the nearest neighbor searches in post SQL by using three most common similarity metrics. The first one is the Euclidean distance. So it measures the straight line distance between two vectors. So typically it is very useful when the physical closeness. Or the raw differences really matter. And this is how you show that it's a Euclidean. I will talk about this, the thing, the hyphen that you're seeing within the brackets. We will see how that plays into the picture. In the next slide where I'm giving, going to give you the actual sample. And the next one the next metric is the co-sign similarity. Cosign similarity. As you can see the picture, it measures the angle between two vectors, like it focuses on their orientation rather than the magnitude. For example, I think some of the most common use case would be like in text embeddings, where the direction of the meaning matters more than magnitude. And it also has, it is depicted using a hash. And the next one is. The top product. Also, I think I also use inner product to define this. So it calculates the product of two vectors and then sums the result. So this is often used when embeddings are normalized. And higher dot products mean greater similarity. If you are looking at some of the examples for query patterns in a PG vector. So it would be like finding a vector's closest to a vector when using one of these distances or similarity measures. And that is a sample that I'm going to show you right now and you can combine similarity search with SQL filters. Like product categories or user segments or something like that. So that is a sample that we are going to look at right now. So that is yep. So I'm going to give an example for query nearest neighbor. And to do this, I'm going to create a table called test Embeddings. And it has two columns product ID and embedding. You see this embed column, it is using a vector data type. And this is possible because I have the PG vector plugin. So now the next thing I'm going to do is I'm going to insert some data, like I think 1, 2, 3, 4, 4, 4 data points. So the first one for everything I have product ID and embeddings. The product ID is a big in inte, and the embeddings is a vector data type that I have. So in each of them, this is a sample test data that I'm inserting. So the first item is one, and then it's vector embedding. Second is two, and then it is, it's vector embedding and so on. In real world, these embeddings that I have, the 1, 2, 3, 2, 3, 4, the vector embeddings that you're seeing, these will come from your machine learning model. Now, if you do the nearest neighbor search what you would say could look like this. Like you would do a select query select product ID and embeddings where embeddings is similar to this vector. So you see that hyphen right there, right? Bracket Hyen. Bracket beside embeddings. So that is what we saw in the previous here, where you have shapes behind each work, one for hyphen hash and equal to. So here, which means that we are using the similarity metric Euclidean in my sample that I'm trying to use. So it is using Euclidean metric. And what we are trying to say is that we are selecting the product ID and and embeddings and we are seeing where embeddings is similar to this vector. This is the, like I said, symbol, fake cian distance. And from the stable test embeddings, and we are saying that to order by embeddings. So you can see how easily we were able to integrate this vector data type into our regular structured sql. And how powerful it is and easy it is. If you already have SQL knowledge, then it's not big, a big learning curve or maintenance curve for you either. And I'm all and you, and when I run this query, I'm getting this response. And I'll get an answer like this. I'm getting an answer like this, and it has only two rows because if you see my query, I said I'll limit it to only two results. So it is able to find that and give me that data correctly, as you can see. In summary, I can say PG Vector. It helps you build a product. Catch log similarity. Search this way by allowing you to store product embeddings. Which you get from your foundation models. These are the vector representations of products directly into your Postgres SQL database. So you can generate embeddings from product attributes like you titles or descriptions, images using your machine learning models. And with PG Vector, you can efficiently perform nearest neighbor searches to find products that are most. Similar based on semantic meaning, not just keywords. So through this we are able to see that you have a lot of key benefits here, right? As I mentioned as a recap of what we have been discussing so far. Like with this approach, you have unified storage. You get to keep product metadata and embeddings together in one database. You are getting this flexible similarity search. As we just saw in this example we were able to use utility distance. And to find related ones and I'm able to write hybrid queries. I'm able to combine my traditional SQL filtering here I with the filtering with vector based similarity search. So not and it is giving me very highly relevant results. And also, I didn't need any separate vector database. I was able to extend my existing Postgres SQL setup. In, in all together, like in short this PG vector helps. It enables fast and intelligent and it is very scalable. Similar search product search by it's bringing powerful AI vector search capabilities into already familiar workflows. I think I was able to explain, the use case and also how to use this advanced data search using Post sql, PG Vector, a data store. Hope you learn something new through this presentation. And yeah, thank you. I hope you have you guys have a good day now. Thanks a lot.
...

Sowjanya Pandruju

@ AWS

Sowjanya Pandruju's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)