Conf42 Machine Learning 2024 - Online

Enriching Generative AI as Events in Real-Time Streaming Pipelines

Abstract

Let’s build streaming pipelines that convert streaming events into prompts and call LLMs and process the results.

Summary

  • Tim Spann: My talk is enriching generative AI as events in real time streaming pipelines. There's a lot of different types of data. Fortunately there's some tools out there in the open source to make this a little easier.
  • There lots of different compute types, and we'll see these expand as more types of GPU's, more advanced compute come out there. One thing that we've been doing is real time data pipelines. Being able to query your database with audio is pretty awesome.
  • So I have a medium and I have some data there and very fortunately they'll give you an RSS feed of your latest couple of articles. Then I convert that into a format to be parsed because it's HTML and then from there parse it out into small chunks. Optimizing those chunks is always a fun exercise.
  • We've got a universal slack listener tied to my slack group here, which is open to all. If you want to ask questions of my bots, one of these is me. There are plugins and things you could add to chat, GBT and other things.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
How's it going? Tim Spann here, principal developer advocate covering some interesting topics in generative AI, real time streaming vector databases, unstructured data, lots of cool stuff. My talk is enriching generative AI as events in real time streaming pipelines because I couldn't fit any more words in the title and I haven't thought of something cool, catchy way to put all that into something a little simpler. So we'll give that a try through the talk to make this as straightforward as possible. If you're trying to contact me, want to know more about this interesting things we're doing with generative AI, reach out to me GitHub medium dzone lots of places you'll find me if you look. I do a weekly newsletter covering ton of different projects, open source for data, unstructured data, vectors, streaming real time IoT, Python, Java, JavaScript, lots of cool stuff. Check it out. You don't have to subscribe, you could just check episodes as they show up in the GitHub. And we'll be adding more multimedia and multimodal contact real soon. So check that out. So let's get into it. We're going to build some streaming pipelines. Hopefully everything's going to be going fast. As you might imagine, if you haven't been around the last couple of years, there's a lot of different types of data. Like a lot like you wouldn't expect it, I mean, because you have the data we've been used to for a while, which is huge enough. I mean, the structured data coming out of databases and other things is big. But then when you start adding things like text and images, video logs, all kinds of really cool advanced 3d models, chemical structures logs, emails, social media data, all of a sudden there's a lot of data and that's just growing. And now with the explosion of things like chat, GBT, now there's some serious use cases for this data, and now it can all come together. I can search, you know, my text data using an image and search my videos and all this together, the structured, unstructured and, you know, make sense of all the data this is, there's going to be some really cool stuff coming out right now. There's, there's stuff everywhere, you know, and it's not easy to manage, especially when, you know, you may have some cats distracting you. Whether it's local, in containers or Docker, you're running Amazon, you're trying to do video analytics, you're not sure which cloud you're running on. Got to feed stuff to slack. Grab stuff from slack. There's a lot going on. Fortunately there's some tools out there in the open source to make this a little easier. Fortunately you don't have to choose between easy and open source anymore. The power of some open source projects out there is making this a lot easier. So you don't have to worry about it. And not just open source, but open source with a powerful community where a lot of people working together, a lot of people using the same code, which is great and big. Shout out to the Linux foundation for AI and data. That's really making this possible, especially within the hybrid cloud environments that we're seeing with all the different Kubernetes installs and big environments. I mean you could have start off with a couple of nodes and before you know it you've got, you know, billions of vectors out there, tons of data. But what's nice, it's easy to get started. I could just work in a notebook, I'll show you that reuse code, use it fast, integrate with all the cool tools you want to use out there where it's something like OpenAI, lang chain, llama index, ton of cool stuff out there and you could do some really advanced stuff out there. Everyone starts off with maybe a little rag, maybe just some text in there, just a really good use case, maybe just all the documents that you need to search. I do that for me. All my articles and different contents, you could find my stuff really fast, ask it questions, maybe get better answers than I give you. Depends what data is. Lots of other things like we're not going to go into things. The difference between really dense and sparse embeddings, filtering, there's so many features coming, whether you're going to just run on your laptop, run in the big cloud somewhere, use all the big name tools, you just use a couple open source, it's all out there, cool stuff going on and a ton of different use cases supported. Again, we won't go into that, but just wanted to mention this quick off, there are a ton of different index types. If you use some of the entry level vector systems, you'll be like, oh, we got an index. And one index is great until you start looking at different types of data, different use cases, how you're going to use that data, how you're going to search it, there's a lot of things to think of when you start going into production and you start adding a ton of different data sources coming from the world of traditional big data. You know, it took us a long time to be able to figure out how am I going to access the data, what type of data it is, what are the best ways to index, search and find data wherever it is? And that's certainly evolving in the unstructured realm as well. But being able to support all the different types of searches you might want to do, not just the top k ones, not just grouping or filtering. Lots of different things you could do and combine them, being able to have multi tenancy, do all these collections and partitions, really important once you start moving into production, whether that's productions on premise, any of the clouds, wherever you need to run, you're going to need that. Some data just can't live with others. Maybe for performance, maybe for legal reasons, maybe they just shouldn't cross the barrier. There lots of different compute types, and we'll see these expand as more types of GPU's, more types of advanced compute come out there. There's already a ton, and fortunately the open source tools there are taking advantage of that, which is important. Now, if you remember some of the architectures I've walked through in the past, it's going to look familiar. The way Milvis works is really similar to some other advanced systems out there, because there's really not that many ways to write these that it's going to work. And scale out for all the different layers because you got to scale out compute separate from storage. If you were around when we did some of the pulsar talks, it's going to sound pretty familiar. Got workers that are handling querying data, indexing your data. What's nice, storing things out to an object store. Even if you're in a small use case, you could use minio. That works pretty awesome. Give people access to your data. We got etCd in there. Again, part of cloud native system makes sense to use that for your metadata, so you could scale out really easily. We're using Kafka Pulsar to distribute messages between all these different systems. Makes a lot of sense. It's the type of apps we've been building anyway, so it makes sense that an advanced system would use that. We'll go through some of these use cases really fast, but there's a couple of them you don't think of right away. Certainly the augmented retrieval. I mean, we all have to do that now with text chat and blogs and stuff, but things like molecular similarity search. I was talking to a guy a couple days ago and he was super excited about this because they're trying to find some uses for some of the materials they have. And that stuff works out really well. That's really hard to do in most systems there, but we'll dive into that in the future. Future talks. Just a couple quick ones. Show you some cats scalability, different types of indexing support for all the major languages for clients. You know, it's all out there. One thing that we've been doing is real time data pipelines, because we've seen that certainly some data is going to be loaded batch like it was before, especially getting all your existing content and documents and websites. Get them that first time loaded. That's going to be a batch, should be, a batch should be done pretty carefully. Make sure you get everything, double check it. Certainly that can be done with the same tools, but you know, you got to watch that, that first round. But after that new document comes in, something comes out of medium, something comes off a slack channel. I need to do that right away. I want to be able to get this data, whether it's to get it into a vector store like Milvis, so I can have that available to build up a prompt, or I'm enriching it, transforming it, or that's how I'm getting in a request to look at the data. I mean, we also have to integrate with whoever needs to ask questions of the AI or even just do a search of those vectors very often. That's my final answer in there. I don't have to go any deeper. I want something like, well, how do I, you know, build this? How do I build my first novice app? Well, I don't need artificial intelligence to do that. There are a number of great articles that will come up right away, so you don't have to do that. But yeah, like I said, building up the prompts, getting the proper context. So when we do need to write a question, we write it correctly, but also then connecting with things like GPT cache, because sometimes that question's been answered already. Why spend the money or the time calling out to something like Jet GBT or even one of the free models if I don't have to? Let's save energy, let's save money, time, network bandwidth, don't call out if you don't need to. It's already been done. Let's make it smart. Let's work with whatever we need to work with quickly, and we can work with unstructured data, with things like Nifi or with Tauee. There's a ton of different things out there to do it. But the days of just looking at CSVs and JSON Protobuf parquet, Avro, that sort of data is over. You're gonna be looking at zip files, you'll be looking at every type of image, every type of document. You know, some of them human readable, some of them not, some of them binary. You'd be looking at videos, maybe live streams, sound. I mean being able to query your database with audio is pretty awesome. Being able to have those advanced uis that you only thought of as, you know, maybe something Apple can do with a phone. But now I can have that in my basic applications where I just talk to you and you give me the results, I send in an image, you give me back a document. There's so many, so many options now. We even just touched on them. It's pretty awesome. But we need to be able to work with all these different unstructured data types and there's going to be more, I mean there's probably going to be some new optimized ones. I'm sure Genai is going to create some awesome new format that combines multimedia in a very rich, easy to search and compressed formats. Waiting for that. If you haven't seen Nifi 20, we just got the m three release and this is adding some serious features that make it a very nice open source tool to be able to do this open source streaming part of the house where I'm getting data regardless of if it's structured, unstructured, semi structured, and get it to who needs it at any speed as quickly as possible. Often Kafka pulsar in there, drop that off to my vector store and we're off and running. What's nice is being able to leverage python and additional places if I now makes Python a first class citizen. So you can write full libraries in Python and make them available for your entire nifi cluster, which is awesome. So as long as you have Python 310 greater, I mean if you're not running Python 310 or newer, you should look at your infrastructure because there's, you know, security issues out there. Upgrade to the latest or near latest. I mean trying to get all these libraries working sometimes is fun, but fortunately with kubernetes and with other Python options, control some of that. But I wrote one for taking an address and turn it into latitude and longitude. Find I needed this because I wanted to do something like have someone ask me a question. And before I sent it to Genai I wanted to see is this something geo important? Again, those crossing of the data types. Geo data is structured data, but I mean it's real world and it has implications with a lot of things. So we need to get this part right. So I have a library that uses pretty awesome libraries out there, especially one from openstreetmaps that does a pretty good job in getting you to the right place. So that's pretty good. I'm going to show you a couple of different demos. I don't want to show you too many slides, you'll get these slides going down. But what I have is my medium. I have Nifi read that in and it's going to read in. Write back to slack and we'll show you some other interactions we could do with slack. Write some stuff to Kafka and flink and do some analytics and write everything that needs to be in Milvis there. So I could do rag later. I could do a lot of things, especially when people ask me questions about my articles. Hey, let's have a prompt enhanced Genai do that for me and then I'll show you a quick little one for images using the Tobii library, which is pretty simple. I mean, the more you look into python, the more you're like, where were you these when I was doing all that Java? I mean, I still like Java, but some of these python libraries are amazing. So we're going to show you a little bit, just a quick one, on getting working with image data in a database. Like it's nothing. Pretty cool. We'll touch on rag today. You've seen that before. And I'll leave you some contact stuff so you can start doing some cool stuff once we get into some demos. Hopefully I haven't spent so much time that everything in the world timed out. You know, I didn't, I'm trying to run this as close to live. So you're there like I'm at the conference. Hopefully I see you at the next one. So I have a medium and I have some data there and very fortunately they'll give you an RSS feed of your latest couple of articles, which would be awesome if I didn't have so many articles. So I also download my old articles and have Nifi load them as well. But for most use cases, just the last couple articles, because once I have this running like everything, get that batch processed, however it takes, get that into your vector store and then, you know, from current data as it's changing, grab it. So we're just going to grab some data from there, which is surprisingly easy. Now we are in Nifi. This you could download run on your own, whether it's in Docker Kubernetes or just in on a JVM on your laptop. Pretty simple. So I have some code here that is just going to grab my feed from medium. Now we could also do this with Python. And I'm looking at rewriting this in Python and see what the difference is the amount of code. So you can see here it's atom format RSS, which is basically a type of XML. Again, another format that won't die. Every new format we get three old ones that never die as well. But here we grab some fields here. So we've got this data, we're going to grab it. I grabbed it here in Nifi and then I have something that converts that RSS data and send that into Jason, a little easier to work with. Then I split it on the channel items and we can look at the data provenance here and see that that happens. We got ten articles it gives you at a time. I grabbed some fields I like here, the most important ones, and then I'm just going to send that along here to build up a new JSON file. And if I look at this, I've got ten results here that I just ran and I could take a look at the data. I switch that over into a format to be parsed because it's HTML and then from there parse it out into small chunks. And I've got my small chunks here. Optimizing those chunks is always a fun exercise and there's definitely a lot of different ways to do that. I could see this is part of my article around irish transit system, which is pretty cool. Love Ireland, nice place to go to. Love castles. So what I have here is I'm going to send that record into Milvis. And what's nice with Milvis is there's an awesome open source product called Attu and this lets you query all your data, see what it's going on. I could see here that I've loaded some articles, I could see some other things I have here. There's also security and all that kind of stuff. But the main thing is we see we got another record come in and I'm extracting the text that came out. I'm just going to push that to Kafka so that I can distribute that. Also, if there's anyone who needs to know that I published a new article, I can have a Kafka consumer somewhere that could send out a slack, a discord, an email, a fax, I mean whatever weird thing you want to send. Also I'm sending out a slack, I'll probably send out a discord, maybe I'll send one to the Milvis discord. If it's an article about Milvis, I probably send that to Genai to tell me where should I publish this and use their choice. Let's see where they tell me to send my articles. Maybe they tell me to throw them away. Okay, we could see here that I got a new slack message. And this is the article, not well formatted, but, you know, was pulled out of RSS, but it is the content of the article posted into my slack channel for Milvis. And you can see it's got links and different stuff embedded in there from the article and some, you know, all the code in there, whatever was in that article, just as a way to distribute that. And again, we sent some of that content to Kafka, but that's pretty straightforward. And we'll just let all those run. I mean, it only does ten at a time, but if you take a look, I may have more than ten articles. So when I do the bulk one, I'll go back and do all of last year. You know, maybe just the most recent articles here. Probably don't need to see articles from multiple years ago, though. There's a couple of cool ones in there, like how to automate all the transit systems in the world. No code. That's kind of cool, I guess, if you're into that. Okay, so it's running here. I've got a couple more Milvis going, so we'll show you another part of this to this timeout. Now, the other thing I have running here is we've got a universal slack listener. And this is tied to my slack group here, which is open to all. And I've got the links out there if you want to join it, if you want to ask questions of my bots, if you want to ask questions of me, I'm. One of these is me. I don't know which one there's. I have like three me's on here. Some of them are bots, some of them are me. See if you can guess. Sometimes I forget which one's me. Whoever gives you the best answer, that's me. So we have one just came in and it's getting processed, coming through the system. Now, some of them I throw away because some of them are the results of bots like you saw here, or they're in the wrong channel, or it's just not relevant to anyone. It would be nice to know the current stock price. Now, this is not taking any. That is not what I should. That is not taking any AI I do send everything through a couple open source AI's thanks to hugging face, but sometimes it just doesn't make sense. Like what's what are they going to tell me about the current stock price now? There are plugins and things you could add to chat, GBT and other things and depend on your pipeline. You know you can get that data, but just sending that raw to check GBT doesn't make a lot of sense. One thing I figured out is if I'm going to let this be open to everyone, I should check things out, make sure that's only on one channel. I don't want to look at every channel. Especially I use a lot just for debugging. Make sure it's not one of my bots. I really try not to get one of my bots, but they do have a lot of users. And the other thing I figured out is to clean up my prompts. So I found a model on hugging face that works most of the time. I got to look and see if I can find a better model or maybe train it on my own stuff so it knows when sometimes it's not sure. Like the not safe ones. A lot of these not safe ones are safe. You know, it picked a word I didn't think it was. I thought maybe it was a curse or something was wrong with it, and usually there's nothing wrong with it, so I don't know. Or sometimes it takes text that's really bach text, and it goes, oh, what? Clearly this isn't a human. So sometimes it's actually that's helpful, but other times it's not. So we have that coming out. We got the prompt filtered, make sure it's asked the right question, and then I send it, which I think is kind of cool, is I send it to multiple models at once. If you saw my article, why not four models doing four models at once? I got mistral eight. Tiny llama. The tiny llama has been having problems recently. Mistral seven. I know the models. Models change a lot. Ms. Phi three, which has been pretty good. And then I'm like, can I call even more models at once? That might be too much for my system, but I have one also that translates to Spanish. It's nice to have content in other languages as you do. French, German, Hindi. Give me some suggestions for the languages. There's a lot of models out there. I try h two o, llama three. So I've got a bunch of models running, and when they get the results back, if they're not total junk, I send them out to slack. And then I've got a slack receiver here, taken anything I send in and then I'm publishing it. Pretty straightforward, but gives you an idea here also, some error might have, might be out of permissions for one of the stock systems. Yeah, I might have to change a token somewhere, but what it does is find out what company your stock is and then send it on its way. Let's see, maybe we can get weather. Did we get any results back? Oh, we got, we got some stuff for, we got, like we mentioned all those model five three give me nothing mistral seven told me who amd was. I guess that's good. Metalama three. And then there's the translation. What about the weather? What is the weather in. I'm going to Wildwood. It's a nice beach town here in New Jersey. We have beaches, and we're getting some results back really quick. So that's cool. Here's the weather from a weather forecast. This is probably the smartest one to look at. And this is showing me some of the weather from the Philadelphia area, which is somewhat close there, one of the stations there, and gives me what's the current weather? And I did some parsing to get, you know, Latin long from a name which everyone wants a lat long. There's some paid services that don't, but pretty easy tells me what the weather is generally from some of the other models. Kind of cool way to do that. Just to give you an idea, we can also do some other things in here. Like I can upload an image, and I often will have images of my cat. So obviously an image is not text, it's not a question. It can be a question if I have the right kind of vector store and I can say, give me pictures similar to this. And that's certainly one of the demos we'll do in the future, because that's pretty cool. So yeah, that works slightly differently. I have some universal code over here, gets my, gets that message in from slack, make sure that it's actually an image, downloads it with the proper permissions, if it matches, sends it over to image processing, which something failed here I might have forgotten to install. One of the libraries should take a look in that, and otherwise it's sending select reports. If things make it through the system, depend on, you know, what that image looks like. Is there security issues, whatever. And then it just pops up in our reports here. Like here, we did a little bit of analytics on there, purple bean bag chair with cat on it, and a colorful room that's pretty accurate. And then if you see we've had other reports, I'm doing some things with real time traffic cameras. That's in future demos. That'll be the next one. So stay tuned with conference in six months, how to get started with Milvis just to make it easier for you. Automate everything. Thank you for coming to my talk. Hopefully it wasn't too long. If you have questions, definitely reach out to me. I am always on LinkedIn and Twitter, medium, GitHub, discord, wherever cool data is, I'll be out there. Thanks for watching. See you next time.
...

Tim Spann

Principal Developer Advocate @ Cloudera

Tim Spann's LinkedIn account Tim Spann's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways