Conf42 Python 2024 - Online

SuperDuperDB: Bring AI to your favourite database! Integrate, train and manage any AI models and APIs directly with your database and your data

Video size:

Abstract

Bring AI to your database. Instead of moving your data to complex MLOps pipelines and specialized vector databases, implement AI more efficiently. Integrate and train AI directly with your preferred database using only Python. This framework transforms your database into an AI powerhouse!

Summary

  • SuperduperDB aims to become the centerpiece of the modern data centric AI stack. It brings your AI to your database data deployment. With super Duper DB, you're able to build AI without moving data. We're open source licensed according to Apache two on GitHub.
  • SuperduperDB is a Python package, but at the same time it's a deployment system. It allows you to link different types of computation together, which might or might not involve traditional AI models. You can define whatever functionality you need via our system of wrappers and encoders and connectors.
  • SuperduperDB allows you to easily create high level AI functionality. You can even parameterize these stacks in order to make a higher level interface to your AI. Here's a demo of how this works.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello, my name is Duncan Blythe. I'm the founder and CTO of Superduperdb. In this talk, I'll present to you superduperdb, our vision and mission, the way we work, the way the technology works, Python snippets how to get started with superduperdb and a short demo. So this quotation you see on the screen perfectly describes what we're aiming to do, and that is to transform the simplicity and ease with which developers can get started using AI together with their data. So a fundamental problem is that data and AI live in separate silos. Current state of the art methods for productionizing AI require a lot of data migration into complex pipelines and infrastructure. That means maintaining duplicated data in multiple locations and also various steps and tools, including specialized vector databases. So as a result, bringing machine learning to production and AI is very complex. What often happens is that deployments take on a character similar to this depiction here, where data databases are the initial input nodes to a complex graph of deployments, tools, steps and processes. And the current trend of including vector databases in this setup only make things worse. In 2024, AI, in order to be simple to use, needs to come into contact with data, and our thesis at super Duper DB is this can be greatly simplified if we can provide a unified data and AI environment where no duplication and migration or etl transformations, mlops, pipelines and infrastructure is necessary. So one environment combining AI and data as well as vector search. So we do that by bringing your AI to your database data deployment. So it's the environment in which data and AI are unified and that greatly simplifies the process of AI development and adoption and allows you to unlock the full potential of your existing data. With super Duper DB, you're able to build AI without moving data. And when I say building AI, I really mean the current state of the art AI. So generative AI, including llms and so forth, also standard machine learning use cases as well as custom workflows which consist of combinations of these things. SuperduperDB aims to become the centerpiece of the modern data centric AI stack. So this is what it looks like. On the left we have data, so databases, data warehouses, your data. We would like to connect this with AI vector search, and indeed the Python ecosystem. And this is currently not possible. With full generality, with SuperduperDB, this is possible. So SuperduperDB acts as a centerpiece, orchestrating and connecting these diverse components. And when I say this, I really mean that you can bring any piece of code from the open source ecosystem of Python libraries and integrate vector search completely flexibly. It's an all in one platform for all data centric AI use cases. So we have a deployment which allows for inference and scalable model training. You can develop models in combination with the platform, putting together very complex workflows, and you can also use the platform for search, navigation and analytics. And that also includes modern document Q and a systems. So it's built for developers with the ecosystem mind and we can allow you to integrate any AI models and AI APIs directly with your database. And so we can leverage the full power of the open source ecosystem for AI and Python, and that is substantial. We're open source licensed according to Apache two on GitHub. Please take a look like subscribe, contribute and that's very important. We are very keen to get contributors on board, improving the quality and the features in the project. So how does it work? So SuperduperDB acts in combination with your databases data deployment and what happens is you can install models and AI APIs and configure these to either form inference on your data as data comes in, or indeed fine tune themselves on your data. And developers can interact with the system from Jupyter notebooks, Python scripts, as well as we're working on other sdks. So this will come in hopefully the upcoming months. And we also are working on rest APIs so that you can easily integrate this with your applications downstream. You can also easily interact with this since we're Python first with fast APIs. So now let's get a little more technical. So what does the underlying architecture look like of a SuperduperDB system? So you'll see later that SuperduperDB is a Python package, but at the same time it's a deployment system. In order for your Python package to operate, it's interacting with the components you see on this diagram. So are various components. So the principal component is the data backend which corresponds to your traditional database. And we also have a metadata store and artifact store. So these are for saving information about models and actually saving model data. So these three things here together get wrapped in this DB variable that you'll see in the subsequent code snippets. Work is carried out by a scalable slave master slave scheduler. So we're using ray to do this and work is submitted to this system via either directly from the developer requests or from a change data capture demon which is listening for incoming data. And you can also set up a vector search component and that interacts with the query API. So when you select data, you can optionally link the query with the vector search component. So that's still a very high level view of what's going on. Suffice to say you can read more about this on our docs and get into as much detail as you like by exploring our examples and exploring the code base. So let's have a look at the code. To connect to superduperDB, you simply wrap your standard database URI with our wrapper and you get an object DB which you can do many of the standard things you would do with a database client, but much much more. So that's sort of the super duper of your database in order to query your system. It's very similar to doing a standard database query. What you do is you simply connect with the backend of your choice, so in this case MongDB and you execute your query object. So the query object is here between brackets. And this is completely analogous to a standard Pymongo query for instance. And SQL is a bit more involved because you need to set up a schema first and add a table to the system. But after you have this table, we can perform SQL queries via the Ibis library. You can create custom data types, so that allows you to do much more than you do than you would do with a standard database. For instance here I'm creating an MP3 data type and the way this works is that you have this encoder object and you tell your encoder how it should handle the bytes from a data point. So here we're just doing a simple encoding via pickle, but you can do whatever you like. And that's a theme in superduperdb. You can define whatever functionality you need via our system of wrappers and encoders and connectors. So creating a model is very flexible. So here for instance, a very simple model just involving a regular expression. So we have to think of models in super duper DB. It's not just a Pytorch model or a hugging face model, but really a generalized computation. And this computation can either have auxiliary data, can be trainable or not, but that's the sort of sense of the whole project is to link different types of computation together, which might or might not involve traditional AI models. So here for instance, we import the object model wrapper, give it a name, my extractor, and we passing as the heavy lifting component of this model a mapping which extracts from an input string URLs. So you can go much more involved than this. For instance, here we are using Spacey to pass text and doing essentially named entity recognition on this text. SuperduperDb handles saving these diverse bits of data and code into the system. So you're now able to use spacey to do parsing. And the cool thing about this is there's no necessity for us as superduperdb maintainers to have already made this spacey integration. You can really just bring it to your super Duperdb deployment and it can go even deeper than this. For instance, you could do something like a custom APIs request handling logic of how exactly individual data points. So this would be this predict function or multiple data points are handled by your model. So completely versatile, completely flexible, you'll see through and through that. We use the data class, data class decorator around our classes and the reason for this is this creates a very nice way to expose these models then to rest API functionality. And so then you can nicely build front ends on top of this. So applying a model to data in a database is simple via the predict in DB. So you simply say which key you would like to operate over and also what data you would like to select. And superduperDb will then under the hood, load the data, efficiently pass that data through your model and save the outputs back in the database. And this can actually be done in a sort of asynchronous streaming fashion where you don't need to necessarily even be activating the model yourself. So this will actually start essentially getting a life of its own via the listener wrapper. So you would wrap your model with a listener and tell the listener to listen to a certain query. And that means that the system will then listen for incoming data on this query, and when it comes in, it will apply that model to that data and populate the database with outputs over that data. And a vector index operates together with this listener component. A vector index in itself needs to be always up to date, so that's why it operates together with listener. So you wrap a listener with a vector index and you instantly make that data underneath this select query searchable. Creating more complex functionality where multiple models interact happens via the stack API. So you can simply list the components you would like to like to add to your stack. And as before, add the stack to the system and you can even parameterize these stacks in order to make a higher level interface to your AI functionality. So what you would do then is essentially perform surgery on your stack, replacing certain variables or certain values with variables become available as parameters in the higher level app API. So now this app has two free variables identify and select from this app. I can very easily share high level AI functionality. And how do I share it? Simply export and SuperduperDb will then inspect the model or models that you've created and create a very nice JSoN serialized format with references to artifacts. So we're now going to see a demo of this system. So imagine you have a library of video recordings or video, and you would like to search this video using natural language for important scenes. Or search these videos and these could potentially be sensitive recordings, so you don't necessarily want to send requests off to externally hosted APIs. So in this case, what we can do with superduperdb with very few Python commands is simply add the videos to the database, specifying only the uris of where the videos are located. We can create our own custom model to extract and subsample video frames for them from the video, vectorize these frames using computer vision models via Pytorch, and then once this is set up, we're ready to search through these frames and return answers to queries such as this. Show me scenes where the main actor throws a ball as just a simple example, and we'll get results in the form of references to places in the video where this may have happened. So this is simply just one example of what you can do with superduperdb. There are numerous examples which you can see on our website. Suffice to say that if you can think of it, you can probably do it. So let's start. So this is a Jupyter notebook and we're going to be interacting with Superduperdb from this notebook. Let's connect, we're connecting to MongoDB and we're going to use the collection videos to store the data about the videos. We have this DB connector. Let's create a special data type video on file which essentially tells us where our videos are. So you'll see here that we have a URL of a video. Let's have a look at what it is. It's a video of lots of different animals and we could potentially add multiple here, but we're just going to add this video to the database. So you'll see that adding a video from a public Uri, or could be an s three Uri, is a simple matter of simply inserting that uri under the hood. The system is actually downloading and caching the data so it can be used in computations. All of this happens automatically. You don't need to specify anything. So we can see here that we have a single document in the collection now which contains the reference to CRI and the data on file. So you can see that in more detail here. There is the cached video. So now I'm going to use the OpenCV library to create my own custom model which takes the data from this video and subsamples frames, saving those frames back in the database. So you'll see the logic here isn't important. Suffice to say. Suffice to say that I can create any custom logic I like. So now that model has been created, let's apply the model to the data in the database. So you'll see it's iterated through the frames. And now we've actually extracted one data point here and verified that a frame has been saved in that data. So you can see that in more detail here. If I take one document out of the outputs collection, you'll see there's actually a Python native image in there which we extract with this execute query. And now I would like to make those frames searchable. So we're going to use a pytorch model which is imported from the clip package of OpenAI. So this is a self hostable model and we're actually going to use two model components. So one for the visual part and one for the textual part. So now those models have been set up, let's wrap them with a vector index, and we're going to create a vector index which is essentially multimodal, so it has an indexing listener and a compatible listener. That means the models can, sorry. The images can be searched either with a textual interface or an image interface. So now the vector index has been set up and the images have been vectorized. Search through those frames. So let's look for for instance elephants in the woods. So this query here is searching the output collection using the search term referring to the index that we've created, and we're able to extract the timestamp from the results. These are simply Mongodbe documents returned in the results. And once you have that timestamp we can actually find the position in the original video. So let's confirm elephant. So we search for elephants in the woods. And now we have an elephant in the woods. Let's check this wasn't a fluke monkey is playing. So there you have it in very few python commands. Videos searchable, completely configurable, self hosting. Configure all steps of logic yourself via Python, but follow this template and save the results in superduperdb. Would you like to know more about superduperdb then find us on GitHub at superduperdb slash superduperdb. You can check out our document and example. Use cases at docs superduperdb.com and try out the code with a simple pip install. Pip install super duper db happy coding.
...

Duncan Blythe

Founder & CTO @ SuperDuperDB

Duncan Blythe's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways