LLM-Enhanced Multimodal AI: Revolutionizing Audio & Video Interaction Technologies

Video size:

Abstract

Discover how LLM-enhanced AI revolutionizes audio/video media, boosting engagement with advanced speaker diarization and topic segmentation. Experience precise navigation and personalized content through sophisticated multimodal technologies to enhance user engagement and accessibility.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone. Welcome to my presentation today on the topic LLM Enhanced Multimodal AI and how it is revolutionizing the audio interaction technologies. A little bit about me. I'm a senior staff software engineer at Intuit with over 17 years of software development experience and full stack specializing in mobile and AI technologies. Currently I lead the development of Gen AI mobile applications. I'm passionate about technology and I love learning and sharing my knowledge through blocks, webinars, publications, and patented innovations. Let's get started. as we see, there's a significant search in the audio content in the past few years, particularly podcasts, audio, audio books, and online courses becoming, much more mainstream. They emerge as an essential challenge, which is a content overload. So with so much material available for listeners, they get. Pretty overwhelmed with the share volume of audio content. and this could lead to navigational difficulties. Unlike text-based mediums where you can easily search or skim for a specific information with a simple command. F for a control f. Audio is inherently linear, and listeners find it cumbersome to manually navigate through hours of content without, a clear way to locate a specific information that they're looking for. Ultimately, this creates a gap between what listeners want and what is currently available. users are looking for a solution that allows them to access relevant segments quickly, enabling them to enjoy the benefits of audio content without feeling lost in the noise. Addressing these challenges is what our AI part solution aims to accomplish. Before we get into the solution, let's take a look at what multimodal AI is. Multimodal AI refers to a system that integrates and analyzes, multiple types of data, such as text, images, audio, and video simultaneously to, to enhance understanding or improve decision making. So our multimodal AI part solution addresses, these challenges with the following key AI driven technologies, which is. Speaker Diarization, which automatically identifies who spoke when in an audio file and topic segmentation, which divides an audio recording into meaningful segments based on the content and multimodal search interface allows users to interact with audio via text and voice queries. we leverage the advanced AI models for this, like open AI whisper for speech to text transcription, or Google Gemini and various, NLP algorithms for content indexing and other things. speaker Diarization is a process that determines who's speaking at a given point in time, and this is essential in a. In a multi-speaker setting, such as podcasts and panel discussions, the traditional systems, they often struggle with accuracy because, they're often based on, identifying based on the, based on the sounds, but not the context and the conversation history. but the proposed AI part approach, aims to reduce the diarization, error, rate to improve the speaker identification. our system also creates, dynamic speaker profiles and metrics showing, speakers bio on their speaking frequency. This means a listener can actually look at, all the topics that the user has, that the speaker has spoken and, they could jump to a particular segment without. and skip through all the sections that are not, that, that do not interest them. The user could say, query with simple things like watch all segments where Jill spoke, or take me to a, segment where John is speaking. So these are some examples, and here is an example of how a traditional audio-based system looks like with. multiple speakers discussing in your, in the top part of the screen where, you know they're introducing themself, talking about inflation, mortgage, job market, and so on and so forth. once the speaker diarization happens, you would be able to get the timestamps or the speaker segments as you could see in this example with, in the bottom of your screen. With, Imani speaking, zero to two minutes and Jill speaking on different segments at two minutes, 10 minutes, 20 minutes, and so on and so forth. And, at the end of the Diarization process, you expect, a response to be written with this, with this values. So topic segmentation helps organize audio content into meaningful segments. our system applies. NLP techniques like cosign similarity or term frequency inverse document T, tf, IDF, to detect the topic, boundaries and group related content together. For example, in a two hour long podcast, if you wanna listen to only discussions about inflation, you should be able to instantly jump to the relevant sections instead of skimming through the entire episode. Our multimodal search interface allows user to search for content, using text or voice queries. So the AI retrieves, answers, based on context instead of relying on generic keyword matches, for example, a listener can ask a very, generic question. Things like what's lottery starts on, rising prices. The question doesn't necessarily tell you that this is about inflation, but the system, since it's using, NLP and ai, it should be able to map it, and get to the. relevant sections of the video that, that match with this criteria. another query could be, did anyone in the panel talk about interest rates or did Jill express hope for a better economy? and the system, if you see, these are all generic queries and the system should be robust enough to handle and give you the segments, as needed. To further enhance engagement, our system includes, dynamic annotations with, key points appearing during playback, follow up links so that the user can explore related content without searching manually. integrated note taking, listeners can add timestamp notes for future references, and this whole thing transforms the passive learning into a very interactive experience. So the system also tries to, automatically index based on various criteria, including speakers, topics, timestamps, so users can reduce structured segments efficiently. For example, educators can use this for organizing their lecture recordings, picking, picking topics from various videos and making them easier for the students to find specific discussions. User feedback is another crucial layer. Our system integrates a rating system to evaluate, segment relevance. As you might know, as you might wanna know how a certain segment in a video is received and not base your opinion on the entire video or the feedback for the entire video. It happens all the time where people might like certain portions of the video, but not the entire video. the users would be able to rate segments of the video on not just the entire video. This helps you, make much more, data driven decisions or sentiment driven decisions, a common section and which is pretty standard. And also an analytics dashboard for content creators to understand their audience preferences. Now let's dive into the technical details of this, that powers the audio base or, navigation system. this system essentially consists of, four layers, input layer, which converts a raw audio into structured text, with timestamp, processing layer, which uses the AI driven speaker diarization to identify speaker and speakers and topic segmentation to break. Content into meaningful sections. indexing layer, it stores the structured metadata for quick search and retrieval, interaction layer, which, or the feedback layer, which will enable search and playback functionality using, multimodal input such as text queries, voice commands, and contextual recommendations, as well as LLC users to. provide, detailed feedback on, digging through various aspects of a video, which will help the content recommend, content creators. alright, let's talk a little bit about the, the input layer. here, the audio processing is the first step in our pipeline where we convert the raw audio into structured text using, a speech transcription model, open AI whisper, or, Google Gemini. there are quite a few other ones. Now, why is this important? audio files by themself are not very useful for search, so they don't have structure. So if we have timestamp for accurate playback, we would be able to build high quality UI where the users, based on these APIs, the users could actually go to a particular portion of a video, highlighting, pro providing high quality speech to text transcription. will definitely help. if you have the transcription accurately, accurately, laid out as well as the transcription playback is, is the right playback, it matches with your, with the input. audio speed, whisper, tends to perform better. At least that's the experience that I have. it gives you the exact. timestamps as well as the, as well as the exact time, playback speed so that if you're building experiences, you are accurately landing on the right point, when you search for a keyword. so. let's check out the code a little bit. we send the audio file to whisper API and it returns a structured text. the, this is just a pseudo code example. The response include, word level timestamps, which means we can align, spoken words to actual audio moments. And the output also has things like, an entire transcript. Plus metadata, like duration, detected language, word timing, and all of these would be used in the, in the next layers. And why is this important? Because if a user searches for a phrase, we should be able to jump straight to the exact moment it was spoken. And, and these step is essential for the, for the next layers. let's talk about Speaker Diarization, which is another step in the, processing layer and how we are making it better using the, NLPs and LLMs. normally Speaker Diarization is done with, caustic, models that just try to figure out who's speaking based on voice characteristics. But, sometimes people. multiple people sound similar or the audio quality, isn't great all the time. So here's what we do instead, we start with the transcription. we get the words, timestamps, and the entire, metadata that we just discussed. We chunk the transcript into segments. We assume that if there's, the system makes some, calculated assumptions, based on pauses. instead of, relying on the voice characteristics, the system also, analyzes the, conversation history or, what's being spoken, based on the context and things like that. and this will give you, a structured JSON that tells you who spoke when, an example of a query, in this layer would look something like, if you want to use that, bill, UI for these APIs who spoke in each segment of this podcast. This could be a simple query and that should give you a list of, segments for each speaker. And here's a simple pseudocode example on, how this thing works. So we get the transcript, chunk it, based on, some calculated guesses. and, by the LLMs and formatted, formatted, we call the LLMs with, our structured prompt and it written speaker assignments, we pass that output into a JSON format that's easier to use and to build any experiences. so this is great for podcast meetings, interviews, basically any conversation where traditional speaker diarization struggles and it's more accurate when it, because it actually understands, more than just the voice. but the, the context and it does a lot of analysis on what's being spoken and what's being discussed. topic segmentation uses, NLP and LLMs to divide and classify the topics. segment them, segment transcripts using, timestamps and detect, topic shifts in conversations. So it can also categorize segments into themes, and use LLMs to assign topic labels for audio systems and developers can build, these, multifaceted UIs using these APIs. an example query, of this layer. how this would be useful is if you wanna find all sections using AI ethics, in this podcast. find all sections, during discussing AI ethics in this podcast. And that should be able to get you a list of items, of all the portions or sections that talk about this topic. And this is, once the, once you have these APIs and the timestamps, the, the next steps are usually faster because you're not analyzing the entire video. You're just, referring through your metadata. And you are, you're playing back based on, a certain timestamp, based on the criteria that is given. Either, either it's a speaker that you wanna land, it based, a speaker. speaking, a particular topic that you wanna land on or a topic itself that's segmented that you want to land on. so it becomes much faster because you're just dealing with the metadata and not analyzing the video again and again. let's discuss how we handle topic segmentation, using the transcript from the, from the input layer. in a long form audio topics shift. frequently users want to jump directly to relevant sections without scrapping to recordings. So our solution automates topic detection and labeling for seamless navigation. So we chunk the transcript, use the, output that we received from the previous layer with word timestamp, group the words together into, on regular intervals for adequate context. compute, text, similar, text similarity. Convert the segment. text into TFID effect, calculate similarity between adjucent segments to assess the relevance. detect the topic shifts, assign topics using, LLMs, send the segmented text to GPT-4 or, or relevant, models for descriptive topic labeling and structured the output for search and playback. and. users should be able to quickly search and jump to any topic that they like. Indexing layer. Once we have the structured data from processing layer, which includes, sp label, text, and topic segmented content, we need a fast and scalable way to search and retrieve it. And this is often now done on metadata, so it should be faster. So how does the indexing layer work? It stores the segments efficiently. Each segment is stored in a database with a topic. or speaker or start end and, start date, start time and end time. And this allows us to map conversation to structured metadata for easier lookup. And the next step would be creating faster, indexes where we index the data by topic, speaker, and timestamps. And this enables quick, full text searches so users can jump to any sections instantly and retrieving segments versus, multiple, filters. retrieving segments, versus multiple filters. users can actually search by different things like topics, speaker, timestamp, et cetera. And this is optimized for speed and scalability and exposing APIs, and it exposes APIs for, for the clients to make, build those experiences, with search, based on different criteria. And why this is powerful. This is powerful because users can find exactly what they need without scrubbing through long videos, and they have different criteria to search from. And, just a high level depiction of how, different layers of this, indexing layer. So the input of this layer would be the data reiterate from the processing layer. Things like topic, speaker, certain end times, et cetera. the output of this layer is expected to be an index storage and for enabling faster or, faster search or indexing. First step would be to save the data, in, and then, index the data by key attributes such as topic, speaker, and timestamps. enable full tech search, capabilities to handle keyword queries effectively and support. Time-based, queries, so that you can jump to a particular, time in the video. for optimizing performance. we use the caching strategies to, accelerate, frequent queries or implement ation to manage large sets of DA data efficiently. Or, if you want to go advance, you can build a rag with the, data to store and retrieve information from a Vector database for accurate and relevant results. And finally, in this layer, it provides API endpoints for external facing, fetching externally, fetching the, index, segments. And it enables seamless content navigation. And let's go over the final layer, which is the integration, interaction and the feedback layer. So this layer, let's say this layer is very important because if you have, let's say, a user, you know that, that is listening to an hour long podcast, but they would, they have some questions that they want to answer, they could simply ask a question that says, what did John say about, AI ethics? So the multi-model interaction for seamless search will, would essentially, query, based on how you query, any, any text space system. And it provides a real time feedback to improve accuracy as well, because, you wanna learn as the, as more users are adopting the system or adopting, or watching the videos. AI powered personalization and recommendations. So over the time the system learns, user preferences as well. and you could make, data driven, decisions based on it. And here's a high level exam, high level depiction of how this feedback layer would look like. it takes the user queries, playback interaction, run ratings as input and generates enhanced recommendations using the queries and indexing from the index layer. So it also ensures, improved accuracy and facilitates, learning based on the feedback as a user interacts or searches feedback is collected real time, which will be used to adjust AI models and generate. Personalized recommendations and improved accuracy and precision over the period of time. The system can leverage these APIs to build features to expose, search and personalized playback, APIs providing, interactive user experience. So to summarize our proposed system, it offers several key benefits. It, it transforms the passive listening to an interactive and structured experience where listeners can effectively find and engage with, with the audio content. it, it is efficient, in terms of topic and speaker classification. It gives you a real time search and learning mechanism, and also a feedback system that is continuously learning and helps you with the advanced personalization. And this framework is scalable because a system is designed to accommodate various audio formats, such as podcasts and webinars, making it versatile and applicable across different use cases. Improved accessibility is often, underlooked, but the system provides features such as voice queries and easy navigation aids, supporting diverse population, user needs, and ensuring that all the listeners have the access to the audio contact. So con in con, in conclusion, there's this research from Edison that says more than 45% of the audio driven platforms, are on demand. and these are, 45% are on demand platforms, which, consumers listen to, including podcasts, meetings, education, and enterprise applications. These. You know these things. They present an opportunity for AI powered, navigation systems by transforming the unstructured speech into searchable interactive content. This approach enhances user engagement, improves accessibility, and drives intelligent content discovery at scale. So that's all from me. thank you for joining.

Slides

Download slides (PDF)

See all 40 talks at this event!

Conf42 Large Language Models (LLMs) 2025 - Online

March 20 2025 - premiere 5PM GMT

LLM-Enhanced Multimodal AI: Revolutionizing Audio & Video Interaction Technologies

Video size:

Abstract

Summary

Transcript

Slides

Waseem Syed

Senior Staff Software Engineer

Join the community!

Featured event

2026

2025

Info

Conf42 Large Language Models (LLMs) 2025 - Online

March 20 2025 - premiere 5PM GMT

LLM-Enhanced Multimodal AI: Revolutionizing Audio & Video Interaction Technologies

Video size:

Abstract

Summary

Transcript

Slides

Waseem Syed

Senior Staff Software Engineer

Join the community!