Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone.
Welcome to my presentation today on the topic LLM Enhanced Multimodal
AI and how it is revolutionizing the audio interaction technologies.
A little bit about me.
I'm a senior staff software engineer at Intuit with over 17 years of
software development experience and full stack specializing
in mobile and AI technologies.
Currently I lead the development of Gen AI mobile applications.
I'm passionate about technology and I love learning and sharing my
knowledge through blocks, webinars, publications, and patented innovations.
Let's get started.
as we see, there's a significant search in the audio content in the
past few years, particularly podcasts, audio, audio books, and online courses
becoming, much more mainstream.
They emerge as an essential challenge, which is a content overload.
So with so much material available for listeners, they get.
Pretty overwhelmed with the share volume of audio content.
and this could lead to navigational difficulties.
Unlike text-based mediums where you can easily search or skim for a specific
information with a simple command.
F for a control f. Audio is inherently linear, and listeners find it
cumbersome to manually navigate through hours of content without,
a clear way to locate a specific information that they're looking for.
Ultimately, this creates a gap between what listeners want and
what is currently available.
users are looking for a solution that allows them to access relevant
segments quickly, enabling them to enjoy the benefits of audio content
without feeling lost in the noise.
Addressing these challenges is what our AI part solution aims to accomplish.
Before we get into the solution, let's take a look at what multimodal AI is.
Multimodal AI refers to a system that integrates and analyzes, multiple types
of data, such as text, images, audio, and video simultaneously to, to enhance
understanding or improve decision making.
So our multimodal AI part solution addresses, these challenges
with the following key AI driven technologies, which is.
Speaker Diarization, which automatically identifies who spoke when in an audio file
and topic segmentation, which divides an audio recording into meaningful segments
based on the content and multimodal search interface allows users to interact
with audio via text and voice queries.
we leverage the advanced AI models for this, like open AI whisper for
speech to text transcription, or Google Gemini and various, NLP algorithms
for content indexing and other things.
speaker Diarization is a process that determines who's speaking at a given
point in time, and this is essential in a. In a multi-speaker setting, such
as podcasts and panel discussions, the traditional systems, they often
struggle with accuracy because, they're often based on, identifying based on
the, based on the sounds, but not the context and the conversation history.
but the proposed AI part approach, aims to reduce the diarization, error, rate
to improve the speaker identification.
our system also creates, dynamic speaker profiles and metrics showing, speakers
bio on their speaking frequency.
This means a listener can actually look at, all the topics that the user has, that
the speaker has spoken and, they could jump to a particular segment without.
and skip through all the sections that are not, that, that do not interest them.
The user could say, query with simple things like watch all segments
where Jill spoke, or take me to a, segment where John is speaking.
So these are some examples, and here is an example of how a traditional
audio-based system looks like with.
multiple speakers discussing in your, in the top part of the screen where,
you know they're introducing themself, talking about inflation, mortgage,
job market, and so on and so forth.
once the speaker diarization happens, you would be able to get the
timestamps or the speaker segments as you could see in this example
with, in the bottom of your screen.
With, Imani speaking, zero to two minutes and Jill speaking on different
segments at two minutes, 10 minutes, 20 minutes, and so on and so forth.
And, at the end of the Diarization process, you expect, a response to be
written with this, with this values.
So topic segmentation helps organize audio content into meaningful segments.
our system applies.
NLP techniques like cosign similarity or term frequency inverse document T,
tf, IDF, to detect the topic, boundaries and group related content together.
For example, in a two hour long podcast, if you wanna listen to only
discussions about inflation, you should be able to instantly jump
to the relevant sections instead of skimming through the entire episode.
Our multimodal search interface allows user to search for content,
using text or voice queries.
So the AI retrieves, answers, based on context instead of relying on generic
keyword matches, for example, a listener can ask a very, generic question.
Things like what's lottery starts on, rising prices.
The question doesn't necessarily tell you that this is about inflation, but the
system, since it's using, NLP and ai, it should be able to map it, and get to the.
relevant sections of the video that, that match with this criteria.
another query could be, did anyone in the panel talk about interest rates or did
Jill express hope for a better economy?
and the system, if you see, these are all generic queries and the system
should be robust enough to handle and give you the segments, as needed.
To further enhance engagement, our system includes, dynamic annotations
with, key points appearing during playback, follow up links so
that the user can explore related content without searching manually.
integrated note taking, listeners can add timestamp notes for future
references, and this whole thing transforms the passive learning
into a very interactive experience.
So the system also tries to, automatically index based on various criteria, including
speakers, topics, timestamps, so users can reduce structured segments efficiently.
For example, educators can use this for organizing their lecture recordings,
picking, picking topics from various videos and making them easier for the
students to find specific discussions.
User feedback is another crucial layer.
Our system integrates a rating system to evaluate, segment relevance.
As you might know, as you might wanna know how a certain segment
in a video is received and not base your opinion on the entire video or
the feedback for the entire video.
It happens all the time where people might like certain portions of the
video, but not the entire video.
the users would be able to rate segments of the video on not just the entire video.
This helps you, make much more, data driven decisions or sentiment
driven decisions, a common section and which is pretty standard.
And also an analytics dashboard for content creators to understand
their audience preferences.
Now let's dive into the technical details of this, that powers the
audio base or, navigation system.
this system essentially consists of, four layers, input layer, which converts
a raw audio into structured text, with timestamp, processing layer, which
uses the AI driven speaker diarization to identify speaker and speakers
and topic segmentation to break.
Content into meaningful sections.
indexing layer, it stores the structured metadata for quick search and retrieval,
interaction layer, which, or the feedback layer, which will enable
search and playback functionality using, multimodal input such as text
queries, voice commands, and contextual recommendations, as well as LLC users to.
provide, detailed feedback on, digging through various aspects
of a video, which will help the content recommend, content creators.
alright, let's talk a little bit about the, the input layer.
here, the audio processing is the first step in our pipeline where we convert
the raw audio into structured text using, a speech transcription model,
open AI whisper, or, Google Gemini.
there are quite a few other ones.
Now, why is this important?
audio files by themself are not very useful for search,
so they don't have structure.
So if we have timestamp for accurate playback, we would be able to build
high quality UI where the users, based on these APIs, the users could actually
go to a particular portion of a video, highlighting, pro providing high
quality speech to text transcription.
will definitely help.
if you have the transcription accurately, accurately, laid out as
well as the transcription playback is, is the right playback, it
matches with your, with the input.
audio speed, whisper, tends to perform better.
At least that's the experience that I have.
it gives you the exact.
timestamps as well as the, as well as the exact time, playback speed so that
if you're building experiences, you are accurately landing on the right
point, when you search for a keyword.
so.
let's check out the code a little bit.
we send the audio file to whisper API and it returns a structured text.
the, this is just a pseudo code example.
The response include, word level timestamps, which means we can align,
spoken words to actual audio moments.
And the output also has things like, an entire transcript.
Plus metadata, like duration, detected language, word timing, and all of these
would be used in the, in the next layers.
And why is this important?
Because if a user searches for a phrase, we should be able to jump straight
to the exact moment it was spoken.
And, and these step is essential for the, for the next layers.
let's talk about Speaker Diarization, which is another step in the,
processing layer and how we are making it better using the, NLPs and LLMs.
normally Speaker Diarization is done with, caustic, models that just
try to figure out who's speaking based on voice characteristics.
But, sometimes people.
multiple people sound similar or the audio quality, isn't great all the time.
So here's what we do instead, we start with the transcription.
we get the words, timestamps, and the entire, metadata that we just discussed.
We chunk the transcript into segments.
We assume that if there's, the system makes some, calculated
assumptions, based on pauses.
instead of, relying on the voice characteristics, the system also,
analyzes the, conversation history or, what's being spoken, based on
the context and things like that.
and this will give you, a structured JSON that tells you who spoke when, an
example of a query, in this layer would look something like, if you want to
use that, bill, UI for these APIs who spoke in each segment of this podcast.
This could be a simple query and that should give you a list
of, segments for each speaker.
And here's a simple pseudocode example on, how this thing works.
So we get the transcript, chunk it, based on, some calculated guesses.
and, by the LLMs and formatted, formatted, we call the LLMs with,
our structured prompt and it written speaker assignments, we pass that
output into a JSON format that's easier to use and to build any experiences.
so this is great for podcast meetings, interviews, basically any
conversation where traditional speaker diarization struggles and it's more
accurate when it, because it actually understands, more than just the voice.
but the, the context and it does a lot of analysis on what's being
spoken and what's being discussed.
topic segmentation uses, NLP and LLMs to divide and classify the topics.
segment them, segment transcripts using, timestamps and detect,
topic shifts in conversations.
So it can also categorize segments into themes, and use LLMs to assign
topic labels for audio systems and developers can build, these,
multifaceted UIs using these APIs.
an example query, of this layer.
how this would be useful is if you wanna find all sections
using AI ethics, in this podcast.
find all sections, during discussing AI ethics in this podcast.
And that should be able to get you a list of items, of all the portions or
sections that talk about this topic.
And this is, once the, once you have these APIs and the timestamps, the, the
next steps are usually faster because you're not analyzing the entire video.
You're just, referring through your metadata.
And you are, you're playing back based on, a certain timestamp,
based on the criteria that is given.
Either, either it's a speaker that you wanna land, it based, a speaker.
speaking, a particular topic that you wanna land on or a topic itself that's
segmented that you want to land on.
so it becomes much faster because you're just dealing with the metadata and not
analyzing the video again and again.
let's discuss how we handle topic segmentation, using the transcript
from the, from the input layer.
in a long form audio topics shift.
frequently users want to jump directly to relevant sections
without scrapping to recordings.
So our solution automates topic detection and labeling for seamless navigation.
So we chunk the transcript, use the, output that we received from the
previous layer with word timestamp, group the words together into, on
regular intervals for adequate context.
compute, text, similar, text similarity.
Convert the segment.
text into TFID effect, calculate similarity between adjucent
segments to assess the relevance.
detect the topic shifts, assign topics using, LLMs, send the segmented text
to GPT-4 or, or relevant, models for descriptive topic labeling and structured
the output for search and playback.
and.
users should be able to quickly search and jump to any topic that they like.
Indexing layer.
Once we have the structured data from processing layer, which includes,
sp label, text, and topic segmented content, we need a fast and scalable
way to search and retrieve it.
And this is often now done on metadata, so it should be faster.
So how does the indexing layer work?
It stores the segments efficiently.
Each segment is stored in a database with a topic.
or speaker or start end and, start date, start time and end time.
And this allows us to map conversation to structured metadata for easier lookup.
And the next step would be creating faster, indexes where we index the
data by topic, speaker, and timestamps.
And this enables quick, full text searches so users can jump to any
sections instantly and retrieving segments versus, multiple, filters.
retrieving segments, versus multiple filters.
users can actually search by different things like topics,
speaker, timestamp, et cetera.
And this is optimized for speed and scalability and exposing APIs, and it
exposes APIs for, for the clients to make, build those experiences, with
search, based on different criteria.
And why this is powerful.
This is powerful because users can find exactly what they need without
scrubbing through long videos, and they have different criteria to search from.
And, just a high level depiction of how, different layers of this, indexing layer.
So the input of this layer would be the data reiterate from the processing layer.
Things like topic, speaker, certain end times, et cetera.
the output of this layer is expected to be an index storage and for enabling
faster or, faster search or indexing.
First step would be to save the data, in, and then, index the data by key attributes
such as topic, speaker, and timestamps.
enable full tech search, capabilities to handle keyword
queries effectively and support.
Time-based, queries, so that you can jump to a particular, time in the video.
for optimizing performance.
we use the caching strategies to, accelerate, frequent queries or
implement ation to manage large sets of DA data efficiently.
Or, if you want to go advance, you can build a rag with the, data
to store and retrieve information from a Vector database for
accurate and relevant results.
And finally, in this layer, it provides API endpoints for external
facing, fetching externally, fetching the, index, segments.
And it enables seamless content navigation.
And let's go over the final layer, which is the integration,
interaction and the feedback layer.
So this layer, let's say this layer is very important because if you have,
let's say, a user, you know that, that is listening to an hour long podcast,
but they would, they have some questions that they want to answer, they could
simply ask a question that says, what did John say about, AI ethics?
So the multi-model interaction for seamless search will, would
essentially, query, based on how you query, any, any text space system.
And it provides a real time feedback to improve accuracy as well, because,
you wanna learn as the, as more users are adopting the system or
adopting, or watching the videos.
AI powered personalization and recommendations.
So over the time the system learns, user preferences as well.
and you could make, data driven, decisions based on it.
And here's a high level exam, high level depiction of how this
feedback layer would look like.
it takes the user queries, playback interaction, run ratings
as input and generates enhanced recommendations using the queries
and indexing from the index layer.
So it also ensures, improved accuracy and facilitates, learning based on
the feedback as a user interacts or searches feedback is collected
real time, which will be used to adjust AI models and generate.
Personalized recommendations and improved accuracy and precision
over the period of time.
The system can leverage these APIs to build features to expose, search
and personalized playback, APIs providing, interactive user experience.
So to summarize our proposed system, it offers several key benefits.
It, it transforms the passive listening to an interactive and structured experience
where listeners can effectively find and engage with, with the audio content.
it, it is efficient, in terms of topic and speaker classification.
It gives you a real time search and learning mechanism, and
also a feedback system that is continuously learning and helps you
with the advanced personalization.
And this framework is scalable because a system is designed to accommodate
various audio formats, such as podcasts and webinars, making it versatile and
applicable across different use cases.
Improved accessibility is often, underlooked, but the system provides
features such as voice queries and easy navigation aids, supporting
diverse population, user needs, and ensuring that all the listeners have
the access to the audio contact.
So con in con, in conclusion, there's this research from Edison
that says more than 45% of the audio driven platforms, are on demand.
and these are, 45% are on demand platforms, which, consumers listen
to, including podcasts, meetings, education, and enterprise applications.
These.
You know these things.
They present an opportunity for AI powered, navigation systems by
transforming the unstructured speech into searchable interactive content.
This approach enhances user engagement, improves accessibility, and drives
intelligent content discovery at scale.
So that's all from me.
thank you for joining.