Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, my name is Ruta.
Welcome to this talk on after making podcast localization
using open eyes, API Stack.
And this talk we'll explore how to convert English podcasts into Russian
while maintaining tone and style.
We'll go through architecture, technical challenges, and real examples.
Let's get started.
the main question is why we want to automate podcast localization, and
there are two, two main reasons.
The first reason is to make content accessible to a wider audience.
Podcasts which are only in English can now be consumed by Russian speaking people
without any knowledge of English I. The second reason is to reduce manual effort
while enduring high quality results.
In other words, automation makes it simpler to translate podcasts
without human intervention.
While doing so, we want to maintain tone and style and translation,
and GBT four O helps with that by allowing us to translate English
speaking speech, to Russian while preserving the original tone and style.
A solution overview.
Here is the overall structure of our localization system.
Each step plays a critical role in transforming English
podcast, into a Russian one.
The first step is podcast download where we download the podcast
metadata such as title and description and the empathy file.
And then there is a second step transcription.
We have to transcript this empathy file into a text, and then text processing.
We.
Give this text to process into GPT-4 oh model.
The processing includes, adding all the punctuation marks, adding, fixing all
the grammar and so on so that the speech sounds more natural and sounds more live.
Then we have speech synthesis and here is where we, take the TTS
one model and produce the Russian track, on this translated text.
And audio assembly.
and this step is needed because when we use cts one model, it
has a very small restriction in about 4,000 characters at once.
So we can only give it to 400 characters to, To make an MP three track from it.
And if the podcast is long enough, we have to split the podcast into multiple
chunks and then merge this audio files.
And the next step is RSS generation, where we generate an R Ss feed, which
is then consumed by, podcast platforms.
And they can check this errors sheet later on and see that if there is a new podcast,
then there is a new episode of a podcast.
They add it to their platform and notify users and whatnot.
And here is the line of pipeline, and as you can see.
All the steps are going after each other.
So we don't load the track, then we transcribe it, then we enhance the
transcription, then we translate in enhanced track, and so on.
And this design not only helps to maintain clarity, but also allows
for easy scalability and improvement.
Each phase in the pipeline plays a specific role in ensuring the
quantity is processed, translated and delivered in the best possible way.
All these chunks are independent of each other.
And this plays a huge role in this pipeline, key technologies.
so we use Kotlin here because we're most familiar with Kotlin.
We use Podcast four J as a framework for using with podcast index.org, whereas
podcast index.org is a huge database of.
Of podcasts and you can ask them about any podcasts, title, description,
track, and for example, thumbnail.
many info you can take from there.
And.
Open AI API, and we will, we use three models from there.
We use whisper one to transcribe audio from the inventory track.
We use GPT-4 oh to work on the transcribed text and to enhance
it and to translate basically.
And then we use TTS one to convert this translated text
into the Russian MP three track.
we use Cater with ish TP to make all the HGTP requests, whereas
Podcast four J has its own client for the open ai, API we use ish TP.
We use Jackson for all the JSON data.
we use Java X XML APIs for building RSS feed.
And we are planning to use FMB to merge the chunks of ber three files.
And, this will allow us to merge these chunks in the correct way.
'cause if you just merge their content, there might be problems
with, for example, playback.
You cannot play it from the middle of the track because each, empathy
file has its own metadata, including.
The length of the track, so you may jump to a wrong place, basically.
And FFMP, allows to merge these track to merge the metadata to, to make
the audio levels on the same level.
So you don't hear, big changes in the loudness and it just makes,
audio files more listenable.
Podcast downloading.
the first step of the pipeline is downloading podcast metadata in
order from the for new episodes from podcast dot index podcast index.org.
Using in podcast four J library.
once the metadata is retrieved, we can then download and the
corresponding number three audio files.
These files are the core content that will be processed in the
subsequent stages of the pipeline.
Transcription, translation, et cetera.
Now that we have podcast information, we could, we can move to the next step,
which is speech to text with Whisper.
And in the next step of the pipeline, we use the whisper one, API, to convert
it and loaded podcast order into text.
And basically we just give this, give the MPT track to the whisper
one API, and it returns us the text.
One problem with that is that it works well, only four files under 25 megabytes.
most of the podcasts we used, we work with are less than 25
megabytes, but there are bigger ones.
And if, the file is bigger, we have to split it into multiple chunks and
then process each chunk independently.
This is just the, requirement of whisper one.
It just does not accept false beer than 25 megabytes.
It's not a big problem, but it is what it is.
A text enhancement and translation.
first GPT-4 row enhances grammar, punctuation, and readability here.
And for the prompt, we say context.
This is the transcription of a podcast in English.
Fix the grammar and punctuation or translated to English if
it's in the wrong language.
The where podcast starts and cut, unrelated contented, start
output format and output on with.
Output with no introduction on the output text.
Other text is polished.
We sent another request to GPT-4 oh and this time to translate text into
Russian, and the prompt is translate Text below to the Russian language.
Keep the translation is close to the original in tone and style as you can.
And we just give it a text.
This way, GPT for all translate text into Russian while preserving, reserving
the nuances of the original text.
This enjoys that the podcast conversational style, humor, and
personality are preserved, making it engaging and relatable to the
Russian speaking audience, just as it is to the English speaking one.
Speech synthesis with TTS one.
Next, let's dive into the process of converting the transcript text into
Russian audio is an OpenAI TTS one model, along with challenges faced in this step.
And in this step we just use TTS one model, with Russian text.
And it, returns as MP three tracks with Russian voice.
while the synthesized voices are quite good, they still carry a
slight American accent, which may not always sound perfectly natural
for Russian speaking audience.
For example, the intro in English might be Welcome to Nature Podcast,
sponsored by X, which is translated to Developer Ledge podcast, nature X.
When comparing the original audio ai, AI is transforming
science with localized version.
You may notice that while the translation is accurate, the TTS
voice still carries the American tone.
The slight accent issue can impact the overall feel of the podcast.
Audio assembly challenges.
Next, let's dive into one of the key challenges in the
pipeline merging audio files.
it may seem like a straightforward, as the process is combining of combining multiple
MP three files, presents some technical hurdles and yeah, as I already said, MP
three metadata can conflict, conflicts, can create playback issues, and currently
we're just merging the empathy files.
Together, taking their bite race and merging them together.
It may not work properly, but it works.
If you just listen to the podcast from this, from start to finish.
If you try to listen to it from the middle, you may have issues in some
players, in others, it might work.
And also mp, different T files may have vary BRATE sample rates and coding
methods, volume levels, all this stuff.
And we need to work on it with FFMP, but this is not the priority right now.
R Ss feed generation.
Now let's look at the final step in our localization pipeline, which is publishing
the Translated podcast through an RS feed to import aspects of the process for
distributing the Translated podcast is X ml based RS feed creation and translated
metadata for better discoverability.
The first one, XML based success fit creation, is about creating the XML
with, RSS feed and then, feed this RSS feed into the podcast platforms so
that they can access it, they can see if we have new episodes, and then they
add new episodes to their platform.
The second thing is translate metadata for better discoverability, basically
means that we take all titles, all the, descriptions and translate
them to Russian so people can then easier find what they want to find.
'cause it would be weird if you have, a Russian podcast with
English description, right?
Some of the core challenges so far were file size limits in Whisper, so
we have to split original MP three files into smaller chunks sometimes.
Second one is style preservation and translation, and here we
rely on the OpenAI models.
And rely on our prompts so that we do not just translate the text
as, for example, Google Translate or any other translate platform.
But we also preserve the style, preserve the terminology and everything
else using the open areas models and all the immersion complexities.
it's just FFMP should solve our problems, but it'll take some time.
in order to endure high quality translations, we use various
techniques to fine tune GPT-4 O.
This include grammar correction, prompt to fix whispers output handling
met, mix it language content GBT four O can translate text into Russian.
So there are only Russian words.
Tone preservation techniques, we ask GPT-4 O to preserve the tone of the
text while it translates into Russian.
let's look at the example.
and it says the input is AI school and in Russian Kuta.
here you can see that we have both English and Russian words.
And for the transcript we will have AI school and in Russian Kruta,
which is all in English letters, and then we translate it into Russian
and get, I. This example highlights how fine tuning the this prompts.
The prompts ensures that GPT-4 O preserves not just the meaning, but also the
tone and style of the original message.
In this case, it ensures that the use of cool and Kta fits naturally within
the context and keeps the conversational tone intact in the translation.
maintaining the original tone of a podcast is essential for keeping the
audience engaged, and we do this by fine tuning GPT for Rose translation with
specific prompts to preserve the tone.
In addition, we selected voices that match the energy level of the speaker.
Let's explore some examples that demonstrate how we address
challenges related to the tone and personalization in translation.
on the left you can see that we use a slight American accent, so
the result, sounds non-native.
It's usable, but we will improve the voice later on.
we translate all intros, outros, and ads.
We have plans to either remove ads or adopt them for Russian audience,
but for now, we just translate it as it is a part of, of a podcast.
And, on the right side, you can see, before and after comparison example, AI
is transforming science, localized to ae.
And it has like American accent and it doesn't sound that natural
to Russian speaking persons.
And we also have to work on this part.
Quality assurance measures.
Currently, we enjoy translation accuracy through manual reviews where we check
for nuances and consistencies and turn shifts that automated systems might miss.
This helps maintain the original intent and style of the podcast, but
the process is time consuming and not easily scalable as we expand to
more podcasting languages to improve efficiency, consistency in the future.
We plan to add an automated quality scoring measures like blue scores,
which compare machine generated translations to human translations
and native speaker evaluations.
Blue scores provide, a quantitative measure of translation accuracy, while
native speaker feedback ensures that translation sounds natural and cultural,
culturally approvable appropriate.
These improvements will speed up the review process and help maintain high
translation quality as the system scales.
And here I will show some of the key code snippets that we have, and this
is one of the first parts, which is downloading podcast episodes.
We have a podcast ID for the podcast where we want to download.
And here we take this podcast, download all the episodes for this
podcast, and later we download all the MP three files for this podcast.
Another part is transcribing audio with Whisper, which is pretty simple.
And here we have to take the file, the audio file and send
it to the Open AI client without any prompts, without anything.
And we have to say that we want to use Whisper one model.
We have to say that we use this.
Specific audio file and it just translates, transcribes the
file and returns us the text.
Another one is improved transcription method, which takes
the transcription and source language and improves this transcription.
And as you can see in the prompt here, it improves grammar and
punctuation, translates everything to source language, which is English.
If it's in the wrong language.
So in the podcast you may have multiple languages, let's say English
and French, or English and Russian.
And for better understanding, you have to first translate everything to
English and then everything from English to Russian, it just works better.
and here is another one, generating speech with dts One.
here, we take the text that we want to translate.
We split it in the chunks of not more than 4,000, characters each, because this
is a restriction of the TTS one model.
you cannot generate speech for more than 4,000 something characters at once.
So we split in the chunks of smaller text.
Then we create, then we generate MP three files, and then we merge
them together into single by array.
And this will be our MP three file in Russian, which we will later use for
distribution Generating errors as feed is pretty simple here as we use JX XML
library, which is included in Java, and we don't need any other libraries.
And here we create, the ssas feed in XML format at, add the podcast
info such as title, description, link language, everything.
And then we add all the info for the episodes.
And it includes title, description, a publication date, enclosure, which
is the path, a link to the file.
And some idea which has to be unique among all the, episodes in this podcast.
for example, output.
to wrap up, this is the result of a translation of translation, a part
of an English podcast into Russian.
Notice the American accent.
In the Russian speech.
We're also creating the S field behind the scenes for distribution.
And now let's hear it.
So we have this, example in English.
Let's hear it.
This is episode two of What's in a Name.
In the previous episode, we learned how scientists name species and
the controversies that can result from those names, but names
don't just matter to scientists.
They can impact all of us.
In this episode, we are moving outta the universities and scientific
publications where names are chosen.
And into the public realm where names chosen by scientists, meet non-scientists.
This was the English version, and now was going to Russian version.
Here, you can see, you can hear that the Russian speech has this
slight American accent, but the, translation is pretty close.
the text is pretty close, so Russian speaking audience can just
listen to this, to this text, to this speech, and understand what's
going on in the original podcast.
Future improvements.
we've come a long way, but there is still room for growth.
Future updates will make this even better.
As we refine our podcast localization pipeline, we're looking ahead to
key enhancements that will improve efficiency in the audio quality.
Our next steps include automating episodes, pleading for better
segment control, integrating FFMP for seamless audio merging and developing
custom trained TTS voices to enhance naturalness and authenticity.
These improvements will help create a more polished and engaging listening
experience for localized content.
As we wrap up, let's reflect on key takeaways from this presentation.
First, deploying machine learning is in real world applications
is an alterna process.
We continuously refine our approach based on feedback, performance
evaluations, and advancements in ai.
This world doesn't stop deployment.
It evolves to meet new challenges.
Second by Strateg strategically combining open AI tools, including Whisper for
transcription, GPT-4 oh for translation and text refinement and tts one for
voice synthesis, we can achieve high quality localization while preserving
the regional content syn intent and tone.
This AI models work together to streamline the process and maintain consistency
across different podcast elements.
And finally, there is still room for improvement.
Future enhancements such as automated episode splitting, ffm, pet Burst, based
merging and custom voice training for TTS will refine the process even further.
As we continue developing these solutions, our goal is to make
localized podcasts feel as natural and seamless as their original versions.
Thank you all for your attention.
This project is an exciting step, forward breaking language barriers
and podcasting, and we're looking forward to what comes next.
Thank you.