From Podcast to Podcast: Automated Content Localization Using OpenAI API Stack

Video size:

Abstract

Large Language Models have revolutionized content localization, enabling complex transformations while preserving nuanced context and style. In this talk, we’ll explore a real-world implementation of an automated podcast localization system that leverages multiple LLM capabilities. Through a Kotlin-based solution, we’ll demonstrate how different LLM strengths complement each other: Whisper for accurate speech recognition, GPT-4 for context-aware text refinement and translation, and an LLM-powered TTS for natural voice synthesis. We’ll dive deep into prompt engineering practices that enable accurate transcription cleanup, style preservation during translation, and handling of mixed-language content. The session will highlight practical aspects of building production-ready LLM pipelines, including chunking strategies for token limitations and maintaining content authenticity across language barriers.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, my name is Ruta. Welcome to this talk on after making podcast localization using open eyes, API Stack. And this talk we'll explore how to convert English podcasts into Russian while maintaining tone and style. We'll go through architecture, technical challenges, and real examples. Let's get started. the main question is why we want to automate podcast localization, and there are two, two main reasons. The first reason is to make content accessible to a wider audience. Podcasts which are only in English can now be consumed by Russian speaking people without any knowledge of English I. The second reason is to reduce manual effort while enduring high quality results. In other words, automation makes it simpler to translate podcasts without human intervention. While doing so, we want to maintain tone and style and translation, and GBT four O helps with that by allowing us to translate English speaking speech, to Russian while preserving the original tone and style. A solution overview. Here is the overall structure of our localization system. Each step plays a critical role in transforming English podcast, into a Russian one. The first step is podcast download where we download the podcast metadata such as title and description and the empathy file. And then there is a second step transcription. We have to transcript this empathy file into a text, and then text processing. We. Give this text to process into GPT-4 oh model. The processing includes, adding all the punctuation marks, adding, fixing all the grammar and so on so that the speech sounds more natural and sounds more live. Then we have speech synthesis and here is where we, take the TTS one model and produce the Russian track, on this translated text. And audio assembly. and this step is needed because when we use cts one model, it has a very small restriction in about 4,000 characters at once. So we can only give it to 400 characters to, To make an MP three track from it. And if the podcast is long enough, we have to split the podcast into multiple chunks and then merge this audio files. And the next step is RSS generation, where we generate an R Ss feed, which is then consumed by, podcast platforms. And they can check this errors sheet later on and see that if there is a new podcast, then there is a new episode of a podcast. They add it to their platform and notify users and whatnot. And here is the line of pipeline, and as you can see. All the steps are going after each other. So we don't load the track, then we transcribe it, then we enhance the transcription, then we translate in enhanced track, and so on. And this design not only helps to maintain clarity, but also allows for easy scalability and improvement. Each phase in the pipeline plays a specific role in ensuring the quantity is processed, translated and delivered in the best possible way. All these chunks are independent of each other. And this plays a huge role in this pipeline, key technologies. so we use Kotlin here because we're most familiar with Kotlin. We use Podcast four J as a framework for using with podcast index.org, whereas podcast index.org is a huge database of. Of podcasts and you can ask them about any podcasts, title, description, track, and for example, thumbnail. many info you can take from there. And. Open AI API, and we will, we use three models from there. We use whisper one to transcribe audio from the inventory track. We use GPT-4 oh to work on the transcribed text and to enhance it and to translate basically. And then we use TTS one to convert this translated text into the Russian MP three track. we use Cater with ish TP to make all the HGTP requests, whereas Podcast four J has its own client for the open ai, API we use ish TP. We use Jackson for all the JSON data. we use Java X XML APIs for building RSS feed. And we are planning to use FMB to merge the chunks of ber three files. And, this will allow us to merge these chunks in the correct way. 'cause if you just merge their content, there might be problems with, for example, playback. You cannot play it from the middle of the track because each, empathy file has its own metadata, including. The length of the track, so you may jump to a wrong place, basically. And FFMP, allows to merge these track to merge the metadata to, to make the audio levels on the same level. So you don't hear, big changes in the loudness and it just makes, audio files more listenable. Podcast downloading. the first step of the pipeline is downloading podcast metadata in order from the for new episodes from podcast dot index podcast index.org. Using in podcast four J library. once the metadata is retrieved, we can then download and the corresponding number three audio files. These files are the core content that will be processed in the subsequent stages of the pipeline. Transcription, translation, et cetera. Now that we have podcast information, we could, we can move to the next step, which is speech to text with Whisper. And in the next step of the pipeline, we use the whisper one, API, to convert it and loaded podcast order into text. And basically we just give this, give the MPT track to the whisper one API, and it returns us the text. One problem with that is that it works well, only four files under 25 megabytes. most of the podcasts we used, we work with are less than 25 megabytes, but there are bigger ones. And if, the file is bigger, we have to split it into multiple chunks and then process each chunk independently. This is just the, requirement of whisper one. It just does not accept false beer than 25 megabytes. It's not a big problem, but it is what it is. A text enhancement and translation. first GPT-4 row enhances grammar, punctuation, and readability here. And for the prompt, we say context. This is the transcription of a podcast in English. Fix the grammar and punctuation or translated to English if it's in the wrong language. The where podcast starts and cut, unrelated contented, start output format and output on with. Output with no introduction on the output text. Other text is polished. We sent another request to GPT-4 oh and this time to translate text into Russian, and the prompt is translate Text below to the Russian language. Keep the translation is close to the original in tone and style as you can. And we just give it a text. This way, GPT for all translate text into Russian while preserving, reserving the nuances of the original text. This enjoys that the podcast conversational style, humor, and personality are preserved, making it engaging and relatable to the Russian speaking audience, just as it is to the English speaking one. Speech synthesis with TTS one. Next, let's dive into the process of converting the transcript text into Russian audio is an OpenAI TTS one model, along with challenges faced in this step. And in this step we just use TTS one model, with Russian text. And it, returns as MP three tracks with Russian voice. while the synthesized voices are quite good, they still carry a slight American accent, which may not always sound perfectly natural for Russian speaking audience. For example, the intro in English might be Welcome to Nature Podcast, sponsored by X, which is translated to Developer Ledge podcast, nature X. When comparing the original audio ai, AI is transforming science with localized version. You may notice that while the translation is accurate, the TTS voice still carries the American tone. The slight accent issue can impact the overall feel of the podcast. Audio assembly challenges. Next, let's dive into one of the key challenges in the pipeline merging audio files. it may seem like a straightforward, as the process is combining of combining multiple MP three files, presents some technical hurdles and yeah, as I already said, MP three metadata can conflict, conflicts, can create playback issues, and currently we're just merging the empathy files. Together, taking their bite race and merging them together. It may not work properly, but it works. If you just listen to the podcast from this, from start to finish. If you try to listen to it from the middle, you may have issues in some players, in others, it might work. And also mp, different T files may have vary BRATE sample rates and coding methods, volume levels, all this stuff. And we need to work on it with FFMP, but this is not the priority right now. R Ss feed generation. Now let's look at the final step in our localization pipeline, which is publishing the Translated podcast through an RS feed to import aspects of the process for distributing the Translated podcast is X ml based RS feed creation and translated metadata for better discoverability. The first one, XML based success fit creation, is about creating the XML with, RSS feed and then, feed this RSS feed into the podcast platforms so that they can access it, they can see if we have new episodes, and then they add new episodes to their platform. The second thing is translate metadata for better discoverability, basically means that we take all titles, all the, descriptions and translate them to Russian so people can then easier find what they want to find. 'cause it would be weird if you have, a Russian podcast with English description, right? Some of the core challenges so far were file size limits in Whisper, so we have to split original MP three files into smaller chunks sometimes. Second one is style preservation and translation, and here we rely on the OpenAI models. And rely on our prompts so that we do not just translate the text as, for example, Google Translate or any other translate platform. But we also preserve the style, preserve the terminology and everything else using the open areas models and all the immersion complexities. it's just FFMP should solve our problems, but it'll take some time. in order to endure high quality translations, we use various techniques to fine tune GPT-4 O. This include grammar correction, prompt to fix whispers output handling met, mix it language content GBT four O can translate text into Russian. So there are only Russian words. Tone preservation techniques, we ask GPT-4 O to preserve the tone of the text while it translates into Russian. let's look at the example. and it says the input is AI school and in Russian Kuta. here you can see that we have both English and Russian words. And for the transcript we will have AI school and in Russian Kruta, which is all in English letters, and then we translate it into Russian and get, I. This example highlights how fine tuning the this prompts. The prompts ensures that GPT-4 O preserves not just the meaning, but also the tone and style of the original message. In this case, it ensures that the use of cool and Kta fits naturally within the context and keeps the conversational tone intact in the translation. maintaining the original tone of a podcast is essential for keeping the audience engaged, and we do this by fine tuning GPT for Rose translation with specific prompts to preserve the tone. In addition, we selected voices that match the energy level of the speaker. Let's explore some examples that demonstrate how we address challenges related to the tone and personalization in translation. on the left you can see that we use a slight American accent, so the result, sounds non-native. It's usable, but we will improve the voice later on. we translate all intros, outros, and ads. We have plans to either remove ads or adopt them for Russian audience, but for now, we just translate it as it is a part of, of a podcast. And, on the right side, you can see, before and after comparison example, AI is transforming science, localized to ae. And it has like American accent and it doesn't sound that natural to Russian speaking persons. And we also have to work on this part. Quality assurance measures. Currently, we enjoy translation accuracy through manual reviews where we check for nuances and consistencies and turn shifts that automated systems might miss. This helps maintain the original intent and style of the podcast, but the process is time consuming and not easily scalable as we expand to more podcasting languages to improve efficiency, consistency in the future. We plan to add an automated quality scoring measures like blue scores, which compare machine generated translations to human translations and native speaker evaluations. Blue scores provide, a quantitative measure of translation accuracy, while native speaker feedback ensures that translation sounds natural and cultural, culturally approvable appropriate. These improvements will speed up the review process and help maintain high translation quality as the system scales. And here I will show some of the key code snippets that we have, and this is one of the first parts, which is downloading podcast episodes. We have a podcast ID for the podcast where we want to download. And here we take this podcast, download all the episodes for this podcast, and later we download all the MP three files for this podcast. Another part is transcribing audio with Whisper, which is pretty simple. And here we have to take the file, the audio file and send it to the Open AI client without any prompts, without anything. And we have to say that we want to use Whisper one model. We have to say that we use this. Specific audio file and it just translates, transcribes the file and returns us the text. Another one is improved transcription method, which takes the transcription and source language and improves this transcription. And as you can see in the prompt here, it improves grammar and punctuation, translates everything to source language, which is English. If it's in the wrong language. So in the podcast you may have multiple languages, let's say English and French, or English and Russian. And for better understanding, you have to first translate everything to English and then everything from English to Russian, it just works better. and here is another one, generating speech with dts One. here, we take the text that we want to translate. We split it in the chunks of not more than 4,000, characters each, because this is a restriction of the TTS one model. you cannot generate speech for more than 4,000 something characters at once. So we split in the chunks of smaller text. Then we create, then we generate MP three files, and then we merge them together into single by array. And this will be our MP three file in Russian, which we will later use for distribution Generating errors as feed is pretty simple here as we use JX XML library, which is included in Java, and we don't need any other libraries. And here we create, the ssas feed in XML format at, add the podcast info such as title, description, link language, everything. And then we add all the info for the episodes. And it includes title, description, a publication date, enclosure, which is the path, a link to the file. And some idea which has to be unique among all the, episodes in this podcast. for example, output. to wrap up, this is the result of a translation of translation, a part of an English podcast into Russian. Notice the American accent. In the Russian speech. We're also creating the S field behind the scenes for distribution. And now let's hear it. So we have this, example in English. Let's hear it. This is episode two of What's in a Name. In the previous episode, we learned how scientists name species and the controversies that can result from those names, but names don't just matter to scientists. They can impact all of us. In this episode, we are moving outta the universities and scientific publications where names are chosen. And into the public realm where names chosen by scientists, meet non-scientists. This was the English version, and now was going to Russian version. Here, you can see, you can hear that the Russian speech has this slight American accent, but the, translation is pretty close. the text is pretty close, so Russian speaking audience can just listen to this, to this text, to this speech, and understand what's going on in the original podcast. Future improvements. we've come a long way, but there is still room for growth. Future updates will make this even better. As we refine our podcast localization pipeline, we're looking ahead to key enhancements that will improve efficiency in the audio quality. Our next steps include automating episodes, pleading for better segment control, integrating FFMP for seamless audio merging and developing custom trained TTS voices to enhance naturalness and authenticity. These improvements will help create a more polished and engaging listening experience for localized content. As we wrap up, let's reflect on key takeaways from this presentation. First, deploying machine learning is in real world applications is an alterna process. We continuously refine our approach based on feedback, performance evaluations, and advancements in ai. This world doesn't stop deployment. It evolves to meet new challenges. Second by Strateg strategically combining open AI tools, including Whisper for transcription, GPT-4 oh for translation and text refinement and tts one for voice synthesis, we can achieve high quality localization while preserving the regional content syn intent and tone. This AI models work together to streamline the process and maintain consistency across different podcast elements. And finally, there is still room for improvement. Future enhancements such as automated episode splitting, ffm, pet Burst, based merging and custom voice training for TTS will refine the process even further. As we continue developing these solutions, our goal is to make localized podcasts feel as natural and seamless as their original versions. Thank you all for your attention. This project is an exciting step, forward breaking language barriers and podcasting, and we're looking forward to what comes next. Thank you.

Slides

Download slides (PDF)

See all 40 talks at this event!

Conf42 Large Language Models (LLMs) 2025 - Online

March 20 2025 - premiere 5PM GMT

From Podcast to Podcast: Automated Content Localization Using OpenAI API Stack

Video size:

Abstract

Summary

Transcript

Slides

Rustam Musin

Software Engineer

Join the community!

Featured event

2025

2024

Info

Conf42 Large Language Models (LLMs) 2025 - Online

March 20 2025 - premiere 5PM GMT

From Podcast to Podcast: Automated Content Localization Using OpenAI API Stack

Video size:

Abstract

Summary

Transcript

Slides

Rustam Musin

Software Engineer

Join the community!