#2443: How Podcast RSS Feeds Can Speak Every Language

One RSS feed, a transcript tag, and TTS voice cloning — the emerging standard for letting any podcast speak any language.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2601
Published: Apr 26
Duration: 27:39
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: speech-recognition voice-cloning audio-processing

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Podcasting has exploded globally — over 460 million monthly listeners, according to Edison Research. Yet the vast majority of shows are produced in a single language. A medical podcast in English is useless to a rural doctor in Peru. A history show in Japanese never reaches a curious student in Kenya. The gap between who could be listening and who can actually understand is enormous.

The Problem with Traditional Localization

Right now, if a creator wants to offer their show in multiple languages, they face two bad options. First: create a completely separate RSS feed for each language, with separate audio files, hosting, and maintenance. That doubles or triples operational complexity for every language added. Second: hire voice actors or use TTS to dub every episode, embed the translated audio alongside the original, and hope whatever app listeners use can figure out which file to play. Both approaches put the entire burden on the creator.

The Podcasting 2.0 Solution

The Podcast Index has been maintaining the Podcasting 2.0 namespace — over 60 tags as of April 2026. The key tag is the transcript tag, which lives inside your RSS feed's item element. It has two critical attributes: a URL pointing to the transcript file, and a language attribute using standard BCP-47 codes (en for English, he for Hebrew, fr for French). The VTT (Web Video Text Tracks) format is particularly important because it includes timing information — each line of dialogue is stamped with when it starts and ends.

This means a podcast app can synchronize the transcript with playback. And here's where the localization magic starts: a machine translation system can translate the text while preserving the timestamps. Then a TTS engine can read that translated text in the target language using the same timing.

The Alternative Enclosure Tag

In September 2025, the Podcast Index published a draft specification for a new tag called "podcast:alternative enclosure." The current enclosure tag only allows one audio file per episode. The alternative enclosure tag lets creators specify multiple audio files for the same episode, each with a language attribute — so you could have the original English recording, plus a Hebrew TTS-generated dub, plus a French dub, all referenced in the same item element. The player picks the right one based on the listener's preferences.

Voice Cloning and the "Voice" Tag

A proposed "podcast:voice" tag would let creators specify TTS voice profiles in their RSS feed. Instead of the app using whatever default voice it has, creators could say "for Hebrew, use this specific ElevenLabs voice ID that's been trained on our voices." ElevenLabs reported in Q1 2026 that their voice cloning TTS now supports 29 languages with over 95% intelligibility in user tests.

The Cost Question

DeepL's API runs about $20 per million characters. A typical episode transcript is roughly 50,000 characters — about $1 per episode per language for translation. ElevenLabs charges about $0.30 per thousand characters for TTS generation, so $15 per language per episode. The client-side approach is even cheaper: just publish the translated transcript and voice profile, and the listener's device handles the TTS work.

Who's Implementing This?

Podlove Publisher released version 4.0 in January 2026 with native support for multilingual transcripts. Apple Podcasts added support for the transcript tag in iOS 18.3 (February 2025), though currently only for display, not TTS dubbing. Overcast, Pocket Casts, and Podcast Addict have all been gradually adopting Podcasting 2.0 tags. The pieces are coming together — the question is how quickly the full vision of client-side, automatic localization becomes reality.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2443: How Podcast RSS Feeds Can Speak Every Language

One hundred eighteen thousand plays. That's a number that stopped me mid-leaf when I saw it. France sitting at number two, Israel at number three. Our analytics are pulling straight from the R two bucket, so these aren't inflated platform numbers. These are actual downloads.

Here's the thing that keeps me up at night. Every single one of those listeners is hearing us in English, regardless of what language they dream in. We're global, but we're monolingual.

Daniel sent us this one. He's been thinking about doing episodes in Hebrew, found a TTS system that can clone our voices, but then he hit the obvious wall. Creating parallel podcasts in every language is impossible. The question is whether there are emerging standards in the podcast XML schema that could handle localization on the front end. Automatic dubbing from transcripts, no per-language feeds, no exponential cost spiral. The goal is letting anyone access the show in their language without breaking the operation.

By the way, DeepSeek V four Pro is generating our script today. So if anything comes out especially coherent, you know who to thank.

That's how these things work.

But let's talk about why this matters right now. Podcasting has exploded globally. Edison Research's latest data shows that over four hundred sixty million people listen to podcasts monthly worldwide. Yet the vast majority of shows are produced in a single language. The gap between who could be listening and who can actually understand us is enormous.

It's not just about us. Think about what it means for knowledge transfer. A medical podcast produced in English is useless to a rural doctor in Peru who only speaks Spanish. A history show in Japanese never reaches a curious student in Kenya. We're building this incredible repository of human conversation and expertise, and we're locking most of humanity out of it.

That's exactly the right framing. And what Daniel's asking about isn't science fiction. The Podcasting Two Point Zero namespace has been quietly building the plumbing for this for years. We're talking about XML tags that live inside your RSS feed, tags that tell podcast apps here's a transcript in French, here's one in Hebrew, here's an audio track in Arabic. The player does the work, not the creator.

The fundamental shift is moving localization from the server side, where the creator has to manually produce and host everything, to the client side, where the app or player handles it automatically. And that distinction is everything.

Let me ground this concretely. Right now, if I want to release My Weird Prompts in Hebrew, I have two options. Option one, I create a completely separate RSS feed with Hebrew audio files, host them, maintain them, update them in parallel. That doubles or triples our operational complexity for every language we add.

Option two, we hire voice actors or use TTS to dub every episode, embed the translated audio alongside the original, and hope whatever app our listeners use can figure out which file to play. Also a nightmare.

And that's the problem Daniel put his finger on. But the emerging approach flips the model entirely. Instead of us doing the work, we publish the ingredients, and the player assembles the meal.

I like that. Though now I'm hungry.

You're always hungry. But stay with me. The key ingredients are transcripts with language attributes, emerging tags for specifying TTS voice profiles, and a new proposed tag called alternative enclosure that lets you attach multiple audio files to a single episode, each tagged with a language code.

Let's get into the actual plumbing. The Podcast Index has been maintaining what's called the Podcasting Two Point Zero namespace. As of April twenty twenty-six, there are over sixty tags in that namespace. Some of them are widely adopted, like the transcript tag. Some are still in draft. But collectively, they're building the infrastructure for exactly what Daniel's asking about.

The transcript tag is the foundation. It's deceptively simple. Inside your RSS feed, for each episode, you add a podcast colon transcript element. It has two critical attributes. One is the URL pointing to the transcript file. The other is the language attribute, using standard BCP forty-seven language codes. So en for English, he for Hebrew, fr for French, ar for Arabic.

The transcript format matters too. You can specify whether it's plain text, HTML, SRT, or VTT. The VTT format, Web Video Text Tracks, is particularly important here because it includes timing information. Each line of dialogue is stamped with when it starts and ends.

Which means a podcast app doesn't just display the transcript. It can synchronize it with playback. And here's where the localization magic starts. If you publish a VTT transcript in English with timing data, a machine translation system can translate the text while preserving the timestamps. Then a TTS engine can read that translated text in the target language, using the same timing.

The listener hits play, their app checks what language their phone is set to, sees that there's a Hebrew transcript available, and either displays it as captions or feeds it to a TTS engine that dubs the episode in real time. The creator uploaded one file, and the app did the rest.

That's the vision. And the pieces are coming together faster than most people realize. In September twenty twenty-five, the Podcast Index published a draft specification for a new tag called podcast colon alternative enclosure. This is a big deal.

The current enclosure tag is how RSS feeds point to the actual audio file. One enclosure per episode. If you want to offer multiple languages, you're stuck. The alternative enclosure tag changes that. It lets you specify multiple audio files for the same episode, each with a language attribute. So you could have the original English recording, plus a Hebrew TTS-generated dub, plus a French dub, all referenced in the same item element.

The player picks the right one based on the listener's preferences.

The draft spec also includes a rel attribute that can indicate whether the alternative is a dub, a translation, or a different version entirely. The Podcast Index published this as a formal proposal in September twenty twenty-five, and it's been under community review since then. Several major hosting platforms have expressed interest.

Here's the question I think Daniel would ask. Who's actually implementing this? Because standards are great, but adoption is what matters.

That's the right question. And there's real movement. In January twenty twenty-six, Podlove Publisher released version four point zero, which added native support for multilingual transcripts. Podlove is one of the major open-source podcast publishing platforms, used by thousands of independent creators. They now let you upload transcripts in multiple languages and automatically generate the correct podcast colon transcript tags with language attributes.

The tooling is arriving. What about the player side?

Apple Podcasts added support for the podcast colon transcript tag in iOS eighteen point three, which shipped in February twenty twenty-five. Currently, they only use it for display, showing the transcript text as the episode plays. They don't yet do TTS dubbing from the transcript. But the infrastructure is there. Once the transcript is being parsed and displayed, adding a TTS layer is a feature update, not a complete rebuild.

Apple's not alone. Overcast, Pocket Casts, Podcast Addict, they've all been gradually adopting Podcasting Two Point Zero tags. The transcript tag is now supported by most major players. The alternative enclosure tag is the next frontier.

There's another tag worth mentioning here. It's still early stage, but there's a proposal circulating for something called podcast colon voice. The idea is that you can specify TTS voice profiles in your RSS feed. So instead of the app using whatever default voice it has, you can say for Hebrew, use this specific ElevenLabs voice ID that's been trained on our voices. For French, use this one.

That's where Daniel's work on voice cloning connects. He's already found a TTS system that can clone our voices in Hebrew. If the podcast colon voice tag becomes a standard, he could publish those voice profiles in the feed, and any compatible app could use them to generate dubs that actually sound like us, not like generic robot voices.

ElevenLabs, which is the leading player in this space, reported in Q one twenty twenty-six that their voice cloning TTS now supports twenty-nine languages with over ninety-five percent intelligibility in user tests. That's a staggering number. Two years ago, cross-language voice cloning was experimental. Now it's production-ready.

Let me synthesize what we're describing. An RSS feed that includes a transcript in ten languages, each with timing data. An alternative enclosure tag pointing to TTS-generated audio files in those languages. A voice tag specifying the voice profiles to use for on-the-fly generation. And a podcast app that reads all of this and presents the listener with their native language automatically.

That's the target architecture. And none of it requires the creator to maintain ten separate feeds or manually dub every episode. You produce your show once, in your native language. A pipeline handles transcription, translation, and TTS generation. The RSS feed becomes a multilingual delivery mechanism rather than a single-language syndication format.

The cost question is central. Daniel was explicit about not wanting exponentially increasing operating costs. Let's put numbers on this.

The numbers are surprisingly manageable. DeepL's API for machine translation runs about twenty dollars per million characters. A typical episode transcript is maybe fifty thousand characters. So you're looking at roughly a dollar per episode per language for translation. The TTS generation is the bigger cost driver.

ElevenLabs charges by the character for TTS generation. Their standard rate is about thirty cents per thousand characters. Same fifty thousand character transcript, that's fifteen dollars per language per episode. Not nothing, but for a show with a hundred eighteen thousand plays, we're talking about serving an audience that currently can't access the show at all.

That's the server-side approach, where you pre-generate the audio files and host them. The client-side approach is even cheaper for the creator, because the listener's device does the TTS work. You just publish the translated transcript and the voice profile, and the app handles the rest.

The core distinction Daniel needs to understand is this. Traditional localization means the creator does everything. Record in Hebrew, host Hebrew files, maintain a Hebrew feed. Front-end localization means the creator publishes the raw materials, and the player assembles the experience. It's the difference between shipping assembled furniture and shipping a flat pack with instructions.

That flat pack analogy works perfectly, because now let me crack open the actual instruction manual. The podcast colon transcript tag is the simplest piece to implement, and it's already widely supported. Here's what it looks like in the RSS XML. You've got your item element for the episode, and inside it you add a podcast colon transcript element with attributes. The URL points to the transcript file, the type attribute says whether it's plain text or VTT or SRT, and the language attribute uses BCP forty-seven codes.

For a Hebrew listener, the feed would include something like podcast colon transcript URL equals slash transcripts slash episode two three six two dot vtt, type equals text slash vtt, language equals he. And then another one for French, same file translated, language equals fr.

And a smart podcast app reads all of those on fetch, checks the listener's device language setting, and selects the matching transcript. If the app supports TTS dubbing, it feeds that Hebrew VTT file to the TTS engine. The timing data in the VTT ensures the Hebrew speech lines up with the original pacing.

Which raises a subtle point. Translated text is almost never the same length as the original. Hebrew, for example, tends to be more compact than English. So if you're doing real-time TTS dubbing, the app has to handle those timing mismatches. Either by adjusting playback speed or by inserting brief pauses.

That's a real engineering challenge, and it's one of the reasons the alternative enclosure tag is so important as a complement. The September twenty twenty-five draft from the Podcast Index lets you specify multiple pre-rendered audio files. So instead of relying on the listener's device to do TTS in real time, you pre-generate the Hebrew audio using ElevenLabs, host it alongside your original English file, and the alternative enclosure tag points to it with a language equals he attribute.

The rel attribute tells the player this is a dub, not just a different version.

The draft specifies rel values like quote translated unquote, quote dub unquote, quote alternative unquote. So the app knows this isn't a bonus episode or a director's cut. It's the same content in another language.

Let me push on the tradeoffs here. Client-side TTS dubbing means zero additional storage and bandwidth costs for the creator. The listener's device does the work. But the quality depends entirely on the device's TTS engine. Pre-rendered dubbing using something like ElevenLabs gives you studio-quality cloned voices, but you're hosting potentially gigabytes of additional audio.

That's where the podcast colon voice tag gets interesting. It's still an early proposal, not yet formally adopted, but the concept is that you specify a TTS voice profile in your feed. So even for client-side generation, the app doesn't use some generic system voice. It calls out to an API with your specified voice ID. ElevenLabs supports this. You could have a voice profile trained on my voice and your voice, and the app fetches it dynamically.

The creator doesn't host the audio files, but they also don't surrender quality to whatever robotic voice ships with the operating system. It's a middle path.

Podlove Publisher four point zero, which shipped in January twenty twenty-six, is the first major publishing tool to natively support multilingual transcript workflows. You upload your English transcript, your Hebrew transcript, your French transcript, and it generates all the correct podcast colon transcript tags with the right language attributes automatically. That lowers the technical barrier enormously.

There's also the podcast colon medium tag, which has been around for a while but takes on new significance here. It signals to directories and apps what kind of content this is. Combined with the podcast colon feed tag, which can carry language metadata, you're essentially telling the entire ecosystem this feed contains multilingual content without needing separate feeds for each language.

The full picture is an RSS item that contains a podcast colon transcript in five languages, a podcast colon alternative enclosure pointing to pre-rendered dubs in those same languages, a podcast colon voice tag specifying the TTS profiles to use if the app wants to generate audio on the fly, and feed-level metadata indicating this is a multilingual podcast. The player reads all of this and presents the listener with a language selector, or just picks automatically based on their system settings.

The creator's workflow is write once, transcribe once, translate via API, optionally pre-render via TTS, and publish a single feed. That's the architecture Daniel was asking about.

With that architecture in mind, let's talk about what happens when you actually deploy this at scale. Say you publish transcripts in ten languages. The average podcast RSS feed is about fifty kilobytes. Adding ten transcripts at roughly five kilobytes each pushes that to about a hundred kilobytes. That's negligible. Your RSS host won't even blink.

Bandwidth for the feed itself is trivial. The real question is audio. If you go the pre-rendered route with alternative enclosures, now you're hosting ten copies of every episode. A typical hour-long MP3 at high quality is maybe sixty megabytes. Ten languages, that's six hundred megabytes per episode. For a weekly show, you're adding two point four gigabytes of storage per month.

Bandwidth costs scale with downloads. If ten percent of our hundred eighteen thousand plays switch to a dubbed version, that's nearly twelve thousand additional downloads per episode at sixty megs each. You're talking over seven hundred gigabytes of additional bandwidth per episode.

The pre-rendered approach gets expensive fast. Which is why the client-side TTS model matters so much. The creator publishes text, not audio. The storage cost is essentially zero. The bandwidth cost is zero beyond the transcript file itself.

There's a middle ground that's emerging, and it connects to something YouTube has already proven works. YouTube launched auto-dubbing in March twenty twenty-five. They use AI to dub videos into nine languages, and it's entirely server-side. Creators just check a box. The key difference is YouTube can absorb those compute costs because they monetize at scale.

Podcasting doesn't have a single platform absorbing costs. It's a distributed ecosystem. So the question becomes who pays for the TTS generation.

That's where the podcast colon value tag gets genuinely interesting. It's already part of the Podcasting Two Point Zero namespace, designed for micropayments and value-for-value models. The idea is that a listener's app can automatically send a tiny payment, we're talking fractions of a cent, to the creator when they consume an episode.

Extend that to localization. A listener in France requests a French dub. The app generates it on the fly using the transcript and voice profile from the feed, and simultaneously triggers a micropayment that covers the TTS API cost. The listener pays maybe ten or twenty cents for the dub, the creator pays nothing.

The transaction is invisible to the user. Their app already knows their language preference. The TTS generation happens in seconds. The payment happens in the background via the Lightning Network. The entire experience feels like the podcast was just available in French.

The Lex Fridman model is instructive here. His podcast provides transcripts in over twenty languages, all done by a volunteer community. It's impressive, but it's still just text. A listener in Ukraine can read the transcript, but they can't listen to the conversation in Ukrainian while driving or cooking.

That's the gap YouTube is closing with auto-dubbing and that podcasting is starting to address. The infrastructure is different. YouTube controls the player, the hosting, the monetization. Podcasting has to build these capabilities into an open standard that dozens of different apps can implement independently.

Which brings us back to adoption. Apple Podcasts added transcript display in iOS eighteen point three, but they haven't touched TTS dubbing yet. Spotify has been experimenting with auto-translation features, but they're doing it inside their walled garden, not through RSS standards.

The risk of fragmentation is real. If Spotify builds proprietary auto-dubbing and Apple builds their own incompatible version, the open RSS approach loses momentum. Creators would have to choose which ecosystem to optimize for, which defeats the whole purpose.

For a show like ours with a hundred eighteen thousand plays, the math tilts toward the open approach. We're not big enough for Spotify to build custom features for us. But we're big enough that serving the French and Israeli audiences properly could meaningfully grow the show. The RSS-based approach lets us do it without platform dependency.

The cost structure actually makes sense at our scale. If five percent of our audience requests a dubbed version and the micropayment model covers the TTS cost, we're looking at maybe fifty to a hundred dollars per month in listener-paid dubbing fees versus thousands in server-side pre-rendering costs.

The key insight is that Podcasting Two Point Zero isn't just about adding tags to RSS. It's about building an economic layer into the protocol. Localization becomes a feature that pays for itself per use, rather than a production cost the creator has to fund upfront.

Let's make that concrete. If you're running a podcast and you want to start doing this today, what do you actually do? Step one is add podcast colon transcript tags with language attributes to your RSS feed. Castopod supports this natively. Transistor added it in their March update. Even if your hosting platform doesn't have a GUI for it, you can inject custom tags into most modern podcast hosts.

The transcript doesn't need to be perfect. Machine-generated is fine as a starting point. Descript or Otter dot A I will get you a solid English transcript. Then you pipe it through the DeepL A P I with target language equals he, target language equals fr, whatever you need. Their pricing is roughly twenty dollars per million characters.

A typical episode transcript for us runs about fifty thousand characters. That's one dollar per language for the translation. If you're doing five languages, you're spending five dollars per episode. For a show with our audience size, that's trivial.

The TTS generation for listeners who want audio dubbing — that cost hits at consumption time, not at publish time. The micropayment model via podcast colon value means the listener covers it. You're not fronting fifteen dollars per language per episode for pre-rendered audio.

There's one more piece listeners can actually help with right now. If you use Overcast, Pocket Casts, Apple Podcasts — reach out to the developers and ask them to implement client-side TTS dubbing using the transcript tags. The infrastructure exists. The tags are standardized. The TTS engines are good enough. What's missing is the feature in the app.

To be clear, we're not asking for charity here. This is a feature that benefits everyone. A Spanish-speaking listener in Argentina discovers a niche American history podcast because their app automatically offers it in Spanish. The creator gets a new audience. The app gets more engagement. It's a genuine win across the board.

The path is publish transcripts with language tags today, translate via API for pennies, and push your app developer to close the last mile with TTS playback. That's the practical roadmap.

There's a question that's going to determine whether any of this actually works. Will the industry converge on a single standard for multilingual audio enclosures, or are we heading toward fragmentation?

That's the tension. Right now you've got the Podcasting Two Point Zero namespace pushing open standards, and you've got Spotify and Apple each experimenting with their own approaches inside walled gardens. If they diverge, creators are back to square one — optimizing for individual platforms instead of publishing once.

The optimistic case is that Apple adopting the transcript tag for display created a precedent. They didn't invent their own thing. They implemented the existing RSS standard. If the alternative enclosure and voice tags gain enough traction among independent apps like Podverse and Fountain first, the big players might follow rather than build incompatible versions.

The pessimistic case is that Spotify's auto-translation experiments stay proprietary because they want it as a competitive moat. And Apple takes three years to decide whether client-side TTS is worth the engineering effort. Meanwhile, the open standard exists but only in niche apps that reach five percent of listeners.

I think the thing that tips the scale is cost. Server-side dubbing at Spotify's scale is expensive. Client-side TTS using RSS standards pushes the compute cost to the listener's device or to micropayment-funded API calls. Even a giant platform eventually notices that math.

The vision here is compelling. Imagine a world where you open your podcast app, search for any show in any language, and it just plays in yours. Not because the creator recorded it that way, but because the infrastructure handles translation and dubbing automatically. That's what these standards are building toward.

It's not science fiction. The tags exist. The TTS quality is there — ElevenLabs hit ninety-five percent intelligibility across twenty-nine languages in their Q one twenty twenty-six benchmarks. The transcript translation pipeline costs pennies. The missing piece is adoption in the apps people actually use.

That's what makes this moment interesting. We're not waiting for a breakthrough. We're waiting for implementation.

Which is why we're talking about it on a show about weird prompts and podcast infrastructure. This is the plumbing that determines whether a listener in Lyon or Tel Aviv gets to hear what we're saying right now.

For that, we thank our producer Hilbert Flumingtop, who keeps the RSS feed clean and the transcripts flowing.

This has been My Weird Prompts. If you enjoyed this episode, head to myweirdprompts dot com and sign up for the newsletter. We'll be back soon with whatever Daniel sends us next.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2443: How Podcast RSS Feeds Can Speak Every Language

The Problem with Traditional Localization

The Podcasting 2.0 Solution

The Alternative Enclosure Tag

Voice Cloning and the "Voice" Tag

The Cost Question

Who's Implementing This?

Downloads

You Might Also Like

#2443: How Podcast RSS Feeds Can Speak Every Language