Swepod the AI-Powered Podcast Generator
An AI-powered SaaS platform that converts any written content into professional podcast episodes with natural sounding AI voices, emotion variation, and 30+ language support.

I consume a lot of written content, articles, documentation, research papers, but I do my best thinking while walking or commuting. I kept wishing I could listen to things instead of read them, but existing text to speech tools produce that flat, robotic output that's hard to stay focused on.
The idea for Swepod was simple: what if I could turn any piece of writing into an actual podcast with two hosts, natural conversation, and real emotional range? Not just narration, but something you'd actually want to listen to.
That question became a full SaaS product. Swepod lets anyone transform written content, a blog post, a PDF, a URL, or a custom script into a fully produced, two person podcast episode in minutes. No microphone, no editor, no studio required. It's live in production at swepod.com with real Stripe billing and a growing user base.
The core of Swepod is a two step pipeline: script generation and audio synthesis.
Step 1 - Script Generation (Gemini 2.5 Flash): The user's input whether a topic, custom script, URL, or PDF is fed into Google Gemini with a carefully engineered prompt. The model returns a two person dialogue where each line is tagged with one of 45+ emotion markers like [excited], [thoughtful], or [laughs]. For large inputs (over 32KB), Gemini first summarizes the content before writing the script, keeping generation fast without losing the key ideas.
Step 2 - Audio Synthesis (Fish Audio): Each dialogue line is sent to the Fish Audio API with the selected voice and emotion context. The segments come back as individual audio buffers which are concatenated server-side into a single MP3, the final podcast episode.
The emotion tags were the hardest part to get right. Too vague and the output sounds flat; too specific and the AI starts over acting. I went through dozens of prompt iterations before landing on a system where Gemini produces consistent, believable emotion tagging across a wide range of content types.
Supporting four different input methods meant building four separate parsing paths that all converge on the same generation pipeline.
Topic Prompt: Straightforward - Gemini generates both the content and the script from a user-provided subject.
Custom Script: The user writes their own dialogue. Gemini formats it, adds emotion tags, and the audio pipeline takes over.
URL: Cheerio scrapes the target page for its main text content. SSRF prevention validates the URL against a blocklist of private IP ranges before the fetch is made.
PDF: pdf-parse extracts the raw text from the uploaded file. Large PDFs hit the 32KB summarization threshold quickly, so most PDF based episodes go through the summarize then generate flow.
1000+ Voice Personas
Fish Audio provides access to over 1000 voice personas, including celebrity and character voices. Voice previews are cached server-side to avoid redundant API calls and reduce latency.
30+ Languages & Configurable Duration
Users can generate episodes in over 30 languages and configure the target episode duration. Gemini adjusts script length accordingly.
Stripe Subscription Tiers
Three tiers - Free (30 min/mo), Basic at $4.99/mo (120 min), and Premium at $9.99/mo (300 min).
Podcast Dashboard & Library
Users get a personal dashboard with their full podcast library, in-browser audio playback, and episode management. Audio files are stored in Supabase Storage buckets with row-level security policies.
Frontend
- Next.js (App Router)
- React
- TypeScript
- Tailwind CSS
Backend & Data
- PostgreSQL
- Prisma ORM
- Supabase Auth
- Next.js route handlers
- Supabase Storage (RLS)
- Stripe (subscriptions)
- Zod (server-side)
AI & Services
- Google Gemini 2.5 Flash
- Fish Audio API
- PostHog
- pdf-parse
- Cheerio (URL scraping)
- Vercel
Prompt Engineering is a Product Decision
Getting Gemini to consistently produce natural, engaging two person dialogue was harder than I expected. The emotion tagging system went through dozens of iterations too vague and the output sounded flat, too prescriptive and it felt robotic. I learned that prompt design is less about writing instructions and more about shaping a product experience. Small wording changes had a huge impact on the final audio quality.
Building a Full Billing System from Scratch
Integrating Stripe end-to-end - subscriptions, webhooks, usage tracking, and the billing portal - was one of the most complex parts of this project. Webhooks in particular were tricky because they can arrive out of order or more than once, which required thinking carefully about how usage resets should work and when they should trigger.
Assembling Audio Server-Side
Working with binary buffers was new territory for me. Handling the timing between Fish Audio API calls, managing memory for longer episodes, and making sure the final file played back cleanly across different devices took a lot of iteration to get right.
Engineering Serves the Product, Not the Other Way Around
Swepod pushed me to think beyond the code. Every technical decision had a product consequence how long generation takes affects whether users stick around, how the pricing tiers are structured affects what people are willing to pay for, and how clearly you communicate value on the landing page determines whether someone even tries the free tier. I found myself thinking as much about user motivation and retention as I did about architecture. That shift from "does this work" to "does this create value" is what I took away most from building Swepod.
The most valuable thing I built in Swepod wasn't the AI pipeline it was an understanding of what makes a feature worth building in the first place. Good engineering gets you to launch. Product thinking is what keeps people coming back.
Try Swepod
Turn any topic, PDF, blog post, or idea into a podcast episode. No microphone required.