uv pip install without --reinstall-package silently skips the ROCm wheel when CPU torch already satisfies torch>=2.0.0, leaving a CPU installation in .venv-rocm and causing a broken import at startup. https://claude.ai/code/session_0168pSswiaoEf6LEx6UQWfBu
VibePod
A text-to-speech podcast generator powered by VibeVoice 0.5B. Paste a script, tune a couple of sliders, and get a WAV back.
Architecture
VibePod/
├── web/ Next.js 15 frontend (React 19, Tailwind CSS 4, TypeScript)
└── server/ FastAPI TTS backend (Python 3.10+, VibeVoice, UV)
The Next.js app proxies audio generation requests to the FastAPI server, keeping CORS out of the picture and the Python model off the browser.
Prerequisites
| Tool | Install |
|---|---|
| Node.js 20+ | winget install OpenJS.NodeJS.LTS |
| pnpm | npm i -g pnpm |
| Python 3.10+ | winget install Python.Python.3.13 |
| uv | winget install astral-sh.uv |
Getting started
# 1. Clone
git clone https://github.com/JezzWTF/vibepod.git
cd vibepod
# 2. Install Node dependencies (root + web workspace)
pnpm install
# 3. Copy env file and fill in values
cp .env.example .env.local
# 4. Start everything
pnpm dev # CUDA (requires NVIDIA GPU + driver >= 525.60)
pnpm dev:cpu # CPU-only (no GPU required)
pnpm dev:rocm # ROCm (requires AMD GPU + ROCm 6.2+, Linux only)
pnpm dev / pnpm dev:cpu start both services concurrently:
- SERVER —
http://localhost:8000— on first run uv creates the Python venv and downloads the ~1 GB VibeVoice model from HuggingFace - WEB —
http://localhost:3000— Next.js dev server with Turbopack
The frontend shows a loading indicator while the model downloads. Once the server reports status: online, generation is available.
CUDA vs CPU vs ROCm
VibePod maintains three completely separate Python virtual environments so torch installs never conflict:
| Mode | Command | venv | torch source |
|---|---|---|---|
| CUDA (default) | pnpm dev |
server/.venv |
PyTorch CUDA 12.4 index |
| CPU-only | pnpm dev:cpu |
server/.venv-cpu |
PyPI (CPU wheel) |
| ROCm (AMD GPU) | pnpm dev:rocm |
server/.venv-rocm |
PyTorch ROCm 6.2 index |
On first run, each mode creates its own venv automatically. You can switch between them freely — they are fully independent. The active device is reported by the /health endpoint as "device": "cpu", "device": "cuda", or "device": "rocm".
CUDA requirement: driver >= 525.60 (RTX 30/40 series all qualify). Run
nvidia-smito check.
ROCm requirement: ROCm 6.2+ installed on Linux. Supported GPUs: AMD RX 6000 series (RDNA2) or newer, RX 7000 series (RDNA3), and Instinct accelerators. ROCm is not supported on Windows. Flash attention is not available on ROCm — SDPA is used instead.
Individual commands
pnpm dev # CUDA — server + web
pnpm dev:cpu # CPU — server + web
pnpm dev:rocm # ROCm — server + web
pnpm dev:server # CUDA — Python server only
pnpm dev:server:cpu # CPU — Python server only
pnpm dev:server:rocm # ROCm — Python server only
pnpm dev:web # Next.js only (no Python server)
pnpm build # Production build of the frontend
Environment variables
Copy .env.example to .env.local and set:
| Variable | Default | Description |
|---|---|---|
VIBEVOICE_SERVER_URL |
http://localhost:8000 |
URL the Next.js API routes use to reach the Python server |
HF_TOKEN |
— | HuggingFace token (required if the model repo is gated) |
HF_HOME |
— | Override the HuggingFace model cache directory |
Project structure
web/
├── app/
│ ├── api/generate/ Proxies POST requests to the Python server
│ ├── api/health/ Proxies health checks (status: loading | online | error)
│ ├── page.tsx Main UI — script input, controls, audio player
│ └── layout.tsx
├── components/
│ ├── Header.tsx
│ ├── TextInputPanel.tsx
│ ├── GenerationControls.tsx cfg_scale and inference_steps sliders
│ ├── AudioPlayer.tsx
│ └── StatusLog.tsx
└── hooks/
└── useAudioPlayer.ts
server/
├── vibevoice_server.py FastAPI app — /health and /generate endpoints
├── download_model.py One-shot HuggingFace model prefetch
├── start.sh Entry point: uv sync → model check → uvicorn
└── pyproject.toml Python deps managed by uv
Generation parameters
| Parameter | Range | Default | Effect |
|---|---|---|---|
speaker |
carter, davis, emma, frank, grace, mike |
carter |
Voice preset used for the generated audio |
cfg_scale |
0.5 – 4.0 | 1.5 | Higher = more expressive guidance |
inference_steps |
5 – 20 | 10 | More steps = higher quality, slower generation |
How it works
- The user pastes a script and hits Generate
- The Next.js
/api/generateroute forwards the request to FastAPI on port 8000 - FastAPI runs the text through the VibeVoice streaming processor and inference model
- Audio chunks stream back to the browser as SSE events containing base64 float32 PCM
- The browser plays the chunks live, assembles a WAV Blob, and loads it into the audio player
Python dependencies
Managed by uv. The server/uv.lock is committed so installs are fully reproducible.
# Add a package
cd server && uv add <package>
# Upgrade all dependencies
cd server && uv lock --upgrade
Note: The
[tool.uv.sources]block inpyproject.tomlpulls torch from the PyTorch CUDA 12.4 index by default. Running with--cpuor--rocm(oruv sync --no-sources) bypasses this and installs the standard PyPI CPU wheel first; for ROCm, the torch wheel is then replaced with the PyTorch ROCm 6.2 build.