Overlap acoustic_decode with forward_tts_lm calls using a background ThreadPoolExecutor, hiding ~72s of decode cost behind tts_lm work. Achieved 0.67x realtime (up from 0.43x, ~56% improvement). - vibevoice_generate_patch.py: patched generate() loop reordered to submit decode to thread before running connector + tts_lm×2, then resolve future. Installed as instance method via types.MethodType so uv sync reinstalling the package cannot revert the patch. - Dynamic INT8 quantization of Linear layers (VIBEPOD_QUANTIZE=1, default on CPU). prediction_head excluded — small fixed-size tensors regressed ~20% with INT8 due to pack/unpack overhead. - Auto-detect AVX512_BF16 and load model in bfloat16 if supported (VIBEPOD_CPU_BF16=auto, overridable with 0/1). - CPU thread count auto-configured from logical CPU count; OMP/MKL env vars set accordingly. Lock file preserved around uv sync --no-sources so CPU mode does not alter the shared uv.lock. - torch.compile retained as opt-in (VIBEPOD_COMPILE=1) but marked not recommended — dynamic KV cache shapes prevent kernel reuse.
VibePod
A text-to-speech podcast generator powered by VibeVoice 0.5B. Paste a script, tune a couple of sliders, and get a WAV back.
Architecture
VibePod/
├── web/ Next.js 15 frontend (React 19, Tailwind CSS 4, TypeScript)
└── server/ FastAPI TTS backend (Python 3.10+, VibeVoice, UV)
The Next.js app proxies audio generation requests to the FastAPI server, keeping CORS out of the picture and the Python model off the browser.
Prerequisites
| Tool | Install |
|---|---|
| Node.js 20+ | winget install OpenJS.NodeJS.LTS |
| pnpm | npm i -g pnpm |
| Python 3.10+ | winget install Python.Python.3.13 |
| uv | winget install astral-sh.uv |
Getting started
# 1. Clone
git clone https://github.com/JezzWTF/vibepod.git
cd vibepod
# 2. Install Node dependencies (root + web workspace)
pnpm install
# 3. Copy env file and fill in values
cp .env.example .env.local
# 4. Start everything
pnpm dev # CUDA (requires NVIDIA GPU + driver >= 525.60)
pnpm dev:cpu # CPU-only (no GPU required)
pnpm dev / pnpm dev:cpu start both services concurrently:
- SERVER —
http://localhost:8000— on first run uv creates the Python venv and downloads the ~1 GB VibeVoice model from HuggingFace - WEB —
http://localhost:3000— Next.js dev server with Turbopack
The frontend shows a loading indicator while the model downloads. Once the server reports status: online, generation is available.
CUDA vs CPU
VibePod maintains two completely separate Python virtual environments so CUDA and CPU torch installs never conflict:
| Mode | Command | venv | torch source |
|---|---|---|---|
| CUDA (default) | pnpm dev |
server/.venv |
PyTorch CUDA 12.4 index |
| CPU-only | pnpm dev:cpu |
server/.venv-cpu |
PyPI (CPU wheel) |
On first run, each mode creates its own venv automatically. You can switch between them freely — they are fully independent. The active device is reported by the /health endpoint as "device": "cpu" or "device": "cuda".
CUDA requirement: driver >= 525.60 (RTX 30/40 series all qualify). Run
nvidia-smito check.
Individual commands
pnpm dev # CUDA — server + web
pnpm dev:cpu # CPU — server + web
pnpm dev:server # CUDA — Python server only
pnpm dev:server:cpu # CPU — Python server only
pnpm dev:web # Next.js only (no Python server)
pnpm build # Production build of the frontend
Environment variables
Copy .env.example to .env.local and set:
| Variable | Default | Description |
|---|---|---|
VIBEVOICE_SERVER_URL |
http://localhost:8000 |
URL the Next.js API routes use to reach the Python server |
HF_TOKEN |
— | HuggingFace token (required if the model repo is gated) |
HF_HOME |
— | Override the HuggingFace model cache directory |
Project structure
web/
├── app/
│ ├── api/generate/ Proxies POST requests to the Python server
│ ├── api/health/ Proxies health checks (status: loading | online | error)
│ ├── page.tsx Main UI — script input, controls, audio player
│ └── layout.tsx
├── components/
│ ├── Header.tsx
│ ├── TextInputPanel.tsx
│ ├── GenerationControls.tsx cfg_scale and inference_steps sliders
│ ├── AudioPlayer.tsx
│ └── StatusLog.tsx
└── hooks/
└── useAudioPlayer.ts
server/
├── vibevoice_server.py FastAPI app — /health and /generate endpoints
├── download_model.py One-shot HuggingFace model prefetch
├── start.sh Entry point: uv sync → model check → uvicorn
└── pyproject.toml Python deps managed by uv
Generation parameters
| Parameter | Range | Default | Effect |
|---|---|---|---|
speaker |
carter, davis, emma, frank, grace, mike |
carter |
Voice preset used for the generated audio |
cfg_scale |
0.5 – 4.0 | 1.5 | Higher = more expressive guidance |
inference_steps |
5 – 20 | 10 | More steps = higher quality, slower generation |
How it works
- The user pastes a script and hits Generate
- The Next.js
/api/generateroute forwards the request to FastAPI on port 8000 - FastAPI runs the text through the VibeVoice streaming processor and inference model
- Audio chunks stream back to the browser as SSE events containing base64 float32 PCM
- The browser plays the chunks live, assembles a WAV Blob, and loads it into the audio player
Python dependencies
Managed by uv. The server/uv.lock is committed so installs are fully reproducible.
# Add a package
cd server && uv add <package>
# Upgrade all dependencies
cd server && uv lock --upgrade
Note: The
[tool.uv.sources]block inpyproject.tomlpulls torch from the PyTorch CUDA 12.4 index by default. Running with--cpu(oruv sync --no-sources) bypasses this and installs the standard PyPI CPU wheel instead.