Files
vibepod/README.md
Claude bb6da662de Add AMD ROCm GPU support
Introduces a third hardware mode alongside CUDA and CPU: ROCm (AMD GPU).
AMD GPUs present as CUDA devices under PyTorch ROCm, so the existing GPU
path is reused with minimal changes — the main additions are wheel management,
device detection, and suppressing flash_attn (unsupported on ROCm).

- server/vibevoice_server.py: extend _resolve_device() to recognise 'rocm'
  (auto-detected via torch.version.hip); add _torch_device() helper that maps
  'rocm' → 'cuda' for all PyTorch API calls; apply GPU optimisations for both
  cuda and rocm in _init_model(); always use sdpa on ROCm; propagate
  _torch_device() to _load_voice_presets() map_location.
- server/start.sh: add --rocm flag; sync .venv-rocm with uv sync --no-sources
  then replace torch with the ROCm 6.2 wheel via uv pip install; set
  VIBEPOD_DEVICE=rocm for uvicorn.
- server/pyproject.toml: register pytorch-rocm62 index (explicit); add
  .venv-rocm to ruff excludes.
- package.json: add dev:rocm and dev:server:rocm scripts.
- README.md: document ROCm mode, prerequisites (RX 6000+, ROCm 6.2+, Linux),
  and new commands; expand CUDA vs CPU section to CUDA vs CPU vs ROCm.

https://claude.ai/code/session_0168pSswiaoEf6LEx6UQWfBu
2026-05-04 01:54:57 +00:00

143 lines
6.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# VibePod
A text-to-speech podcast generator powered by [VibeVoice 0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B). Paste a script, tune a couple of sliders, and get a WAV back.
## Architecture
```
VibePod/
├── web/ Next.js 15 frontend (React 19, Tailwind CSS 4, TypeScript)
└── server/ FastAPI TTS backend (Python 3.10+, VibeVoice, UV)
```
The Next.js app proxies audio generation requests to the FastAPI server, keeping CORS out of the picture and the Python model off the browser.
## Prerequisites
| Tool | Install |
| ---------------------------------- | ----------------------------------- |
| [Node.js 20+](https://nodejs.org) | `winget install OpenJS.NodeJS.LTS` |
| [pnpm](https://pnpm.io) | `npm i -g pnpm` |
| [Python 3.10+](https://python.org) | `winget install Python.Python.3.13` |
| [uv](https://docs.astral.sh/uv/) | `winget install astral-sh.uv` |
## Getting started
```bash
# 1. Clone
git clone https://github.com/JezzWTF/vibepod.git
cd vibepod
# 2. Install Node dependencies (root + web workspace)
pnpm install
# 3. Copy env file and fill in values
cp .env.example .env.local
# 4. Start everything
pnpm dev # CUDA (requires NVIDIA GPU + driver >= 525.60)
pnpm dev:cpu # CPU-only (no GPU required)
pnpm dev:rocm # ROCm (requires AMD GPU + ROCm 6.2+, Linux only)
```
`pnpm dev` / `pnpm dev:cpu` start both services concurrently:
- **SERVER** — `http://localhost:8000` — on first run uv creates the Python venv and downloads the ~1 GB VibeVoice model from HuggingFace
- **WEB** — `http://localhost:3000` — Next.js dev server with Turbopack
The frontend shows a loading indicator while the model downloads. Once the server reports `status: online`, generation is available.
## CUDA vs CPU vs ROCm
VibePod maintains three completely separate Python virtual environments so torch installs never conflict:
| Mode | Command | venv | torch source |
| -------------- | ---------------- | --------------------- | ------------------------- |
| CUDA (default) | `pnpm dev` | `server/.venv` | PyTorch CUDA 12.4 index |
| CPU-only | `pnpm dev:cpu` | `server/.venv-cpu` | PyPI (CPU wheel) |
| ROCm (AMD GPU) | `pnpm dev:rocm` | `server/.venv-rocm` | PyTorch ROCm 6.2 index |
On first run, each mode creates its own venv automatically. You can switch between them freely — they are fully independent. The active device is reported by the `/health` endpoint as `"device": "cpu"`, `"device": "cuda"`, or `"device": "rocm"`.
> **CUDA requirement:** driver >= 525.60 (RTX 30/40 series all qualify). Run `nvidia-smi` to check.
> **ROCm requirement:** ROCm 6.2+ installed on Linux. Supported GPUs: AMD RX 6000 series (RDNA2) or newer, RX 7000 series (RDNA3), and Instinct accelerators. ROCm is not supported on Windows. Flash attention is not available on ROCm — SDPA is used instead.
## Individual commands
```bash
pnpm dev # CUDA — server + web
pnpm dev:cpu # CPU — server + web
pnpm dev:rocm # ROCm — server + web
pnpm dev:server # CUDA — Python server only
pnpm dev:server:cpu # CPU — Python server only
pnpm dev:server:rocm # ROCm — Python server only
pnpm dev:web # Next.js only (no Python server)
pnpm build # Production build of the frontend
```
## Environment variables
Copy `.env.example` to `.env.local` and set:
| Variable | Default | Description |
| ---------------------- | ----------------------- | --------------------------------------------------------- |
| `VIBEVOICE_SERVER_URL` | `http://localhost:8000` | URL the Next.js API routes use to reach the Python server |
| `HF_TOKEN` | — | HuggingFace token (required if the model repo is gated) |
| `HF_HOME` | — | Override the HuggingFace model cache directory |
## Project structure
```
web/
├── app/
│ ├── api/generate/ Proxies POST requests to the Python server
│ ├── api/health/ Proxies health checks (status: loading | online | error)
│ ├── page.tsx Main UI — script input, controls, audio player
│ └── layout.tsx
├── components/
│ ├── Header.tsx
│ ├── TextInputPanel.tsx
│ ├── GenerationControls.tsx cfg_scale and inference_steps sliders
│ ├── AudioPlayer.tsx
│ └── StatusLog.tsx
└── hooks/
└── useAudioPlayer.ts
server/
├── vibevoice_server.py FastAPI app — /health and /generate endpoints
├── download_model.py One-shot HuggingFace model prefetch
├── start.sh Entry point: uv sync → model check → uvicorn
└── pyproject.toml Python deps managed by uv
```
## Generation parameters
| Parameter | Range | Default | Effect |
| ----------------- | --------------------------------------------------- | -------- | ---------------------------------------------- |
| `speaker` | `carter`, `davis`, `emma`, `frank`, `grace`, `mike` | `carter` | Voice preset used for the generated audio |
| `cfg_scale` | 0.5 4.0 | 1.5 | Higher = more expressive guidance |
| `inference_steps` | 5 20 | 10 | More steps = higher quality, slower generation |
## How it works
1. The user pastes a script and hits **Generate**
2. The Next.js `/api/generate` route forwards the request to FastAPI on port 8000
3. FastAPI runs the text through the VibeVoice streaming processor and inference model
4. Audio chunks stream back to the browser as SSE events containing base64 float32 PCM
5. The browser plays the chunks live, assembles a WAV Blob, and loads it into the audio player
## Python dependencies
Managed by [uv](https://docs.astral.sh/uv/). The `server/uv.lock` is committed so installs are fully reproducible.
```bash
# Add a package
cd server && uv add <package>
# Upgrade all dependencies
cd server && uv lock --upgrade
```
> **Note:** The `[tool.uv.sources]` block in `pyproject.toml` pulls torch from the PyTorch CUDA 12.4 index by default. Running with `--cpu` or `--rocm` (or `uv sync --no-sources`) bypasses this and installs the standard PyPI CPU wheel first; for ROCm, the torch wheel is then replaced with the PyTorch ROCm 6.2 build.