diff --git a/AGENTS.md b/AGENTS.md index ed18570..7ca403c 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -8,10 +8,10 @@ This file gives AI coding agents (Jules, Copilot, Claude Code, etc.) the context VibePod is a text-to-speech web app. It has two services that must both run for the app to work: -| Service | Language | Entry point | Port | -|---------|----------|-------------|------| -| **server** | Python 3.10+ (FastAPI + VibeVoice) | `server/start.sh` | 8000 | -| **web** | TypeScript (Next.js 15, React 19) | `pnpm --filter vibepod-web dev` | 3000 | +| Service | Language | Entry point | Port | +| ---------- | ---------------------------------- | ------------------------------- | ---- | +| **server** | Python 3.10+ (FastAPI + VibeVoice) | `server/start.sh` | 8000 | +| **web** | TypeScript (Next.js 15, React 19) | `pnpm --filter vibepod-web dev` | 3000 | The Next.js frontend proxies all model requests through its own API routes to the FastAPI server — it never calls the Python server directly from the browser. @@ -51,12 +51,12 @@ pnpm build The `--cpu` flag in `start.sh` sets `VIBEPOD_DEVICE=cpu` and uses a separate venv (`server/.venv-cpu`) so CUDA and CPU installs never conflict. `vibevoice_server.py` reads `VIBEPOD_DEVICE` at startup via `_resolve_device()` — do not remove or rename that function. -| Env var | Values | Set by | -|---------|--------|--------| -| `VIBEPOD_DEVICE` | `cpu` \| `cuda` | `server/start.sh` | -| `UV_PROJECT_ENVIRONMENT` | `.venv-cpu` \| `.venv` | `server/start.sh` | -| `HF_TOKEN` | HuggingFace token | Jules secret / `.env.local` | -| `VIBEVOICE_SERVER_URL` | `http://localhost:8000` | `.env.local` | +| Env var | Values | Set by | +| ------------------------ | ----------------------- | --------------------------- | +| `VIBEPOD_DEVICE` | `cpu` \| `cuda` | `server/start.sh` | +| `UV_PROJECT_ENVIRONMENT` | `.venv-cpu` \| `.venv` | `server/start.sh` | +| `HF_TOKEN` | HuggingFace token | Jules secret / `.env.local` | +| `VIBEVOICE_SERVER_URL` | `http://localhost:8000` | `.env.local` | --- @@ -94,7 +94,9 @@ dev.sh Concurrent launcher (forwards flags to start.sh) ## API reference ### `GET /health` + Returns server status. Safe to poll. + ```json { "status": "online", @@ -103,13 +105,17 @@ Returns server status. Safe to poll. "voices": ["carter", "davis", "emma", "frank", "grace", "mike"] } ``` + `status` values: `downloading` | `loading` | `online` | `error` ### `POST /generate` + Streams audio as SSE events. + ```json { "text": "Hello world", "speaker": "carter", "cfg_scale": 1.5, "inference_steps": 10 } ``` + Event types: `audio_chunk` (base64 float32 PCM) | `complete` | `error` | `cancelled` --- @@ -117,12 +123,14 @@ Event types: `audio_chunk` (base64 float32 PCM) | `complete` | `error` | `cancel ## Do / Don't **Do:** + - Use `pnpm dev:cpu` in Jules — never plain `pnpm dev` - Run `git checkout server/uv.lock` if uv rewrites it during setup - Keep `_resolve_device()` in `vibevoice_server.py` — it's the CPU/CUDA switching logic - Test server changes against `GET /health` and `POST /generate` **Don't:** + - Run `uv sync` without `UV_PROJECT_ENVIRONMENT=.venv-cpu` in the Jules sandbox - Install Python packages with pip - Modify `server/uv.lock` manually diff --git a/DESIGN.md b/DESIGN.md index 2654734..42a00df 100644 --- a/DESIGN.md +++ b/DESIGN.md @@ -173,16 +173,21 @@ The shape language is a hybrid of structural precision and tactile softness. ## Components ### Card Containers + The fundamental building block of the UI. Every distinct section (Script, Player, Controls, Logs) is housed in a card featuring the `card-bg`, a 1px `border`, and `rounded-xl` corners. The internal layout always features an uppercase teal header for immediate section identification. ### Primary Action Buttons + Used for high-leverage actions like "Generate Audio" and "Play/Pause." These buttons utilize the `gradient-primary-dim` background, bold white text, and emit a soft teal glow to draw the eye and signify their importance. ### Range Sliders + Custom-styled input ranges replace default browser styles. The tracks are muted and slim, while the thumbs are bright teal, fully rounded, and emit a glow that intensifies on hover, providing a premium, tactile scrubbing experience. ### Status Indicators & Logs + A critical component of the application. Status badges utilize a minimalist pill shape with a pulsing ring animation to indicate active server processing. The log panel explicitly uses monospace typography and color-codes messages (green for success, red for error, white for neutral) to provide a terminal-like readout of the backend systems. ### Gradients + Gradients are used purposefully to indicate progress, activity, or brand presence. The primary gradient (`135deg` from teal to violet) is used for branding (the logo icon and text) and primary buttons. Horizontal gradients (`90deg`) are used dynamically in progress bars to represent the flow of data over time (e.g., loading, downloading, and audio generation). diff --git a/README.md b/README.md index f8202f5..ab76d3c 100644 --- a/README.md +++ b/README.md @@ -14,12 +14,12 @@ The Next.js app proxies audio generation requests to the FastAPI server, keeping ## Prerequisites -| Tool | Install | -|------|---------| -| [Node.js 20+](https://nodejs.org) | `winget install OpenJS.NodeJS.LTS` | -| [pnpm](https://pnpm.io) | `npm i -g pnpm` | +| Tool | Install | +| ---------------------------------- | ----------------------------------- | +| [Node.js 20+](https://nodejs.org) | `winget install OpenJS.NodeJS.LTS` | +| [pnpm](https://pnpm.io) | `npm i -g pnpm` | | [Python 3.10+](https://python.org) | `winget install Python.Python.3.13` | -| [uv](https://docs.astral.sh/uv/) | `winget install astral-sh.uv` | +| [uv](https://docs.astral.sh/uv/) | `winget install astral-sh.uv` | ## Getting started @@ -50,10 +50,10 @@ The frontend shows a loading indicator while the model downloads. Once the serve VibePod maintains two completely separate Python virtual environments so CUDA and CPU torch installs never conflict: -| Mode | Command | venv | torch source | -|------|---------|------|--------------| -| CUDA (default) | `pnpm dev` | `server/.venv` | PyTorch CUDA 12.4 index | -| CPU-only | `pnpm dev:cpu` | `server/.venv-cpu` | PyPI (CPU wheel) | +| Mode | Command | venv | torch source | +| -------------- | -------------- | ------------------ | ----------------------- | +| CUDA (default) | `pnpm dev` | `server/.venv` | PyTorch CUDA 12.4 index | +| CPU-only | `pnpm dev:cpu` | `server/.venv-cpu` | PyPI (CPU wheel) | On first run, each mode creates its own venv automatically. You can switch between them freely — they are fully independent. The active device is reported by the `/health` endpoint as `"device": "cpu"` or `"device": "cuda"`. @@ -74,11 +74,11 @@ pnpm build # Production build of the frontend Copy `.env.example` to `.env.local` and set: -| Variable | Default | Description | -|----------|---------|-------------| +| Variable | Default | Description | +| ---------------------- | ----------------------- | --------------------------------------------------------- | | `VIBEVOICE_SERVER_URL` | `http://localhost:8000` | URL the Next.js API routes use to reach the Python server | -| `HF_TOKEN` | — | HuggingFace token (required if the model repo is gated) | -| `HF_HOME` | — | Override the HuggingFace model cache directory | +| `HF_TOKEN` | — | HuggingFace token (required if the model repo is gated) | +| `HF_HOME` | — | Override the HuggingFace model cache directory | ## Project structure @@ -107,11 +107,11 @@ server/ ## Generation parameters -| Parameter | Range | Default | Effect | -|-----------|-------|---------|--------| -| `speaker` | `carter`, `davis`, `emma`, `frank`, `grace`, `mike` | `carter` | Voice preset used for the generated audio | -| `cfg_scale` | 0.5 – 4.0 | 1.5 | Higher = more expressive guidance | -| `inference_steps` | 5 – 20 | 10 | More steps = higher quality, slower generation | +| Parameter | Range | Default | Effect | +| ----------------- | --------------------------------------------------- | -------- | ---------------------------------------------- | +| `speaker` | `carter`, `davis`, `emma`, `frank`, `grace`, `mike` | `carter` | Voice preset used for the generated audio | +| `cfg_scale` | 0.5 – 4.0 | 1.5 | Higher = more expressive guidance | +| `inference_steps` | 5 – 20 | 10 | More steps = higher quality, slower generation | ## How it works diff --git a/pnpm-workspace.yaml b/pnpm-workspace.yaml index b1cedb5..92a7e8b 100644 --- a/pnpm-workspace.yaml +++ b/pnpm-workspace.yaml @@ -1,2 +1,2 @@ packages: - - 'web' + - "web" diff --git a/web/app/api/generate/route.ts b/web/app/api/generate/route.ts index 180c659..310bb01 100644 --- a/web/app/api/generate/route.ts +++ b/web/app/api/generate/route.ts @@ -7,7 +7,7 @@ export async function POST(request: NextRequest) { const pythonServerUrl = process.env.VIBEVOICE_SERVER_URL ?? "http://localhost:8000"; try { - const body = await request.json() as { + const body = (await request.json()) as { text: string; speaker?: string; cfg_scale?: number; @@ -41,7 +41,7 @@ export async function POST(request: NextRequest) { headers: { "Content-Type": "text/event-stream", "Cache-Control": "no-cache, no-transform", - "Connection": "keep-alive", + Connection: "keep-alive", "X-Content-Type-Options": "nosniff", "X-Accel-Buffering": "no", }, diff --git a/web/app/api/health/route.ts b/web/app/api/health/route.ts index ba17edb..e4d3506 100644 --- a/web/app/api/health/route.ts +++ b/web/app/api/health/route.ts @@ -4,8 +4,7 @@ const OFFLINE_RESPONSE = { status: "offline" }; const COMMON_OPTIONS = { headers: { "Cache-Control": "no-store" } }; export async function GET() { - const pythonServerUrl = - process.env.VIBEVOICE_SERVER_URL ?? "http://localhost:8000"; + const pythonServerUrl = process.env.VIBEVOICE_SERVER_URL ?? "http://localhost:8000"; try { const res = await fetch(`${pythonServerUrl}/health`, { diff --git a/web/app/globals.css b/web/app/globals.css index 9388e7f..d4569ee 100644 --- a/web/app/globals.css +++ b/web/app/globals.css @@ -12,8 +12,10 @@ --muted: #64748b; --success: #22c55e; --error: #ef4444; - --font-sans: ui-sans-serif, system-ui, -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif; - --font-mono: ui-monospace, SFMono-Regular, "SF Mono", Menlo, Consolas, "Liberation Mono", monospace; + --font-sans: + ui-sans-serif, system-ui, -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif; + --font-mono: + ui-monospace, SFMono-Regular, "SF Mono", Menlo, Consolas, "Liberation Mono", monospace; } @theme inline { diff --git a/web/app/page.tsx b/web/app/page.tsx index a9f6317..128824a 100644 --- a/web/app/page.tsx +++ b/web/app/page.tsx @@ -69,19 +69,39 @@ type AppAction = function reducer(state: AppState, action: AppAction): AppState { switch (action.type) { - case "SET_SCRIPT": return { ...state, script: action.payload }; - case "SET_SPEAKER": return { ...state, speaker: action.payload }; - case "SET_CFG_SCALE": return { ...state, cfgScale: action.payload }; - case "SET_INFERENCE_STEPS": return { ...state, inferenceSteps: action.payload }; - case "SET_PREBUFFER_SECS": return { ...state, prebufferSecs: action.payload }; - case "SET_REBUFFER_THRESHOLD": return { ...state, rebufferThresholdSecs: action.payload }; - case "SET_RESUME_THRESHOLD": return { ...state, resumeThresholdSecs: action.payload }; + case "SET_SCRIPT": + return { ...state, script: action.payload }; + case "SET_SPEAKER": + return { ...state, speaker: action.payload }; + case "SET_CFG_SCALE": + return { ...state, cfgScale: action.payload }; + case "SET_INFERENCE_STEPS": + return { ...state, inferenceSteps: action.payload }; + case "SET_PREBUFFER_SECS": + return { ...state, prebufferSecs: action.payload }; + case "SET_REBUFFER_THRESHOLD": + return { ...state, rebufferThresholdSecs: action.payload }; + case "SET_RESUME_THRESHOLD": + return { ...state, resumeThresholdSecs: action.payload }; case "START_GENERATION": - return { ...state, isGenerating: true, audioUrl: null, logs: [], genElapsed: 0, genPct: null }; + return { + ...state, + isGenerating: true, + audioUrl: null, + logs: [], + genElapsed: 0, + genPct: null, + }; case "GEN_PROGRESS": return { ...state, genElapsed: action.elapsed, genPct: action.pct }; case "GENERATION_SUCCESS": - return { ...state, isGenerating: false, genElapsed: 0, genPct: null, audioUrl: action.payload }; + return { + ...state, + isGenerating: false, + genElapsed: 0, + genPct: null, + audioUrl: action.payload, + }; case "GENERATION_CANCELLED": case "GENERATION_ERROR": return { ...state, isGenerating: false, genElapsed: 0, genPct: null }; @@ -89,21 +109,27 @@ function reducer(state: AppState, action: AppAction): AppState { return { ...state, logs: [...state.logs, action.payload] }; case "SET_SERVER_STATUS": { const isNewConfig = !state.serverConfig && action.payload.config; - const deviceChanged = !!(state.serverConfig && action.payload.config && state.serverConfig.device !== action.payload.config.device); + const deviceChanged = !!( + state.serverConfig && + action.payload.config && + state.serverConfig.device !== action.payload.config.device + ); - const nextSteps = (isNewConfig || deviceChanged) + const nextSteps = + isNewConfig || deviceChanged ? action.payload.config!.default_inference_steps : state.inferenceSteps; - const nextPrebuffer = (isNewConfig || deviceChanged) - ? action.payload.config!.prebuffer_secs - : state.prebufferSecs; + const nextPrebuffer = + isNewConfig || deviceChanged ? action.payload.config!.prebuffer_secs : state.prebufferSecs; - const nextRebuffer = (isNewConfig || deviceChanged) + const nextRebuffer = + isNewConfig || deviceChanged ? action.payload.config!.rebuffer_threshold_secs : state.rebufferThresholdSecs; - const nextResume = (isNewConfig || deviceChanged) + const nextResume = + isNewConfig || deviceChanged ? action.payload.config!.resume_threshold_secs : state.resumeThresholdSecs; @@ -121,7 +147,8 @@ function reducer(state: AppState, action: AppAction): AppState { resumeThresholdSecs: nextResume, }; } - default: return state; + default: + return state; } } @@ -213,7 +240,10 @@ export default function HomePage() { } poll(); - return () => { cancelled = true; clearTimeout(timeoutId); }; + return () => { + cancelled = true; + clearTimeout(timeoutId); + }; }, []); const handleGenerate = useCallback(async () => { @@ -241,7 +271,6 @@ export default function HomePage() {
- {/* Left: script + audio player */}
dispatch({ type: "SET_CFG_SCALE", payload: v })} inferenceSteps={state.inferenceSteps} onInferenceStepsChange={(v) => dispatch({ type: "SET_INFERENCE_STEPS", payload: v })} - prebufferSecs={state.prebufferSecs} - onPrebufferSecsChange={(v) => dispatch({ type: "SET_PREBUFFER_SECS", payload: v })} - rebufferThresholdSecs={state.rebufferThresholdSecs} - onRebufferThresholdChange={(v) => dispatch({ type: "SET_REBUFFER_THRESHOLD", payload: v })} - resumeThresholdSecs={state.resumeThresholdSecs} - onResumeThresholdChange={(v) => dispatch({ type: "SET_RESUME_THRESHOLD", payload: v })} + prebufferSecs={state.prebufferSecs} + onPrebufferSecsChange={(v) => dispatch({ type: "SET_PREBUFFER_SECS", payload: v })} + rebufferThresholdSecs={state.rebufferThresholdSecs} + onRebufferThresholdChange={(v) => + dispatch({ type: "SET_REBUFFER_THRESHOLD", payload: v }) + } + resumeThresholdSecs={state.resumeThresholdSecs} + onResumeThresholdChange={(v) => + dispatch({ type: "SET_RESUME_THRESHOLD", payload: v }) + } onGenerate={handleGenerate} onStop={stop} onPauseStream={pauseStream} @@ -281,7 +314,6 @@ export default function HomePage() { />
-
diff --git a/web/components/AudioPlayer.tsx b/web/components/AudioPlayer.tsx index f54f25e..36ca5b1 100644 --- a/web/components/AudioPlayer.tsx +++ b/web/components/AudioPlayer.tsx @@ -14,15 +14,8 @@ function formatTime(seconds: number): string { } export default function AudioPlayer({ audioUrl }: AudioPlayerProps) { - const { - isPlaying, - currentTime, - duration, - volume, - toggle, - seek, - setVolume, - } = useAudioPlayer(audioUrl); + const { isPlaying, currentTime, duration, volume, toggle, seek, setVolume } = + useAudioPlayer(audioUrl); if (!audioUrl) return null; @@ -56,12 +49,10 @@ export default function AudioPlayer({ audioUrl }: AudioPlayerProps) { background: "rgba(45, 212, 191, 0.05)", }} onMouseEnter={(e) => { - (e.currentTarget as HTMLButtonElement).style.background = - "rgba(45, 212, 191, 0.15)"; + (e.currentTarget as HTMLButtonElement).style.background = "rgba(45, 212, 191, 0.15)"; }} onMouseLeave={(e) => { - (e.currentTarget as HTMLButtonElement).style.background = - "rgba(45, 212, 191, 0.05)"; + (e.currentTarget as HTMLButtonElement).style.background = "rgba(45, 212, 191, 0.05)"; }} > {isPlaying ? ( - + ) : ( - + )} @@ -143,9 +125,7 @@ export default function AudioPlayer({ audioUrl }: AudioPlayerProps) { {/* Duration info */}
- - {formatTime(currentTime)} - + {formatTime(currentTime)} / {formatTime(duration)}
diff --git a/web/components/GenerationControls.tsx b/web/components/GenerationControls.tsx index 41bce72..373fe0d 100644 --- a/web/components/GenerationControls.tsx +++ b/web/components/GenerationControls.tsx @@ -36,18 +36,27 @@ const STATUS_CONFIG: Record< Exclude, { color: string; label: (p: DownloadProgress | null) => string } > = { - offline: { color: "var(--error)", label: () => "Server offline — waiting for connection..." }, - downloading: { color: "#60a5fa", label: (p) => p && p.total > 0 ? `Downloading model... (${p.done} / ${p.total} files)` : "Downloading model (~1 GB)..." }, - loading: { color: "#fbbf24", label: () => "Loading model into memory..." }, - error: { color: "var(--error)", label: () => "Server error — check the terminal for details." }, + offline: { color: "var(--error)", label: () => "Server offline — waiting for connection..." }, + downloading: { + color: "#60a5fa", + label: (p) => + p && p.total > 0 + ? `Downloading model... (${p.done} / ${p.total} files)` + : "Downloading model (~1 GB)...", + }, + loading: { color: "#fbbf24", label: () => "Loading model into memory..." }, + error: { color: "var(--error)", label: () => "Server error — check the terminal for details." }, }; - function SpinnerIcon() { return ( - + ); } @@ -146,7 +155,10 @@ export default function GenerationControls({ onChange={(e) => onCfgScaleChange(parseFloat(e.target.value))} className="w-full" /> -
+
Flat (0.5) CFG Scale Expressive (4.0) @@ -176,7 +188,10 @@ export default function GenerationControls({ className="w-full" style={{ "--thumb-color": "var(--accent-violet)" } as React.CSSProperties} /> -
+
Faster (5) Diffusion Steps Better (20) @@ -207,7 +222,11 @@ export default function GenerationControls({
{showAdvanced && ( -
+
{/* Pre-buffer */}
@@ -232,7 +251,11 @@ export default function GenerationControls({ {/* Re-buffer threshold */}
-