# VibePod Studio — Build Plan **Version:** 1.0 **Authors:** LyAhn + Claude (Anthropic) + Codex **Date:** 2026-05-02 **Status:** Active --- ## Table of Contents 1. [Product Goal](#1-product-goal) 2. [Current Architecture](#2-current-architecture) 3. [Locked Technical Decisions](#3-locked-technical-decisions) 4. [Non-Goals](#4-non-goals) 5. [Data Models](#5-data-models) 6. [Storage Layout](#6-storage-layout) 7. [API Contract](#7-api-contract) 8. [Frontend Routes](#8-frontend-routes) 9. [Component Hierarchy](#9-component-hierarchy) 10. [Timeline Rendering Model](#10-timeline-rendering-model) 11. [Playback Model](#11-playback-model) 12. [Export Rendering Model](#12-export-rendering-model) 13. [Phase Breakdown](#13-phase-breakdown) 14. [Acceptance Criteria](#14-acceptance-criteria) 15. [Deferred Decisions](#15-deferred-decisions) 16. [Third-Party Library Policy](#16-third-party-library-policy) --- ## 1. Product Goal VibePod becomes a **script-aware AI podcast creation studio** where users can generate, arrange, edit, regenerate, and export podcast episodes from written scripts. The headline feature that differentiates VibePod from generic audio editors: > Every generated clip knows where it came from in the script and can be regenerated, replaced, or compared — without touching the original audio file. The existing fast-generation page stays as the quick path. Studio grows alongside it. --- ## 2. Current Architecture ### What exists today | Layer | Details | |---|---| | Frontend | Next.js 15 · React 19 · TypeScript 5 · Tailwind CSS 4 · App Router | | Backend | FastAPI · uvicorn · Python 3.10+ | | Model | VibeVoice Realtime 0.5B via JezzWTF fork | | Audio format | 24 kHz · float32 PCM · streamed as SSE · assembled to WAV in browser | | State management | React `useReducer` in `page.tsx` — no Zustand, no Redux | | Persistence | None — generated audio is an ephemeral browser Blob | | Packages | pnpm workspace (frontend) · uv (Python) | ### Current generation flow ``` User submits script → POST /api/generate (Next.js proxy) → POST /generate (FastAPI) → VibeVoice inference thread → SSE stream of base64 float32 PCM chunks → Browser decodes chunks, adaptive buffering, live Web Audio playback → On complete: chunks merged, RIFF/WAV header prepended, Blob URL created → User can play back or download the WAV ``` ### Key files ``` web/ app/ page.tsx main generation UI (AppState via useReducer) api/generate/route.ts SSE proxy to FastAPI api/health/route.ts health check proxy components/ Header.tsx TextInputPanel.tsx AudioPlayer.tsx GenerationControls.tsx StatusLog.tsx hooks/ useStreamingGeneration.ts core streaming + WAV assembly useAudioPlayer.ts HTML5 audio element wrapper server/ vibevoice_server.py entire FastAPI app (972 lines) start.sh launcher (CPU/CUDA detection, uv sync, uvicorn) download_model.py HuggingFace prefetch docs/ studio-build-plan.md this file roadmap.md high-level phase vision DESIGN.md brand + design system (colours, type, spacing) AGENTS.md AI agent / CI guide ``` --- ## 3. Locked Technical Decisions These decisions are final and must not be revisited without explicit agreement. New phases build on them. ### 3.1 Rendering approach — Hybrid DOM + Canvas 2D The Studio uses a **hybrid rendering model**: | Layer | Technology | |---|---| | App shell, layout, sidebars, panels, inspector | React + Tailwind CSS (DOM) | | Track headers, controls, transport, modals | React + Tailwind CSS (DOM) | | Timeline clip containers + positioning | React + CSS (`left`/`width` from time → pixels) | | Waveform rendering inside clips | Raw Canvas 2D | | Timeline ruler | Raw Canvas 2D | | Playhead overlay | Raw Canvas 2D | | Browser playback preview | Web Audio API | | Final render + export | Python + FFmpeg (server-side only) | **Why not full canvas:** Every non-waveform element — buttons, text, inputs, scroll, keyboard focus, accessibility — works better in DOM. Reimplementing all of that in canvas is wasted effort. **Why not pure CSS:** Waveform peaks are thousands of pixel-height values per clip. DOM representation would be extremely slow. Canvas draws them in a tight loop in milliseconds. **Why not WaveSurfer.js as the core:** WaveSurfer owns playback and its own event model. VibePod Studio needs its own clip model, its own timeline, and eventually its own multi-track playback. Adapting around WaveSurfer's assumptions creates friction. It may be used for standalone audio preview components (e.g., the generation page player), not the Studio timeline. **Why not Konva.js:** Overkill for v1. Konva is designed for fully canvas-based scenes (whiteboards, diagrams). VibePod's timeline is mostly DOM. The additional mental model (Stage/Layer/Group/Transformer) is not justified unless interaction complexity grows significantly beyond v1. ### 3.2 Frontend stack — No additions without justification Build inside the existing stack. New packages require a written reason in this document. **Approved additions (to be installed when their phase begins):** | Package | Purpose | Phase | |---|---|---| | `zustand` | Studio editor state | Phase 2 | | `@dnd-kit/core` + `@dnd-kit/utilities` | Clip drag-and-drop in timeline | Phase 2 | | `better-sqlite3` | SQLite for project/studio persistence (Next.js layer) | Phase 2 | **Conditionally approved (evaluate at phase start):** | Package | Purpose | Condition | |---|---|---| | `framer-motion` | Clip move animations | Only if DnD-kit transitions feel rough after prototype | | `@radix-ui/react-*` | Accessible modal/dropdown primitives | Only if building custom is taking too long | ### 3.3 Backend stack — Python + FFmpeg for all rendering Browser-side audio mixing is only for **preview**. Export always goes to the Python backend. **Approved backend additions:** | Package | Purpose | Phase | |---|---|---| | `soundfile` | Already present — WAV read/write | Phase 1 | | `numpy` | Audio array manipulation | Phase 1 | | `pydub` | Audio trimming, mixing, concatenation | Phase 2 | | `pyloudnorm` | Loudness normalisation (LUFS) | Phase 5 | FFmpeg must be available on the server host. The render endpoint assumes `ffmpeg` is on PATH. ### 3.4 State management — Zustand for Studio, useReducer stays on generation page The existing generation page uses `useReducer` and works well. Do not refactor it. Studio requires a shared store that multiple components read and write (timeline, inspector, transport, script panel). Zustand is the right tool. It is lightweight, does not require providers, and handles editor-style state (undo stacks, selection, playhead) cleanly. ### 3.5 Database — SQLite from Phase 1 Do not start with flat JSON files. SQLite is still a single file, requires no server process, and gives proper queries, transactions, and schema migrations from day one. Use `better-sqlite3` in the Next.js API layer. Schema lives in `web/lib/db/schema.sql`. Migrations are numbered SQL files in `web/lib/db/migrations/`. ### 3.6 Audio sample rate All generated audio is 24 kHz float32 mono (VibeVoice output). Studio renders at 44.1 kHz stereo WAV or 48 kHz for podcast MP3 export. The render pipeline handles resampling. --- ## 4. Non-Goals These will not be built and must not creep in: - **Real-time collaborative editing** — single-user per project only - **Cloud sync or user accounts** — local-first, no auth system - **MIDI or music composition** — audio clips only, no MIDI tracks - **Plugin system** — no third-party audio plugin API - **Browser-side FFmpeg (ffmpeg.wasm)** — all rendering is server-side - **Mobile / responsive Studio layout** — Studio targets desktop viewport only - **Offline PWA** — the server must be running; no service worker caching of model output - **Real-time voice cloning** — out of scope until VibeVoice supports it cleanly - **Exporting to streaming platforms** — export to file only; no Spotify/Apple Podcasts upload --- ## 5. Data Models ### 5.1 Generation job ```ts type GenerationJob = { id: string; // "gen_" createdAt: string; // ISO 8601 status: "pending" | "generating" | "complete" | "error" | "cancelled"; script: string; speaker: string; cfgScale: number; inferenceSteps: number; durationSecs: number | null; // set on complete sampleRate: number; // always 24000 audioPath: string | null; // relative to data/generations//audio.wav waveformPath: string | null; // relative to data/generations//waveform.json errorMessage: string | null; }; ``` ### 5.2 Studio project ```ts type StudioProject = { id: string; // "proj_" name: string; createdAt: string; updatedAt: string; script: ScriptDocument; assets: AudioAsset[]; tracks: Track[]; edits: EditOperation[]; renderSettings: RenderSettings; }; ``` ### 5.3 Script document ```ts type ScriptDocument = { blocks: ScriptBlock[]; }; type ScriptBlock = { id: string; // "block_" speakerId: string; text: string; order: number; generatedAssetId: string | null; timelineClipIds: string[]; }; ``` ### 5.4 Audio asset ```ts type AudioAsset = { id: string; // "asset_" projectId: string | null; // null = generation library asset kind: "generated_voice" | "upload" | "music" | "sfx" | "render"; filePath: string; durationSecs: number; sampleRate: number; channels: number; waveformPath: string | null; source: { generationJobId?: string; scriptBlockId?: string; providerId?: string; modelId?: string; voiceId?: string; settings?: Record; } | null; }; ``` ### 5.5 Track ```ts type Track = { id: string; // "track_" name: string; type: "voice" | "music" | "sfx" | "ambience" | "master"; order: number; muted: boolean; solo: boolean; gainDb: number; clips: TimelineClip[]; }; ``` ### 5.6 Timeline clip ```ts type TimelineClip = { id: string; // "clip_" assetId: string; trackId: string; startTime: number; // seconds from timeline origin sourceStart: number; // trim start within source asset (seconds) sourceEnd: number; // trim end within source asset (seconds) gainDb: number; fadeInMs: number; fadeOutMs: number; linkedScriptRange: { blockId: string; startChar: number; endChar: number; } | null; }; ``` ### 5.7 Edit operation (non-destructive EDL) ```ts type EditOperation = | { type: "split"; clipId: string; at: number } | { type: "trim"; clipId: string; sourceStart: number; sourceEnd: number } | { type: "move"; clipId: string; startTime: number; trackId: string } | { type: "gain"; clipId: string; gainDb: number } | { type: "fade"; clipId: string; fadeInMs: number; fadeOutMs: number } | { type: "delete"; clipId: string } | { type: "mute"; trackId: string; muted: boolean } | { type: "solo"; trackId: string; solo: boolean }; ``` ### 5.8 Take (regeneration history) ```ts type Take = { id: string; // "take_" scriptBlockId: string; assetId: string; voiceId: string; modelId: string; settings: Record; createdAt: string; rating: number | null; // 1-5 stars, optional notes: string | null; isActive: boolean; // true = the one placed on the timeline }; ``` ### 5.9 Render settings ```ts type RenderSettings = { format: "wav" | "mp3"; sampleRate: 44100 | 48000; bitrate: number | null; // kbps, null for WAV normaliseLoudness: boolean; lufsTarget: number; // default -16 LUFS for podcast metadata: { title: string; artist: string; album: string; episodeNumber: number | null; description: string; } | null; }; ``` ### 5.10 Waveform peaks ```ts type WaveformPeaks = { sampleRate: number; durationSecs: number; channels: number; samplesPerPixel: number; length: number; data: { min: number[]; // range -1.0 to 0.0 max: number[]; // range 0.0 to 1.0 }; }; ``` --- ## 6. Storage Layout ``` data/ generations/ gen_/ audio.wav raw float32 WAV at 24 kHz waveform.json WaveformPeaks at 256 samples/pixel metadata.json GenerationJob fields (denormalised) projects/ proj_/ project.json full StudioProject serialised assets/ asset_.wav uploaded or imported audio renders/ render_.wav exported renders render_.mp3 db/ vibepod.db SQLite database ``` The SQLite database is the source of truth for IDs, status, and relationships. JSON files are the source of truth for audio and waveform data. --- ## 7. API Contract All new routes are under `/api/`. The Next.js app proxies to FastAPI only for generation and health. Persistence routes are handled directly by Next.js API routes talking to SQLite. ### 7.1 Generation (existing, extended) ``` POST /api/generate start streaming generation (existing) GET /api/health server health check (existing) ``` ### 7.2 Generation library (Phase 1) ``` GET /api/generations list all jobs, newest first query: ?limit=20&offset=0&status=complete GET /api/generations/:id get single job metadata GET /api/generations/:id/audio stream WAV file GET /api/generations/:id/waveform get WaveformPeaks JSON DELETE /api/generations/:id delete job and files ``` ### 7.3 Projects (Phase 2) ``` POST /api/projects create project GET /api/projects list projects GET /api/projects/:id get project with full StudioProject PUT /api/projects/:id save/autosave project DELETE /api/projects/:id delete project and assets ``` ### 7.4 Project assets (Phase 2) ``` POST /api/projects/:id/assets upload audio file or import from generation GET /api/projects/:id/assets/:aid get asset metadata DELETE /api/projects/:id/assets/:aid remove asset ``` ### 7.5 Takes (Phase 3) ``` GET /api/projects/:id/takes/:blockId list takes for a script block POST /api/projects/:id/takes/:blockId save new take PUT /api/projects/:id/takes/:blockId/:takeId set active take DELETE /api/projects/:id/takes/:blockId/:takeId delete take ``` ### 7.6 Render (Phase 5) ``` POST /api/projects/:id/render start render job (sends EDL to Python) GET /api/projects/:id/renders list render history GET /api/projects/:id/renders/:rid poll render status GET /api/projects/:id/renders/:rid/download stream rendered file ``` All error responses follow: ```json { "error": "human-readable message", "code": "SNAKE_CASE_CODE" } ``` --- ## 8. Frontend Routes ``` / generation page (current fast path, unchanged) /library generation library — browse and replay saved jobs /projects project dashboard — create or open Studio projects /studio/:projectId Studio workspace /studio/new redirect: creates project + navigates to /studio/:id ``` --- ## 9. Component Hierarchy ### Generation page (existing — do not restructure without reason) ``` page.tsx Header TextInputPanel GenerationControls StatusLog AudioPlayer ``` ### Library page (Phase 1) ``` /app/library/page.tsx Header LibraryPage GenerationCard[] WaveformPreview (small static canvas render of peaks) GenerationMetadata (speaker, duration, date, settings) GenerationActions (play, download, open in studio, delete) ``` ### Projects dashboard (Phase 2) ``` /app/projects/page.tsx Header ProjectsPage NewProjectButton ProjectCard[] ProjectThumbnail ProjectMetadata ProjectActions ``` ### Studio workspace (Phase 2+) ``` /app/studio/[projectId]/page.tsx StudioShell StudioTopBar ProjectNameInput UndoButton / RedoButton SaveStatus ExportButton StudioBody ProjectSidebar SidebarTabs (Script | Voices | Media | Effects | Templates) ScriptPanel (Phase 3) VoicesPanel MediaBin (Phase 2) EffectsPanel (Phase 4) StudioMain TimelineRuler (canvas) TimelineArea TimelineTrack[] (one per track) TrackHeader TrackClipArea TimelineClip[] WaveformCanvas (canvas) ClipLabel InspectorPanel ClipInspector (when clip selected) ClipMetadata GainControl FadeControls RegenerateButton (Phase 3) TakeStack (Phase 3) TrackInspector (when track selected) EmptyInspector (nothing selected) TransportBar PlayPauseButton StopButton PlayheadTimeDisplay ZoomControls SnapToggle ``` --- ## 10. Timeline Rendering Model ### Clip positioning Timeline clips are DOM elements with CSS `position: absolute`. Position and size derive from the project's pixels-per-second zoom level: ```ts const left = clip.startTime * pixelsPerSecond; const width = (clip.sourceEnd - clip.sourceStart) * pixelsPerSecond; ``` The `pixelsPerSecond` value lives in Zustand and changes with zoom. All clip positions recompute via derived selectors. ### Waveform canvas Each `TimelineClip` contains a `` element that receives peak data as a prop. The renderer draws one vertical line per pixel column: ```ts function drawWaveform( ctx: CanvasRenderingContext2D, peaks: WaveformPeaks, width: number, height: number, color: string ): void { const midY = height / 2; ctx.clearRect(0, 0, width, height); ctx.strokeStyle = color; ctx.lineWidth = 1; for (let x = 0; x < width; x++) { const peakIndex = Math.floor((x / width) * peaks.length); const minY = midY + peaks.data.min[peakIndex] * midY; const maxY = midY - peaks.data.max[peakIndex] * midY; ctx.beginPath(); ctx.moveTo(x + 0.5, minY); ctx.lineTo(x + 0.5, maxY); ctx.stroke(); } } ``` The canvas rerenders when `width`, `peaks`, or zoom changes. It does not rerender on playback. ### Playhead The playhead is a separate absolutely-positioned element (thin vertical line) that overlays the entire timeline area. Its `left` position is updated via `requestAnimationFrame` during playback — no React state updates, direct DOM style write. ### Timeline ruler A `` element spanning the full timeline width. Draws time markers at intervals derived from current zoom level. Rerenders on zoom change only. --- ## 11. Playback Model Studio preview uses the **Web Audio API**, not HTML5 `