From 47e0c7e512f025355b44635b1f4dd39d6266c6b6 Mon Sep 17 00:00:00 2001 From: LyAhn Date: Sat, 2 May 2026 17:24:45 +0100 Subject: [PATCH] chore(phase-0): stabilise foundation for Studio build MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Extract WAV assembly (buildWav, mergeFloat32Arrays, decodeFloat32Chunk, SAMPLE_RATE) into web/lib/audio/wav.ts so it can be reused by the Studio playback engine and library waveform previews - Add server/waveform.py with compute_peaks() / write_peaks() — reads any WAV, mixes to mono, returns min/max peak arrays matching the WaveformPeaks TypeScript type - Add server/ids.py with prefixed URL-safe ID helpers (gen_id, proj_id, asset_id, etc.) using stdlib secrets — no new dependency - Add docs/studio-build-plan.md — full execution spec covering stack decisions, data models, API contract, component hierarchy, phase breakdown and acceptance criteria - Ignore data/ directory (generated audio, waveforms, SQLite DB) --- .gitignore | 4 + docs/studio-build-plan.md | 925 ++++++++++++++++++++++++++++ server/ids.py | 35 ++ server/waveform.py | 77 +++ web/hooks/useStreamingGeneration.ts | 49 +- web/lib/audio/wav.ts | 56 ++ 6 files changed, 1098 insertions(+), 48 deletions(-) create mode 100644 docs/studio-build-plan.md create mode 100644 server/ids.py create mode 100644 server/waveform.py create mode 100644 web/lib/audio/wav.ts diff --git a/.gitignore b/.gitignore index 040007f..47a2110 100644 --- a/.gitignore +++ b/.gitignore @@ -20,8 +20,12 @@ web/tsconfig.tsbuildinfo web/next-env.d.ts web/node_modules/ +# Generated data (audio files, waveforms, SQLite DB) +data/ + # OS .DS_Store Thumbs.db .vscode/settings.json .claude/settings.local.json +*.txt diff --git a/docs/studio-build-plan.md b/docs/studio-build-plan.md new file mode 100644 index 0000000..210a04a --- /dev/null +++ b/docs/studio-build-plan.md @@ -0,0 +1,925 @@ +# VibePod Studio — Build Plan + +**Version:** 1.0 +**Authors:** LyAhn + Claude (Anthropic) + Codex +**Date:** 2026-05-02 +**Status:** Active + +--- + +## Table of Contents + +1. [Product Goal](#1-product-goal) +2. [Current Architecture](#2-current-architecture) +3. [Locked Technical Decisions](#3-locked-technical-decisions) +4. [Non-Goals](#4-non-goals) +5. [Data Models](#5-data-models) +6. [Storage Layout](#6-storage-layout) +7. [API Contract](#7-api-contract) +8. [Frontend Routes](#8-frontend-routes) +9. [Component Hierarchy](#9-component-hierarchy) +10. [Timeline Rendering Model](#10-timeline-rendering-model) +11. [Playback Model](#11-playback-model) +12. [Export Rendering Model](#12-export-rendering-model) +13. [Phase Breakdown](#13-phase-breakdown) +14. [Acceptance Criteria](#14-acceptance-criteria) +15. [Deferred Decisions](#15-deferred-decisions) +16. [Third-Party Library Policy](#16-third-party-library-policy) + +--- + +## 1. Product Goal + +VibePod becomes a **script-aware AI podcast creation studio** where users can generate, arrange, edit, regenerate, and export podcast episodes from written scripts. + +The headline feature that differentiates VibePod from generic audio editors: + +> Every generated clip knows where it came from in the script and can be regenerated, replaced, or compared — without touching the original audio file. + +The existing fast-generation page stays as the quick path. Studio grows alongside it. + +--- + +## 2. Current Architecture + +### What exists today + +| Layer | Details | +|---|---| +| Frontend | Next.js 15 · React 19 · TypeScript 5 · Tailwind CSS 4 · App Router | +| Backend | FastAPI · uvicorn · Python 3.10+ | +| Model | VibeVoice Realtime 0.5B via JezzWTF fork | +| Audio format | 24 kHz · float32 PCM · streamed as SSE · assembled to WAV in browser | +| State management | React `useReducer` in `page.tsx` — no Zustand, no Redux | +| Persistence | None — generated audio is an ephemeral browser Blob | +| Packages | pnpm workspace (frontend) · uv (Python) | + +### Current generation flow + +``` +User submits script + → POST /api/generate (Next.js proxy) + → POST /generate (FastAPI) + → VibeVoice inference thread + → SSE stream of base64 float32 PCM chunks + → Browser decodes chunks, adaptive buffering, live Web Audio playback + → On complete: chunks merged, RIFF/WAV header prepended, Blob URL created + → User can play back or download the WAV +``` + +### Key files + +``` +web/ + app/ + page.tsx main generation UI (AppState via useReducer) + api/generate/route.ts SSE proxy to FastAPI + api/health/route.ts health check proxy + components/ + Header.tsx + TextInputPanel.tsx + AudioPlayer.tsx + GenerationControls.tsx + StatusLog.tsx + hooks/ + useStreamingGeneration.ts core streaming + WAV assembly + useAudioPlayer.ts HTML5 audio element wrapper + +server/ + vibevoice_server.py entire FastAPI app (972 lines) + start.sh launcher (CPU/CUDA detection, uv sync, uvicorn) + download_model.py HuggingFace prefetch + +docs/ + studio-build-plan.md this file + +roadmap.md high-level phase vision +DESIGN.md brand + design system (colours, type, spacing) +AGENTS.md AI agent / CI guide +``` + +--- + +## 3. Locked Technical Decisions + +These decisions are final and must not be revisited without explicit agreement. New phases build on them. + +### 3.1 Rendering approach — Hybrid DOM + Canvas 2D + +The Studio uses a **hybrid rendering model**: + +| Layer | Technology | +|---|---| +| App shell, layout, sidebars, panels, inspector | React + Tailwind CSS (DOM) | +| Track headers, controls, transport, modals | React + Tailwind CSS (DOM) | +| Timeline clip containers + positioning | React + CSS (`left`/`width` from time → pixels) | +| Waveform rendering inside clips | Raw Canvas 2D | +| Timeline ruler | Raw Canvas 2D | +| Playhead overlay | Raw Canvas 2D | +| Browser playback preview | Web Audio API | +| Final render + export | Python + FFmpeg (server-side only) | + +**Why not full canvas:** Every non-waveform element — buttons, text, inputs, scroll, keyboard focus, accessibility — works better in DOM. Reimplementing all of that in canvas is wasted effort. + +**Why not pure CSS:** Waveform peaks are thousands of pixel-height values per clip. DOM representation would be extremely slow. Canvas draws them in a tight loop in milliseconds. + +**Why not WaveSurfer.js as the core:** WaveSurfer owns playback and its own event model. VibePod Studio needs its own clip model, its own timeline, and eventually its own multi-track playback. Adapting around WaveSurfer's assumptions creates friction. It may be used for standalone audio preview components (e.g., the generation page player), not the Studio timeline. + +**Why not Konva.js:** Overkill for v1. Konva is designed for fully canvas-based scenes (whiteboards, diagrams). VibePod's timeline is mostly DOM. The additional mental model (Stage/Layer/Group/Transformer) is not justified unless interaction complexity grows significantly beyond v1. + +### 3.2 Frontend stack — No additions without justification + +Build inside the existing stack. New packages require a written reason in this document. + +**Approved additions (to be installed when their phase begins):** + +| Package | Purpose | Phase | +|---|---|---| +| `zustand` | Studio editor state | Phase 2 | +| `@dnd-kit/core` + `@dnd-kit/utilities` | Clip drag-and-drop in timeline | Phase 2 | +| `better-sqlite3` | SQLite for job and project persistence | Phase 1 | + +**Conditionally approved (evaluate at phase start):** + +| Package | Purpose | Condition | +|---|---|---| +| `framer-motion` | Clip move animations | Only if DnD-kit transitions feel rough after prototype | +| `@radix-ui/react-*` | Accessible modal/dropdown primitives | Only if building custom is taking too long | + +### 3.3 Backend stack — Python + FFmpeg for all rendering + +Browser-side audio mixing is only for **preview**. Export always goes to the Python backend. + +**Approved backend additions:** + +| Package | Purpose | Phase | +|---|---|---| +| `soundfile` | Already present — WAV read/write | Phase 1 | +| `numpy` | Audio array manipulation | Phase 1 | +| `pydub` | Audio trimming, mixing, concatenation | Phase 2 | +| `pyloudnorm` | Loudness normalisation (LUFS) | Phase 5 | + +FFmpeg must be available on the server host. The render endpoint assumes `ffmpeg` is on PATH. + +### 3.4 State management — Zustand for Studio, useReducer stays on generation page + +The existing generation page uses `useReducer` and works well. Do not refactor it. + +Studio requires a shared store that multiple components read and write (timeline, inspector, transport, script panel). Zustand is the right tool. It is lightweight, does not require providers, and handles editor-style state (undo stacks, selection, playhead) cleanly. + +### 3.5 Database — SQLite from Phase 1 + +Do not start with flat JSON files. SQLite is still a single file, requires no server process, and gives proper queries, transactions, and schema migrations from day one. Use `better-sqlite3` in the Next.js API layer. + +Schema lives in `web/lib/db/schema.sql`. Migrations are numbered SQL files in `web/lib/db/migrations/`. + +### 3.6 Audio sample rate + +All generated audio is 24 kHz float32 mono (VibeVoice output). Studio renders at 44.1 kHz stereo WAV or 48 kHz for podcast MP3 export. The render pipeline handles resampling. + +--- + +## 4. Non-Goals + +These will not be built and must not creep in: + +- **Real-time collaborative editing** — single-user per project only +- **Cloud sync or user accounts** — local-first, no auth system +- **MIDI or music composition** — audio clips only, no MIDI tracks +- **Plugin system** — no third-party audio plugin API +- **Browser-side FFmpeg (ffmpeg.wasm)** — all rendering is server-side +- **Mobile / responsive Studio layout** — Studio targets desktop viewport only +- **Offline PWA** — the server must be running; no service worker caching of model output +- **Real-time voice cloning** — out of scope until VibeVoice supports it cleanly +- **Exporting to streaming platforms** — export to file only; no Spotify/Apple Podcasts upload + +--- + +## 5. Data Models + +### 5.1 Generation job + +```ts +type GenerationJob = { + id: string; // "gen_" + createdAt: string; // ISO 8601 + status: "pending" | "generating" | "complete" | "error" | "cancelled"; + script: string; + speaker: string; + cfgScale: number; + inferenceSteps: number; + durationSecs: number | null; // set on complete + sampleRate: number; // always 24000 + audioPath: string | null; // relative to data/generations//audio.wav + waveformPath: string | null; // relative to data/generations//waveform.json + errorMessage: string | null; +}; +``` + +### 5.2 Studio project + +```ts +type StudioProject = { + id: string; // "proj_" + name: string; + createdAt: string; + updatedAt: string; + script: ScriptDocument; + assets: AudioAsset[]; + tracks: Track[]; + edits: EditOperation[]; + renderSettings: RenderSettings; +}; +``` + +### 5.3 Script document + +```ts +type ScriptDocument = { + blocks: ScriptBlock[]; +}; + +type ScriptBlock = { + id: string; // "block_" + speakerId: string; + text: string; + order: number; + generatedAssetId: string | null; + timelineClipIds: string[]; +}; +``` + +### 5.4 Audio asset + +```ts +type AudioAsset = { + id: string; // "asset_" + projectId: string | null; // null = generation library asset + kind: "generated_voice" | "upload" | "music" | "sfx" | "render"; + filePath: string; + durationSecs: number; + sampleRate: number; + channels: number; + waveformPath: string | null; + source: { + generationJobId?: string; + scriptBlockId?: string; + providerId?: string; + modelId?: string; + voiceId?: string; + settings?: Record; + } | null; +}; +``` + +### 5.5 Track + +```ts +type Track = { + id: string; // "track_" + name: string; + type: "voice" | "music" | "sfx" | "ambience" | "master"; + order: number; + muted: boolean; + solo: boolean; + gainDb: number; + clips: TimelineClip[]; +}; +``` + +### 5.6 Timeline clip + +```ts +type TimelineClip = { + id: string; // "clip_" + assetId: string; + trackId: string; + startTime: number; // seconds from timeline origin + sourceStart: number; // trim start within source asset (seconds) + sourceEnd: number; // trim end within source asset (seconds) + gainDb: number; + fadeInMs: number; + fadeOutMs: number; + linkedScriptRange: { + blockId: string; + startChar: number; + endChar: number; + } | null; +}; +``` + +### 5.7 Edit operation (non-destructive EDL) + +```ts +type EditOperation = + | { type: "split"; clipId: string; at: number } + | { type: "trim"; clipId: string; sourceStart: number; sourceEnd: number } + | { type: "move"; clipId: string; startTime: number; trackId: string } + | { type: "gain"; clipId: string; gainDb: number } + | { type: "fade"; clipId: string; fadeInMs: number; fadeOutMs: number } + | { type: "delete"; clipId: string } + | { type: "mute"; trackId: string; muted: boolean } + | { type: "solo"; trackId: string; solo: boolean }; +``` + +### 5.8 Take (regeneration history) + +```ts +type Take = { + id: string; // "take_" + scriptBlockId: string; + assetId: string; + voiceId: string; + modelId: string; + settings: Record; + createdAt: string; + rating: number | null; // 1-5 stars, optional + notes: string | null; + isActive: boolean; // true = the one placed on the timeline +}; +``` + +### 5.9 Render settings + +```ts +type RenderSettings = { + format: "wav" | "mp3"; + sampleRate: 44100 | 48000; + bitrate: number | null; // kbps, null for WAV + normaliseLoudness: boolean; + lufsTarget: number; // default -16 LUFS for podcast + metadata: { + title: string; + artist: string; + album: string; + episodeNumber: number | null; + description: string; + } | null; +}; +``` + +### 5.10 Waveform peaks + +```ts +type WaveformPeaks = { + sampleRate: number; + durationSecs: number; + channels: number; + samplesPerPixel: number; + length: number; + data: { + min: number[]; // range -1.0 to 0.0 + max: number[]; // range 0.0 to 1.0 + }; +}; +``` + +--- + +## 6. Storage Layout + +``` +data/ + generations/ + gen_/ + audio.wav raw float32 WAV at 24 kHz + waveform.json WaveformPeaks at 256 samples/pixel + metadata.json GenerationJob fields (denormalised) + + projects/ + proj_/ + project.json full StudioProject serialised + assets/ + asset_.wav uploaded or imported audio + renders/ + render_.wav exported renders + render_.mp3 + + db/ + vibepod.db SQLite database +``` + +The SQLite database is the source of truth for IDs, status, and relationships. JSON files are the source of truth for audio and waveform data. + +--- + +## 7. API Contract + +All new routes are under `/api/`. The Next.js app proxies to FastAPI only for generation and health. Persistence routes are handled directly by Next.js API routes talking to SQLite. + +### 7.1 Generation (existing, extended) + +``` +POST /api/generate start streaming generation (existing) +GET /api/health server health check (existing) +``` + +### 7.2 Generation library (Phase 1) + +``` +GET /api/generations list all jobs, newest first + query: ?limit=20&offset=0&status=complete +GET /api/generations/:id get single job metadata +GET /api/generations/:id/audio stream WAV file +GET /api/generations/:id/waveform get WaveformPeaks JSON +DELETE /api/generations/:id delete job and files +``` + +### 7.3 Projects (Phase 2) + +``` +POST /api/projects create project +GET /api/projects list projects +GET /api/projects/:id get project with full StudioProject +PUT /api/projects/:id save/autosave project +DELETE /api/projects/:id delete project and assets +``` + +### 7.4 Project assets (Phase 2) + +``` +POST /api/projects/:id/assets upload audio file or import from generation +GET /api/projects/:id/assets/:aid get asset metadata +DELETE /api/projects/:id/assets/:aid remove asset +``` + +### 7.5 Takes (Phase 3) + +``` +GET /api/projects/:id/takes/:blockId list takes for a script block +POST /api/projects/:id/takes/:blockId save new take +PUT /api/projects/:id/takes/:blockId/:takeId set active take +DELETE /api/projects/:id/takes/:blockId/:takeId delete take +``` + +### 7.6 Render (Phase 5) + +``` +POST /api/projects/:id/render start render job (sends EDL to Python) +GET /api/projects/:id/renders list render history +GET /api/projects/:id/renders/:rid poll render status +GET /api/projects/:id/renders/:rid/download stream rendered file +``` + +All error responses follow: + +```json +{ "error": "human-readable message", "code": "SNAKE_CASE_CODE" } +``` + +--- + +## 8. Frontend Routes + +``` +/ generation page (current fast path, unchanged) +/library generation library — browse and replay saved jobs +/projects project dashboard — create or open Studio projects +/studio/:projectId Studio workspace +/studio/new redirect: creates project + navigates to /studio/:id +``` + +--- + +## 9. Component Hierarchy + +### Generation page (existing — do not restructure without reason) + +``` +page.tsx + Header + TextInputPanel + GenerationControls + StatusLog + AudioPlayer +``` + +### Library page (Phase 1) + +``` +/app/library/page.tsx + Header + LibraryPage + GenerationCard[] + WaveformPreview (small static canvas render of peaks) + GenerationMetadata (speaker, duration, date, settings) + GenerationActions (play, download, open in studio, delete) +``` + +### Projects dashboard (Phase 2) + +``` +/app/projects/page.tsx + Header + ProjectsPage + NewProjectButton + ProjectCard[] + ProjectThumbnail + ProjectMetadata + ProjectActions +``` + +### Studio workspace (Phase 2+) + +``` +/app/studio/[projectId]/page.tsx + StudioShell + StudioTopBar + ProjectNameInput + UndoButton / RedoButton + SaveStatus + ExportButton + StudioBody + ProjectSidebar + SidebarTabs (Script | Voices | Media | Effects | Templates) + ScriptPanel (Phase 3) + VoicesPanel + MediaBin (Phase 2) + EffectsPanel (Phase 4) + StudioMain + TimelineRuler (canvas) + TimelineArea + TimelineTrack[] (one per track) + TrackHeader + TrackClipArea + TimelineClip[] + WaveformCanvas (canvas) + ClipLabel + InspectorPanel + ClipInspector (when clip selected) + ClipMetadata + GainControl + FadeControls + RegenerateButton (Phase 3) + TakeStack (Phase 3) + TrackInspector (when track selected) + EmptyInspector (nothing selected) + TransportBar + PlayPauseButton + StopButton + PlayheadTimeDisplay + ZoomControls + SnapToggle +``` + +--- + +## 10. Timeline Rendering Model + +### Clip positioning + +Timeline clips are DOM elements with CSS `position: absolute`. Position and size derive from the project's pixels-per-second zoom level: + +```ts +const left = clip.startTime * pixelsPerSecond; +const width = (clip.sourceEnd - clip.sourceStart) * pixelsPerSecond; +``` + +The `pixelsPerSecond` value lives in Zustand and changes with zoom. All clip positions recompute via derived selectors. + +### Waveform canvas + +Each `TimelineClip` contains a `` element that receives peak data as a prop. The renderer draws one vertical line per pixel column: + +```ts +function drawWaveform( + ctx: CanvasRenderingContext2D, + peaks: WaveformPeaks, + width: number, + height: number, + color: string +): void { + const midY = height / 2; + ctx.clearRect(0, 0, width, height); + ctx.strokeStyle = color; + ctx.lineWidth = 1; + + for (let x = 0; x < width; x++) { + const peakIndex = Math.floor((x / width) * peaks.length); + const minY = midY + peaks.data.min[peakIndex] * midY; + const maxY = midY - peaks.data.max[peakIndex] * midY; + ctx.beginPath(); + ctx.moveTo(x + 0.5, minY); + ctx.lineTo(x + 0.5, maxY); + ctx.stroke(); + } +} +``` + +The canvas rerenders when `width`, `peaks`, or zoom changes. It does not rerender on playback. + +### Playhead + +The playhead is a separate absolutely-positioned element (thin vertical line) that overlays the entire timeline area. Its `left` position is updated via `requestAnimationFrame` during playback — no React state updates, direct DOM style write. + +### Timeline ruler + +A `` element spanning the full timeline width. Draws time markers at intervals derived from current zoom level. Rerenders on zoom change only. + +--- + +## 11. Playback Model + +Studio preview uses the **Web Audio API**, not HTML5 `