chore(phase-0): stabilise foundation for Studio build

- Extract WAV assembly (buildWav, mergeFloat32Arrays, decodeFloat32Chunk, SAMPLE_RATE) into web/lib/audio/wav.ts so it can be reused by the Studio playback engine and library waveform previews - Add server/waveform.py with compute_peaks() / write_peaks() — reads any WAV, mixes to mono, returns min/max peak arrays matching the WaveformPeaks TypeScript type - Add server/ids.py with prefixed URL-safe ID helpers (gen_id, proj_id, asset_id, etc.) using stdlib secrets — no new dependency - Add docs/studio-build-plan.md — full execution spec covering stack decisions, data models, API contract, component hierarchy, phase breakdown and acceptance criteria - Ignore data/ directory (generated audio, waveforms, SQLite DB)
2026-07-31 21:07:07 +00:00 · 2026-05-02 17:24:45 +01:00
parent 0236807928
commit 47e0c7e512
6 changed files with 1098 additions and 48 deletions
@@ -20,8 +20,12 @@ web/tsconfig.tsbuildinfo
 web/next-env.d.ts
 web/node_modules/

+# Generated data (audio files, waveforms, SQLite DB)
+data/
+
 # OS
 .DS_Store
 Thumbs.db
 .vscode/settings.json
 .claude/settings.local.json
+*.txt
@@ -0,0 +1,925 @@
+# VibePod Studio — Build Plan
+
+**Version:** 1.0  
+**Authors:** LyAhn + Claude (Anthropic) + Codex  
+**Date:** 2026-05-02  
+**Status:** Active
+
+---
+
+## Table of Contents
+
+1. [Product Goal](#1-product-goal)
+2. [Current Architecture](#2-current-architecture)
+3. [Locked Technical Decisions](#3-locked-technical-decisions)
+4. [Non-Goals](#4-non-goals)
+5. [Data Models](#5-data-models)
+6. [Storage Layout](#6-storage-layout)
+7. [API Contract](#7-api-contract)
+8. [Frontend Routes](#8-frontend-routes)
+9. [Component Hierarchy](#9-component-hierarchy)
+10. [Timeline Rendering Model](#10-timeline-rendering-model)
+11. [Playback Model](#11-playback-model)
+12. [Export Rendering Model](#12-export-rendering-model)
+13. [Phase Breakdown](#13-phase-breakdown)
+14. [Acceptance Criteria](#14-acceptance-criteria)
+15. [Deferred Decisions](#15-deferred-decisions)
+16. [Third-Party Library Policy](#16-third-party-library-policy)
+
+---
+
+## 1. Product Goal
+
+VibePod becomes a **script-aware AI podcast creation studio** where users can generate, arrange, edit, regenerate, and export podcast episodes from written scripts.
+
+The headline feature that differentiates VibePod from generic audio editors:
+
+> Every generated clip knows where it came from in the script and can be regenerated, replaced, or compared — without touching the original audio file.
+
+The existing fast-generation page stays as the quick path. Studio grows alongside it.
+
+---
+
+## 2. Current Architecture
+
+### What exists today
+
+| Layer | Details |
+|---|---|
+| Frontend | Next.js 15 · React 19 · TypeScript 5 · Tailwind CSS 4 · App Router |
+| Backend | FastAPI · uvicorn · Python 3.10+ |
+| Model | VibeVoice Realtime 0.5B via JezzWTF fork |
+| Audio format | 24 kHz · float32 PCM · streamed as SSE · assembled to WAV in browser |
+| State management | React `useReducer` in `page.tsx` — no Zustand, no Redux |
+| Persistence | None — generated audio is an ephemeral browser Blob |
+| Packages | pnpm workspace (frontend) · uv (Python) |
+
+### Current generation flow
+
+```
+User submits script
+  → POST /api/generate (Next.js proxy)
+  → POST /generate (FastAPI)
+  → VibeVoice inference thread
+  → SSE stream of base64 float32 PCM chunks
+  → Browser decodes chunks, adaptive buffering, live Web Audio playback
+  → On complete: chunks merged, RIFF/WAV header prepended, Blob URL created
+  → User can play back or download the WAV
+```
+
+### Key files
+
+```
+web/
+  app/
+    page.tsx                    main generation UI (AppState via useReducer)
+    api/generate/route.ts       SSE proxy to FastAPI
+    api/health/route.ts         health check proxy
+  components/
+    Header.tsx
+    TextInputPanel.tsx
+    AudioPlayer.tsx
+    GenerationControls.tsx
+    StatusLog.tsx
+  hooks/
+    useStreamingGeneration.ts   core streaming + WAV assembly
+    useAudioPlayer.ts           HTML5 audio element wrapper
+
+server/
+  vibevoice_server.py           entire FastAPI app (972 lines)
+  start.sh                      launcher (CPU/CUDA detection, uv sync, uvicorn)
+  download_model.py             HuggingFace prefetch
+
+docs/
+  studio-build-plan.md          this file
+
+roadmap.md                      high-level phase vision
+DESIGN.md                       brand + design system (colours, type, spacing)
+AGENTS.md                       AI agent / CI guide
+```
+
+---
+
+## 3. Locked Technical Decisions
+
+These decisions are final and must not be revisited without explicit agreement. New phases build on them.
+
+### 3.1 Rendering approach — Hybrid DOM + Canvas 2D
+
+The Studio uses a **hybrid rendering model**:
+
+| Layer | Technology |
+|---|---|
+| App shell, layout, sidebars, panels, inspector | React + Tailwind CSS (DOM) |
+| Track headers, controls, transport, modals | React + Tailwind CSS (DOM) |
+| Timeline clip containers + positioning | React + CSS (`left`/`width` from time → pixels) |
+| Waveform rendering inside clips | Raw Canvas 2D |
+| Timeline ruler | Raw Canvas 2D |
+| Playhead overlay | Raw Canvas 2D |
+| Browser playback preview | Web Audio API |
+| Final render + export | Python + FFmpeg (server-side only) |
+
+**Why not full canvas:** Every non-waveform element — buttons, text, inputs, scroll, keyboard focus, accessibility — works better in DOM. Reimplementing all of that in canvas is wasted effort.
+
+**Why not pure CSS:** Waveform peaks are thousands of pixel-height values per clip. DOM representation would be extremely slow. Canvas draws them in a tight loop in milliseconds.
+
+**Why not WaveSurfer.js as the core:** WaveSurfer owns playback and its own event model. VibePod Studio needs its own clip model, its own timeline, and eventually its own multi-track playback. Adapting around WaveSurfer's assumptions creates friction. It may be used for standalone audio preview components (e.g., the generation page player), not the Studio timeline.
+
+**Why not Konva.js:** Overkill for v1. Konva is designed for fully canvas-based scenes (whiteboards, diagrams). VibePod's timeline is mostly DOM. The additional mental model (Stage/Layer/Group/Transformer) is not justified unless interaction complexity grows significantly beyond v1.
+
+### 3.2 Frontend stack — No additions without justification
+
+Build inside the existing stack. New packages require a written reason in this document.
+
+**Approved additions (to be installed when their phase begins):**
+
+| Package | Purpose | Phase |
+|---|---|---|
+| `zustand` | Studio editor state | Phase 2 |
+| `@dnd-kit/core` + `@dnd-kit/utilities` | Clip drag-and-drop in timeline | Phase 2 |
+| `better-sqlite3` | SQLite for job and project persistence | Phase 1 |
+
+**Conditionally approved (evaluate at phase start):**
+
+| Package | Purpose | Condition |
+|---|---|---|
+| `framer-motion` | Clip move animations | Only if DnD-kit transitions feel rough after prototype |
+| `@radix-ui/react-*` | Accessible modal/dropdown primitives | Only if building custom is taking too long |
+
+### 3.3 Backend stack — Python + FFmpeg for all rendering
+
+Browser-side audio mixing is only for **preview**. Export always goes to the Python backend.
+
+**Approved backend additions:**
+
+| Package | Purpose | Phase |
+|---|---|---|
+| `soundfile` | Already present — WAV read/write | Phase 1 |
+| `numpy` | Audio array manipulation | Phase 1 |
+| `pydub` | Audio trimming, mixing, concatenation | Phase 2 |
+| `pyloudnorm` | Loudness normalisation (LUFS) | Phase 5 |
+
+FFmpeg must be available on the server host. The render endpoint assumes `ffmpeg` is on PATH.
+
+### 3.4 State management — Zustand for Studio, useReducer stays on generation page
+
+The existing generation page uses `useReducer` and works well. Do not refactor it.
+
+Studio requires a shared store that multiple components read and write (timeline, inspector, transport, script panel). Zustand is the right tool. It is lightweight, does not require providers, and handles editor-style state (undo stacks, selection, playhead) cleanly.
+
+### 3.5 Database — SQLite from Phase 1
+
+Do not start with flat JSON files. SQLite is still a single file, requires no server process, and gives proper queries, transactions, and schema migrations from day one. Use `better-sqlite3` in the Next.js API layer.
+
+Schema lives in `web/lib/db/schema.sql`. Migrations are numbered SQL files in `web/lib/db/migrations/`.
+
+### 3.6 Audio sample rate
+
+All generated audio is 24 kHz float32 mono (VibeVoice output). Studio renders at 44.1 kHz stereo WAV or 48 kHz for podcast MP3 export. The render pipeline handles resampling.
+
+---
+
+## 4. Non-Goals
+
+These will not be built and must not creep in:
+
+- **Real-time collaborative editing** — single-user per project only
+- **Cloud sync or user accounts** — local-first, no auth system
+- **MIDI or music composition** — audio clips only, no MIDI tracks
+- **Plugin system** — no third-party audio plugin API
+- **Browser-side FFmpeg (ffmpeg.wasm)** — all rendering is server-side
+- **Mobile / responsive Studio layout** — Studio targets desktop viewport only
+- **Offline PWA** — the server must be running; no service worker caching of model output
+- **Real-time voice cloning** — out of scope until VibeVoice supports it cleanly
+- **Exporting to streaming platforms** — export to file only; no Spotify/Apple Podcasts upload
+
+---
+
+## 5. Data Models
+
+### 5.1 Generation job
+
+```ts
+type GenerationJob = {
+  id: string;                      // "gen_<nanoid>"
+  createdAt: string;               // ISO 8601
+  status: "pending" | "generating" | "complete" | "error" | "cancelled";
+  script: string;
+  speaker: string;
+  cfgScale: number;
+  inferenceSteps: number;
+  durationSecs: number | null;     // set on complete
+  sampleRate: number;              // always 24000
+  audioPath: string | null;        // relative to data/generations/<id>/audio.wav
+  waveformPath: string | null;     // relative to data/generations/<id>/waveform.json
+  errorMessage: string | null;
+};
+```
+
+### 5.2 Studio project
+
+```ts
+type StudioProject = {
+  id: string;                      // "proj_<nanoid>"
+  name: string;
+  createdAt: string;
+  updatedAt: string;
+  script: ScriptDocument;
+  assets: AudioAsset[];
+  tracks: Track[];
+  edits: EditOperation[];
+  renderSettings: RenderSettings;
+};
+```
+
+### 5.3 Script document
+
+```ts
+type ScriptDocument = {
+  blocks: ScriptBlock[];
+};
+
+type ScriptBlock = {
+  id: string;                      // "block_<nanoid>"
+  speakerId: string;
+  text: string;
+  order: number;
+  generatedAssetId: string | null;
+  timelineClipIds: string[];
+};
+```
+
+### 5.4 Audio asset
+
+```ts
+type AudioAsset = {
+  id: string;                      // "asset_<nanoid>"
+  projectId: string | null;        // null = generation library asset
+  kind: "generated_voice" | "upload" | "music" | "sfx" | "render";
+  filePath: string;
+  durationSecs: number;
+  sampleRate: number;
+  channels: number;
+  waveformPath: string | null;
+  source: {
+    generationJobId?: string;
+    scriptBlockId?: string;
+    providerId?: string;
+    modelId?: string;
+    voiceId?: string;
+    settings?: Record<string, unknown>;
+  } | null;
+};
+```
+
+### 5.5 Track
+
+```ts
+type Track = {
+  id: string;                      // "track_<nanoid>"
+  name: string;
+  type: "voice" | "music" | "sfx" | "ambience" | "master";
+  order: number;
+  muted: boolean;
+  solo: boolean;
+  gainDb: number;
+  clips: TimelineClip[];
+};
+```
+
+### 5.6 Timeline clip
+
+```ts
+type TimelineClip = {
+  id: string;                      // "clip_<nanoid>"
+  assetId: string;
+  trackId: string;
+  startTime: number;               // seconds from timeline origin
+  sourceStart: number;             // trim start within source asset (seconds)
+  sourceEnd: number;               // trim end within source asset (seconds)
+  gainDb: number;
+  fadeInMs: number;
+  fadeOutMs: number;
+  linkedScriptRange: {
+    blockId: string;
+    startChar: number;
+    endChar: number;
+  } | null;
+};
+```
+
+### 5.7 Edit operation (non-destructive EDL)
+
+```ts
+type EditOperation =
+  | { type: "split";  clipId: string; at: number }
+  | { type: "trim";   clipId: string; sourceStart: number; sourceEnd: number }
+  | { type: "move";   clipId: string; startTime: number; trackId: string }
+  | { type: "gain";   clipId: string; gainDb: number }
+  | { type: "fade";   clipId: string; fadeInMs: number; fadeOutMs: number }
+  | { type: "delete"; clipId: string }
+  | { type: "mute";   trackId: string; muted: boolean }
+  | { type: "solo";   trackId: string; solo: boolean };
+```
+
+### 5.8 Take (regeneration history)
+
+```ts
+type Take = {
+  id: string;                      // "take_<nanoid>"
+  scriptBlockId: string;
+  assetId: string;
+  voiceId: string;
+  modelId: string;
+  settings: Record<string, unknown>;
+  createdAt: string;
+  rating: number | null;           // 1-5 stars, optional
+  notes: string | null;
+  isActive: boolean;               // true = the one placed on the timeline
+};
+```
+
+### 5.9 Render settings
+
+```ts
+type RenderSettings = {
+  format: "wav" | "mp3";
+  sampleRate: 44100 | 48000;
+  bitrate: number | null;          // kbps, null for WAV
+  normaliseLoudness: boolean;
+  lufsTarget: number;              // default -16 LUFS for podcast
+  metadata: {
+    title: string;
+    artist: string;
+    album: string;
+    episodeNumber: number | null;
+    description: string;
+  } | null;
+};
+```
+
+### 5.10 Waveform peaks
+
+```ts
+type WaveformPeaks = {
+  sampleRate: number;
+  durationSecs: number;
+  channels: number;
+  samplesPerPixel: number;
+  length: number;
+  data: {
+    min: number[];                 // range -1.0 to 0.0
+    max: number[];                 // range 0.0 to 1.0
+  };
+};
+```
+
+---
+
+## 6. Storage Layout
+
+```
+data/
+  generations/
+    gen_<id>/
+      audio.wav                   raw float32 WAV at 24 kHz
+      waveform.json               WaveformPeaks at 256 samples/pixel
+      metadata.json               GenerationJob fields (denormalised)
+
+  projects/
+    proj_<id>/
+      project.json                full StudioProject serialised
+      assets/
+        asset_<id>.wav            uploaded or imported audio
+      renders/
+        render_<timestamp>.wav    exported renders
+        render_<timestamp>.mp3
+
+  db/
+    vibepod.db                    SQLite database
+```
+
+The SQLite database is the source of truth for IDs, status, and relationships. JSON files are the source of truth for audio and waveform data.
+
+---
+
+## 7. API Contract
+
+All new routes are under `/api/`. The Next.js app proxies to FastAPI only for generation and health. Persistence routes are handled directly by Next.js API routes talking to SQLite.
+
+### 7.1 Generation (existing, extended)
+
+```
+POST   /api/generate                   start streaming generation (existing)
+GET    /api/health                     server health check (existing)
+```
+
+### 7.2 Generation library (Phase 1)
+
+```
+GET    /api/generations                list all jobs, newest first
+                                       query: ?limit=20&offset=0&status=complete
+GET    /api/generations/:id            get single job metadata
+GET    /api/generations/:id/audio      stream WAV file
+GET    /api/generations/:id/waveform   get WaveformPeaks JSON
+DELETE /api/generations/:id            delete job and files
+```
+
+### 7.3 Projects (Phase 2)
+
+```
+POST   /api/projects                   create project
+GET    /api/projects                   list projects
+GET    /api/projects/:id               get project with full StudioProject
+PUT    /api/projects/:id               save/autosave project
+DELETE /api/projects/:id               delete project and assets
+```
+
+### 7.4 Project assets (Phase 2)
+
+```
+POST   /api/projects/:id/assets        upload audio file or import from generation
+GET    /api/projects/:id/assets/:aid   get asset metadata
+DELETE /api/projects/:id/assets/:aid   remove asset
+```
+
+### 7.5 Takes (Phase 3)
+
+```
+GET    /api/projects/:id/takes/:blockId         list takes for a script block
+POST   /api/projects/:id/takes/:blockId         save new take
+PUT    /api/projects/:id/takes/:blockId/:takeId  set active take
+DELETE /api/projects/:id/takes/:blockId/:takeId  delete take
+```
+
+### 7.6 Render (Phase 5)
+
+```
+POST   /api/projects/:id/render        start render job (sends EDL to Python)
+GET    /api/projects/:id/renders       list render history
+GET    /api/projects/:id/renders/:rid  poll render status
+GET    /api/projects/:id/renders/:rid/download  stream rendered file
+```
+
+All error responses follow:
+
+```json
+{ "error": "human-readable message", "code": "SNAKE_CASE_CODE" }
+```
+
+---
+
+## 8. Frontend Routes
+
+```
+/                                  generation page (current fast path, unchanged)
+/library                           generation library — browse and replay saved jobs
+/projects                          project dashboard — create or open Studio projects
+/studio/:projectId                 Studio workspace
+/studio/new                        redirect: creates project + navigates to /studio/:id
+```
+
+---
+
+## 9. Component Hierarchy
+
+### Generation page (existing — do not restructure without reason)
+
+```
+page.tsx
+  Header
+  TextInputPanel
+  GenerationControls
+  StatusLog
+  AudioPlayer
+```
+
+### Library page (Phase 1)
+
+```
+/app/library/page.tsx
+  Header
+  LibraryPage
+    GenerationCard[]
+      WaveformPreview          (small static canvas render of peaks)
+      GenerationMetadata       (speaker, duration, date, settings)
+      GenerationActions        (play, download, open in studio, delete)
+```
+
+### Projects dashboard (Phase 2)
+
+```
+/app/projects/page.tsx
+  Header
+  ProjectsPage
+    NewProjectButton
+    ProjectCard[]
+      ProjectThumbnail
+      ProjectMetadata
+      ProjectActions
+```
+
+### Studio workspace (Phase 2+)
+
+```
+/app/studio/[projectId]/page.tsx
+  StudioShell
+    StudioTopBar
+      ProjectNameInput
+      UndoButton / RedoButton
+      SaveStatus
+      ExportButton
+    StudioBody
+      ProjectSidebar
+        SidebarTabs (Script | Voices | Media | Effects | Templates)
+        ScriptPanel           (Phase 3)
+        VoicesPanel
+        MediaBin              (Phase 2)
+        EffectsPanel          (Phase 4)
+      StudioMain
+        TimelineRuler         (canvas)
+        TimelineArea
+          TimelineTrack[]     (one per track)
+            TrackHeader
+            TrackClipArea
+              TimelineClip[]
+                WaveformCanvas  (canvas)
+                ClipLabel
+      InspectorPanel
+        ClipInspector         (when clip selected)
+          ClipMetadata
+          GainControl
+          FadeControls
+          RegenerateButton    (Phase 3)
+          TakeStack           (Phase 3)
+        TrackInspector        (when track selected)
+        EmptyInspector        (nothing selected)
+    TransportBar
+      PlayPauseButton
+      StopButton
+      PlayheadTimeDisplay
+      ZoomControls
+      SnapToggle
+```
+
+---
+
+## 10. Timeline Rendering Model
+
+### Clip positioning
+
+Timeline clips are DOM elements with CSS `position: absolute`. Position and size derive from the project's pixels-per-second zoom level:
+
+```ts
+const left = clip.startTime * pixelsPerSecond;
+const width = (clip.sourceEnd - clip.sourceStart) * pixelsPerSecond;
+```
+
+The `pixelsPerSecond` value lives in Zustand and changes with zoom. All clip positions recompute via derived selectors.
+
+### Waveform canvas
+
+Each `TimelineClip` contains a `<canvas>` element that receives peak data as a prop. The renderer draws one vertical line per pixel column:
+
+```ts
+function drawWaveform(
+  ctx: CanvasRenderingContext2D,
+  peaks: WaveformPeaks,
+  width: number,
+  height: number,
+  color: string
+): void {
+  const midY = height / 2;
+  ctx.clearRect(0, 0, width, height);
+  ctx.strokeStyle = color;
+  ctx.lineWidth = 1;
+
+  for (let x = 0; x < width; x++) {
+    const peakIndex = Math.floor((x / width) * peaks.length);
+    const minY = midY + peaks.data.min[peakIndex] * midY;
+    const maxY = midY - peaks.data.max[peakIndex] * midY;
+    ctx.beginPath();
+    ctx.moveTo(x + 0.5, minY);
+    ctx.lineTo(x + 0.5, maxY);
+    ctx.stroke();
+  }
+}
+```
+
+The canvas rerenders when `width`, `peaks`, or zoom changes. It does not rerender on playback.
+
+### Playhead
+
+The playhead is a separate absolutely-positioned element (thin vertical line) that overlays the entire timeline area. Its `left` position is updated via `requestAnimationFrame` during playback — no React state updates, direct DOM style write.
+
+### Timeline ruler
+
+A `<canvas>` element spanning the full timeline width. Draws time markers at intervals derived from current zoom level. Rerenders on zoom change only.
+
+---
+
+## 11. Playback Model
+
+Studio preview uses the **Web Audio API**, not HTML5 `<audio>`.
+
+### Graph
+
+```
+AudioBufferSourceNode (per clip)
+  → GainNode (clip gain + fades)
+    → GainNode (track gain)
+      → GainNode (master gain)
+        → AudioContext.destination
+```
+
+### Scheduling
+
+On play, the engine:
+
+1. Reads current `playheadTime` from Zustand.
+2. For each clip where `clip.startTime + (clip.sourceEnd - clip.sourceStart) > playheadTime`:
+   - Calculates `offset = playheadTime - clip.startTime + clip.sourceStart` (clamped to 0)
+   - Calls `source.start(audioCtx.currentTime, offset)`
+3. All sources start in sync via a shared `startTime = audioCtx.currentTime + 0.01` anchor.
+
+On stop or seek, all sources are disconnected and the graph is torn down. A new graph is built on the next play call.
+
+### Audio buffer cache
+
+Fetched WAV files are decoded to `AudioBuffer` via `AudioContext.decodeAudioData()` and cached by asset ID. Cache is invalidated if the asset is deleted or replaced.
+
+### Fade implementation
+
+Fades are implemented as `AudioParam` ramps on the clip GainNode:
+
+```ts
+gainNode.gain.setValueAtTime(0, startTime);
+gainNode.gain.linearRampToValueAtTime(clipGain, startTime + fadeInSecs);
+gainNode.gain.setValueAtTime(clipGain, endTime - fadeOutSecs);
+gainNode.gain.linearRampToValueAtTime(0, endTime);
+```
+
+---
+
+## 12. Export Rendering Model
+
+When the user triggers export, the frontend sends a render request to the Python backend. The browser is never involved in mixing.
+
+### Request payload
+
+```ts
+type RenderRequest = {
+  projectId: string;
+  tracks: Track[];
+  clips: TimelineClip[];
+  assets: Array<{ id: string; filePath: string }>;
+  settings: RenderSettings;
+};
+```
+
+### Python render pipeline
+
+```python
+# Pseudocode — actual implementation lives in server/render.py
+def render_project(req: RenderRequest) -> str:
+    # 1. Load all source audio files into numpy arrays
+    # 2. Determine total timeline duration
+    # 3. Create output buffer (zeros) at target sample rate
+    # 4. For each clip (sorted by startTime):
+    #    a. Load source audio
+    #    b. Resample to target sample rate if needed
+    #    c. Apply trim (sourceStart → sourceEnd)
+    #    d. Apply gain (dB → linear)
+    #    e. Apply fade in/out (linear ramp)
+    #    f. Place at clip.startTime offset in output buffer
+    # 5. Apply track gain to each track's summed signal
+    # 6. Sum all tracks into master buffer
+    # 7. Apply master gain
+    # 8. If normaliseLoudness: apply pyloudnorm to target LUFS
+    # 9. Export WAV or MP3 via soundfile / ffmpeg
+    # 10. Write to data/projects/<id>/renders/<timestamp>.wav
+    # 11. Return file path
+```
+
+Render runs in a background thread. The client polls `GET /api/projects/:id/renders/:rid` for status.
+
+---
+
+## 13. Phase Breakdown
+
+### Phase 0 — Stabilise (current state → pre-Phase 1)
+
+**Goal:** Clean foundation. No new features.
+
+Tasks:
+- [ ] Extract WAV assembly from `useStreamingGeneration.ts` into `web/lib/audio/wav.ts`
+- [ ] Extract waveform peak generation into `server/waveform.py`
+- [ ] Confirm generation cancellation works cleanly (stream abort + server cancel_event)
+- [ ] Add `nanoid` to backend for stable generation IDs
+- [ ] Add `data/` directory to `.gitignore`
+
+**Acceptance:** WAV assembly is a pure function with unit tests. Generation IDs are stable.
+
+---
+
+### Phase 1 — Persistent Generation Library
+
+**Goal:** Every generation is saved. Users can browse, play, and download past generations.
+
+**Backend tasks:**
+- [ ] Add SQLite setup (`data/db/vibepod.db`, schema migration 001)
+- [ ] `generations` table: `id, created_at, status, script, speaker, cfg_scale, inference_steps, duration_secs, sample_rate, audio_path, waveform_path, error_message`
+- [ ] On generation complete: save WAV to `data/generations/<id>/audio.wav`
+- [ ] On generation complete: compute and save waveform peaks to `data/generations/<id>/waveform.json`
+- [ ] Implement `GET /api/generations` (list, paginated)
+- [ ] Implement `GET /api/generations/:id` (single)
+- [ ] Implement `GET /api/generations/:id/audio` (stream file)
+- [ ] Implement `GET /api/generations/:id/waveform` (peaks JSON)
+- [ ] Implement `DELETE /api/generations/:id` (delete row + files)
+
+**Frontend tasks:**
+- [ ] Install `better-sqlite3` + types
+- [ ] Create `web/lib/db/` — schema, migration runner, query helpers
+- [ ] Create `/library` route and `LibraryPage` component
+- [ ] `GenerationCard` component: waveform preview canvas, metadata, play/download/delete actions
+- [ ] `WaveformPreview` component: draws peaks on canvas (static, no playback)
+- [ ] Mini audio player for library card playback (reuse `useAudioPlayer` hook)
+- [ ] Link "Open in Studio" button (navigates to `/studio/new?fromGeneration=<id>`)
+- [ ] Add "Library" link to `Header`
+
+**Acceptance:**
+- Generate audio → close browser → reopen → generation appears in library with waveform
+- Play button plays correct audio
+- Delete removes from library and disk
+- Library renders without error when empty
+
+---
+
+### Phase 2 — Studio MVP
+
+**Goal:** Single-track timeline editor. Open a generation, view waveform, trim/split/delete, export WAV.
+
+**Backend tasks:**
+- [ ] `projects` table: `id, name, created_at, updated_at, project_json`
+- [ ] `assets` table: `id, project_id, kind, file_path, duration_secs, sample_rate, channels, waveform_path, source_json`
+- [ ] Implement `POST /api/projects`
+- [ ] Implement `GET /api/projects` (list)
+- [ ] Implement `GET /api/projects/:id`
+- [ ] Implement `PUT /api/projects/:id` (save)
+- [ ] Implement `DELETE /api/projects/:id`
+- [ ] Implement `POST /api/projects/:id/assets` (import from generation or upload)
+- [ ] Implement basic render endpoint (single voice track, WAV out only)
+
+**Frontend tasks:**
+- [ ] Install `zustand`, `@dnd-kit/core`, `@dnd-kit/utilities`
+- [ ] Create Studio Zustand store (`web/stores/studioStore.ts`)
+  - Project state, selected clip, playhead time, zoom, isPlaying, undo stack
+  - Actions: selectClip, moveClip, splitClip, trimClip, setClipGain, undo, redo
+- [ ] Create `/projects` route and dashboard
+- [ ] Create `/studio/[projectId]` route
+- [ ] `StudioShell` — top-level layout
+- [ ] `StudioTopBar` — project name, undo/redo, save status, export button
+- [ ] `ProjectSidebar` — tabs shell + `MediaBin` tab
+- [ ] `MediaBin` — list assets, drag to timeline
+- [ ] `TimelineArea` — scrollable container with tracks
+- [ ] `TimelineRuler` — canvas ruler, rerenders on zoom
+- [ ] `TimelineTrack` — track header + clip area
+- [ ] `TimelineClip` — positioned div, selectable, draggable
+- [ ] `WaveformCanvas` — canvas inside clip, draws peaks
+- [ ] `InspectorPanel` — shows selected clip properties
+- [ ] `TransportBar` — play/pause/stop, time display, zoom slider
+- [ ] Web Audio playback engine (`web/lib/audio/playbackEngine.ts`)
+- [ ] Autosave: debounced PUT on every store change (500ms delay)
+- [ ] Export dialog: format picker → POST /api/projects/:id/render → poll → download
+
+**Acceptance:**
+- Open generation from library → Studio loads with waveform on single track
+- Play button plays audio in sync with playhead
+- Drag clip moves it on timeline
+- Split at playhead creates two clips
+- Trim handles reduce clip duration
+- Delete removes clip
+- Export produces downloadable WAV
+- Undo/redo works for all operations
+
+---
+
+### Phase 3 — Script-Linked Regeneration
+
+**Goal:** Script blocks are the source of truth. Clicking a clip highlights the script. Regenerating a clip produces a new take.
+
+**Backend tasks:**
+- [ ] `takes` table: `id, project_id, script_block_id, asset_id, voice_id, model_id, settings_json, created_at, rating, notes, is_active`
+- [ ] Implement takes API endpoints (list, create, set active, delete)
+- [ ] Waveform peak generation on regenerated takes
+
+**Frontend tasks:**
+- [ ] `ScriptPanel` sidebar tab — editable script blocks with speaker labels
+- [ ] Script block → clip bidirectional linking (click clip → highlight block, click block → select clip)
+- [ ] Clip inspector: show source script text (read-only in Phase 3)
+- [ ] `RegenerateButton` in inspector — sends block text + current voice settings → new generation
+- [ ] Regeneration creates new Take, new Asset, new Clip (does not replace existing clip automatically)
+- [ ] `TakeStack` in inspector — list takes for selected block, click to preview, "Replace in timeline" action
+- [ ] Per-block voice setting override (speaker, cfg_scale, inference_steps)
+
+**Acceptance:**
+- Clicking a clip selects the related script block in ScriptPanel
+- Clicking a script block selects the clip on the timeline
+- Regenerate produces a new take visible in TakeStack
+- "Replace in timeline" swaps the clip's asset to the new take
+- Previous take is preserved and can be restored
+- Undo works across take replacements
+
+---
+
+### Phase 4 — Multi-Speaker Podcast Builder
+
+**Goal:** Multiple voice tracks, music/SFX tracks, speaker assignment, show templates.
+
+**Tasks:**
+- [ ] Multiple tracks: Host, Guest, Music, SFX, Ambience
+- [ ] Track type icons and colour coding per track type
+- [ ] Per-track mute/solo buttons (functional in Web Audio engine)
+- [ ] Track gain slider
+- [ ] Speaker assignment per track (voice preset tied to track)
+- [ ] Music/SFX uploads to media bin
+- [ ] Basic music ducking on voice tracks (auto-gain on music track when voice plays)
+- [ ] Show template: save a project's track layout + speaker assignments as a reusable template
+- [ ] Template picker on new project creation
+
+**Acceptance:**
+- Two voice tracks play independently and mix correctly
+- Mute/solo work
+- Music bed plays under voice tracks
+- Saving as template creates a new project correctly
+- Exported WAV contains all tracks mixed
+
+---
+
+### Phase 5 — Production Export
+
+**Goal:** MP3 export, loudness normalisation, podcast metadata, render queue, mastering presets.
+
+**Tasks:**
+- [ ] MP3 export via FFmpeg on render backend
+- [ ] `pyloudnorm` integration — LUFS targeting per preset
+- [ ] Export presets: Podcast Balanced, Podcast Loud, Audiobook, Raw WAV, YouTube Audio
+- [ ] ID3 metadata fields in export dialog (title, artist, episode number, cover art, description)
+- [ ] Render job queue — multiple renders can be queued
+- [ ] Render status polling with progress bar
+- [ ] Render history panel in project
+- [ ] Autosave recovery: on crash/close, restore last autosaved state on next open
+
+**Acceptance:**
+- MP3 export produces valid file with correct ID3 tags
+- Loudness normalisation hits target LUFS ± 0.5
+- Render queue processes jobs sequentially
+- Recovering an autosave restores timeline to last saved state
+
+---
+
+## 14. Acceptance Criteria
+
+### Cross-cutting criteria (all phases)
+
+- No TypeScript `any` types anywhere in Studio code
+- Zustand store actions are pure (no side effects except explicit async actions)
+- Autosave never blocks the UI thread
+- Undo/redo covers every timeline mutation
+- No orphaned audio files — deleting a project deletes its files
+- Waveform canvas does not rerender on playback (only on zoom/resize)
+- Playhead position updates at 60fps via `requestAnimationFrame`, not React state
+
+---
+
+## 15. Deferred Decisions
+
+These are intentionally not decided yet. Revisit at the phase that needs them.
+
+| Decision | Deferred until |
+|---|---|
+| Voice cloning / custom voice upload | Depends on VibeVoice roadmap |
+| XTTS or ElevenLabs as second provider | Phase 3+ — only after VoiceModelProvider abstraction is proven |
+| Clip crossfades (overlapping clips) | Phase 4 — requires mixing model update |
+| Clip-level EQ / compression | Phase 5 |
+| Per-segment emotion / style tags | Phase 3 evaluation |
+| WebSocket vs SSE for render progress | Phase 5 — evaluate based on render durations seen in practice |
+| IndexedDB caching of AudioBuffers | Phase 2 evaluation — only if cache miss latency is a real problem |
+| Noise gate / background removal | Post-Phase 5 |
+| Multi-window Studio (popout inspector etc.) | Not planned |
+
+---
+
+## 16. Third-Party Library Policy
+
+VibePod is an open-source project. Any third-party library integrated into the codebase must satisfy:
+
+1. **License:** MIT, Apache 2.0, BSD 2/3-Clause, or ISC. No GPL unless the entire application is separately GPL-licensed. No CC-NC.
+2. **Attribution:** Add to a `LICENSES.md` file in the repo root when integrating. Include library name, version, license type, and project URL.
+3. **Size:** For frontend packages, run `bundlephobia` before adding. Prefer packages under 20 kB gzipped unless there is no alternative.
+4. **Maintenance:** Prefer packages with active maintenance. Check last commit date and open issue count before adding.
+5. **Source code:** If copying or adapting a snippet (not a full package), add an inline comment with the source URL and license.
+
+---
+
+*This document is the execution specification. The high-level vision lives in `roadmap.md`. When in doubt about scope, refer to Section 4 (Non-Goals) first.*
@@ -0,0 +1,35 @@
+"""Stable, URL-safe ID generation for VibePod entities."""
+
+import secrets
+
+
+def _make_id(prefix: str) -> str:
+    return f"{prefix}_{secrets.token_urlsafe(8)}"
+
+
+def gen_id() -> str:
+    return _make_id("gen")
+
+
+def proj_id() -> str:
+    return _make_id("proj")
+
+
+def asset_id() -> str:
+    return _make_id("asset")
+
+
+def track_id() -> str:
+    return _make_id("track")
+
+
+def clip_id() -> str:
+    return _make_id("clip")
+
+
+def block_id() -> str:
+    return _make_id("block")
+
+
+def take_id() -> str:
+    return _make_id("take")
@@ -0,0 +1,77 @@
+"""Waveform peak generation for VibePod.
+
+Reads a WAV file and produces min/max peak arrays suitable for canvas rendering.
+The output format matches the WaveformPeaks TypeScript type in the frontend.
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+import numpy as np
+import soundfile as sf
+
+
+def compute_peaks(
+    audio_path: str | Path,
+    samples_per_pixel: int = 256,
+) -> dict:
+    """Compute min/max waveform peaks from a WAV file.
+
+    Args:
+        audio_path: Path to a WAV file (any bit depth, any channel count).
+        samples_per_pixel: How many audio samples are condensed into one peak pair.
+                           256 is a good default for a ~1000px wide waveform at
+                           standard podcast lengths.
+
+    Returns:
+        A dict matching the WaveformPeaks TypeScript type:
+        {
+            "sampleRate": int,
+            "durationSecs": float,
+            "channels": int,
+            "samplesPerPixel": int,
+            "length": int,           # number of peak pairs
+            "data": {
+                "min": [float, ...], # values in [-1.0, 0.0]
+                "max": [float, ...], # values in [0.0, 1.0]
+            }
+        }
+    """
+    samples, sample_rate = sf.read(str(audio_path), dtype="float32", always_2d=True)
+
+    # Mix to mono by averaging channels
+    mono = samples.mean(axis=1)
+    total_samples = len(mono)
+    duration_secs = total_samples / sample_rate
+    channels = samples.shape[1]
+
+    # Pad so total_samples is divisible by samples_per_pixel
+    remainder = total_samples % samples_per_pixel
+    if remainder:
+        pad = samples_per_pixel - remainder
+        mono = np.concatenate([mono, np.zeros(pad, dtype=np.float32)])
+
+    frames = mono.reshape(-1, samples_per_pixel)
+    peak_min = frames.min(axis=1).tolist()
+    peak_max = frames.max(axis=1).tolist()
+    length = len(peak_min)
+
+    return {
+        "sampleRate": int(sample_rate),
+        "durationSecs": round(duration_secs, 4),
+        "channels": int(channels),
+        "samplesPerPixel": samples_per_pixel,
+        "length": length,
+        "data": {
+            "min": [round(float(v), 5) for v in peak_min],
+            "max": [round(float(v), 5) for v in peak_max],
+        },
+    }
+
+
+def write_peaks(audio_path: str | Path, output_path: str | Path, samples_per_pixel: int = 256) -> None:
+    """Compute peaks and write them to a JSON file."""
+    peaks = compute_peaks(audio_path, samples_per_pixel)
+    Path(output_path).write_text(json.dumps(peaks, separators=(",", ":")), encoding="utf-8")
@@ -1,8 +1,8 @@
 "use client";

 import { useCallback, useEffect, useRef, useState } from "react";
+import { buildWav, decodeFloat32Chunk, mergeFloat32Arrays, SAMPLE_RATE } from "@/lib/audio/wav";

-const SAMPLE_RATE = 24_000;
 const DEFAULT_PREBUFFER_SECS = 5.0;
 const DEFAULT_REBUFFER_THRESHOLD_SECS = 1.0;
 const DEFAULT_RESUME_THRESHOLD_SECS = 3.0;
@@ -30,53 +30,6 @@ interface UseStreamingGenerationOptions {
  resumeThresholdSecs?: number;
 }

-function mergeFloat32Arrays(chunks: Float32Array<ArrayBuffer>[]): Float32Array<ArrayBuffer> {
-  const total = chunks.reduce((sum, chunk) => sum + chunk.length, 0);
-  const out = new Float32Array(total);
-  let offset = 0;
-  for (const chunk of chunks) {
-    out.set(chunk, offset);
-    offset += chunk.length;
-  }
-  return out;
-}
-
-function buildWav(samples: Float32Array<ArrayBuffer>, sampleRate: number): Blob {
-  const dataSize = samples.length * 4;
-  const buffer = new ArrayBuffer(44 + dataSize);
-  const view = new DataView(buffer);
-  const writeString = (offset: number, value: string) => {
-    for (let i = 0; i < value.length; i += 1) {
-      view.setUint8(offset + i, value.charCodeAt(i));
-    }
-  };
-
-  writeString(0, "RIFF");
-  view.setUint32(4, 36 + dataSize, true);
-  writeString(8, "WAVE");
-  writeString(12, "fmt ");
-  view.setUint32(16, 16, true);
-  view.setUint16(20, 3, true);
-  view.setUint16(22, 1, true);
-  view.setUint32(24, sampleRate, true);
-  view.setUint32(28, sampleRate * 4, true);
-  view.setUint16(32, 4, true);
-  view.setUint16(34, 32, true);
-  writeString(36, "data");
-  view.setUint32(40, dataSize, true);
-  new Float32Array(buffer, 44).set(samples);
-  return new Blob([buffer], { type: "audio/wav" });
-}
-
-function decodeFloat32Chunk(data: string): Float32Array<ArrayBuffer> {
-  const raw = atob(data);
-  const bytes = new Uint8Array(raw.length);
-  for (let i = 0; i < raw.length; i += 1) {
-    bytes[i] = raw.charCodeAt(i);
-  }
-  return new Float32Array(bytes.buffer as ArrayBuffer);
-}
-
 export function useStreamingGeneration({
  onLog,
  onStart,
@@ -0,0 +1,56 @@
+export const SAMPLE_RATE = 24_000;
+
+export function decodeFloat32Chunk(data: string): Float32Array<ArrayBuffer> {
+  const raw = atob(data);
+  const bytes = new Uint8Array(raw.length);
+  for (let i = 0; i < raw.length; i += 1) {
+    bytes[i] = raw.charCodeAt(i);
+  }
+  return new Float32Array(bytes.buffer as ArrayBuffer);
+}
+
+export function mergeFloat32Arrays(
+  chunks: Float32Array<ArrayBuffer>[]
+): Float32Array<ArrayBuffer> {
+  const total = chunks.reduce((sum, chunk) => sum + chunk.length, 0);
+  const out = new Float32Array(total);
+  let offset = 0;
+  for (const chunk of chunks) {
+    out.set(chunk, offset);
+    offset += chunk.length;
+  }
+  return out;
+}
+
+/** Builds a float32 RIFF/WAV Blob from raw PCM samples. */
+export function buildWav(
+  samples: Float32Array<ArrayBuffer>,
+  sampleRate: number
+): Blob {
+  const dataSize = samples.length * 4;
+  const buffer = new ArrayBuffer(44 + dataSize);
+  const view = new DataView(buffer);
+
+  const writeString = (offset: number, value: string) => {
+    for (let i = 0; i < value.length; i += 1) {
+      view.setUint8(offset + i, value.charCodeAt(i));
+    }
+  };
+
+  writeString(0, "RIFF");
+  view.setUint32(4, 36 + dataSize, true);
+  writeString(8, "WAVE");
+  writeString(12, "fmt ");
+  view.setUint32(16, 16, true);
+  view.setUint16(20, 3, true); // PCM float32
+  view.setUint16(22, 1, true); // mono
+  view.setUint32(24, sampleRate, true);
+  view.setUint32(28, sampleRate * 4, true); // byte rate
+  view.setUint16(32, 4, true); // block align
+  view.setUint16(34, 32, true); // bits per sample
+  writeString(36, "data");
+  view.setUint32(40, dataSize, true);
+  new Float32Array(buffer, 44).set(samples);
+
+  return new Blob([buffer], { type: "audio/wav" });
+}