chore(phase-0): stabilise foundation for Studio build

- Extract WAV assembly (buildWav, mergeFloat32Arrays, decodeFloat32Chunk,
  SAMPLE_RATE) into web/lib/audio/wav.ts so it can be reused by the
  Studio playback engine and library waveform previews
- Add server/waveform.py with compute_peaks() / write_peaks() — reads
  any WAV, mixes to mono, returns min/max peak arrays matching the
  WaveformPeaks TypeScript type
- Add server/ids.py with prefixed URL-safe ID helpers (gen_id, proj_id,
  asset_id, etc.) using stdlib secrets — no new dependency
- Add docs/studio-build-plan.md — full execution spec covering stack
  decisions, data models, API contract, component hierarchy, phase
  breakdown and acceptance criteria
- Ignore data/ directory (generated audio, waveforms, SQLite DB)
This commit is contained in:
2026-05-02 17:24:45 +01:00
parent 0236807928
commit 47e0c7e512
6 changed files with 1098 additions and 48 deletions
+4
View File
@@ -20,8 +20,12 @@ web/tsconfig.tsbuildinfo
web/next-env.d.ts
web/node_modules/
# Generated data (audio files, waveforms, SQLite DB)
data/
# OS
.DS_Store
Thumbs.db
.vscode/settings.json
.claude/settings.local.json
*.txt
+925
View File
@@ -0,0 +1,925 @@
# VibePod Studio — Build Plan
**Version:** 1.0
**Authors:** LyAhn + Claude (Anthropic) + Codex
**Date:** 2026-05-02
**Status:** Active
---
## Table of Contents
1. [Product Goal](#1-product-goal)
2. [Current Architecture](#2-current-architecture)
3. [Locked Technical Decisions](#3-locked-technical-decisions)
4. [Non-Goals](#4-non-goals)
5. [Data Models](#5-data-models)
6. [Storage Layout](#6-storage-layout)
7. [API Contract](#7-api-contract)
8. [Frontend Routes](#8-frontend-routes)
9. [Component Hierarchy](#9-component-hierarchy)
10. [Timeline Rendering Model](#10-timeline-rendering-model)
11. [Playback Model](#11-playback-model)
12. [Export Rendering Model](#12-export-rendering-model)
13. [Phase Breakdown](#13-phase-breakdown)
14. [Acceptance Criteria](#14-acceptance-criteria)
15. [Deferred Decisions](#15-deferred-decisions)
16. [Third-Party Library Policy](#16-third-party-library-policy)
---
## 1. Product Goal
VibePod becomes a **script-aware AI podcast creation studio** where users can generate, arrange, edit, regenerate, and export podcast episodes from written scripts.
The headline feature that differentiates VibePod from generic audio editors:
> Every generated clip knows where it came from in the script and can be regenerated, replaced, or compared — without touching the original audio file.
The existing fast-generation page stays as the quick path. Studio grows alongside it.
---
## 2. Current Architecture
### What exists today
| Layer | Details |
|---|---|
| Frontend | Next.js 15 · React 19 · TypeScript 5 · Tailwind CSS 4 · App Router |
| Backend | FastAPI · uvicorn · Python 3.10+ |
| Model | VibeVoice Realtime 0.5B via JezzWTF fork |
| Audio format | 24 kHz · float32 PCM · streamed as SSE · assembled to WAV in browser |
| State management | React `useReducer` in `page.tsx` — no Zustand, no Redux |
| Persistence | None — generated audio is an ephemeral browser Blob |
| Packages | pnpm workspace (frontend) · uv (Python) |
### Current generation flow
```
User submits script
→ POST /api/generate (Next.js proxy)
→ POST /generate (FastAPI)
→ VibeVoice inference thread
→ SSE stream of base64 float32 PCM chunks
→ Browser decodes chunks, adaptive buffering, live Web Audio playback
→ On complete: chunks merged, RIFF/WAV header prepended, Blob URL created
→ User can play back or download the WAV
```
### Key files
```
web/
app/
page.tsx main generation UI (AppState via useReducer)
api/generate/route.ts SSE proxy to FastAPI
api/health/route.ts health check proxy
components/
Header.tsx
TextInputPanel.tsx
AudioPlayer.tsx
GenerationControls.tsx
StatusLog.tsx
hooks/
useStreamingGeneration.ts core streaming + WAV assembly
useAudioPlayer.ts HTML5 audio element wrapper
server/
vibevoice_server.py entire FastAPI app (972 lines)
start.sh launcher (CPU/CUDA detection, uv sync, uvicorn)
download_model.py HuggingFace prefetch
docs/
studio-build-plan.md this file
roadmap.md high-level phase vision
DESIGN.md brand + design system (colours, type, spacing)
AGENTS.md AI agent / CI guide
```
---
## 3. Locked Technical Decisions
These decisions are final and must not be revisited without explicit agreement. New phases build on them.
### 3.1 Rendering approach — Hybrid DOM + Canvas 2D
The Studio uses a **hybrid rendering model**:
| Layer | Technology |
|---|---|
| App shell, layout, sidebars, panels, inspector | React + Tailwind CSS (DOM) |
| Track headers, controls, transport, modals | React + Tailwind CSS (DOM) |
| Timeline clip containers + positioning | React + CSS (`left`/`width` from time → pixels) |
| Waveform rendering inside clips | Raw Canvas 2D |
| Timeline ruler | Raw Canvas 2D |
| Playhead overlay | Raw Canvas 2D |
| Browser playback preview | Web Audio API |
| Final render + export | Python + FFmpeg (server-side only) |
**Why not full canvas:** Every non-waveform element — buttons, text, inputs, scroll, keyboard focus, accessibility — works better in DOM. Reimplementing all of that in canvas is wasted effort.
**Why not pure CSS:** Waveform peaks are thousands of pixel-height values per clip. DOM representation would be extremely slow. Canvas draws them in a tight loop in milliseconds.
**Why not WaveSurfer.js as the core:** WaveSurfer owns playback and its own event model. VibePod Studio needs its own clip model, its own timeline, and eventually its own multi-track playback. Adapting around WaveSurfer's assumptions creates friction. It may be used for standalone audio preview components (e.g., the generation page player), not the Studio timeline.
**Why not Konva.js:** Overkill for v1. Konva is designed for fully canvas-based scenes (whiteboards, diagrams). VibePod's timeline is mostly DOM. The additional mental model (Stage/Layer/Group/Transformer) is not justified unless interaction complexity grows significantly beyond v1.
### 3.2 Frontend stack — No additions without justification
Build inside the existing stack. New packages require a written reason in this document.
**Approved additions (to be installed when their phase begins):**
| Package | Purpose | Phase |
|---|---|---|
| `zustand` | Studio editor state | Phase 2 |
| `@dnd-kit/core` + `@dnd-kit/utilities` | Clip drag-and-drop in timeline | Phase 2 |
| `better-sqlite3` | SQLite for job and project persistence | Phase 1 |
**Conditionally approved (evaluate at phase start):**
| Package | Purpose | Condition |
|---|---|---|
| `framer-motion` | Clip move animations | Only if DnD-kit transitions feel rough after prototype |
| `@radix-ui/react-*` | Accessible modal/dropdown primitives | Only if building custom is taking too long |
### 3.3 Backend stack — Python + FFmpeg for all rendering
Browser-side audio mixing is only for **preview**. Export always goes to the Python backend.
**Approved backend additions:**
| Package | Purpose | Phase |
|---|---|---|
| `soundfile` | Already present — WAV read/write | Phase 1 |
| `numpy` | Audio array manipulation | Phase 1 |
| `pydub` | Audio trimming, mixing, concatenation | Phase 2 |
| `pyloudnorm` | Loudness normalisation (LUFS) | Phase 5 |
FFmpeg must be available on the server host. The render endpoint assumes `ffmpeg` is on PATH.
### 3.4 State management — Zustand for Studio, useReducer stays on generation page
The existing generation page uses `useReducer` and works well. Do not refactor it.
Studio requires a shared store that multiple components read and write (timeline, inspector, transport, script panel). Zustand is the right tool. It is lightweight, does not require providers, and handles editor-style state (undo stacks, selection, playhead) cleanly.
### 3.5 Database — SQLite from Phase 1
Do not start with flat JSON files. SQLite is still a single file, requires no server process, and gives proper queries, transactions, and schema migrations from day one. Use `better-sqlite3` in the Next.js API layer.
Schema lives in `web/lib/db/schema.sql`. Migrations are numbered SQL files in `web/lib/db/migrations/`.
### 3.6 Audio sample rate
All generated audio is 24 kHz float32 mono (VibeVoice output). Studio renders at 44.1 kHz stereo WAV or 48 kHz for podcast MP3 export. The render pipeline handles resampling.
---
## 4. Non-Goals
These will not be built and must not creep in:
- **Real-time collaborative editing** — single-user per project only
- **Cloud sync or user accounts** — local-first, no auth system
- **MIDI or music composition** — audio clips only, no MIDI tracks
- **Plugin system** — no third-party audio plugin API
- **Browser-side FFmpeg (ffmpeg.wasm)** — all rendering is server-side
- **Mobile / responsive Studio layout** — Studio targets desktop viewport only
- **Offline PWA** — the server must be running; no service worker caching of model output
- **Real-time voice cloning** — out of scope until VibeVoice supports it cleanly
- **Exporting to streaming platforms** — export to file only; no Spotify/Apple Podcasts upload
---
## 5. Data Models
### 5.1 Generation job
```ts
type GenerationJob = {
id: string; // "gen_<nanoid>"
createdAt: string; // ISO 8601
status: "pending" | "generating" | "complete" | "error" | "cancelled";
script: string;
speaker: string;
cfgScale: number;
inferenceSteps: number;
durationSecs: number | null; // set on complete
sampleRate: number; // always 24000
audioPath: string | null; // relative to data/generations/<id>/audio.wav
waveformPath: string | null; // relative to data/generations/<id>/waveform.json
errorMessage: string | null;
};
```
### 5.2 Studio project
```ts
type StudioProject = {
id: string; // "proj_<nanoid>"
name: string;
createdAt: string;
updatedAt: string;
script: ScriptDocument;
assets: AudioAsset[];
tracks: Track[];
edits: EditOperation[];
renderSettings: RenderSettings;
};
```
### 5.3 Script document
```ts
type ScriptDocument = {
blocks: ScriptBlock[];
};
type ScriptBlock = {
id: string; // "block_<nanoid>"
speakerId: string;
text: string;
order: number;
generatedAssetId: string | null;
timelineClipIds: string[];
};
```
### 5.4 Audio asset
```ts
type AudioAsset = {
id: string; // "asset_<nanoid>"
projectId: string | null; // null = generation library asset
kind: "generated_voice" | "upload" | "music" | "sfx" | "render";
filePath: string;
durationSecs: number;
sampleRate: number;
channels: number;
waveformPath: string | null;
source: {
generationJobId?: string;
scriptBlockId?: string;
providerId?: string;
modelId?: string;
voiceId?: string;
settings?: Record<string, unknown>;
} | null;
};
```
### 5.5 Track
```ts
type Track = {
id: string; // "track_<nanoid>"
name: string;
type: "voice" | "music" | "sfx" | "ambience" | "master";
order: number;
muted: boolean;
solo: boolean;
gainDb: number;
clips: TimelineClip[];
};
```
### 5.6 Timeline clip
```ts
type TimelineClip = {
id: string; // "clip_<nanoid>"
assetId: string;
trackId: string;
startTime: number; // seconds from timeline origin
sourceStart: number; // trim start within source asset (seconds)
sourceEnd: number; // trim end within source asset (seconds)
gainDb: number;
fadeInMs: number;
fadeOutMs: number;
linkedScriptRange: {
blockId: string;
startChar: number;
endChar: number;
} | null;
};
```
### 5.7 Edit operation (non-destructive EDL)
```ts
type EditOperation =
| { type: "split"; clipId: string; at: number }
| { type: "trim"; clipId: string; sourceStart: number; sourceEnd: number }
| { type: "move"; clipId: string; startTime: number; trackId: string }
| { type: "gain"; clipId: string; gainDb: number }
| { type: "fade"; clipId: string; fadeInMs: number; fadeOutMs: number }
| { type: "delete"; clipId: string }
| { type: "mute"; trackId: string; muted: boolean }
| { type: "solo"; trackId: string; solo: boolean };
```
### 5.8 Take (regeneration history)
```ts
type Take = {
id: string; // "take_<nanoid>"
scriptBlockId: string;
assetId: string;
voiceId: string;
modelId: string;
settings: Record<string, unknown>;
createdAt: string;
rating: number | null; // 1-5 stars, optional
notes: string | null;
isActive: boolean; // true = the one placed on the timeline
};
```
### 5.9 Render settings
```ts
type RenderSettings = {
format: "wav" | "mp3";
sampleRate: 44100 | 48000;
bitrate: number | null; // kbps, null for WAV
normaliseLoudness: boolean;
lufsTarget: number; // default -16 LUFS for podcast
metadata: {
title: string;
artist: string;
album: string;
episodeNumber: number | null;
description: string;
} | null;
};
```
### 5.10 Waveform peaks
```ts
type WaveformPeaks = {
sampleRate: number;
durationSecs: number;
channels: number;
samplesPerPixel: number;
length: number;
data: {
min: number[]; // range -1.0 to 0.0
max: number[]; // range 0.0 to 1.0
};
};
```
---
## 6. Storage Layout
```
data/
generations/
gen_<id>/
audio.wav raw float32 WAV at 24 kHz
waveform.json WaveformPeaks at 256 samples/pixel
metadata.json GenerationJob fields (denormalised)
projects/
proj_<id>/
project.json full StudioProject serialised
assets/
asset_<id>.wav uploaded or imported audio
renders/
render_<timestamp>.wav exported renders
render_<timestamp>.mp3
db/
vibepod.db SQLite database
```
The SQLite database is the source of truth for IDs, status, and relationships. JSON files are the source of truth for audio and waveform data.
---
## 7. API Contract
All new routes are under `/api/`. The Next.js app proxies to FastAPI only for generation and health. Persistence routes are handled directly by Next.js API routes talking to SQLite.
### 7.1 Generation (existing, extended)
```
POST /api/generate start streaming generation (existing)
GET /api/health server health check (existing)
```
### 7.2 Generation library (Phase 1)
```
GET /api/generations list all jobs, newest first
query: ?limit=20&offset=0&status=complete
GET /api/generations/:id get single job metadata
GET /api/generations/:id/audio stream WAV file
GET /api/generations/:id/waveform get WaveformPeaks JSON
DELETE /api/generations/:id delete job and files
```
### 7.3 Projects (Phase 2)
```
POST /api/projects create project
GET /api/projects list projects
GET /api/projects/:id get project with full StudioProject
PUT /api/projects/:id save/autosave project
DELETE /api/projects/:id delete project and assets
```
### 7.4 Project assets (Phase 2)
```
POST /api/projects/:id/assets upload audio file or import from generation
GET /api/projects/:id/assets/:aid get asset metadata
DELETE /api/projects/:id/assets/:aid remove asset
```
### 7.5 Takes (Phase 3)
```
GET /api/projects/:id/takes/:blockId list takes for a script block
POST /api/projects/:id/takes/:blockId save new take
PUT /api/projects/:id/takes/:blockId/:takeId set active take
DELETE /api/projects/:id/takes/:blockId/:takeId delete take
```
### 7.6 Render (Phase 5)
```
POST /api/projects/:id/render start render job (sends EDL to Python)
GET /api/projects/:id/renders list render history
GET /api/projects/:id/renders/:rid poll render status
GET /api/projects/:id/renders/:rid/download stream rendered file
```
All error responses follow:
```json
{ "error": "human-readable message", "code": "SNAKE_CASE_CODE" }
```
---
## 8. Frontend Routes
```
/ generation page (current fast path, unchanged)
/library generation library — browse and replay saved jobs
/projects project dashboard — create or open Studio projects
/studio/:projectId Studio workspace
/studio/new redirect: creates project + navigates to /studio/:id
```
---
## 9. Component Hierarchy
### Generation page (existing — do not restructure without reason)
```
page.tsx
Header
TextInputPanel
GenerationControls
StatusLog
AudioPlayer
```
### Library page (Phase 1)
```
/app/library/page.tsx
Header
LibraryPage
GenerationCard[]
WaveformPreview (small static canvas render of peaks)
GenerationMetadata (speaker, duration, date, settings)
GenerationActions (play, download, open in studio, delete)
```
### Projects dashboard (Phase 2)
```
/app/projects/page.tsx
Header
ProjectsPage
NewProjectButton
ProjectCard[]
ProjectThumbnail
ProjectMetadata
ProjectActions
```
### Studio workspace (Phase 2+)
```
/app/studio/[projectId]/page.tsx
StudioShell
StudioTopBar
ProjectNameInput
UndoButton / RedoButton
SaveStatus
ExportButton
StudioBody
ProjectSidebar
SidebarTabs (Script | Voices | Media | Effects | Templates)
ScriptPanel (Phase 3)
VoicesPanel
MediaBin (Phase 2)
EffectsPanel (Phase 4)
StudioMain
TimelineRuler (canvas)
TimelineArea
TimelineTrack[] (one per track)
TrackHeader
TrackClipArea
TimelineClip[]
WaveformCanvas (canvas)
ClipLabel
InspectorPanel
ClipInspector (when clip selected)
ClipMetadata
GainControl
FadeControls
RegenerateButton (Phase 3)
TakeStack (Phase 3)
TrackInspector (when track selected)
EmptyInspector (nothing selected)
TransportBar
PlayPauseButton
StopButton
PlayheadTimeDisplay
ZoomControls
SnapToggle
```
---
## 10. Timeline Rendering Model
### Clip positioning
Timeline clips are DOM elements with CSS `position: absolute`. Position and size derive from the project's pixels-per-second zoom level:
```ts
const left = clip.startTime * pixelsPerSecond;
const width = (clip.sourceEnd - clip.sourceStart) * pixelsPerSecond;
```
The `pixelsPerSecond` value lives in Zustand and changes with zoom. All clip positions recompute via derived selectors.
### Waveform canvas
Each `TimelineClip` contains a `<canvas>` element that receives peak data as a prop. The renderer draws one vertical line per pixel column:
```ts
function drawWaveform(
ctx: CanvasRenderingContext2D,
peaks: WaveformPeaks,
width: number,
height: number,
color: string
): void {
const midY = height / 2;
ctx.clearRect(0, 0, width, height);
ctx.strokeStyle = color;
ctx.lineWidth = 1;
for (let x = 0; x < width; x++) {
const peakIndex = Math.floor((x / width) * peaks.length);
const minY = midY + peaks.data.min[peakIndex] * midY;
const maxY = midY - peaks.data.max[peakIndex] * midY;
ctx.beginPath();
ctx.moveTo(x + 0.5, minY);
ctx.lineTo(x + 0.5, maxY);
ctx.stroke();
}
}
```
The canvas rerenders when `width`, `peaks`, or zoom changes. It does not rerender on playback.
### Playhead
The playhead is a separate absolutely-positioned element (thin vertical line) that overlays the entire timeline area. Its `left` position is updated via `requestAnimationFrame` during playback — no React state updates, direct DOM style write.
### Timeline ruler
A `<canvas>` element spanning the full timeline width. Draws time markers at intervals derived from current zoom level. Rerenders on zoom change only.
---
## 11. Playback Model
Studio preview uses the **Web Audio API**, not HTML5 `<audio>`.
### Graph
```
AudioBufferSourceNode (per clip)
→ GainNode (clip gain + fades)
→ GainNode (track gain)
→ GainNode (master gain)
→ AudioContext.destination
```
### Scheduling
On play, the engine:
1. Reads current `playheadTime` from Zustand.
2. For each clip where `clip.startTime + (clip.sourceEnd - clip.sourceStart) > playheadTime`:
- Calculates `offset = playheadTime - clip.startTime + clip.sourceStart` (clamped to 0)
- Calls `source.start(audioCtx.currentTime, offset)`
3. All sources start in sync via a shared `startTime = audioCtx.currentTime + 0.01` anchor.
On stop or seek, all sources are disconnected and the graph is torn down. A new graph is built on the next play call.
### Audio buffer cache
Fetched WAV files are decoded to `AudioBuffer` via `AudioContext.decodeAudioData()` and cached by asset ID. Cache is invalidated if the asset is deleted or replaced.
### Fade implementation
Fades are implemented as `AudioParam` ramps on the clip GainNode:
```ts
gainNode.gain.setValueAtTime(0, startTime);
gainNode.gain.linearRampToValueAtTime(clipGain, startTime + fadeInSecs);
gainNode.gain.setValueAtTime(clipGain, endTime - fadeOutSecs);
gainNode.gain.linearRampToValueAtTime(0, endTime);
```
---
## 12. Export Rendering Model
When the user triggers export, the frontend sends a render request to the Python backend. The browser is never involved in mixing.
### Request payload
```ts
type RenderRequest = {
projectId: string;
tracks: Track[];
clips: TimelineClip[];
assets: Array<{ id: string; filePath: string }>;
settings: RenderSettings;
};
```
### Python render pipeline
```python
# Pseudocode — actual implementation lives in server/render.py
def render_project(req: RenderRequest) -> str:
# 1. Load all source audio files into numpy arrays
# 2. Determine total timeline duration
# 3. Create output buffer (zeros) at target sample rate
# 4. For each clip (sorted by startTime):
# a. Load source audio
# b. Resample to target sample rate if needed
# c. Apply trim (sourceStart → sourceEnd)
# d. Apply gain (dB → linear)
# e. Apply fade in/out (linear ramp)
# f. Place at clip.startTime offset in output buffer
# 5. Apply track gain to each track's summed signal
# 6. Sum all tracks into master buffer
# 7. Apply master gain
# 8. If normaliseLoudness: apply pyloudnorm to target LUFS
# 9. Export WAV or MP3 via soundfile / ffmpeg
# 10. Write to data/projects/<id>/renders/<timestamp>.wav
# 11. Return file path
```
Render runs in a background thread. The client polls `GET /api/projects/:id/renders/:rid` for status.
---
## 13. Phase Breakdown
### Phase 0 — Stabilise (current state → pre-Phase 1)
**Goal:** Clean foundation. No new features.
Tasks:
- [ ] Extract WAV assembly from `useStreamingGeneration.ts` into `web/lib/audio/wav.ts`
- [ ] Extract waveform peak generation into `server/waveform.py`
- [ ] Confirm generation cancellation works cleanly (stream abort + server cancel_event)
- [ ] Add `nanoid` to backend for stable generation IDs
- [ ] Add `data/` directory to `.gitignore`
**Acceptance:** WAV assembly is a pure function with unit tests. Generation IDs are stable.
---
### Phase 1 — Persistent Generation Library
**Goal:** Every generation is saved. Users can browse, play, and download past generations.
**Backend tasks:**
- [ ] Add SQLite setup (`data/db/vibepod.db`, schema migration 001)
- [ ] `generations` table: `id, created_at, status, script, speaker, cfg_scale, inference_steps, duration_secs, sample_rate, audio_path, waveform_path, error_message`
- [ ] On generation complete: save WAV to `data/generations/<id>/audio.wav`
- [ ] On generation complete: compute and save waveform peaks to `data/generations/<id>/waveform.json`
- [ ] Implement `GET /api/generations` (list, paginated)
- [ ] Implement `GET /api/generations/:id` (single)
- [ ] Implement `GET /api/generations/:id/audio` (stream file)
- [ ] Implement `GET /api/generations/:id/waveform` (peaks JSON)
- [ ] Implement `DELETE /api/generations/:id` (delete row + files)
**Frontend tasks:**
- [ ] Install `better-sqlite3` + types
- [ ] Create `web/lib/db/` — schema, migration runner, query helpers
- [ ] Create `/library` route and `LibraryPage` component
- [ ] `GenerationCard` component: waveform preview canvas, metadata, play/download/delete actions
- [ ] `WaveformPreview` component: draws peaks on canvas (static, no playback)
- [ ] Mini audio player for library card playback (reuse `useAudioPlayer` hook)
- [ ] Link "Open in Studio" button (navigates to `/studio/new?fromGeneration=<id>`)
- [ ] Add "Library" link to `Header`
**Acceptance:**
- Generate audio → close browser → reopen → generation appears in library with waveform
- Play button plays correct audio
- Delete removes from library and disk
- Library renders without error when empty
---
### Phase 2 — Studio MVP
**Goal:** Single-track timeline editor. Open a generation, view waveform, trim/split/delete, export WAV.
**Backend tasks:**
- [ ] `projects` table: `id, name, created_at, updated_at, project_json`
- [ ] `assets` table: `id, project_id, kind, file_path, duration_secs, sample_rate, channels, waveform_path, source_json`
- [ ] Implement `POST /api/projects`
- [ ] Implement `GET /api/projects` (list)
- [ ] Implement `GET /api/projects/:id`
- [ ] Implement `PUT /api/projects/:id` (save)
- [ ] Implement `DELETE /api/projects/:id`
- [ ] Implement `POST /api/projects/:id/assets` (import from generation or upload)
- [ ] Implement basic render endpoint (single voice track, WAV out only)
**Frontend tasks:**
- [ ] Install `zustand`, `@dnd-kit/core`, `@dnd-kit/utilities`
- [ ] Create Studio Zustand store (`web/stores/studioStore.ts`)
- Project state, selected clip, playhead time, zoom, isPlaying, undo stack
- Actions: selectClip, moveClip, splitClip, trimClip, setClipGain, undo, redo
- [ ] Create `/projects` route and dashboard
- [ ] Create `/studio/[projectId]` route
- [ ] `StudioShell` — top-level layout
- [ ] `StudioTopBar` — project name, undo/redo, save status, export button
- [ ] `ProjectSidebar` — tabs shell + `MediaBin` tab
- [ ] `MediaBin` — list assets, drag to timeline
- [ ] `TimelineArea` — scrollable container with tracks
- [ ] `TimelineRuler` — canvas ruler, rerenders on zoom
- [ ] `TimelineTrack` — track header + clip area
- [ ] `TimelineClip` — positioned div, selectable, draggable
- [ ] `WaveformCanvas` — canvas inside clip, draws peaks
- [ ] `InspectorPanel` — shows selected clip properties
- [ ] `TransportBar` — play/pause/stop, time display, zoom slider
- [ ] Web Audio playback engine (`web/lib/audio/playbackEngine.ts`)
- [ ] Autosave: debounced PUT on every store change (500ms delay)
- [ ] Export dialog: format picker → POST /api/projects/:id/render → poll → download
**Acceptance:**
- Open generation from library → Studio loads with waveform on single track
- Play button plays audio in sync with playhead
- Drag clip moves it on timeline
- Split at playhead creates two clips
- Trim handles reduce clip duration
- Delete removes clip
- Export produces downloadable WAV
- Undo/redo works for all operations
---
### Phase 3 — Script-Linked Regeneration
**Goal:** Script blocks are the source of truth. Clicking a clip highlights the script. Regenerating a clip produces a new take.
**Backend tasks:**
- [ ] `takes` table: `id, project_id, script_block_id, asset_id, voice_id, model_id, settings_json, created_at, rating, notes, is_active`
- [ ] Implement takes API endpoints (list, create, set active, delete)
- [ ] Waveform peak generation on regenerated takes
**Frontend tasks:**
- [ ] `ScriptPanel` sidebar tab — editable script blocks with speaker labels
- [ ] Script block → clip bidirectional linking (click clip → highlight block, click block → select clip)
- [ ] Clip inspector: show source script text (read-only in Phase 3)
- [ ] `RegenerateButton` in inspector — sends block text + current voice settings → new generation
- [ ] Regeneration creates new Take, new Asset, new Clip (does not replace existing clip automatically)
- [ ] `TakeStack` in inspector — list takes for selected block, click to preview, "Replace in timeline" action
- [ ] Per-block voice setting override (speaker, cfg_scale, inference_steps)
**Acceptance:**
- Clicking a clip selects the related script block in ScriptPanel
- Clicking a script block selects the clip on the timeline
- Regenerate produces a new take visible in TakeStack
- "Replace in timeline" swaps the clip's asset to the new take
- Previous take is preserved and can be restored
- Undo works across take replacements
---
### Phase 4 — Multi-Speaker Podcast Builder
**Goal:** Multiple voice tracks, music/SFX tracks, speaker assignment, show templates.
**Tasks:**
- [ ] Multiple tracks: Host, Guest, Music, SFX, Ambience
- [ ] Track type icons and colour coding per track type
- [ ] Per-track mute/solo buttons (functional in Web Audio engine)
- [ ] Track gain slider
- [ ] Speaker assignment per track (voice preset tied to track)
- [ ] Music/SFX uploads to media bin
- [ ] Basic music ducking on voice tracks (auto-gain on music track when voice plays)
- [ ] Show template: save a project's track layout + speaker assignments as a reusable template
- [ ] Template picker on new project creation
**Acceptance:**
- Two voice tracks play independently and mix correctly
- Mute/solo work
- Music bed plays under voice tracks
- Saving as template creates a new project correctly
- Exported WAV contains all tracks mixed
---
### Phase 5 — Production Export
**Goal:** MP3 export, loudness normalisation, podcast metadata, render queue, mastering presets.
**Tasks:**
- [ ] MP3 export via FFmpeg on render backend
- [ ] `pyloudnorm` integration — LUFS targeting per preset
- [ ] Export presets: Podcast Balanced, Podcast Loud, Audiobook, Raw WAV, YouTube Audio
- [ ] ID3 metadata fields in export dialog (title, artist, episode number, cover art, description)
- [ ] Render job queue — multiple renders can be queued
- [ ] Render status polling with progress bar
- [ ] Render history panel in project
- [ ] Autosave recovery: on crash/close, restore last autosaved state on next open
**Acceptance:**
- MP3 export produces valid file with correct ID3 tags
- Loudness normalisation hits target LUFS ± 0.5
- Render queue processes jobs sequentially
- Recovering an autosave restores timeline to last saved state
---
## 14. Acceptance Criteria
### Cross-cutting criteria (all phases)
- No TypeScript `any` types anywhere in Studio code
- Zustand store actions are pure (no side effects except explicit async actions)
- Autosave never blocks the UI thread
- Undo/redo covers every timeline mutation
- No orphaned audio files — deleting a project deletes its files
- Waveform canvas does not rerender on playback (only on zoom/resize)
- Playhead position updates at 60fps via `requestAnimationFrame`, not React state
---
## 15. Deferred Decisions
These are intentionally not decided yet. Revisit at the phase that needs them.
| Decision | Deferred until |
|---|---|
| Voice cloning / custom voice upload | Depends on VibeVoice roadmap |
| XTTS or ElevenLabs as second provider | Phase 3+ — only after VoiceModelProvider abstraction is proven |
| Clip crossfades (overlapping clips) | Phase 4 — requires mixing model update |
| Clip-level EQ / compression | Phase 5 |
| Per-segment emotion / style tags | Phase 3 evaluation |
| WebSocket vs SSE for render progress | Phase 5 — evaluate based on render durations seen in practice |
| IndexedDB caching of AudioBuffers | Phase 2 evaluation — only if cache miss latency is a real problem |
| Noise gate / background removal | Post-Phase 5 |
| Multi-window Studio (popout inspector etc.) | Not planned |
---
## 16. Third-Party Library Policy
VibePod is an open-source project. Any third-party library integrated into the codebase must satisfy:
1. **License:** MIT, Apache 2.0, BSD 2/3-Clause, or ISC. No GPL unless the entire application is separately GPL-licensed. No CC-NC.
2. **Attribution:** Add to a `LICENSES.md` file in the repo root when integrating. Include library name, version, license type, and project URL.
3. **Size:** For frontend packages, run `bundlephobia` before adding. Prefer packages under 20 kB gzipped unless there is no alternative.
4. **Maintenance:** Prefer packages with active maintenance. Check last commit date and open issue count before adding.
5. **Source code:** If copying or adapting a snippet (not a full package), add an inline comment with the source URL and license.
---
*This document is the execution specification. The high-level vision lives in `roadmap.md`. When in doubt about scope, refer to Section 4 (Non-Goals) first.*
+35
View File
@@ -0,0 +1,35 @@
"""Stable, URL-safe ID generation for VibePod entities."""
import secrets
def _make_id(prefix: str) -> str:
return f"{prefix}_{secrets.token_urlsafe(8)}"
def gen_id() -> str:
return _make_id("gen")
def proj_id() -> str:
return _make_id("proj")
def asset_id() -> str:
return _make_id("asset")
def track_id() -> str:
return _make_id("track")
def clip_id() -> str:
return _make_id("clip")
def block_id() -> str:
return _make_id("block")
def take_id() -> str:
return _make_id("take")
+77
View File
@@ -0,0 +1,77 @@
"""Waveform peak generation for VibePod.
Reads a WAV file and produces min/max peak arrays suitable for canvas rendering.
The output format matches the WaveformPeaks TypeScript type in the frontend.
"""
from __future__ import annotations
import json
from pathlib import Path
import numpy as np
import soundfile as sf
def compute_peaks(
audio_path: str | Path,
samples_per_pixel: int = 256,
) -> dict:
"""Compute min/max waveform peaks from a WAV file.
Args:
audio_path: Path to a WAV file (any bit depth, any channel count).
samples_per_pixel: How many audio samples are condensed into one peak pair.
256 is a good default for a ~1000px wide waveform at
standard podcast lengths.
Returns:
A dict matching the WaveformPeaks TypeScript type:
{
"sampleRate": int,
"durationSecs": float,
"channels": int,
"samplesPerPixel": int,
"length": int, # number of peak pairs
"data": {
"min": [float, ...], # values in [-1.0, 0.0]
"max": [float, ...], # values in [0.0, 1.0]
}
}
"""
samples, sample_rate = sf.read(str(audio_path), dtype="float32", always_2d=True)
# Mix to mono by averaging channels
mono = samples.mean(axis=1)
total_samples = len(mono)
duration_secs = total_samples / sample_rate
channels = samples.shape[1]
# Pad so total_samples is divisible by samples_per_pixel
remainder = total_samples % samples_per_pixel
if remainder:
pad = samples_per_pixel - remainder
mono = np.concatenate([mono, np.zeros(pad, dtype=np.float32)])
frames = mono.reshape(-1, samples_per_pixel)
peak_min = frames.min(axis=1).tolist()
peak_max = frames.max(axis=1).tolist()
length = len(peak_min)
return {
"sampleRate": int(sample_rate),
"durationSecs": round(duration_secs, 4),
"channels": int(channels),
"samplesPerPixel": samples_per_pixel,
"length": length,
"data": {
"min": [round(float(v), 5) for v in peak_min],
"max": [round(float(v), 5) for v in peak_max],
},
}
def write_peaks(audio_path: str | Path, output_path: str | Path, samples_per_pixel: int = 256) -> None:
"""Compute peaks and write them to a JSON file."""
peaks = compute_peaks(audio_path, samples_per_pixel)
Path(output_path).write_text(json.dumps(peaks, separators=(",", ":")), encoding="utf-8")
+1 -48
View File
@@ -1,8 +1,8 @@
"use client";
import { useCallback, useEffect, useRef, useState } from "react";
import { buildWav, decodeFloat32Chunk, mergeFloat32Arrays, SAMPLE_RATE } from "@/lib/audio/wav";
const SAMPLE_RATE = 24_000;
const DEFAULT_PREBUFFER_SECS = 5.0;
const DEFAULT_REBUFFER_THRESHOLD_SECS = 1.0;
const DEFAULT_RESUME_THRESHOLD_SECS = 3.0;
@@ -30,53 +30,6 @@ interface UseStreamingGenerationOptions {
resumeThresholdSecs?: number;
}
function mergeFloat32Arrays(chunks: Float32Array<ArrayBuffer>[]): Float32Array<ArrayBuffer> {
const total = chunks.reduce((sum, chunk) => sum + chunk.length, 0);
const out = new Float32Array(total);
let offset = 0;
for (const chunk of chunks) {
out.set(chunk, offset);
offset += chunk.length;
}
return out;
}
function buildWav(samples: Float32Array<ArrayBuffer>, sampleRate: number): Blob {
const dataSize = samples.length * 4;
const buffer = new ArrayBuffer(44 + dataSize);
const view = new DataView(buffer);
const writeString = (offset: number, value: string) => {
for (let i = 0; i < value.length; i += 1) {
view.setUint8(offset + i, value.charCodeAt(i));
}
};
writeString(0, "RIFF");
view.setUint32(4, 36 + dataSize, true);
writeString(8, "WAVE");
writeString(12, "fmt ");
view.setUint32(16, 16, true);
view.setUint16(20, 3, true);
view.setUint16(22, 1, true);
view.setUint32(24, sampleRate, true);
view.setUint32(28, sampleRate * 4, true);
view.setUint16(32, 4, true);
view.setUint16(34, 32, true);
writeString(36, "data");
view.setUint32(40, dataSize, true);
new Float32Array(buffer, 44).set(samples);
return new Blob([buffer], { type: "audio/wav" });
}
function decodeFloat32Chunk(data: string): Float32Array<ArrayBuffer> {
const raw = atob(data);
const bytes = new Uint8Array(raw.length);
for (let i = 0; i < raw.length; i += 1) {
bytes[i] = raw.charCodeAt(i);
}
return new Float32Array(bytes.buffer as ArrayBuffer);
}
export function useStreamingGeneration({
onLog,
onStart,
+56
View File
@@ -0,0 +1,56 @@
export const SAMPLE_RATE = 24_000;
export function decodeFloat32Chunk(data: string): Float32Array<ArrayBuffer> {
const raw = atob(data);
const bytes = new Uint8Array(raw.length);
for (let i = 0; i < raw.length; i += 1) {
bytes[i] = raw.charCodeAt(i);
}
return new Float32Array(bytes.buffer as ArrayBuffer);
}
export function mergeFloat32Arrays(
chunks: Float32Array<ArrayBuffer>[]
): Float32Array<ArrayBuffer> {
const total = chunks.reduce((sum, chunk) => sum + chunk.length, 0);
const out = new Float32Array(total);
let offset = 0;
for (const chunk of chunks) {
out.set(chunk, offset);
offset += chunk.length;
}
return out;
}
/** Builds a float32 RIFF/WAV Blob from raw PCM samples. */
export function buildWav(
samples: Float32Array<ArrayBuffer>,
sampleRate: number
): Blob {
const dataSize = samples.length * 4;
const buffer = new ArrayBuffer(44 + dataSize);
const view = new DataView(buffer);
const writeString = (offset: number, value: string) => {
for (let i = 0; i < value.length; i += 1) {
view.setUint8(offset + i, value.charCodeAt(i));
}
};
writeString(0, "RIFF");
view.setUint32(4, 36 + dataSize, true);
writeString(8, "WAVE");
writeString(12, "fmt ");
view.setUint32(16, 16, true);
view.setUint16(20, 3, true); // PCM float32
view.setUint16(22, 1, true); // mono
view.setUint32(24, sampleRate, true);
view.setUint32(28, sampleRate * 4, true); // byte rate
view.setUint16(32, 4, true); // block align
view.setUint16(34, 32, true); // bits per sample
writeString(36, "data");
view.setUint32(40, dataSize, true);
new Float32Array(buffer, 44).set(samples);
return new Blob([buffer], { type: "audio/wav" });
}