mirror of https://github.com/JezzWTF/vibepod.git synced 2026-06-01 15:22:14 +00:00

Files

T

LyAhn 13085166fb feat(phase-1): persistent generation library

- Save every completed generation to SQLite (generation_store.py) with
  WAV and waveform peaks written to data/generations/<id>/
- Deferred DB write until success — cancelled/errored generations never
  touch the DB and never appear in the library
- Fixed cancel+regenerate IndexError: _reset_scheduler_caches() now
  directly zeros scheduler._step_index and running state in addition to
  clearing VibePod cache dicts; same explicit resets added in the fresh
  path of prepare_noise_scheduler as belt-and-suspenders
- Added /library page with GenerationCard, WaveformPreview, waveform
  fetch, play/pause, download, delete, pagination, empty + error states
- Added generation API routes (list, single, audio stream, waveform,
  delete) proxying to Python server
- Added Library nav link to Header with active state
- Persist script/speaker/CFG to localStorage so generate page state
  survives navigation
- Updated build plan: Phase 0+1 ticked off, better-sqlite3 moved to
  Phase 2, architectural note on Python owning all persistence

2026-05-02 23:05:11 +01:00

32 KiB

Raw Blame History

VibePod Studio — Build Plan

Version: 1.0
Authors: LyAhn + Claude (Anthropic) + Codex
Date: 2026-05-02
Status: Active

Product Goal
Current Architecture
Locked Technical Decisions
Non-Goals
Data Models
Storage Layout
API Contract
Frontend Routes
Component Hierarchy
Timeline Rendering Model
Playback Model
Export Rendering Model
Phase Breakdown
Acceptance Criteria
Deferred Decisions
Third-Party Library Policy

1. Product Goal

VibePod becomes a script-aware AI podcast creation studio where users can generate, arrange, edit, regenerate, and export podcast episodes from written scripts.

The headline feature that differentiates VibePod from generic audio editors:

Every generated clip knows where it came from in the script and can be regenerated, replaced, or compared — without touching the original audio file.

The existing fast-generation page stays as the quick path. Studio grows alongside it.

2. Current Architecture

What exists today

Layer	Details
Frontend	Next.js 15 · React 19 · TypeScript 5 · Tailwind CSS 4 · App Router
Backend	FastAPI · uvicorn · Python 3.10+
Model	VibeVoice Realtime 0.5B via JezzWTF fork
Audio format	24 kHz · float32 PCM · streamed as SSE · assembled to WAV in browser
State management	React `useReducer` in `page.tsx` — no Zustand, no Redux
Persistence	None — generated audio is an ephemeral browser Blob
Packages	pnpm workspace (frontend) · uv (Python)

Current generation flow

User submits script
  → POST /api/generate (Next.js proxy)
  → POST /generate (FastAPI)
  → VibeVoice inference thread
  → SSE stream of base64 float32 PCM chunks
  → Browser decodes chunks, adaptive buffering, live Web Audio playback
  → On complete: chunks merged, RIFF/WAV header prepended, Blob URL created
  → User can play back or download the WAV

Key files

web/
  app/
    page.tsx                    main generation UI (AppState via useReducer)
    api/generate/route.ts       SSE proxy to FastAPI
    api/health/route.ts         health check proxy
  components/
    Header.tsx
    TextInputPanel.tsx
    AudioPlayer.tsx
    GenerationControls.tsx
    StatusLog.tsx
  hooks/
    useStreamingGeneration.ts   core streaming + WAV assembly
    useAudioPlayer.ts           HTML5 audio element wrapper

server/
  vibevoice_server.py           entire FastAPI app (972 lines)
  start.sh                      launcher (CPU/CUDA detection, uv sync, uvicorn)
  download_model.py             HuggingFace prefetch

docs/
  studio-build-plan.md          this file

roadmap.md                      high-level phase vision
DESIGN.md                       brand + design system (colours, type, spacing)
AGENTS.md                       AI agent / CI guide

3. Locked Technical Decisions

These decisions are final and must not be revisited without explicit agreement. New phases build on them.

3.1 Rendering approach — Hybrid DOM + Canvas 2D

The Studio uses a hybrid rendering model:

Layer	Technology
App shell, layout, sidebars, panels, inspector	React + Tailwind CSS (DOM)
Track headers, controls, transport, modals	React + Tailwind CSS (DOM)
Timeline clip containers + positioning	React + CSS (`left`/`width` from time → pixels)
Waveform rendering inside clips	Raw Canvas 2D
Timeline ruler	Raw Canvas 2D
Playhead overlay	Raw Canvas 2D
Browser playback preview	Web Audio API
Final render + export	Python + FFmpeg (server-side only)

Why not full canvas: Every non-waveform element — buttons, text, inputs, scroll, keyboard focus, accessibility — works better in DOM. Reimplementing all of that in canvas is wasted effort.

Why not pure CSS: Waveform peaks are thousands of pixel-height values per clip. DOM representation would be extremely slow. Canvas draws them in a tight loop in milliseconds.

Why not WaveSurfer.js as the core: WaveSurfer owns playback and its own event model. VibePod Studio needs its own clip model, its own timeline, and eventually its own multi-track playback. Adapting around WaveSurfer's assumptions creates friction. It may be used for standalone audio preview components (e.g., the generation page player), not the Studio timeline.

Why not Konva.js: Overkill for v1. Konva is designed for fully canvas-based scenes (whiteboards, diagrams). VibePod's timeline is mostly DOM. The additional mental model (Stage/Layer/Group/Transformer) is not justified unless interaction complexity grows significantly beyond v1.

3.2 Frontend stack — No additions without justification

Build inside the existing stack. New packages require a written reason in this document.

Approved additions (to be installed when their phase begins):

Package	Purpose	Phase
`zustand`	Studio editor state	Phase 2
`@dnd-kit/core` + `@dnd-kit/utilities`	Clip drag-and-drop in timeline	Phase 2
`better-sqlite3`	SQLite for project/studio persistence (Next.js layer)	Phase 2

Conditionally approved (evaluate at phase start):

Package	Purpose	Condition
`framer-motion`	Clip move animations	Only if DnD-kit transitions feel rough after prototype
`@radix-ui/react-*`	Accessible modal/dropdown primitives	Only if building custom is taking too long

3.3 Backend stack — Python + FFmpeg for all rendering

Browser-side audio mixing is only for preview. Export always goes to the Python backend.

Approved backend additions:

Package	Purpose	Phase
`soundfile`	Already present — WAV read/write	Phase 1
`numpy`	Audio array manipulation	Phase 1
`pydub`	Audio trimming, mixing, concatenation	Phase 2
`pyloudnorm`	Loudness normalisation (LUFS)	Phase 5

FFmpeg must be available on the server host. The render endpoint assumes ffmpeg is on PATH.

3.4 State management — Zustand for Studio, useReducer stays on generation page

The existing generation page uses useReducer and works well. Do not refactor it.

Studio requires a shared store that multiple components read and write (timeline, inspector, transport, script panel). Zustand is the right tool. It is lightweight, does not require providers, and handles editor-style state (undo stacks, selection, playhead) cleanly.

3.5 Database — SQLite from Phase 1

Do not start with flat JSON files. SQLite is still a single file, requires no server process, and gives proper queries, transactions, and schema migrations from day one. Use better-sqlite3 in the Next.js API layer.

Schema lives in web/lib/db/schema.sql. Migrations are numbered SQL files in web/lib/db/migrations/.

3.6 Audio sample rate

All generated audio is 24 kHz float32 mono (VibeVoice output). Studio renders at 44.1 kHz stereo WAV or 48 kHz for podcast MP3 export. The render pipeline handles resampling.

4. Non-Goals

These will not be built and must not creep in:

Real-time collaborative editing — single-user per project only
Cloud sync or user accounts — local-first, no auth system
MIDI or music composition — audio clips only, no MIDI tracks
Plugin system — no third-party audio plugin API
Browser-side FFmpeg (ffmpeg.wasm) — all rendering is server-side
Mobile / responsive Studio layout — Studio targets desktop viewport only
Offline PWA — the server must be running; no service worker caching of model output
Real-time voice cloning — out of scope until VibeVoice supports it cleanly
Exporting to streaming platforms — export to file only; no Spotify/Apple Podcasts upload

5. Data Models

5.1 Generation job

type GenerationJob = {
  id: string;                      // "gen_<nanoid>"
  createdAt: string;               // ISO 8601
  status: "pending" | "generating" | "complete" | "error" | "cancelled";
  script: string;
  speaker: string;
  cfgScale: number;
  inferenceSteps: number;
  durationSecs: number | null;     // set on complete
  sampleRate: number;              // always 24000
  audioPath: string | null;        // relative to data/generations/<id>/audio.wav
  waveformPath: string | null;     // relative to data/generations/<id>/waveform.json
  errorMessage: string | null;
};

5.2 Studio project

type StudioProject = {
  id: string;                      // "proj_<nanoid>"
  name: string;
  createdAt: string;
  updatedAt: string;
  script: ScriptDocument;
  assets: AudioAsset[];
  tracks: Track[];
  edits: EditOperation[];
  renderSettings: RenderSettings;
};

5.3 Script document

type ScriptDocument = {
  blocks: ScriptBlock[];
};

type ScriptBlock = {
  id: string;                      // "block_<nanoid>"
  speakerId: string;
  text: string;
  order: number;
  generatedAssetId: string | null;
  timelineClipIds: string[];
};

5.4 Audio asset

type AudioAsset = {
  id: string;                      // "asset_<nanoid>"
  projectId: string | null;        // null = generation library asset
  kind: "generated_voice" | "upload" | "music" | "sfx" | "render";
  filePath: string;
  durationSecs: number;
  sampleRate: number;
  channels: number;
  waveformPath: string | null;
  source: {
    generationJobId?: string;
    scriptBlockId?: string;
    providerId?: string;
    modelId?: string;
    voiceId?: string;
    settings?: Record<string, unknown>;
  } | null;
};

5.5 Track

type Track = {
  id: string;                      // "track_<nanoid>"
  name: string;
  type: "voice" | "music" | "sfx" | "ambience" | "master";
  order: number;
  muted: boolean;
  solo: boolean;
  gainDb: number;
  clips: TimelineClip[];
};

5.6 Timeline clip

type TimelineClip = {
  id: string;                      // "clip_<nanoid>"
  assetId: string;
  trackId: string;
  startTime: number;               // seconds from timeline origin
  sourceStart: number;             // trim start within source asset (seconds)
  sourceEnd: number;               // trim end within source asset (seconds)
  gainDb: number;
  fadeInMs: number;
  fadeOutMs: number;
  linkedScriptRange: {
    blockId: string;
    startChar: number;
    endChar: number;
  } | null;
};

5.7 Edit operation (non-destructive EDL)

type EditOperation =
  | { type: "split";  clipId: string; at: number }
  | { type: "trim";   clipId: string; sourceStart: number; sourceEnd: number }
  | { type: "move";   clipId: string; startTime: number; trackId: string }
  | { type: "gain";   clipId: string; gainDb: number }
  | { type: "fade";   clipId: string; fadeInMs: number; fadeOutMs: number }
  | { type: "delete"; clipId: string }
  | { type: "mute";   trackId: string; muted: boolean }
  | { type: "solo";   trackId: string; solo: boolean };

5.8 Take (regeneration history)

type Take = {
  id: string;                      // "take_<nanoid>"
  scriptBlockId: string;
  assetId: string;
  voiceId: string;
  modelId: string;
  settings: Record<string, unknown>;
  createdAt: string;
  rating: number | null;           // 1-5 stars, optional
  notes: string | null;
  isActive: boolean;               // true = the one placed on the timeline
};

5.9 Render settings

type RenderSettings = {
  format: "wav" | "mp3";
  sampleRate: 44100 | 48000;
  bitrate: number | null;          // kbps, null for WAV
  normaliseLoudness: boolean;
  lufsTarget: number;              // default -16 LUFS for podcast
  metadata: {
    title: string;
    artist: string;
    album: string;
    episodeNumber: number | null;
    description: string;
  } | null;
};

5.10 Waveform peaks

type WaveformPeaks = {
  sampleRate: number;
  durationSecs: number;
  channels: number;
  samplesPerPixel: number;
  length: number;
  data: {
    min: number[];                 // range -1.0 to 0.0
    max: number[];                 // range 0.0 to 1.0
  };
};

6. Storage Layout

data/
  generations/
    gen_<id>/
      audio.wav                   raw float32 WAV at 24 kHz
      waveform.json               WaveformPeaks at 256 samples/pixel
      metadata.json               GenerationJob fields (denormalised)

  projects/
    proj_<id>/
      project.json                full StudioProject serialised
      assets/
        asset_<id>.wav            uploaded or imported audio
      renders/
        render_<timestamp>.wav    exported renders
        render_<timestamp>.mp3

  db/
    vibepod.db                    SQLite database

The SQLite database is the source of truth for IDs, status, and relationships. JSON files are the source of truth for audio and waveform data.

7. API Contract

All new routes are under /api/. The Next.js app proxies to FastAPI only for generation and health. Persistence routes are handled directly by Next.js API routes talking to SQLite.

7.1 Generation (existing, extended)

POST   /api/generate                   start streaming generation (existing)
GET    /api/health                     server health check (existing)

7.2 Generation library (Phase 1)

GET    /api/generations                list all jobs, newest first
                                       query: ?limit=20&offset=0&status=complete
GET    /api/generations/:id            get single job metadata
GET    /api/generations/:id/audio      stream WAV file
GET    /api/generations/:id/waveform   get WaveformPeaks JSON
DELETE /api/generations/:id            delete job and files

7.3 Projects (Phase 2)

POST   /api/projects                   create project
GET    /api/projects                   list projects
GET    /api/projects/:id               get project with full StudioProject
PUT    /api/projects/:id               save/autosave project
DELETE /api/projects/:id               delete project and assets

7.4 Project assets (Phase 2)

POST   /api/projects/:id/assets        upload audio file or import from generation
GET    /api/projects/:id/assets/:aid   get asset metadata
DELETE /api/projects/:id/assets/:aid   remove asset

7.5 Takes (Phase 3)

GET    /api/projects/:id/takes/:blockId         list takes for a script block
POST   /api/projects/:id/takes/:blockId         save new take
PUT    /api/projects/:id/takes/:blockId/:takeId  set active take
DELETE /api/projects/:id/takes/:blockId/:takeId  delete take

7.6 Render (Phase 5)

POST   /api/projects/:id/render        start render job (sends EDL to Python)
GET    /api/projects/:id/renders       list render history
GET    /api/projects/:id/renders/:rid  poll render status
GET    /api/projects/:id/renders/:rid/download  stream rendered file

All error responses follow:

{ "error": "human-readable message", "code": "SNAKE_CASE_CODE" }

8. Frontend Routes

/                                  generation page (current fast path, unchanged)
/library                           generation library — browse and replay saved jobs
/projects                          project dashboard — create or open Studio projects
/studio/:projectId                 Studio workspace
/studio/new                        redirect: creates project + navigates to /studio/:id

9. Component Hierarchy

Generation page (existing — do not restructure without reason)

page.tsx
  Header
  TextInputPanel
  GenerationControls
  StatusLog
  AudioPlayer

Library page (Phase 1)

/app/library/page.tsx
  Header
  LibraryPage
    GenerationCard[]
      WaveformPreview          (small static canvas render of peaks)
      GenerationMetadata       (speaker, duration, date, settings)
      GenerationActions        (play, download, open in studio, delete)

Projects dashboard (Phase 2)

/app/projects/page.tsx
  Header
  ProjectsPage
    NewProjectButton
    ProjectCard[]
      ProjectThumbnail
      ProjectMetadata
      ProjectActions

Studio workspace (Phase 2+)

/app/studio/[projectId]/page.tsx
  StudioShell
    StudioTopBar
      ProjectNameInput
      UndoButton / RedoButton
      SaveStatus
      ExportButton
    StudioBody
      ProjectSidebar
        SidebarTabs (Script | Voices | Media | Effects | Templates)
        ScriptPanel           (Phase 3)
        VoicesPanel
        MediaBin              (Phase 2)
        EffectsPanel          (Phase 4)
      StudioMain
        TimelineRuler         (canvas)
        TimelineArea
          TimelineTrack[]     (one per track)
            TrackHeader
            TrackClipArea
              TimelineClip[]
                WaveformCanvas  (canvas)
                ClipLabel
      InspectorPanel
        ClipInspector         (when clip selected)
          ClipMetadata
          GainControl
          FadeControls
          RegenerateButton    (Phase 3)
          TakeStack           (Phase 3)
        TrackInspector        (when track selected)
        EmptyInspector        (nothing selected)
    TransportBar
      PlayPauseButton
      StopButton
      PlayheadTimeDisplay
      ZoomControls
      SnapToggle

10. Timeline Rendering Model

Clip positioning

Timeline clips are DOM elements with CSS position: absolute. Position and size derive from the project's pixels-per-second zoom level:

const left = clip.startTime * pixelsPerSecond;
const width = (clip.sourceEnd - clip.sourceStart) * pixelsPerSecond;

The pixelsPerSecond value lives in Zustand and changes with zoom. All clip positions recompute via derived selectors.

Waveform canvas

Each TimelineClip contains a <canvas> element that receives peak data as a prop. The renderer draws one vertical line per pixel column:

function drawWaveform(
  ctx: CanvasRenderingContext2D,
  peaks: WaveformPeaks,
  width: number,
  height: number,
  color: string
): void {
  const midY = height / 2;
  ctx.clearRect(0, 0, width, height);
  ctx.strokeStyle = color;
  ctx.lineWidth = 1;

  for (let x = 0; x < width; x++) {
    const peakIndex = Math.floor((x / width) * peaks.length);
    const minY = midY + peaks.data.min[peakIndex] * midY;
    const maxY = midY - peaks.data.max[peakIndex] * midY;
    ctx.beginPath();
    ctx.moveTo(x + 0.5, minY);
    ctx.lineTo(x + 0.5, maxY);
    ctx.stroke();
  }
}

The canvas rerenders when width, peaks, or zoom changes. It does not rerender on playback.

Playhead

The playhead is a separate absolutely-positioned element (thin vertical line) that overlays the entire timeline area. Its left position is updated via requestAnimationFrame during playback — no React state updates, direct DOM style write.

Timeline ruler

A <canvas> element spanning the full timeline width. Draws time markers at intervals derived from current zoom level. Rerenders on zoom change only.

11. Playback Model

Studio preview uses the Web Audio API, not HTML5 <audio>.

Graph

AudioBufferSourceNode (per clip)
  → GainNode (clip gain + fades)
    → GainNode (track gain)
      → GainNode (master gain)
        → AudioContext.destination

Scheduling

On play, the engine:

Reads current playheadTime from Zustand.
For each clip where clip.startTime + (clip.sourceEnd - clip.sourceStart) > playheadTime:
- Calculates offset = playheadTime - clip.startTime + clip.sourceStart (clamped to 0)
- Calls source.start(audioCtx.currentTime, offset)
All sources start in sync via a shared startTime = audioCtx.currentTime + 0.01 anchor.

On stop or seek, all sources are disconnected and the graph is torn down. A new graph is built on the next play call.

Audio buffer cache

Fetched WAV files are decoded to AudioBuffer via AudioContext.decodeAudioData() and cached by asset ID. Cache is invalidated if the asset is deleted or replaced.

Fade implementation

Fades are implemented as AudioParam ramps on the clip GainNode:

gainNode.gain.setValueAtTime(0, startTime);
gainNode.gain.linearRampToValueAtTime(clipGain, startTime + fadeInSecs);
gainNode.gain.setValueAtTime(clipGain, endTime - fadeOutSecs);
gainNode.gain.linearRampToValueAtTime(0, endTime);

12. Export Rendering Model

When the user triggers export, the frontend sends a render request to the Python backend. The browser is never involved in mixing.

Request payload

type RenderRequest = {
  projectId: string;
  tracks: Track[];
  clips: TimelineClip[];
  assets: Array<{ id: string; filePath: string }>;
  settings: RenderSettings;
};

Python render pipeline

# Pseudocode — actual implementation lives in server/render.py
def render_project(req: RenderRequest) -> str:
    # 1. Load all source audio files into numpy arrays
    # 2. Determine total timeline duration
    # 3. Create output buffer (zeros) at target sample rate
    # 4. For each clip (sorted by startTime):
    #    a. Load source audio
    #    b. Resample to target sample rate if needed
    #    c. Apply trim (sourceStart → sourceEnd)
    #    d. Apply gain (dB → linear)
    #    e. Apply fade in/out (linear ramp)
    #    f. Place at clip.startTime offset in output buffer
    # 5. Apply track gain to each track's summed signal
    # 6. Sum all tracks into master buffer
    # 7. Apply master gain
    # 8. If normaliseLoudness: apply pyloudnorm to target LUFS
    # 9. Export WAV or MP3 via soundfile / ffmpeg
    # 10. Write to data/projects/<id>/renders/<timestamp>.wav
    # 11. Return file path

Render runs in a background thread. The client polls GET /api/projects/:id/renders/:rid for status.

13. Phase Breakdown

Phase 0 — Stabilise (current state → pre-Phase 1)

Goal: Clean foundation. No new features.

Tasks:

Extract WAV assembly from useStreamingGeneration.ts into web/lib/audio/wav.ts
Extract waveform peak generation into server/waveform.py
Confirm generation cancellation works cleanly (stream abort + server cancel_event)
Add stable generation IDs to backend (server/ids.py via secrets.token_urlsafe)
Add data/ directory to .gitignore

Acceptance: WAV assembly is a pure function with unit tests. Generation IDs are stable.

Phase 1 — Persistent Generation Library

Goal: Every generation is saved. Users can browse, play, and download past generations.

Backend tasks:

Add SQLite setup (data/db/vibepod.db)
generations table: id, created_at, status, script, speaker, cfg_scale, inference_steps, duration_secs, sample_rate, audio_path, waveform_path, error_message
On generation complete: save WAV to data/generations/<id>/audio.wav
On generation complete: compute and save waveform peaks to data/generations/<id>/waveform.json
Implement GET /api/generations (list, paginated)
Implement GET /api/generations/:id (single)
Implement GET /api/generations/:id/audio (stream file)
Implement GET /api/generations/:id/waveform (peaks JSON)
Implement DELETE /api/generations/:id (delete row + files)

Frontend tasks:

Create /library route and LibraryPage component
GenerationCard component: waveform preview canvas, metadata, play/download/delete actions
WaveformPreview component: draws peaks on canvas (static, no playback)
Mini audio player for library card playback
Link "Open in Studio" button (navigates to /studio/new?fromGeneration=<id>) — deferred to Phase 2
Add "Library" link to Header

Acceptance:

Generate audio → close browser → reopen → generation appears in library with waveform
Play button plays correct audio
Delete removes from library and disk
Library renders without error when empty

Phase 2 — Studio MVP

Goal: Single-track timeline editor. Open a generation, view waveform, trim/split/delete, export WAV.

Backend tasks:

projects table: id, name, created_at, updated_at, project_json
assets table: id, project_id, kind, file_path, duration_secs, sample_rate, channels, waveform_path, source_json
Implement POST /api/projects
Implement GET /api/projects (list)
Implement GET /api/projects/:id
Implement PUT /api/projects/:id (save)
Implement DELETE /api/projects/:id
Implement POST /api/projects/:id/assets (import from generation or upload)
Implement basic render endpoint (single voice track, WAV out only)

Frontend tasks:

Install better-sqlite3 + types
Create web/lib/db/ — schema, migration runner, query helpers
Install zustand, @dnd-kit/core, @dnd-kit/utilities
Create Studio Zustand store (web/stores/studioStore.ts)
- Project state, selected clip, playhead time, zoom, isPlaying, undo stack
- Actions: selectClip, moveClip, splitClip, trimClip, setClipGain, undo, redo
Create /projects route and dashboard
Create /studio/[projectId] route
StudioShell — top-level layout
StudioTopBar — project name, undo/redo, save status, export button
ProjectSidebar — tabs shell + MediaBin tab
MediaBin — list assets, drag to timeline
TimelineArea — scrollable container with tracks
TimelineRuler — canvas ruler, rerenders on zoom
TimelineTrack — track header + clip area
TimelineClip — positioned div, selectable, draggable
WaveformCanvas — canvas inside clip, draws peaks
InspectorPanel — shows selected clip properties
TransportBar — play/pause/stop, time display, zoom slider
Web Audio playback engine (web/lib/audio/playbackEngine.ts)
Autosave: debounced PUT on every store change (500ms delay)
Export dialog: format picker → POST /api/projects/:id/render → poll → download

Acceptance:

Open generation from library → Studio loads with waveform on single track
Play button plays audio in sync with playhead
Drag clip moves it on timeline
Split at playhead creates two clips
Trim handles reduce clip duration
Delete removes clip
Export produces downloadable WAV
Undo/redo works for all operations

Phase 3 — Script-Linked Regeneration

Goal: Script blocks are the source of truth. Clicking a clip highlights the script. Regenerating a clip produces a new take.

Backend tasks:

takes table: id, project_id, script_block_id, asset_id, voice_id, model_id, settings_json, created_at, rating, notes, is_active
Implement takes API endpoints (list, create, set active, delete)
Waveform peak generation on regenerated takes

Frontend tasks:

ScriptPanel sidebar tab — editable script blocks with speaker labels
Script block → clip bidirectional linking (click clip → highlight block, click block → select clip)
Clip inspector: show source script text (read-only in Phase 3)
RegenerateButton in inspector — sends block text + current voice settings → new generation
Regeneration creates new Take, new Asset, new Clip (does not replace existing clip automatically)
TakeStack in inspector — list takes for selected block, click to preview, "Replace in timeline" action
Per-block voice setting override (speaker, cfg_scale, inference_steps)

Acceptance:

Clicking a clip selects the related script block in ScriptPanel
Clicking a script block selects the clip on the timeline
Regenerate produces a new take visible in TakeStack
"Replace in timeline" swaps the clip's asset to the new take
Previous take is preserved and can be restored
Undo works across take replacements

Phase 4 — Multi-Speaker Podcast Builder

Goal: Multiple voice tracks, music/SFX tracks, speaker assignment, show templates.

Tasks:

Multiple tracks: Host, Guest, Music, SFX, Ambience
Track type icons and colour coding per track type
Per-track mute/solo buttons (functional in Web Audio engine)
Track gain slider
Speaker assignment per track (voice preset tied to track)
Music/SFX uploads to media bin
Basic music ducking on voice tracks (auto-gain on music track when voice plays)
Show template: save a project's track layout + speaker assignments as a reusable template
Template picker on new project creation

Acceptance:

Two voice tracks play independently and mix correctly
Mute/solo work
Music bed plays under voice tracks
Saving as template creates a new project correctly
Exported WAV contains all tracks mixed

Phase 5 — Production Export

Goal: MP3 export, loudness normalisation, podcast metadata, render queue, mastering presets.

Tasks:

MP3 export via FFmpeg on render backend
pyloudnorm integration — LUFS targeting per preset
Export presets: Podcast Balanced, Podcast Loud, Audiobook, Raw WAV, YouTube Audio
ID3 metadata fields in export dialog (title, artist, episode number, cover art, description)
Render job queue — multiple renders can be queued
Render status polling with progress bar
Render history panel in project
Autosave recovery: on crash/close, restore last autosaved state on next open

Acceptance:

MP3 export produces valid file with correct ID3 tags
Loudness normalisation hits target LUFS ± 0.5
Render queue processes jobs sequentially
Recovering an autosave restores timeline to last saved state

14. Acceptance Criteria

Cross-cutting criteria (all phases)

No TypeScript any types anywhere in Studio code
Zustand store actions are pure (no side effects except explicit async actions)
Autosave never blocks the UI thread
Undo/redo covers every timeline mutation
No orphaned audio files — deleting a project deletes its files
Waveform canvas does not rerender on playback (only on zoom/resize)
Playhead position updates at 60fps via requestAnimationFrame, not React state

15. Deferred Decisions

These are intentionally not decided yet. Revisit at the phase that needs them.

Decision	Deferred until
Voice cloning / custom voice upload	Depends on VibeVoice roadmap
XTTS or ElevenLabs as second provider	Phase 3+ — only after VoiceModelProvider abstraction is proven
Clip crossfades (overlapping clips)	Phase 4 — requires mixing model update
Clip-level EQ / compression	Phase 5
Per-segment emotion / style tags	Phase 3 evaluation
WebSocket vs SSE for render progress	Phase 5 — evaluate based on render durations seen in practice
IndexedDB caching of AudioBuffers	Phase 2 evaluation — only if cache miss latency is a real problem
Noise gate / background removal	Post-Phase 5
Multi-window Studio (popout inspector etc.)	Not planned

16. Third-Party Library Policy

VibePod is an open-source project. Any third-party library integrated into the codebase must satisfy:

License: MIT, Apache 2.0, BSD 2/3-Clause, or ISC. No GPL unless the entire application is separately GPL-licensed. No CC-NC.
Attribution: Add to a LICENSES.md file in the repo root when integrating. Include library name, version, license type, and project URL.
Size: For frontend packages, run bundlephobia before adding. Prefer packages under 20 kB gzipped unless there is no alternative.
Maintenance: Prefer packages with active maintenance. Check last commit date and open issue count before adding.
Source code: If copying or adapting a snippet (not a full package), add an inline comment with the source URL and license.

This document is the execution specification. The high-level vision lives in roadmap.md. When in doubt about scope, refer to Section 4 (Non-Goals) first.

32 KiB Raw Blame History

VibePod Studio — Build Plan

Table of Contents

1. Product Goal

2. Current Architecture

What exists today

Current generation flow

Key files

3. Locked Technical Decisions

3.1 Rendering approach — Hybrid DOM + Canvas 2D

3.2 Frontend stack — No additions without justification

3.3 Backend stack — Python + FFmpeg for all rendering

3.4 State management — Zustand for Studio, useReducer stays on generation page

3.5 Database — SQLite from Phase 1

3.6 Audio sample rate

4. Non-Goals

5. Data Models

5.1 Generation job

5.2 Studio project

5.3 Script document

5.4 Audio asset

5.5 Track

5.6 Timeline clip

5.7 Edit operation (non-destructive EDL)

5.8 Take (regeneration history)

5.9 Render settings

5.10 Waveform peaks

6. Storage Layout

7. API Contract

7.1 Generation (existing, extended)

7.2 Generation library (Phase 1)

7.3 Projects (Phase 2)

7.4 Project assets (Phase 2)

7.5 Takes (Phase 3)

7.6 Render (Phase 5)

8. Frontend Routes

9. Component Hierarchy

Generation page (existing — do not restructure without reason)

Library page (Phase 1)

Projects dashboard (Phase 2)

Studio workspace (Phase 2+)

10. Timeline Rendering Model

Clip positioning

Waveform canvas

Playhead

Timeline ruler

11. Playback Model

Graph

Scheduling

Audio buffer cache

Fade implementation

12. Export Rendering Model

Request payload

Python render pipeline

13. Phase Breakdown

Phase 0 — Stabilise (current state → pre-Phase 1)

Phase 1 — Persistent Generation Library

Phase 2 — Studio MVP

Phase 3 — Script-Linked Regeneration

Phase 4 — Multi-Speaker Podcast Builder

Phase 5 — Production Export

14. Acceptance Criteria

Cross-cutting criteria (all phases)

15. Deferred Decisions

16. Third-Party Library Policy

32 KiB

Raw Blame History