Files

2.8 KiB

VibePod Roadmap

Studio Vision

VibePod Studio will turn generated audio from a one-shot download into a reusable editing workspace. The core idea is to persist each generation as a project artifact with the source script, voice, generation settings, audio file, waveform peaks, and edit history, then expose those artifacts in a timeline editor.

Phase 1: Generation Artifacts

  • Store generated audio as server-side jobs instead of browser-only object URLs.
  • Save job metadata: script, speaker, cfg scale, inference steps, duration, sample rate, created date, and generation status.
  • Generate waveform peak data for fast timeline rendering.
  • Add a library view for previous generations.

Phase 2: Basic Studio Editor

  • Add a Studio route with waveform timeline playback.
  • Support trim start/end, split, delete range, silence insertion, fade in/out, and clip gain.
  • Keep edits non-destructive by storing an edit decision list instead of rewriting the original audio immediately.
  • Export edited audio as WAV first, then add compressed formats later.

Phase 3: Regeneration Workflow

  • Link script text ranges to generated audio ranges.
  • Allow users to select a clip and regenerate just that segment.
  • Support voice/settings changes per regenerated segment.
  • Add replace, insert, and compare-take workflows.

Phase 4: Multi-Speaker Projects

  • Support script blocks with per-speaker assignment.
  • Render speakers into separate timeline lanes.
  • Add voice presets, reusable show templates, and episode-level settings.
  • Support intro/outro/music beds once the audio engine can mix multiple lanes.

Phase 5: Production Export

  • Add loudness normalization, silence cleanup, and final mastering presets.
  • Export MP3, WAV, and podcast-ready metadata.
  • Add project save/load, autosave, and recoverable render jobs.
  • Prepare the audio pipeline for queueing longer renders outside the request lifecycle.

Later: VibeVoice Performance Research

  • Move the current VibePod hot-path monkey patches into the JezzWTF/VibeVoice fork once the feature direction has settled.
  • Add clearer generation profiling for overlapped CPU work, especially decode wait time versus total acoustic decode time.
  • Prototype batched positive/negative CFG TTS LM inference behind an opt-in flag and benchmark it against the current sequential path on CPU and CUDA.
  • Keep experimental performance work isolated from user-facing feature work unless it shows a clear speedup without audio quality regressions.

Foundation Work Needed First

  • Persist generated outputs with stable IDs.
  • Move waveform and WAV assembly into reusable modules.
  • Add cancellation-aware generation jobs.
  • Add a backend audio processing layer for edits and exports.
  • Keep the current generate screen as the fast path while Studio grows beside it.