Overlap acoustic_decode with forward_tts_lm calls using a background
ThreadPoolExecutor, hiding ~72s of decode cost behind tts_lm work.
Achieved 0.67x realtime (up from 0.43x, ~56% improvement).
- vibevoice_generate_patch.py: patched generate() loop reordered to
submit decode to thread before running connector + tts_lm×2, then
resolve future. Installed as instance method via types.MethodType so
uv sync reinstalling the package cannot revert the patch.
- Dynamic INT8 quantization of Linear layers (VIBEPOD_QUANTIZE=1,
default on CPU). prediction_head excluded — small fixed-size tensors
regressed ~20% with INT8 due to pack/unpack overhead.
- Auto-detect AVX512_BF16 and load model in bfloat16 if supported
(VIBEPOD_CPU_BF16=auto, overridable with 0/1).
- CPU thread count auto-configured from logical CPU count; OMP/MKL env
vars set accordingly. Lock file preserved around uv sync --no-sources
so CPU mode does not alter the shared uv.lock.
- torch.compile retained as opt-in (VIBEPOD_COMPILE=1) but marked not
recommended — dynamic KV cache shapes prevent kernel reuse.
* Improve CPU Inference Stability: Implement Adaptive Buffering and Chunk Accumulation
This change addresses audio stuttering issues when running on CPU-only hardware by:
- Implementing server-side audio chunk accumulation to reduce SSE overhead.
- Introducing device-aware default configurations for buffering and inference steps.
- Exposing key performance parameters as environment variables.
- Enabling the frontend to adaptively adjust its buffering thresholds based on the server's configuration.
Changes:
- Modified `server/vibevoice_server.py` to support accumulation and provide config via `/health`.
- Updated `web/hooks/useStreamingGeneration.ts` to accept configurable buffering parameters.
- Updated `web/app/page.tsx` to fetch and apply server-side configuration.
Verified on CPU mode in the development environment.
Co-authored-by: LyAhn <27559362+LyAhn@users.noreply.github.com>
* Improve CPU Inference Stability: Implement Adaptive Buffering and Chunk Accumulation
This change addresses audio stuttering issues when running on CPU-only hardware by:
- Implementing server-side audio chunk accumulation to reduce SSE overhead.
- Introducing device-aware default configurations for buffering and inference steps.
- Exposing key performance parameters as environment variables.
- Enabling the frontend to adaptively adjust its buffering thresholds based on the server's configuration.
Changes:
- Modified `server/vibevoice_server.py` to support accumulation and provide config via `/health`.
- Updated `web/hooks/useStreamingGeneration.ts` to accept configurable buffering parameters.
- Updated `web/app/page.tsx` to fetch and apply server-side configuration.
Verified on CPU mode in the development environment.
Co-authored-by: LyAhn <27559362+LyAhn@users.noreply.github.com>
* Improve CPU Inference Stability: Adaptive Buffering UI & Logic
This change enhances the initial CPU stability fix by:
- Exposing adaptive buffering settings (Pre-buffer, Re-buffer Threshold, Resume Threshold) in a new "Advanced Buffering" UI section.
- Managing buffering settings in the application state to allow for manual overrides.
- Implementing robust re-initialization of buffering and inference defaults whenever the server's device (CPU/CUDA) changes.
- Including the active device in the server's config object for reliable client-side detection.
Verified with frontend screenshots and full build. Responds to PR feedback regarding actioning the adaptive logic.
Co-authored-by: LyAhn <27559362+LyAhn@users.noreply.github.com>
* Refine adaptive buffering: env helpers, threshold validation, a11y fixes
- Extract _env_int/_env_float helpers in server to validate env-var config
with graceful fallback instead of bare int/float casts
- Fix inference_steps falsy-check (0 is valid) to use explicit None guard
- Enforce rebufferThresholdSecs < resumeThresholdSecs in both the hook
(with console.warn + clamp) and the GenerationControls UI (sliders block
invalid states by auto-bumping or ignoring the drag)
- Add type="button", aria-expanded, aria-controls, htmlFor, and input id
attributes to GenerationControls for accessibility
- Add .vscode/settings.json to .gitignore; sort package.json scripts
---------
Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
🎯 What: Changed the uvicorn host binding from 0.0.0.0 to 127.0.0.1 in server/start.sh.
⚠️ Risk: Binding to 0.0.0.0 exposes the unauthenticated backend API to any network interface, potentially allowing unauthorized access.
🛡️ Solution: Binding to 127.0.0.1 ensures the FastAPI backend is only accessible from the local machine, relying on the Next.js frontend to securely proxy external requests.
Co-authored-by: LyAhn <27559362+LyAhn@users.noreply.github.com>
🎯 What: Changed the uvicorn host binding from 0.0.0.0 to 127.0.0.1 in server/start.sh.
⚠️ Risk: Binding to 0.0.0.0 exposes the unauthenticated backend API to any network interface, potentially allowing unauthorized access.
🛡️ Solution: Binding to 127.0.0.1 ensures the FastAPI backend is only accessible from the local machine, relying on the Next.js frontend to securely proxy external requests.
Co-authored-by: LyAhn <27559362+LyAhn@users.noreply.github.com>
- Replaced multiple named event handler functions with inline state setters.
- Used an AbortController to cleanly remove all event listeners with a single `controller.abort()` call in the cleanup hook.
- This improves maintainability and readability by reducing verbosity without changing functionality.
- Formatted inline callbacks across multiple lines for better readability as requested.
Co-authored-by: LyAhn <27559362+LyAhn@users.noreply.github.com>
🎯 What: Extracted inline model loading logic from `_load_model_sync` into distinct helper functions (`_init_processor`, `_init_model`, and `_load_voice_presets`). Added exc_info to model load exception logging.
💡 Why: This significantly reduces the complexity of `_load_model_sync`, making the code easier to read and maintain. Better logging helps diagnose initialization failures.
✅ Verification: Ran a syntax check (`python -m py_compile`), started the backend server with CPU inference, and verified the model initialized and correctly processed a text-to-speech request to the `/generate` endpoint without regressions.
✨ Result: Improved code modularity while preserving identical behavior.
Co-authored-by: LyAhn <27559362+LyAhn@users.noreply.github.com>
- Replaced multiple named event handler functions with inline state setters.
- Used an AbortController to cleanly remove all event listeners with a single `controller.abort()` call in the cleanup hook.
- This improves maintainability and readability by reducing verbosity without changing functionality.
Co-authored-by: LyAhn <27559362+LyAhn@users.noreply.github.com>
🎯 What: Extracted inline model loading logic from `_load_model_sync` into distinct helper functions (`_init_processor`, `_init_model`, and `_load_voice_presets`).
💡 Why: This significantly reduces the complexity of `_load_model_sync`, making the code easier to read and maintain.
✅ Verification: Ran a syntax check (`python -m py_compile`), started the backend server with CPU inference, and verified the model initialized and correctly processed a text-to-speech request to the `/generate` endpoint without regressions.
✨ Result: Improved code modularity while preserving identical behavior.
Co-authored-by: LyAhn <27559362+LyAhn@users.noreply.github.com>
Replace detailed exception strings with generic error messages in
the health and generate endpoints to prevent information leakage.
Internal logs still contain full exception details for debugging.
Co-authored-by: LyAhn <27559362+LyAhn@users.noreply.github.com>
Extract the duplicated offline response payload and common headers into
constants to improve maintainability and readability.
- Define OFFLINE_RESPONSE for { status: "offline" }
- Define COMMON_OPTIONS for { headers: { "Cache-Control": "no-store" } }
- Use these constants across all response paths in the route.
Co-authored-by: LyAhn <27559362+LyAhn@users.noreply.github.com>