Keep CPU async decode enabled without CFG parallelism, expand CPU buffering defaults for smooth playback, prevent CPU startup from mutating the lockfile during thread autodetection, and document runtime tuning variables in the example environment file.
Overlap acoustic_decode with forward_tts_lm calls using a background
ThreadPoolExecutor, hiding ~72s of decode cost behind tts_lm work.
Achieved 0.67x realtime (up from 0.43x, ~56% improvement).
- vibevoice_generate_patch.py: patched generate() loop reordered to
submit decode to thread before running connector + tts_lm×2, then
resolve future. Installed as instance method via types.MethodType so
uv sync reinstalling the package cannot revert the patch.
- Dynamic INT8 quantization of Linear layers (VIBEPOD_QUANTIZE=1,
default on CPU). prediction_head excluded — small fixed-size tensors
regressed ~20% with INT8 due to pack/unpack overhead.
- Auto-detect AVX512_BF16 and load model in bfloat16 if supported
(VIBEPOD_CPU_BF16=auto, overridable with 0/1).
- CPU thread count auto-configured from logical CPU count; OMP/MKL env
vars set accordingly. Lock file preserved around uv sync --no-sources
so CPU mode does not alter the shared uv.lock.
- torch.compile retained as opt-in (VIBEPOD_COMPILE=1) but marked not
recommended — dynamic KV cache shapes prevent kernel reuse.
🎯 What: Changed the uvicorn host binding from 0.0.0.0 to 127.0.0.1 in server/start.sh.
⚠️ Risk: Binding to 0.0.0.0 exposes the unauthenticated backend API to any network interface, potentially allowing unauthorized access.
🛡️ Solution: Binding to 127.0.0.1 ensures the FastAPI backend is only accessible from the local machine, relying on the Next.js frontend to securely proxy external requests.
Co-authored-by: LyAhn <27559362+LyAhn@users.noreply.github.com>