Overlap acoustic_decode with forward_tts_lm calls using a background
ThreadPoolExecutor, hiding ~72s of decode cost behind tts_lm work.
Achieved 0.67x realtime (up from 0.43x, ~56% improvement).
- vibevoice_generate_patch.py: patched generate() loop reordered to
submit decode to thread before running connector + tts_lm×2, then
resolve future. Installed as instance method via types.MethodType so
uv sync reinstalling the package cannot revert the patch.
- Dynamic INT8 quantization of Linear layers (VIBEPOD_QUANTIZE=1,
default on CPU). prediction_head excluded — small fixed-size tensors
regressed ~20% with INT8 due to pack/unpack overhead.
- Auto-detect AVX512_BF16 and load model in bfloat16 if supported
(VIBEPOD_CPU_BF16=auto, overridable with 0/1).
- CPU thread count auto-configured from logical CPU count; OMP/MKL env
vars set accordingly. Lock file preserved around uv sync --no-sources
so CPU mode does not alter the shared uv.lock.
- torch.compile retained as opt-in (VIBEPOD_COMPILE=1) but marked not
recommended — dynamic KV cache shapes prevent kernel reuse.
🎯 What: Changed the uvicorn host binding from 0.0.0.0 to 127.0.0.1 in server/start.sh.
⚠️ Risk: Binding to 0.0.0.0 exposes the unauthenticated backend API to any network interface, potentially allowing unauthorized access.
🛡️ Solution: Binding to 127.0.0.1 ensures the FastAPI backend is only accessible from the local machine, relying on the Next.js frontend to securely proxy external requests.
Co-authored-by: LyAhn <27559362+LyAhn@users.noreply.github.com>