Microsoft's open-source, cutting-edge multi-speaker conversational text-to-speech AI model, supporting the generation of expressive dialogue audio of up to 90 minutes with up to 4 different speakers.

MITPythonVibeVoicemicrosoft 6.7k Last Updated: September 01, 2025

VibeVoice - Microsoft's Cutting-Edge Open-Source Speech Synthesis Framework

Project Overview

VibeVoice is a novel open-source framework developed by Microsoft Research, specifically designed for generating expressive, long-form, multi-speaker dialogue audio, such as podcasts, from text. It addresses significant challenges faced by traditional Text-to-Speech (TTS) systems in terms of scalability, speaker consistency, and natural transitions.

Core Technical Innovations

Continuous Speech Tokenizers

A core innovation of VibeVoice lies in its use of continuous speech tokenizers (acoustic and semantic), operating at an ultra-low frame rate of 7.5 Hz. These tokenizers significantly enhance computational efficiency for processing long sequences while effectively maintaining audio fidelity.

Next-Token Diffusion Framework

VibeVoice employs a next-token diffusion framework, leveraging Large Language Models (LLMs) to understand text context and dialogue flow, and utilizing a diffusion head to generate high-fidelity acoustic details.

Key Features

🎯 Core Capabilities

  • Ultra-long Audio Generation: Can synthesize speech up to 90 minutes long.
  • Multi-speaker Dialogue Support: Supports up to 4 distinct speakers, surpassing the 1-2 speaker limit of many existing models.
  • Cross-lingual Synthesis: Supports English and Chinese, and enables cross-lingual narration (e.g., English prompt → Chinese speech).
  • Basic Singing Synthesis: Possesses basic singing synthesis capabilities.

🏗️ Technical Architecture

VibeVoice is built upon a 1.5B parameter LLM (Qwen2.5-1.5B), integrating two novel tokenizers—acoustic and semantic—both designed to operate at a low frame rate (7.5Hz) for computational efficiency and consistency across long sequences.

Technical Components:

  • Acoustic Tokenizer: A σ-VAE variant with a mirrored encoder-decoder architecture (each approximately 340M parameters), achieving a 3200x downsampling from 24kHz raw audio.
  • Semantic Tokenizer: Trained via an ASR proxy task, this encoder-only architecture mirrors the design of the acoustic tokenizer.
  • Diffusion Decoder Head: A lightweight (approximately 123M parameters) conditional diffusion module that predicts acoustic features.

Model Versions

Model Context Length Generation Length Download Link
VibeVoice-1.5B 64K ~90 minutes HuggingFace
VibeVoice-7B 64K ~90 minutes HuggingFace
VibeVoice-0.5B-Streaming - - Coming Soon

Installation and Usage

Environment Preparation

It is recommended to use NVIDIA Deep Learning Container to manage the CUDA environment:

# Start Docker container
sudo docker run --privileged --net=host --ipc=host --ulimit memlock=-1:-1 --ulimit stack=-1:-1 --gpus all --rm -it nvcr.io/nvidia/pytorch:24.07-py3

# If flash attention is not in the environment, it needs to be installed manually
pip install flash-attn --no-build-isolation

Installation Steps

# Clone the project
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice/

# Install dependencies
pip install -e .
apt update && apt install ffmpeg -y

Usage Methods

Gradio Demo Interface

# 1.5B model
python demo/gradio_demo.py --model_path microsoft/VibeVoice-1.5B --share

# 7B model
python demo/gradio_demo.py --model_path WestZhang/VibeVoice-Large-pt --share

Inference from File

# Single-speaker audio
python demo/inference_from_file.py --model_path microsoft/VibeVoice-1.5B --txt_path demo/text_examples/1p_abs.txt --speaker_names Alice

# Multi-speaker audio
python demo/inference_from_file.py --model_path microsoft/VibeVoice-1.5B --txt_path demo/text_examples/2p_zh.txt --speaker_names Alice Yunfan

Application Scenarios

  • Podcast Production: Generate multi-host dialogue audio (up to 4 voices) lasting up to 90 minutes.
  • Audiobook Creation: Create emotionally rich narrations to make audiobooks more vivid and engaging.
  • Dialogue Systems: Natural speech generation in multi-turn dialogue scenarios.
  • Content Creation: Automate audio content generation.

Technical Limitations

Current Limitations

  • Language Restrictions: Only supports English and Chinese.
  • Non-speech Audio: The model focuses on speech synthesis and does not process background music or sound effects.
  • Overlapping Speech: The current model does not support generating overlapping dialogue segments.

Notes on Chinese Speech

Occasional instability may be encountered when synthesizing Chinese speech. It is recommended to:

  • Use English punctuation even for Chinese text, preferably only commas and periods.
  • Use the 7B model version, which offers significantly better stability.

Usage Responsibility and Limitations

Research Purposes

We do not recommend using VibeVoice for commercial or practical applications without further testing and development. This model is intended for research and development purposes only.

Potential Risks

Potential for Deepfakes and Misinformation: High-quality synthetic speech could be misused to create convincing fake audio content for impersonation, fraud, or spreading misinformation. Users must ensure the reliability of transcripts, verify content accuracy, and avoid using generated content in a misleading manner.

Contact Information

For suggestions, questions, or to report anomalies/offensive behavior in the technology, please contact: VibeVoice@microsoft.com

Star History Chart