Chatterbox - Open Source Text-to-Speech Model
Project Overview
Chatterbox is the first production-grade open-source Text-to-Speech (TTS) model developed by Resemble AI. Released under the MIT license, this project is a groundbreaking speech synthesis solution that excels in multiple benchmarks, consistently outperforming leading closed-source systems like ElevenLabs in side-by-side evaluations.
Core Features
🎯 Technical Advantages
- State-of-the-art Zero-Shot TTS Technology: Generates high-quality speech without training.
- 500 Million Parameter Llama Backbone: Robust model architecture ensures generation quality.
- Unique Emotional Exaggeration/Intensity Control: Industry's first open-source TTS model supporting emotional control.
- Ultra-Stable Alignment-Aware Inference: Ensures the stability and consistency of generated speech.
- Large-Scale Training Data: Trained on 500,000 hours of clean data.
- Built-in Watermarking: All generated audio contains Perth perceptual threshold watermarks.
🚀 Performance
- Outperforms ElevenLabs: Performs better in comparative tests on the Podonos platform.
- Low Latency: Commercial version supports ultra-low latency below 200ms.
- High-Quality Synthesis: Trained on large-scale clean data, ensuring output quality.
Application Scenarios
Chatterbox is suitable for various application scenarios:
- Content Creation: Meme creation, video dubbing.
- Game Development: Character voices, game narration.
- AI Agents: Intelligent assistants, chatbots.
- Interactive Media: Interactive applications, educational content.
- Voice Conversion: Voice style transfer.
Installation and Usage
Quick Installation
pip install chatterbox-tts
Basic Usage Example
import torchaudio as ta
from chatterbox.tts import ChatterboxTTS
# Initialize the model
model = ChatterboxTTS.from_pretrained(device="cuda")
# Generate speech
text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill."
wav = model.generate(text)
ta.save("test-1.wav", wav, model.sr)
# Use audio prompt for voice cloning
AUDIO_PROMPT_PATH = "YOUR_FILE.wav"
wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH)
ta.save("test-2.wav", wav, model.sr)
Parameter Tuning Guide
General Use (TTS and Voice Agents)
- Default Settings:
exaggeration=0.5
, cfg=0.5
suitable for most prompts.
- Fast Voice Style: If the reference speaker speaks quickly, reduce
cfg
to around 0.3
to improve rhythm.
Expressive or Dramatic Speech
- Low CFG Value: Try a lower
cfg
value (e.g., ~0.3
).
- High Exaggeration: Increase
exaggeration
to around 0.7
or higher.
- Speed Compensation: Higher
exaggeration
speeds up speech; lowering cfg
helps compensate with a slower, more deliberate rhythm.
Technical Architecture
Model Architecture
- Backbone Network: 500 million parameter model based on the Llama architecture.
- Training Data: 500,000 hours of high-quality clean data.
- Inference Optimization: Alignment-aware inference technology ensures stability.
Security Features
- Built-in Watermark: Uses Resemble AI's Perth (Perceptual Threshold) watermarking technology.
- Detection Accuracy: Watermark maintains nearly 100% detection accuracy after MP3 compression, audio editing, and common operations.
- Transparency: Open-source model provides complete transparency and control.
Project Resources
Commercial Support
For users who need to scale or fine-tune for higher accuracy, Resemble AI offers competitively priced TTS services with the following features:
- Reliable Performance: Stable production-grade service.
- Ultra-Low Latency: Response time below 200ms.
- Suitable Scenarios: Production use for agents, applications, or interactive media.
Usage Notice
This model should be used responsibly and not for malicious purposes. Training prompts are derived from freely available data on the internet.
Contribution and Community
As an open-source project, Chatterbox welcomes community contributions. Developers can participate in project development through GitHub, submit issue reports, or feature suggestions.
