2noise/ChatTTS

A generative speech model designed specifically for dialogue scenarios, supporting mixed Chinese and English input and multi-speaker capabilities.

AGPL-3.0Python 36.8k2noise Last Updated: 2025-05-23

ChatTTS - Professional Conversational Text-to-Speech Model

Project Overview

ChatTTS is a generative text-to-speech (TTS) model developed by the 2noise team, specifically designed for conversational scenarios. The project has garnered over 35,000+ stars on GitHub, making it one of the most popular open-source TTS projects currently available.

Project Address: https://github.com/2noise/ChatTTS Development Team: 2noise Open Source License: AGPLv3+ (Code) / CC BY-NC 4.0 (Model) Primary Language Support: Chinese, English

ChatTTS is designed to provide a natural and fluent voice interaction experience for conversational applications such as LLM assistants. Compared to traditional TTS models, it performs significantly better in conversational contexts.

Core Features and Characteristics

🎯 Conversational Optimization Design

Optimized for Conversational Scenarios: Specifically optimized for conversational applications such as chatbots and LLM assistants.
Natural Conversational Experience: Generates more natural and fluent speech, suitable for human-machine dialogue scenarios.
Interactive Dialogue: Supports voice coherence in multi-turn conversations.

🎭 Multi-Speaker Support

Multi-Speaker Capability: Supports switching between different speakers, enabling multi-role conversations.
Speaker Sampling: Allows random sampling of speaker characteristics from a Gaussian distribution.
Voice Tone Control: Supports custom and fixed specific voice tones to maintain character consistency.

🎵 Fine-Grained Prosody Control

Laughter Control: Supports adding different levels of laughter effects [laugh], [laugh_0-2].
Pause Control: Precisely controls pauses and intervals in speech [uv_break], [lbreak], [break_0-7].
Intonation Control: Supports adjusting the degree of colloquialism [oral_0-9].
Emotional Expression: Able to predict and control fine-grained prosodic features, including intonation changes.

🌐 Multi-Language Support

Mixed Chinese and English: Native support for mixed Chinese and English input without language tags.
Language Adaptation: Automatically identifies and processes text content in different languages.
Future Expansion: Plans to support more languages in the future.

⚡ Technical Advantages

Advanced Architecture: Based on an autoregressive model architecture, drawing on advanced technologies such as Bark and Valle.
Prosody Advantage: Surpasses most open-source TTS models in terms of prosody performance.
High-Quality Pre-training: The main model is trained on 100,000+ hours of Chinese and English audio data.
Open-Source Friendly: Provides a 40,000-hour pre-trained base model for research use.

Model Specifications and Performance

Training Data

Main Model: Trained on 100,000+ hours of Chinese and English audio data.
Open Source Version: 40,000-hour pre-trained model (without SFT).
Data Source: Publicly available audio data sources.

Performance Metrics

GPU Requirement: Requires at least 4GB of GPU memory to generate 30 seconds of audio.
Generation Speed: Approximately 7 semantic tokens per second on a 4090 GPU.
Real-Time Factor (RTF): Approximately 0.3.
Audio Quality: 24kHz sampling rate output.

Hardware Requirements

Minimum Configuration: 4GB+ GPU memory.
Recommended Configuration: High-end graphics cards such as RTX 3090/4090.
CPU: Supports multi-core processor acceleration.
Memory: Recommended 16GB+ system memory.

Installation and Usage

Quick Installation

# Clone the project
git clone https://github.com/2noise/ChatTTS
cd ChatTTS

# Install dependencies
pip install --upgrade -r requirements.txt

# Or use a conda environment
conda create -n chattts python=3.11
conda activate chattts
pip install -r requirements.txt

Basic Usage Example

import ChatTTS
import torch
import torchaudio

# Initialize the model
chat = ChatTTS.Chat()
chat.load(compile=False)  # Set to True for better performance

# Text-to-speech
texts = ["你好，我是ChatTTS", "Hello, I am ChatTTS"]
wavs = chat.infer(texts)

# Save audio files
for i, wav in enumerate(wavs):
    torchaudio.save(f"output_{i}.wav", torch.from_numpy(wav).unsqueeze(0), 24000)

Advanced Control Features

# Randomly sample a speaker
rand_spk = chat.sample_random_speaker()

# Set inference parameters
params_infer_code = ChatTTS.Chat.InferCodeParams(
    spk_emb=rand_spk,      # Speaker embedding
    temperature=0.3,        # Temperature parameter
    top_P=0.7,             # top-P sampling
    top_K=20,              # top-K sampling
)

# Set text refinement parameters
params_refine_text = ChatTTS.Chat.RefineTextParams(
    prompt='[oral_2][laugh_0][break_6]',  # Add prosody control
)

# Generate speech
wavs = chat.infer(
    texts,
    params_refine_text=params_refine_text,
    params_infer_code=params_infer_code,
)

Application Scenarios

🤖 AI Assistants and Chatbots

Voice output for LLM dialogue systems
Intelligent customer service systems
Virtual assistant applications

📚 Education and Training

Online education platforms
Language learning applications
Audiobook production

🎬 Content Creation

Podcast production
Video dubbing
Audio content generation

🏢 Enterprise Applications

Meeting summary broadcasts
Voice announcements
Accessibility assistance features

Technical Architecture

Core Components

Text Encoder: Processes the semantic understanding of input text.
Prosody Predictor: Predicts and controls the prosodic features of speech.
Vocoder: Converts features into high-quality audio waveforms.
Speaker Encoder: Processes multi-speaker feature embeddings.

Model Characteristics

Autoregressive Architecture: Transformer-based autoregressive generation model.
End-to-End Training: Unified end-to-end training framework.
Multi-Modal Fusion: Effective fusion of text, prosody, and speaker information.

Precautions and Limitations

Usage Restrictions

Academic Use: The released model is limited to academic research use.
Commercial Restrictions: Not to be used for commercial or illegal purposes.
Ethical Considerations: High-frequency noise has been added to prevent malicious use.

Technical Limitations

Audio Length: Longer audio may experience a decrease in quality.
Computational Requirements: Requires high GPU computing resources.
Language Support: Currently primarily supports Chinese and English.

Common Issues

Generation Speed: Can be improved by optimizing hardware configuration and parameter adjustments.
Audio Quality: MP3 compression format may affect the final quality.
Stability: Autoregressive models may exhibit unstable output.

Summary

ChatTTS, as a TTS model specifically designed for conversational scenarios, excels in the following aspects:

🎯 Professionalism: Specifically optimized for conversational scenarios, performing excellently in applications such as chatbots and AI assistants.

🚀 Technical Advancement: Employs the latest deep learning technologies, leading in prosody control and multi-speaker support.

🌟 Open-Source Value: Provides a complete open-source solution, lowering the barrier to entry for high-quality TTS technology.

🤝 Active Community: Boasts an active developer community and rich ecosystem resources.

⚡ Practicality: Offers complete functionality from basic usage to advanced control, meeting the needs of different levels of users.

The emergence of ChatTTS fills the gap for dedicated TTS models in conversational scenarios, providing strong technical support for building more natural human-machine voice interaction experiences. With the continuous development of technology and the ongoing contributions of the community, it is believed that ChatTTS will play an increasingly important role in the field of speech synthesis.