index-tts/index-ttsPlease refer to the latest official releases for information GitHub Homepage
IndexTTS is an industrial-grade, controllable, and efficient zero-shot text-to-speech system built on XTTS and Tortoise, supporting Chinese Pinyin error correction and precise voice control.
Apache-2.0Python 3.6kindex-ttsindex-tts Last Updated: 2025-06-17
IndexTTS Project Detailed Introduction
Project Overview
IndexTTS is an industrial-grade, controllable, efficient, zero-shot text-to-speech system primarily built upon XTTS and Tortoise. The system adopts a GPT-style architecture, possessing powerful speech synthesis capabilities, and is particularly optimized for Chinese speech synthesis.
Core Features
1. Zero-shot Voice Cloning
- Capable of achieving high-quality voice cloning with only a small amount of reference audio.
- Supports multi-language speech synthesis, especially Chinese and English.
2. Chinese Pinyin Correction
- Able to correct the pronunciation of Chinese characters using Pinyin.
- Employs a character-pinyin hybrid modeling method to quickly correct mispronounced characters.
- Effectively handles pronunciation issues for polyphonic characters and long-tail characters.
3. Precise Voice Control
- Controls pauses at arbitrary positions through punctuation marks.
- Supports precise control over speech rhythm and prosody.
- Provides rich options for adjusting voice expressiveness.
Technical Architecture
Model Components
- GPT-style Text-to-Speech Model: Based on the Transformer architecture.
- Conformer Conditional Encoder: Enhances training stability and voice similarity.
- BigVGAN2 Speech Decoder: Optimizes audio quality and timbre fidelity.
- Character-Pinyin Hybrid Modeling: Specifically optimized for Chinese speech synthesis.
Training Data
- Trained on tens of thousands of hours of data.
- Covers multiple languages and voice styles.
- Includes rich Chinese speech datasets.
Performance Metrics
Objective Evaluation Metrics
Word Error Rate (WER) Comparison
Test results based on the seed-test dataset:
Model | test_zh | test_en | test_hard |
---|---|---|---|
Human | 1.26 | 2.14 | - |
SeedTTS | 1.002 | 1.945 | 6.243 |
CosyVoice 2 | 1.45 | 2.57 | 6.83 |
F5TTS | 1.56 | 1.83 | 8.67 |
IndexTTS | 0.937 | 1.936 | 6.831 |
IndexTTS-1.5 | 0.821 | 1.606 | 6.565 |
Speaker Similarity (SS) Comparison
Model | aishell1_test | commonvoice_20_test_zh | commonvoice_20_test_en | librispeech_test_clean | Average |
---|---|---|---|---|---|
Human | 0.846 | 0.809 | 0.820 | 0.858 | 0.836 |
CosyVoice 2 | 0.796 | 0.743 | 0.742 | 0.837 | 0.788 |
IndexTTS | 0.744 | 0.742 | 0.758 | 0.823 | 0.776 |
IndexTTS-1.5 | 0.741 | 0.722 | 0.753 | 0.819 | 0.771 |
Subjective Evaluation (MOS) Scores
Model | Prosody | Timbre | Quality | Average |
---|---|---|---|---|
CosyVoice 2 | 3.67 | 4.05 | 3.73 | 3.81 |
F5TTS | 3.56 | 3.88 | 3.56 | 3.66 |
XTTS | 3.23 | 2.99 | 3.10 | 3.11 |
IndexTTS | 3.79 | 4.20 | 4.05 | 4.01 |
Installation and Usage
Environment Configuration
# Clone the repository
git clone https://github.com/index-tts/index-tts.git
# Create a conda environment
conda create -n index-tts python=3.10
conda activate index-tts
# Install dependencies
pip install -r requirements.txt
apt-get install ffmpeg
Model Download
# Download using huggingface-cli
huggingface-cli download IndexTeam/IndexTTS-1.5 \
config.yaml bigvgan_discriminator.pth bigvgan_generator.pth bpe.model dvae.pth gpt.pth unigram_12000.vocab \
--local-dir checkpoints
# For Chinese users, use the mirror
export HF_ENDPOINT="https://hf-mirror.com"
Command Line Usage
# Install the command-line tool
pip install -e .
# Usage example
indextts "大家好,我现在正在bilibili 体验 ai 科技,说实话,来之前我绝对想不到!AI技术已经发展到这样匪夷所思的地步了!" \
--voice reference_voice.wav \
--model_dir checkpoints \
--config checkpoints/config.yaml \
--output output.wav
Web Interface
# Install Web UI dependencies
pip install -e ".[webui]"
# Start the Web UI
python webui.py
Then access http://127.0.0.1:7860 in your browser.
Python API Usage
from indextts.infer import IndexTTS
# Initialize the model
tts = IndexTTS(model_dir="checkpoints", cfg_path="checkpoints/config.yaml")
# Set reference audio and text
voice = "reference_voice.wav"
text = "大家好,我现在正在bilibili 体验 ai 科技,说实话,来之前我绝对想不到!AI技术已经发展到这样匪夷所思的地步了!"
# Generate speech
tts.infer(voice, text, output_path)
Online Demo
Project Advantages
- Industrial-grade Performance: Outperforms mainstream TTS systems in multiple evaluations.
- Multi-language Support: Specifically optimized for Chinese speech synthesis, while also supporting English.
- Flexible Control: Provides precise voice control capabilities.
- Easy Deployment: Offers multiple usage methods and comprehensive deployment documentation.
- Continuous Updates: The team continuously optimizes and improves system performance.
IndexTTS represents the advanced level of current text-to-speech technology, providing a high-quality, high-efficiency solution for speech synthesis applications.