Login

IndexTTS is an industrial-grade, controllable, and efficient zero-shot text-to-speech system built on XTTS and Tortoise, supporting Chinese Pinyin error correction and precise voice control.

Apache-2.0Python 3.6kindex-ttsindex-tts Last Updated: 2025-06-17

IndexTTS Project Detailed Introduction

Project Overview

IndexTTS is an industrial-grade, controllable, efficient, zero-shot text-to-speech system primarily built upon XTTS and Tortoise. The system adopts a GPT-style architecture, possessing powerful speech synthesis capabilities, and is particularly optimized for Chinese speech synthesis.

Core Features

1. Zero-shot Voice Cloning

  • Capable of achieving high-quality voice cloning with only a small amount of reference audio.
  • Supports multi-language speech synthesis, especially Chinese and English.

2. Chinese Pinyin Correction

  • Able to correct the pronunciation of Chinese characters using Pinyin.
  • Employs a character-pinyin hybrid modeling method to quickly correct mispronounced characters.
  • Effectively handles pronunciation issues for polyphonic characters and long-tail characters.

3. Precise Voice Control

  • Controls pauses at arbitrary positions through punctuation marks.
  • Supports precise control over speech rhythm and prosody.
  • Provides rich options for adjusting voice expressiveness.

Technical Architecture

Model Components

  • GPT-style Text-to-Speech Model: Based on the Transformer architecture.
  • Conformer Conditional Encoder: Enhances training stability and voice similarity.
  • BigVGAN2 Speech Decoder: Optimizes audio quality and timbre fidelity.
  • Character-Pinyin Hybrid Modeling: Specifically optimized for Chinese speech synthesis.

Training Data

  • Trained on tens of thousands of hours of data.
  • Covers multiple languages and voice styles.
  • Includes rich Chinese speech datasets.

Performance Metrics

Objective Evaluation Metrics

Word Error Rate (WER) Comparison

Test results based on the seed-test dataset:

Model test_zh test_en test_hard
Human 1.26 2.14 -
SeedTTS 1.002 1.945 6.243
CosyVoice 2 1.45 2.57 6.83
F5TTS 1.56 1.83 8.67
IndexTTS 0.937 1.936 6.831
IndexTTS-1.5 0.821 1.606 6.565

Speaker Similarity (SS) Comparison

Model aishell1_test commonvoice_20_test_zh commonvoice_20_test_en librispeech_test_clean Average
Human 0.846 0.809 0.820 0.858 0.836
CosyVoice 2 0.796 0.743 0.742 0.837 0.788
IndexTTS 0.744 0.742 0.758 0.823 0.776
IndexTTS-1.5 0.741 0.722 0.753 0.819 0.771

Subjective Evaluation (MOS) Scores

Model Prosody Timbre Quality Average
CosyVoice 2 3.67 4.05 3.73 3.81
F5TTS 3.56 3.88 3.56 3.66
XTTS 3.23 2.99 3.10 3.11
IndexTTS 3.79 4.20 4.05 4.01

Installation and Usage

Environment Configuration

# Clone the repository
git clone https://github.com/index-tts/index-tts.git

# Create a conda environment
conda create -n index-tts python=3.10
conda activate index-tts

# Install dependencies
pip install -r requirements.txt
apt-get install ffmpeg

Model Download

# Download using huggingface-cli
huggingface-cli download IndexTeam/IndexTTS-1.5 \
config.yaml bigvgan_discriminator.pth bigvgan_generator.pth bpe.model dvae.pth gpt.pth unigram_12000.vocab \
--local-dir checkpoints

# For Chinese users, use the mirror
export HF_ENDPOINT="https://hf-mirror.com"

Command Line Usage

# Install the command-line tool
pip install -e .

# Usage example
indextts "大家好,我现在正在bilibili 体验 ai 科技,说实话,来之前我绝对想不到!AI技术已经发展到这样匪夷所思的地步了!" \
--voice reference_voice.wav \
--model_dir checkpoints \
--config checkpoints/config.yaml \
--output output.wav

Web Interface

# Install Web UI dependencies
pip install -e ".[webui]"

# Start the Web UI
python webui.py

Then access http://127.0.0.1:7860 in your browser.

Python API Usage

from indextts.infer import IndexTTS

# Initialize the model
tts = IndexTTS(model_dir="checkpoints", cfg_path="checkpoints/config.yaml")

# Set reference audio and text
voice = "reference_voice.wav"
text = "大家好,我现在正在bilibili 体验 ai 科技,说实话,来之前我绝对想不到!AI技术已经发展到这样匪夷所思的地步了!"

# Generate speech
tts.infer(voice, text, output_path)

Online Demo

Project Advantages

  1. Industrial-grade Performance: Outperforms mainstream TTS systems in multiple evaluations.
  2. Multi-language Support: Specifically optimized for Chinese speech synthesis, while also supporting English.
  3. Flexible Control: Provides precise voice control capabilities.
  4. Easy Deployment: Offers multiple usage methods and comprehensive deployment documentation.
  5. Continuous Updates: The team continuously optimizes and improves system performance.

IndexTTS represents the advanced level of current text-to-speech technology, providing a high-quality, high-efficiency solution for speech synthesis applications.

Star History Chart