index-tts/index-ttsPlease refer to the latest official releases for information GitHub Homepage

IndexTTS is an industrial-grade, controllable, and efficient zero-shot text-to-speech system built on XTTS and Tortoise, supporting Chinese Pinyin error correction and precise voice control.

Apache-2.0Python 3.6kindex-ttsindex-tts Last Updated: 2025-06-17

IndexTTS Project Detailed Introduction

Project Overview

IndexTTS is an industrial-grade, controllable, efficient, zero-shot text-to-speech system primarily built upon XTTS and Tortoise. The system adopts a GPT-style architecture, possessing powerful speech synthesis capabilities, and is particularly optimized for Chinese speech synthesis.

Core Features

1. Zero-shot Voice Cloning

Capable of achieving high-quality voice cloning with only a small amount of reference audio.
Supports multi-language speech synthesis, especially Chinese and English.

2. Chinese Pinyin Correction

Able to correct the pronunciation of Chinese characters using Pinyin.
Employs a character-pinyin hybrid modeling method to quickly correct mispronounced characters.
Effectively handles pronunciation issues for polyphonic characters and long-tail characters.

3. Precise Voice Control

Controls pauses at arbitrary positions through punctuation marks.
Supports precise control over speech rhythm and prosody.
Provides rich options for adjusting voice expressiveness.

Technical Architecture

Model Components

GPT-style Text-to-Speech Model: Based on the Transformer architecture.
Conformer Conditional Encoder: Enhances training stability and voice similarity.
BigVGAN2 Speech Decoder: Optimizes audio quality and timbre fidelity.
Character-Pinyin Hybrid Modeling: Specifically optimized for Chinese speech synthesis.

Training Data

Trained on tens of thousands of hours of data.
Covers multiple languages and voice styles.
Includes rich Chinese speech datasets.

Performance Metrics

Objective Evaluation Metrics

Word Error Rate (WER) Comparison

Test results based on the seed-test dataset:

Model	test_zh	test_en	test_hard
Human	1.26	2.14	-
SeedTTS	1.002	1.945	6.243
CosyVoice 2	1.45	2.57	6.83
F5TTS	1.56	1.83	8.67
IndexTTS	0.937	1.936	6.831
IndexTTS-1.5	0.821	1.606	6.565

Speaker Similarity (SS) Comparison

Model	aishell1_test	commonvoice_20_test_zh	commonvoice_20_test_en	librispeech_test_clean	Average
Human	0.846	0.809	0.820	0.858	0.836
CosyVoice 2	0.796	0.743	0.742	0.837	0.788
IndexTTS	0.744	0.742	0.758	0.823	0.776
IndexTTS-1.5	0.741	0.722	0.753	0.819	0.771

Subjective Evaluation (MOS) Scores

Model	Prosody	Timbre	Quality	Average
CosyVoice 2	3.67	4.05	3.73	3.81
F5TTS	3.56	3.88	3.56	3.66
XTTS	3.23	2.99	3.10	3.11
IndexTTS	3.79	4.20	4.05	4.01

Installation and Usage

Environment Configuration

# Clone the repository
git clone https://github.com/index-tts/index-tts.git

# Create a conda environment
conda create -n index-tts python=3.10
conda activate index-tts

# Install dependencies
pip install -r requirements.txt
apt-get install ffmpeg

Model Download

# Download using huggingface-cli
huggingface-cli download IndexTeam/IndexTTS-1.5 \
config.yaml bigvgan_discriminator.pth bigvgan_generator.pth bpe.model dvae.pth gpt.pth unigram_12000.vocab \
--local-dir checkpoints

# For Chinese users, use the mirror
export HF_ENDPOINT="https://hf-mirror.com"

Command Line Usage

# Install the command-line tool
pip install -e .

# Usage example
indextts "大家好，我现在正在bilibili 体验 ai 科技，说实话，来之前我绝对想不到！AI技术已经发展到这样匪夷所思的地步了！" \
--voice reference_voice.wav \
--model_dir checkpoints \
--config checkpoints/config.yaml \
--output output.wav

Web Interface

# Install Web UI dependencies
pip install -e ".[webui]"

# Start the Web UI
python webui.py

Then access http://127.0.0.1:7860 in your browser.

Python API Usage

from indextts.infer import IndexTTS

# Initialize the model
tts = IndexTTS(model_dir="checkpoints", cfg_path="checkpoints/config.yaml")

# Set reference audio and text
voice = "reference_voice.wav"
text = "大家好，我现在正在bilibili 体验 ai 科技，说实话，来之前我绝对想不到！AI技术已经发展到这样匪夷所思的地步了！"

# Generate speech
tts.infer(voice, text, output_path)

Online Demo

Project Advantages

Industrial-grade Performance: Outperforms mainstream TTS systems in multiple evaluations.
Multi-language Support: Specifically optimized for Chinese speech synthesis, while also supporting English.
Flexible Control: Provides precise voice control capabilities.
Easy Deployment: Offers multiple usage methods and comprehensive deployment documentation.
Continuous Updates: The team continuously optimizes and improves system performance.

IndexTTS represents the advanced level of current text-to-speech technology, providing a high-quality, high-efficiency solution for speech synthesis applications.