Home
Login

MegaTTS3: A high-quality zero-shot text-to-speech model developed by ByteDance, supporting Chinese and English voice cloning.

Apache-2.0Python 5.5kbytedance Last Updated: 2025-05-11

MegaTTS3 Project Detailed Introduction

Project Overview

MegaTTS3 is a high-quality zero-shot text-to-speech system developed by ByteDance, based on Sparse Alignment Enhanced Latent Diffusion Transformer technology. This project is primarily for academic research purposes, providing powerful text-to-speech (TTS) and voice cloning capabilities.

Core Features

🚀 Lightweight and Efficient

  • Parameter Scale: The backbone network of the TTS diffusion transformer has only 0.45B parameters.
  • Efficient Inference: Optimized architecture design for fast speech generation.

🎧 Ultra-High-Quality Voice Cloning

  • Zero-Shot Synthesis: Clones new speaker voices without training.
  • High Fidelity: Generated speech quality is close to the original recording.
  • Online Experience: Available for experience on Huggingface Demo.

🌍 Bilingual Support

  • Multilingual: Supports both Chinese and English speech synthesis.
  • Code Switching: Supports mixed Chinese and English speech generation.
  • Cross-Lingual: English voices can synthesize Chinese speech (with accent control).

✍️ Strong Controllability

  • Accent Strength Control: Adjustable accent level of generated speech.
  • Fine-Grained Pronunciation Adjustment: Supports fine-grained pronunciation and duration adjustments (coming soon).
  • Intelligibility Weight: Controls speech clarity through the p_w parameter.
  • Similarity Weight: Controls similarity to the original voice through the t_w parameter.

Technical Architecture

Main Components

  1. TTS Main Model
  • Based on Sparse Alignment Enhanced Latent Diffusion Transformer.
  • Supports zero-shot speech synthesis.
  • High-quality voice cloning capability.
  1. Speech-Text Aligner
  • Trained using pseudo-labels generated by a large number of MFA expert models.
  • Uses: Dataset preparation, noise filtering, phoneme recognition, speech segmentation.
  1. Grapheme-to-Phoneme Converter (G2P)
  • Fine-tuned based on the Qwen2.5-0.5B model.
  • Provides robust grapheme-to-phoneme conversion.
  1. WaveVAE
  • Powerful waveform variational autoencoder.
  • Compresses 24kHz speech into 25Hz acoustic latent representation.
  • Almost lossless reconstruction of the original waveform.

Installation and Usage

System Requirements

  • Python 3.10
  • Linux/Windows/Docker support
  • Optional GPU acceleration (recommended)

Quick Start

  1. Clone Repository
git clone https://github.com/bytedance/MegaTTS3
cd MegaTTS3
  1. Environment Configuration
conda create -n megatts3-env python=3.10
conda activate megatts3-env
pip install -r requirements.txt
export PYTHONPATH="/path/to/MegaTTS3:$PYTHONPATH"
  1. Model Download
  • Download pre-trained models from Google Drive or Huggingface.
  • Place the model files into the ./checkpoints/xxx directory.

Usage Methods

Command-Line Inference (Standard)

# Chinese speech synthesis
python tts/infer_cli.py --input_wav 'assets/Chinese_prompt.wav' --input_text "另一边的桌上,一位读书人嗤之以鼻道,'佛子三藏,神子燕小鱼是什么样的人物,李家的那个李子夜如何与他们相提并论?'" --output_dir ./gen

# English speech synthesis (high expressiveness)
python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text 'As his long promised tariff threat turned into reality this week, top human advisers began fielding a wave of calls from business leaders.' --output_dir ./gen --p_w 2.0 --t_w 3.0

Accent Control Synthesis

# Maintain original accent (p_w ≈ 1.0)
python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '这是一条有口音的音频。' --output_dir ./gen --p_w 1.0 --t_w 3.0

# Standard pronunciation (p_w > 2.0)
python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '这条音频的发音标准一些了吗?' --output_dir ./gen --p_w 2.5 --t_w 2.5

Web Interface

python tts/gradio_api.py

Parameter Description

Core Parameters

  • p_w (Intelligibility Weight): Controls speech clarity; noisy prompt audio requires a higher p_w value.
  • t_w (Similarity Weight): Controls similarity to the original voice; usually 0-3 points higher than p_w.
  • Inference Steps: Default 10 steps, CPU inference takes approximately 30 seconds.

Accent Control

  • p_w ≈ 1.0: Maintains the speaker's original accent.
  • p_w Increase: Moves towards standard pronunciation.
  • t_w Range: Usually between 2.0-5.0; reasonable increase can improve expressiveness.

Security and Limitations

Security Considerations

  • WaveVAE Encoder: For security reasons, the encoder parameters are not publicly available.
  • Pre-extracted Latent Representations: Only pre-extracted .npy latent files can be used for inference.
  • Academic Use: The project is primarily for academic research.

Usage Flow

  1. Prepare audio files (.wav format, <24 seconds, filename without spaces).
  2. Upload to the Voice Request Queue.
  3. After security verification, obtain the corresponding .npy latent file.
  4. Use the .wav and .npy files for inference.

License and Citation

  • License: Apache-2.0 License
  • Release Date: March 22, 2025
  • Maintainer: ByteDance

Application Scenarios

Main Uses

  1. Speech Synthesis Research: Provides researchers with a high-quality TTS baseline.
  2. Voice Cloning: Enables personalized voice assistants.
  3. Multilingual Applications: Supports bilingual Chinese and English content creation.
  4. Accent Research: Research and control accent features in speech.

Extended Applications

  • Dataset Preparation: Use the aligner to prepare data for model training.
  • Speech Quality Filtering: Filter large-scale speech datasets.
  • Phoneme Recognition: Perform phoneme-level analysis of speech.
  • Voice Conversion: Achieve voice conversion between different speakers.

Precautions

  1. Model Download: Manually download pre-trained model files.
  2. Dependency Management: Pay attention to pydantic and gradio version matching.
  3. Environment Variables: Correctly set PYTHONPATH and CUDA_VISIBLE_DEVICES.
  4. File Format: Input audio must be in .wav format and less than 24 seconds in length.
  5. Security Review: Uploaded audio files must pass a security review.