bytedance/MegaTTS3Please refer to the latest official releases for information GitHub Homepage

MegaTTS3: A high-quality zero-shot text-to-speech model developed by ByteDance, supporting Chinese and English voice cloning.

Apache-2.0Python 5.5kbytedance Last Updated: 2025-05-11

MegaTTS3 Project Detailed Introduction

Project Overview

MegaTTS3 is a high-quality zero-shot text-to-speech system developed by ByteDance, based on Sparse Alignment Enhanced Latent Diffusion Transformer technology. This project is primarily for academic research purposes, providing powerful text-to-speech (TTS) and voice cloning capabilities.

Core Features

🚀 Lightweight and Efficient

Parameter Scale: The backbone network of the TTS diffusion transformer has only 0.45B parameters.
Efficient Inference: Optimized architecture design for fast speech generation.

🎧 Ultra-High-Quality Voice Cloning

Zero-Shot Synthesis: Clones new speaker voices without training.
High Fidelity: Generated speech quality is close to the original recording.
Online Experience: Available for experience on Huggingface Demo.

🌍 Bilingual Support

Multilingual: Supports both Chinese and English speech synthesis.
Code Switching: Supports mixed Chinese and English speech generation.
Cross-Lingual: English voices can synthesize Chinese speech (with accent control).

✍️ Strong Controllability

Accent Strength Control: Adjustable accent level of generated speech.
Fine-Grained Pronunciation Adjustment: Supports fine-grained pronunciation and duration adjustments (coming soon).
Intelligibility Weight: Controls speech clarity through the p_w parameter.
Similarity Weight: Controls similarity to the original voice through the t_w parameter.

Technical Architecture

Main Components

TTS Main Model

Based on Sparse Alignment Enhanced Latent Diffusion Transformer.
Supports zero-shot speech synthesis.
High-quality voice cloning capability.

Speech-Text Aligner

Trained using pseudo-labels generated by a large number of MFA expert models.
Uses: Dataset preparation, noise filtering, phoneme recognition, speech segmentation.

Grapheme-to-Phoneme Converter (G2P)

Fine-tuned based on the Qwen2.5-0.5B model.
Provides robust grapheme-to-phoneme conversion.

WaveVAE

Powerful waveform variational autoencoder.
Compresses 24kHz speech into 25Hz acoustic latent representation.
Almost lossless reconstruction of the original waveform.

Installation and Usage

System Requirements

Python 3.10
Linux/Windows/Docker support
Optional GPU acceleration (recommended)

Quick Start

Clone Repository

git clone https://github.com/bytedance/MegaTTS3
cd MegaTTS3

Environment Configuration

conda create -n megatts3-env python=3.10
conda activate megatts3-env
pip install -r requirements.txt
export PYTHONPATH="/path/to/MegaTTS3:$PYTHONPATH"

Model Download

Download pre-trained models from Google Drive or Huggingface.
Place the model files into the ./checkpoints/xxx directory.

Usage Methods

Command-Line Inference (Standard)

# Chinese speech synthesis
python tts/infer_cli.py --input_wav 'assets/Chinese_prompt.wav' --input_text "另一边的桌上,一位读书人嗤之以鼻道,'佛子三藏,神子燕小鱼是什么样的人物,李家的那个李子夜如何与他们相提并论？'" --output_dir ./gen

# English speech synthesis (high expressiveness)
python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text 'As his long promised tariff threat turned into reality this week, top human advisers began fielding a wave of calls from business leaders.' --output_dir ./gen --p_w 2.0 --t_w 3.0

Accent Control Synthesis

# Maintain original accent (p_w ≈ 1.0)
python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '这是一条有口音的音频。' --output_dir ./gen --p_w 1.0 --t_w 3.0

# Standard pronunciation (p_w > 2.0)
python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '这条音频的发音标准一些了吗？' --output_dir ./gen --p_w 2.5 --t_w 2.5

Web Interface

python tts/gradio_api.py

Parameter Description

Core Parameters

p_w (Intelligibility Weight): Controls speech clarity; noisy prompt audio requires a higher p_w value.
t_w (Similarity Weight): Controls similarity to the original voice; usually 0-3 points higher than p_w.
Inference Steps: Default 10 steps, CPU inference takes approximately 30 seconds.

Accent Control

p_w ≈ 1.0: Maintains the speaker's original accent.
p_w Increase: Moves towards standard pronunciation.
t_w Range: Usually between 2.0-5.0; reasonable increase can improve expressiveness.

Security and Limitations

Security Considerations

WaveVAE Encoder: For security reasons, the encoder parameters are not publicly available.
Pre-extracted Latent Representations: Only pre-extracted .npy latent files can be used for inference.
Academic Use: The project is primarily for academic research.

Usage Flow

Prepare audio files (.wav format, <24 seconds, filename without spaces).
Upload to the Voice Request Queue.
After security verification, obtain the corresponding .npy latent file.
Use the .wav and .npy files for inference.

License and Citation

License: Apache-2.0 License
Release Date: March 22, 2025
Maintainer: ByteDance

Application Scenarios

Main Uses

Speech Synthesis Research: Provides researchers with a high-quality TTS baseline.
Voice Cloning: Enables personalized voice assistants.
Multilingual Applications: Supports bilingual Chinese and English content creation.
Accent Research: Research and control accent features in speech.

Extended Applications

Dataset Preparation: Use the aligner to prepare data for model training.
Speech Quality Filtering: Filter large-scale speech datasets.
Phoneme Recognition: Perform phoneme-level analysis of speech.
Voice Conversion: Achieve voice conversion between different speakers.

Precautions

Model Download: Manually download pre-trained model files.
Dependency Management: Pay attention to pydantic and gradio version matching.
Environment Variables: Correctly set PYTHONPATH and CUDA_VISIBLE_DEVICES.
File Format: Input audio must be in .wav format and less than 24 seconds in length.
Security Review: Uploaded audio files must pass a security review.