Advanced open-source TTS model series supporting multilingual speech generation, 3-second voice cloning, and ultra-low-latency streaming synthesis
Qwen3-TTS: Advanced Multilingual Text-to-Speech Model Series
Project Overview
Qwen3-TTS is an open-source series of advanced text-to-speech (TTS) models developed by the Qwen team at Alibaba Cloud. Released in January 2026, this comprehensive TTS suite represents a significant advancement in speech synthesis technology, offering unprecedented capabilities in voice generation, cloning, and real-time streaming synthesis.
Key Features and Capabilities
Core Functionality
- Multilingual Support: Native support for 10 major languages including Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian
- Voice Cloning: State-of-the-art 3-second rapid voice cloning from minimal audio input
- Voice Design: Create entirely new voices using natural language descriptions
- Streaming Generation: Ultra-low-latency streaming with 97ms first-packet emission
- Custom Voice Control: Fine-grained control over acoustic attributes including timbre, emotion, and prosody
Technical Architecture
Dual-Track Language Model Architecture
Qwen3-TTS employs an innovative dual-track hybrid streaming generation architecture that supports both streaming and non-streaming generation modes. This design enables immediate audio output after single character input, making it ideal for real-time interactive applications.
Two Speech Tokenizers
Qwen-TTS-Tokenizer-25Hz:
- Single-codebook codec emphasizing semantic content
- Seamless integration with Qwen-Audio models
- Supports streaming waveform reconstruction via block-wise DiT
Qwen-TTS-Tokenizer-12Hz:
- Multi-codebook design with 16 layers operating at 12.5 Hz
- Extreme bitrate reduction for ultra-low-latency streaming
- Lightweight causal ConvNet for efficient speech reconstruction
Model Variants
Available Models
- Qwen3-TTS-12Hz-1.7B-Base: Foundation model for voice cloning and fine-tuning
- Qwen3-TTS-12Hz-1.7B-CustomVoice: Pre-configured with 9 premium voice timbres
- Qwen3-TTS-12Hz-1.7B-VoiceDesign: Specialized for description-based voice creation
- Qwen3-TTS-12Hz-0.6B-CustomVoice: Lightweight version with custom voice capabilities
- Qwen3-TTS-12Hz-0.6B-Base: Compact foundation model
Training Data
- Trained on over 5 million hours of high-quality speech data
- Comprehensive coverage across 10 languages and multiple dialectal profiles
- Advanced contextual understanding for adaptive tone and emotional expression control
Technical Innovations
Advanced Speech Representation
- Semantic-Acoustic Disentanglement: Separates high-level semantic content from acoustic details
- Multi-Token Prediction (MTP): Enables immediate speech decoding from first codec frame
- GAN-based Training: Generator operates on raw waveforms with discriminator improving naturalness
Streaming Capabilities
- Causal Architecture: Fully causal feature encoders and decoders for real-time processing
- Real-time Synthesis: End-to-end synthesis latency as low as 97ms
- Incremental Decoding: Progressive audio reconstruction from discrete tokens
Installation and Usage
Quick Installation
# Create isolated environment
conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts
# Install via PyPI
pip install qwen-tts
# Optional: FlashAttention 2 for memory optimization
pip install flash-attn
Development Installation
git clone https://github.com/QwenLM/Qwen3-TTS.git
cd Qwen3-TTS
pip install -e .
Basic Usage Example
from qwen_tts import Qwen3TTSModel
import torch
# Load model
tts = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-Base",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2"
)
# Generate speech
text = "Hello, this is Qwen3-TTS speaking!"
wavs, sr = tts.generate_speech(text)
Performance and Benchmarks
State-of-the-Art Results
- Superior performance on TTS multilingual test sets
- Excellent scores on InstructTTSEval benchmarks
- Outstanding results on long speech generation tasks
- Robust handling of noisy input text
Quality Metrics
- High-fidelity speech reconstruction
- Natural prosody and emotional expression
- Consistent voice quality across languages
- Minimal artifacts in streaming mode
Integration and Deployment
Platform Support
- vLLM-Omni: Official day-0 support for deployment and inference
- ComfyUI: Multiple community implementations for workflow integration
- Hugging Face: Direct model hosting and inference APIs
- DashScope API: Alibaba Cloud's optimized deployment platform
Hardware Requirements
- CUDA-compatible GPU recommended
- FlashAttention 2 compatible hardware for optimal performance
- Support for torch.float16 or torch.bfloat16 precision
Community and Ecosystem
Open Source Commitment
- Released under Apache 2.0 License
- Full model weights and tokenizers available
- Comprehensive documentation and examples
- Active community development support
Community Integrations
- Multiple ComfyUI custom node implementations
- Third-party wrapper libraries and tools
- Integration with popular ML frameworks
- Extensive example code and tutorials
Research and Development
Technical Paper
The project is accompanied by a comprehensive technical report (arXiv:2601.15621) detailing the architecture, training methodology, and performance evaluations.
Future Roadmap
- Enhanced online serving capabilities
- Additional language support
- Improved streaming performance optimizations
- Extended integration with multimodal AI systems
Conclusion
Qwen3-TTS represents a significant leap forward in open-source text-to-speech technology. With its combination of multilingual support, ultra-low latency streaming, advanced voice cloning capabilities, and robust performance across diverse scenarios, it sets a new standard for accessible, high-quality speech synthesis. The project's commitment to open-source development and comprehensive documentation makes it an excellent choice for researchers, developers, and organizations seeking state-of-the-art TTS capabilities.