wanaigc/ComfyUI-Qwen3-TTS View GitHub Homepage for Latest Official Releases

Advanced open-source TTS model series supporting multilingual speech generation, 3-second voice cloning, and ultra-low-latency streaming synthesis

PythonComfyUI-Qwen3-TTSwanaigc 45 Last Updated: January 25, 2026

Qwen3-TTS: Advanced Multilingual Text-to-Speech Model Series

Project Overview

Qwen3-TTS is an open-source series of advanced text-to-speech (TTS) models developed by the Qwen team at Alibaba Cloud. Released in January 2026, this comprehensive TTS suite represents a significant advancement in speech synthesis technology, offering unprecedented capabilities in voice generation, cloning, and real-time streaming synthesis.

Key Features and Capabilities

Core Functionality

Multilingual Support: Native support for 10 major languages including Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian
Voice Cloning: State-of-the-art 3-second rapid voice cloning from minimal audio input
Voice Design: Create entirely new voices using natural language descriptions
Streaming Generation: Ultra-low-latency streaming with 97ms first-packet emission
Custom Voice Control: Fine-grained control over acoustic attributes including timbre, emotion, and prosody

Technical Architecture

Dual-Track Language Model Architecture

Qwen3-TTS employs an innovative dual-track hybrid streaming generation architecture that supports both streaming and non-streaming generation modes. This design enables immediate audio output after single character input, making it ideal for real-time interactive applications.

Two Speech Tokenizers

Qwen-TTS-Tokenizer-25Hz:
- Single-codebook codec emphasizing semantic content
- Seamless integration with Qwen-Audio models
- Supports streaming waveform reconstruction via block-wise DiT
Qwen-TTS-Tokenizer-12Hz:
- Multi-codebook design with 16 layers operating at 12.5 Hz
- Extreme bitrate reduction for ultra-low-latency streaming
- Lightweight causal ConvNet for efficient speech reconstruction

Model Variants

Available Models

Qwen3-TTS-12Hz-1.7B-Base: Foundation model for voice cloning and fine-tuning
Qwen3-TTS-12Hz-1.7B-CustomVoice: Pre-configured with 9 premium voice timbres
Qwen3-TTS-12Hz-1.7B-VoiceDesign: Specialized for description-based voice creation
Qwen3-TTS-12Hz-0.6B-CustomVoice: Lightweight version with custom voice capabilities
Qwen3-TTS-12Hz-0.6B-Base: Compact foundation model

Training Data

Trained on over 5 million hours of high-quality speech data
Comprehensive coverage across 10 languages and multiple dialectal profiles
Advanced contextual understanding for adaptive tone and emotional expression control

Technical Innovations

Advanced Speech Representation

Semantic-Acoustic Disentanglement: Separates high-level semantic content from acoustic details
Multi-Token Prediction (MTP): Enables immediate speech decoding from first codec frame
GAN-based Training: Generator operates on raw waveforms with discriminator improving naturalness

Streaming Capabilities

Causal Architecture: Fully causal feature encoders and decoders for real-time processing
Real-time Synthesis: End-to-end synthesis latency as low as 97ms
Incremental Decoding: Progressive audio reconstruction from discrete tokens

Installation and Usage

Quick Installation

# Create isolated environment
conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts

# Install via PyPI
pip install qwen-tts

# Optional: FlashAttention 2 for memory optimization
pip install flash-attn

Development Installation

git clone https://github.com/QwenLM/Qwen3-TTS.git
cd Qwen3-TTS
pip install -e .

Basic Usage Example

from qwen_tts import Qwen3TTSModel
import torch

# Load model
tts = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2"
)

# Generate speech
text = "Hello, this is Qwen3-TTS speaking!"
wavs, sr = tts.generate_speech(text)

Performance and Benchmarks

State-of-the-Art Results

Superior performance on TTS multilingual test sets
Excellent scores on InstructTTSEval benchmarks
Outstanding results on long speech generation tasks
Robust handling of noisy input text

Quality Metrics

High-fidelity speech reconstruction
Natural prosody and emotional expression
Consistent voice quality across languages
Minimal artifacts in streaming mode

Integration and Deployment

Platform Support

vLLM-Omni: Official day-0 support for deployment and inference
ComfyUI: Multiple community implementations for workflow integration
Hugging Face: Direct model hosting and inference APIs
DashScope API: Alibaba Cloud's optimized deployment platform

Hardware Requirements

CUDA-compatible GPU recommended
FlashAttention 2 compatible hardware for optimal performance
Support for torch.float16 or torch.bfloat16 precision

Community and Ecosystem

Open Source Commitment

Released under Apache 2.0 License
Full model weights and tokenizers available
Comprehensive documentation and examples
Active community development support

Community Integrations

Multiple ComfyUI custom node implementations
Third-party wrapper libraries and tools
Integration with popular ML frameworks
Extensive example code and tutorials

Research and Development

Technical Paper

The project is accompanied by a comprehensive technical report (arXiv:2601.15621) detailing the architecture, training methodology, and performance evaluations.

Future Roadmap

Enhanced online serving capabilities
Additional language support
Improved streaming performance optimizations
Extended integration with multimodal AI systems

Conclusion

Qwen3-TTS represents a significant leap forward in open-source text-to-speech technology. With its combination of multilingual support, ultra-low latency streaming, advanced voice cloning capabilities, and robust performance across diverse scenarios, it sets a new standard for accessible, high-quality speech synthesis. The project's commitment to open-source development and comprehensive documentation makes it an excellent choice for researchers, developers, and organizations seeking state-of-the-art TTS capabilities.