Advanced open-source TTS model series supporting multilingual speech generation, 3-second voice cloning, and ultra-low-latency streaming synthesis

PythonComfyUI-Qwen3-TTSwanaigc 45 Last Updated: January 25, 2026

Qwen3-TTS: Advanced Multilingual Text-to-Speech Model Series

Project Overview

Qwen3-TTS is an open-source series of advanced text-to-speech (TTS) models developed by the Qwen team at Alibaba Cloud. Released in January 2026, this comprehensive TTS suite represents a significant advancement in speech synthesis technology, offering unprecedented capabilities in voice generation, cloning, and real-time streaming synthesis.

Key Features and Capabilities

Core Functionality

  • Multilingual Support: Native support for 10 major languages including Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian
  • Voice Cloning: State-of-the-art 3-second rapid voice cloning from minimal audio input
  • Voice Design: Create entirely new voices using natural language descriptions
  • Streaming Generation: Ultra-low-latency streaming with 97ms first-packet emission
  • Custom Voice Control: Fine-grained control over acoustic attributes including timbre, emotion, and prosody

Technical Architecture

Dual-Track Language Model Architecture

Qwen3-TTS employs an innovative dual-track hybrid streaming generation architecture that supports both streaming and non-streaming generation modes. This design enables immediate audio output after single character input, making it ideal for real-time interactive applications.

Two Speech Tokenizers

  1. Qwen-TTS-Tokenizer-25Hz:

    • Single-codebook codec emphasizing semantic content
    • Seamless integration with Qwen-Audio models
    • Supports streaming waveform reconstruction via block-wise DiT
  2. Qwen-TTS-Tokenizer-12Hz:

    • Multi-codebook design with 16 layers operating at 12.5 Hz
    • Extreme bitrate reduction for ultra-low-latency streaming
    • Lightweight causal ConvNet for efficient speech reconstruction

Model Variants

Available Models

  • Qwen3-TTS-12Hz-1.7B-Base: Foundation model for voice cloning and fine-tuning
  • Qwen3-TTS-12Hz-1.7B-CustomVoice: Pre-configured with 9 premium voice timbres
  • Qwen3-TTS-12Hz-1.7B-VoiceDesign: Specialized for description-based voice creation
  • Qwen3-TTS-12Hz-0.6B-CustomVoice: Lightweight version with custom voice capabilities
  • Qwen3-TTS-12Hz-0.6B-Base: Compact foundation model

Training Data

  • Trained on over 5 million hours of high-quality speech data
  • Comprehensive coverage across 10 languages and multiple dialectal profiles
  • Advanced contextual understanding for adaptive tone and emotional expression control

Technical Innovations

Advanced Speech Representation

  • Semantic-Acoustic Disentanglement: Separates high-level semantic content from acoustic details
  • Multi-Token Prediction (MTP): Enables immediate speech decoding from first codec frame
  • GAN-based Training: Generator operates on raw waveforms with discriminator improving naturalness

Streaming Capabilities

  • Causal Architecture: Fully causal feature encoders and decoders for real-time processing
  • Real-time Synthesis: End-to-end synthesis latency as low as 97ms
  • Incremental Decoding: Progressive audio reconstruction from discrete tokens

Installation and Usage

Quick Installation

# Create isolated environment
conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts

# Install via PyPI
pip install qwen-tts

# Optional: FlashAttention 2 for memory optimization
pip install flash-attn

Development Installation

git clone https://github.com/QwenLM/Qwen3-TTS.git
cd Qwen3-TTS
pip install -e .

Basic Usage Example

from qwen_tts import Qwen3TTSModel
import torch

# Load model
tts = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2"
)

# Generate speech
text = "Hello, this is Qwen3-TTS speaking!"
wavs, sr = tts.generate_speech(text)

Performance and Benchmarks

State-of-the-Art Results

  • Superior performance on TTS multilingual test sets
  • Excellent scores on InstructTTSEval benchmarks
  • Outstanding results on long speech generation tasks
  • Robust handling of noisy input text

Quality Metrics

  • High-fidelity speech reconstruction
  • Natural prosody and emotional expression
  • Consistent voice quality across languages
  • Minimal artifacts in streaming mode

Integration and Deployment

Platform Support

  • vLLM-Omni: Official day-0 support for deployment and inference
  • ComfyUI: Multiple community implementations for workflow integration
  • Hugging Face: Direct model hosting and inference APIs
  • DashScope API: Alibaba Cloud's optimized deployment platform

Hardware Requirements

  • CUDA-compatible GPU recommended
  • FlashAttention 2 compatible hardware for optimal performance
  • Support for torch.float16 or torch.bfloat16 precision

Community and Ecosystem

Open Source Commitment

  • Released under Apache 2.0 License
  • Full model weights and tokenizers available
  • Comprehensive documentation and examples
  • Active community development support

Community Integrations

  • Multiple ComfyUI custom node implementations
  • Third-party wrapper libraries and tools
  • Integration with popular ML frameworks
  • Extensive example code and tutorials

Research and Development

Technical Paper

The project is accompanied by a comprehensive technical report (arXiv:2601.15621) detailing the architecture, training methodology, and performance evaluations.

Future Roadmap

  • Enhanced online serving capabilities
  • Additional language support
  • Improved streaming performance optimizations
  • Extended integration with multimodal AI systems

Conclusion

Qwen3-TTS represents a significant leap forward in open-source text-to-speech technology. With its combination of multilingual support, ultra-low latency streaming, advanced voice cloning capabilities, and robust performance across diverse scenarios, it sets a new standard for accessible, high-quality speech synthesis. The project's commitment to open-source development and comprehensive documentation makes it an excellent choice for researchers, developers, and organizations seeking state-of-the-art TTS capabilities.

Star History Chart