Home
Login

Spark-TTS: An efficient text-to-speech system based on large language models, supporting zero-shot voice cloning and controllable speech generation.

Apache-2.0Python 9.8kSparkAudio Last Updated: 2025-04-09

Spark-TTS Project Detailed Introduction

Project Overview

Spark-TTS is an advanced text-to-speech (TTS) system based on a large language model (LLM), developed by the SparkAudio team. The system employs innovative single-stream decoupled speech token technology to generate high-quality, natural speech synthesis results. Built upon the Qwen2.5 large language model, the project is designed for both research and production environments, featuring efficiency, flexibility, and power.

Core Features and Characteristics

1. Concise and Efficient Architecture Design

  • Fully built on Qwen2.5, eliminating the need for additional generative models (such as flow-matching models).
  • Directly reconstructs audio from LLM-predicted code, simplifying the processing flow.
  • Improves efficiency and reduces system complexity.

2. Zero-Shot Voice Cloning

  • Supports zero-shot voice cloning technology, enabling the replication of a speaker's voice without specific training data.
  • Ideal for cross-language and code-switching scenarios.
  • Seamlessly switches between different languages and voices.

3. Bilingual Support Capability

  • Supports Chinese and English speech synthesis.
  • Possesses cross-lingual zero-shot voice cloning capabilities.
  • Maintains high naturalness and accuracy in multilingual environments.

4. Controllable Speech Generation

  • Supports creating virtual speakers by adjusting parameters.
  • Allows control over voice characteristics such as gender, pitch, and speaking rate.
  • Provides both coarse-grained attribute control and fine-grained parameter adjustment.

5. Advanced Technical Architecture

  • BiCodec Technology: A single-stream speech codec that decomposes speech into two complementary token types:
    • Low-bitrate semantic tokens: for language content.
    • Fixed-length global tokens: for speaker-specific attributes.
  • Chain-of-Thought (CoT) Generation Method: Combines decoupled representations for precise control.

Technical Specifications

System Requirements

  • Operating System: Linux (primarily supported), Windows (refer to the installation guide).
  • Python Version: 3.12+
  • Deep Learning Framework: PyTorch 2.5+
  • License: Apache 2.0

Model Information

  • Model Name: Spark-TTS-0.5B
  • Hosting Platform: Hugging Face
  • Supported Platform: Supports Nvidia Triton Inference Server

Installation and Usage

Basic Installation

# Clone the repository
git clone https://github.com/SparkAudio/Spark-TTS.git
cd Spark-TTS

# Create a Conda environment
conda create -n sparktts -y python=3.12
conda activate sparktts
pip install -r requirements.txt

Model Download

# Download via Python
from huggingface_hub import snapshot_download
snapshot_download("SparkAudio/Spark-TTS-0.5B", local_dir="pretrained_models/Spark-TTS-0.5B")

Usage Methods

  1. Command-Line Interface: Supports direct command-line inference.
  2. Web UI Interface: Provides a graphical interface with support for voice cloning and voice creation.
  3. API Interface: Supports programmatic invocation.

Performance Metrics

Inference Performance

  • Benchmarked on a single L20 GPU.
  • Test data: 26 different prompt audio/target text pairs (totaling 169 seconds of audio).
  • Supports high concurrency processing.
  • Provides Real-Time Factor (RTF) performance metrics.

Voice Quality

  • High-quality zero-shot voice cloning effect.
  • Supports voice replication of various well-known figures and characters.
  • Maintains excellent performance in both Chinese and English environments.

Application Scenarios

Academic Research

  • Speech synthesis technology research.
  • Linguistics research.
  • Artificial intelligence and machine learning research.

Practical Applications

  • Personalized speech synthesis.
  • Assistive technology development.
  • Multimedia content production.
  • Cross-language communication tools.

Technical Advantages

  1. Innovative Architecture: Novel design based on single-stream decoupled speech tokens.
  2. Efficient Implementation: Directly reconstructs audio from LLM output, avoiding complex intermediate steps.
  3. Flexible Control: Supports multi-level control of voice characteristics.
  4. Cross-Lingual Capability: Excellent multilingual and cross-lingual performance.
  5. Zero-Shot Learning: Adapts to new speakers without additional training.

Ethics and Usage Guidelines

The project explicitly stipulates usage guidelines:

  • Only for academic research, educational purposes, and legal applications.
  • Prohibited for unauthorized voice cloning, impersonation, fraud, and other illegal activities.
  • Users must comply with local laws, regulations, and ethical standards.
  • Developers are not responsible for misuse.

Summary

Spark-TTS is a technologically advanced and powerful text-to-speech system, representing the cutting edge of current TTS technology. Through innovative architectural design and advanced deep learning techniques, it provides excellent voice quality and flexible control capabilities while maintaining high efficiency. The project is not only suitable for academic research but also has the potential for practical applications, making it an important contribution to the field of speech synthesis.