WhisperSpeech/WhisperSpeech

An open-source text-to-speech system built by reverse-engineering Whisper

MITJupyter Notebook 4.3kWhisperSpeech Last Updated: 2025-06-08

https://github.com/WhisperSpeech/WhisperSpeech

WhisperSpeech Project Details

Overview

WhisperSpeech is an open-source text-to-speech (TTS) system built through reverse engineering OpenAI Whisper. The project's vision is to become the "Stable Diffusion" of speech synthesis – both powerful and easily customizable.

Initially known as spear-tts-pytorch, the project has evolved into a mature, multilingual speech synthesis solution. WhisperSpeech focuses on using compliantly licensed voice recording data, and all code is open-source, ensuring commercial application safety.

Core Features and Characteristics

🎯 Key Features

Open-Source and Commercially Safe: Licensed under Apache-2.0/MIT, all code is open-source, and only compliantly licensed voice data is used.
Multilingual Support: Currently supports English and Polish, with plans to expand to more languages.
Voice Cloning: Supports voice cloning based on reference audio files.
Multilingual Mixing: Can mix multiple languages within a single sentence.
High-Performance Optimization: Achieves inference performance exceeding 12x real-time speed on a consumer-grade 4090 GPU.

🔧 Technical Architecture

WhisperSpeech's architecture is similar to Google's AudioLM and SPEAR TTS, as well as Meta's MusicGen, built on top of powerful open-source models:

Whisper (OpenAI): Used to generate semantic tokens and perform transcription.
EnCodec (Meta): Used for acoustic modeling.
Vocos (Charactr Inc): Serves as a high-quality vocoder.

📊 Model Components

Semantic Token Generation: Utilizes OpenAI Whisper encoder blocks to generate embeddings, which are then quantized to obtain semantic tokens.
Acoustic Modeling: Uses EnCodec to model audio waveforms, providing reasonable quality at 1.5kbps.
High-Quality Vocoder: Converts EnCodec tokens into high-quality audio using Vocos.

🌍 Dataset and Training

English Data: Trained on the LibreLight dataset.
Multilingual Expansion: Successfully trained a small model on English + Polish + French datasets.
Voice Cloning: Supports cross-lingual voice cloning, even if semantic tokens are only trained on a subset of languages.

Latest Developments

Performance Optimization

Integrated torch.compile
Added kv-caching
Optimized network layer structure
Achieved over 12x real-time inference speed on a 4090 GPU

Multilingual Capabilities

Successfully implemented mixed English and Polish speech synthesis.
Supports seamless switching between multiple languages in a single sentence.
Cross-lingual voice cloning functionality.

Model Updates

Released a faster SD S2A model, improving speed while maintaining high quality.
Improved voice cloning functionality.
Optimized dependencies, reducing installation time to under 30 seconds.

Usage

Quick Start

Google Colab: Provides ready-to-use Colab notebooks, completing installation in 30 seconds.
Local Execution: Supports local notebook environments.
HuggingFace: Pre-trained models and converted datasets are available on HuggingFace.

Model Download

Technical Principles

WhisperSpeech employs an innovative "reverse engineering" approach:

Uses Whisper's speech recognition capabilities to reverse-engineer a speech synthesis system.
Bridges text and speech through semantic tokens.
Leverages existing powerful open-source models to avoid reinventing the wheel.
Focuses on compliant data and commercial safety.

Summary

WhisperSpeech represents a significant breakthrough in open-source speech synthesis technology. It not only achieves high-quality multilingual speech synthesis technically, but more importantly, establishes a completely open-source and commercially safe ecosystem. Through the innovative approach of reverse engineering Whisper, this project provides a powerful and flexible solution for the field of speech synthesis.