Home
Login
WhisperSpeech/WhisperSpeech

An open-source text-to-speech system built by reverse-engineering Whisper

MITJupyter Notebook 4.3kWhisperSpeech Last Updated: 2025-06-08
https://github.com/WhisperSpeech/WhisperSpeech

WhisperSpeech Project Details

Overview

WhisperSpeech is an open-source text-to-speech (TTS) system built through reverse engineering OpenAI Whisper. The project's vision is to become the "Stable Diffusion" of speech synthesis – both powerful and easily customizable.

Initially known as spear-tts-pytorch, the project has evolved into a mature, multilingual speech synthesis solution. WhisperSpeech focuses on using compliantly licensed voice recording data, and all code is open-source, ensuring commercial application safety.

Core Features and Characteristics

🎯 Key Features

  • Open-Source and Commercially Safe: Licensed under Apache-2.0/MIT, all code is open-source, and only compliantly licensed voice data is used.
  • Multilingual Support: Currently supports English and Polish, with plans to expand to more languages.
  • Voice Cloning: Supports voice cloning based on reference audio files.
  • Multilingual Mixing: Can mix multiple languages within a single sentence.
  • High-Performance Optimization: Achieves inference performance exceeding 12x real-time speed on a consumer-grade 4090 GPU.

🔧 Technical Architecture

WhisperSpeech's architecture is similar to Google's AudioLM and SPEAR TTS, as well as Meta's MusicGen, built on top of powerful open-source models:

  • Whisper (OpenAI): Used to generate semantic tokens and perform transcription.
  • EnCodec (Meta): Used for acoustic modeling.
  • Vocos (Charactr Inc): Serves as a high-quality vocoder.

📊 Model Components

  1. Semantic Token Generation: Utilizes OpenAI Whisper encoder blocks to generate embeddings, which are then quantized to obtain semantic tokens.
  2. Acoustic Modeling: Uses EnCodec to model audio waveforms, providing reasonable quality at 1.5kbps.
  3. High-Quality Vocoder: Converts EnCodec tokens into high-quality audio using Vocos.

🌍 Dataset and Training

  • English Data: Trained on the LibreLight dataset.
  • Multilingual Expansion: Successfully trained a small model on English + Polish + French datasets.
  • Voice Cloning: Supports cross-lingual voice cloning, even if semantic tokens are only trained on a subset of languages.

Latest Developments

Performance Optimization

  • Integrated torch.compile
  • Added kv-caching
  • Optimized network layer structure
  • Achieved over 12x real-time inference speed on a 4090 GPU

Multilingual Capabilities

  • Successfully implemented mixed English and Polish speech synthesis.
  • Supports seamless switching between multiple languages in a single sentence.
  • Cross-lingual voice cloning functionality.

Model Updates

  • Released a faster SD S2A model, improving speed while maintaining high quality.
  • Improved voice cloning functionality.
  • Optimized dependencies, reducing installation time to under 30 seconds.

Usage

Quick Start

  • Google Colab: Provides ready-to-use Colab notebooks, completing installation in 30 seconds.
  • Local Execution: Supports local notebook environments.
  • HuggingFace: Pre-trained models and converted datasets are available on HuggingFace.

Model Download

Technical Principles

WhisperSpeech employs an innovative "reverse engineering" approach:

  1. Uses Whisper's speech recognition capabilities to reverse-engineer a speech synthesis system.
  2. Bridges text and speech through semantic tokens.
  3. Leverages existing powerful open-source models to avoid reinventing the wheel.
  4. Focuses on compliant data and commercial safety.

Summary

WhisperSpeech represents a significant breakthrough in open-source speech synthesis technology. It not only achieves high-quality multilingual speech synthesis technically, but more importantly, establishes a completely open-source and commercially safe ecosystem. Through the innovative approach of reverse engineering Whisper, this project provides a powerful and flexible solution for the field of speech synthesis.