WhisperSpeech is an open-source text-to-speech (TTS) system built through reverse engineering OpenAI Whisper. The project's vision is to become the "Stable Diffusion" of speech synthesis – both powerful and easily customizable.
Initially known as spear-tts-pytorch, the project has evolved into a mature, multilingual speech synthesis solution. WhisperSpeech focuses on using compliantly licensed voice recording data, and all code is open-source, ensuring commercial application safety.
WhisperSpeech's architecture is similar to Google's AudioLM and SPEAR TTS, as well as Meta's MusicGen, built on top of powerful open-source models:
WhisperSpeech employs an innovative "reverse engineering" approach:
WhisperSpeech represents a significant breakthrough in open-source speech synthesis technology. It not only achieves high-quality multilingual speech synthesis technically, but more importantly, establishes a completely open-source and commercially safe ecosystem. Through the innovative approach of reverse engineering Whisper, this project provides a powerful and flexible solution for the field of speech synthesis.