Fish Speech - Open-Source Text-to-Speech System
Project Overview
Fish Speech is an open-source text-to-speech (TTS) system based on the latest technology, developed by the FishAudio team. This project represents the state-of-the-art (SOTA) in current speech synthesis technology, offering powerful voice generation and cloning capabilities.
Core Features
🎯 Zero-Shot and Few-Shot TTS
- Generate high-quality TTS output with only 10-30 seconds of voice samples.
- Supports rapid voice cloning without lengthy training.
- Provides a detailed Voice Clone Best Practices Guide.
🌍 Multilingual and Cross-Lingual Support
- Supports multiple languages: English, Japanese, Chinese, etc.
- Simply copy and paste multilingual text into the input box without worrying about language recognition.
- Powerful cross-lingual capabilities.
🔤 Phoneme-Free
- The model has strong generalization capabilities.
- Does not rely on phonemes for TTS processing.
- Can handle text in any language script.
📊 High Accuracy
- For 5 minutes of English text, the Character Error Rate (CER) and Word Error Rate (WER) are approximately 2%.
- Industry-leading accuracy performance.
⚡ High-Speed Inference
- Real-time rate of approximately 1:5 on an Nvidia RTX 4060 laptop.
- Real-time rate of approximately 1:15 on an Nvidia RTX 4090.
- Employs fish-tech acceleration technology.
🖥️ User-Friendly Interface
- WebUI Inference: Easy-to-use web interface based on Gradio, compatible with Chrome, Firefox, Edge, and other browsers.
- GUI Inference: Provides a PyQt6 graphical interface that seamlessly integrates with the API server, supporting Linux, Windows, and macOS.
🚀 Deployment Friendly
- Easy to set up inference servers.
- Native support for Linux, Windows, and macOS.
- Minimal speed loss.
🔄 Fully End-to-End
- Automatically integrates ASR and TTS parts.
- No need to insert other models.
- True end-to-end solution, not a three-stage (ASR+LLM+TTS) architecture.
🎨 Advanced Features
- Voice Tone Control: Voice tone can be controlled using reference audio.
- Emotional Expression: The model can generate speech with strong emotions.
Technical Architecture
Fish Speech is based on large language model (LLM) technology, utilizing advanced deep learning algorithms to achieve high-quality multilingual text-to-speech synthesis. The system adopts a fully end-to-end architecture design, avoiding the complexity of traditional three-stage methods.
License Information
- Code Repository: Released under the Apache License.
- Model Weights: Released under the CC-BY-NC-SA-4.0 License.
- Attribution is required when using content released under the CC BY-NC-SA 4.0 license.
Latest Developments
The project has been upgraded to the OpenAudio brand, launching a new generation of advanced text-to-speech model series based on Fish-Speech, with significant improvements and new features.
Academic Citation
@misc{fish-speech-v1.4,
title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},
author={Shijia Liao and Yuxuan Wang and Tianyu Li and Yifan Cheng and Ruoyi Zhang and Rongzhi Zhou and Yijin Xing},
year={2024},
eprint={2411.01156},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2411.01156},
}
Summary
Fish Speech is a powerful and easy-to-use open-source TTS solution, particularly suitable for developers and researchers who need high-quality speech synthesis and voice cloning capabilities. Its advanced technical architecture, multilingual support, and user-friendly interface make it one of the best open-source TTS systems currently available.