fishaudio/fish-speechView GitHub Homepage for Latest Official Releases

SOTA open-source Text-to-Speech (TTS) system

Apache-2.0Pythonfish-speechfishaudio 22.6k Last Updated: July 23, 2025

Fish Speech - Open-Source Text-to-Speech System

Project Overview

Fish Speech is an open-source text-to-speech (TTS) system based on the latest technology, developed by the FishAudio team. This project represents the state-of-the-art (SOTA) in current speech synthesis technology, offering powerful voice generation and cloning capabilities.

Core Features

🎯 Zero-Shot and Few-Shot TTS

Generate high-quality TTS output with only 10-30 seconds of voice samples.
Supports rapid voice cloning without lengthy training.
Provides a detailed Voice Clone Best Practices Guide.

🌍 Multilingual and Cross-Lingual Support

Supports multiple languages: English, Japanese, Chinese, etc.
Simply copy and paste multilingual text into the input box without worrying about language recognition.
Powerful cross-lingual capabilities.

🔤 Phoneme-Free

The model has strong generalization capabilities.
Does not rely on phonemes for TTS processing.
Can handle text in any language script.

📊 High Accuracy

For 5 minutes of English text, the Character Error Rate (CER) and Word Error Rate (WER) are approximately 2%.
Industry-leading accuracy performance.

⚡ High-Speed Inference

Real-time rate of approximately 1:5 on an Nvidia RTX 4060 laptop.
Real-time rate of approximately 1:15 on an Nvidia RTX 4090.
Employs fish-tech acceleration technology.

🖥️ User-Friendly Interface

WebUI Inference: Easy-to-use web interface based on Gradio, compatible with Chrome, Firefox, Edge, and other browsers.
GUI Inference: Provides a PyQt6 graphical interface that seamlessly integrates with the API server, supporting Linux, Windows, and macOS.

🚀 Deployment Friendly

Easy to set up inference servers.
Native support for Linux, Windows, and macOS.
Minimal speed loss.

🔄 Fully End-to-End

Automatically integrates ASR and TTS parts.
No need to insert other models.
True end-to-end solution, not a three-stage (ASR+LLM+TTS) architecture.

🎨 Advanced Features

Voice Tone Control: Voice tone can be controlled using reference audio.
Emotional Expression: The model can generate speech with strong emotions.

Technical Architecture

Fish Speech is based on large language model (LLM) technology, utilizing advanced deep learning algorithms to achieve high-quality multilingual text-to-speech synthesis. The system adopts a fully end-to-end architecture design, avoiding the complexity of traditional three-stage methods.

License Information

Code Repository: Released under the Apache License.
Model Weights: Released under the CC-BY-NC-SA-4.0 License.
Attribution is required when using content released under the CC BY-NC-SA 4.0 license.

Latest Developments

The project has been upgraded to the OpenAudio brand, launching a new generation of advanced text-to-speech model series based on Fish-Speech, with significant improvements and new features.

Academic Citation

@misc{fish-speech-v1.4,
title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},
author={Shijia Liao and Yuxuan Wang and Tianyu Li and Yifan Cheng and Ruoyi Zhang and Rongzhi Zhou and Yijin Xing},
year={2024},
eprint={2411.01156},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2411.01156},
}

Summary

Fish Speech is a powerful and easy-to-use open-source TTS solution, particularly suitable for developers and researchers who need high-quality speech synthesis and voice cloning capabilities. Its advanced technical architecture, multilingual support, and user-friendly interface make it one of the best open-source TTS systems currently available.