Home
Login

SOTA open-source Text-to-Speech (TTS) system

Apache-2.0Python 21.9kfishaudio Last Updated: 2025-06-12

Fish Speech - Open-Source Text-to-Speech System

Project Overview

Fish Speech is an open-source text-to-speech (TTS) system based on the latest technology, developed by the FishAudio team. This project represents the state-of-the-art (SOTA) in current speech synthesis technology, offering powerful voice generation and cloning capabilities.

Core Features

🎯 Zero-Shot and Few-Shot TTS

  • Generate high-quality TTS output with only 10-30 seconds of voice samples.
  • Supports rapid voice cloning without lengthy training.
  • Provides a detailed Voice Clone Best Practices Guide.

🌍 Multilingual and Cross-Lingual Support

  • Supports multiple languages: English, Japanese, Chinese, etc.
  • Simply copy and paste multilingual text into the input box without worrying about language recognition.
  • Powerful cross-lingual capabilities.

🔤 Phoneme-Free

  • The model has strong generalization capabilities.
  • Does not rely on phonemes for TTS processing.
  • Can handle text in any language script.

📊 High Accuracy

  • For 5 minutes of English text, the Character Error Rate (CER) and Word Error Rate (WER) are approximately 2%.
  • Industry-leading accuracy performance.

⚡ High-Speed Inference

  • Real-time rate of approximately 1:5 on an Nvidia RTX 4060 laptop.
  • Real-time rate of approximately 1:15 on an Nvidia RTX 4090.
  • Employs fish-tech acceleration technology.

🖥️ User-Friendly Interface

  • WebUI Inference: Easy-to-use web interface based on Gradio, compatible with Chrome, Firefox, Edge, and other browsers.
  • GUI Inference: Provides a PyQt6 graphical interface that seamlessly integrates with the API server, supporting Linux, Windows, and macOS.

🚀 Deployment Friendly

  • Easy to set up inference servers.
  • Native support for Linux, Windows, and macOS.
  • Minimal speed loss.

🔄 Fully End-to-End

  • Automatically integrates ASR and TTS parts.
  • No need to insert other models.
  • True end-to-end solution, not a three-stage (ASR+LLM+TTS) architecture.

🎨 Advanced Features

  • Voice Tone Control: Voice tone can be controlled using reference audio.
  • Emotional Expression: The model can generate speech with strong emotions.

Technical Architecture

Fish Speech is based on large language model (LLM) technology, utilizing advanced deep learning algorithms to achieve high-quality multilingual text-to-speech synthesis. The system adopts a fully end-to-end architecture design, avoiding the complexity of traditional three-stage methods.

License Information

  • Code Repository: Released under the Apache License.
  • Model Weights: Released under the CC-BY-NC-SA-4.0 License.
  • Attribution is required when using content released under the CC BY-NC-SA 4.0 license.

Latest Developments

The project has been upgraded to the OpenAudio brand, launching a new generation of advanced text-to-speech model series based on Fish-Speech, with significant improvements and new features.

Academic Citation

@misc{fish-speech-v1.4,
title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},
author={Shijia Liao and Yuxuan Wang and Tianyu Li and Yifan Cheng and Ruoyi Zhang and Rongzhi Zhou and Yijin Xing},
year={2024},
eprint={2411.01156},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2411.01156},
}

Summary

Fish Speech is a powerful and easy-to-use open-source TTS solution, particularly suitable for developers and researchers who need high-quality speech synthesis and voice cloning capabilities. Its advanced technical architecture, multilingual support, and user-friendly interface make it one of the best open-source TTS systems currently available.