Home
Login

CosyVoice: A multilingual large-scale speech generation model providing full-stack capabilities for inference, training, and deployment.

Apache-2.0Python 14.5kFunAudioLLM Last Updated: 2025-06-12

CosyVoice Project Detailed Introduction

Project Overview

CosyVoice is a multilingual large-scale speech generation model developed by Alibaba's FunAudioLLM team, providing a complete full-stack solution for inference, training, and deployment. This project focuses on high-quality speech synthesis technology, supporting multiple languages and application scenarios.

Core Features

CosyVoice 2.0 Latest Features

Supported Languages

  • Chinese, English, Japanese, Korean
  • Chinese Dialects: Cantonese, Sichuanese, Shanghainese, Tianjin dialect, Wuhan dialect, etc.

Technical Breakthroughs

  • Cross-lingual and Multilingual Mixing: Supports zero-shot voice cloning in cross-lingual and code-switching scenarios.
  • Bidirectional Streaming Support: Integrates offline and streaming modeling techniques.
  • Ultra-Low Latency Synthesis: First-packet synthesis latency as low as 150ms, while maintaining high-quality audio output.
  • Improved Pronunciation Accuracy: Reduces pronunciation errors by 30% to 50% compared to version 1.0.
  • Benchmark Achievements: Achieves the lowest character error rate on the difficult test set of the Seed-TTS evaluation set.
  • Voice Consistency: Ensures reliable voice consistency for zero-shot and cross-lingual speech synthesis.
  • Prosody and Audio Quality Enhancement: Improved synthesized audio alignment, with MOS evaluation scores increasing from 5.4 to 5.53.
  • Emotional and Dialect Flexibility: Supports finer-grained emotional control and accent adjustment.

Model Versions

CosyVoice2-0.5B (Recommended)

  • Latest version with superior performance.
  • Supports all the latest features.

CosyVoice-300M Series

  • CosyVoice-300M: Base model
  • CosyVoice-300M-SFT: Supervised Fine-Tuning version
  • CosyVoice-300M-Instruct: Instruction Fine-Tuning version

Functionality Modes

1. Zero-shot Voice Cloning

  • Clones a voice with just a few seconds of audio sample.
  • Supports cross-lingual voice cloning.
  • Maintains the original speaker's voice characteristics.

2. Cross-lingual Synthesis

  • Synthesizes speech in one language using an audio sample from another language.
  • Supports multiple language combinations, including Chinese, English, Japanese, Korean, Cantonese, etc.

3. Voice Conversion

  • Converts the voice of one speaker to the voice of another speaker.
  • Changes the voice while preserving the original content.

4. Supervised Fine-Tuning Mode (SFT)

  • Performs speech synthesis using predefined speaker identities.
  • Provides stable and reliable synthesis quality.

5. Instruction Control Mode (Instruct)

  • Controls speech synthesis through natural language instructions.
  • Supports emotional labels and special effects.
  • Allows control over speech style, emotional expression, etc.

6. Fine-grained Control

  • Supports special markers such as laughter [laughter] and breath [breath].
  • Supports emphasis control <strong></strong>.
  • Fine-grained adjustment of emotion and prosody.

Technical Architecture

Core Technologies

  • Discrete Speech Tokens: Supervised discrete speech tokenization technology.
  • Progressive Semantic Decoding: Uses Language Models (LMs) and Flow Matching.
  • Bidirectional Streaming Modeling: Supports real-time and batch inference.
  • Multi-modal Integration: Seamless integration with large language models.

Performance Optimization

  • Streaming Inference Support: Includes KV caching and SDPA optimization.
  • Repetition-Aware Sampling (RAS): Improves LLM stability.
  • TensorRT Acceleration: Supports GPU-accelerated inference.
  • FP16 Precision: Balances performance and quality.

Installation and Usage

Environment Requirements

  • Python 3.10
  • CUDA-enabled GPU (recommended)
  • Conda environment management

Quick Start

# Clone the repository
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice

# Create the environment
conda create -n cosyvoice -y python=3.10
conda activate cosyvoice
conda install -y -c conda-forge pynini==2.1.5
pip install -r requirements.txt

Model Download

from modelscope import snapshot_download

# Download CosyVoice2.0 (Recommended)
snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')

# Download other versions
snapshot_download('iic/CosyVoice-300M', local_dir='pretrained_models/CosyVoice-300M')
snapshot_download('iic/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT')
snapshot_download('iic/CosyVoice-300M-Instruct', local_dir='pretrained_models/CosyVoice-300M-Instruct')

Basic Usage Example

from cosyvoice.cli.cosyvoice import CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudio

# Initialize the model
cosyvoice = CosyVoice2('pretrained_models/CosyVoice2-0.5B')

# Zero-shot voice cloning
prompt_speech = load_wav('./asset/zero_shot_prompt.wav', 16000)
for i, result in enumerate(cosyvoice.inference_zero_shot(
    '收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐。',
    '希望你以后能够做的比我还好呦。',
    prompt_speech
)):
    torchaudio.save(f'output_{i}.wav', result['tts_speech'], cosyvoice.sample_rate)

# Instruction-controlled synthesis
for i, result in enumerate(cosyvoice.inference_instruct2(
    '今天天气真不错,我们去公园散步吧。',
    '用四川话说这句话',
    prompt_speech
)):
    torchaudio.save(f'instruct_{i}.wav', result['tts_speech'], cosyvoice.sample_rate)

Deployment Solutions

Web Interface Deployment

python3 webui.py --port 50000 --model_dir pretrained_models/CosyVoice2-0.5B

Docker Container Deployment

cd runtime/python
docker build -t cosyvoice:v1.0 .

# gRPC service
docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 \
  /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/grpc && \
  python3 server.py --port 50000 --model_dir iic/CosyVoice2-0.5B"

# FastAPI service
docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 \
  /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/fastapi && \
  python3 server.py --port 50000 --model_dir iic/CosyVoice2-0.5B"

Application Scenarios

Commercial Applications

  • Intelligent Customer Service: Multilingual customer service systems.
  • Audiobooks: Personalized narration and character voice acting.
  • Voice Assistants: Natural human-machine interaction experiences.
  • Online Education: Multilingual educational content creation.

Creative Applications

  • Podcast Production: Automated podcast content generation.
  • Game Voice Acting: Character voice synthesis.
  • Short Video Production: Quick voice-over solutions.
  • Voice Translation: Real-time speech-to-speech translation.

Technical Integration

  • Integration with LLMs: Building complete dialogue systems.
  • Emotional Voice Chat: Dialogue robots that support emotional expression.
  • Interactive Podcasts: Dynamic content generation.
  • Expressive Audiobooks: Rich emotional expression.

Technical Advantages

Performance Metrics

  • Latency: First-packet synthesis as low as 150ms.
  • Quality: MOS