FunAudioLLM/CosyVoicePlease refer to the latest official releases for information GitHub Homepage

CosyVoice: A multilingual large-scale speech generation model providing full-stack capabilities for inference, training, and deployment.

Apache-2.0Python 14.5kFunAudioLLM Last Updated: 2025-06-12

CosyVoice Project Detailed Introduction

Project Overview

CosyVoice is a multilingual large-scale speech generation model developed by Alibaba's FunAudioLLM team, providing a complete full-stack solution for inference, training, and deployment. This project focuses on high-quality speech synthesis technology, supporting multiple languages and application scenarios.

Core Features

CosyVoice 2.0 Latest Features

Supported Languages

Chinese, English, Japanese, Korean
Chinese Dialects: Cantonese, Sichuanese, Shanghainese, Tianjin dialect, Wuhan dialect, etc.

Technical Breakthroughs

Cross-lingual and Multilingual Mixing: Supports zero-shot voice cloning in cross-lingual and code-switching scenarios.
Bidirectional Streaming Support: Integrates offline and streaming modeling techniques.
Ultra-Low Latency Synthesis: First-packet synthesis latency as low as 150ms, while maintaining high-quality audio output.
Improved Pronunciation Accuracy: Reduces pronunciation errors by 30% to 50% compared to version 1.0.
Benchmark Achievements: Achieves the lowest character error rate on the difficult test set of the Seed-TTS evaluation set.
Voice Consistency: Ensures reliable voice consistency for zero-shot and cross-lingual speech synthesis.
Prosody and Audio Quality Enhancement: Improved synthesized audio alignment, with MOS evaluation scores increasing from 5.4 to 5.53.
Emotional and Dialect Flexibility: Supports finer-grained emotional control and accent adjustment.

Model Versions

CosyVoice2-0.5B (Recommended)

Latest version with superior performance.
Supports all the latest features.

CosyVoice-300M Series

CosyVoice-300M: Base model
CosyVoice-300M-SFT: Supervised Fine-Tuning version
CosyVoice-300M-Instruct: Instruction Fine-Tuning version

Functionality Modes

1. Zero-shot Voice Cloning

Clones a voice with just a few seconds of audio sample.
Supports cross-lingual voice cloning.
Maintains the original speaker's voice characteristics.

2. Cross-lingual Synthesis

Synthesizes speech in one language using an audio sample from another language.
Supports multiple language combinations, including Chinese, English, Japanese, Korean, Cantonese, etc.

3. Voice Conversion

Converts the voice of one speaker to the voice of another speaker.
Changes the voice while preserving the original content.

4. Supervised Fine-Tuning Mode (SFT)

Performs speech synthesis using predefined speaker identities.
Provides stable and reliable synthesis quality.

5. Instruction Control Mode (Instruct)

Controls speech synthesis through natural language instructions.
Supports emotional labels and special effects.
Allows control over speech style, emotional expression, etc.

6. Fine-grained Control

Supports special markers such as laughter [laughter] and breath [breath].
Supports emphasis control <strong></strong>.
Fine-grained adjustment of emotion and prosody.

Technical Architecture

Core Technologies

Discrete Speech Tokens: Supervised discrete speech tokenization technology.
Progressive Semantic Decoding: Uses Language Models (LMs) and Flow Matching.
Bidirectional Streaming Modeling: Supports real-time and batch inference.
Multi-modal Integration: Seamless integration with large language models.

Performance Optimization

Streaming Inference Support: Includes KV caching and SDPA optimization.
Repetition-Aware Sampling (RAS): Improves LLM stability.
TensorRT Acceleration: Supports GPU-accelerated inference.
FP16 Precision: Balances performance and quality.

Installation and Usage

Environment Requirements

Python 3.10
CUDA-enabled GPU (recommended)
Conda environment management

Quick Start

# Clone the repository
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice

# Create the environment
conda create -n cosyvoice -y python=3.10
conda activate cosyvoice
conda install -y -c conda-forge pynini==2.1.5
pip install -r requirements.txt

Model Download

from modelscope import snapshot_download

# Download CosyVoice2.0 (Recommended)
snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')

# Download other versions
snapshot_download('iic/CosyVoice-300M', local_dir='pretrained_models/CosyVoice-300M')
snapshot_download('iic/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT')
snapshot_download('iic/CosyVoice-300M-Instruct', local_dir='pretrained_models/CosyVoice-300M-Instruct')

Basic Usage Example

from cosyvoice.cli.cosyvoice import CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudio

# Initialize the model
cosyvoice = CosyVoice2('pretrained_models/CosyVoice2-0.5B')

# Zero-shot voice cloning
prompt_speech = load_wav('./asset/zero_shot_prompt.wav', 16000)
for i, result in enumerate(cosyvoice.inference_zero_shot(
    '收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐。',
    '希望你以后能够做的比我还好呦。',
    prompt_speech
)):
    torchaudio.save(f'output_{i}.wav', result['tts_speech'], cosyvoice.sample_rate)

# Instruction-controlled synthesis
for i, result in enumerate(cosyvoice.inference_instruct2(
    '今天天气真不错，我们去公园散步吧。',
    '用四川话说这句话',
    prompt_speech
)):
    torchaudio.save(f'instruct_{i}.wav', result['tts_speech'], cosyvoice.sample_rate)

Deployment Solutions

Web Interface Deployment

python3 webui.py --port 50000 --model_dir pretrained_models/CosyVoice2-0.5B

Docker Container Deployment

cd runtime/python
docker build -t cosyvoice:v1.0 .

# gRPC service
docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 \
  /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/grpc && \
  python3 server.py --port 50000 --model_dir iic/CosyVoice2-0.5B"

# FastAPI service
docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 \
  /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/fastapi && \
  python3 server.py --port 50000 --model_dir iic/CosyVoice2-0.5B"