CosyVoice Project Detailed Introduction
Project Overview
CosyVoice is a multilingual large-scale speech generation model developed by Alibaba's FunAudioLLM team, providing a complete full-stack solution for inference, training, and deployment. This project focuses on high-quality speech synthesis technology, supporting multiple languages and application scenarios.
Core Features
CosyVoice 2.0 Latest Features
Supported Languages
- Chinese, English, Japanese, Korean
- Chinese Dialects: Cantonese, Sichuanese, Shanghainese, Tianjin dialect, Wuhan dialect, etc.
Technical Breakthroughs
- Cross-lingual and Multilingual Mixing: Supports zero-shot voice cloning in cross-lingual and code-switching scenarios.
- Bidirectional Streaming Support: Integrates offline and streaming modeling techniques.
- Ultra-Low Latency Synthesis: First-packet synthesis latency as low as 150ms, while maintaining high-quality audio output.
- Improved Pronunciation Accuracy: Reduces pronunciation errors by 30% to 50% compared to version 1.0.
- Benchmark Achievements: Achieves the lowest character error rate on the difficult test set of the Seed-TTS evaluation set.
- Voice Consistency: Ensures reliable voice consistency for zero-shot and cross-lingual speech synthesis.
- Prosody and Audio Quality Enhancement: Improved synthesized audio alignment, with MOS evaluation scores increasing from 5.4 to 5.53.
- Emotional and Dialect Flexibility: Supports finer-grained emotional control and accent adjustment.
Model Versions
CosyVoice2-0.5B (Recommended)
- Latest version with superior performance.
- Supports all the latest features.
CosyVoice-300M Series
- CosyVoice-300M: Base model
- CosyVoice-300M-SFT: Supervised Fine-Tuning version
- CosyVoice-300M-Instruct: Instruction Fine-Tuning version
Functionality Modes
1. Zero-shot Voice Cloning
- Clones a voice with just a few seconds of audio sample.
- Supports cross-lingual voice cloning.
- Maintains the original speaker's voice characteristics.
2. Cross-lingual Synthesis
- Synthesizes speech in one language using an audio sample from another language.
- Supports multiple language combinations, including Chinese, English, Japanese, Korean, Cantonese, etc.
3. Voice Conversion
- Converts the voice of one speaker to the voice of another speaker.
- Changes the voice while preserving the original content.
4. Supervised Fine-Tuning Mode (SFT)
- Performs speech synthesis using predefined speaker identities.
- Provides stable and reliable synthesis quality.
5. Instruction Control Mode (Instruct)
- Controls speech synthesis through natural language instructions.
- Supports emotional labels and special effects.
- Allows control over speech style, emotional expression, etc.
6. Fine-grained Control
- Supports special markers such as laughter
[laughter]
and breath [breath]
.
- Supports emphasis control
<strong></strong>
.
- Fine-grained adjustment of emotion and prosody.
Technical Architecture
Core Technologies
- Discrete Speech Tokens: Supervised discrete speech tokenization technology.
- Progressive Semantic Decoding: Uses Language Models (LMs) and Flow Matching.
- Bidirectional Streaming Modeling: Supports real-time and batch inference.
- Multi-modal Integration: Seamless integration with large language models.
Performance Optimization
- Streaming Inference Support: Includes KV caching and SDPA optimization.
- Repetition-Aware Sampling (RAS): Improves LLM stability.
- TensorRT Acceleration: Supports GPU-accelerated inference.
- FP16 Precision: Balances performance and quality.
Installation and Usage
Environment Requirements
- Python 3.10
- CUDA-enabled GPU (recommended)
- Conda environment management
Quick Start
# Clone the repository
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice
# Create the environment
conda create -n cosyvoice -y python=3.10
conda activate cosyvoice
conda install -y -c conda-forge pynini==2.1.5
pip install -r requirements.txt
Model Download
from modelscope import snapshot_download
# Download CosyVoice2.0 (Recommended)
snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
# Download other versions
snapshot_download('iic/CosyVoice-300M', local_dir='pretrained_models/CosyVoice-300M')
snapshot_download('iic/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT')
snapshot_download('iic/CosyVoice-300M-Instruct', local_dir='pretrained_models/CosyVoice-300M-Instruct')
Basic Usage Example
from cosyvoice.cli.cosyvoice import CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudio
# Initialize the model
cosyvoice = CosyVoice2('pretrained_models/CosyVoice2-0.5B')
# Zero-shot voice cloning
prompt_speech = load_wav('./asset/zero_shot_prompt.wav', 16000)
for i, result in enumerate(cosyvoice.inference_zero_shot(
'收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐。',
'希望你以后能够做的比我还好呦。',
prompt_speech
)):
torchaudio.save(f'output_{i}.wav', result['tts_speech'], cosyvoice.sample_rate)
# Instruction-controlled synthesis
for i, result in enumerate(cosyvoice.inference_instruct2(
'今天天气真不错,我们去公园散步吧。',
'用四川话说这句话',
prompt_speech
)):
torchaudio.save(f'instruct_{i}.wav', result['tts_speech'], cosyvoice.sample_rate)
Deployment Solutions
Web Interface Deployment
python3 webui.py --port 50000 --model_dir pretrained_models/CosyVoice2-0.5B
Docker Container Deployment
cd runtime/python
docker build -t cosyvoice:v1.0 .
# gRPC service
docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 \
/bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/grpc && \
python3 server.py --port 50000 --model_dir iic/CosyVoice2-0.5B"
# FastAPI service
docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 \
/bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/fastapi && \
python3 server.py --port 50000 --model_dir iic/CosyVoice2-0.5B"
Application Scenarios
Commercial Applications
- Intelligent Customer Service: Multilingual customer service systems.
- Audiobooks: Personalized narration and character voice acting.
- Voice Assistants: Natural human-machine interaction experiences.
- Online Education: Multilingual educational content creation.
Creative Applications
- Podcast Production: Automated podcast content generation.
- Game Voice Acting: Character voice synthesis.
- Short Video Production: Quick voice-over solutions.
- Voice Translation: Real-time speech-to-speech translation.
Technical Integration
- Integration with LLMs: Building complete dialogue systems.
- Emotional Voice Chat: Dialogue robots that support emotional expression.
- Interactive Podcasts: Dynamic content generation.
- Expressive Audiobooks: Rich emotional expression.
Technical Advantages
Performance Metrics
- Latency: First-packet synthesis as low as 150ms.
- Quality: MOS