modelscope/FunASR View GitHub Homepage for Latest Official Releases

An end-to-end speech recognition toolkit based on ModelScope, supporting various functions such as speech recognition, voice activity detection, and punctuation restoration.

MITPythonFunASRmodelscope 13.2k Last Updated: October 01, 2025

FunASR - A Fundamental End-to-End Speech Recognition Toolkit

Project Overview

FunASR is a fundamental speech recognition toolkit that provides various functionalities, including Automatic Speech Recognition (ASR), Voice Activity Detection (VAD), Punctuation Restoration, Language Models, Speaker Verification, Speaker Diarization, and Multi-speaker ASR. Developed by Alibaba Damo Academy, this project aims to bridge the gap between academic research and industrial applications.

Project Address: https://github.com/modelscope/FunASR

Core Features

1. Multifunctional Speech Processing

Automatic Speech Recognition (ASR): Supports streaming and non-streaming recognition
Voice Activity Detection (VAD): Detects speech activity segments
Punctuation Restoration: Automatically adds punctuation marks
Speaker Recognition: Supports speaker verification and diarization
Emotion Recognition: Speech emotion analysis
Keyword Spotting: Supports keyword wake-up

2. Pre-trained Model Library

FunASR has released a large number of academic and industrial-grade pre-trained models on ModelScope and Hugging Face, primarily including:

Model Name	Function Description	Training Data	Parameters
SenseVoiceSmall	Multi-modal speech understanding capabilities, including ASR, ITN, LID, SER, and AED	300k hours	234M
paraformer-zh	Chinese speech recognition, with timestamps, non-streaming	60k hours, Chinese	220M
paraformer-zh-streaming	Chinese speech recognition, streaming	60k hours, Chinese	220M
paraformer-en	English speech recognition, non-streaming	50k hours, English	220M
ct-punc	Punctuation restoration	100M entries, Chinese & English	290M
fsmn-vad	Voice activity detection	5000 hours, Chinese & English	0.4M
Whisper-large-v3	Multilingual speech recognition	Multilingual	1550M

3. Core Model Introduction

Paraformer

Paraformer-large is a non-autoregressive end-to-end speech recognition model, offering high accuracy, high efficiency, and convenient deployment, supporting the rapid construction of speech recognition services.

SenseVoice

SenseVoice is a foundational speech model with various speech understanding capabilities, including ASR, LID, SER, and AED, supporting multiple languages such as Chinese, Cantonese, English, Japanese, and Korean.

Installation and Usage

Installation Methods

Install via pip

pip3 install -U funasr

Install from source

git clone https://github.com/alibaba/FunASR.git && cd FunASR
pip3 install -e ./

Install model library support (optional)

pip3 install -U modelscope huggingface_hub

Quick Start

1. Command Line Usage

funasr ++model=paraformer-zh ++vad_model="fsmn-vad" ++punc_model="ct-punc" ++input=asr_example_zh.wav

2. Python API - SenseVoice Model

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model_dir = "iic/SenseVoiceSmall"
model = AutoModel(
    model=model_dir,
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    device="cuda:0",
)

# English recognition
res = model.generate(
    input=f"{model.model_path}/example/en.mp3",
    cache={},
    language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
    use_itn=True,
    batch_size_s=60,
    merge_vad=True,
    merge_length_s=15,
)
text = rich_transcription_postprocess(res[0]["text"])
print(text)

3. Python API - Paraformer Model

from funasr import AutoModel

# paraformer-zh is a multi-functional ASR model
model = AutoModel(
    model="paraformer-zh", 
    vad_model="fsmn-vad", 
    punc_model="ct-punc",
    # spk_model="cam++",  # Optional speaker recognition
)

res = model.generate(
    input=f"{model.model_path}/example/asr_example.wav",
    batch_size_s=300,
    hotword='魔搭'  # Hotword
)
print(res)

4. Streaming Recognition

from funasr import AutoModel
import soundfile
import os

chunk_size = [0, 10, 5]  # [0, 10, 5] 600ms latency configuration
encoder_chunk_look_back = 4
decoder_chunk_look_back = 1

model = AutoModel(model="paraformer-zh-streaming")

wav_file = os.path.join(model.model_path, "example/asr_example.wav")
speech, sample_rate = soundfile.read(wav_file)
chunk_stride = chunk_size[1] * 960  # 600ms

cache = {}
total_chunk_num = int(len((speech)-1)/chunk_stride+1)

for i in range(total_chunk_num):
    speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
    is_final = i == total_chunk_num - 1
    res = model.generate(
        input=speech_chunk, 
        cache=cache, 
        is_final=is_final, 
        chunk_size=chunk_size,
        encoder_chunk_look_back=encoder_chunk_look_back, 
        decoder_chunk_look_back=decoder_chunk_look_back
    )
    print(res)

Service Deployment

FunASR supports deploying pre-trained or fine-tuned models for services, currently supporting the following types of service deployment:

Supported Service Types

Offline File Transcription Service (Chinese, CPU version)
Real-time Transcription Service (Chinese, CPU version)
Offline File Transcription Service (English, CPU version)
Offline File Transcription Service (Chinese, GPU version)

Deployment Configuration Recommendations

Recommended configurations:

Configuration 1: (X86 Compute Type) 4 vCPUs, 8GB RAM, single machine supports approximately 32 requests
Configuration 2: (X86 Compute Type) 16 vCPUs, 32GB RAM, single machine supports approximately 64 requests

Technical Features

1. Model Innovation

Non-autoregressive Architecture: Paraformer adopts a non-autoregressive design to improve inference efficiency
2Pass Mode: Combines the advantages of streaming and non-streaming
Hotword Support: Supports custom hotwords to improve recognition accuracy for specific vocabulary

2. Engineering Optimization

ONNX Export: Supports ONNX format export for models, facilitating deployment
Multi-platform Support: Supports CPU, GPU, ARM64, and other platforms
Containerized Deployment: Provides Docker image support

3. Developer Friendly

Unified Interface: AutoModel unifies the inference interfaces of ModelScope, Hugging Face, and FunASR
Plugin Design: Supports flexible combination of components like VAD, punctuation, and speaker diarization
Rich Documentation: Provides detailed tutorials and examples

Application Scenarios

1. Real-time Speech Transcription

Meeting minutes
Live captions
Voice assistants

2. Offline Audio Processing

Audio file transcription
Speech data analysis
Content moderation

3. Multilingual Support

Cross-language speech recognition
Speech translation
Multilingual customer service

Latest Updates

Major Updates in 2024

2024/10/29: Real-time transcription service 1.12 released, 2pass-offline mode supports SenseVoice model
2024/10/10: Added Whisper-large-v3-turbo model support
2024/09/26: Fixed memory leak issues, supported SenseVoice ONNX model
2024/07/04: Released SenseVoice foundational speech model
2024/06/27: Offline file transcription service GPU 1.0 released

Community and Support

Open Source License

The project uses the MIT License
Pre-trained models use specific Model License Agreements

Community Participation

GitHub Issues: Technical questions and bug reports
DingTalk Group: Daily communication and discussion
ModelScope: Model download and sharing

Citation

If you use FunASR in your research, please cite the following paper:

@inproceedings{gao2023funasr,
  author={Zhifu Gao and Zerui Li and Jiaming Wang and Haoneng Luo and Xian Shi and Mengzhe Chen and Yabin Li and Lingyun Zuo and Zhihao Du and Zhangyu Xiao and Shiliang Zhang},
  title={FunASR: A Fundamental End-to-End Speech Recognition Toolkit},
  year={2023},
  booktitle={INTERSPEECH},
}

Summary

FunASR is a feature-complete and high-performance speech recognition toolkit that successfully combines cutting-edge academic research with the practical needs of industrial applications. Whether for researchers validating algorithms or developers building speech applications, FunASR provides powerful technical support and a convenient development experience. Through its rich pre-trained models, flexible deployment solutions, and active open-source community, FunASR is becoming an important infrastructure in the field of speech recognition.