An end-to-end speech recognition toolkit based on ModelScope, supporting various functions such as speech recognition, voice activity detection, and punctuation restoration.

MITPythonFunASRmodelscope 12.6k Last Updated: September 09, 2025

FunASR - A Fundamental End-to-End Speech Recognition Toolkit

Project Overview

FunASR is a fundamental speech recognition toolkit that provides various functionalities, including Automatic Speech Recognition (ASR), Voice Activity Detection (VAD), Punctuation Restoration, Language Models, Speaker Verification, Speaker Diarization, and Multi-speaker ASR. Developed by Alibaba Damo Academy, this project aims to bridge the gap between academic research and industrial applications.

Project Address: https://github.com/modelscope/FunASR

Core Features

1. Multifunctional Speech Processing

  • Automatic Speech Recognition (ASR): Supports streaming and non-streaming recognition
  • Voice Activity Detection (VAD): Detects speech activity segments
  • Punctuation Restoration: Automatically adds punctuation marks
  • Speaker Recognition: Supports speaker verification and diarization
  • Emotion Recognition: Speech emotion analysis
  • Keyword Spotting: Supports keyword wake-up

2. Pre-trained Model Library

FunASR has released a large number of academic and industrial-grade pre-trained models on ModelScope and Hugging Face, primarily including:

Model Name Function Description Training Data Parameters
SenseVoiceSmall Multi-modal speech understanding capabilities, including ASR, ITN, LID, SER, and AED 300k hours 234M
paraformer-zh Chinese speech recognition, with timestamps, non-streaming 60k hours, Chinese 220M
paraformer-zh-streaming Chinese speech recognition, streaming 60k hours, Chinese 220M
paraformer-en English speech recognition, non-streaming 50k hours, English 220M
ct-punc Punctuation restoration 100M entries, Chinese & English 290M
fsmn-vad Voice activity detection 5000 hours, Chinese & English 0.4M
Whisper-large-v3 Multilingual speech recognition Multilingual 1550M

3. Core Model Introduction

Paraformer

Paraformer-large is a non-autoregressive end-to-end speech recognition model, offering high accuracy, high efficiency, and convenient deployment, supporting the rapid construction of speech recognition services.

SenseVoice

SenseVoice is a foundational speech model with various speech understanding capabilities, including ASR, LID, SER, and AED, supporting multiple languages such as Chinese, Cantonese, English, Japanese, and Korean.

Installation and Usage

Installation Methods

Install via pip

pip3 install -U funasr

Install from source

git clone https://github.com/alibaba/FunASR.git && cd FunASR
pip3 install -e ./

Install model library support (optional)

pip3 install -U modelscope huggingface_hub

Quick Start

1. Command Line Usage

funasr ++model=paraformer-zh ++vad_model="fsmn-vad" ++punc_model="ct-punc" ++input=asr_example_zh.wav

2. Python API - SenseVoice Model

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model_dir = "iic/SenseVoiceSmall"
model = AutoModel(
    model=model_dir,
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    device="cuda:0",
)

# English recognition
res = model.generate(
    input=f"{model.model_path}/example/en.mp3",
    cache={},
    language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
    use_itn=True,
    batch_size_s=60,
    merge_vad=True,
    merge_length_s=15,
)
text = rich_transcription_postprocess(res[0]["text"])
print(text)

3. Python API - Paraformer Model

from funasr import AutoModel

# paraformer-zh is a multi-functional ASR model
model = AutoModel(
    model="paraformer-zh", 
    vad_model="fsmn-vad", 
    punc_model="ct-punc",
    # spk_model="cam++",  # Optional speaker recognition
)

res = model.generate(
    input=f"{model.model_path}/example/asr_example.wav",
    batch_size_s=300,
    hotword='魔搭'  # Hotword
)
print(res)

4. Streaming Recognition

from funasr import AutoModel
import soundfile
import os

chunk_size = [0, 10, 5]  # [0, 10, 5] 600ms latency configuration
encoder_chunk_look_back = 4
decoder_chunk_look_back = 1

model = AutoModel(model="paraformer-zh-streaming")

wav_file = os.path.join(model.model_path, "example/asr_example.wav")
speech, sample_rate = soundfile.read(wav_file)
chunk_stride = chunk_size[1] * 960  # 600ms

cache = {}
total_chunk_num = int(len((speech)-1)/chunk_stride+1)

for i in range(total_chunk_num):
    speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
    is_final = i == total_chunk_num - 1
    res = model.generate(
        input=speech_chunk, 
        cache=cache, 
        is_final=is_final, 
        chunk_size=chunk_size,
        encoder_chunk_look_back=encoder_chunk_look_back, 
        decoder_chunk_look_back=decoder_chunk_look_back
    )
    print(res)

Service Deployment

FunASR supports deploying pre-trained or fine-tuned models for services, currently supporting the following types of service deployment:

Supported Service Types

  • Offline File Transcription Service (Chinese, CPU version)
  • Real-time Transcription Service (Chinese, CPU version)
  • Offline File Transcription Service (English, CPU version)
  • Offline File Transcription Service (Chinese, GPU version)

Deployment Configuration Recommendations

Recommended configurations:

  • Configuration 1: (X86 Compute Type) 4 vCPUs, 8GB RAM, single machine supports approximately 32 requests
  • Configuration 2: (X86 Compute Type) 16 vCPUs, 32GB RAM, single machine supports approximately 64 requests

Technical Features

1. Model Innovation

  • Non-autoregressive Architecture: Paraformer adopts a non-autoregressive design to improve inference efficiency
  • 2Pass Mode: Combines the advantages of streaming and non-streaming
  • Hotword Support: Supports custom hotwords to improve recognition accuracy for specific vocabulary

2. Engineering Optimization

  • ONNX Export: Supports ONNX format export for models, facilitating deployment
  • Multi-platform Support: Supports CPU, GPU, ARM64, and other platforms
  • Containerized Deployment: Provides Docker image support

3. Developer Friendly

  • Unified Interface: AutoModel unifies the inference interfaces of ModelScope, Hugging Face, and FunASR
  • Plugin Design: Supports flexible combination of components like VAD, punctuation, and speaker diarization
  • Rich Documentation: Provides detailed tutorials and examples

Application Scenarios

1. Real-time Speech Transcription

  • Meeting minutes
  • Live captions
  • Voice assistants

2. Offline Audio Processing

  • Audio file transcription
  • Speech data analysis
  • Content moderation

3. Multilingual Support

  • Cross-language speech recognition
  • Speech translation
  • Multilingual customer service

Latest Updates

Major Updates in 2024

  • 2024/10/29: Real-time transcription service 1.12 released, 2pass-offline mode supports SenseVoice model
  • 2024/10/10: Added Whisper-large-v3-turbo model support
  • 2024/09/26: Fixed memory leak issues, supported SenseVoice ONNX model
  • 2024/07/04: Released SenseVoice foundational speech model
  • 2024/06/27: Offline file transcription service GPU 1.0 released

Community and Support

Open Source License

Community Participation

  • GitHub Issues: Technical questions and bug reports
  • DingTalk Group: Daily communication and discussion
  • ModelScope: Model download and sharing

Citation

If you use FunASR in your research, please cite the following paper:

@inproceedings{gao2023funasr,
  author={Zhifu Gao and Zerui Li and Jiaming Wang and Haoneng Luo and Xian Shi and Mengzhe Chen and Yabin Li and Lingyun Zuo and Zhihao Du and Zhangyu Xiao and Shiliang Zhang},
  title={FunASR: A Fundamental End-to-End Speech Recognition Toolkit},
  year={2023},
  booktitle={INTERSPEECH},
}

Summary

FunASR is a feature-complete and high-performance speech recognition toolkit that successfully combines cutting-edge academic research with the practical needs of industrial applications. Whether for researchers validating algorithms or developers building speech applications, FunASR provides powerful technical support and a convenient development experience. Through its rich pre-trained models, flexible deployment solutions, and active open-source community, FunASR is becoming an important infrastructure in the field of speech recognition.

Star History Chart