Home
Login

PaddleSpeech: An easy-to-use speech toolkit, including self-supervised learning models, state-of-the-art/streaming ASR with punctuation, streaming TTS with text frontend, speaker verification system, end-to-end speech translation, and keyword spotting. Winner of the NAACL 2022 Best Demo Award.

Apache-2.0Python 12.0kPaddlePaddle Last Updated: 2025-06-10

PaddleSpeech Project Detailed Introduction

Project Overview

PaddleSpeech is an open-source speech toolkit developed based on Baidu PaddlePaddle platform, focusing on various key speech and audio tasks. The project won the NAACL2022 Best Demo Award for its latest and impactful model designs.

Core Features

🚀 Easy to Use

  • Low-Threshold Installation: Provides simple installation methods.
  • Command-Line Tools: Supports CLI, Server, and Streaming Server for quick start.
  • Multiple Interfaces: Supports both command-line and Python API usage.

🏆 Cutting-Edge Technology

  • Aligns with the Latest Technologies: Provides high-speed, ultra-lightweight models and cutting-edge technologies.
  • Streaming System: Offers production-ready streaming ASR and streaming TTS systems.
  • Self-Supervised Learning: Integrates self-supervised learning models.

💯 Chinese Speech Frontend

  • Regularization Processing: Includes text normalization and grapheme-to-phoneme conversion (G2P).
  • Polyphone Processing: Supports polyphone and tone sandhi processing.
  • Linguistic Rules: Uses custom linguistic rules to adapt to the Chinese context.

Main Functional Modules

1. Automatic Speech Recognition (ASR)

  • Supported Models: DeepSpeech2, Transformer, Conformer, U2, etc.
  • Multi-Language Support: Chinese, English, Chinese-English mixed.
  • Real-Time Recognition: Supports streaming speech recognition.
  • Punctuation Restoration: Automatically adds punctuation marks.

2. Text-to-Speech (TTS)

  • Acoustic Models: Tacotron2, FastSpeech2, SpeedySpeech, VITS, etc.
  • Vocoders: WaveFlow, PWGAN, HiFiGAN, Multi Band MelGAN, etc.
  • Multi-Language Support: Chinese, English, Chinese-English mixed, Cantonese.
  • Voice Cloning: Supports voice cloning and fine-tuning.

3. Voiceprint Recognition (VPR)

  • Speaker Recognition: Based on the ECAPA-TDNN model.
  • Voiceprint Extraction: Industrial-grade voiceprint feature extraction.
  • Speaker Diarization: Supports speaker diarization tasks.

4. Speech Translation (ST)

  • End-to-End Translation: English-to-Chinese speech translation.
  • Multi-Modal Pre-training: Combines acoustic and text features.

5. Audio Classification (CLS)

  • Open-Domain Classification: 527-class audio classification based on the AudioSet dataset.
  • PANN Models: Uses pre-trained audio neural networks.

6. Keyword Spotting (KWS)

  • Wake-Up Word Detection: Supports custom wake-up words.
  • Lightweight Models: Suitable for mobile deployment.

Technical Architecture

Model Support

  • Self-Supervised Learning: Wav2vec2.0, HuBERT, WavLM, etc.
  • Attention Mechanism: Transformer, Conformer architectures.
  • End-to-End Training: U2, U2++, etc. unified models.
  • Adversarial Training: VITS, StarGAN, etc. generative models.

Dataset Support

  • ASR Datasets: Aishell, LibriSpeech, CommonVoice, etc.
  • TTS Datasets: LJSpeech, CSMSC, VCTK, etc.
  • Multi-Language Data: Supports Chinese-English mixed datasets.

Installation and Usage

System Requirements

  • Operating System: Linux (recommended), Windows, Mac OSX.
  • Python Version: ≥ 3.8
  • Compiler: gcc ≥ 4.8.5
  • Dependency Framework: PaddlePaddle

Installation Methods

1. pip Installation

pip install paddlespeech

2. Source Code Installation (Recommended)

git clone https://github.com/PaddlePaddle/PaddleSpeech.git
cd PaddleSpeech
pip install pytest-runner
pip install .

Quick Experience

Speech Recognition Example

# Command-Line Method
paddlespeech asr --lang zh --input zh.wav

# Python API Method
from paddlespeech.cli.asr.infer import ASRExecutor
asr = ASRExecutor()
result = asr(audio_file="zh.wav")

Speech Synthesis Example

# Command-Line Method
paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架!" --output output.wav

# Python API Method
from paddlespeech.cli.tts.infer import TTSExecutor
tts = TTSExecutor()
tts(text="今天天气十分不错。", output="output.wav")

Service Deployment

Speech Server

PaddleSpeech provides a complete server solution:

Start Service

paddlespeech_server start --config_file ./demos/speech_server/conf/application.yaml

Client Call

# ASR Service
paddlespeech_client asr --server_ip 127.0.0.1 --port 8090 --input input_16k.wav

# TTS Service
paddlespeech_client tts --server_ip 127.0.0.1 --port 8090 --input "您好,欢迎使用百度飞桨语音合成服务。"

Streaming Service

Supports real-time streaming speech recognition and speech synthesis:

# Streaming ASR
paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input input_16k.wav

# Streaming TTS
paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --input "您好,欢迎使用百度飞桨语音合成服务。"

Application Cases

Industrial Applications

  • Intelligent Customer Service: Speech recognition + speech synthesis.
  • Voice Assistant: Wake-up word detection + dialogue system.
  • Content Creation: Voice cloning + multi-language synthesis.
  • Accessibility Services: Speech-to-text + text-to-speech.

Academic Research

  • Multi-Modal Pre-training: ERNIE-SAT and other models.
  • Speech Translation: End-to-end English-to-Chinese.
  • Speaker Recognition: Voiceprint recognition and verification.
  • Audio Analysis: Audio classification and scene recognition.

Technical Advantages

1. Model Performance

  • SOTA Results: Reaches industry-leading levels in multiple tasks.
  • Lightweight Deployment: Supports mobile and edge devices.
  • Real-Time Processing: Meets real-time interaction needs.

2. Ease of Use

  • One-Click Deployment: Simplified installation and configuration process.
  • Rich Documentation: Complete usage instructions and examples.
  • Community Support: Active developer community.

3. Scalability

  • Modular Design: Supports custom models and tasks.
  • Multi-Language Support: Continuously expanding language coverage.
  • Cross-Platform Deployment: Supports multiple deployment environments.

Community and Ecosystem

Open Source Community

  • GitHub Stars: Over 10k stars.
  • Contributors: Developers from around the world.
  • Community Projects: Derivative projects based on PaddleSpeech.

Related Projects

  • PaddleBoBo: Virtual anchor voice generation.
  • VTuberTalk: Video voice cloning tool.
  • FastASR: C++ inference implementation.
  • VoiceTyping: Real-time voice input tool.

Summary

PaddleSpeech is a comprehensive and easy-to-use speech toolkit that covers multiple core tasks such as speech recognition, speech synthesis, speaker verification, and speech translation. Through modular design and rich pre-trained models, it provides developers and researchers with powerful speech AI solutions. Whether it is academic research or industrial applications, PaddleSpeech can provide high-quality technical support and complete solutions.