CTranslate2-based optimized implementation of Whisper speech recognition, 4x faster and less memory usage than the original
Faster-Whisper Project Details
Project Overview
Faster-Whisper is a re-implementation of the OpenAI Whisper model, utilizing CTranslate2 as a fast inference engine. Compared to the original openai/whisper, it achieves up to 4x faster inference while maintaining the same accuracy and consuming less memory. Further efficiency gains can be achieved on both CPU and GPU through 8-bit quantization.
Core Features
🚀 Performance Advantages
- Speed Boost: Up to 4x faster than the original Whisper
- Memory Optimization: Lower memory footprint
- Quantization Support: Supports 8-bit quantization for further performance enhancement
- Batch Processing: Supports batch transcription to improve throughput
🛠️ Technical Features
- Based on the CTranslate2 inference engine
- Supports GPU and CPU execution
- Compatible with original Whisper models
- Supports multiple precision modes (FP16, FP32, INT8)
- Built-in audio decoding (no FFmpeg required)
Performance Comparison
GPU Benchmark (NVIDIA RTX 3070 Ti 8GB)
Performance comparison for transcribing 13 minutes of audio:
Implementation | Precision | Beam Size | Time | VRAM Usage |
---|---|---|---|---|
openai/whisper | fp16 | 5 | 2m23s | 4708MB |
whisper.cpp (Flash Attention) | fp16 | 5 | 1m05s | 4127MB |
faster-whisper | fp16 | 5 | 1m47s | 3244MB |
faster-whisper | int8 | 5 | 1m33s | 2926MB |
CPU Benchmark (Intel Core i7-12700K)
Implementation | Precision | Beam Size | Time | RAM Usage |
---|---|---|---|---|
openai/whisper | fp32 | 5 | 6m58s | 2335MB |
whisper.cpp | fp32 | 5 | 2m05s | 1049MB |
faster-whisper | fp32 | 5 | 2m37s | 2257MB |
faster-whisper | int8 | 5 | 1m42s | 1477MB |
Installation Instructions
System Requirements
- Python 3.9 or higher
- NVIDIA CUDA library support required for GPU execution
Basic Installation
pip install faster-whisper
GPU Support Installation
NVIDIA libraries are required:
- CUDA 12.x
- cuDNN 9.x
- cuBLAS
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*
export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
Docker Method
# Use official NVIDIA CUDA image
docker run --gpus all -it nvidia/cuda:12.3.2-cudnn9-runtime-ubuntu22.04
Usage Guide
Basic Transcription
from faster_whisper import WhisperModel
model_size = "large-v3"
# GPU execution (FP16)
model = WhisperModel(model_size, device="cuda", compute_type="float16")
# Or GPU execution (INT8)
# model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")
# Or CPU execution (INT8)
# model = WhisperModel(model_size, device="cpu", compute_type="int8")
segments, info = model.transcribe("audio.mp3", beam_size=5)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
Batch Transcription
from faster_whisper import WhisperModel, BatchedInferencePipeline
model = WhisperModel("turbo", device="cuda", compute_type="float16")
batched_model = BatchedInferencePipeline(model=model)
segments, info = batched_model.transcribe("audio.mp3", batch_size=16)
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
Word-level Timestamps
segments, _ = model.transcribe("audio.mp3", word_timestamps=True)
for segment in segments:
for word in segment.words:
print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word))
Voice Activity Detection (VAD)
# Enable VAD filtering
segments, _ = model.transcribe("audio.mp3", vad_filter=True)
# Custom VAD parameters
segments, _ = model.transcribe(
"audio.mp3",
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500),
)
Distil-Whisper Support
from faster_whisper import WhisperModel
model_size = "distil-large-v3"
model = WhisperModel(model_size, device="cuda", compute_type="float16")
segments, info = model.transcribe(
"audio.mp3",
beam_size=5,
language="en",
condition_on_previous_text=False
)
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
Model Conversion
From Transformers
pip install transformers[torch]>=4.23
ct2-transformers-converter \
--model openai/whisper-large-v3 \
--output_dir whisper-large-v3-ct2 \
--copy_files tokenizer.json preprocessor_config.json \
--quantization float16
Loading Custom Models
# Load from local directory
model = WhisperModel("whisper-large-v3-ct2")
# Load from Hugging Face Hub
model = WhisperModel("username/whisper-large-v3-ct2")
Application Scenarios
- Speech-to-text
- Real-time transcription
- Subtitle generation
- Multilingual translation
- Speech analysis
- Audio content indexing
Configuration and Optimization
Logging Configuration
import logging
logging.basicConfig()
logging.getLogger("faster_whisper").setLevel(logging.DEBUG)
Performance Optimization Suggestions
- Use GPU acceleration for optimal performance
- Select appropriate precision mode based on hardware
- Utilize batch processing to improve throughput
- Enable VAD filtering to reduce processing time
- Properly set beam size and batch size
Thread Configuration
# Set CPU thread count
OMP_NUM_THREADS=4 python3 my_script.py
Technical Architecture
Core Components
- CTranslate2: Fast inference engine
- PyAV: Audio decoding library
- Silero VAD: Voice Activity Detection
- Transformers: Model conversion support
Supported Models
- OpenAI Whisper series (tiny, base, small, medium, large-v1/v2/v3)
- Distil-Whisper series
- Custom fine-tuned models
Community and Support
- GitHub Repository: https://github.com/SYSTRAN/faster-whisper
- PyPI Package: https://pypi.org/project/faster-whisper/
- Hugging Face Models: https://huggingface.co/Systran
Summary
Faster-Whisper is a high-performance speech recognition solution that achieves significant speed improvements through an optimized inference engine while maintaining the same accuracy as the original Whisper. Its rich features, robust ecosystem support, and user-friendly API make it an ideal choice for speech recognition applications. Whether for developers or researchers, Faster-Whisper enables the rapid development of efficient speech processing applications.