An advanced real-time text-to-speech Python library that supports multiple TTS engines, featuring low latency and high-quality audio output.
RealtimeTTS Project Detailed Introduction
Project Overview
RealtimeTTS is an advanced real-time Text-to-Speech (TTS) Python library, specifically designed for real-time applications that require low latency and high-quality audio output. This library can quickly convert text streams into high-quality audio output with minimal delay, making it ideal for building voice assistants, AI dialogue systems, and accessibility tools.
Project Address: https://github.com/KoljaB/RealtimeTTS
Core Features
1. Low Latency Processing
- Near-instantaneous text-to-speech conversion: Optimized processing flow ensures minimal latency.
- LLM Output Compatibility: Can directly process streaming output from Large Language Models.
- Real-time Stream Processing: Supports real-time processing at both character and sentence levels.
2. High-Quality Audio Output
- Clear and Natural Voice: Generates natural, human-like speech.
- Multiple Audio Format Support: Supports various audio output formats.
- Configurable Audio Parameters: Adjustable parameters such as sample rate and bit rate.
3. Multi-Engine Support
RealtimeTTS supports multiple TTS engines, providing a wide range of choices:
Cloud Engines 🌐
- OpenAIEngine: OpenAI's TTS service, offering 6 high-quality voices.
- AzureEngine: Microsoft Azure Speech Service, with 500,000 free characters per month.
- ElevenlabsEngine: High-end voice quality, providing a rich selection of voices.
- GTTSEngine: Free Google Translate TTS, no GPU required.
- EdgeEngine: Microsoft Edge free TTS service.
Local Engines 🏠
- CoquiEngine: High-quality neural TTS, supports local processing and voice cloning.
- ParlerEngine: Local neural TTS, suitable for high-end GPUs.
- SystemEngine: Built-in system TTS, quick setup.
- PiperEngine: Extremely fast TTS system, even runs on Raspberry Pi.
- StyleTTS2Engine: Stylized speech synthesis.
- KokoroEngine: New engine with multilingual support.
- OrpheusEngine: Newly added engine option.
4. Multilingual Support
- Supports speech synthesis in multiple languages.
- Intelligent sentence segmentation and language detection.
- Configurable language-specific parameters.
5. Robustness and Reliability
- Failover Mechanism: Automatically switches to a backup engine when one engine encounters a problem.
- Continuous Operation Assurance: Ensures consistent performance and reliability for critical and professional use cases.
- Error Handling: Comprehensive error handling and recovery mechanisms.
Installation
Recommended Installation (Full Version)
pip install -U realtimetts[all]
Custom Installation
You can choose specific engine support as needed:
# System TTS only
pip install realtimetts[system]
# Azure support
pip install realtimetts[azure]
# Multi-engine combination
pip install realtimetts[azure,elevenlabs,openai]
Available Installation Options
all
: Full installation, supports all engines.system
: Local system TTS (pyttsx3).azure
: Azure Speech Service support.elevenlabs
: ElevenLabs API integration.openai
: OpenAI TTS service.gtts
: Google Text-to-Speech.edge
: Microsoft Edge TTS.coqui
: Coqui TTS engine.minimal
: Core package only (for custom engine development).
Core Components
1. Text Stream Processing
- Sentence Boundary Detection: Supports NLTK and Stanza tokenizers.
- Intelligent Segmentation: Segments text based on punctuation and language rules.
- Stream Processing: Supports character iterators and generators.
2. Audio Stream Management
- Asynchronous Playback:
play_async()
method supports non-blocking playback. - Synchronous Playback:
play()
method for blocking playback. - Stream Control: Supports pause, resume, and stop operations.
3. Callback System
Provides rich callback functions for monitoring and control:
on_text_stream_start()
: Triggered when the text stream starts.on_text_stream_stop()
: Triggered when the text stream ends.on_audio_stream_start()
: Triggered when audio playback starts.on_audio_stream_stop()
: Triggered when audio playback ends.on_character()
: Triggered when each character is processed.on_word()
: Word-level time synchronization (supports Azure and Kokoro engines).
Basic Usage Examples
Simple Usage
from RealtimeTTS import TextToAudioStream, SystemEngine
# Create engine and stream
engine = SystemEngine()
stream = TextToAudioStream(engine)
# Input text and play
stream.feed("Hello world! How are you today?")
stream.play_async()
Streaming Text Processing
# Process a string
stream.feed("Hello, this is a sentence.")
# Process a generator (suitable for LLM output)
def write(prompt: str):
for chunk in openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
stream=True
):
if (text_chunk := chunk["choices"][0]["delta"].get("content")) is not None:
yield text_chunk
text_stream = write("A three-sentence relaxing speech.")
stream.feed(text_stream)
# Process a character iterator
char_iterator = iter("Streaming this character by character.")
stream.feed(char_iterator)
Playback Control
# Asynchronous playback
stream.play_async()
while stream.is_playing():
time.sleep(0.1)
# Synchronous playback
stream.play()
# Control operations
stream.pause() # Pause
stream.resume() # Resume
stream.stop() # Stop
Advanced Configuration
TextToAudioStream Parameters
stream = TextToAudioStream(
engine=engine, # TTS engine
on_text_stream_start=callback, # Text stream start callback
on_audio_stream_start=callback, # Audio stream start callback
output_device_index=None, # Audio output device
tokenizer="nltk", # Tokenizer selection
language="en", # Language code
muted=False, # Whether to mute
level=logging.WARNING # Log level
)
Playback Parameters
stream.play(
fast_sentence_fragment=True, # Fast sentence fragment processing
buffer_threshold_seconds=0.0, # Buffer threshold
minimum_sentence_length=10, # Minimum sentence length
log_synthesized_text=False, # Log synthesized text
reset_generated_text=True, # Reset generated text
output_wavfile=None, # Save to WAV file
on_sentence_synthesized=callback, # Sentence synthesis complete callback
before_sentence_synthesized=callback, # Before sentence synthesis callback
on_audio_chunk=callback # Audio chunk ready callback
)
Engine-Specific Configuration
OpenAI Engine
from RealtimeTTS import OpenAIEngine
engine = OpenAIEngine(
api_key="your-api-key", # Or set environment variable OPENAI_API_KEY
voice="alloy", # Optional: alloy, echo, fable, onyx, nova, shimmer
model="tts-1" # Or tts-1-hd
)
Azure Engine
from RealtimeTTS import AzureEngine
engine = AzureEngine(
speech_key="your-speech-key", # Or set environment variable AZURE_SPEECH_KEY
service_region="your-region", # For example: "eastus"
voice_name="en-US-AriaNeural" # Azure voice name
)
Coqui Engine (Voice Cloning)
from RealtimeTTS import CoquiEngine
engine = CoquiEngine(
voice="path/to/voice/sample.wav", # Voice cloning source file
language="en" # Language code
)
Test Files
The project provides a rich set of test examples:
simple_test.py
: Basic "Hello World" demonstration.complex_test.py
: Full-featured demonstration.coqui_test.py
: Local Coqui TTS engine test.translator.py
: Real-time multilingual translation (requires installation ofopenai realtimetts
).openai_voice_interface.py
: Voice-activated OpenAI API interface.advanced_talk.py
: Advanced dialogue system.minimalistic_talkbot.py
: Simple chatbot in 20 lines of code.test_callbacks.py
: Callback functionality and latency testing.
CUDA Support
For better performance, especially when using local neural engines, it is recommended to install CUDA support:
Installation Steps
- Install NVIDIA CUDA Toolkit (version 11.8 or 12.X).
- Install NVIDIA cuDNN.
- Install ffmpeg.
- Install CUDA-enabled PyTorch:
# CUDA 11.8
pip install torch==2.5.1+cu118 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118
# CUDA 12.X
pip install torch==2.5.1+cu121 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
Application Scenarios
1. AI Assistants and Chatbots
- Real-time response to user queries.
- Natural conversational experience.
- Multilingual support.
2. Accessibility Tools
- Screen readers.
- Visual impairment assistance.
- Learning aids.
3. Content Creation
- Podcast production.
- Audiobooks.
- Educational content.
4. Customer Service
- Automated customer service systems.
- Telephone robots.
- Real-time translation services.
5. Games and Entertainment
- In-game voice.
- Virtual character voice acting.
- Interactive entertainment applications.
Project Ecosystem
RealtimeTTS is part of a larger ecosystem:
- RealtimeSTT: A complementary speech-to-text library, which, when combined, can create a complete real-time audio processing system.
- Linguflex: The original project, a powerful open-source AI assistant.
- LocalAIVoiceChat: A local AI voice dialogue system based on the Zephyr 7B model.
License Information
The project itself is open source, but note the license restrictions of each engine:
- Open Source Engines: SystemEngine, GTTSEngine (MIT License).
- Commercially Restricted Engines: CoquiEngine, ElevenlabsEngine, AzureEngine (free for non-commercial use).
- Paid Services: OpenAI requires an API key and a paid plan.
System Requirements
- Python Version: >= 3.9, < 3.13
- Operating System: Windows, macOS, Linux
- Dependencies: PyAudio, pyttsx3, pydub, etc.
- GPU Support: NVIDIA graphics card recommended for local neural engines.
Summary
RealtimeTTS is a powerful and well-designed real-time text-to-speech library, suitable for modern applications that require high-quality, low-latency speech synthesis. Its multi-engine support, robust error handling mechanisms, and rich configuration options make it an ideal choice for building professional-grade voice applications. Whether for personal projects or enterprise-level applications, RealtimeTTS provides a reliable and efficient solution.