Home
Login

An advanced real-time text-to-speech Python library that supports multiple TTS engines, featuring low latency and high-quality audio output.

MITPython 3.2kKoljaBRealtimeTTS Last Updated: 2025-06-17

RealtimeTTS Project Detailed Introduction

Project Overview

RealtimeTTS is an advanced real-time Text-to-Speech (TTS) Python library, specifically designed for real-time applications that require low latency and high-quality audio output. This library can quickly convert text streams into high-quality audio output with minimal delay, making it ideal for building voice assistants, AI dialogue systems, and accessibility tools.

Project Address: https://github.com/KoljaB/RealtimeTTS

Core Features

1. Low Latency Processing

  • Near-instantaneous text-to-speech conversion: Optimized processing flow ensures minimal latency.
  • LLM Output Compatibility: Can directly process streaming output from Large Language Models.
  • Real-time Stream Processing: Supports real-time processing at both character and sentence levels.

2. High-Quality Audio Output

  • Clear and Natural Voice: Generates natural, human-like speech.
  • Multiple Audio Format Support: Supports various audio output formats.
  • Configurable Audio Parameters: Adjustable parameters such as sample rate and bit rate.

3. Multi-Engine Support

RealtimeTTS supports multiple TTS engines, providing a wide range of choices:

Cloud Engines 🌐

  • OpenAIEngine: OpenAI's TTS service, offering 6 high-quality voices.
  • AzureEngine: Microsoft Azure Speech Service, with 500,000 free characters per month.
  • ElevenlabsEngine: High-end voice quality, providing a rich selection of voices.
  • GTTSEngine: Free Google Translate TTS, no GPU required.
  • EdgeEngine: Microsoft Edge free TTS service.

Local Engines 🏠

  • CoquiEngine: High-quality neural TTS, supports local processing and voice cloning.
  • ParlerEngine: Local neural TTS, suitable for high-end GPUs.
  • SystemEngine: Built-in system TTS, quick setup.
  • PiperEngine: Extremely fast TTS system, even runs on Raspberry Pi.
  • StyleTTS2Engine: Stylized speech synthesis.
  • KokoroEngine: New engine with multilingual support.
  • OrpheusEngine: Newly added engine option.

4. Multilingual Support

  • Supports speech synthesis in multiple languages.
  • Intelligent sentence segmentation and language detection.
  • Configurable language-specific parameters.

5. Robustness and Reliability

  • Failover Mechanism: Automatically switches to a backup engine when one engine encounters a problem.
  • Continuous Operation Assurance: Ensures consistent performance and reliability for critical and professional use cases.
  • Error Handling: Comprehensive error handling and recovery mechanisms.

Installation

Recommended Installation (Full Version)

pip install -U realtimetts[all]

Custom Installation

You can choose specific engine support as needed:

# System TTS only
pip install realtimetts[system]

# Azure support
pip install realtimetts[azure]

# Multi-engine combination
pip install realtimetts[azure,elevenlabs,openai]

Available Installation Options

  • all: Full installation, supports all engines.
  • system: Local system TTS (pyttsx3).
  • azure: Azure Speech Service support.
  • elevenlabs: ElevenLabs API integration.
  • openai: OpenAI TTS service.
  • gtts: Google Text-to-Speech.
  • edge: Microsoft Edge TTS.
  • coqui: Coqui TTS engine.
  • minimal: Core package only (for custom engine development).

Core Components

1. Text Stream Processing

  • Sentence Boundary Detection: Supports NLTK and Stanza tokenizers.
  • Intelligent Segmentation: Segments text based on punctuation and language rules.
  • Stream Processing: Supports character iterators and generators.

2. Audio Stream Management

  • Asynchronous Playback: play_async() method supports non-blocking playback.
  • Synchronous Playback: play() method for blocking playback.
  • Stream Control: Supports pause, resume, and stop operations.

3. Callback System

Provides rich callback functions for monitoring and control:

  • on_text_stream_start(): Triggered when the text stream starts.
  • on_text_stream_stop(): Triggered when the text stream ends.
  • on_audio_stream_start(): Triggered when audio playback starts.
  • on_audio_stream_stop(): Triggered when audio playback ends.
  • on_character(): Triggered when each character is processed.
  • on_word(): Word-level time synchronization (supports Azure and Kokoro engines).

Basic Usage Examples

Simple Usage

from RealtimeTTS import TextToAudioStream, SystemEngine

# Create engine and stream
engine = SystemEngine()
stream = TextToAudioStream(engine)

# Input text and play
stream.feed("Hello world! How are you today?")
stream.play_async()

Streaming Text Processing

# Process a string
stream.feed("Hello, this is a sentence.")

# Process a generator (suitable for LLM output)
def write(prompt: str):
    for chunk in openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    ):
        if (text_chunk := chunk["choices"][0]["delta"].get("content")) is not None:
            yield text_chunk

text_stream = write("A three-sentence relaxing speech.")
stream.feed(text_stream)

# Process a character iterator
char_iterator = iter("Streaming this character by character.")
stream.feed(char_iterator)

Playback Control

# Asynchronous playback
stream.play_async()
while stream.is_playing():
    time.sleep(0.1)

# Synchronous playback
stream.play()

# Control operations
stream.pause()   # Pause
stream.resume()  # Resume
stream.stop()    # Stop

Advanced Configuration

TextToAudioStream Parameters

stream = TextToAudioStream(
    engine=engine,                    # TTS engine
    on_text_stream_start=callback,    # Text stream start callback
    on_audio_stream_start=callback,   # Audio stream start callback
    output_device_index=None,         # Audio output device
    tokenizer="nltk",                # Tokenizer selection
    language="en",                   # Language code
    muted=False,                     # Whether to mute
    level=logging.WARNING            # Log level
)

Playback Parameters

stream.play(
    fast_sentence_fragment=True,      # Fast sentence fragment processing
    buffer_threshold_seconds=0.0,     # Buffer threshold
    minimum_sentence_length=10,       # Minimum sentence length
    log_synthesized_text=False,       # Log synthesized text
    reset_generated_text=True,        # Reset generated text
    output_wavfile=None,             # Save to WAV file
    on_sentence_synthesized=callback, # Sentence synthesis complete callback
    before_sentence_synthesized=callback, # Before sentence synthesis callback
    on_audio_chunk=callback          # Audio chunk ready callback
)

Engine-Specific Configuration

OpenAI Engine

from RealtimeTTS import OpenAIEngine

engine = OpenAIEngine(
    api_key="your-api-key",  # Or set environment variable OPENAI_API_KEY
    voice="alloy",           # Optional: alloy, echo, fable, onyx, nova, shimmer
    model="tts-1"           # Or tts-1-hd
)

Azure Engine

from RealtimeTTS import AzureEngine

engine = AzureEngine(
    speech_key="your-speech-key",    # Or set environment variable AZURE_SPEECH_KEY
    service_region="your-region",    # For example: "eastus"
    voice_name="en-US-AriaNeural"   # Azure voice name
)

Coqui Engine (Voice Cloning)

from RealtimeTTS import CoquiEngine

engine = CoquiEngine(
    voice="path/to/voice/sample.wav",  # Voice cloning source file
    language="en"                      # Language code
)

Test Files

The project provides a rich set of test examples:

  • simple_test.py: Basic "Hello World" demonstration.
  • complex_test.py: Full-featured demonstration.
  • coqui_test.py: Local Coqui TTS engine test.
  • translator.py: Real-time multilingual translation (requires installation of openai realtimetts).
  • openai_voice_interface.py: Voice-activated OpenAI API interface.
  • advanced_talk.py: Advanced dialogue system.
  • minimalistic_talkbot.py: Simple chatbot in 20 lines of code.
  • test_callbacks.py: Callback functionality and latency testing.

CUDA Support

For better performance, especially when using local neural engines, it is recommended to install CUDA support:

Installation Steps

  1. Install NVIDIA CUDA Toolkit (version 11.8 or 12.X).
  2. Install NVIDIA cuDNN.
  3. Install ffmpeg.
  4. Install CUDA-enabled PyTorch:
# CUDA 11.8
pip install torch==2.5.1+cu118 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118

# CUDA 12.X
pip install torch==2.5.1+cu121 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

Application Scenarios

1. AI Assistants and Chatbots

  • Real-time response to user queries.
  • Natural conversational experience.
  • Multilingual support.

2. Accessibility Tools

  • Screen readers.
  • Visual impairment assistance.
  • Learning aids.

3. Content Creation

  • Podcast production.
  • Audiobooks.
  • Educational content.

4. Customer Service

  • Automated customer service systems.
  • Telephone robots.
  • Real-time translation services.

5. Games and Entertainment

  • In-game voice.
  • Virtual character voice acting.
  • Interactive entertainment applications.

Project Ecosystem

RealtimeTTS is part of a larger ecosystem:

  • RealtimeSTT: A complementary speech-to-text library, which, when combined, can create a complete real-time audio processing system.
  • Linguflex: The original project, a powerful open-source AI assistant.
  • LocalAIVoiceChat: A local AI voice dialogue system based on the Zephyr 7B model.

License Information

The project itself is open source, but note the license restrictions of each engine:

  • Open Source Engines: SystemEngine, GTTSEngine (MIT License).
  • Commercially Restricted Engines: CoquiEngine, ElevenlabsEngine, AzureEngine (free for non-commercial use).
  • Paid Services: OpenAI requires an API key and a paid plan.

System Requirements

  • Python Version: >= 3.9, < 3.13
  • Operating System: Windows, macOS, Linux
  • Dependencies: PyAudio, pyttsx3, pydub, etc.
  • GPU Support: NVIDIA graphics card recommended for local neural engines.

Summary

RealtimeTTS is a powerful and well-designed real-time text-to-speech library, suitable for modern applications that require high-quality, low-latency speech synthesis. Its multi-engine support, robust error handling mechanisms, and rich configuration options make it an ideal choice for building professional-grade voice applications. Whether for personal projects or enterprise-level applications, RealtimeTTS provides a reliable and efficient solution.

Star History Chart