huggingface/parler-ttsPlease refer to the latest official releases for information GitHub Homepage

A lightweight text-to-speech model that generates high-quality, natural-sounding speech from natural language descriptions.

Apache-2.0Python 5.3khuggingfaceparler-tts Last Updated: 2024-12-10

Parler TTS Project Details

Project Overview

Parler-TTS is a lightweight text-to-speech (TTS) model capable of generating high-quality, natural-sounding speech, with control over the speaker's style (gender, tone, speaking manner, etc.). This project is an open-source implementation of the Stability AI and University of Edinburgh research paper "Natural language guidance of high-fidelity text-to-speech with synthetic annotations."

Project Features

Fully Open Source: Unlike other TTS models, Parler-TTS is a fully open-source release.
Dataset Openness: All datasets, preprocessing, training code, and weights are publicly released under a permissive license.
Natural Language Control: Voice characteristics can be controlled through simple text prompts.
Multiple Model Sizes: Different parameter-scale model versions are available.

Available Model Versions

1. Parler-TTS Mini v1

Parameters: 880M
Training Data: 45K hours of audiobook data
Features: Lightweight, suitable for fast inference

2. Parler-TTS Large v1

Parameters: 2.2B parameters
Training Data: 45K hours of audio data
Features: Higher quality speech generation

3. Parler-TTS Mini Expresso

Special Features: Provides superior emotional control (happy, confused, laughter, sad) and consistent voices (Jerry, Thomas, Elisabeth, Talia)

Installation

Basic Installation

pip install git+https://github.com/huggingface/parler-tts.git

Apple Silicon Users

pip3 install --pre torch torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu

Usage

Basic Usage Example

import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")

prompt = "Hey, how are you doing today?"
description = "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

Using Predefined Speakers

The model supports 34 predefined speakers, including: Laura, Gary, Jon, Lea, Karen, Rick, Brenda, David, Eileen, Jordan, Mike, Yann, Joy, James, Eric, Lauren, Rose, Will, Jason, Aaron, Naomie, Alisa, Patrick, Jerry, Tina, Jenna, Bill, Tom, Carol, Barbara, Rebecca, Anna, Bruce, Emily.

prompt = "Hey, how are you doing today?"
description = "Jon's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

Usage Tips

Use "very clear audio" to generate the highest quality audio.
Use "very noisy audio" to add a high level of background noise.
Punctuation can be used to control the prosody of the speech, such as using commas to add small pauses in the speech.
Other speech characteristics (gender, speaking rate, pitch, and reverb) can be directly controlled through prompts.

Training and Fine-tuning

Quick Training

accelerate launch ./training/run_parler_tts_training.py ./helpers/training_configs/starting_point_v1.json

Fine-tuning Support

The project provides complete training and fine-tuning guides, including:

Architecture introduction
Getting started steps
Detailed training guide
Single speaker dataset fine-tuning example

Technical Optimizations

The project includes various performance optimizations:

SDPA and Flash Attention 2 compatibility
Model compilation capabilities
Streaming generation support
Static cache optimization

Project Structure

Inference Code: Core TTS inference functionality
Training Code: Complete training and fine-tuning processes
Data-Speech Integration: Works with dataset annotation libraries
Optimization Tools: Multiple inference speed optimization options

Application Scenarios

Audiobook production
Voice assistants
Educational content creation
Accessibility assistive technology
Multimedia content creation

Open Source License and Citation

The project uses a permissive open-source license, encouraging community contributions and commercial use. If you use this project, please cite:

@misc{lacombe-etal-2024-parler-tts,
author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
title = {Parler-TTS},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huggingface/parler-tts}}
}

Community Contributions

The project welcomes community contributions, especially in the following areas:

Dataset expansion and diversity
Training method optimization
Multilingual support
Performance optimization
Evaluation metric improvement

Parler TTS represents a significant advancement in open-source TTS technology, providing researchers and developers with a powerful and flexible text-to-speech solution.