huggingface/parler-ttsPlease refer to the latest official releases for information GitHub Homepage

轻量级文本转语音模型，可通过自然语言描述生成高质量、自然的语音

Apache-2.0Python 5.3khuggingfaceparler-tts Last Updated: 2024-12-10

Parler TTS 项目详细介绍

项目概述

Parler-TTS是一个轻量级文本转语音(TTS)模型，能够生成高质量、自然的语音，并且可以控制说话者的风格（性别、音调、说话方式等）。该项目是对Stability AI和爱丁堡大学研究论文《Natural language guidance of high-fidelity text-to-speech with synthetic annotations》的开源实现。

项目特点

完全开源: 与其他TTS模型不同，Parler-TTS是完全开源的发布版本
数据集开放: 所有数据集、预处理、训练代码和权重都在宽松许可证下公开发布
自然语言控制: 可以通过简单的文本提示控制语音特征
多种模型规模: 提供不同参数规模的模型版本

可用模型版本

1. Parler-TTS Mini v1

参数量: 880M
训练数据: 45K小时有声读物数据
特点: 轻量级，适合快速推理

2. Parler-TTS Large v1

参数量: 2.2B参数
训练数据: 45K小时音频数据
特点: 更高质量的语音生成

3. Parler-TTS Mini Expresso

特色功能: 提供优越的情感控制（快乐、困惑、笑声、悲伤）和一致的声音（Jerry、Thomas、Elisabeth、Talia）

安装方式

基本安装

pip install git+https://github.com/huggingface/parler-tts.git

Apple Silicon用户

pip3 install --pre torch torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu

使用方法

基础使用示例

import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")

prompt = "Hey, how are you doing today?"
description = "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

使用预定义说话者

该模型支持34个预定义说话者，包括：Laura, Gary, Jon, Lea, Karen, Rick, Brenda, David, Eileen, Jordan, Mike, Yann, Joy, James, Eric, Lauren, Rose, Will, Jason, Aaron, Naomie, Alisa, Patrick, Jerry, Tina, Jenna, Bill, Tom, Carol, Barbara, Rebecca, Anna, Bruce, Emily。

prompt = "Hey, how are you doing today?"
description = "Jon's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

使用技巧

使用"very clear audio"生成最高质量的音频
使用"very noisy audio"添加高级别的背景噪音
可以使用标点符号控制语音的韵律，例如使用逗号在语音中添加小停顿
其余语音特征（性别、语速、音调和混响）可以直接通过提示进行控制

训练和微调

快速训练

accelerate launch ./training/run_parler_tts_training.py ./helpers/training_configs/starting_point_v1.json

微调支持

项目提供了完整的训练和微调指南，包括：

架构介绍
入门步骤
详细训练指南
单说话者数据集微调示例

技术优化

项目包含多种性能优化：

SDPA和Flash Attention 2兼容性
模型编译能力
流式生成支持
静态缓存优化

项目结构

推理代码: 核心TTS推理功能
训练代码: 完整的训练和微调流程
Data-Speech集成: 与数据集标注库协同工作
优化工具: 多种推理速度优化选项

应用场景

有声读物制作
语音助手
教育内容制作
无障碍辅助技术
多媒体内容创作

开源协议和引用

项目采用宽松的开源许可证，鼓励社区贡献和商业使用。如果使用该项目，建议引用：

@misc{lacombe-etal-2024-parler-tts,
author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
title = {Parler-TTS},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huggingface/parler-tts}}
}

社区贡献

项目欢迎社区贡献，特别是在以下方面：

数据集扩展和多样性
训练方法优化
多语言支持
性能优化
评估指标改进

Parler TTS代表了开源TTS技术的重要进展，为研究者和开发者提供了强大而灵活的文本转语音解决方案。