PromptTTS: Controllable Text-to-Speech with Text Descriptions

Guo, Zhifang; Leng, Yichong; Wu, Yihan; Zhao, Sheng; Tan, Xu

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2211.12171 (eess)

[Submitted on 22 Nov 2022]

Title:PromptTTS: Controllable Text-to-Speech with Text Descriptions

Authors:Zhifang Guo, Yichong Leng, Yihan Wu, Sheng Zhao, Xu Tan

View PDF

Abstract:Using a text description as prompt to guide the generation of text or images (e.g., GPT-3 or DALLE-2) has drawn wide attention recently. Beyond text and image generation, in this work, we explore the possibility of utilizing text descriptions to guide speech synthesis. Thus, we develop a text-to-speech (TTS) system (dubbed as PromptTTS) that takes a prompt with both style and content descriptions as input to synthesize the corresponding speech. Specifically, PromptTTS consists of a style encoder and a content encoder to extract the corresponding representations from the prompt, and a speech decoder to synthesize speech according to the extracted style and content representations. Compared with previous works in controllable TTS that require users to have acoustic knowledge to understand style factors such as prosody and pitch, PromptTTS is more user-friendly since text descriptions are a more natural way to express speech style (e.g., ''A lady whispers to her friend slowly''). Given that there is no TTS dataset with prompts, to benchmark the task of PromptTTS, we construct and release a dataset containing prompts with style and content information and the corresponding speech. Experiments show that PromptTTS can generate speech with precise style control and high speech quality. Audio samples and our dataset are publicly available.

Comments:	Submitted to ICASSP 2023
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
Cite as:	arXiv:2211.12171 [eess.AS]
	(or arXiv:2211.12171v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2211.12171

Submission history

From: Zhifang Guo [view email]
[v1] Tue, 22 Nov 2022 10:58:38 UTC (1,129 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:PromptTTS: Controllable Text-to-Speech with Text Descriptions

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:PromptTTS: Controllable Text-to-Speech with Text Descriptions

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators