-
BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data
Authors:
Mateusz Łajszczak,
Guillermo Cámbara,
Yang Li,
Fatih Beyhan,
Arent van Korlaar,
Fan Yang,
Arnaud Joly,
Álvaro Martín-Cortinas,
Ammar Abbas,
Adam Michalski,
Alexis Moinet,
Sri Karlapati,
Ewa Muszyńska,
Haohan Guo,
Bartosz Putrycz,
Soledad López Gambino,
Kayeon Yoo,
Elena Sokolova,
Thomas Drugman
Abstract:
We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts ra…
▽ More
We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.
△ Less
Submitted 15 February, 2024; v1 submitted 12 February, 2024;
originally announced February 2024.
-
A Comparative Analysis of Pretrained Language Models for Text-to-Speech
Authors:
Marcel Granero-Moya,
Penny Karanasou,
Sri Karlapati,
Bastian Schnell,
Nicole Peinelt,
Alexis Moinet,
Thomas Drugman
Abstract:
State-of-the-art text-to-speech (TTS) systems have utilized pretrained language models (PLMs) to enhance prosody and create more natural-sounding speech. However, while PLMs have been extensively researched for natural language understanding (NLU), their impact on TTS has been overlooked. In this study, we aim to address this gap by conducting a comparative analysis of different PLMs for two TTS t…
▽ More
State-of-the-art text-to-speech (TTS) systems have utilized pretrained language models (PLMs) to enhance prosody and create more natural-sounding speech. However, while PLMs have been extensively researched for natural language understanding (NLU), their impact on TTS has been overlooked. In this study, we aim to address this gap by conducting a comparative analysis of different PLMs for two TTS tasks: prosody prediction and pause prediction. Firstly, we trained a prosody prediction model using 15 different PLMs. Our findings revealed a logarithmic relationship between model size and quality, as well as significant performance differences between neutral and expressive prosody. Secondly, we employed PLMs for pause prediction and found that the task was less sensitive to small models. We also identified a strong correlation between our empirical results and the GLUE scores obtained for these language models. To the best of our knowledge, this is the first study of its kind to investigate the impact of different PLMs on TTS.
△ Less
Submitted 4 September, 2023;
originally announced September 2023.
-
Controllable Emphasis with zero data for text-to-speech
Authors:
Arnaud Joly,
Marco Nicolis,
Ekaterina Peterova,
Alessandro Lombardi,
Ammar Abbas,
Arent van Korlaar,
Aman Hussain,
Parul Sharma,
Alexis Moinet,
Mateusz Lajszczak,
Penny Karanasou,
Antonio Bonafonte,
Thomas Drugman,
Elena Sokolova
Abstract:
We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques im…
▽ More
We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques improving naturalness by $7.3\%$ and correct testers' identification of the emphasized word in a sentence by $40\%$ on a reference female en-US voice. We show that this technique significantly closes the gap to methods that require explicit recordings. The method proved to be scalable and preferred in all four languages tested (English, Spanish, Italian, German), for different voices and multiple speaking styles.
△ Less
Submitted 13 July, 2023;
originally announced July 2023.
-
eCat: An End-to-End Model for Multi-Speaker TTS & Many-to-Many Fine-Grained Prosody Transfer
Authors:
Ammar Abbas,
Sri Karlapati,
Bastian Schnell,
Penny Karanasou,
Marcel Granero Moya,
Amith Nagaraj,
Ayman Boustati,
Nicole Peinelt,
Alexis Moinet,
Thomas Drugman
Abstract:
We present eCat, a novel end-to-end multispeaker model capable of: a) generating long-context speech with expressive and contextually appropriate prosody, and b) performing fine-grained prosody transfer between any pair of seen speakers. eCat is trained using a two-stage training approach. In Stage I, the model learns speaker-independent word-level prosody representations in an end-to-end fashion…
▽ More
We present eCat, a novel end-to-end multispeaker model capable of: a) generating long-context speech with expressive and contextually appropriate prosody, and b) performing fine-grained prosody transfer between any pair of seen speakers. eCat is trained using a two-stage training approach. In Stage I, the model learns speaker-independent word-level prosody representations in an end-to-end fashion from speech. In Stage II, we learn to predict the prosody representations using the contextual information available in text. We compare eCat to CopyCat2, a model capable of both fine-grained prosody transfer (FPT) and multi-speaker TTS. We show that eCat statistically significantly reduces the gap in naturalness between CopyCat2 and human recordings by an average of 46.7% across 2 languages, 3 locales, and 7 speakers, along with better target-speaker similarity in FPT. We also compare eCat to VITS, and show a statistically significant preference.
△ Less
Submitted 20 June, 2023;
originally announced June 2023.
-
Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need
Authors:
Daniel Korzekwa,
Jaime Lorenzo-Trueba,
Thomas Drugman,
Bozena Kostek
Abstract:
The research community has long studied computer-assisted pronunciation training (CAPT) methods in non-native speech. Researchers focused on studying various model architectures, such as Bayesian networks and deep learning methods, as well as on the analysis of different representations of the speech signal. Despite significant progress in recent years, existing CAPT methods are not able to detect…
▽ More
The research community has long studied computer-assisted pronunciation training (CAPT) methods in non-native speech. Researchers focused on studying various model architectures, such as Bayesian networks and deep learning methods, as well as on the analysis of different representations of the speech signal. Despite significant progress in recent years, existing CAPT methods are not able to detect pronunciation errors with high accuracy (only 60\% precision at 40\%-80\% recall). One of the key problems is the low availability of mispronounced speech that is needed for the reliable training of pronunciation error detection models. If we had a generative model that could mimic non-native speech and produce any amount of training data, then the task of detecting pronunciation errors would be much easier. We present three innovative techniques based on phoneme-to-phoneme (P2P), text-to-speech (T2S), and speech-to-speech (S2S) conversion to generate correctly pronounced and mispronounced synthetic speech. We show that these techniques not only improve the accuracy of three machine learning models for detecting pronunciation errors but also help establish a new state-of-the-art in the field. Earlier studies have used simple speech generation techniques such as P2P conversion, but only as an additional mechanism to improve the accuracy of pronunciation error detection. We, on the other hand, consider speech generation to be the first-class method of detecting pronunciation errors. The effectiveness of these techniques is assessed in the tasks of detecting pronunciation and lexical stress errors. Non-native English speech corpora of German, Italian, and Polish speakers are used in the evaluations. The best proposed S2S technique improves the accuracy of detecting pronunciation errors in AUC metric by 41\% from 0.528 to 0.749 compared to the state-of-the-art approach.
△ Less
Submitted 2 July, 2022;
originally announced July 2022.
-
Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody
Authors:
Peter Makarov,
Ammar Abbas,
Mateusz Łajszczak,
Arnaud Joly,
Sri Karlapati,
Alexis Moinet,
Thomas Drugman,
Penny Karanasou
Abstract:
Generating expressive and contextually appropriate prosody remains a challenge for modern text-to-speech (TTS) systems. This is particularly evident for long, multi-sentence inputs. In this paper, we examine simple extensions to a Transformer-based FastSpeech-like system, with the goal of improving prosody for multi-sentence TTS. We find that long context, powerful text features, and training on m…
▽ More
Generating expressive and contextually appropriate prosody remains a challenge for modern text-to-speech (TTS) systems. This is particularly evident for long, multi-sentence inputs. In this paper, we examine simple extensions to a Transformer-based FastSpeech-like system, with the goal of improving prosody for multi-sentence TTS. We find that long context, powerful text features, and training on multi-speaker data all improve prosody. More interestingly, they result in synergies. Long context disambiguates prosody, improves coherence, and plays to the strengths of Transformers. Fine-tuning word-level features from a powerful language model, such as BERT, appears to profit from more training data, readily available in a multi-speaker setting. We look into objective metrics on pausing and pacing and perform thorough subjective evaluations for speech naturalness. Our main system, which incorporates all the extensions, achieves consistently strong results, including statistically significant improvements in speech naturalness over all its competitors.
△ Less
Submitted 29 June, 2022;
originally announced June 2022.
-
Expressive, Variable, and Controllable Duration Modelling in TTS
Authors:
Ammar Abbas,
Thomas Merritt,
Alexis Moinet,
Sri Karlapati,
Ewa Muszynska,
Simon Slangen,
Elia Gatti,
Thomas Drugman
Abstract:
Duration modelling has become an important research problem once more with the rise of non-attention neural text-to-speech systems. The current approaches largely fall back to relying on previous statistical parametric speech synthesis technology for duration prediction, which poorly models the expressiveness and variability in speech. In this paper, we propose two alternate approaches to improve…
▽ More
Duration modelling has become an important research problem once more with the rise of non-attention neural text-to-speech systems. The current approaches largely fall back to relying on previous statistical parametric speech synthesis technology for duration prediction, which poorly models the expressiveness and variability in speech. In this paper, we propose two alternate approaches to improve duration modelling. First, we propose a duration model conditioned on phrasing that improves the predicted durations and provides better modelling of pauses. We show that the duration model conditioned on phrasing improves the naturalness of speech over our baseline duration model. Second, we also propose a multi-speaker duration model called Cauliflow, that uses normalising flows to predict durations that better match the complex target duration distribution. Cauliflow performs on par with our other proposed duration model in terms of naturalness, whilst providing variable durations for the same prompt and variable levels of expressiveness. Lastly, we propose to condition Cauliflow on parameters that provide an intuitive control of the pacing and pausing in the synthesised speech in a novel way.
△ Less
Submitted 28 June, 2022;
originally announced June 2022.
-
CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer
Authors:
Sri Karlapati,
Penny Karanasou,
Mateusz Lajszczak,
Ammar Abbas,
Alexis Moinet,
Peter Makarov,
Ray Li,
Arent van Korlaar,
Simon Slangen,
Thomas Drugman
Abstract:
In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing speech with different speaker identities, b) generating speech with expressive and contextually appropriate prosody, and c) transferring prosody at fine-grained level between any pair of seen speakers. We do this by activating distinct parts of the network for different tasks. We train our model using a novel appro…
▽ More
In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing speech with different speaker identities, b) generating speech with expressive and contextually appropriate prosody, and c) transferring prosody at fine-grained level between any pair of seen speakers. We do this by activating distinct parts of the network for different tasks. We train our model using a novel approach to two-stage training. In Stage I, the model learns speaker-independent word-level prosody representations from speech which it uses for many-to-many fine-grained prosody transfer. In Stage II, we learn to predict these prosody representations using the contextual information available in text, thereby, enabling multi-speaker TTS with contextually appropriate prosody. We compare CC2 to two strong baselines, one in TTS with contextually appropriate prosody, and one in fine-grained prosody transfer. CC2 reduces the gap in naturalness between our baseline and copy-synthesised speech by $22.79\%$. In fine-grained prosody transfer evaluations, it obtains a relative improvement of $33.15\%$ in target speaker similarity.
△ Less
Submitted 27 June, 2022;
originally announced June 2022.
-
Distribution augmentation for low-resource expressive text-to-speech
Authors:
Mateusz Lajszczak,
Animesh Prasad,
Arent van Korlaar,
Bajibabu Bollepalli,
Antonio Bonafonte,
Arnaud Joly,
Marco Nicolis,
Alexis Moinet,
Thomas Drugman,
Trevor Wood,
Elena Sokolova
Abstract:
This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data. Our goal is to increase diversity of text conditionings available during training. This helps to reduce overfitting, especially in low-resource settings. Our method relies on substituting text and audio fragments in a w…
▽ More
This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data. Our goal is to increase diversity of text conditionings available during training. This helps to reduce overfitting, especially in low-resource settings. Our method relies on substituting text and audio fragments in a way that preserves syntactical correctness. We take additional measures to ensure that synthesized speech does not contain artifacts caused by combining inconsistent audio samples. The perceptual evaluations show that our method improves speech quality over a number of datasets, speakers, and TTS architectures. We also demonstrate that it greatly improves robustness of attention-based TTS models.
△ Less
Submitted 19 February, 2022; v1 submitted 13 February, 2022;
originally announced February 2022.
-
Multi-Scale Spectrogram Modelling for Neural Text-to-Speech
Authors:
Ammar Abbas,
Bajibabu Bollepalli,
Alexis Moinet,
Arnaud Joly,
Penny Karanasou,
Peter Makarov,
Simon Slangens,
Sri Karlapati,
Thomas Drugman
Abstract:
We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody. We present a generic multi-scale spectrogram prediction mechanism where the system first predicts coarser scale mel-spectrograms that capture the suprasegmental information in speech, and later uses these coarser scale mel-spectrograms to predict finer scale me…
▽ More
We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody. We present a generic multi-scale spectrogram prediction mechanism where the system first predicts coarser scale mel-spectrograms that capture the suprasegmental information in speech, and later uses these coarser scale mel-spectrograms to predict finer scale mel-spectrograms capturing fine-grained prosody.
We present details for two specific versions of MSS called Word-level MSS and Sentence-level MSS where the scales in our system are motivated by the linguistic units. The Word-level MSS models word, phoneme, and frame-level spectrograms while Sentence-level MSS models sentence-level spectrogram in addition.
Subjective evaluations show that Word-level MSS performs statistically significantly better compared to the baseline on two voices.
△ Less
Submitted 29 June, 2021;
originally announced June 2021.
-
A learned conditional prior for the VAE acoustic space of a TTS system
Authors:
Penny Karanasou,
Sri Karlapati,
Alexis Moinet,
Arnaud Joly,
Ammar Abbas,
Simon Slangen,
Jaime Lorenzo Trueba,
Thomas Drugman
Abstract:
Many factors influence speech yielding different renditions of a given sentence. Generative models, such as variational autoencoders (VAEs), capture this variability and allow multiple renditions of the same sentence via sampling. The degree of prosodic variability depends heavily on the prior that is used when sampling. In this paper, we propose a novel method to compute an informative prior for…
▽ More
Many factors influence speech yielding different renditions of a given sentence. Generative models, such as variational autoencoders (VAEs), capture this variability and allow multiple renditions of the same sentence via sampling. The degree of prosodic variability depends heavily on the prior that is used when sampling. In this paper, we propose a novel method to compute an informative prior for the VAE latent space of a neural text-to-speech (TTS) system. By doing so, we aim to sample with more prosodic variability, while gaining controllability over the latent space's structure.
By using as prior the posterior distribution of a secondary VAE, which we condition on a speaker vector, we can sample from the primary VAE taking explicitly the conditioning into account and resulting in samples from a specific region of the latent space for each condition (i.e. speaker). A formal preference test demonstrates significant preference of the proposed approach over standard Conditional VAE. We also provide visualisations of the latent space where well-separated condition-specific clusters appear, as well as ablation studies to better understand the behaviour of the system.
△ Less
Submitted 14 June, 2021;
originally announced June 2021.
-
Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments
Authors:
Alejandro Mottini,
Jaime Lorenzo-Trueba,
Sri Vishnu Kumar Karlapati,
Thomas Drugman
Abstract:
Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker. While there is a rich literature on VC, most proposed methods are trained and evaluated on clean speech recordings. However, many acoustic environments are noisy and reverberant, severely restricting the applicability of popular VC method…
▽ More
Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker. While there is a rich literature on VC, most proposed methods are trained and evaluated on clean speech recordings. However, many acoustic environments are noisy and reverberant, severely restricting the applicability of popular VC methods to such scenarios. To address this limitation, we propose Voicy, a new VC framework particularly tailored for noisy speech. Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder. Importantly, Voicy is capable of performing non-parallel zero-shot VC, an important requirement for any VC system that needs to work on speakers not seen during training. We have validated our approach using a noisy reverberant version of the LibriSpeech dataset. Experimental results show that Voicy outperforms other tested VC techniques in terms of naturalness and target speaker similarity in noisy reverberant environments.
△ Less
Submitted 16 June, 2021;
originally announced June 2021.
-
Weakly-supervised word-level pronunciation error detection in non-native English speech
Authors:
Daniel Korzekwa,
Jaime Lorenzo-Trueba,
Thomas Drugman,
Shira Calamaro,
Bozena Kostek
Abstract:
We propose a weakly-supervised model for word-level mispronunciation detection in non-native (L2) English speech. To train this model, phonetically transcribed L2 speech is not required and we only need to mark mispronounced words. The lack of phonetic transcriptions for L2 speech means that the model has to learn only from a weak signal of word-level mispronunciations. Because of that and due to…
▽ More
We propose a weakly-supervised model for word-level mispronunciation detection in non-native (L2) English speech. To train this model, phonetically transcribed L2 speech is not required and we only need to mark mispronounced words. The lack of phonetic transcriptions for L2 speech means that the model has to learn only from a weak signal of word-level mispronunciations. Because of that and due to the limited amount of mispronounced L2 speech, the model is more likely to overfit. To limit this risk, we train it in a multi-task setup. In the first task, we estimate the probabilities of word-level mispronunciation. For the second task, we use a phoneme recognizer trained on phonetically transcribed L1 speech that is easily accessible and can be automatically annotated. Compared to state-of-the-art approaches, we improve the accuracy of detecting word-level pronunciation errors in AUC metric by 30% on the GUT Isle Corpus of L2 Polish speakers, and by 21.5% on the Isle Corpus of L2 German and Italian speakers.
△ Less
Submitted 7 June, 2021;
originally announced June 2021.
-
Mispronunciation Detection in Non-native (L2) English with Uncertainty Modeling
Authors:
Daniel Korzekwa,
Jaime Lorenzo-Trueba,
Szymon Zaporowski,
Shira Calamaro,
Thomas Drugman,
Bozena Kostek
Abstract:
A common approach to the automatic detection of mispronunciation in language learning is to recognize the phonemes produced by a student and compare it to the expected pronunciation of a native speaker. This approach makes two simplifying assumptions: a) phonemes can be recognized from speech with high accuracy, b) there is a single correct way for a sentence to be pronounced. These assumptions do…
▽ More
A common approach to the automatic detection of mispronunciation in language learning is to recognize the phonemes produced by a student and compare it to the expected pronunciation of a native speaker. This approach makes two simplifying assumptions: a) phonemes can be recognized from speech with high accuracy, b) there is a single correct way for a sentence to be pronounced. These assumptions do not always hold, which can result in a significant amount of false mispronunciation alarms. We propose a novel approach to overcome this problem based on two principles: a) taking into account uncertainty in the automatic phoneme recognition step, b) accounting for the fact that there may be multiple valid pronunciations. We evaluate the model on non-native (L2) English speech of German, Italian and Polish speakers, where it is shown to increase the precision of detecting mispronunciations by up to 18% (relative) compared to the common approach.
△ Less
Submitted 8 February, 2021; v1 submitted 16 January, 2021;
originally announced January 2021.
-
EmoCat: Language-agnostic Emotional Voice Conversion
Authors:
Bastian Schnell,
Goeric Huybrechts,
Bartek Perz,
Thomas Drugman,
Jaime Lorenzo-Trueba
Abstract:
Emotional voice conversion models adapt the emotion in speech without changing the speaker identity or linguistic content. They are less data hungry than text-to-speech models and allow to generate large amounts of emotional data for downstream tasks. In this work we propose EmoCat, a language-agnostic emotional voice conversion model. It achieves high-quality emotion conversion in German with les…
▽ More
Emotional voice conversion models adapt the emotion in speech without changing the speaker identity or linguistic content. They are less data hungry than text-to-speech models and allow to generate large amounts of emotional data for downstream tasks. In this work we propose EmoCat, a language-agnostic emotional voice conversion model. It achieves high-quality emotion conversion in German with less than 45 minutes of German emotional recordings by exploiting large amounts of emotional data in US English. EmoCat is an encoder-decoder model based on CopyCat, a voice conversion system which transfers prosody. We use adversarial training to remove emotion leakage from the encoder to the decoder. The adversarial training is improved by a novel contribution to gradient reversal to truly reverse gradients. This allows to remove only the leaking information and to converge to better optima with higher conversion performance. Evaluations show that Emocat can convert to different emotions but misses on emotion intensity compared to the recordings, especially for very expressive emotions. EmoCat is able to achieve audio quality on par with the recordings for five out of six tested emotion intensities.
△ Less
Submitted 14 January, 2021;
originally announced January 2021.
-
Detection of Lexical Stress Errors in Non-Native (L2) English with Data Augmentation and Attention
Authors:
Daniel Korzekwa,
Roberto Barra-Chicote,
Szymon Zaporowski,
Grzegorz Beringer,
Jaime Lorenzo-Trueba,
Alicja Serafinowicz,
Jasha Droppo,
Thomas Drugman,
Bozena Kostek
Abstract:
This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS). In a classical approach, audio features are usually extracted from fixed regions of speech such as the syllable nucleus. We propose an attention-based deep learni…
▽ More
This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS). In a classical approach, audio features are usually extracted from fixed regions of speech such as the syllable nucleus. We propose an attention-based deep learning model that automatically derives optimal syllable-level representation from frame-level and phoneme-level audio features. Training this model is challenging because of the limited amount of incorrect stress patterns. To solve this problem, we propose to augment the training set with incorrectly stressed words generated with Neural TTS. Combining both techniques achieves 94.8% precision and 49.2% recall for the detection of incorrectly stressed words in L2 English speech of Slavic and Baltic speakers.
△ Less
Submitted 7 June, 2021; v1 submitted 29 December, 2020;
originally announced December 2020.
-
Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech
Authors:
Sri Karlapati,
Ammar Abbas,
Zack Hodari,
Alexis Moinet,
Arnaud Joly,
Penny Karanasou,
Thomas Drugman
Abstract:
In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information ava…
▽ More
In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of $13.2\%$ in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case.
△ Less
Submitted 4 November, 2020;
originally announced November 2020.
-
CAMP: a Two-Stage Approach to Modelling Prosody in Context
Authors:
Zack Hodari,
Alexis Moinet,
Sri Karlapati,
Jaime Lorenzo-Trueba,
Thomas Merritt,
Arnaud Joly,
Ammar Abbas,
Penny Karanasou,
Thomas Drugman
Abstract:
Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In th…
▽ More
Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In this paper, we propose solutions to both these issues. To mitigate the challenge of modelling a slow-varying signal, we learn to disentangle prosodic information using a word level representation. To alleviate the ill-posed nature of prosody modelling, we use syntactic and semantic information derived from text to learn a context-dependent prior over our prosodic space. Our Context-Aware Model of Prosody (CAMP) outperforms the state-of-the-art technique, closing the gap with natural speech by 26%. We also find that replacing attention with a jointly-trained duration model improves prosody significantly.
△ Less
Submitted 12 February, 2021; v1 submitted 2 November, 2020;
originally announced November 2020.
-
Parametric Representation for Singing Voice Synthesis: a Comparative Evaluation
Authors:
Onur Babacan,
Thomas Drugman,
Tuomo Raitio,
Daniel Erro,
Thierry Dutoit
Abstract:
Various parametric representations have been proposed to model the speech signal. While the performance of such vocoders is well-known in the context of speech processing, their extrapolation to singing voice synthesis might not be straightforward. The goal of this paper is twofold. First, a comparative subjective evaluation is performed across four existing techniques suitable for statistical par…
▽ More
Various parametric representations have been proposed to model the speech signal. While the performance of such vocoders is well-known in the context of speech processing, their extrapolation to singing voice synthesis might not be straightforward. The goal of this paper is twofold. First, a comparative subjective evaluation is performed across four existing techniques suitable for statistical parametric synthesis: traditional pulse vocoder, Deterministic plus Stochastic Model, Harmonic plus Noise Model and GlottHMM. The behavior of these techniques as a function of the singer type (baritone, counter-tenor and soprano) is studied. Secondly, the artifacts occurring in high-pitched voices are discussed and possible approaches to overcome them are suggested.
△ Less
Submitted 7 June, 2020;
originally announced June 2020.
-
Maximum Phase Modeling for Sparse Linear Prediction of Speech
Authors:
Thomas Drugman
Abstract:
Linear prediction (LP) is an ubiquitous analysis method in speech processing. Various studies have focused on sparse LP algorithms by introducing sparsity constraints into the LP framework. Sparse LP has been shown to be effective in several issues related to speech modeling and coding. However, all existing approaches assume the speech signal to be minimum-phase. Because speech is known to be mix…
▽ More
Linear prediction (LP) is an ubiquitous analysis method in speech processing. Various studies have focused on sparse LP algorithms by introducing sparsity constraints into the LP framework. Sparse LP has been shown to be effective in several issues related to speech modeling and coding. However, all existing approaches assume the speech signal to be minimum-phase. Because speech is known to be mixed-phase, the resulting residual signal contains a persistent maximum-phase component. The aim of this paper is to propose a novel technique which incorporates a modeling of the maximum-phase contribution of speech, and can be applied to any filter representation. The proposed method is shown to significantly increase the sparsity of the LP residual signal and to be effective in two illustrative applications: speech polarity detection and excitation modeling.
△ Less
Submitted 7 June, 2020;
originally announced June 2020.
-
Analysis and Synthesis of Hypo and Hyperarticulated Speech
Authors:
Benjamin Picart,
Thomas Drugman,
Thierry Dutoit
Abstract:
This paper focuses on the analysis and synthesis of hypo and hyperarticulated speech in the framework of HMM-based speech synthesis. First of all, a new French database matching our needs was created, which contains three identical sets, pronounced with three different degrees of articulation: neutral, hypo and hyperarticulated speech. On that basis, acoustic and phonetic analyses were performed.…
▽ More
This paper focuses on the analysis and synthesis of hypo and hyperarticulated speech in the framework of HMM-based speech synthesis. First of all, a new French database matching our needs was created, which contains three identical sets, pronounced with three different degrees of articulation: neutral, hypo and hyperarticulated speech. On that basis, acoustic and phonetic analyses were performed. It is shown that the degrees of articulation significantly influence, on one hand, both vocal tract and glottal characteristics, and on the other hand, speech rate, phone durations, phone variations and the presence of glottal stops. Finally, neutral, hypo and hyperarticulated speech are synthesized using HMM-based speech synthesis and both objective and subjective tests aiming at assessing the generated speech quality are performed. These tests show that synthesized hypoarticulated speech seems to be less naturally rendered than neutral and hyperarticulated speech.
△ Less
Submitted 7 June, 2020;
originally announced June 2020.
-
Residual Excitation Skewness for Automatic Speech Polarity Detection
Authors:
Thomas Drugman
Abstract:
Detecting the correct speech polarity is a necessary step prior to several speech processing techniques. An error on its determination could have a dramatic detrimental impact on their performance. As current systems have to deal with increasing amounts of data stemming from multiple devices, the automatic detection of speech polarity has become a crucial problem. For this purpose, we here propose…
▽ More
Detecting the correct speech polarity is a necessary step prior to several speech processing techniques. An error on its determination could have a dramatic detrimental impact on their performance. As current systems have to deal with increasing amounts of data stemming from multiple devices, the automatic detection of speech polarity has become a crucial problem. For this purpose, we here propose a very simple algorithm based on the skewness of two excitation signals. The method is shown on 10 speech corpora (8545 files) to lead to an error rate of only 0.06% in clean conditions and to clearly outperform four state-of-the-art methods. Besides it significantly reduces the computational load through its simplicity and is observed to exhibit the strongest robustness in both noisy and reverberant environments.
△ Less
Submitted 31 May, 2020;
originally announced June 2020.
-
Maximum Voiced Frequency Estimation: Exploiting Amplitude and Phase Spectra
Authors:
Thomas Drugman,
Yannis Stylianou
Abstract:
Maximum Voiced Frequency (MVF) is used in various speech models as the spectral boundary separating periodic and aperiodic components during the production of voiced sounds. Recent studies have shown that its proper estimation and modeling enhance the quality of statistical parametric speech synthesizers. Contrastingly, these same methods of MVF estimation have been reported to degrade the perform…
▽ More
Maximum Voiced Frequency (MVF) is used in various speech models as the spectral boundary separating periodic and aperiodic components during the production of voiced sounds. Recent studies have shown that its proper estimation and modeling enhance the quality of statistical parametric speech synthesizers. Contrastingly, these same methods of MVF estimation have been reported to degrade the performance of singing voice synthesizers. This paper proposes a new approach for MVF estimation which exploits both amplitude and phase spectra. It is shown that phase conveys relevant information about the harmonicity of the voice signal, and that it can be jointly used with features derived from the amplitude spectrum. This information is further integrated into a maximum likelihood criterion which provides a decision about the MVF estimate. The proposed technique is compared to two state-of-the-art methods, and shows a superior performance in both objective and subjective evaluations. Perceptual tests indicate a drastic improvement in high-pitched voices.
△ Less
Submitted 31 May, 2020;
originally announced June 2020.
-
Data-driven Detection and Analysis of the Patterns of Creaky Voice
Authors:
Thomas Drugman,
John Kane,
Christer Gobl
Abstract:
This paper investigates the temporal excitation patterns of creaky voice. Creaky voice is a voice quality frequently used as a phrase-boundary marker, but also as a means of portraying attitude, affective states and even social status. Consequently, the automatic detection and modelling of creaky voice may have implications for speech technology applications. The acoustic characteristics of creaky…
▽ More
This paper investigates the temporal excitation patterns of creaky voice. Creaky voice is a voice quality frequently used as a phrase-boundary marker, but also as a means of portraying attitude, affective states and even social status. Consequently, the automatic detection and modelling of creaky voice may have implications for speech technology applications. The acoustic characteristics of creaky voice are, however, rather distinct from modal phonation. Further, several acoustic patterns can bring about the perception of creaky voice, thereby complicating the strategies used for its automatic detection, analysis and modelling. The present study is carried out using a variety of languages, speakers, and on both read and conversational data and involves a mutual information-based assessment of the various acoustic features proposed in the literature for detecting creaky voice. These features are then exploited in classification experiments where we achieve an appreciable improvement in detection accuracy compared to the state of the art. Both experiments clearly highlight the presence of several creaky patterns. A subsequent qualitative and quantitative analysis of the identified patterns is provided, which reveals a considerable speaker-dependent variability in the usage of these creaky patterns. We also investigate how creaky voice detection systems perform across creaky patterns.
△ Less
Submitted 31 May, 2020;
originally announced June 2020.
-
Glottal source estimation robustness: A comparison of sensitivity of voice source estimation techniques
Authors:
Thomas Drugman,
Thomas Dubuisson,
Alexis Moinet,
Nicolas D'Alessandro,
Thierry Dutoit
Abstract:
This paper addresses the problem of estimating the voice source directly from speech waveforms. A novel principle based on Anticausality Dominated Regions (ACDR) is used to estimate the glottal open phase. This technique is compared to two other state-of-the-art well-known methods, namely the Zeros of the Z-Transform (ZZT) and the Iterative Adaptive Inverse Filtering (IAIF) algorithms. Decompositi…
▽ More
This paper addresses the problem of estimating the voice source directly from speech waveforms. A novel principle based on Anticausality Dominated Regions (ACDR) is used to estimate the glottal open phase. This technique is compared to two other state-of-the-art well-known methods, namely the Zeros of the Z-Transform (ZZT) and the Iterative Adaptive Inverse Filtering (IAIF) algorithms. Decomposition quality is assessed on synthetic signals through two objective measures: the spectral distortion and a glottal formant determination rate. Technique robustness is tested by analyzing the influence of noise and Glottal Closure Instant (GCI) location errors. Besides impacts of the fundamental frequency and the first formant on the performance are evaluated. Our proposed approach shows significant improvement in robustness, which could be of a great interest when decomposing real speech.
△ Less
Submitted 24 May, 2020;
originally announced May 2020.
-
Oscillating Statistical Moments for Speech Polarity Detection
Authors:
Thomas Drugman,
Thierry Dutoit
Abstract:
An inversion of the speech polarity may have a dramatic detrimental effect on the performance of various techniques of speech processing. An automatic method for determining the speech polarity (which is dependent upon the recording setup) is thus required as a preliminary step for ensuring the well-behaviour of such techniques. This paper proposes a new approach of polarity detection relying on o…
▽ More
An inversion of the speech polarity may have a dramatic detrimental effect on the performance of various techniques of speech processing. An automatic method for determining the speech polarity (which is dependent upon the recording setup) is thus required as a preliminary step for ensuring the well-behaviour of such techniques. This paper proposes a new approach of polarity detection relying on oscillating statistical moments. These moments have the property to oscillate at the local fundamental frequency and to exhibit a phase shift which depends on the speech polarity. This dependency stems from the introduction of non-linearity or higher-order statistics in the moment calculation. The resulting method is shown on 10 speech corpora to provide a substantial improvement compared to state-of-the-art techniques.
△ Less
Submitted 16 May, 2020;
originally announced May 2020.
-
Glottal Source Estimation using an Automatic Chirp Decomposition
Authors:
Thomas Drugman,
Baris Bozkurt,
Thierry Dutoit
Abstract:
In a previous work, we showed that the glottal source can be estimated from speech signals by computing the Zeros of the Z-Transform (ZZT). Decomposition was achieved by separating the roots inside (causal contribution) and outside (anticausal contribution) the unit circle. In order to guarantee a correct deconvolution, time alignment on the Glottal Closure Instants (GCIs) was shown to be essentia…
▽ More
In a previous work, we showed that the glottal source can be estimated from speech signals by computing the Zeros of the Z-Transform (ZZT). Decomposition was achieved by separating the roots inside (causal contribution) and outside (anticausal contribution) the unit circle. In order to guarantee a correct deconvolution, time alignment on the Glottal Closure Instants (GCIs) was shown to be essential. This paper extends the formalism of ZZT by evaluating the Z-transform on a contour possibly different from the unit circle. A method is proposed for determining automatically this contour by inspecting the root distribution. The derived Zeros of the Chirp Z-Transform (ZCZT)-based technique turns out to be much more robust to GCI location errors.
△ Less
Submitted 16 May, 2020;
originally announced May 2020.
-
Audio and Contact Microphones for Cough Detection
Authors:
Thomas Drugman,
Jerome Urbain,
Nathalie Bauwens,
Ricardo Chessini,
Anne-Sophie Aubriot,
Patrick Lebecque,
Thierry Dutoit
Abstract:
In the framework of assessing the pathology severity in chronic cough diseases, medical literature underlines the lack of tools for allowing the automatic, objective and reliable detection of cough events. This paper describes a system based on two microphones which we developed for this purpose. The proposed approach relies on a large variety of audio descriptors, an efficient algorithm of featur…
▽ More
In the framework of assessing the pathology severity in chronic cough diseases, medical literature underlines the lack of tools for allowing the automatic, objective and reliable detection of cough events. This paper describes a system based on two microphones which we developed for this purpose. The proposed approach relies on a large variety of audio descriptors, an efficient algorithm of feature selection based on their mutual information and the use of artificial neural networks. First, the possible use of a contact microphone (placed on the patient's thorax or trachea) in complement to the audio signal is investigated. This study underlines that this contact microphone suffers from reliability issues, and conveys little new relevant information compared to the audio modality. Secondly, the proposed audio-only approach is compared to a commercially available system using four sensors on a database with different sound categories often misdetected as coughs, and produced in various conditions. With average sensitivity and specificity of 94.7% and 95% respectively, the proposed method achieves better cough detection performance than the commercial system.
△ Less
Submitted 10 May, 2020;
originally announced May 2020.
-
Chirp Complex Cepstrum-based Decomposition for Asynchronous Glottal Analysis
Authors:
Thomas Drugman,
Thierry Dutoit
Abstract:
It was recently shown that complex cepstrum can be effectively used for glottal flow estimation by separating the causal and anticausal components of speech. In order to guarantee a correct estimation, some constraints on the window have been derived. Among these, the window has to be synchronized on a Glottal Closure Instant. This paper proposes an extension of the complex cepstrum-based decompos…
▽ More
It was recently shown that complex cepstrum can be effectively used for glottal flow estimation by separating the causal and anticausal components of speech. In order to guarantee a correct estimation, some constraints on the window have been derived. Among these, the window has to be synchronized on a Glottal Closure Instant. This paper proposes an extension of the complex cepstrum-based decomposition by incorporating a chirp analysis. The resulting method is shown to give a reliable estimation of the glottal flow wherever the window is located. This technique is then suited for its integration in usual speech processing systems, which generally operate in an asynchronous way. Besides its potential for automatic voice quality analysis is highlighted.
△ Less
Submitted 10 May, 2020;
originally announced May 2020.
-
CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech
Authors:
Sri Karlapati,
Alexis Moinet,
Arnaud Joly,
Viacheslav Klimkov,
Daniel Sáez-Trigueros,
Thomas Drugman
Abstract:
Prosody Transfer (PT) is a technique that aims to use the prosody from a source audio as a reference while synthesising speech. Fine-grained PT aims at capturing prosodic aspects like rhythm, emphasis, melody, duration, and loudness, from a source audio at a very granular level and transferring them when synthesising speech in a different target speaker's voice. Current approaches for fine-grained…
▽ More
Prosody Transfer (PT) is a technique that aims to use the prosody from a source audio as a reference while synthesising speech. Fine-grained PT aims at capturing prosodic aspects like rhythm, emphasis, melody, duration, and loudness, from a source audio at a very granular level and transferring them when synthesising speech in a different target speaker's voice. Current approaches for fine-grained PT suffer from source speaker leakage, where the synthesised speech has the voice identity of the source speaker as opposed to the target speaker. In order to mitigate this issue, they compromise on the quality of PT. In this paper, we propose CopyCat, a novel, many-to-many PT system that is robust to source speaker leakage, without using parallel data. We achieve this through a novel reference encoder architecture capable of capturing temporal prosodic representations which are robust to source speaker leakage. We compare CopyCat against a state-of-the-art fine-grained PT model through various subjective evaluations, where we show a relative improvement of $47\%$ in the quality of prosody transfer and $14\%$ in preserving the target speaker identity, while still maintaining the same naturalness.
△ Less
Submitted 30 April, 2020;
originally announced April 2020.
-
The Deterministic plus Stochastic Model of the Residual Signal and its Applications
Authors:
Thomas Drugman,
Thierry Dutoit
Abstract:
The modeling of speech production often relies on a source-filter approach. Although methods parameterizing the filter have nowadays reached a certain maturity, there is still a lot to be gained for several speech processing applications in finding an appropriate excitation model. This manuscript presents a Deterministic plus Stochastic Model (DSM) of the residual signal. The DSM consists of two c…
▽ More
The modeling of speech production often relies on a source-filter approach. Although methods parameterizing the filter have nowadays reached a certain maturity, there is still a lot to be gained for several speech processing applications in finding an appropriate excitation model. This manuscript presents a Deterministic plus Stochastic Model (DSM) of the residual signal. The DSM consists of two contributions acting in two distinct spectral bands delimited by a maximum voiced frequency. Both components are extracted from an analysis performed on a speaker-dependent dataset of pitch-synchronous residual frames. The deterministic part models the low-frequency contents and arises from an orthonormal decomposition of these frames. As for the stochastic component, it is a high-frequency noise modulated both in time and frequency. Some interesting phonetic and computational properties of the DSM are also highlighted. The applicability of the DSM in two fields of speech processing is then studied. First, it is shown that incorporating the DSM vocoder in HMM-based speech synthesis enhances the delivered quality. The proposed approach turns out to significantly outperform the traditional pulse excitation and provides a quality equivalent to STRAIGHT. In a second application, the potential of glottal signatures derived from the proposed DSM is investigated for speaker identification purpose. Interestingly, these signatures are shown to lead to better recognition rates than other glottal-based methods.
△ Less
Submitted 29 December, 2019;
originally announced January 2020.
-
A Deterministic plus Stochastic Model of the Residual Signal for Improved Parametric Speech Synthesis
Authors:
Thomas Drugman,
Geoffrey Wilfart,
Thierry Dutoit
Abstract:
Speech generated by parametric synthesizers generally suffers from a typical buzziness, similar to what was encountered in old LPC-like vocoders. In order to alleviate this problem, a more suited modeling of the excitation should be adopted. For this, we hereby propose an adaptation of the Deterministic plus Stochastic Model (DSM) for the residual. In this model, the excitation is divided into two…
▽ More
Speech generated by parametric synthesizers generally suffers from a typical buzziness, similar to what was encountered in old LPC-like vocoders. In order to alleviate this problem, a more suited modeling of the excitation should be adopted. For this, we hereby propose an adaptation of the Deterministic plus Stochastic Model (DSM) for the residual. In this model, the excitation is divided into two distinct spectral bands delimited by the maximum voiced frequency. The deterministic part concerns the low-frequency contents and consists of a decomposition of pitch-synchronous residual frames on an orthonormal basis obtained by Principal Component Analysis. The stochastic component is a high-pass filtered noise whose time structure is modulated by an energy-envelope, similarly to what is done in the Harmonic plus Noise Model (HNM). The proposed residual model is integrated within a HMM-based speech synthesizer and is compared to the traditional excitation through a subjective test. Results show a significative improvement for both male and female voices. In addition the proposed model requires few computational load and memory, which is essential for its integration in commercial applications.
△ Less
Submitted 29 December, 2019;
originally announced January 2020.
-
Glottal Closure and Opening Instant Detection from Speech Signals
Authors:
Thomas Drugman,
Thierry Dutoit
Abstract:
This paper proposes a new procedure to detect Glottal Closure and Opening Instants (GCIs and GOIs) directly from speech waveforms. The procedure is divided into two successive steps. First a mean-based signal is computed, and intervals where speech events are expected to occur are extracted from it. Secondly, at each interval a precise position of the speech event is assigned by locating a discont…
▽ More
This paper proposes a new procedure to detect Glottal Closure and Opening Instants (GCIs and GOIs) directly from speech waveforms. The procedure is divided into two successive steps. First a mean-based signal is computed, and intervals where speech events are expected to occur are extracted from it. Secondly, at each interval a precise position of the speech event is assigned by locating a discontinuity in the Linear Prediction residual. The proposed method is compared to the DYPSA algorithm on the CMU ARCTIC database. A significant improvement as well as a better noise robustness are reported. Besides, results of GOI identification accuracy are promising for the glottal source characterization.
△ Less
Submitted 28 December, 2019;
originally announced January 2020.
-
A Comparative Study of Glottal Source Estimation Techniques
Authors:
Thomas Drugman,
Baris Bozkurt,
Thierry Dutoit
Abstract:
Source-tract decomposition (or glottal flow estimation) is one of the basic problems of speech processing. For this, several techniques have been proposed in the literature. However studies comparing different approaches are almost nonexistent. Besides, experiments have been systematically performed either on synthetic speech or on sustained vowels. In this study we compare three of the main repre…
▽ More
Source-tract decomposition (or glottal flow estimation) is one of the basic problems of speech processing. For this, several techniques have been proposed in the literature. However studies comparing different approaches are almost nonexistent. Besides, experiments have been systematically performed either on synthetic speech or on sustained vowels. In this study we compare three of the main representative state-of-the-art methods of glottal flow estimation: closed-phase inverse filtering, iterative and adaptive inverse filtering, and mixed-phase decomposition. These techniques are first submitted to an objective assessment test on synthetic speech signals. Their sensitivity to various factors affecting the estimation quality, as well as their robustness to noise are studied. In a second experiment, their ability to label voice quality (tensed, modal, soft) is studied on a large corpus of real connected speech. It is shown that changes of voice quality are reflected by significant modifications in glottal feature distributions. Techniques based on the mixed-phase decomposition and on a closed-phase inverse filtering process turn out to give the best results on both clean synthetic and real speech signals. On the other hand, iterative and adaptive inverse filtering is recommended in noisy environments for its high robustness.
△ Less
Submitted 28 December, 2019;
originally announced January 2020.
-
On the Mutual Information between Source and Filter Contributions for Voice Pathology Detection
Authors:
Thomas Drugman,
Thomas Dubuisson,
Thierry Dutoit
Abstract:
This paper addresses the problem of automatic detection of voice pathologies directly from the speech signal. For this, we investigate the use of the glottal source estimation as a means to detect voice disorders. Three sets of features are proposed, depending on whether they are related to the speech or the glottal signal, or to prosody. The relevancy of these features is assessed through mutual…
▽ More
This paper addresses the problem of automatic detection of voice pathologies directly from the speech signal. For this, we investigate the use of the glottal source estimation as a means to detect voice disorders. Three sets of features are proposed, depending on whether they are related to the speech or the glottal signal, or to prosody. The relevancy of these features is assessed through mutual information-based measures. This allows an intuitive interpretation in terms of discrimation power and redundancy between the features, independently of any subsequent classifier. It is discussed which characteristics are interestingly informative or complementary for detecting voice pathologies.
△ Less
Submitted 2 January, 2020;
originally announced January 2020.
-
Excitation-based Voice Quality Analysis and Modification
Authors:
Thomas Drugman,
Thierry Dutoit,
Baris Bozkurt
Abstract:
This paper investigates the differences occuring in the excitation for different voice qualities. Its goal is two-fold. First a large corpus containing three voice qualities (modal, soft and loud) uttered by the same speaker is analyzed and significant differences in characteristics extracted from the excitation are observed. Secondly rules of modification derived from the analysis are used to bui…
▽ More
This paper investigates the differences occuring in the excitation for different voice qualities. Its goal is two-fold. First a large corpus containing three voice qualities (modal, soft and loud) uttered by the same speaker is analyzed and significant differences in characteristics extracted from the excitation are observed. Secondly rules of modification derived from the analysis are used to build a voice quality transformation system applied as a post-process to HMM-based speech synthesis. The system is shown to effectively achieve the transformations while maintaining the delivered quality.
△ Less
Submitted 2 January, 2020;
originally announced January 2020.
-
Eigenresiduals for improved Parametric Speech Synthesis
Authors:
Thomas Drugman,
Geoffrey Wilfart,
Thierry Dutoit
Abstract:
Statistical parametric speech synthesizers have recently shown their ability to produce natural-sounding and flexible voices. Unfortunately the delivered quality suffers from a typical buzziness due to the fact that speech is vocoded. This paper proposes a new excitation model in order to reduce this undesirable effect. This model is based on the decomposition of pitch-synchronous residual frames…
▽ More
Statistical parametric speech synthesizers have recently shown their ability to produce natural-sounding and flexible voices. Unfortunately the delivered quality suffers from a typical buzziness due to the fact that speech is vocoded. This paper proposes a new excitation model in order to reduce this undesirable effect. This model is based on the decomposition of pitch-synchronous residual frames on an orthonormal basis obtained by Principal Component Analysis. This basis contains a limited number of eigenresiduals and is computed on a relatively small speech database. A stream of PCA-based coefficients is added to our HMM-based synthesizer and allows to generate the voiced excitation during the synthesis. An improvement compared to the traditional excitation is reported while the synthesis engine footprint remains under about 1Mb.
△ Less
Submitted 2 January, 2020;
originally announced January 2020.
-
Assessment of Audio Features for Automatic Cough Detection
Authors:
Thomas Drugman,
Jerome Urbain,
Thierry Dutoit
Abstract:
This paper addresses the issue of cough detection using only audio recordings, with the ultimate goal of quantifying and qualifying the degree of pathology for patients suffering from respiratory diseases, notably mucoviscidosis. A large set of audio features describing various aspects of the audio signal is proposed. These features are assessed in two steps. First, their intrisic potential and re…
▽ More
This paper addresses the issue of cough detection using only audio recordings, with the ultimate goal of quantifying and qualifying the degree of pathology for patients suffering from respiratory diseases, notably mucoviscidosis. A large set of audio features describing various aspects of the audio signal is proposed. These features are assessed in two steps. First, their intrisic potential and redundancy are evaluated using mutual information-based measures. Secondly, their efficiency is confirmed relying on three classifiers: Artificial Neural Network, Gaussian Mixture Model and Support Vector Machine. The influence of both the feature dimension and the classifier complexity are also investigated.
△ Less
Submitted 2 January, 2020;
originally announced January 2020.
-
A Comparative Evaluation of Pitch Modification Techniques
Authors:
Thomas Drugman,
Thierry Dutoit
Abstract:
This paper addresses the problem of pitch modification, as an important module for an efficient voice transformation system. The Deterministic plus Stochastic Model of the residual signal we proposed in a previous work is compared to TDPSOLA, HNM and STRAIGHT. The four methods are compared through an important subjective test. The influence of the speaker gender and of the pitch modification ratio…
▽ More
This paper addresses the problem of pitch modification, as an important module for an efficient voice transformation system. The Deterministic plus Stochastic Model of the residual signal we proposed in a previous work is compared to TDPSOLA, HNM and STRAIGHT. The four methods are compared through an important subjective test. The influence of the speaker gender and of the pitch modification ratio is analyzed. Despite its higher compression level, the DSM technique is shown to give similar or better results than other methods, especially for male speakers and important ratios of modification. The DSM turns out to be only outperformed by STRAIGHT for female voices.
△ Less
Submitted 2 January, 2020;
originally announced January 2020.
-
Objective Study of Sensor Relevance for Automatic Cough Detection
Authors:
Thomas Drugman,
Jerome Urbain,
Nathalie Bauwens,
Ricardo Chessini,
Carlos Valderrama,
Patrick Lebecque,
Thierry Dutoit
Abstract:
The development of a system for the automatic, objective and reliable detection of cough events is a need underlined by the medical literature for years. The benefit of such a tool is clear as it would allow the assessment of pathology severity in chronic cough diseases. Even though some approaches have recently reported solutions achieving this task with a relative success, there is still no stan…
▽ More
The development of a system for the automatic, objective and reliable detection of cough events is a need underlined by the medical literature for years. The benefit of such a tool is clear as it would allow the assessment of pathology severity in chronic cough diseases. Even though some approaches have recently reported solutions achieving this task with a relative success, there is still no standardization about the method to adopt or the sensors to use. The goal of this paper is to study objectively the performance of several sensors for cough detection: ECG, thermistor, chest belt, accelerometer, contact and audio microphones. Experiments are carried out on a database of 32 healthy subjects producing, in a confined room and in three situations, voluntary cough at various volumes as well as other event categories which can possibly lead to some detection errors: background noise, forced expiration, throat clearing, speech and laugh. The relevance of each sensor is evaluated at three stages: mutual information conveyed by the features, ability to discriminate at the frame level cough from these latter other sources of ambiguity, and ability to detect cough events. In this latter experiment, with both an averaged sensitivity and specificity of about 94.5%, the proposed approach is shown to clearly outperform the commercial Karmelsonix system which achieved a specificity of 95.3% and a sensitivity of 64.9%.
△ Less
Submitted 30 December, 2019;
originally announced January 2020.
-
Detection of Glottal Closure Instants from Speech Signals: a Quantitative Review
Authors:
Thomas Drugman,
Mark Thomas,
Jon Gudnason,
Patrick Naylor,
Thierry Dutoit
Abstract:
The pseudo-periodicity of voiced speech can be exploited in several speech processing applications. This requires however that the precise locations of the Glottal Closure Instants (GCIs) are available. The focus of this paper is the evaluation of automatic methods for the detection of GCIs directly from the speech waveform. Five state-of-the-art GCI detection algorithms are compared using six dif…
▽ More
The pseudo-periodicity of voiced speech can be exploited in several speech processing applications. This requires however that the precise locations of the Glottal Closure Instants (GCIs) are available. The focus of this paper is the evaluation of automatic methods for the detection of GCIs directly from the speech waveform. Five state-of-the-art GCI detection algorithms are compared using six different databases with contemporaneous electroglottographic recordings as ground truth, and containing many hours of speech by multiple speakers. The five techniques compared are the Hilbert Envelope-based detection (HE), the Zero Frequency Resonator-based method (ZFR), the Dynamic Programming Phase Slope Algorithm (DYPSA), the Speech Event Detection using the Residual Excitation And a Mean-based Signal (SEDREAMS) and the Yet Another GCI Algorithm (YAGA). The efficacy of these methods is first evaluated on clean speech, both in terms of reliabililty and accuracy. Their robustness to additive noise and to reverberation is also assessed. A further contribution of the paper is the evaluation of their performance on a concrete application of speech processing: the causal-anticausal decomposition of speech. It is shown that for clean speech, SEDREAMS and YAGA are the best performing techniques, both in terms of identification rate and accuracy. ZFR and SEDREAMS also show a superior robustness to additive noise and reverberation.
△ Less
Submitted 28 December, 2019;
originally announced January 2020.
-
Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics
Authors:
Thomas Drugman,
Abeer Alwan
Abstract:
This paper focuses on the problem of pitch tracking in noisy conditions. A method using harmonic information in the residual signal is presented. The proposed criterion is used both for pitch estimation, as well as for determining the voicing segments of speech. In the experiments, the method is compared to six state-of-the-art pitch trackers on the Keele and CSTR databases. The proposed technique…
▽ More
This paper focuses on the problem of pitch tracking in noisy conditions. A method using harmonic information in the residual signal is presented. The proposed criterion is used both for pitch estimation, as well as for determining the voicing segments of speech. In the experiments, the method is compared to six state-of-the-art pitch trackers on the Keele and CSTR databases. The proposed technique is shown to be particularly robust to additive noise, leading to a significant improvement in adverse conditions.
△ Less
Submitted 28 December, 2019;
originally announced January 2020.
-
Phase-based Information for Voice Pathology Detection
Authors:
Thomas Drugman,
Thomas Dubuisson,
Thierry Dutoit
Abstract:
In most current approaches of speech processing, information is extracted from the magnitude spectrum. However recent perceptual studies have underlined the importance of the phase component. The goal of this paper is to investigate the potential of using phase-based features for automatically detecting voice disorders. It is shown that group delay functions are appropriate for characterizing irre…
▽ More
In most current approaches of speech processing, information is extracted from the magnitude spectrum. However recent perceptual studies have underlined the importance of the phase component. The goal of this paper is to investigate the potential of using phase-based features for automatically detecting voice disorders. It is shown that group delay functions are appropriate for characterizing irregularities in the phonation. Besides the respect of the mixed-phase model of speech is discussed. The proposed phase-based features are evaluated and compared to other parameters derived from the magnitude spectrum. Both streams are shown to be interestingly complementary. Furthermore phase-based features turn out to convey a great amount of relevant information, leading to high discrimination performance.
△ Less
Submitted 2 January, 2020;
originally announced January 2020.
-
Using a Pitch-Synchronous Residual Codebook for Hybrid HMM/Frame Selection Speech Synthesis
Authors:
Thomas Drugman,
Alexis Moinet,
Thierry Dutoit,
Geoffrey Wilfart
Abstract:
This paper proposes a method to improve the quality delivered by statistical parametric speech synthesizers. For this, we use a codebook of pitch-synchronous residual frames, so as to construct a more realistic source signal. First a limited codebook of typical excitations is built from some training database. During the synthesis part, HMMs are used to generate filter and source coefficients. The…
▽ More
This paper proposes a method to improve the quality delivered by statistical parametric speech synthesizers. For this, we use a codebook of pitch-synchronous residual frames, so as to construct a more realistic source signal. First a limited codebook of typical excitations is built from some training database. During the synthesis part, HMMs are used to generate filter and source coefficients. The latter coefficients contain both the pitch and a compact representation of target residual frames. The source signal is obtained by concatenating excitation frames picked up from the codebook, based on a selection criterion and taking target residual coefficients as input. Subjective results show a relevant improvement compared to the basic technique.
△ Less
Submitted 30 December, 2019;
originally announced December 2019.
-
Causal-Anticausal Decomposition of Speech using Complex Cepstrum for Glottal Source Estimation
Authors:
Thomas Drugman,
Baris Bozkurt,
Thierry Dutoit
Abstract:
Complex cepstrum is known in the literature for linearly separating causal and anticausal components. Relying on advances achieved by the Zeros of the Z-Transform (ZZT) technique, we here investigate the possibility of using complex cepstrum for glottal flow estimation on a large-scale database. Via a systematic study of the windowing effects on the deconvolution quality, we show that the complex…
▽ More
Complex cepstrum is known in the literature for linearly separating causal and anticausal components. Relying on advances achieved by the Zeros of the Z-Transform (ZZT) technique, we here investigate the possibility of using complex cepstrum for glottal flow estimation on a large-scale database. Via a systematic study of the windowing effects on the deconvolution quality, we show that the complex cepstrum causal-anticausal decomposition can be effectively used for glottal flow estimation when specific windowing criteria are met. It is also shown that this complex cepstral decomposition gives similar glottal estimates as obtained with the ZZT method. However, as complex cepstrum uses FFT operations instead of requiring the factoring of high-degree polynomials, the method benefits from a much higher speed. Finally in our tests on a large corpus of real expressive speech, we show that the proposed method has the potential to be used for voice quality analysis.
△ Less
Submitted 30 December, 2019;
originally announced December 2019.
-
A Comparative Study of Pitch Extraction Algorithms on a Large Variety of Singing Sounds
Authors:
Onur Babacan,
Thomas Drugman,
Nicolas d'Alessandro,
Nathalie Henrich,
Thierry Dutoit
Abstract:
The problem of pitch tracking has been extensively studied in the speech research community. The goal of this paper is to investigate how these techniques should be adapted to singing voice analysis, and to provide a comparative evaluation of the most representative state-of-the-art approaches. This study is carried out on a large database of annotated singing sounds with aligned EGG recordings, c…
▽ More
The problem of pitch tracking has been extensively studied in the speech research community. The goal of this paper is to investigate how these techniques should be adapted to singing voice analysis, and to provide a comparative evaluation of the most representative state-of-the-art approaches. This study is carried out on a large database of annotated singing sounds with aligned EGG recordings, comprising a variety of singer categories and singing exercises. The algorithmic performance is assessed according to the ability to detect voicing boundaries and to accurately estimate pitch contour. First, we evaluate the usefulness of adapting existing methods to singing voice analysis. Then we compare the accuracy of several pitch-extraction algorithms, depending on singer category and laryngeal mechanism. Finally, we analyze their robustness to reverberation.
△ Less
Submitted 29 December, 2019;
originally announced December 2019.
-
Glottal Source Processing: from Analysis to Applications
Authors:
Thomas Drugman,
Paavo Alku,
Abeer Alwan,
Bayya Yegnanarayana
Abstract:
The great majority of current voice technology applications relies on acoustic features characterizing the vocal tract response, such as the widely used MFCC of LPC parameters. Nonetheless, the airflow passing through the vocal folds, and called glottal flow, is expected to exhibit a relevant complementarity. Unfortunately, glottal analysis from speech recordings requires specific and more complex…
▽ More
The great majority of current voice technology applications relies on acoustic features characterizing the vocal tract response, such as the widely used MFCC of LPC parameters. Nonetheless, the airflow passing through the vocal folds, and called glottal flow, is expected to exhibit a relevant complementarity. Unfortunately, glottal analysis from speech recordings requires specific and more complex processing operations, which explains why it has been generally avoided. This review gives a general overview of techniques which have been designed for glottal source processing. Starting from fundamental analysis tools of pitch tracking, glottal closure instant detection, glottal flow estimation and modelling, this paper then highlights how these solutions can be properly integrated within various voice technology applications.
△ Less
Submitted 29 December, 2019;
originally announced December 2019.
-
Complex Cepstrum-based Decomposition of Speech for Glottal Source Estimation
Authors:
Thomas Drugman,
Baris Bozkurt,
Thierry Dutoit
Abstract:
Homomorphic analysis is a well-known method for the separation of non-linearly combined signals. More particularly, the use of complex cepstrum for source-tract deconvolution has been discussed in various articles. However there exists no study which proposes a glottal flow estimation methodology based on cepstrum and reports effective results. In this paper, we show that complex cepstrum can be e…
▽ More
Homomorphic analysis is a well-known method for the separation of non-linearly combined signals. More particularly, the use of complex cepstrum for source-tract deconvolution has been discussed in various articles. However there exists no study which proposes a glottal flow estimation methodology based on cepstrum and reports effective results. In this paper, we show that complex cepstrum can be effectively used for glottal flow estimation by separating the causal and anticausal components of a windowed speech signal as done by the Zeros of the Z-Transform (ZZT) decomposition. Based on exactly the same principles presented for ZZT decomposition, windowing should be applied such that the windowed speech signals exhibit mixed-phase characteristics which conform the speech production model that the anticausal component is mainly due to the glottal flow open phase. The advantage of the complex cepstrum-based approach compared to the ZZT decomposition is its much higher speed.
△ Less
Submitted 29 December, 2019;
originally announced December 2019.
-
Singing Synthesis: with a little help from my attention
Authors:
Orazio Angelini,
Alexis Moinet,
Kayoko Yanagisawa,
Thomas Drugman
Abstract:
We present UTACO, a singing synthesis model based on an attention-based sequence-to-sequence mechanism and a vocoder based on dilated causal convolutions. These two classes of models have significantly affected the field of text-to-speech, but have never been thoroughly applied to the task of singing synthesis. UTACO demonstrates that attention can be successfully applied to the singing synthesis…
▽ More
We present UTACO, a singing synthesis model based on an attention-based sequence-to-sequence mechanism and a vocoder based on dilated causal convolutions. These two classes of models have significantly affected the field of text-to-speech, but have never been thoroughly applied to the task of singing synthesis. UTACO demonstrates that attention can be successfully applied to the singing synthesis field and improves naturalness over the state of the art. The system requires considerably less explicit modelling of voice features such as F0 patterns, vibratos, and note and phoneme durations, than previous models in the literature. Despite this, it shows a strong improvement in naturalness with respect to previous neural singing synthesis models. The model does not require any durations or pitch patterns as inputs, and learns to insert vibrato autonomously according to the musical context. However, we observe that, by completely dispensing with any explicit duration modelling it becomes harder to obtain the fine control of timing needed to exactly match the tempo of a song.
△ Less
Submitted 6 May, 2020; v1 submitted 12 December, 2019;
originally announced December 2019.
-
Voice Conversion for Whispered Speech Synthesis
Authors:
Marius Cotescu,
Thomas Drugman,
Goeric Huybrechts,
Jaime Lorenzo-Trueba,
Alexis Moinet
Abstract:
We present an approach to synthesize whisper by applying a handcrafted signal processing recipe and Voice Conversion (VC) techniques to convert normally phonated speech to whispered speech. We investigate using Gaussian Mixture Models (GMM) and Deep Neural Networks (DNN) to model the map** between acoustic features of normal speech and those of whispered speech. We evaluate naturalness and speak…
▽ More
We present an approach to synthesize whisper by applying a handcrafted signal processing recipe and Voice Conversion (VC) techniques to convert normally phonated speech to whispered speech. We investigate using Gaussian Mixture Models (GMM) and Deep Neural Networks (DNN) to model the map** between acoustic features of normal speech and those of whispered speech. We evaluate naturalness and speaker similarity of the converted whisper on an internal corpus and on the publicly available wTIMIT corpus. We show that applying VC techniques is significantly better than using rule-based signal processing methods and it achieves results that are indistinguishable from copy-synthesis of natural whisper recordings. We investigate the ability of the DNN model to generalize on unseen speakers, when trained with data from multiple speakers. We show that excluding the target speaker from the training set has little or no impact on the perceived naturalness and speaker similarity of the converted whisper. The proposed DNN method is used in the newly released Whisper Mode of Amazon Alexa.
△ Less
Submitted 17 January, 2020; v1 submitted 11 December, 2019;
originally announced December 2019.