Skip to main content

Showing 1–50 of 59 results for author: Drugman, T

Searching in archive eess. Search in all archives.
.
  1. arXiv:2402.08093  [pdf, other

    cs.LG cs.CL eess.AS

    BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

    Authors: Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszyńska, Haohan Guo, Bartosz Putrycz, Soledad López Gambino, Kayeon Yoo, Elena Sokolova, Thomas Drugman

    Abstract: We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts ra… ▽ More

    Submitted 15 February, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

    Comments: v1.1 (fixed typos)

  2. arXiv:2309.01576  [pdf, other

    cs.CL cs.SD eess.AS

    A Comparative Analysis of Pretrained Language Models for Text-to-Speech

    Authors: Marcel Granero-Moya, Penny Karanasou, Sri Karlapati, Bastian Schnell, Nicole Peinelt, Alexis Moinet, Thomas Drugman

    Abstract: State-of-the-art text-to-speech (TTS) systems have utilized pretrained language models (PLMs) to enhance prosody and create more natural-sounding speech. However, while PLMs have been extensively researched for natural language understanding (NLU), their impact on TTS has been overlooked. In this study, we aim to address this gap by conducting a comparative analysis of different PLMs for two TTS t… ▽ More

    Submitted 4 September, 2023; originally announced September 2023.

    Comments: Accepted for presentation at the 12th ISCA Speech Synthesis Workshop (SSW) in Grenoble, France, from 26th to 28th August 2023

  3. arXiv:2307.07062  [pdf, other

    eess.AS cs.LG cs.SD

    Controllable Emphasis with zero data for text-to-speech

    Authors: Arnaud Joly, Marco Nicolis, Ekaterina Peterova, Alessandro Lombardi, Ammar Abbas, Arent van Korlaar, Aman Hussain, Parul Sharma, Alexis Moinet, Mateusz Lajszczak, Penny Karanasou, Antonio Bonafonte, Thomas Drugman, Elena Sokolova

    Abstract: We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques im… ▽ More

    Submitted 13 July, 2023; originally announced July 2023.

    Comments: In proceeding of 12th Speech Synthesis Workshop (SSW) 2023

  4. arXiv:2306.11327  [pdf, other

    eess.AS cs.SD

    eCat: An End-to-End Model for Multi-Speaker TTS & Many-to-Many Fine-Grained Prosody Transfer

    Authors: Ammar Abbas, Sri Karlapati, Bastian Schnell, Penny Karanasou, Marcel Granero Moya, Amith Nagaraj, Ayman Boustati, Nicole Peinelt, Alexis Moinet, Thomas Drugman

    Abstract: We present eCat, a novel end-to-end multispeaker model capable of: a) generating long-context speech with expressive and contextually appropriate prosody, and b) performing fine-grained prosody transfer between any pair of seen speakers. eCat is trained using a two-stage training approach. In Stage I, the model learns speaker-independent word-level prosody representations in an end-to-end fashion… ▽ More

    Submitted 20 June, 2023; originally announced June 2023.

    Comments: Accepted to be published in the Proceedings of InterSpeech 2023

  5. Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need

    Authors: Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Bozena Kostek

    Abstract: The research community has long studied computer-assisted pronunciation training (CAPT) methods in non-native speech. Researchers focused on studying various model architectures, such as Bayesian networks and deep learning methods, as well as on the analysis of different representations of the speech signal. Despite significant progress in recent years, existing CAPT methods are not able to detect… ▽ More

    Submitted 2 July, 2022; originally announced July 2022.

    Comments: Published in Speech Communication Journal

  6. arXiv:2206.14643  [pdf, other

    eess.AS cs.CL

    Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody

    Authors: Peter Makarov, Ammar Abbas, Mateusz Łajszczak, Arnaud Joly, Sri Karlapati, Alexis Moinet, Thomas Drugman, Penny Karanasou

    Abstract: Generating expressive and contextually appropriate prosody remains a challenge for modern text-to-speech (TTS) systems. This is particularly evident for long, multi-sentence inputs. In this paper, we examine simple extensions to a Transformer-based FastSpeech-like system, with the goal of improving prosody for multi-sentence TTS. We find that long context, powerful text features, and training on m… ▽ More

    Submitted 29 June, 2022; originally announced June 2022.

    Comments: Accepted to be published in the Proceedings of InterSpeech 2022

  7. arXiv:2206.14165  [pdf, other

    eess.AS cs.SD

    Expressive, Variable, and Controllable Duration Modelling in TTS

    Authors: Ammar Abbas, Thomas Merritt, Alexis Moinet, Sri Karlapati, Ewa Muszynska, Simon Slangen, Elia Gatti, Thomas Drugman

    Abstract: Duration modelling has become an important research problem once more with the rise of non-attention neural text-to-speech systems. The current approaches largely fall back to relying on previous statistical parametric speech synthesis technology for duration prediction, which poorly models the expressiveness and variability in speech. In this paper, we propose two alternate approaches to improve… ▽ More

    Submitted 28 June, 2022; originally announced June 2022.

    Comments: Accepted to be published in the Proceedings of InterSpeech 2022

  8. arXiv:2206.13443  [pdf, other

    eess.AS cs.SD

    CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer

    Authors: Sri Karlapati, Penny Karanasou, Mateusz Lajszczak, Ammar Abbas, Alexis Moinet, Peter Makarov, Ray Li, Arent van Korlaar, Simon Slangen, Thomas Drugman

    Abstract: In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing speech with different speaker identities, b) generating speech with expressive and contextually appropriate prosody, and c) transferring prosody at fine-grained level between any pair of seen speakers. We do this by activating distinct parts of the network for different tasks. We train our model using a novel appro… ▽ More

    Submitted 27 June, 2022; originally announced June 2022.

    Comments: Accepted to be published in the Proceedings of InterSpeech 2022

  9. arXiv:2202.06409  [pdf, other

    eess.AS cs.CL cs.LG

    Distribution augmentation for low-resource expressive text-to-speech

    Authors: Mateusz Lajszczak, Animesh Prasad, Arent van Korlaar, Bajibabu Bollepalli, Antonio Bonafonte, Arnaud Joly, Marco Nicolis, Alexis Moinet, Thomas Drugman, Trevor Wood, Elena Sokolova

    Abstract: This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data. Our goal is to increase diversity of text conditionings available during training. This helps to reduce overfitting, especially in low-resource settings. Our method relies on substituting text and audio fragments in a w… ▽ More

    Submitted 19 February, 2022; v1 submitted 13 February, 2022; originally announced February 2022.

    Comments: ICASSP 2022: camera-ready

  10. arXiv:2106.15649  [pdf, other

    eess.AS cs.LG cs.SD

    Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

    Authors: Ammar Abbas, Bajibabu Bollepalli, Alexis Moinet, Arnaud Joly, Penny Karanasou, Peter Makarov, Simon Slangens, Sri Karlapati, Thomas Drugman

    Abstract: We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody. We present a generic multi-scale spectrogram prediction mechanism where the system first predicts coarser scale mel-spectrograms that capture the suprasegmental information in speech, and later uses these coarser scale mel-spectrograms to predict finer scale me… ▽ More

    Submitted 29 June, 2021; originally announced June 2021.

    Comments: Accepted for the 11th ISCA Speech Synthesis Workshop (SSW11)

  11. arXiv:2106.10229  [pdf, other

    eess.AS cs.LG cs.SD

    A learned conditional prior for the VAE acoustic space of a TTS system

    Authors: Penny Karanasou, Sri Karlapati, Alexis Moinet, Arnaud Joly, Ammar Abbas, Simon Slangen, Jaime Lorenzo Trueba, Thomas Drugman

    Abstract: Many factors influence speech yielding different renditions of a given sentence. Generative models, such as variational autoencoders (VAEs), capture this variability and allow multiple renditions of the same sentence via sampling. The degree of prosodic variability depends heavily on the prior that is used when sampling. In this paper, we propose a novel method to compute an informative prior for… ▽ More

    Submitted 14 June, 2021; originally announced June 2021.

    Comments: in Proceedings of Interspeech 2021

  12. arXiv:2106.08873  [pdf, other

    cs.SD cs.LG eess.AS

    Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments

    Authors: Alejandro Mottini, Jaime Lorenzo-Trueba, Sri Vishnu Kumar Karlapati, Thomas Drugman

    Abstract: Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker. While there is a rich literature on VC, most proposed methods are trained and evaluated on clean speech recordings. However, many acoustic environments are noisy and reverberant, severely restricting the applicability of popular VC method… ▽ More

    Submitted 16 June, 2021; originally announced June 2021.

    Comments: Presented at the Speech Synthesis Workshops 2021 (SSW11)

  13. arXiv:2106.03494  [pdf, other

    eess.AS cs.LG

    Weakly-supervised word-level pronunciation error detection in non-native English speech

    Authors: Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Shira Calamaro, Bozena Kostek

    Abstract: We propose a weakly-supervised model for word-level mispronunciation detection in non-native (L2) English speech. To train this model, phonetically transcribed L2 speech is not required and we only need to mark mispronounced words. The lack of phonetic transcriptions for L2 speech means that the model has to learn only from a weak signal of word-level mispronunciations. Because of that and due to… ▽ More

    Submitted 7 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021

  14. arXiv:2101.06396  [pdf, other

    eess.AS cs.LG cs.SD

    Mispronunciation Detection in Non-native (L2) English with Uncertainty Modeling

    Authors: Daniel Korzekwa, Jaime Lorenzo-Trueba, Szymon Zaporowski, Shira Calamaro, Thomas Drugman, Bozena Kostek

    Abstract: A common approach to the automatic detection of mispronunciation in language learning is to recognize the phonemes produced by a student and compare it to the expected pronunciation of a native speaker. This approach makes two simplifying assumptions: a) phonemes can be recognized from speech with high accuracy, b) there is a single correct way for a sentence to be pronounced. These assumptions do… ▽ More

    Submitted 8 February, 2021; v1 submitted 16 January, 2021; originally announced January 2021.

    Comments: Accepted to ICASSP 2021

  15. arXiv:2101.05695  [pdf, other

    eess.AS cs.SD

    EmoCat: Language-agnostic Emotional Voice Conversion

    Authors: Bastian Schnell, Goeric Huybrechts, Bartek Perz, Thomas Drugman, Jaime Lorenzo-Trueba

    Abstract: Emotional voice conversion models adapt the emotion in speech without changing the speaker identity or linguistic content. They are less data hungry than text-to-speech models and allow to generate large amounts of emotional data for downstream tasks. In this work we propose EmoCat, a language-agnostic emotional voice conversion model. It achieves high-quality emotion conversion in German with les… ▽ More

    Submitted 14 January, 2021; originally announced January 2021.

    Comments: Submitted to IEEE ICASSP 2021

  16. arXiv:2012.14788  [pdf, other

    eess.AS cs.SD

    Detection of Lexical Stress Errors in Non-Native (L2) English with Data Augmentation and Attention

    Authors: Daniel Korzekwa, Roberto Barra-Chicote, Szymon Zaporowski, Grzegorz Beringer, Jaime Lorenzo-Trueba, Alicja Serafinowicz, Jasha Droppo, Thomas Drugman, Bozena Kostek

    Abstract: This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS). In a classical approach, audio features are usually extracted from fixed regions of speech such as the syllable nucleus. We propose an attention-based deep learni… ▽ More

    Submitted 7 June, 2021; v1 submitted 29 December, 2020; originally announced December 2020.

    Comments: Accepted to Interspeech 2021

  17. arXiv:2011.02252  [pdf, other

    eess.AS cs.CL cs.SD

    Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech

    Authors: Sri Karlapati, Ammar Abbas, Zack Hodari, Alexis Moinet, Arnaud Joly, Penny Karanasou, Thomas Drugman

    Abstract: In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information ava… ▽ More

    Submitted 4 November, 2020; originally announced November 2020.

    Comments: 5 pages and 3 figures

  18. arXiv:2011.01175  [pdf, other

    eess.AS

    CAMP: a Two-Stage Approach to Modelling Prosody in Context

    Authors: Zack Hodari, Alexis Moinet, Sri Karlapati, Jaime Lorenzo-Trueba, Thomas Merritt, Arnaud Joly, Ammar Abbas, Penny Karanasou, Thomas Drugman

    Abstract: Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In th… ▽ More

    Submitted 12 February, 2021; v1 submitted 2 November, 2020; originally announced November 2020.

    Comments: 5 pages. Published in the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021)

  19. arXiv:2006.04142  [pdf, other

    eess.AS cs.CL cs.SD

    Parametric Representation for Singing Voice Synthesis: a Comparative Evaluation

    Authors: Onur Babacan, Thomas Drugman, Tuomo Raitio, Daniel Erro, Thierry Dutoit

    Abstract: Various parametric representations have been proposed to model the speech signal. While the performance of such vocoders is well-known in the context of speech processing, their extrapolation to singing voice synthesis might not be straightforward. The goal of this paper is twofold. First, a comparative subjective evaluation is performed across four existing techniques suitable for statistical par… ▽ More

    Submitted 7 June, 2020; originally announced June 2020.

  20. arXiv:2006.04138  [pdf, other

    eess.AS cs.CL cs.SD

    Maximum Phase Modeling for Sparse Linear Prediction of Speech

    Authors: Thomas Drugman

    Abstract: Linear prediction (LP) is an ubiquitous analysis method in speech processing. Various studies have focused on sparse LP algorithms by introducing sparsity constraints into the LP framework. Sparse LP has been shown to be effective in several issues related to speech modeling and coding. However, all existing approaches assume the speech signal to be minimum-phase. Because speech is known to be mix… ▽ More

    Submitted 7 June, 2020; originally announced June 2020.

  21. arXiv:2006.04136  [pdf, ps, other

    eess.AS cs.CL

    Analysis and Synthesis of Hypo and Hyperarticulated Speech

    Authors: Benjamin Picart, Thomas Drugman, Thierry Dutoit

    Abstract: This paper focuses on the analysis and synthesis of hypo and hyperarticulated speech in the framework of HMM-based speech synthesis. First of all, a new French database matching our needs was created, which contains three identical sets, pronounced with three different degrees of articulation: neutral, hypo and hyperarticulated speech. On that basis, acoustic and phonetic analyses were performed.… ▽ More

    Submitted 7 June, 2020; originally announced June 2020.

  22. arXiv:2006.00525  [pdf, other

    eess.AS cs.CL cs.SD

    Residual Excitation Skewness for Automatic Speech Polarity Detection

    Authors: Thomas Drugman

    Abstract: Detecting the correct speech polarity is a necessary step prior to several speech processing techniques. An error on its determination could have a dramatic detrimental impact on their performance. As current systems have to deal with increasing amounts of data stemming from multiple devices, the automatic detection of speech polarity has become a crucial problem. For this purpose, we here propose… ▽ More

    Submitted 31 May, 2020; originally announced June 2020.

  23. arXiv:2006.00521  [pdf, other

    eess.AS cs.CL cs.SD

    Maximum Voiced Frequency Estimation: Exploiting Amplitude and Phase Spectra

    Authors: Thomas Drugman, Yannis Stylianou

    Abstract: Maximum Voiced Frequency (MVF) is used in various speech models as the spectral boundary separating periodic and aperiodic components during the production of voiced sounds. Recent studies have shown that its proper estimation and modeling enhance the quality of statistical parametric speech synthesizers. Contrastingly, these same methods of MVF estimation have been reported to degrade the perform… ▽ More

    Submitted 31 May, 2020; originally announced June 2020.

  24. arXiv:2006.00518  [pdf, other

    eess.AS cs.CL cs.SD

    Data-driven Detection and Analysis of the Patterns of Creaky Voice

    Authors: Thomas Drugman, John Kane, Christer Gobl

    Abstract: This paper investigates the temporal excitation patterns of creaky voice. Creaky voice is a voice quality frequently used as a phrase-boundary marker, but also as a means of portraying attitude, affective states and even social status. Consequently, the automatic detection and modelling of creaky voice may have implications for speech technology applications. The acoustic characteristics of creaky… ▽ More

    Submitted 31 May, 2020; originally announced June 2020.

  25. arXiv:2005.11682  [pdf, other

    eess.AS cs.CL cs.SD

    Glottal source estimation robustness: A comparison of sensitivity of voice source estimation techniques

    Authors: Thomas Drugman, Thomas Dubuisson, Alexis Moinet, Nicolas D'Alessandro, Thierry Dutoit

    Abstract: This paper addresses the problem of estimating the voice source directly from speech waveforms. A novel principle based on Anticausality Dominated Regions (ACDR) is used to estimate the glottal open phase. This technique is compared to two other state-of-the-art well-known methods, namely the Zeros of the Z-Transform (ZZT) and the Iterative Adaptive Inverse Filtering (IAIF) algorithms. Decompositi… ▽ More

    Submitted 24 May, 2020; originally announced May 2020.

  26. arXiv:2005.07901  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Oscillating Statistical Moments for Speech Polarity Detection

    Authors: Thomas Drugman, Thierry Dutoit

    Abstract: An inversion of the speech polarity may have a dramatic detrimental effect on the performance of various techniques of speech processing. An automatic method for determining the speech polarity (which is dependent upon the recording setup) is thus required as a preliminary step for ensuring the well-behaviour of such techniques. This paper proposes a new approach of polarity detection relying on o… ▽ More

    Submitted 16 May, 2020; originally announced May 2020.

  27. arXiv:2005.07897  [pdf, other

    cs.SD cs.CL eess.AS

    Glottal Source Estimation using an Automatic Chirp Decomposition

    Authors: Thomas Drugman, Baris Bozkurt, Thierry Dutoit

    Abstract: In a previous work, we showed that the glottal source can be estimated from speech signals by computing the Zeros of the Z-Transform (ZZT). Decomposition was achieved by separating the roots inside (causal contribution) and outside (anticausal contribution) the unit circle. In order to guarantee a correct deconvolution, time alignment on the Glottal Closure Instants (GCIs) was shown to be essentia… ▽ More

    Submitted 16 May, 2020; originally announced May 2020.

  28. arXiv:2005.05313  [pdf, other

    eess.AS cs.SD

    Audio and Contact Microphones for Cough Detection

    Authors: Thomas Drugman, Jerome Urbain, Nathalie Bauwens, Ricardo Chessini, Anne-Sophie Aubriot, Patrick Lebecque, Thierry Dutoit

    Abstract: In the framework of assessing the pathology severity in chronic cough diseases, medical literature underlines the lack of tools for allowing the automatic, objective and reliable detection of cough events. This paper describes a system based on two microphones which we developed for this purpose. The proposed approach relies on a large variety of audio descriptors, an efficient algorithm of featur… ▽ More

    Submitted 10 May, 2020; originally announced May 2020.

    Comments: arXiv admin note: substantial text overlap with arXiv:2001.00537

  29. arXiv:2005.04724  [pdf, other

    cs.SD cs.CL eess.AS

    Chirp Complex Cepstrum-based Decomposition for Asynchronous Glottal Analysis

    Authors: Thomas Drugman, Thierry Dutoit

    Abstract: It was recently shown that complex cepstrum can be effectively used for glottal flow estimation by separating the causal and anticausal components of speech. In order to guarantee a correct estimation, some constraints on the window have been derived. Among these, the window has to be synchronized on a Glottal Closure Instant. This paper proposes an extension of the complex cepstrum-based decompos… ▽ More

    Submitted 10 May, 2020; originally announced May 2020.

  30. CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech

    Authors: Sri Karlapati, Alexis Moinet, Arnaud Joly, Viacheslav Klimkov, Daniel Sáez-Trigueros, Thomas Drugman

    Abstract: Prosody Transfer (PT) is a technique that aims to use the prosody from a source audio as a reference while synthesising speech. Fine-grained PT aims at capturing prosodic aspects like rhythm, emphasis, melody, duration, and loudness, from a source audio at a very granular level and transferring them when synthesising speech in a different target speaker's voice. Current approaches for fine-grained… ▽ More

    Submitted 30 April, 2020; originally announced April 2020.

    Journal ref: INTERSPEECH 2020: 4387-4391

  31. arXiv:2001.01000  [pdf, ps, other

    cs.SD cs.CL eess.AS

    The Deterministic plus Stochastic Model of the Residual Signal and its Applications

    Authors: Thomas Drugman, Thierry Dutoit

    Abstract: The modeling of speech production often relies on a source-filter approach. Although methods parameterizing the filter have nowadays reached a certain maturity, there is still a lot to be gained for several speech processing applications in finding an appropriate excitation model. This manuscript presents a Deterministic plus Stochastic Model (DSM) of the residual signal. The DSM consists of two c… ▽ More

    Submitted 29 December, 2019; originally announced January 2020.

  32. arXiv:2001.00842  [pdf, other

    cs.SD cs.CL eess.AS

    A Deterministic plus Stochastic Model of the Residual Signal for Improved Parametric Speech Synthesis

    Authors: Thomas Drugman, Geoffrey Wilfart, Thierry Dutoit

    Abstract: Speech generated by parametric synthesizers generally suffers from a typical buzziness, similar to what was encountered in old LPC-like vocoders. In order to alleviate this problem, a more suited modeling of the excitation should be adopted. For this, we hereby propose an adaptation of the Deterministic plus Stochastic Model (DSM) for the residual. In this model, the excitation is divided into two… ▽ More

    Submitted 29 December, 2019; originally announced January 2020.

  33. arXiv:2001.00841  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Glottal Closure and Opening Instant Detection from Speech Signals

    Authors: Thomas Drugman, Thierry Dutoit

    Abstract: This paper proposes a new procedure to detect Glottal Closure and Opening Instants (GCIs and GOIs) directly from speech waveforms. The procedure is divided into two successive steps. First a mean-based signal is computed, and intervals where speech events are expected to occur are extracted from it. Secondly, at each interval a precise position of the speech event is assigned by locating a discont… ▽ More

    Submitted 28 December, 2019; originally announced January 2020.

  34. arXiv:2001.00840  [pdf, other

    cs.SD cs.CL eess.AS

    A Comparative Study of Glottal Source Estimation Techniques

    Authors: Thomas Drugman, Baris Bozkurt, Thierry Dutoit

    Abstract: Source-tract decomposition (or glottal flow estimation) is one of the basic problems of speech processing. For this, several techniques have been proposed in the literature. However studies comparing different approaches are almost nonexistent. Besides, experiments have been systematically performed either on synthetic speech or on sustained vowels. In this study we compare three of the main repre… ▽ More

    Submitted 28 December, 2019; originally announced January 2020.

  35. arXiv:2001.00583  [pdf, other

    cs.SD cs.CL eess.AS

    On the Mutual Information between Source and Filter Contributions for Voice Pathology Detection

    Authors: Thomas Drugman, Thomas Dubuisson, Thierry Dutoit

    Abstract: This paper addresses the problem of automatic detection of voice pathologies directly from the speech signal. For this, we investigate the use of the glottal source estimation as a means to detect voice disorders. Three sets of features are proposed, depending on whether they are related to the speech or the glottal signal, or to prosody. The relevancy of these features is assessed through mutual… ▽ More

    Submitted 2 January, 2020; originally announced January 2020.

  36. arXiv:2001.00582  [pdf, other

    cs.SD cs.CL eess.AS

    Excitation-based Voice Quality Analysis and Modification

    Authors: Thomas Drugman, Thierry Dutoit, Baris Bozkurt

    Abstract: This paper investigates the differences occuring in the excitation for different voice qualities. Its goal is two-fold. First a large corpus containing three voice qualities (modal, soft and loud) uttered by the same speaker is analyzed and significant differences in characteristics extracted from the excitation are observed. Secondly rules of modification derived from the analysis are used to bui… ▽ More

    Submitted 2 January, 2020; originally announced January 2020.

  37. arXiv:2001.00581  [pdf, other

    cs.SD cs.CL eess.AS

    Eigenresiduals for improved Parametric Speech Synthesis

    Authors: Thomas Drugman, Geoffrey Wilfart, Thierry Dutoit

    Abstract: Statistical parametric speech synthesizers have recently shown their ability to produce natural-sounding and flexible voices. Unfortunately the delivered quality suffers from a typical buzziness due to the fact that speech is vocoded. This paper proposes a new excitation model in order to reduce this undesirable effect. This model is based on the decomposition of pitch-synchronous residual frames… ▽ More

    Submitted 2 January, 2020; originally announced January 2020.

  38. arXiv:2001.00580  [pdf, ps, other

    cs.SD cs.HC eess.AS

    Assessment of Audio Features for Automatic Cough Detection

    Authors: Thomas Drugman, Jerome Urbain, Thierry Dutoit

    Abstract: This paper addresses the issue of cough detection using only audio recordings, with the ultimate goal of quantifying and qualifying the degree of pathology for patients suffering from respiratory diseases, notably mucoviscidosis. A large set of audio features describing various aspects of the audio signal is proposed. These features are assessed in two steps. First, their intrisic potential and re… ▽ More

    Submitted 2 January, 2020; originally announced January 2020.

  39. arXiv:2001.00579  [pdf, other

    cs.SD cs.CL eess.AS

    A Comparative Evaluation of Pitch Modification Techniques

    Authors: Thomas Drugman, Thierry Dutoit

    Abstract: This paper addresses the problem of pitch modification, as an important module for an efficient voice transformation system. The Deterministic plus Stochastic Model of the residual signal we proposed in a previous work is compared to TDPSOLA, HNM and STRAIGHT. The four methods are compared through an important subjective test. The influence of the speaker gender and of the pitch modification ratio… ▽ More

    Submitted 2 January, 2020; originally announced January 2020.

  40. arXiv:2001.00537  [pdf, other

    physics.med-ph cs.SD eess.AS

    Objective Study of Sensor Relevance for Automatic Cough Detection

    Authors: Thomas Drugman, Jerome Urbain, Nathalie Bauwens, Ricardo Chessini, Carlos Valderrama, Patrick Lebecque, Thierry Dutoit

    Abstract: The development of a system for the automatic, objective and reliable detection of cough events is a need underlined by the medical literature for years. The benefit of such a tool is clear as it would allow the assessment of pathology severity in chronic cough diseases. Even though some approaches have recently reported solutions achieving this task with a relative success, there is still no stan… ▽ More

    Submitted 30 December, 2019; originally announced January 2020.

  41. arXiv:2001.00473  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Detection of Glottal Closure Instants from Speech Signals: a Quantitative Review

    Authors: Thomas Drugman, Mark Thomas, Jon Gudnason, Patrick Naylor, Thierry Dutoit

    Abstract: The pseudo-periodicity of voiced speech can be exploited in several speech processing applications. This requires however that the precise locations of the Glottal Closure Instants (GCIs) are available. The focus of this paper is the evaluation of automatic methods for the detection of GCIs directly from the speech waveform. Five state-of-the-art GCI detection algorithms are compared using six dif… ▽ More

    Submitted 28 December, 2019; originally announced January 2020.

  42. arXiv:2001.00459  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics

    Authors: Thomas Drugman, Abeer Alwan

    Abstract: This paper focuses on the problem of pitch tracking in noisy conditions. A method using harmonic information in the residual signal is presented. The proposed criterion is used both for pitch estimation, as well as for determining the voicing segments of speech. In the experiments, the method is compared to six state-of-the-art pitch trackers on the Keele and CSTR databases. The proposed technique… ▽ More

    Submitted 28 December, 2019; originally announced January 2020.

  43. arXiv:2001.00372  [pdf, other

    cs.SD cs.CL eess.AS

    Phase-based Information for Voice Pathology Detection

    Authors: Thomas Drugman, Thomas Dubuisson, Thierry Dutoit

    Abstract: In most current approaches of speech processing, information is extracted from the magnitude spectrum. However recent perceptual studies have underlined the importance of the phase component. The goal of this paper is to investigate the potential of using phase-based features for automatically detecting voice disorders. It is shown that group delay functions are appropriate for characterizing irre… ▽ More

    Submitted 2 January, 2020; originally announced January 2020.

  44. arXiv:1912.12887  [pdf, other

    cs.SD cs.CL eess.AS

    Using a Pitch-Synchronous Residual Codebook for Hybrid HMM/Frame Selection Speech Synthesis

    Authors: Thomas Drugman, Alexis Moinet, Thierry Dutoit, Geoffrey Wilfart

    Abstract: This paper proposes a method to improve the quality delivered by statistical parametric speech synthesizers. For this, we use a codebook of pitch-synchronous residual frames, so as to construct a more realistic source signal. First a limited codebook of typical excitations is built from some training database. During the synthesis part, HMMs are used to generate filter and source coefficients. The… ▽ More

    Submitted 30 December, 2019; originally announced December 2019.

  45. arXiv:1912.12843  [pdf, other

    cs.SD cs.CL eess.AS

    Causal-Anticausal Decomposition of Speech using Complex Cepstrum for Glottal Source Estimation

    Authors: Thomas Drugman, Baris Bozkurt, Thierry Dutoit

    Abstract: Complex cepstrum is known in the literature for linearly separating causal and anticausal components. Relying on advances achieved by the Zeros of the Z-Transform (ZZT) technique, we here investigate the possibility of using complex cepstrum for glottal flow estimation on a large-scale database. Via a systematic study of the windowing effects on the deconvolution quality, we show that the complex… ▽ More

    Submitted 30 December, 2019; originally announced December 2019.

  46. arXiv:1912.12609  [pdf, other

    cs.SD eess.AS

    A Comparative Study of Pitch Extraction Algorithms on a Large Variety of Singing Sounds

    Authors: Onur Babacan, Thomas Drugman, Nicolas d'Alessandro, Nathalie Henrich, Thierry Dutoit

    Abstract: The problem of pitch tracking has been extensively studied in the speech research community. The goal of this paper is to investigate how these techniques should be adapted to singing voice analysis, and to provide a comparative evaluation of the most representative state-of-the-art approaches. This study is carried out on a large database of annotated singing sounds with aligned EGG recordings, c… ▽ More

    Submitted 29 December, 2019; originally announced December 2019.

  47. arXiv:1912.12604  [pdf, other

    cs.SD cs.CL eess.AS

    Glottal Source Processing: from Analysis to Applications

    Authors: Thomas Drugman, Paavo Alku, Abeer Alwan, Bayya Yegnanarayana

    Abstract: The great majority of current voice technology applications relies on acoustic features characterizing the vocal tract response, such as the widely used MFCC of LPC parameters. Nonetheless, the airflow passing through the vocal folds, and called glottal flow, is expected to exhibit a relevant complementarity. Unfortunately, glottal analysis from speech recordings requires specific and more complex… ▽ More

    Submitted 29 December, 2019; originally announced December 2019.

  48. arXiv:1912.12602  [pdf, other

    cs.SD cs.CL eess.AS

    Complex Cepstrum-based Decomposition of Speech for Glottal Source Estimation

    Authors: Thomas Drugman, Baris Bozkurt, Thierry Dutoit

    Abstract: Homomorphic analysis is a well-known method for the separation of non-linearly combined signals. More particularly, the use of complex cepstrum for source-tract deconvolution has been discussed in various articles. However there exists no study which proposes a glottal flow estimation methodology based on cepstrum and reports effective results. In this paper, we show that complex cepstrum can be e… ▽ More

    Submitted 29 December, 2019; originally announced December 2019.

  49. arXiv:1912.05881  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Singing Synthesis: with a little help from my attention

    Authors: Orazio Angelini, Alexis Moinet, Kayoko Yanagisawa, Thomas Drugman

    Abstract: We present UTACO, a singing synthesis model based on an attention-based sequence-to-sequence mechanism and a vocoder based on dilated causal convolutions. These two classes of models have significantly affected the field of text-to-speech, but have never been thoroughly applied to the task of singing synthesis. UTACO demonstrates that attention can be successfully applied to the singing synthesis… ▽ More

    Submitted 6 May, 2020; v1 submitted 12 December, 2019; originally announced December 2019.

    Comments: Submitted to Interspeech 2020

  50. arXiv:1912.05289  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Voice Conversion for Whispered Speech Synthesis

    Authors: Marius Cotescu, Thomas Drugman, Goeric Huybrechts, Jaime Lorenzo-Trueba, Alexis Moinet

    Abstract: We present an approach to synthesize whisper by applying a handcrafted signal processing recipe and Voice Conversion (VC) techniques to convert normally phonated speech to whispered speech. We investigate using Gaussian Mixture Models (GMM) and Deep Neural Networks (DNN) to model the map** between acoustic features of normal speech and those of whispered speech. We evaluate naturalness and speak… ▽ More

    Submitted 17 January, 2020; v1 submitted 11 December, 2019; originally announced December 2019.

    Comments: Submitted to IEEE Signal Processing Letters