Search | arXiv e-print repository

TIPAA-SSL: Text Independent Phone-to-Audio Alignment based on Self-Supervised Learning and Knowledge Transfer

Authors: Noé Tits, Prernna Bhatnagar, Thierry Dutoit

Abstract: In this paper, we present a novel approach for text independent phone-to-audio alignment based on phoneme recognition, representation learning and knowledge transfer. Our method leverages a self-supervised model (wav2vec2) fine-tuned for phoneme recognition using a Connectionist Temporal Classification (CTC) loss, a dimension reduction model and a frame-level phoneme classifier trained thanks to f… ▽ More In this paper, we present a novel approach for text independent phone-to-audio alignment based on phoneme recognition, representation learning and knowledge transfer. Our method leverages a self-supervised model (wav2vec2) fine-tuned for phoneme recognition using a Connectionist Temporal Classification (CTC) loss, a dimension reduction model and a frame-level phoneme classifier trained thanks to forced-alignment labels (using Montreal Forced Aligner) to produce multi-lingual phonetic representations, thus requiring minimal additional training. We evaluate our model using synthetic native data from the TIMIT dataset and the SCRIBE dataset for American and British English, respectively. Our proposed model outperforms the state-of-the-art (charsiu) in statistical metrics and has applications in language learning and speech processing systems. We leave experiments on other languages for future work but the design of the system makes it easily adaptable to other languages. △ Less

Submitted 3 May, 2024; originally announced May 2024.

arXiv:2210.16984 [pdf, other]

Synthesizer Preset Interpolation using Transformer Auto-Encoders

Authors: Gwendal Le Vaillant, Thierry Dutoit

Abstract: Sound synthesizers are widespread in modern music production but they increasingly require expert skills to be mastered. This work focuses on interpolation between presets, i.e., sets of values of all sound synthesis parameters, to enable the intuitive creation of new sounds from existing ones. We introduce a bimodal auto-encoder neural network, which simultaneously processes presets using multi… ▽ More Sound synthesizers are widespread in modern music production but they increasingly require expert skills to be mastered. This work focuses on interpolation between presets, i.e., sets of values of all sound synthesis parameters, to enable the intuitive creation of new sounds from existing ones. We introduce a bimodal auto-encoder neural network, which simultaneously processes presets using multi-head attention blocks, and audio using convolutions. This model has been tested on a popular frequency modulation synthesizer with more than one hundred parameters. Experiments have compared the model to related architectures and methods, and have demonstrated that it performs smoother interpolations. After training, the proposed model can be integrated into commercial synthesizers for live interpolation or sound design tasks. △ Less

Submitted 9 March, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

Comments: Accepted to IEEE ICASSP 2023

arXiv:2209.15085 [pdf, other]

Cardiotocography Signal Abnormality Detection based on Deep Unsupervised Models

Authors: Julien Bertieaux, Mohammadhadi Shateri, Fabrice Labeau, Thierry Dutoit

Abstract: Cardiotocography (CTG) is a key element when it comes to monitoring fetal well-being. Obstetricians use it to observe the fetal heart rate (FHR) and the uterine contraction (UC). The goal is to determine how the fetus reacts to the contraction and whether it is receiving adequate oxygen. If a problem occurs, the physician can then respond with an intervention. Unfortunately, the interpretation of… ▽ More Cardiotocography (CTG) is a key element when it comes to monitoring fetal well-being. Obstetricians use it to observe the fetal heart rate (FHR) and the uterine contraction (UC). The goal is to determine how the fetus reacts to the contraction and whether it is receiving adequate oxygen. If a problem occurs, the physician can then respond with an intervention. Unfortunately, the interpretation of CTGs is highly subjective and there is a low inter- and intra-observer agreement rate among practitioners. This can lead to unnecessary medical intervention that represents a risk for both the mother and the fetus. Recently, computer-assisted diagnosis techniques, especially based on artificial intelligence models (mostly supervised), have been proposed in the literature. But, many of these models lack generalization to unseen/test data samples due to overfitting. Moreover, the unsupervised models were applied to a very small portion of the CTG samples where the normal and abnormal classes are highly separable. In this work, deep unsupervised learning approaches, trained in a semi-supervised manner, are proposed for anomaly detection in CTG signals. The GANomaly framework, modified to capture the underlying distribution of data samples, is used as our main model and is applied to the CTU-UHB dataset. Unlike the recent studies, all the CTG data samples, without any specific preferences, are used in our work. The experimental results show that our modified GANomaly model outperforms state-of-the-arts. This study admit the superiority of the deep unsupervised models over the supervised ones in CTG abnormality detection. △ Less

Submitted 29 September, 2022; originally announced September 2022.

arXiv:2204.07162 [pdf, ps, other]

Spatio-Temporal Analysis of Transformer based Architecture for Attention Estimation from EEG

Authors: Victor Delvigne, Hazem Wannous, Jean-Philippe Vandeborre, Laurence Ris, Thierry Dutoit

Abstract: For many years now, understanding the brain mechanism has been a great research subject in many different fields. Brain signal processing and especially electroencephalogram (EEG) has recently known a growing interest both in academia and industry. One of the main examples is the increasing number of Brain-Computer Interfaces (BCI) aiming to link brains and computers. In this paper, we present a n… ▽ More For many years now, understanding the brain mechanism has been a great research subject in many different fields. Brain signal processing and especially electroencephalogram (EEG) has recently known a growing interest both in academia and industry. One of the main examples is the increasing number of Brain-Computer Interfaces (BCI) aiming to link brains and computers. In this paper, we present a novel framework allowing us to retrieve the attention state, i.e degree of attention given to a specific task, from EEG signals. While previous methods often consider the spatial relationship in EEG through electrodes and process them in recurrent or convolutional based architecture, we propose here to also exploit the spatial and temporal information with a transformer-based network that has already shown its supremacy in many machine-learning (ML) related studies, e.g. machine translation. In addition to this novel architecture, an extensive study on the feature extraction methods, frequential bands and temporal windows length has also been carried out. The proposed network has been trained and validated on two public datasets and achieves higher results compared to state-of-the-art models. As well as proposing better results, the framework could be used in real applications, e.g. Attention Deficit Hyperactivity Disorder (ADHD) symptoms or vigilance during a driving assessment. △ Less

Submitted 4 April, 2022; originally announced April 2022.

arXiv:2201.03902 [pdf, other]

Where Is My Mind (looking at)? Predicting Visual Attention from Brain Activity

Authors: Victor Delvigne, Noé Tits, Luca La Fisca, Nathan Hubens, Antoine Maiorca, Hazem Wannous, Thierry Dutoit, Jean-Philippe Vandeborre

Abstract: Visual attention estimation is an active field of research at the crossroads of different disciplines: computer vision, artificial intelligence and medicine. One of the most common approaches to estimate a saliency map representing attention is based on the observed images. In this paper, we show that visual attention can be retrieved from EEG acquisition. The results are comparable to traditional… ▽ More Visual attention estimation is an active field of research at the crossroads of different disciplines: computer vision, artificial intelligence and medicine. One of the most common approaches to estimate a saliency map representing attention is based on the observed images. In this paper, we show that visual attention can be retrieved from EEG acquisition. The results are comparable to traditional predictions from observed images, which is of great interest. For this purpose, a set of signals has been recorded and different models have been developed to study the relationship between visual attention and brain activity. The results are encouraging and comparable with other approaches estimating attention with other modalities. The codes and dataset considered in this paper have been made available at \url{https://figshare.com/s/3e353bd1c621962888ad} to promote research in the field. △ Less

Submitted 11 January, 2022; originally announced January 2022.

arXiv:2103.04097 [pdf, other]

Analysis and Assessment of Controllability of an Expressive Deep Learning-based TTS system

Authors: Noé Tits, Kevin El Haddad, Thierry Dutoit

Abstract: In this paper, we study the controllability of an Expressive TTS system trained on a dataset for a continuous control. The dataset is the Blizzard 2013 dataset based on audiobooks read by a female speaker containing a great variability in styles and expressiveness. Controllability is evaluated with both an objective and a subjective experiment. The objective assessment is based on a measure of cor… ▽ More In this paper, we study the controllability of an Expressive TTS system trained on a dataset for a continuous control. The dataset is the Blizzard 2013 dataset based on audiobooks read by a female speaker containing a great variability in styles and expressiveness. Controllability is evaluated with both an objective and a subjective experiment. The objective assessment is based on a measure of correlation between acoustic features and the dimensions of the latent space representing expressiveness. The subjective assessment is based on a perceptual experiment in which users are shown an interface for Controllable Expressive TTS and asked to retrieve a synthetic utterance whose expressiveness subjectively corresponds to that a reference utterance. △ Less

Submitted 6 March, 2021; originally announced March 2021.

arXiv:2008.11045 [pdf, other]

ICE-Talk: an Interface for a Controllable Expressive Talking Machine

Authors: Noé Tits, Kevin El Haddad, Thierry Dutoit

Abstract: ICE-Talk is an open source web-based GUI that allows the use of a TTS system with controllable parameters via a text field and a clickable 2D plot. It enables the study of latent spaces for controllable TTS. Moreover it is implemented as a module that can be used as part of a Human-Agent interaction. ICE-Talk is an open source web-based GUI that allows the use of a TTS system with controllable parameters via a text field and a clickable 2D plot. It enables the study of latent spaces for controllable TTS. Moreover it is implemented as a module that can be used as part of a Human-Agent interaction. △ Less

Submitted 25 August, 2020; originally announced August 2020.

arXiv:2008.09483 [pdf, other]

Laughter Synthesis: Combining Seq2seq modeling with Transfer Learning

Authors: Noé Tits, Kevin El Haddad, Thierry Dutoit

Abstract: Despite the growing interest for expressive speech synthesis, synthesis of nonverbal expressions is an under-explored area. In this paper we propose an audio laughter synthesis system based on a sequence-to-sequence TTS synthesis system. We leverage transfer learning by training a deep learning model to learn to generate both speech and laughs from annotations. We evaluate our model with a listeni… ▽ More Despite the growing interest for expressive speech synthesis, synthesis of nonverbal expressions is an under-explored area. In this paper we propose an audio laughter synthesis system based on a sequence-to-sequence TTS synthesis system. We leverage transfer learning by training a deep learning model to learn to generate both speech and laughs from annotations. We evaluate our model with a listening test, comparing its performance to an HMM-based laughter synthesis one and assess that it reaches higher perceived naturalness. Our solution is a first step towards a TTS system that would be able to synthesize speech with a control on amusement level with laughter integration. △ Less

Submitted 20 August, 2020; originally announced August 2020.

arXiv:2006.04142 [pdf, other]

Parametric Representation for Singing Voice Synthesis: a Comparative Evaluation

Authors: Onur Babacan, Thomas Drugman, Tuomo Raitio, Daniel Erro, Thierry Dutoit

Abstract: Various parametric representations have been proposed to model the speech signal. While the performance of such vocoders is well-known in the context of speech processing, their extrapolation to singing voice synthesis might not be straightforward. The goal of this paper is twofold. First, a comparative subjective evaluation is performed across four existing techniques suitable for statistical par… ▽ More Various parametric representations have been proposed to model the speech signal. While the performance of such vocoders is well-known in the context of speech processing, their extrapolation to singing voice synthesis might not be straightforward. The goal of this paper is twofold. First, a comparative subjective evaluation is performed across four existing techniques suitable for statistical parametric synthesis: traditional pulse vocoder, Deterministic plus Stochastic Model, Harmonic plus Noise Model and GlottHMM. The behavior of these techniques as a function of the singer type (baritone, counter-tenor and soprano) is studied. Secondly, the artifacts occurring in high-pitched voices are discussed and possible approaches to overcome them are suggested. △ Less

Submitted 7 June, 2020; originally announced June 2020.

arXiv:2006.04136 [pdf, ps, other]

Analysis and Synthesis of Hypo and Hyperarticulated Speech

Authors: Benjamin Picart, Thomas Drugman, Thierry Dutoit

Abstract: This paper focuses on the analysis and synthesis of hypo and hyperarticulated speech in the framework of HMM-based speech synthesis. First of all, a new French database matching our needs was created, which contains three identical sets, pronounced with three different degrees of articulation: neutral, hypo and hyperarticulated speech. On that basis, acoustic and phonetic analyses were performed.… ▽ More This paper focuses on the analysis and synthesis of hypo and hyperarticulated speech in the framework of HMM-based speech synthesis. First of all, a new French database matching our needs was created, which contains three identical sets, pronounced with three different degrees of articulation: neutral, hypo and hyperarticulated speech. On that basis, acoustic and phonetic analyses were performed. It is shown that the degrees of articulation significantly influence, on one hand, both vocal tract and glottal characteristics, and on the other hand, speech rate, phone durations, phone variations and the presence of glottal stops. Finally, neutral, hypo and hyperarticulated speech are synthesized using HMM-based speech synthesis and both objective and subjective tests aiming at assessing the generated speech quality are performed. These tests show that synthesized hypoarticulated speech seems to be less naturally rendered than neutral and hyperarticulated speech. △ Less

Submitted 7 June, 2020; originally announced June 2020.

arXiv:2005.11682 [pdf, other]

Glottal source estimation robustness: A comparison of sensitivity of voice source estimation techniques

Authors: Thomas Drugman, Thomas Dubuisson, Alexis Moinet, Nicolas D'Alessandro, Thierry Dutoit

Abstract: This paper addresses the problem of estimating the voice source directly from speech waveforms. A novel principle based on Anticausality Dominated Regions (ACDR) is used to estimate the glottal open phase. This technique is compared to two other state-of-the-art well-known methods, namely the Zeros of the Z-Transform (ZZT) and the Iterative Adaptive Inverse Filtering (IAIF) algorithms. Decompositi… ▽ More This paper addresses the problem of estimating the voice source directly from speech waveforms. A novel principle based on Anticausality Dominated Regions (ACDR) is used to estimate the glottal open phase. This technique is compared to two other state-of-the-art well-known methods, namely the Zeros of the Z-Transform (ZZT) and the Iterative Adaptive Inverse Filtering (IAIF) algorithms. Decomposition quality is assessed on synthetic signals through two objective measures: the spectral distortion and a glottal formant determination rate. Technique robustness is tested by analyzing the influence of noise and Glottal Closure Instant (GCI) location errors. Besides impacts of the fundamental frequency and the first formant on the performance are evaluated. Our proposed approach shows significant improvement in robustness, which could be of a great interest when decomposing real speech. △ Less

Submitted 24 May, 2020; originally announced May 2020.

arXiv:2005.07901 [pdf, ps, other]

Oscillating Statistical Moments for Speech Polarity Detection

Authors: Thomas Drugman, Thierry Dutoit

Abstract: An inversion of the speech polarity may have a dramatic detrimental effect on the performance of various techniques of speech processing. An automatic method for determining the speech polarity (which is dependent upon the recording setup) is thus required as a preliminary step for ensuring the well-behaviour of such techniques. This paper proposes a new approach of polarity detection relying on o… ▽ More An inversion of the speech polarity may have a dramatic detrimental effect on the performance of various techniques of speech processing. An automatic method for determining the speech polarity (which is dependent upon the recording setup) is thus required as a preliminary step for ensuring the well-behaviour of such techniques. This paper proposes a new approach of polarity detection relying on oscillating statistical moments. These moments have the property to oscillate at the local fundamental frequency and to exhibit a phase shift which depends on the speech polarity. This dependency stems from the introduction of non-linearity or higher-order statistics in the moment calculation. The resulting method is shown on 10 speech corpora to provide a substantial improvement compared to state-of-the-art techniques. △ Less

Submitted 16 May, 2020; originally announced May 2020.

arXiv:2005.07897 [pdf, other]

Glottal Source Estimation using an Automatic Chirp Decomposition

Authors: Thomas Drugman, Baris Bozkurt, Thierry Dutoit

Abstract: In a previous work, we showed that the glottal source can be estimated from speech signals by computing the Zeros of the Z-Transform (ZZT). Decomposition was achieved by separating the roots inside (causal contribution) and outside (anticausal contribution) the unit circle. In order to guarantee a correct deconvolution, time alignment on the Glottal Closure Instants (GCIs) was shown to be essentia… ▽ More In a previous work, we showed that the glottal source can be estimated from speech signals by computing the Zeros of the Z-Transform (ZZT). Decomposition was achieved by separating the roots inside (causal contribution) and outside (anticausal contribution) the unit circle. In order to guarantee a correct deconvolution, time alignment on the Glottal Closure Instants (GCIs) was shown to be essential. This paper extends the formalism of ZZT by evaluating the Z-transform on a contour possibly different from the unit circle. A method is proposed for determining automatically this contour by inspecting the root distribution. The derived Zeros of the Chirp Z-Transform (ZCZT)-based technique turns out to be much more robust to GCI location errors. △ Less

Submitted 16 May, 2020; originally announced May 2020.

arXiv:2005.05313 [pdf, other]

Audio and Contact Microphones for Cough Detection

Authors: Thomas Drugman, Jerome Urbain, Nathalie Bauwens, Ricardo Chessini, Anne-Sophie Aubriot, Patrick Lebecque, Thierry Dutoit

Abstract: In the framework of assessing the pathology severity in chronic cough diseases, medical literature underlines the lack of tools for allowing the automatic, objective and reliable detection of cough events. This paper describes a system based on two microphones which we developed for this purpose. The proposed approach relies on a large variety of audio descriptors, an efficient algorithm of featur… ▽ More In the framework of assessing the pathology severity in chronic cough diseases, medical literature underlines the lack of tools for allowing the automatic, objective and reliable detection of cough events. This paper describes a system based on two microphones which we developed for this purpose. The proposed approach relies on a large variety of audio descriptors, an efficient algorithm of feature selection based on their mutual information and the use of artificial neural networks. First, the possible use of a contact microphone (placed on the patient's thorax or trachea) in complement to the audio signal is investigated. This study underlines that this contact microphone suffers from reliability issues, and conveys little new relevant information compared to the audio modality. Secondly, the proposed audio-only approach is compared to a commercially available system using four sensors on a database with different sound categories often misdetected as coughs, and produced in various conditions. With average sensitivity and specificity of 94.7% and 95% respectively, the proposed method achieves better cough detection performance than the commercial system. △ Less

Submitted 10 May, 2020; originally announced May 2020.

Comments: arXiv admin note: substantial text overlap with arXiv:2001.00537

arXiv:2005.04724 [pdf, other]

Chirp Complex Cepstrum-based Decomposition for Asynchronous Glottal Analysis

Authors: Thomas Drugman, Thierry Dutoit

Abstract: It was recently shown that complex cepstrum can be effectively used for glottal flow estimation by separating the causal and anticausal components of speech. In order to guarantee a correct estimation, some constraints on the window have been derived. Among these, the window has to be synchronized on a Glottal Closure Instant. This paper proposes an extension of the complex cepstrum-based decompos… ▽ More It was recently shown that complex cepstrum can be effectively used for glottal flow estimation by separating the causal and anticausal components of speech. In order to guarantee a correct estimation, some constraints on the window have been derived. Among these, the window has to be synchronized on a Glottal Closure Instant. This paper proposes an extension of the complex cepstrum-based decomposition by incorporating a chirp analysis. The resulting method is shown to give a reliable estimation of the glottal flow wherever the window is located. This technique is then suited for its integration in usual speech processing systems, which generally operate in an asynchronous way. Besides its potential for automatic voice quality analysis is highlighted. △ Less

Submitted 10 May, 2020; originally announced May 2020.

arXiv:2001.01000 [pdf, ps, other]

The Deterministic plus Stochastic Model of the Residual Signal and its Applications

Authors: Thomas Drugman, Thierry Dutoit

Abstract: The modeling of speech production often relies on a source-filter approach. Although methods parameterizing the filter have nowadays reached a certain maturity, there is still a lot to be gained for several speech processing applications in finding an appropriate excitation model. This manuscript presents a Deterministic plus Stochastic Model (DSM) of the residual signal. The DSM consists of two c… ▽ More The modeling of speech production often relies on a source-filter approach. Although methods parameterizing the filter have nowadays reached a certain maturity, there is still a lot to be gained for several speech processing applications in finding an appropriate excitation model. This manuscript presents a Deterministic plus Stochastic Model (DSM) of the residual signal. The DSM consists of two contributions acting in two distinct spectral bands delimited by a maximum voiced frequency. Both components are extracted from an analysis performed on a speaker-dependent dataset of pitch-synchronous residual frames. The deterministic part models the low-frequency contents and arises from an orthonormal decomposition of these frames. As for the stochastic component, it is a high-frequency noise modulated both in time and frequency. Some interesting phonetic and computational properties of the DSM are also highlighted. The applicability of the DSM in two fields of speech processing is then studied. First, it is shown that incorporating the DSM vocoder in HMM-based speech synthesis enhances the delivered quality. The proposed approach turns out to significantly outperform the traditional pulse excitation and provides a quality equivalent to STRAIGHT. In a second application, the potential of glottal signatures derived from the proposed DSM is investigated for speaker identification purpose. Interestingly, these signatures are shown to lead to better recognition rates than other glottal-based methods. △ Less

Submitted 29 December, 2019; originally announced January 2020.

arXiv:2001.00842 [pdf, other]

A Deterministic plus Stochastic Model of the Residual Signal for Improved Parametric Speech Synthesis

Authors: Thomas Drugman, Geoffrey Wilfart, Thierry Dutoit

Abstract: Speech generated by parametric synthesizers generally suffers from a typical buzziness, similar to what was encountered in old LPC-like vocoders. In order to alleviate this problem, a more suited modeling of the excitation should be adopted. For this, we hereby propose an adaptation of the Deterministic plus Stochastic Model (DSM) for the residual. In this model, the excitation is divided into two… ▽ More Speech generated by parametric synthesizers generally suffers from a typical buzziness, similar to what was encountered in old LPC-like vocoders. In order to alleviate this problem, a more suited modeling of the excitation should be adopted. For this, we hereby propose an adaptation of the Deterministic plus Stochastic Model (DSM) for the residual. In this model, the excitation is divided into two distinct spectral bands delimited by the maximum voiced frequency. The deterministic part concerns the low-frequency contents and consists of a decomposition of pitch-synchronous residual frames on an orthonormal basis obtained by Principal Component Analysis. The stochastic component is a high-pass filtered noise whose time structure is modulated by an energy-envelope, similarly to what is done in the Harmonic plus Noise Model (HNM). The proposed residual model is integrated within a HMM-based speech synthesizer and is compared to the traditional excitation through a subjective test. Results show a significative improvement for both male and female voices. In addition the proposed model requires few computational load and memory, which is essential for its integration in commercial applications. △ Less

Submitted 29 December, 2019; originally announced January 2020.

arXiv:2001.00841 [pdf, ps, other]

Glottal Closure and Opening Instant Detection from Speech Signals

Authors: Thomas Drugman, Thierry Dutoit

Abstract: This paper proposes a new procedure to detect Glottal Closure and Opening Instants (GCIs and GOIs) directly from speech waveforms. The procedure is divided into two successive steps. First a mean-based signal is computed, and intervals where speech events are expected to occur are extracted from it. Secondly, at each interval a precise position of the speech event is assigned by locating a discont… ▽ More This paper proposes a new procedure to detect Glottal Closure and Opening Instants (GCIs and GOIs) directly from speech waveforms. The procedure is divided into two successive steps. First a mean-based signal is computed, and intervals where speech events are expected to occur are extracted from it. Secondly, at each interval a precise position of the speech event is assigned by locating a discontinuity in the Linear Prediction residual. The proposed method is compared to the DYPSA algorithm on the CMU ARCTIC database. A significant improvement as well as a better noise robustness are reported. Besides, results of GOI identification accuracy are promising for the glottal source characterization. △ Less

Submitted 28 December, 2019; originally announced January 2020.

arXiv:2001.00840 [pdf, other]

A Comparative Study of Glottal Source Estimation Techniques

Authors: Thomas Drugman, Baris Bozkurt, Thierry Dutoit

Abstract: Source-tract decomposition (or glottal flow estimation) is one of the basic problems of speech processing. For this, several techniques have been proposed in the literature. However studies comparing different approaches are almost nonexistent. Besides, experiments have been systematically performed either on synthetic speech or on sustained vowels. In this study we compare three of the main repre… ▽ More Source-tract decomposition (or glottal flow estimation) is one of the basic problems of speech processing. For this, several techniques have been proposed in the literature. However studies comparing different approaches are almost nonexistent. Besides, experiments have been systematically performed either on synthetic speech or on sustained vowels. In this study we compare three of the main representative state-of-the-art methods of glottal flow estimation: closed-phase inverse filtering, iterative and adaptive inverse filtering, and mixed-phase decomposition. These techniques are first submitted to an objective assessment test on synthetic speech signals. Their sensitivity to various factors affecting the estimation quality, as well as their robustness to noise are studied. In a second experiment, their ability to label voice quality (tensed, modal, soft) is studied on a large corpus of real connected speech. It is shown that changes of voice quality are reflected by significant modifications in glottal feature distributions. Techniques based on the mixed-phase decomposition and on a closed-phase inverse filtering process turn out to give the best results on both clean synthetic and real speech signals. On the other hand, iterative and adaptive inverse filtering is recommended in noisy environments for its high robustness. △ Less

Submitted 28 December, 2019; originally announced January 2020.

arXiv:2001.00583 [pdf, other]

On the Mutual Information between Source and Filter Contributions for Voice Pathology Detection

Authors: Thomas Drugman, Thomas Dubuisson, Thierry Dutoit

Abstract: This paper addresses the problem of automatic detection of voice pathologies directly from the speech signal. For this, we investigate the use of the glottal source estimation as a means to detect voice disorders. Three sets of features are proposed, depending on whether they are related to the speech or the glottal signal, or to prosody. The relevancy of these features is assessed through mutual… ▽ More This paper addresses the problem of automatic detection of voice pathologies directly from the speech signal. For this, we investigate the use of the glottal source estimation as a means to detect voice disorders. Three sets of features are proposed, depending on whether they are related to the speech or the glottal signal, or to prosody. The relevancy of these features is assessed through mutual information-based measures. This allows an intuitive interpretation in terms of discrimation power and redundancy between the features, independently of any subsequent classifier. It is discussed which characteristics are interestingly informative or complementary for detecting voice pathologies. △ Less

Submitted 2 January, 2020; originally announced January 2020.

arXiv:2001.00582 [pdf, other]

Excitation-based Voice Quality Analysis and Modification

Authors: Thomas Drugman, Thierry Dutoit, Baris Bozkurt

Abstract: This paper investigates the differences occuring in the excitation for different voice qualities. Its goal is two-fold. First a large corpus containing three voice qualities (modal, soft and loud) uttered by the same speaker is analyzed and significant differences in characteristics extracted from the excitation are observed. Secondly rules of modification derived from the analysis are used to bui… ▽ More This paper investigates the differences occuring in the excitation for different voice qualities. Its goal is two-fold. First a large corpus containing three voice qualities (modal, soft and loud) uttered by the same speaker is analyzed and significant differences in characteristics extracted from the excitation are observed. Secondly rules of modification derived from the analysis are used to build a voice quality transformation system applied as a post-process to HMM-based speech synthesis. The system is shown to effectively achieve the transformations while maintaining the delivered quality. △ Less

Submitted 2 January, 2020; originally announced January 2020.

arXiv:2001.00581 [pdf, other]

Eigenresiduals for improved Parametric Speech Synthesis

Authors: Thomas Drugman, Geoffrey Wilfart, Thierry Dutoit

Abstract: Statistical parametric speech synthesizers have recently shown their ability to produce natural-sounding and flexible voices. Unfortunately the delivered quality suffers from a typical buzziness due to the fact that speech is vocoded. This paper proposes a new excitation model in order to reduce this undesirable effect. This model is based on the decomposition of pitch-synchronous residual frames… ▽ More Statistical parametric speech synthesizers have recently shown their ability to produce natural-sounding and flexible voices. Unfortunately the delivered quality suffers from a typical buzziness due to the fact that speech is vocoded. This paper proposes a new excitation model in order to reduce this undesirable effect. This model is based on the decomposition of pitch-synchronous residual frames on an orthonormal basis obtained by Principal Component Analysis. This basis contains a limited number of eigenresiduals and is computed on a relatively small speech database. A stream of PCA-based coefficients is added to our HMM-based synthesizer and allows to generate the voiced excitation during the synthesis. An improvement compared to the traditional excitation is reported while the synthesis engine footprint remains under about 1Mb. △ Less

Submitted 2 January, 2020; originally announced January 2020.

arXiv:2001.00580 [pdf, ps, other]

Assessment of Audio Features for Automatic Cough Detection

Authors: Thomas Drugman, Jerome Urbain, Thierry Dutoit

Abstract: This paper addresses the issue of cough detection using only audio recordings, with the ultimate goal of quantifying and qualifying the degree of pathology for patients suffering from respiratory diseases, notably mucoviscidosis. A large set of audio features describing various aspects of the audio signal is proposed. These features are assessed in two steps. First, their intrisic potential and re… ▽ More This paper addresses the issue of cough detection using only audio recordings, with the ultimate goal of quantifying and qualifying the degree of pathology for patients suffering from respiratory diseases, notably mucoviscidosis. A large set of audio features describing various aspects of the audio signal is proposed. These features are assessed in two steps. First, their intrisic potential and redundancy are evaluated using mutual information-based measures. Secondly, their efficiency is confirmed relying on three classifiers: Artificial Neural Network, Gaussian Mixture Model and Support Vector Machine. The influence of both the feature dimension and the classifier complexity are also investigated. △ Less

Submitted 2 January, 2020; originally announced January 2020.

arXiv:2001.00579 [pdf, other]

A Comparative Evaluation of Pitch Modification Techniques

Authors: Thomas Drugman, Thierry Dutoit

Abstract: This paper addresses the problem of pitch modification, as an important module for an efficient voice transformation system. The Deterministic plus Stochastic Model of the residual signal we proposed in a previous work is compared to TDPSOLA, HNM and STRAIGHT. The four methods are compared through an important subjective test. The influence of the speaker gender and of the pitch modification ratio… ▽ More This paper addresses the problem of pitch modification, as an important module for an efficient voice transformation system. The Deterministic plus Stochastic Model of the residual signal we proposed in a previous work is compared to TDPSOLA, HNM and STRAIGHT. The four methods are compared through an important subjective test. The influence of the speaker gender and of the pitch modification ratio is analyzed. Despite its higher compression level, the DSM technique is shown to give similar or better results than other methods, especially for male speakers and important ratios of modification. The DSM turns out to be only outperformed by STRAIGHT for female voices. △ Less

Submitted 2 January, 2020; originally announced January 2020.

arXiv:2001.00537 [pdf, other]

Objective Study of Sensor Relevance for Automatic Cough Detection

Authors: Thomas Drugman, Jerome Urbain, Nathalie Bauwens, Ricardo Chessini, Carlos Valderrama, Patrick Lebecque, Thierry Dutoit

Abstract: The development of a system for the automatic, objective and reliable detection of cough events is a need underlined by the medical literature for years. The benefit of such a tool is clear as it would allow the assessment of pathology severity in chronic cough diseases. Even though some approaches have recently reported solutions achieving this task with a relative success, there is still no stan… ▽ More The development of a system for the automatic, objective and reliable detection of cough events is a need underlined by the medical literature for years. The benefit of such a tool is clear as it would allow the assessment of pathology severity in chronic cough diseases. Even though some approaches have recently reported solutions achieving this task with a relative success, there is still no standardization about the method to adopt or the sensors to use. The goal of this paper is to study objectively the performance of several sensors for cough detection: ECG, thermistor, chest belt, accelerometer, contact and audio microphones. Experiments are carried out on a database of 32 healthy subjects producing, in a confined room and in three situations, voluntary cough at various volumes as well as other event categories which can possibly lead to some detection errors: background noise, forced expiration, throat clearing, speech and laugh. The relevance of each sensor is evaluated at three stages: mutual information conveyed by the features, ability to discriminate at the frame level cough from these latter other sources of ambiguity, and ability to detect cough events. In this latter experiment, with both an averaged sensitivity and specificity of about 94.5%, the proposed approach is shown to clearly outperform the commercial Karmelsonix system which achieved a specificity of 95.3% and a sensitivity of 64.9%. △ Less

Submitted 30 December, 2019; originally announced January 2020.

arXiv:2001.00473 [pdf, ps, other]

Detection of Glottal Closure Instants from Speech Signals: a Quantitative Review

Authors: Thomas Drugman, Mark Thomas, Jon Gudnason, Patrick Naylor, Thierry Dutoit

Abstract: The pseudo-periodicity of voiced speech can be exploited in several speech processing applications. This requires however that the precise locations of the Glottal Closure Instants (GCIs) are available. The focus of this paper is the evaluation of automatic methods for the detection of GCIs directly from the speech waveform. Five state-of-the-art GCI detection algorithms are compared using six dif… ▽ More The pseudo-periodicity of voiced speech can be exploited in several speech processing applications. This requires however that the precise locations of the Glottal Closure Instants (GCIs) are available. The focus of this paper is the evaluation of automatic methods for the detection of GCIs directly from the speech waveform. Five state-of-the-art GCI detection algorithms are compared using six different databases with contemporaneous electroglottographic recordings as ground truth, and containing many hours of speech by multiple speakers. The five techniques compared are the Hilbert Envelope-based detection (HE), the Zero Frequency Resonator-based method (ZFR), the Dynamic Programming Phase Slope Algorithm (DYPSA), the Speech Event Detection using the Residual Excitation And a Mean-based Signal (SEDREAMS) and the Yet Another GCI Algorithm (YAGA). The efficacy of these methods is first evaluated on clean speech, both in terms of reliabililty and accuracy. Their robustness to additive noise and to reverberation is also assessed. A further contribution of the paper is the evaluation of their performance on a concrete application of speech processing: the causal-anticausal decomposition of speech. It is shown that for clean speech, SEDREAMS and YAGA are the best performing techniques, both in terms of identification rate and accuracy. ZFR and SEDREAMS also show a superior robustness to additive noise and reverberation. △ Less

Submitted 28 December, 2019; originally announced January 2020.

arXiv:2001.00372 [pdf, other]

Phase-based Information for Voice Pathology Detection

Authors: Thomas Drugman, Thomas Dubuisson, Thierry Dutoit

Abstract: In most current approaches of speech processing, information is extracted from the magnitude spectrum. However recent perceptual studies have underlined the importance of the phase component. The goal of this paper is to investigate the potential of using phase-based features for automatically detecting voice disorders. It is shown that group delay functions are appropriate for characterizing irre… ▽ More In most current approaches of speech processing, information is extracted from the magnitude spectrum. However recent perceptual studies have underlined the importance of the phase component. The goal of this paper is to investigate the potential of using phase-based features for automatically detecting voice disorders. It is shown that group delay functions are appropriate for characterizing irregularities in the phonation. Besides the respect of the mixed-phase model of speech is discussed. The proposed phase-based features are evaluated and compared to other parameters derived from the magnitude spectrum. Both streams are shown to be interestingly complementary. Furthermore phase-based features turn out to convey a great amount of relevant information, leading to high discrimination performance. △ Less

Submitted 2 January, 2020; originally announced January 2020.

arXiv:1912.12887 [pdf, other]

Using a Pitch-Synchronous Residual Codebook for Hybrid HMM/Frame Selection Speech Synthesis

Authors: Thomas Drugman, Alexis Moinet, Thierry Dutoit, Geoffrey Wilfart

Abstract: This paper proposes a method to improve the quality delivered by statistical parametric speech synthesizers. For this, we use a codebook of pitch-synchronous residual frames, so as to construct a more realistic source signal. First a limited codebook of typical excitations is built from some training database. During the synthesis part, HMMs are used to generate filter and source coefficients. The… ▽ More This paper proposes a method to improve the quality delivered by statistical parametric speech synthesizers. For this, we use a codebook of pitch-synchronous residual frames, so as to construct a more realistic source signal. First a limited codebook of typical excitations is built from some training database. During the synthesis part, HMMs are used to generate filter and source coefficients. The latter coefficients contain both the pitch and a compact representation of target residual frames. The source signal is obtained by concatenating excitation frames picked up from the codebook, based on a selection criterion and taking target residual coefficients as input. Subjective results show a relevant improvement compared to the basic technique. △ Less

Submitted 30 December, 2019; originally announced December 2019.

arXiv:1912.12843 [pdf, other]

Causal-Anticausal Decomposition of Speech using Complex Cepstrum for Glottal Source Estimation

Authors: Thomas Drugman, Baris Bozkurt, Thierry Dutoit

Abstract: Complex cepstrum is known in the literature for linearly separating causal and anticausal components. Relying on advances achieved by the Zeros of the Z-Transform (ZZT) technique, we here investigate the possibility of using complex cepstrum for glottal flow estimation on a large-scale database. Via a systematic study of the windowing effects on the deconvolution quality, we show that the complex… ▽ More Complex cepstrum is known in the literature for linearly separating causal and anticausal components. Relying on advances achieved by the Zeros of the Z-Transform (ZZT) technique, we here investigate the possibility of using complex cepstrum for glottal flow estimation on a large-scale database. Via a systematic study of the windowing effects on the deconvolution quality, we show that the complex cepstrum causal-anticausal decomposition can be effectively used for glottal flow estimation when specific windowing criteria are met. It is also shown that this complex cepstral decomposition gives similar glottal estimates as obtained with the ZZT method. However, as complex cepstrum uses FFT operations instead of requiring the factoring of high-degree polynomials, the method benefits from a much higher speed. Finally in our tests on a large corpus of real expressive speech, we show that the proposed method has the potential to be used for voice quality analysis. △ Less

Submitted 30 December, 2019; originally announced December 2019.

arXiv:1912.12609 [pdf, other]

A Comparative Study of Pitch Extraction Algorithms on a Large Variety of Singing Sounds

Authors: Onur Babacan, Thomas Drugman, Nicolas d'Alessandro, Nathalie Henrich, Thierry Dutoit

Abstract: The problem of pitch tracking has been extensively studied in the speech research community. The goal of this paper is to investigate how these techniques should be adapted to singing voice analysis, and to provide a comparative evaluation of the most representative state-of-the-art approaches. This study is carried out on a large database of annotated singing sounds with aligned EGG recordings, c… ▽ More The problem of pitch tracking has been extensively studied in the speech research community. The goal of this paper is to investigate how these techniques should be adapted to singing voice analysis, and to provide a comparative evaluation of the most representative state-of-the-art approaches. This study is carried out on a large database of annotated singing sounds with aligned EGG recordings, comprising a variety of singer categories and singing exercises. The algorithmic performance is assessed according to the ability to detect voicing boundaries and to accurately estimate pitch contour. First, we evaluate the usefulness of adapting existing methods to singing voice analysis. Then we compare the accuracy of several pitch-extraction algorithms, depending on singer category and laryngeal mechanism. Finally, we analyze their robustness to reverberation. △ Less

Submitted 29 December, 2019; originally announced December 2019.

arXiv:1912.12602 [pdf, other]

Complex Cepstrum-based Decomposition of Speech for Glottal Source Estimation

Authors: Thomas Drugman, Baris Bozkurt, Thierry Dutoit

Abstract: Homomorphic analysis is a well-known method for the separation of non-linearly combined signals. More particularly, the use of complex cepstrum for source-tract deconvolution has been discussed in various articles. However there exists no study which proposes a glottal flow estimation methodology based on cepstrum and reports effective results. In this paper, we show that complex cepstrum can be e… ▽ More Homomorphic analysis is a well-known method for the separation of non-linearly combined signals. More particularly, the use of complex cepstrum for source-tract deconvolution has been discussed in various articles. However there exists no study which proposes a glottal flow estimation methodology based on cepstrum and reports effective results. In this paper, we show that complex cepstrum can be effectively used for glottal flow estimation by separating the causal and anticausal components of a windowed speech signal as done by the Zeros of the Z-Transform (ZZT) decomposition. Based on exactly the same principles presented for ZZT decomposition, windowing should be applied such that the windowed speech signals exhibit mixed-phase characteristics which conform the speech production model that the anticausal component is mainly due to the glottal flow open phase. The advantage of the complex cepstrum-based approach compared to the ZZT decomposition is its much higher speed. △ Less

Submitted 29 December, 2019; originally announced December 2019.

arXiv:1910.06234 [pdf, other]

The Theory behind Controllable Expressive Speech Synthesis: a Cross-disciplinary Approach

Authors: Noé Tits, Kevin El Haddad, Thierry Dutoit

Abstract: As part of the Human-Computer Interaction field, Expressive speech synthesis is a very rich domain as it requires knowledge in areas such as machine learning, signal processing, sociology, psychology. In this Chapter, we will focus mostly on the technical side. From the recording of expressive speech to its modeling, the reader will have an overview of the main paradigms used in this field, throug… ▽ More As part of the Human-Computer Interaction field, Expressive speech synthesis is a very rich domain as it requires knowledge in areas such as machine learning, signal processing, sociology, psychology. In this Chapter, we will focus mostly on the technical side. From the recording of expressive speech to its modeling, the reader will have an overview of the main paradigms used in this field, through some of the most prominent systems and methods. We explain how speech can be represented and encoded with audio features. We present a history of the main methods of Text-to-Speech synthesis: concatenative, parametric and statistical parametric speech synthesis. Finally, we focus on the last one, with the last techniques modeling Text-to-Speech synthesis as a sequence-to-sequence problem. This enables the use of Deep Learning blocks such as Convolutional and Recurrent Neural Networks as well as Attention Mechanism. The last part of the Chapter intends to assemble the different aspects of the theory and summarize the concepts. △ Less

Submitted 14 October, 2019; originally announced October 2019.

Comments: 19 pages, 6 figures. To be published in the book "Human Computer Interaction" edited by Prof. Yves Rybarczyk, published by IntechOpen

arXiv:1903.11570 [pdf, other]

Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis through Audio Analysis

Authors: Noé Tits, Fengna Wang, Kevin El Haddad, Vincent Pagel, Thierry Dutoit

Abstract: The field of Text-to-Speech has experienced huge improvements last years benefiting from deep learning techniques. Producing realistic speech becomes possible now. As a consequence, the research on the control of the expressiveness, allowing to generate speech in different styles or manners, has attracted increasing attention lately. Systems able to control style have been developed and show impre… ▽ More The field of Text-to-Speech has experienced huge improvements last years benefiting from deep learning techniques. Producing realistic speech becomes possible now. As a consequence, the research on the control of the expressiveness, allowing to generate speech in different styles or manners, has attracted increasing attention lately. Systems able to control style have been developed and show impressive results. However the control parameters often consist of latent variables and remain complex to interpret. In this paper, we analyze and compare different latent spaces and obtain an interpretation of their influence on expressive speech. This will enable the possibility to build controllable speech synthesis systems with an understandable behaviour. △ Less

Submitted 27 March, 2019; originally announced March 2019.

arXiv:1901.04276 [pdf, other]

Exploring Transfer Learning for Low Resource Emotional TTS

Authors: Noé Tits, Kevin El Haddad, Thierry Dutoit

Abstract: During the last few years, spoken language technologies have known a big improvement thanks to Deep Learning. However Deep Learning-based algorithms require amounts of data that are often difficult and costly to gather. Particularly, modeling the variability in speech of different speakers, different styles or different emotions with few data remains challenging. In this paper, we investigate how… ▽ More During the last few years, spoken language technologies have known a big improvement thanks to Deep Learning. However Deep Learning-based algorithms require amounts of data that are often difficult and costly to gather. Particularly, modeling the variability in speech of different speakers, different styles or different emotions with few data remains challenging. In this paper, we investigate how to leverage fine-tuning on a pre-trained Deep Learning-based TTS model to synthesize speech with a small dataset of another speaker. Then we investigate the possibility to adapt this model to have emotional TTS by fine-tuning the neutral TTS model with a small emotional dataset. △ Less

Submitted 14 January, 2019; originally announced January 2019.

Comments: Accepted at IntelliSys 2019

arXiv:1806.09514 [pdf, ps, other]

The Emotional Voices Database: Towards Controlling the Emotion Dimension in Voice Generation Systems

Authors: Adaeze Adigwe, Noé Tits, Kevin El Haddad, Sarah Ostadabbas, Thierry Dutoit

Abstract: In this paper, we present a database of emotional speech intended to be open-sourced and used for synthesis and generation purpose. It contains data for male and female actors in English and a male actor in French. The database covers 5 emotion classes so it could be suitable to build synthesis and voice transformation systems with the potential to control the emotional dimension in a continuous w… ▽ More In this paper, we present a database of emotional speech intended to be open-sourced and used for synthesis and generation purpose. It contains data for male and female actors in English and a male actor in French. The database covers 5 emotion classes so it could be suitable to build synthesis and voice transformation systems with the potential to control the emotional dimension in a continuous way. We show the data's efficiency by building a simple MLP system converting neutral to angry speech style and evaluate it via a CMOS perception test. Even though the system is a very simple one, the test show the efficiency of the data which is promising for future work. △ Less

Submitted 25 June, 2018; originally announced June 2018.

Comments: Submitted to SLSP 2018

arXiv:1805.09197 [pdf, other]

ASR-based Features for Emotion Recognition: A Transfer Learning Approach

Authors: Noé Tits, Kevin El Haddad, Thierry Dutoit

Abstract: During the last decade, the applications of signal processing have drastically improved with deep learning. However areas of affecting computing such as emotional speech synthesis or emotion recognition from spoken language remains challenging. In this paper, we investigate the use of a neural Automatic Speech Recognition (ASR) as a feature extractor for emotion recognition. We show that these fea… ▽ More During the last decade, the applications of signal processing have drastically improved with deep learning. However areas of affecting computing such as emotional speech synthesis or emotion recognition from spoken language remains challenging. In this paper, we investigate the use of a neural Automatic Speech Recognition (ASR) as a feature extractor for emotion recognition. We show that these features outperform the eGeMAPS feature set to predict the valence and arousal emotional dimensions, which means that the audio-to-text map** learning by the ASR system contain information related to the emotional dimensions in spontaneous speech. We also examine the relationship between first layers (closer to speech) and last layers (closer to text) of the ASR and valence/arousal. △ Less

Submitted 1 June, 2018; v1 submitted 23 May, 2018; originally announced May 2018.

Comments: Accepted to be published in the First Workshop on Computational Modeling of Human Multimodal Language - ACL 2018

Showing 1–36 of 36 results for author: Dutoit, T