-
Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting
Authors:
Ihab Asaad,
Maxime Jacquelin,
Olivier Perrotin,
Laurent Girin,
Thomas Hueber
Abstract:
Most speech self-supervised learning (SSL) models are trained with a pretext task which consists in predicting missing parts of the input signal, either future segments (causal prediction) or segments masked anywhere within the input (non-causal prediction). Learned speech representations can then be efficiently transferred to downstream tasks (e.g., automatic speech or speaker recognition). In th…
▽ More
Most speech self-supervised learning (SSL) models are trained with a pretext task which consists in predicting missing parts of the input signal, either future segments (causal prediction) or segments masked anywhere within the input (non-causal prediction). Learned speech representations can then be efficiently transferred to downstream tasks (e.g., automatic speech or speaker recognition). In the present study, we investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context, i.e., fulfilling a downstream task that is very similar to the pretext task. To that purpose, we combine an SSL encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role of a decoder. In particular, we propose two solutions to match the HuBERT output with the HiFiGAN input, by freezing one and fine-tuning the other, and vice versa. Performance of both approaches was assessed in single- and multi-speaker settings, for both informed and blind inpainting configurations (i.e., the position of the mask is known or unknown, respectively), with different objective metrics and a perceptual evaluation. Performances show that if both solutions allow to correctly reconstruct signal portions up to the size of 200ms (and even 400ms in some cases), fine-tuning the SSL encoder provides a more accurate signal reconstruction in the single-speaker setting case, while freezing it (and training the neural vocoder instead) is a better strategy when dealing with multi-speaker data.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Energy Management of Hydrogen Hybrid Electric Vehicles -- A Potential Study
Authors:
David Theodor Machacek,
Nazim Ozan Yazar,
Thomas Huber,
Christopher Harald Onder
Abstract:
The hydrogen combustion engine (H$_2$ICE) is known to be able to burn H$_2$ under ultra-lean conditions, while producing no CO$_2$ emissions and extremely low engine-out NO$_x^{\mathrm{eo}}$ emissions. Immediate goals, as for instance the upcoming EURO 7 NO$_x$ limitations, can be reached more easily as extremely low engine-out NO$_x^{\mathrm{eo}}$ emissions facilitate the reduction of the overall…
▽ More
The hydrogen combustion engine (H$_2$ICE) is known to be able to burn H$_2$ under ultra-lean conditions, while producing no CO$_2$ emissions and extremely low engine-out NO$_x^{\mathrm{eo}}$ emissions. Immediate goals, as for instance the upcoming EURO 7 NO$_x$ limitations, can be reached more easily as extremely low engine-out NO$_x^{\mathrm{eo}}$ emissions facilitate the reduction of the overall tailpipe NO$_x^{\mathrm{tp}}$ emissions. In this work, the feasibility of achieving consistent reductions in NO$_x^{\mathrm{eo}}$ emissions through the implementation of electric hybridization of an H$_2$ICE-equipped passenger car (H$_2$-HEV), combined with a dedicated energy management strategy (EMS) is discussed. In particular, the mixed H$_2$-HEV architecture is investigated and compared to a series H$_2$-HEV, a parallel H$_2$-HEV, and a base H$_2$-vehicle, which is only equipped with an H$_2$ICE. For hybrid vehicles, a low H$_2$ consumption and low NO$_x^{\mathrm{eo}}$ emissions are conflicting objectives, the trade-off of which depends on the EMS and can be represented as a Pareto front. Overall, through the utilization of a dedicated energy management calibration, the mixed H$_2$-HEV demonstrates the capability to consistently achieve extremely low engine-out NO$_x^{\mathrm{eo}}$ emissions. For a broad range of driving missions, the mixed H$_2$-HEV is able to decrease the engine-out NO$_\mathrm{x}^\mathrm{eo}$ emissions by more than 90%, while, at the same time, the H$_2$ consumption is decreased by over 16%, compared to a comparable non-hybridized H$_2$-vehicle. These significant emission reductions are possible without having to modify the exhaust-gas aftertreatment system, or the optimization of any of the individual drivetrain components, but solely by setting the EMS calibration accordingly.
△ Less
Submitted 18 September, 2023;
originally announced September 2023.
-
Learning-Based Model Predictive Control for the Energy Management of Hybrid Electric Vehicles Including Driving Mode Decisions
Authors:
David Theodor Machacek,
Stijn van Dooren,
Thomas Huber,
Christopher Onder
Abstract:
This paper presents an online-capable controller for the energy management system of a parallel hybrid electric vehicle based on model predictive control. Its task is to minimize the vehicle's fuel consumption along a predicted driving mission by calculating the distribution of the driver's power request between the electrical and the combustive part of the powertrain, and by choosing the driving…
▽ More
This paper presents an online-capable controller for the energy management system of a parallel hybrid electric vehicle based on model predictive control. Its task is to minimize the vehicle's fuel consumption along a predicted driving mission by calculating the distribution of the driver's power request between the electrical and the combustive part of the powertrain, and by choosing the driving mode, which depends on the vehicle's clutch state. The inclusion of the clutch state in a model predictive control structure is not trivial because the underlying optimization problem becomes a mixed-integer program as a consequence. Using Pontryagin's Minimum Principle and a simplified vehicle model, it is possible to prove that a drive cycle-dependent critical power request Pcrit exists, which uniquely separates the optimal driving mode. Based on this result, a learning algorithm is proposed to determine Pcrit during the operation of the vehicle. The learning algorithm is incorporated into a multi-level controller structure and the working principle of the resulting multi-level learning-based model predictive controller is analyzed in detail using two realistic driving missions. A comparison to the solution obtained by Dynamic Programming reveals that the proposed controller achieves close-to-optimal performance.
△ Less
Submitted 3 January, 2023;
originally announced January 2023.
-
BERT, can HE predict contrastive focus? Predicting and controlling prominence in neural TTS using a language model
Authors:
Brooke Stephenson,
Laurent Besacier,
Laurent Girin,
Thomas Hueber
Abstract:
Several recent studies have tested the use of transformer language model representations to infer prosodic features for text-to-speech synthesis (TTS). While these studies have explored prosody in general, in this work, we look specifically at the prediction of contrastive focus on personal pronouns. This is a particularly challenging task as it often requires semantic, discursive and/or pragmatic…
▽ More
Several recent studies have tested the use of transformer language model representations to infer prosodic features for text-to-speech synthesis (TTS). While these studies have explored prosody in general, in this work, we look specifically at the prediction of contrastive focus on personal pronouns. This is a particularly challenging task as it often requires semantic, discursive and/or pragmatic knowledge to predict correctly. We collect a corpus of utterances containing contrastive focus and we evaluate the accuracy of a BERT model, finetuned to predict quantized acoustic prominence features, on these samples. We also investigate how past utterances can provide relevant information for this prediction. Furthermore, we evaluate the controllability of pronoun prominence in a TTS model conditioned on acoustic prominence features.
△ Less
Submitted 4 July, 2022;
originally announced July 2022.
-
Multistream neural architectures for cued-speech recognition using a pre-trained visual feature extractor and constrained CTC decoding
Authors:
Sanjana Sankar,
Denis Beautemps,
Thomas Hueber
Abstract:
This paper proposes a simple and effective approach for automatic recognition of Cued Speech (CS), a visual communication tool that helps people with hearing impairment to understand spoken language with the help of hand gestures that can uniquely identify the uttered phonemes in complement to lipreading. The proposed approach is based on a pre-trained hand and lips tracker used for visual feature…
▽ More
This paper proposes a simple and effective approach for automatic recognition of Cued Speech (CS), a visual communication tool that helps people with hearing impairment to understand spoken language with the help of hand gestures that can uniquely identify the uttered phonemes in complement to lipreading. The proposed approach is based on a pre-trained hand and lips tracker used for visual feature extraction and a phonetic decoder based on a multistream recurrent neural network trained with connectionist temporal classification loss and combined with a pronunciation lexicon. The proposed system is evaluated on an updated version of the French CS dataset CSF18 for which the phonetic transcription has been manually checked and corrected. With a decoding accuracy at the phonetic level of 70.88%, the proposed system outperforms our previous CNN-HMM decoder and competes with more complex baselines.
△ Less
Submitted 11 April, 2022;
originally announced April 2022.
-
Repeat after me: Self-supervised learning of acoustic-to-articulatory map** by vocal imitation
Authors:
Marc-Antoine Georges,
Julien Diard,
Laurent Girin,
Jean-Luc Schwartz,
Thomas Hueber
Abstract:
We propose a computational model of speech production combining a pre-trained neural articulatory synthesizer able to reproduce complex speech stimuli from a limited set of interpretable articulatory parameters, a DNN-based internal forward model predicting the sensory consequences of articulatory commands, and an internal inverse model based on a recurrent neural network recovering articulatory c…
▽ More
We propose a computational model of speech production combining a pre-trained neural articulatory synthesizer able to reproduce complex speech stimuli from a limited set of interpretable articulatory parameters, a DNN-based internal forward model predicting the sensory consequences of articulatory commands, and an internal inverse model based on a recurrent neural network recovering articulatory commands from the acoustic speech input. Both forward and inverse models are jointly trained in a self-supervised way from raw acoustic-only speech data from different speakers. The imitation simulations are evaluated objectively and subjectively and display quite encouraging performances.
△ Less
Submitted 5 April, 2022;
originally announced April 2022.
-
A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling
Authors:
Xiaoyu Bie,
Laurent Girin,
Simon Leglaive,
Thomas Hueber,
Xavier Alameda-Pineda
Abstract:
The Variational Autoencoder (VAE) is a powerful deep generative model that is now extensively used to represent high-dimensional complex data via a low-dimensional latent space learned in an unsupervised manner. In the original VAE model, input data vectors are processed independently. In recent years, a series of papers have presented different extensions of the VAE to process sequential data, th…
▽ More
The Variational Autoencoder (VAE) is a powerful deep generative model that is now extensively used to represent high-dimensional complex data via a low-dimensional latent space learned in an unsupervised manner. In the original VAE model, input data vectors are processed independently. In recent years, a series of papers have presented different extensions of the VAE to process sequential data, that not only model the latent space, but also model the temporal dependencies within a sequence of data vectors and corresponding latent vectors, relying on recurrent neural networks. We recently performed a comprehensive review of those models and unified them into a general class called Dynamical Variational Autoencoders (DVAEs). In the present paper, we present the results of an experimental benchmark comparing six of those DVAE models on the speech analysis-resynthesis task, as an illustration of the high potential of DVAEs for speech modeling.
△ Less
Submitted 14 June, 2021; v1 submitted 11 June, 2021;
originally announced June 2021.
-
Deep learning-based bias transfer for overcoming laboratory differences of microscopic images
Authors:
Ann-Katrin Thebille,
Esther Dietrich,
Martin Klaus,
Lukas Gernhold,
Maximilian Lennartz,
Christoph Kuppe,
Rafael Kramann,
Tobias B. Huber,
Guido Sauter,
Victor G. Puelles,
Marina Zimmermann,
Stefan Bonn
Abstract:
The automated analysis of medical images is currently limited by technical and biological noise and bias. The same source tissue can be represented by vastly different images if the image acquisition or processing protocols vary. For an image analysis pipeline, it is crucial to compensate such biases to avoid misinterpretations. Here, we evaluate, compare, and improve existing generative model arc…
▽ More
The automated analysis of medical images is currently limited by technical and biological noise and bias. The same source tissue can be represented by vastly different images if the image acquisition or processing protocols vary. For an image analysis pipeline, it is crucial to compensate such biases to avoid misinterpretations. Here, we evaluate, compare, and improve existing generative model architectures to overcome domain shifts for immunofluorescence (IF) and Hematoxylin and Eosin (H&E) stained microscopy images. To determine the performance of the generative models, the original and transformed images were segmented or classified by deep neural networks that were trained only on images of the target bias. In the scope of our analysis, U-Net cycleGANs trained with an additional identity and an MS-SSIM-based loss and Fixed-Point GANs trained with an additional structure loss led to the best results for the IF and H&E stained samples, respectively. Adapting the bias of the samples significantly improved the pixel-level segmentation for human kidney glomeruli and podocytes and improved the classification accuracy for human prostate biopsies by up to 14%.
△ Less
Submitted 25 May, 2021;
originally announced May 2021.
-
Learning robust speech representation with an articulatory-regularized variational autoencoder
Authors:
Marc-Antoine Georges,
Laurent Girin,
Jean-Luc Schwartz,
Thomas Hueber
Abstract:
It is increasingly considered that human speech perception and production both rely on articulatory representations. In this paper, we investigate whether this type of representation could improve the performances of a deep generative model (here a variational autoencoder) trained to encode and decode acoustic speech features. First we develop an articulatory model able to associate articulatory p…
▽ More
It is increasingly considered that human speech perception and production both rely on articulatory representations. In this paper, we investigate whether this type of representation could improve the performances of a deep generative model (here a variational autoencoder) trained to encode and decode acoustic speech features. First we develop an articulatory model able to associate articulatory parameters describing the jaw, tongue, lips and velum configurations with vocal tract shapes and spectral features. Then we incorporate these articulatory parameters into a variational autoencoder applied on spectral features by using a regularization technique that constraints part of the latent space to follow articulatory trajectories. We show that this articulatory constraint improves model training by decreasing time to convergence and reconstruction loss at convergence, and yields better performance in a speech denoising task.
△ Less
Submitted 7 April, 2021;
originally announced April 2021.
-
Alternate Endings: Improving Prosody for Incremental Neural TTS with Predicted Future Text Input
Authors:
Brooke Stephenson,
Thomas Hueber,
Laurent Girin,
Laurent Besacier
Abstract:
The prosody of a spoken word is determined by its surrounding context. In incremental text-to-speech synthesis, where the synthesizer produces an output before it has access to the complete input, the full context is often unknown which can result in a loss of naturalness in the synthesized speech. In this paper, we investigate whether the use of predicted future text can attenuate this loss. We c…
▽ More
The prosody of a spoken word is determined by its surrounding context. In incremental text-to-speech synthesis, where the synthesizer produces an output before it has access to the complete input, the full context is often unknown which can result in a loss of naturalness in the synthesized speech. In this paper, we investigate whether the use of predicted future text can attenuate this loss. We compare several test conditions of next future word: (a) unknown (zero-word), (b) language model predicted, (c) randomly predicted and (d) ground-truth. We measure the prosodic features (pitch, energy and duration) and find that predicted text provides significant improvements over a zero-word lookahead, but only slight gains over random-word lookahead. We confirm these results with a perceptive test.
△ Less
Submitted 15 June, 2021; v1 submitted 19 February, 2021;
originally announced February 2021.
-
What the Future Brings: Investigating the Impact of Lookahead for Incremental Neural TTS
Authors:
Brooke Stephenson,
Laurent Besacier,
Laurent Girin,
Thomas Hueber
Abstract:
In incremental text to speech synthesis (iTTS), the synthesizer produces an audio output before it has access to the entire input sentence. In this paper, we study the behavior of a neural sequence-to-sequence TTS system when used in an incremental mode, i.e. when generating speech output for token n, the system has access to n + k tokens from the text sequence. We first analyze the impact of this…
▽ More
In incremental text to speech synthesis (iTTS), the synthesizer produces an audio output before it has access to the entire input sentence. In this paper, we study the behavior of a neural sequence-to-sequence TTS system when used in an incremental mode, i.e. when generating speech output for token n, the system has access to n + k tokens from the text sequence. We first analyze the impact of this incremental policy on the evolution of the encoder representations of token n for different values of k (the lookahead parameter). The results show that, on average, tokens travel 88% of the way to their full context representation with a one-word lookahead and 94% after 2 words. We then investigate which text features are the most influential on the evolution towards the final representation using a random forest analysis. The results show that the most salient factors are related to token length. We finally evaluate the effects of lookahead k at the decoder level, using a MUSHRA listening test. This test shows results that contrast with the above high figures: speech synthesis quality obtained with 2 word-lookahead is significantly lower than the one obtained with the full sentence.
△ Less
Submitted 4 September, 2020;
originally announced September 2020.
-
Autoencoders for music sound modeling: a comparison of linear, shallow, deep, recurrent and variational models
Authors:
Fanny Roche,
Thomas Hueber,
Samuel Limier,
Laurent Girin
Abstract:
This study investigates the use of non-linear unsupervised dimensionality reduction techniques to compress a music dataset into a low-dimensional representation which can be used in turn for the synthesis of new sounds. We systematically compare (shallow) autoencoders (AEs), deep autoencoders (DAEs), recurrent autoencoders (with Long Short-Term Memory cells -- LSTM-AEs) and variational autoencoder…
▽ More
This study investigates the use of non-linear unsupervised dimensionality reduction techniques to compress a music dataset into a low-dimensional representation which can be used in turn for the synthesis of new sounds. We systematically compare (shallow) autoencoders (AEs), deep autoencoders (DAEs), recurrent autoencoders (with Long Short-Term Memory cells -- LSTM-AEs) and variational autoencoders (VAEs) with principal component analysis (PCA) for representing the high-resolution short-term magnitude spectrum of a large and dense dataset of music notes into a lower-dimensional vector (and then convert it back to a magnitude spectrum used for sound resynthesis). Our experiments were conducted on the publicly available multi-instrument and multi-pitch database NSynth. Interestingly and contrary to the recent literature on image processing, we can show that PCA systematically outperforms shallow AE. Only deep and recurrent architectures (DAEs and LSTM-AEs) lead to a lower reconstruction error. The optimization criterion in VAEs being the sum of the reconstruction error and a regularization term, it naturally leads to a lower reconstruction accuracy than DAEs but we show that VAEs are still able to outperform PCA while providing a low-dimensional latent space with nice "usability" properties. We also provide corresponding objective measures of perceptual audio quality (PEMO-Q scores), which generally correlate well with the reconstruction error.
△ Less
Submitted 24 May, 2019; v1 submitted 11 June, 2018;
originally announced June 2018.