Search | arXiv e-print repository

EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation

Authors: Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinji Watanabe, Alexander Richard, Timo Gerkmann

Abstract: We release the EARS (Expressive Anechoic Recordings of Speech) dataset, a high-quality speech dataset comprising 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data. The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech. We benchmark various m… ▽ More We release the EARS (Expressive Anechoic Recordings of Speech) dataset, a high-quality speech dataset comprising 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data. The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech. We benchmark various methods for speech enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics. In addition, we conduct a listening test with 20 participants for the speech enhancement task, where a generative method is preferred. We introduce a blind test set that allows for automatic online evaluation of uploaded data. Dataset download links and automatic evaluation server can be found online. △ Less

Submitted 11 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted at Interspeech 2024

arXiv:2402.00811 [pdf, other]

An Analysis of the Variance of Diffusion-based Speech Enhancement

Authors: Bunlong Lay, Timo Gerkmann

Abstract: Diffusion models proved to be powerful models for generative speech enhancement. In recent SGMSE+ approaches, training involves a stochastic differential equation for the diffusion process, adding both Gaussian and environmental noise to the clean speech signal gradually. The speech enhancement performance varies depending on the choice of the stochastic differential equation that controls the evo… ▽ More Diffusion models proved to be powerful models for generative speech enhancement. In recent SGMSE+ approaches, training involves a stochastic differential equation for the diffusion process, adding both Gaussian and environmental noise to the clean speech signal gradually. The speech enhancement performance varies depending on the choice of the stochastic differential equation that controls the evolution of the mean and the variance along the diffusion processes when adding environmental and Gaussian noise. In this work, we highlight that the scale of the variance is a dominant parameter for speech enhancement performance and show that it controls the tradeoff between noise attenuation and speech distortions. More concretely, we show that a larger variance increases the noise attenuation and allows for reducing the computational footprint, as fewer function evaluations for generating the estimate are required △ Less

Submitted 13 June, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

Comments: 5 pages, 3 figures, 1 table

arXiv:2309.09677 [pdf, other]

Single and Few-step Diffusion for Generative Speech Enhancement

Authors: Bunlong Lay, Jean-Marie Lemercier, Julius Richter, Timo Gerkmann

Abstract: Diffusion models have shown promising results in speech enhancement, using a task-adapted diffusion process for the conditional generation of clean speech given a noisy mixture. However, at test time, the neural network used for score estimation is called multiple times to solve the iterative reverse process. This results in a slow inference process and causes discretization errors that accumulate… ▽ More Diffusion models have shown promising results in speech enhancement, using a task-adapted diffusion process for the conditional generation of clean speech given a noisy mixture. However, at test time, the neural network used for score estimation is called multiple times to solve the iterative reverse process. This results in a slow inference process and causes discretization errors that accumulate over the sampling trajectory. In this paper, we address these limitations through a two-stage training approach. In the first stage, we train the diffusion model the usual way using the generative denoising score matching loss. In the second stage, we compute the enhanced signal by solving the reverse process and compare the resulting estimate to the clean speech target using a predictive loss. We show that using this second training stage enables achieving the same performance as the baseline model using only 5 function evaluations instead of 60 function evaluations. While the performance of usual generative diffusion algorithms drops dramatically when lowering the number of function evaluations (NFEs) to obtain single-step diffusion, we show that our proposed method keeps a steady performance and therefore largely outperforms the diffusion baseline in this setting and also generalizes better than its predictive counterpart. △ Less

Submitted 15 January, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

Comments: copyright 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

arXiv:2309.07828 [pdf, other]

EMOCONV-DIFF: Diffusion-based Speech Emotion Conversion for Non-parallel and In-the-wild Data

Authors: Navin Raj Prabhu, Bunlong Lay, Simon Welker, Nale Lehmann-Willenbrock, Timo Gerkmann

Abstract: Speech emotion conversion is the task of converting the expressed emotion of a spoken utterance to a target emotion while preserving the lexical content and speaker identity. While most existing works in speech emotion conversion rely on acted-out datasets and parallel data samples, in this work we specifically focus on more challenging in-the-wild scenarios and do not rely on parallel data. To th… ▽ More Speech emotion conversion is the task of converting the expressed emotion of a spoken utterance to a target emotion while preserving the lexical content and speaker identity. While most existing works in speech emotion conversion rely on acted-out datasets and parallel data samples, in this work we specifically focus on more challenging in-the-wild scenarios and do not rely on parallel data. To this end, we propose a diffusion-based generative model for speech emotion conversion, the EmoConv-Diff, that is trained to reconstruct an input utterance while also conditioning on its emotion. Subsequently, at inference, a target emotion embedding is employed to convert the emotion of the input utterance to the given target emotion. As opposed to performing emotion conversion on categorical representations, we use a continuous arousal dimension to represent emotions while also achieving intensity control. We validate the proposed methodology on a large in-the-wild dataset, the MSP-Podcast v1.10. Our results show that the proposed diffusion model is indeed capable of synthesizing speech with a controllable target emotion. Crucially, the proposed approach shows improved performance along the extreme values of arousal and thereby addresses a common challenge in the speech emotion conversion literature. △ Less

Submitted 8 January, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

Comments: Accepted to ICASSP 2024

arXiv:2303.08674 [pdf, other]

Speech Signal Improvement Using Causal Generative Diffusion Models

Authors: Julius Richter, Simon Welker, Jean-Marie Lemercier, Bunlong Lay, Tal Peer, Timo Gerkmann

Abstract: In this paper, we present a causal speech signal improvement system that is designed to handle different types of distortions. The method is based on a generative diffusion model which has been shown to work well in scenarios with missing data and non-linear corruptions. To guarantee causal processing, we modify the network architecture of our previous work and replace global normalization with ca… ▽ More In this paper, we present a causal speech signal improvement system that is designed to handle different types of distortions. The method is based on a generative diffusion model which has been shown to work well in scenarios with missing data and non-linear corruptions. To guarantee causal processing, we modify the network architecture of our previous work and replace global normalization with causal adaptive gain control. We generate diverse training data containing a broad range of distortions. This work was performed in the context of an "ICASSP Signal Processing Grand Challenge" and submitted to the non-real-time track of the "Speech Signal Improvement Challenge 2023", where it was ranked fifth. △ Less

Submitted 15 March, 2023; originally announced March 2023.

Comments: Accepted by ICASSP 2023

arXiv:2302.14748 [pdf, other]

Reducing the Prior Mismatch of Stochastic Differential Equations for Diffusion-based Speech Enhancement

Authors: Bunlong Lay, Simon Welker, Julius Richter, Timo Gerkmann

Abstract: Recently, score-based generative models have been successfully employed for the task of speech enhancement. A stochastic differential equation is used to model the iterative forward process, where at each step environmental noise and white Gaussian noise are added to the clean speech signal. While in limit the mean of the forward process ends at the noisy mixture, in practice it stops earlier and… ▽ More Recently, score-based generative models have been successfully employed for the task of speech enhancement. A stochastic differential equation is used to model the iterative forward process, where at each step environmental noise and white Gaussian noise are added to the clean speech signal. While in limit the mean of the forward process ends at the noisy mixture, in practice it stops earlier and thus only at an approximation of the noisy mixture. This results in a discrepancy between the terminating distribution of the forward process and the prior used for solving the reverse process at inference. In this paper, we address this discrepancy and propose a forward process based on a Brownian bridge. We show that such a process leads to a reduction of the mismatch compared to previous diffusion processes. More importantly, we show that our approach improves in objective metrics over the baseline process with only half of the iteration steps and having one hyperparameter less to tune. △ Less

Submitted 30 May, 2023; v1 submitted 28 February, 2023; originally announced February 2023.

Comments: 5 pages, 2 figures, Accepted to Interspeech 20223

arXiv:2208.05830 [pdf, other]

Speech Enhancement and Dereverberation with Diffusion-based Generative Models

Authors: Julius Richter, Simon Welker, Jean-Marie Lemercier, Bunlong Lay, Timo Gerkmann

Abstract: In this work, we build upon our previous publication and use diffusion-based generative models for speech enhancement. We present a detailed overview of the diffusion process that is based on a stochastic differential equation and delve into an extensive theoretical examination of its implications. Opposed to usual conditional generation tasks, we do not start the reverse process from pure Gaussia… ▽ More In this work, we build upon our previous publication and use diffusion-based generative models for speech enhancement. We present a detailed overview of the diffusion process that is based on a stochastic differential equation and delve into an extensive theoretical examination of its implications. Opposed to usual conditional generation tasks, we do not start the reverse process from pure Gaussian noise but from a mixture of noisy speech and Gaussian noise. This matches our forward process which moves from clean speech to noisy speech by including a drift term. We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates. By adapting the network architecture, we are able to significantly improve the speech enhancement performance, indicating that the network, rather than the formalism, was the main limitation of our original approach. In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models and achieves better generalization when evaluating on a different corpus than used for training. We complement the results with an instrumental evaluation using real-world noisy recordings and a listening experiment, in which our proposed method is rated best. Examining different sampler configurations for solving the reverse process allows us to balance the performance and computational speed of the proposed method. Moreover, we show that the proposed method is also suitable for dereverberation and thus not limited to additive background noise removal. Code and audio examples are available online, see https://github.com/sp-uhh/sgmse △ Less

Submitted 13 June, 2023; v1 submitted 11 August, 2022; originally announced August 2022.

Comments: Accepted version

arXiv:1906.11875 [pdf]

Instant automatic diagnosis of diabetic retinopathy

Authors: Gwenolé Quellec, Mathieu Lamard, Bruno Lay, Alexandre Le Guilcher, Ali Erginay, Béatrice Cochener, Pascale Massin

Abstract: The purpose of this study is to evaluate the performance of the OphtAI system for the automatic detection of referable diabetic retinopathy (DR) and the automatic assessment of DR severity using color fundus photography. OphtAI relies on ensembles of convolutional neural networks trained to recognize eye laterality, detect referable DR and assess DR severity. The system can either process single i… ▽ More The purpose of this study is to evaluate the performance of the OphtAI system for the automatic detection of referable diabetic retinopathy (DR) and the automatic assessment of DR severity using color fundus photography. OphtAI relies on ensembles of convolutional neural networks trained to recognize eye laterality, detect referable DR and assess DR severity. The system can either process single images or full examination records. To document the automatic diagnoses, accurate heatmaps are generated. The system was developed and validated using a dataset of 763,848 images from 164,660 screening procedures from the OPHDIAT screening program. For comparison purposes, it was also evaluated in the public Messidor-2 dataset. Referable DR can be detected with an area under the ROC curve of AUC = 0.989 in the Messidor-2 dataset, using the University of Iowa's reference standard (95% CI: 0.984-0.994). This is significantly better than the only AI system authorized by the FDA, evaluated in the exact same conditions (AUC = 0.980). OphtAI can also detect vision-threatening DR with an AUC of 0.997 (95% CI: 0.996-0.998) and proliferative DR with an AUC of 0.997 (95% CI: 0.995-0.999). The system runs in 0.3 seconds using a graphics processing unit and less than 2 seconds without. OphtAI is safer, faster and more comprehensive than the only AI system authorized by the FDA so far. Instant DR diagnosis is now possible, which is expected to streamline DR screening and to give easy access to DR screening to more diabetic patients. △ Less

Submitted 12 June, 2019; originally announced June 2019.

arXiv:1906.11600 [pdf, other]

doi 10.1007/978-3-030-01449-0_39

Dealing with Topological Information within a Fully Convolutional Neural Network

Authors: Etienne Decencière, Santiago Velasco-Forero, Fu Min, Juanjuan Chen, Hélène Burdin, Gervais Gauthier, Bruno Laÿ, Thomas Bornschloegl, Thérèse Baldeweck

Abstract: A fully convolutional neural network has a receptive field of limited size and therefore cannot exploit global information, such as topological information. A solution is proposed in this paper to solve this problem, based on pre-processing with a geodesic operator. It is applied to the segmentation of histological images of pigmented reconstructed epidermis acquired via Whole Slide Imaging. A fully convolutional neural network has a receptive field of limited size and therefore cannot exploit global information, such as topological information. A solution is proposed in this paper to solve this problem, based on pre-processing with a geodesic operator. It is applied to the segmentation of histological images of pigmented reconstructed epidermis acquired via Whole Slide Imaging. △ Less

Submitted 27 June, 2019; originally announced June 2019.

Comments: International Conference on Advanced Concepts for Intelligent Vision Systems (ACIVS 2018)

Journal ref: Advanced Concepts for Intelligent Vision Systems. ACIVS 2018. Lecture Notes in Computer Science, vol 11182. Springer, Cham

Showing 1–9 of 9 results for author: Lay, B