Skip to main content

Showing 1–50 of 53 results for author: Gerkmann, T

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.06185  [pdf, other

    eess.AS cs.LG cs.SD

    EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation

    Authors: Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinji Watanabe, Alexander Richard, Timo Gerkmann

    Abstract: We release the EARS (Expressive Anechoic Recordings of Speech) dataset, a high-quality speech dataset comprising 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data. The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech. We benchmark various m… ▽ More

    Submitted 11 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  2. arXiv:2406.03460  [pdf, other

    eess.AS cs.LG cs.SD

    The PESQetarian: On the Relevance of Goodhart's Law for Speech Enhancement

    Authors: Danilo de Oliveira, Simon Welker, Julius Richter, Timo Gerkmann

    Abstract: To obtain improved speech enhancement models, researchers often focus on increasing performance according to specific instrumental metrics. However, when the same metric is used in a loss function to optimize models, it may be detrimental to aspects that the given metric does not see. The goal of this paper is to illustrate the risk of overfitting a speech enhancement model to the metric used for… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  3. arXiv:2405.04272  [pdf, other

    eess.AS cs.LG cs.SD

    BUDDy: Single-Channel Blind Unsupervised Dereverberation with Diffusion Models

    Authors: Eloi Moliner, Jean-Marie Lemercier, Simon Welker, Timo Gerkmann, Vesa Välimäki

    Abstract: In this paper, we present an unsupervised single-channel method for joint blind dereverberation and room impulse response estimation, based on posterior sampling with diffusion models. We parameterize the reverberation operator using a filter with exponential decay for each frequency subband, and iteratively estimate the corresponding parameters as the speech utterance gets refined along the rever… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

    Comments: Submitted to IWAENC 2024

  4. arXiv:2402.09821  [pdf, other

    eess.AS cs.LG cs.SD

    Diffusion Models for Audio Restoration

    Authors: Jean-Marie Lemercier, Julius Richter, Simon Welker, Eloi Moliner, Vesa Välimäki, Timo Gerkmann

    Abstract: With the development of audio playback devices and fast data transmission, the demand for high sound quality is rising, for both entertainment and communications. In this quest for better sound quality, challenges emerge from distortions and interferences originating at the recording side or caused by an imperfect transmission pipeline. To address this problem, audio restoration methods aim to rec… ▽ More

    Submitted 15 February, 2024; originally announced February 2024.

    Comments: Full paper invited to the IEEE Signal Processing Magazine Special Issue "Model-based and Data-Driven Audio Signal Processing"

  5. arXiv:2402.00811  [pdf, other

    eess.AS cs.LG cs.SD

    An Analysis of the Variance of Diffusion-based Speech Enhancement

    Authors: Bunlong Lay, Timo Gerkmann

    Abstract: Diffusion models proved to be powerful models for generative speech enhancement. In recent SGMSE+ approaches, training involves a stochastic differential equation for the diffusion process, adding both Gaussian and environmental noise to the clean speech signal gradually. The speech enhancement performance varies depending on the choice of the stochastic differential equation that controls the evo… ▽ More

    Submitted 13 June, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

    Comments: 5 pages, 3 figures, 1 table

  6. arXiv:2309.09920  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Distilling HuBERT with LSTMs via Decoupled Knowledge Distillation

    Authors: Danilo de Oliveira, Timo Gerkmann

    Abstract: Much research effort is being applied to the task of compressing the knowledge of self-supervised models, which are powerful, yet large and memory consuming. In this work, we show that the original method of knowledge distillation (and its more recently proposed extension, decoupled knowledge distillation) can be applied to the task of distilling HuBERT. In contrast to methods that focus on distil… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  7. arXiv:2309.09677  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    Single and Few-step Diffusion for Generative Speech Enhancement

    Authors: Bunlong Lay, Jean-Marie Lemercier, Julius Richter, Timo Gerkmann

    Abstract: Diffusion models have shown promising results in speech enhancement, using a task-adapted diffusion process for the conditional generation of clean speech given a noisy mixture. However, at test time, the neural network used for score estimation is called multiple times to solve the iterative reverse process. This results in a slow inference process and causes discretization errors that accumulate… ▽ More

    Submitted 15 January, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

    Comments: copyright 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  8. arXiv:2309.08639  [pdf, other

    eess.IV eess.SP physics.comp-ph physics.optics

    Live Iterative Ptychography with projection-based algorithms

    Authors: Simon Welker, Tal Peer, Henry N. Chapman, Timo Gerkmann

    Abstract: In this work, we demonstrate that the ptychographic phase problem can be solved in a live fashion during scanning, while data is still being collected. We propose a generally applicable modification of the widespread projection-based algorithms such as Error Reduction (ER) and Difference Map (DM). This novel variant of ptychographic phase retrieval enables immediate visual feedback during experime… ▽ More

    Submitted 19 September, 2023; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 24

  9. arXiv:2309.07828  [pdf, other

    eess.AS cs.SD eess.SP

    EMOCONV-DIFF: Diffusion-based Speech Emotion Conversion for Non-parallel and In-the-wild Data

    Authors: Navin Raj Prabhu, Bunlong Lay, Simon Welker, Nale Lehmann-Willenbrock, Timo Gerkmann

    Abstract: Speech emotion conversion is the task of converting the expressed emotion of a spoken utterance to a target emotion while preserving the lexical content and speaker identity. While most existing works in speech emotion conversion rely on acted-out datasets and parallel data samples, in this work we specifically focus on more challenging in-the-wild scenarios and do not rely on parallel data. To th… ▽ More

    Submitted 8 January, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: Accepted to ICASSP 2024

  10. arXiv:2309.07043  [pdf, other

    eess.AS cs.SD eess.SP

    A Flexible Online Framework for Projection-Based STFT Phase Retrieval

    Authors: Tal Peer, Simon Welker, Johannes Kolhoff, Timo Gerkmann

    Abstract: Several recent contributions in the field of iterative STFT phase retrieval have demonstrated that the performance of the classical Griffin-Lim method can be considerably improved upon. By using the same projection operators as Griffin-Lim, but combining them in innovative ways, these approaches achieve better results in terms of both reconstruction quality and required number of iterations, while… ▽ More

    Submitted 13 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 24

  11. arXiv:2306.12867  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Wind Noise Reduction with a Diffusion-based Stochastic Regeneration Model

    Authors: Jean-Marie Lemercier, Joachim Thiemann, Raphael Koning, Timo Gerkmann

    Abstract: In this paper we present a method for single-channel wind noise reduction using our previously proposed diffusion-based stochastic regeneration model combining predictive and generative modelling. We introduce a non-additive speech in noise model to account for the non-linear deformation of the membrane caused by the wind flow and possible clip**. We show that our stochastic regeneration model o… ▽ More

    Submitted 9 January, 2024; v1 submitted 22 June, 2023; originally announced June 2023.

    Comments: Accepted to VDE 15th ITG conference on Speech Communication

  12. arXiv:2306.12286  [pdf, other

    eess.AS cs.LG cs.SD

    Diffusion Posterior Sampling for Informed Single-Channel Dereverberation

    Authors: Jean-Marie Lemercier, Simon Welker, Timo Gerkmann

    Abstract: We present in this paper an informed single-channel dereverberation method based on conditional generation with diffusion models. With knowledge of the room impulse response, the anechoic utterance is generated via reverse diffusion using a measurement consistency criterion coupled with a neural network that represents the clean speech prior. The proposed approach is largely more robust to measure… ▽ More

    Submitted 21 June, 2023; originally announced June 2023.

  13. arXiv:2306.03014  [pdf, other

    eess.AS cs.LG cs.SD

    On the Behavior of Intrusive and Non-intrusive Speech Enhancement Metrics in Predictive and Generative Settings

    Authors: Danilo de Oliveira, Julius Richter, Jean-Marie Lemercier, Tal Peer, Timo Gerkmann

    Abstract: Since its inception, the field of deep speech enhancement has been dominated by predictive (discriminative) approaches, such as spectral map** or masking. Recently, however, novel generative approaches have been applied to speech enhancement, attaining good denoising performance with high subjective quality scores. At the same time, advances in deep learning also allowed for the creation of neur… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: Submitted to ITG Conference on Speech Communication

  14. arXiv:2306.01916  [pdf, other

    eess.AS cs.HC cs.LG

    In-the-wild Speech Emotion Conversion Using Disentangled Self-Supervised Representations and Neural Vocoder-based Resynthesis

    Authors: Navin Raj Prabhu, Nale Lehmann-Willenbrock, Timo Gerkmann

    Abstract: Speech emotion conversion aims to convert the expressed emotion of a spoken utterance to a target emotion while preserving the lexical information and the speaker's identity. In this work, we specifically focus on in-the-wild emotion conversion where parallel data does not exist, and the problem of disentangling lexical, speaker, and emotion information arises. In this paper, we introduce a method… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

    Comments: Submitted to 15th ITG Conference on Speech Communication

  15. arXiv:2306.01432  [pdf, other

    eess.AS cs.LG

    Audio-Visual Speech Enhancement with Score-Based Generative Models

    Authors: Julius Richter, Simone Frintrop, Timo Gerkmann

    Abstract: This paper introduces an audio-visual speech enhancement system that leverages score-based generative models, also known as diffusion models, conditioned on visual information. In particular, we exploit audio-visual embeddings obtained from a self-super\-vised learning model that has been fine-tuned on lipreading. The layer-wise features of its transformer-based encoder are aggregated, time-aligne… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

    Comments: Submitted to ITG Conference on Speech Communication

  16. arXiv:2306.00160  [pdf, other

    eess.AS cs.LG cs.SD

    Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model

    Authors: Héctor Martel, Julius Richter, Kai Li, Xiaolin Hu, Timo Gerkmann

    Abstract: We propose Audio-Visual Lightweight ITerative model (AVLIT), an effective and lightweight neural network that uses Progressive Learning (PL) to perform audio-visual speech separation in noisy environments. To this end, we adopt the Asynchronous Fully Recurrent Convolutional Neural Network (A-FRCNN), which has shown successful results in audio-only speech separation. Our architecture consists of an… ▽ More

    Submitted 31 May, 2023; originally announced June 2023.

    Comments: Accepted by Interspeech 2023

  17. Leveraging Semantic Information for Efficient Self-Supervised Emotion Recognition with Audio-Textual Distilled Models

    Authors: Danilo de Oliveira, Navin Raj Prabhu, Timo Gerkmann

    Abstract: In large part due to their implicit semantic modeling, self-supervised learning (SSL) methods have significantly increased the performance of valence recognition in speech emotion recognition (SER) systems. Yet, their large size may often hinder practical implementations. In this work, we take HuBERT as an example of an SSL model and analyze the relevance of each of its layers for SER. We show tha… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: Accepted at Interspeech 2023

    Journal ref: Proc. Interspeech 2023

  18. arXiv:2305.08744  [pdf, other

    eess.AS cs.LG cs.SD

    Integrating Uncertainty into Neural Network-based Speech Enhancement

    Authors: Huajian Fang, Dennis Becker, Stefan Wermter, Timo Gerkmann

    Abstract: Supervised masking approaches in the time-frequency domain aim to employ deep neural networks to estimate a multiplicative mask to extract clean speech. This leads to a single estimate for each input without any guarantees or measures of reliability. In this paper, we study the benefits of modeling uncertainty in clean speech estimation. Prediction uncertainty is typically categorized into aleator… ▽ More

    Submitted 15 May, 2023; originally announced May 2023.

    Comments: Accepted version

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1587-1600, 2023

  19. arXiv:2304.12023  [pdf, other

    eess.AS cs.LG cs.SD

    Multi-channel Speech Separation Using Spatially Selective Deep Non-linear Filters

    Authors: Kristina Tesch, Timo Gerkmann

    Abstract: In a multi-channel separation task with multiple speakers, we aim to recover all individual speech signals from the mixture. In contrast to single-channel approaches, which rely on the different spectro-temporal characteristics of the speech signals, multi-channel approaches should additionally utilize the different spatial locations of the sources for a more powerful separation especially when th… ▽ More

    Submitted 21 November, 2023; v1 submitted 24 April, 2023; originally announced April 2023.

    Comments: Accepted version

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.32, pp. 542-553, 2024

  20. arXiv:2303.15042  [pdf, other

    eess.AS cs.LG cs.RO cs.SD

    Partially Adaptive Multichannel Joint Reduction of Ego-noise and Environmental Noise

    Authors: Huajian Fang, Niklas Wittmer, Johannes Twiefel, Stefan Wermter, Timo Gerkmann

    Abstract: Human-robot interaction relies on a noise-robust audio processing module capable of estimating target speech from audio recordings impacted by environmental noise, as well as self-induced noise, so-called ego-noise. While external ambient noise sources vary from environment to environment, ego-noise is mainly caused by the internal motors and joints of a robot. Ego-noise and environmental noise re… ▽ More

    Submitted 27 March, 2023; originally announced March 2023.

    Comments: Accepted to the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2023)

    Journal ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

  21. arXiv:2303.08674  [pdf, other

    eess.AS cs.SD

    Speech Signal Improvement Using Causal Generative Diffusion Models

    Authors: Julius Richter, Simon Welker, Jean-Marie Lemercier, Bunlong Lay, Tal Peer, Timo Gerkmann

    Abstract: In this paper, we present a causal speech signal improvement system that is designed to handle different types of distortions. The method is based on a generative diffusion model which has been shown to work well in scenarios with missing data and non-linear corruptions. To guarantee causal processing, we modify the network architecture of our previous work and replace global normalization with ca… ▽ More

    Submitted 15 March, 2023; originally announced March 2023.

    Comments: Accepted by ICASSP 2023

  22. arXiv:2303.00529  [pdf, other

    eess.AS cs.LG cs.SD

    Extending DNN-based Multiplicative Masking to Deep Subband Filtering for Improved Dereverberation

    Authors: Jean-Marie Lemercier, Julian Tobergte, Timo Gerkmann

    Abstract: In this paper, we present a scheme for extending deep neural network-based multiplicative maskers to deep subband filters for speech restoration in the time-frequency domain. The resulting method can be generically applied to any deep neural network providing masks in the time-frequency domain, while requiring only few more trainable parameters and a computational overhead that is negligible for s… ▽ More

    Submitted 31 May, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

    Comments: Accepted at ISCA Interspeech 2023

  23. arXiv:2302.14748  [pdf, other

    eess.AS cs.LG cs.SD

    Reducing the Prior Mismatch of Stochastic Differential Equations for Diffusion-based Speech Enhancement

    Authors: Bunlong Lay, Simon Welker, Julius Richter, Timo Gerkmann

    Abstract: Recently, score-based generative models have been successfully employed for the task of speech enhancement. A stochastic differential equation is used to model the iterative forward process, where at each step environmental noise and white Gaussian noise are added to the clean speech signal. While in limit the mean of the forward process ends at the noisy mixture, in practice it stops earlier and… ▽ More

    Submitted 30 May, 2023; v1 submitted 28 February, 2023; originally announced February 2023.

    Comments: 5 pages, 2 figures, Accepted to Interspeech 20223

  24. arXiv:2212.11851  [pdf, other

    eess.AS cs.LG cs.SD

    StoRM: A Diffusion-based Stochastic Regeneration Model for Speech Enhancement and Dereverberation

    Authors: Jean-Marie Lemercier, Julius Richter, Simon Welker, Timo Gerkmann

    Abstract: Diffusion models have shown a great ability at bridging the performance gap between predictive and generative approaches for speech enhancement. We have shown that they may even outperform their predictive counterparts for non-additive corruption types or when they are evaluated on mismatched conditions. However, diffusion models suffer from a high computational burden, mainly as they require to r… ▽ More

    Submitted 12 March, 2024; v1 submitted 22 December, 2022; originally announced December 2022.

    Comments: Published in IEEE/ACM Transactions on Audio, Speech and Language Processing, 2023

  25. arXiv:2212.04831  [pdf, other

    eess.AS cs.LG cs.SD

    Uncertainty Estimation in Deep Speech Enhancement Using Complex Gaussian Mixture Models

    Authors: Huajian Fang, Timo Gerkmann

    Abstract: Single-channel deep speech enhancement approaches often estimate a single multiplicative mask to extract clean speech without a measure of its accuracy. Instead, in this work, we propose to quantify the uncertainty associated with clean speech estimates in neural network-based speech enhancement. Predictive uncertainty is typically categorized into aleatoric uncertainty and epistemic uncertainty.… ▽ More

    Submitted 15 May, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

    Comments: ©2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

    Journal ref: ICASSP 2023 - IEEE International Conference on Acoustics, Speech and Signal Processing

  26. arXiv:2211.06757  [pdf, other

    eess.IV cs.CV cs.LG

    DriftRec: Adapting diffusion models to blind JPEG restoration

    Authors: Simon Welker, Henry N. Chapman, Timo Gerkmann

    Abstract: In this work, we utilize the high-fidelity generation abilities of diffusion models to solve blind JPEG restoration at high compression levels. We propose an elegant modification of the forward stochastic differential equation of diffusion models to adapt them to this restoration task and name our method DriftRec. Comparing DriftRec against an $L_2$ regression baseline with the same network archit… ▽ More

    Submitted 3 April, 2024; v1 submitted 12 November, 2022; originally announced November 2022.

    Comments: (C) 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

    Report number: pp. 2795 - 2807

    Journal ref: IEEE Transactions on Image Processing, Vol. 33, 2024

  27. arXiv:2211.04332  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    DiffPhase: Generative Diffusion-based STFT Phase Retrieval

    Authors: Tal Peer, Simon Welker, Timo Gerkmann

    Abstract: Diffusion probabilistic models have been recently used in a variety of tasks, including speech enhancement and synthesis. As a generative approach, diffusion models have been shown to be especially suitable for imputation problems, where missing data is generated based on existing data. Phase retrieval is inherently an imputation problem, where phase information has to be generated based on the gi… ▽ More

    Submitted 2 June, 2023; v1 submitted 8 November, 2022; originally announced November 2022.

    Comments: Accepted by ICASSP 2023

    Journal ref: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  28. arXiv:2211.02420  [pdf, other

    eess.AS cs.LG cs.SD

    Spatially Selective Deep Non-linear Filters for Speaker Extraction

    Authors: Kristina Tesch, Timo Gerkmann

    Abstract: In a scenario with multiple persons talking simultaneously, the spatial characteristics of the signals are the most distinct feature for extracting the target signal. In this work, we develop a deep joint spatial-spectral non-linear filter that can be steered in an arbitrary target direction. For this we propose a simple and effective conditioning mechanism, which sets the initial state of the fil… ▽ More

    Submitted 23 March, 2023; v1 submitted 4 November, 2022; originally announced November 2022.

    Comments: ©2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

    Journal ref: ICASSP 2023 - IEEE International Conference on Acoustics, Speech and Signal Processing

  29. arXiv:2211.02397  [pdf, other

    eess.AS cs.LG cs.SD

    Analysing Diffusion-based Generative Approaches versus Discriminative Approaches for Speech Restoration

    Authors: Jean-Marie Lemercier, Julius Richter, Simon Welker, Timo Gerkmann

    Abstract: Diffusion-based generative models have had a high impact on the computer vision and speech processing communities these past years. Besides data generation tasks, they have also been employed for data restoration tasks like speech enhancement and dereverberation. While discriminative models have traditionally been argued to be more powerful e.g. for speech enhancement, generative diffusion approac… ▽ More

    Submitted 16 March, 2023; v1 submitted 4 November, 2022; originally announced November 2022.

    Comments: \c{opyright} 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

    Journal ref: ICASSP 2023 - IEEE International Conference on Acoustics, Speech and Signal Processing

  30. arXiv:2208.05830  [pdf, other

    eess.AS cs.LG cs.SD

    Speech Enhancement and Dereverberation with Diffusion-based Generative Models

    Authors: Julius Richter, Simon Welker, Jean-Marie Lemercier, Bunlong Lay, Timo Gerkmann

    Abstract: In this work, we build upon our previous publication and use diffusion-based generative models for speech enhancement. We present a detailed overview of the diffusion process that is based on a stochastic differential equation and delve into an extensive theoretical examination of its implications. Opposed to usual conditional generation tasks, we do not start the reverse process from pure Gaussia… ▽ More

    Submitted 13 June, 2023; v1 submitted 11 August, 2022; originally announced August 2022.

    Comments: Accepted version

  31. arXiv:2207.12135  [pdf, other

    eess.AS cs.LG cs.SD

    Label Uncertainty Modeling and Prediction for Speech Emotion Recognition using t-Distributions

    Authors: Navin Raj Prabhu, Nale Lehmann-Willenbrock, Timo Gerkmann

    Abstract: As different people perceive others' emotional expressions differently, their annotation in terms of arousal and valence are per se subjective. To address this, these emotion annotations are typically collected by multiple annotators and averaged across annotators in order to obtain labels for arousal and valence. However, besides the average, also the uncertainty of a label is of interest, and sh… ▽ More

    Submitted 25 July, 2022; originally announced July 2022.

    Comments: ACCEPTED to ACII 2022 -10th INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING & INTELLIGENT INTERACTION

  32. arXiv:2206.13310  [pdf, other

    eess.AS cs.LG cs.SD

    Insights Into Deep Non-linear Filters for Improved Multi-channel Speech Enhancement

    Authors: Kristina Tesch, Timo Gerkmann

    Abstract: The key advantage of using multiple microphones for speech enhancement is that spatial filtering can be used to complement the tempo-spectral processing. In a traditional setting, linear spatial filtering (beamforming) and single-channel post-filtering are commonly performed separately. In contrast, there is a trend towards employing deep neural networks (DNNs) to learn a joint spatial and tempo-s… ▽ More

    Submitted 16 January, 2023; v1 submitted 27 June, 2022; originally announced June 2022.

    Comments: Accepted version

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 563-575, 2023

  33. Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes

    Authors: Danilo de Oliveira, Tal Peer, Timo Gerkmann

    Abstract: The SepFormer architecture shows very good results in speech separation. Like other learned-encoder models, it uses short frames, as they have been shown to obtain better performance in these cases. This results in a large number of frames at the input, which is problematic; since the SepFormer is transformer-based, its computational complexity drastically increases with longer sequences. In this… ▽ More

    Submitted 23 June, 2022; originally announced June 2022.

    Comments: Accepted at Interspeech 2022

    Journal ref: Proc. Interspeech 2022

  34. arXiv:2206.11181  [pdf, other

    eess.AS cs.LG cs.SD

    On the Role of Spatial, Spectral, and Temporal Processing for DNN-based Non-linear Multi-channel Speech Enhancement

    Authors: Kristina Tesch, Nils-Hendrik Mohrmann, Timo Gerkmann

    Abstract: Employing deep neural networks (DNNs) to directly learn filters for multi-channel speech enhancement has potentially two key advantages over a traditional approach combining a linear spatial filter with an independent tempo-spectral post-filter: 1) non-linear spatial filtering allows to overcome potential restrictions originating from a linear processing model and 2) joint processing of spatial an… ▽ More

    Submitted 22 June, 2022; originally announced June 2022.

    Comments: Accepted at Interspeech 2022

  35. Beyond Griffin-Lim: Improved Iterative Phase Retrieval for Speech

    Authors: Tal Peer, Simon Welker, Timo Gerkmann

    Abstract: Phase retrieval is a problem encountered not only in speech and audio processing, but in many other fields such as optics. Iterative algorithms based on non-convex set projections are effective and frequently used for retrieving the phase when only STFT magnitudes are available. While the basic Griffin-Lim algorithm and its variants have been the prevalent method for decades, more recent advances,… ▽ More

    Submitted 11 May, 2022; originally announced May 2022.

    Comments: Submitted to IWAENC 2022

  36. arXiv:2204.02978  [pdf

    eess.AS cs.LG cs.SD

    A neural network-supported two-stage algorithm for lightweight dereverberation on hearing devices

    Authors: Jean-Marie Lemercier, Joachim Thiemann, Raphael Koning, Timo Gerkmann

    Abstract: A two-stage lightweight online dereverberation algorithm for hearing devices is presented in this paper. The approach combines a multi-channel multi-frame linear filter with a single-channel single-frame post-filter. Both components rely on power spectral density (PSD) estimates provided by deep neural networks (DNNs). By deriving new metrics analyzing the dereverberation performance in various ti… ▽ More

    Submitted 31 May, 2023; v1 submitted 6 April, 2022; originally announced April 2022.

    Comments: Accepted for publication in EURASIP Journal on Audio, Speech and Music Processing

  37. arXiv:2204.02741  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Neural Network-augmented Kalman Filtering for Robust Online Speech Dereverberation in Noisy Reverberant Environments

    Authors: Jean-Marie Lemercier, Joachim Thiemann, Raphael Koning, Timo Gerkmann

    Abstract: In this paper, a neural network-augmented algorithm for noise-robust online dereverberation with a Kalman filtering variant of the weighted prediction error (WPE) method is proposed. The filter stochastic variations are predicted by a deep neural network (DNN) trained end-to-end using the filter residual error and signal characteristics. The presented framework allows for robust dereverberation on… ▽ More

    Submitted 23 June, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

    Comments: accepted to INTERSPEECH 2022

  38. Customizable End-to-end Optimization of Online Neural Network-supported Dereverberation for Hearing Devices

    Authors: Jean-Marie Lemercier, Joachim Thiemann, Raphael Koning, Timo Gerkmann

    Abstract: This work focuses on online dereverberation for hearing devices using the weighted prediction error (WPE) algorithm. WPE filtering requires an estimate of the target speech power spectral density (PSD). Recently deep neural networks (DNNs) have been used for this task. However, these approaches optimize the PSD estimate which only indirectly affects the WPE output, thus potentially resulting in li… ▽ More

    Submitted 6 April, 2022; originally announced April 2022.

    Comments: ©2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

    Journal ref: ICASSP 2022 - IEEE International Conference on Acoustics, Speech and Signal Processing

  39. arXiv:2203.17004  [pdf, other

    eess.AS cs.LG cs.SD

    Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain

    Authors: Simon Welker, Julius Richter, Timo Gerkmann

    Abstract: Score-based generative models (SGMs) have recently shown impressive results for difficult generative tasks such as the unconditional and conditional generation of natural images and audio signals. In this work, we extend these models to the complex short-time Fourier transform (STFT) domain, proposing a novel training task for speech enhancement using a complex-valued deep neural network. We deriv… ▽ More

    Submitted 7 July, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

    Comments: Accepted by Interspeech 2022

  40. arXiv:2203.16222  [pdf, other

    eess.AS cs.LG cs.SD

    Phase-Aware Deep Speech Enhancement: It's All About The Frame Length

    Authors: Tal Peer, Timo Gerkmann

    Abstract: Algorithmic latency in speech processing is dominated by the frame length used for Fourier analysis, which in turn limits the achievable performance of magnitude-centric approaches. As previous studies suggest the importance of phase grows with decreasing frame length, this work presents a systematical study on the contribution of phase and magnitude in modern Deep Neural Network (DNN)-based speec… ▽ More

    Submitted 4 October, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: The following article has been accepted by JASA Express Letters. After it is published, it will be found at http://asa.scitation.org/journal/jel

    Journal ref: JASA Express Letters 2, 104802 (2022)

  41. Integrating Statistical Uncertainty into Neural Network-Based Speech Enhancement

    Authors: Huajian Fang, Tal Peer, Stefan Wermter, Timo Gerkmann

    Abstract: Speech enhancement in the time-frequency domain is often performed by estimating a multiplicative mask to extract clean speech. However, most neural network-based methods perform point estimation, i.e., their output consists of a single mask. In this paper, we study the benefits of modeling uncertainty in neural network-based speech enhancement. For this, our neural network is trained to map a noi… ▽ More

    Submitted 4 March, 2022; originally announced March 2022.

    Comments: ©2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

    Journal ref: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  42. arXiv:2202.10573  [pdf, other

    eess.IV cs.LG eess.AS eess.SP

    Deep Iterative Phase Retrieval for Ptychography

    Authors: Simon Welker, Tal Peer, Henry N. Chapman, Timo Gerkmann

    Abstract: One of the most prominent challenges in the field of diffractive imaging is the phase retrieval (PR) problem: In order to reconstruct an object from its diffraction pattern, the inverse Fourier transform must be computed. This is only possible given the full complex-valued diffraction data, i.e. magnitude and phase. However, in diffractive imaging, generally only magnitudes can be directly measure… ▽ More

    Submitted 17 February, 2022; originally announced February 2022.

    Comments: © 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

    Journal ref: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  43. arXiv:2112.02321  [pdf, other

    cs.SD eess.AS

    Speech Separation Using an Asynchronous Fully Recurrent Convolutional Neural Network

    Authors: Xiaolin Hu, Kai Li, Weiyi Zhang, Yi Luo, Jean-Marie Lemercier, Timo Gerkmann

    Abstract: Recent advances in the design of neural network architectures, in particular those specialized in modeling sequences, have provided significant improvements in speech separation performance. In this work, we propose to use a bio-inspired architecture called Fully Recurrent Convolutional Neural Network (FRCNN) to solve the separation task. This model contains bottom-up, top-down and lateral connect… ▽ More

    Submitted 4 December, 2021; originally announced December 2021.

    Comments: Accepted by NeurIPS 2021, Demo at https://cslikai.cn/project/AFRCNN

  44. arXiv:2110.03299  [pdf, other

    eess.AS cs.LG cs.SD

    End-To-End Label Uncertainty Modeling for Speech-based Arousal Recognition Using Bayesian Neural Networks

    Authors: Navin Raj Prabhu, Guillaume Carbajal, Nale Lehmann-Willenbrock, Timo Gerkmann

    Abstract: Emotions are subjective constructs. Recent end-to-end speech emotion recognition systems are typically agnostic to the subjective nature of emotions, despite their state-of-the-art performance. In this work, we introduce an end-to-end Bayesian neural network architecture to capture the inherent subjectivity in the arousal dimension of emotional expressions. To the best of our knowledge, this work… ▽ More

    Submitted 27 June, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: ACCEPTED to INTERSPEECH 2022

  45. Disentanglement Learning for Variational Autoencoders Applied to Audio-Visual Speech Enhancement

    Authors: Guillaume Carbajal, Julius Richter, Timo Gerkmann

    Abstract: Recently, the standard variational autoencoder has been successfully used to learn a probabilistic prior over speech signals, which is then used to perform speech enhancement. Variational autoencoders have then been conditioned on a label describing a high-level speech attribute (e.g. speech activity) that allows for a more explicit control of speech generation. However, the label is not guarantee… ▽ More

    Submitted 3 August, 2021; v1 submitted 19 May, 2021; originally announced May 2021.

    Comments: arXiv admin note: text overlap with arXiv:2102.06454

    Journal ref: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

  46. Nonlinear Spatial Filtering in Multichannel Speech Enhancement

    Authors: Kristina Tesch, Timo Gerkmann

    Abstract: The majority of multichannel speech enhancement algorithms are two-step procedures that first apply a linear spatial filter, a so-called beamformer, and combine it with a single-channel approach for postprocessing. However, the serial concatenation of a linear spatial filter and a postfilter is not generally optimal in the minimum mean square error (MMSE) sense for noise distributions other than a… ▽ More

    Submitted 22 April, 2021; originally announced April 2021.

    Comments: Accepted version, 11 pages, 6 figures

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 29, 2021

  47. Variational Autoencoder for Speech Enhancement with a Noise-Aware Encoder

    Authors: Huajian Fang, Guillaume Carbajal, Stefan Wermter, Timo Gerkmann

    Abstract: Recently, a generative variational autoencoder (VAE) has been proposed for speech enhancement to model speech statistics. However, this approach only uses clean speech in the training phase, making the estimation particularly sensitive to noise presence, especially in low signal-to-noise ratios (SNRs). To increase the robustness of the VAE, we propose to include noise information in the training p… ▽ More

    Submitted 17 February, 2021; originally announced February 2021.

    Comments: ICASSP 2021. (c) 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

    Journal ref: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  48. Guided Variational Autoencoder for Speech Enhancement With a Supervised Classifier

    Authors: Guillaume Carbajal, Julius Richter, Timo Gerkmann

    Abstract: Recently, variational autoencoders have been successfully used to learn a probabilistic prior over speech signals, which is then used to perform speech enhancement. However, variational autoencoders are trained on clean speech only, which results in a limited ability of extracting the speech signal from noisy speech compared to supervised approaches. In this paper, we propose to guide the variatio… ▽ More

    Submitted 12 February, 2021; originally announced February 2021.

    Journal ref: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  49. arXiv:2006.05741  [pdf, other

    eess.IV physics.med-ph

    Efficient Joint Estimation of Tracer Distribution and Background Signals in Magnetic Particle Imaging using a Dictionary Approach

    Authors: Tobias Knopp, Mirco Grosser, Matthias Graeser, Timo Gerkmann, Martin Möddel

    Abstract: Background signals are a primary source of artifacts in magnetic particle imaging and limit the sensitivity of the method since background signals are often not precisely known and vary over time. The state-of-the art method for handling background signals uses one or several background calibration measurements with an empty scanner bore and subtracts a linear combination of these background measu… ▽ More

    Submitted 17 June, 2021; v1 submitted 10 June, 2020; originally announced June 2020.

  50. arXiv:2004.03512  [pdf, other

    eess.AS cs.LG cs.SD

    SNR-Based Features and Diverse Training Data for Robust DNN-Based Speech Enhancement

    Authors: Robert Rehr, Timo Gerkmann

    Abstract: In this paper, we address the generalization of deep neural network (DNN) based speech enhancement to unseen noise conditions for the case that training data is limited in size and diversity. To gain more insights, we analyze the generalization with respect to (1) the size and diversity of the training data, (2) different network architectures, and (3) the chosen features. To address (1), we train… ▽ More

    Submitted 15 May, 2021; v1 submitted 7 April, 2020; originally announced April 2020.

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 29, 2021. (c) 2021 IEEE