Skip to main content

Showing 1–39 of 39 results for author: Jensen, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.06160  [pdf, other

    eess.AS

    The Effect of Training Dataset Size on Discriminative and Diffusion-Based Speech Enhancement Systems

    Authors: Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May

    Abstract: The performance of deep neural network-based speech enhancement systems typically increases with the training dataset size. However, studies that investigated the effect of training dataset size on speech enhancement performance did not consider recent approaches, such as diffusion-based generative models. Diffusion models are typically trained with massive datasets for image generation tasks, but… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

  2. arXiv:2404.19375  [pdf

    eess.AS cs.SD

    Deep low-latency joint speech transmission and enhancement over a gaussian channel

    Authors: Mohammad Bokaei, Jesper Jensen, Simon Doclo, Jan Østergaard

    Abstract: Ensuring intelligible speech communication for hearing assistive devices in low-latency scenarios presents significant challenges in terms of speech enhancement, coding and transmission. In this paper, we propose novel solutions for low-latency joint speech transmission and enhancement, leveraging deep neural networks (DNNs). Our approach integrates two state-of-the-art DNN architectures for low-l… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

  3. How to train your ears: Auditory-model emulation for large-dynamic-range inputs and mild-to-severe hearing losses

    Authors: Peter Leer, Jesper Jensen, Zheng-Hua Tan, Jan Østergaard, Lars Bramsløw

    Abstract: Advanced auditory models are useful in designing signal-processing algorithms for hearing-loss compensation or speech enhancement. Such auditory models provide rich and detailed descriptions of the auditory pathway, and might allow for individualization of signal-processing strategies, based on physiological measurements. However, these auditory models are often computationally demanding, requirin… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing. This version is the authors' version and may vary from the final publication in details

  4. arXiv:2403.10420  [pdf, other

    eess.AS

    Neural Networks Hear You Loud And Clear: Hearing Loss Compensation Using Deep Neural Networks

    Authors: Peter Leer, Jesper Jensen, Laurel Carney, Zheng-Hua Tan, Jan Østergaard, Lars Bramsløw

    Abstract: This article investigates the use of deep neural networks (DNNs) for hearing-loss compensation. Hearing loss is a prevalent issue affecting millions of people worldwide, and conventional hearing aids have limitations in providing satisfactory compensation. DNNs have shown remarkable performance in various auditory tasks, including speech recognition, speaker identification, and music classificatio… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

  5. arXiv:2403.05393  [pdf, other

    eess.AS

    Binaural Speech Enhancement Using Deep Complex Convolutional Transformer Networks

    Authors: Vikas Tokala, Eric Grinstein, Mike Brookes, Simon Doclo, Jesper Jensen, Patrick A. Naylor

    Abstract: Studies have shown that in noisy acoustic environments, providing binaural signals to the user of an assistive listening device may improve speech intelligibility and spatial awareness. This paper presents a binaural speech enhancement method using a complex convolutional neural network with an encoder-decoder architecture and a complex multi-head attention transformer. The model is trained to est… ▽ More

    Submitted 8 March, 2024; originally announced March 2024.

    Comments: Accepted to ICASSP 2024

  6. arXiv:2401.09315  [pdf, other

    eess.AS

    On Speech Pre-emphasis as a Simple and Inexpensive Method to Boost Speech Enhancement

    Authors: Iván López-Espejo, Aditya Joglekar, Antonio M. Peinado, Jesper Jensen

    Abstract: Pre-emphasis filtering, compensating for the natural energy decay of speech at higher frequencies, has been considered as a common pre-processing step in a number of speech processing tasks over the years. In this work, we demonstrate, for the first time, that pre-emphasis filtering may also be used as a simple and computationally-inexpensive way to leverage deep neural network-based speech enhanc… ▽ More

    Submitted 17 January, 2024; originally announced January 2024.

  7. arXiv:2312.16613  [pdf, other

    cs.SD cs.LG eess.AS

    Self-supervised Pretraining for Robust Personalized Voice Activity Detection in Adverse Conditions

    Authors: Holger Severin Bovbjerg, Jesper Jensen, Jan Østergaard, Zheng-Hua Tan

    Abstract: In this paper, we propose the use of self-supervised pretraining on a large unlabelled data set to improve the performance of a personalized voice activity detection (VAD) model in adverse conditions. We pretrain a long short-term memory (LSTM)-encoder using the autoregressive predictive coding (APC) framework and fine-tune it for personalized VAD. We also propose a denoising variant of APC, with… ▽ More

    Submitted 23 January, 2024; v1 submitted 27 December, 2023; originally announced December 2023.

    Comments: To be published at ICASSP2024, 14th of April 2024, Seoul, South Korea. Copyright (c) 2023 IEEE. 5 pages, 2, figures, 5 tables

    MSC Class: 68T10 ACM Class: I.2.6

  8. arXiv:2312.04370  [pdf, other

    eess.AS cs.LG cs.SD

    Investigating the Design Space of Diffusion Models for Speech Enhancement

    Authors: Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May

    Abstract: Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature. As a consequence, studies have attempted to apply diffusion models to other tasks, such as speech enhancement. A popular approach in adapting diffusion models to speech enhancement consists in modelling a progressive transformation between the clean and noisy speech signals… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

  9. arXiv:2312.02683  [pdf, other

    eess.AS cs.LG cs.SD

    Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions Using a Heun-Based Sampler

    Authors: Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May

    Abstract: Diffusion models are a new class of generative models that have recently been applied to speech enhancement successfully. Previous works have demonstrated their superior performance in mismatched conditions compared to state-of-the art discriminative models. However, this was investigated with a single database for training and another one for testing, which makes the results highly dependent on t… ▽ More

    Submitted 16 January, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

    Comments: Accepted to ICASSP 2024

  10. arXiv:2309.11243  [pdf, other

    eess.AS cs.SD

    Joint Minimum Processing Beamforming and Near-end Listening Enhancement

    Authors: Andreas J. Fuglsig, Jesper Jensen, Zheng-Hua Tan, Lars S. Bertelsen, Jens Christian Lindof, Jan Østergaard

    Abstract: We consider speech enhancement for signals picked up in one noisy environment that must be rendered to a listener in another noisy environment. For both far-end noise reduction and near-end listening enhancement, it has been shown that excessive focus on noise suppression or intelligibility maximization may lead to excessive speech distortions and quality degradations in favorable noise conditions… ▽ More

    Submitted 5 February, 2024; v1 submitted 20 September, 2023; originally announced September 2023.

    Comments: Accepted at IEEE ICASSP 2024 Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA) 2024

  11. arXiv:2306.00489  [pdf, other

    cs.SD cs.AI eess.AS

    Speech inpainting: Context-based speech synthesis guided by video

    Authors: Juan F. Montesinos, Daniel Michelsanti, Gloria Haro, Zheng-Hua Tan, Jesper Jensen

    Abstract: Audio and visual modalities are inherently connected in speech signals: lip movements and facial expressions are correlated with speech sounds. This motivates studies that incorporate the visual modality to enhance an acoustic speech signal or even restore missing audio information. Specifically, this paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted in Interspeech23

  12. arXiv:2303.00832  [pdf, ps, other

    eess.SP

    Distributed Adaptive Norm Estimation for Blind System Identification in Wireless Sensor Networks

    Authors: Matthias Blochberger, Filip Elvander, Randall Ali, Jan Østergaard, Jesper Jensen, Marc Moonen, Toon van Waterschoot

    Abstract: Distributed signal-processing algorithms in (wireless) sensor networks often aim to decentralize processing tasks to reduce communication cost and computational complexity or avoid reliance on a single device (i.e., fusion center) for processing. In this contribution, we extend a distributed adaptive algorithm for blind system identification that relies on the estimation of a stacked network-wide… ▽ More

    Submitted 1 March, 2023; originally announced March 2023.

    Comments: Accepted to ICASSP 2023

  13. arXiv:2302.12048  [pdf, ps, other

    eess.AS cs.SD

    Frequency bin-wise single channel speech presence probability estimation using multiple DNNs

    Authors: Shuai Tao, Himavanth Reddy, Jesper Rindom Jensen, Mads Græsbøll Christensen

    Abstract: In this work, we propose a frequency bin-wise method to estimate the single-channel speech presence probability (SPP) with multiple deep neural networks (DNNs) in the short-time Fourier transform domain. Since all frequency bins are typically considered simultaneously as input features for conventional DNN-based SPP estimators, high model complexity is inevitable. To reduce the model complexity an… ▽ More

    Submitted 23 February, 2023; originally announced February 2023.

    Comments: Accepted for ICASSP 2023

  14. arXiv:2211.10565  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    Filterbank Learning for Noise-Robust Small-Footprint Keyword Spotting

    Authors: Iván López-Espejo, Ram C. M. C. Shekar, Zheng-Hua Tan, Jesper Jensen, John H. L. Hansen

    Abstract: In the context of keyword spotting (KWS), the replacement of handcrafted speech features by learnable features has not yielded superior KWS performance. In this study, we demonstrate that filterbank learning outperforms handcrafted speech features for KWS whenever the number of filterbank channels is severely decreased. Reducing the number of channels might yield certain KWS performance drop, but… ▽ More

    Submitted 23 February, 2023; v1 submitted 18 November, 2022; originally announced November 2022.

  15. HAVOK Model Predictive Control for Time-Delay Systems with Applications to District Heating

    Authors: Christian M. Jensen, Mathias C. Frederiksen, Carsten S. Kallesøe, Jeppe N. Jensen, Laurits H. Andersen, Roozbeh Izadi-Zamanabadi

    Abstract: A computationally efficient Model-Predictive Control (MPC) approach is proposed for systems with unknown delay using only input/output data. We use the Koopman operator framework and the related Hankel Alternative View of Koopman (HAVOK) algorithm to identify a model in a basis of projected time-delay coordinates and demonstrate a novel MPC structure that reduces and bounds the computational compl… ▽ More

    Submitted 6 April, 2023; v1 submitted 31 October, 2022; originally announced November 2022.

    Comments: This work has been accepted for publication at IFAC World Congress 2023

  16. arXiv:2210.17154  [pdf, other

    eess.AS cs.SD eess.SP

    Minimum Processing Near-end Listening Enhancement

    Authors: Andreas Jonas Fuglsig, Jesper Jensen, Zheng-Hua Tan, Lars Søndergaard Bertelsen, Jens Christian Lindof, Jan Østergaard

    Abstract: The intelligibility and quality of speech from a mobile phone or public announcement system are often affected by background noise in the listening environment. By pre-processing the speech signal it is possible to improve the speech intelligibility and quality -- this is known as near-end listening enhancement (NLE). Although, existing NLE techniques are able to greatly increase intelligibility i… ▽ More

    Submitted 30 May, 2023; v1 submitted 31 October, 2022; originally announced October 2022.

  17. arXiv:2111.10592  [pdf, other

    cs.SD cs.HC cs.LG eess.AS

    Deep Spoken Keyword Spotting: An Overview

    Authors: Iván López-Espejo, Zheng-Hua Tan, John Hansen, Jesper Jensen

    Abstract: Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. Prospects suggest a sustained growth in te… ▽ More

    Submitted 20 November, 2021; originally announced November 2021.

  18. arXiv:2111.08327  [pdf, other

    cs.SD eess.AS

    Detecting acoustic reflectors using a robot's ego-noise

    Authors: Usama Saqib, Antoine Deleforge, Jesper Jensen

    Abstract: In this paper, we propose a method to estimate the proximity of an acoustic reflector, e.g., a wall, using ego-noise, i.e., the noise produced by the moving parts of a listening robot. This is achieved by estimating the times of arrival of acoustic echoes reflected from the surface. Simulated experiments show that the proposed nonintrusive approach is capable of accurately estimating the distance… ▽ More

    Submitted 16 November, 2021; originally announced November 2021.

    Journal ref: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun 2021, Toronto, Canada

  19. Joint Far- and Near-End Speech Intelligibility Enhancement based on the Approximated Speech Intelligibility Index

    Authors: Andreas Jonas Fuglsig, Jan Østergaard, Jesper Jensen, Lars Søndergaard Bertelsen, Peter Mariager, Zheng-Hua Tan

    Abstract: This paper considers speech enhancement of signals picked up in one noisy environment which must be presented to a listener in another noisy environment. Recently, it has been shown that an optimal solution to this problem requires the consideration of the noise sources in both environments jointly. However, the existing optimal mutual information based method requires a complicated system model t… ▽ More

    Submitted 15 November, 2021; originally announced November 2021.

  20. arXiv:2103.14882  [pdf, other

    cs.SD eess.AS

    On TasNet for Low-Latency Single-Speaker Speech Enhancement

    Authors: Morten Kolbæk, Zheng-Hua Tan, Søren Holdt Jensen, Jesper Jensen

    Abstract: In recent years, speech processing algorithms have seen tremendous progress primarily due to the deep learning renaissance. This is especially true for speech separation where the time-domain audio separation network (TasNet) has led to significant improvements. However, for the related task of single-speaker speech enhancement, which is of obvious importance, it is yet unknown, if the TasNet arch… ▽ More

    Submitted 27 March, 2021; originally announced March 2021.

  21. arXiv:2010.04556  [pdf, other

    eess.AS cs.LG eess.IV

    Audio-Visual Speech Inpainting with Deep Learning

    Authors: Giovanni Morrone, Daniel Michelsanti, Zheng-Hua Tan, Jesper Jensen

    Abstract: In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i.e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information. Recent work focuses solely on audio-only methods and generally aims at inpainting music signals, which show highly different structure than speech. Instead, we inpaint… ▽ More

    Submitted 3 February, 2021; v1 submitted 9 October, 2020; originally announced October 2020.

    Comments: Accepted at ICASSP 2021

  22. arXiv:2008.09586  [pdf, other

    eess.AS cs.LG eess.IV

    An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

    Authors: Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Jesper Jensen

    Abstract: Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaf… ▽ More

    Submitted 12 March, 2021; v1 submitted 21 August, 2020; originally announced August 2020.

  23. arXiv:2006.00217  [pdf, other

    eess.AS cs.LG cs.SD

    Exploring Filterbank Learning for Keyword Spotting

    Authors: Iván López-Espejo, Zheng-Hua Tan, Jesper Jensen

    Abstract: Despite their great performance over the years, handcrafted speech features are not necessarily optimal for any particular speech application. Consequently, with greater or lesser success, optimal filterbank learning has been studied for different speech processing tasks. In this paper, we fill in a gap by exploring filterbank learning for keyword spotting (KWS). Two approaches are examined: filte… ▽ More

    Submitted 30 May, 2020; originally announced June 2020.

  24. arXiv:2004.02541  [pdf, other

    eess.AS cs.CV cs.LG

    Vocoder-Based Speech Synthesis from Silent Videos

    Authors: Daniel Michelsanti, Olga Slizovskaia, Gloria Haro, Emilia Gómez, Zheng-Hua Tan, Jesper Jensen

    Abstract: Both acoustic and visual information influence human perception of speech. For this reason, the lack of audio in a video sequence determines an extremely low speech intelligibility for untrained lip readers. In this paper, we present a way to synthesise speech from the silent video of a talker using deep learning. The system learns a map** function from raw video frames to acoustic features and… ▽ More

    Submitted 15 August, 2020; v1 submitted 6 April, 2020; originally announced April 2020.

    Comments: Accepted to Interspeech 2020

  25. arXiv:1909.01019  [pdf, other

    cs.SD cs.LG eess.AS

    On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement

    Authors: Morten Kolbæk, Zheng-Hua Tan, Søren Holdt Jensen, Jesper Jensen

    Abstract: Many deep learning-based speech enhancement algorithms are designed to minimize the mean-square error (MSE) in some transform domain between a predicted and a target speech signal. However, optimizing for MSE does not necessarily guarantee high speech quality or intelligibility, which is the ultimate goal of many speech enhancement algorithms. Additionally, only little is known about the impact of… ▽ More

    Submitted 30 January, 2020; v1 submitted 3 September, 2019; originally announced September 2019.

    Comments: Published in the IEEE Transactions on Audio, Speech and Language Processing

  26. arXiv:1906.09417  [pdf, other

    cs.SD cs.HC cs.LG eess.AS

    Keyword Spotting for Hearing Assistive Devices Robust to External Speakers

    Authors: Iván López-Espejo, Zheng-Hua Tan, Jesper Jensen

    Abstract: Keyword spotting (KWS) is experiencing an upswing due to the pervasiveness of small electronic devices that allow interaction with them via speech. Often, KWS systems are speaker-independent, which means that any person --user or not-- might trigger them. For applications like KWS for hearing assistive devices this is unacceptable, as only the user must be allowed to handle them. In this paper we… ▽ More

    Submitted 26 June, 2019; v1 submitted 22 June, 2019; originally announced June 2019.

  27. Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect

    Authors: Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen

    Abstract: When speaking in presence of background noise, humans reflexively change their way of speaking in order to improve the intelligibility of their speech. This reflex is known as Lombard effect. Collecting speech in Lombard conditions is usually hard and costly. For this reason, speech enhancement systems are generally trained and evaluated on speech recorded in quiet to which noise is artificially a… ▽ More

    Submitted 29 May, 2019; originally announced May 2019.

  28. arXiv:1905.11785  [pdf, other

    eess.AS cs.SD

    Automatic Quality Control and Enhancement for Voice-Based Remote Parkinson's Disease Detection

    Authors: Amir Hossein Poorjam, Mathew Shaji Kavalekalam, Liming Shi, Yordan P. Raykov, Jesper Rindom Jensen, Max A. Little, Mads Græsbøll Christensen

    Abstract: The performance of voice-based Parkinson's disease (PD) detection systems degrades when there is an acoustic mismatch between training and operating conditions caused mainly by degradation in test signals. In this paper, we address this mismatch by considering three types of degradation commonly encountered in remote voice analysis, namely background noise, reverberation and nonlinear distortion,… ▽ More

    Submitted 31 May, 2019; v1 submitted 28 May, 2019; originally announced May 2019.

    Comments: Preprint, 12 pages, 6 figures

  29. arXiv:1905.08557  [pdf, other

    cs.SD cs.LG eess.AS

    Bayesian Pitch Tracking Based on the Harmonic Model

    Authors: Liming Shi, Jesper Kjaer Nielsen, Jesper Rindom Jensen, Max A. Little, Mads Graesboll Christensen

    Abstract: Fundamental frequency is one of the most important characteristics of speech and audio signals. Harmonic model-based fundamental frequency estimators offer a higher estimation accuracy and robustness against noise than the widely used autocorrelation-based methods. However, the traditional harmonic model-based estimators do not take the temporal smoothness of the fundamental frequency, the model o… ▽ More

    Submitted 21 May, 2019; originally announced May 2019.

  30. arXiv:1811.06250  [pdf, other

    eess.AS cs.LG cs.SD eess.IV

    Effects of Lombard Reflex on the Performance of Deep-Learning-Based Audio-Visual Speech Enhancement Systems

    Authors: Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen

    Abstract: Humans tend to change their way of speaking when they are immersed in a noisy environment, a reflex known as Lombard effect. Current speech enhancement systems based on deep learning do not usually take into account this change in the speaking style, because they are trained with neutral (non-Lombard) speech utterances recorded under quiet conditions to which noise is artificially added. In this p… ▽ More

    Submitted 15 November, 2018; originally announced November 2018.

  31. arXiv:1811.06234  [pdf, ps, other

    eess.AS cs.LG cs.SD eess.IV

    On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement

    Authors: Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen

    Abstract: Audio-visual speech enhancement (AV-SE) is the task of improving speech quality and intelligibility in a noisy environment using audio and visual information from a talker. Recently, deep learning techniques have been adopted to solve the AV-SE task in a supervised manner. In this context, the choice of the target, i.e. the quantity to be estimated, and the objective function, which quantifies the… ▽ More

    Submitted 15 November, 2018; originally announced November 2018.

  32. arXiv:1810.05677  [pdf, ps, other

    eess.AS cs.SD

    Robust Joint Estimation of Multi-Microphone Signal Model Parameters

    Authors: Andreas I. Koutrouvelis, Richard C. Hendriks, Richard Heusdens, Jesper Jensen

    Abstract: One of the biggest challenges in multi-microphone applications is the estimation of the parameters of the signal model such as the power spectral densities (PSDs) of the sources, the early (relative) acoustic transfer functions of the sources with respect to the microphones, the PSD of late reverberation, and the PSDs of microphone-self noise. Typically, the existing methods estimate subsets of th… ▽ More

    Submitted 12 October, 2018; originally announced October 2018.

  33. On the Relationship Between Short-Time Objective Intelligibility and Short-Time Spectral-Amplitude Mean-Square Error for Speech Enhancement

    Authors: Morten Kolbæk, Zheng-Hua Tan, Jesper Jensen

    Abstract: The majority of deep neural network (DNN) based speech enhancement algorithms rely on the mean-square error (MSE) criterion of short-time spectral amplitudes (STSA), which has no apparent link to human perception, e.g. speech intelligibility. Short-Time Objective Intelligibility (STOI), a popular state-of-the-art speech intelligibility estimator, on the other hand, relies on linear correlation of… ▽ More

    Submitted 4 December, 2018; v1 submitted 21 June, 2018; originally announced June 2018.

    Journal ref: Published in IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 2, pp. 283-295, 2018

  34. arXiv:1805.01692  [pdf, other

    cs.SD cs.IT eess.AS

    A Convex Approximation of the Relaxed Binaural Beamforming Optimization Problem

    Authors: Andreas I. Koutrouvelis, Richard C. Hendriks, Richard Heusdens, Jesper Jensen

    Abstract: The recently proposed relaxed binaural beamforming (RBB) optimization problem provides a flexible trade-off between noise suppression and binaural-cue preservation of the sound sources in the acoustic scene. It minimizes the output noise power, under the constraints which guarantee that the target remains unchanged after processing and the binaural-cue distortions of the acoustic sources will be l… ▽ More

    Submitted 4 May, 2018; originally announced May 2018.

    Journal ref: IEEE/ACM Transactions on Audio, Speech and Language Processing, 27(2), 321-331, 2019

  35. arXiv:1802.00604  [pdf, other

    cs.SD eess.AS

    Monaural Speech Enhancement using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure

    Authors: Morten Kolbæk, Zheng-Hua Tan, Jesper Jensen

    Abstract: In this paper we propose a Deep Neural Network (DNN) based Speech Enhancement (SE) system that is designed to maximize an approximation of the Short-Time Objective Intelligibility (STOI) measure. We formalize an approximate-STOI cost function and derive analytical expressions for the gradients required for DNN training and show that these gradients have desirable properties when used together with… ▽ More

    Submitted 2 February, 2018; originally announced February 2018.

    Comments: To appear in ICASSP 2018

  36. arXiv:1708.09588  [pdf, ps, other

    cs.SD eess.AS

    Joint Separation and Denoising of Noisy Multi-talker Speech using Recurrent Neural Networks and Permutation Invariant Training

    Authors: Morten Kolbæk, Dong Yu, Zheng-Hua Tan, Jesper Jensen

    Abstract: In this paper we propose to use utterance-level Permutation Invariant Training (uPIT) for speaker independent multi-talker speech separation and denoising, simultaneously. Specifically, we train deep bi-directional Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNNs) using uPIT, for single-channel speaker independent multi-talker speech separation in multiple noisy conditions, including… ▽ More

    Submitted 31 August, 2017; originally announced August 2017.

    Comments: To appear in MLSP 2017

  37. arXiv:1703.06284  [pdf, other

    cs.SD cs.LG eess.AS

    Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks

    Authors: Morten Kolbæk, Dong Yu, Zheng-Hua Tan, Jesper Jensen

    Abstract: In this paper we propose the utterance-level Permutation Invariant Training (uPIT) technique. uPIT is a practically applicable, end-to-end, deep learning based solution for speaker independent multi-talker speech separation. Specifically, uPIT extends the recently proposed Permutation Invariant Training (PIT) technique with an utterance-level cost function, hence eliminating the need for solving a… ▽ More

    Submitted 11 July, 2017; v1 submitted 18 March, 2017; originally announced March 2017.

  38. Relaxed Binaural LCMV Beamforming

    Authors: Andreas I. Koutrouvelis, Richard C. Hendriks, Richard Heusdens, Jesper Jensen

    Abstract: In this paper we propose a new binaural beamforming technique which can be seen as a relaxation of the linearly constrained minimum variance (LCMV) framework. The proposed method can achieve simultaneous noise reduction and exact binaural cue preservation of the target source, similar to the binaural minimum variance distortionless response (BMVDR) method. However, unlike BMVDR, the proposed metho… ▽ More

    Submitted 11 September, 2016; originally announced September 2016.

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(1), 137-152, 2016

  39. arXiv:1607.00325  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Permutation Invariant Training of Deep Models for Speaker-Independent Multi-talker Speech Separation

    Authors: Dong Yu, Morten Kolbæk, Zheng-Hua Tan, Jesper Jensen

    Abstract: We propose a novel deep learning model, which supports permutation invariant training (PIT), for speaker independent multi-talker speech separation, commonly known as the cocktail-party problem. Different from most of the prior arts that treat speech separation as a multi-class regression problem and the deep clustering technique that considers it a segmentation (or clustering) problem, our model… ▽ More

    Submitted 3 January, 2017; v1 submitted 1 July, 2016; originally announced July 2016.

    Comments: 5 pages