Search | arXiv e-print repository

arXiv:1905.01209 [pdf, other]

A Statistically Principled and Computationally Efficient Approach to Speech Enhancement using Variational Autoencoders

Authors: Manuel Pariente, Antoine Deleforge, Emmanuel Vincent

Abstract: Recent studies have explored the use of deep generative models of speech spectra based of variational autoencoders (VAEs), combined with unsupervised noise models, to perform speech enhancement. These studies developed iterative algorithms involving either Gibbs sampling or gradient descent at each step, making them computationally expensive. This paper proposes a variational inference method to i… ▽ More Recent studies have explored the use of deep generative models of speech spectra based of variational autoencoders (VAEs), combined with unsupervised noise models, to perform speech enhancement. These studies developed iterative algorithms involving either Gibbs sampling or gradient descent at each step, making them computationally expensive. This paper proposes a variational inference method to iteratively estimate the power spectrogram of the clean speech. Our main contribution is the analytical derivation of the variational steps in which the en-coder of the pre-learned VAE can be used to estimate the varia-tional approximation of the true posterior distribution, using the very same assumption made to train VAEs. Experiments show that the proposed method produces results on par with the afore-mentioned iterative methods using sampling, while decreasing the computational cost by a factor 36 to reach a given performance . △ Less

Submitted 14 May, 2019; v1 submitted 3 May, 2019; originally announced May 2019.

Comments: Submitted to INTERSPEECH 2019

arXiv:1711.04460 [pdf, other]

Blind Source Separation Using Mixtures of Alpha-Stable Distributions

Authors: Nicolas Keriven, Antoine Deleforge, Antoine Liutkus

Abstract: We propose a new blind source separation algorithm based on mixtures of alpha-stable distributions. Complex symmetric alpha-stable distributions have been recently showed to better model audio signals in the time-frequency domain than classical Gaussian distributions thanks to their larger dynamic range. However, inference of these models is notoriously hard to perform because their probability de… ▽ More We propose a new blind source separation algorithm based on mixtures of alpha-stable distributions. Complex symmetric alpha-stable distributions have been recently showed to better model audio signals in the time-frequency domain than classical Gaussian distributions thanks to their larger dynamic range. However, inference of these models is notoriously hard to perform because their probability density functions do not have a closed-form expression in general. Here, we introduce a novel method for estimating mixture of alpha-stable distributions based on characteristic function matching. We apply this to the blind estimation of binary masks in individual frequency bands from multichannel convolutive audio mixes. We show that the proposed method yields better separation performance than Gaussian-based binary-masking methods. △ Less

Submitted 12 February, 2018; v1 submitted 13 November, 2017; originally announced November 2017.

Comments: International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2018, Calgary, Canada

arXiv:1609.09744 [pdf, other]

Phase Unmixing : Multichannel Source Separation with Magnitude Constraints

Authors: Antoine Deleforge, Yann Traonmilin

Abstract: We consider the problem of estimating the phases of K mixed complex signals from a multichannel observation, when the mixing matrix and signal magnitudes are known. This problem can be cast as a non-convex quadratically constrained quadratic program which is known to be NP-hard in general. We propose three approaches to tackle it: a heuristic method, an alternate minimization method, and a convex… ▽ More We consider the problem of estimating the phases of K mixed complex signals from a multichannel observation, when the mixing matrix and signal magnitudes are known. This problem can be cast as a non-convex quadratically constrained quadratic program which is known to be NP-hard in general. We propose three approaches to tackle it: a heuristic method, an alternate minimization method, and a convex relaxation into a semi-definite program. The last two approaches are showed to outperform the oracle multichannel Wiener filter in under-determined informed source separation tasks, using simulated and speech signals. The convex relaxation approach yields best results, including the potential for exact source separation in under-determined settings. △ Less

Submitted 20 March, 2017; v1 submitted 30 September, 2016; originally announced September 2016.

Comments: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar 2017, New Orleans, United States

Report number: hal-01372418

arXiv:1609.09743 [pdf, other]

Rectified binaural ratio: A complex T-distributed feature for robust sound localization

Authors: Antoine Deleforge, Florence Forbes

Abstract: Most existing methods in binaural sound source localization rely on some kind of aggregation of phase-and level-difference cues in the time-frequency plane. While different ag-gregation schemes exist, they are often heuristic and suffer in adverse noise conditions. In this paper, we introduce the rectified binaural ratio as a new feature for sound source local-ization. We show that for Gaussian-pr… ▽ More Most existing methods in binaural sound source localization rely on some kind of aggregation of phase-and level-difference cues in the time-frequency plane. While different ag-gregation schemes exist, they are often heuristic and suffer in adverse noise conditions. In this paper, we introduce the rectified binaural ratio as a new feature for sound source local-ization. We show that for Gaussian-process point source signals corrupted by stationary Gaussian noise, this ratio follows a complex t-distribution with explicit parameters. This new formulation provides a principled and statistically sound way to aggregate binaural features in the presence of noise. We subsequently derive two simple and efficient methods for robust relative transfer function and time-delay estimation. Experiments on heavily corrupted simulated and speech signals demonstrate the robustness of the proposed scheme. △ Less

Submitted 30 September, 2016; originally announced September 2016.

Comments: European Signal Processing Conference, Aug 2016, Budapest, Hungary. Proceedings of the 24th European Signal Processing Conference (EUSIPCO), 2016, 2016

arXiv:1409.8500 [pdf, other]

doi 10.1109/JSTSP.2015.2416677

Hyper-Spectral Image Analysis with Partially-Latent Regression and Spatial Markov Dependencies

Authors: Antoine Deleforge, Florence Forbes, Sileye Ba, Radu Horaud

Abstract: Hyper-spectral data can be analyzed to recover physical properties at large planetary scales. This involves resolving inverse problems which can be addressed within machine learning, with the advantage that, once a relationship between physical parameters and spectra has been established in a data-driven fashion, the learned relationship can be used to estimate physical parameters for new hyper-sp… ▽ More Hyper-spectral data can be analyzed to recover physical properties at large planetary scales. This involves resolving inverse problems which can be addressed within machine learning, with the advantage that, once a relationship between physical parameters and spectra has been established in a data-driven fashion, the learned relationship can be used to estimate physical parameters for new hyper-spectral observations. Within this framework, we propose a spatially-constrained and partially-latent regression method which maps high-dimensional inputs (hyper-spectral images) onto low-dimensional responses (physical parameters such as the local chemical composition of the soil). The proposed regression model comprises two key features. Firstly, it combines a Gaussian mixture of locally-linear map**s (GLLiM) with a partially-latent response model. While the former makes high-dimensional regression tractable, the latter enables to deal with physical parameters that cannot be observed or, more generally, with data contaminated by experimental artifacts that cannot be explained with noise models. Secondly, spatial constraints are introduced in the model through a Markov random field (MRF) prior which provides a spatial structure to the Gaussian-mixture hidden variables. Experiments conducted on a database composed of remotely sensed observations collected from the Mars planet by the Mars Express orbiter demonstrate the effectiveness of the proposed model. △ Less

Submitted 27 March, 2015; v1 submitted 30 September, 2014; originally announced September 2014.

Comments: 12 pages, 4 figures, 3 tables

Journal ref: IEEE Journal on Selected Topics in Signal Processing, volume 9, number 6, 1037-1048, 2015

arXiv:1408.2700 [pdf, other]

doi 10.1109/TASLP.2015.2405475

Co-Localization of Audio Sources in Images Using Binaural Features and Locally-Linear Regression

Authors: Antoine Deleforge, Radu Horaud, Yoav Schechner, Laurent Girin

Abstract: This paper addresses the problem of localizing audio sources using binaural measurements. We propose a supervised formulation that simultaneously localizes multiple sources at different locations. The approach is intrinsically efficient because, contrary to prior work, it relies neither on source separation, nor on monaural segregation. The method starts with a training stage that establishes a lo… ▽ More This paper addresses the problem of localizing audio sources using binaural measurements. We propose a supervised formulation that simultaneously localizes multiple sources at different locations. The approach is intrinsically efficient because, contrary to prior work, it relies neither on source separation, nor on monaural segregation. The method starts with a training stage that establishes a locally-linear Gaussian regression model between the directional coordinates of all the sources and the auditory features extracted from binaural measurements. While fixed-length wide-spectrum sounds (white noise) are used for training to reliably estimate the model parameters, we show that the testing (localization) can be extended to variable-length sparse-spectrum sounds (such as speech), thus enabling a wide range of realistic applications. Indeed, we demonstrate that the method can be used for audio-visual fusion, namely to map speech signals onto images and hence to spatially align the audio and visual modalities, thus enabling to discriminate between speaking and non-speaking faces. We release a novel corpus of real-room recordings that allow quantitative evaluation of the co-localization method in the presence of one or two sound sources. Experiments demonstrate increased accuracy and speed relative to several state-of-the-art methods. △ Less

Submitted 15 April, 2016; v1 submitted 12 August, 2014; originally announced August 2014.

Comments: 15 pages, 8 figures

Journal ref: IEEE Transactions on Audio, Speech, and Language Processing 23(4), 718-731, April, 2015

arXiv:1308.2302 [pdf, ps, other]

doi 10.1007/s11222-014-9461-5

High-Dimensional Regression with Gaussian Mixtures and Partially-Latent Response Variables

Authors: Antoine Deleforge, Florence Forbes, Radu Horaud

Abstract: In this work we address the problem of approximating high-dimensional data with a low-dimensional representation. We make the following contributions. We propose an inverse regression method which exchanges the roles of input and response, such that the low-dimensional variable becomes the regressor, and which is tractable. We introduce a mixture of locally-linear probabilistic map** model that… ▽ More In this work we address the problem of approximating high-dimensional data with a low-dimensional representation. We make the following contributions. We propose an inverse regression method which exchanges the roles of input and response, such that the low-dimensional variable becomes the regressor, and which is tractable. We introduce a mixture of locally-linear probabilistic map** model that starts with estimating the parameters of inverse regression, and follows with inferring closed-form solutions for the forward parameters of the high-dimensional regression problem of interest. Moreover, we introduce a partially-latent paradigm, such that the vector-valued response variable is composed of both observed and latent entries, thus being able to deal with data contaminated by experimental artifacts that cannot be explained with noise models. The proposed probabilistic formulation could be viewed as a latent-variable augmentation of regression. We devise expectation-maximization (EM) procedures based on a data augmentation strategy which facilitates the maximum-likelihood search over the model parameters. We propose two augmentation schemes and we describe in detail the associated EM inference procedures that may well be viewed as generalizations of a number of EM regression, dimension reduction, and factor analysis algorithms. The proposed framework is validated with both synthetic and real data. We provide experimental evidence that our method outperforms several existing regression techniques. △ Less

Submitted 20 December, 2013; v1 submitted 10 August, 2013; originally announced August 2013.

Journal ref: Statistics and Computing, 25(5), 893-911, 2015

Showing 1–7 of 7 results for author: Deleforge, A