Search | arXiv e-print repository

Fully Reversing the Shoebox Image Source Method: From Impulse Responses to Room Parameters

Authors: Tom Sprunck, Antoine Deleforge, Yannick Privat, Cédric Foy

Abstract: We present an algorithm that fully reverses the shoebox image source method (ISM), a popular and widely used room impulse response (RIR) simulator for cuboid rooms introduced by Allen and Berkley in 1979. More precisely, given a discrete multichannel RIR generated by the shoebox ISM for a microphone array of known geometry, the algorithm reliably recovers the 18 input parameters. These are the 3D… ▽ More We present an algorithm that fully reverses the shoebox image source method (ISM), a popular and widely used room impulse response (RIR) simulator for cuboid rooms introduced by Allen and Berkley in 1979. More precisely, given a discrete multichannel RIR generated by the shoebox ISM for a microphone array of known geometry, the algorithm reliably recovers the 18 input parameters. These are the 3D source position, the 3 dimensions of the room, the 6-degrees-of-freedom room translation and orientation, and an absorption coefficient for each of the 6 room boundaries. The approach builds on a recently proposed gridless image source localization technique combined with new procedures for room axes recovery and first-order-reflection identification. Extensive simulated experiments reveal that near-exact recovery of all parameters is achieved for a 32-element, 8.4-cm-wide spherical microphone array and a sampling rate of 16~kHz using fully randomized input parameters within rooms of size 2X2X2 to 10X10X5 meters. Estimation errors decay towards zero when increasing the array size and sampling rate. The method is also shown to strongly outperform a known baseline, and its ability to extrapolate RIRs at new positions is demonstrated. Crucially, the approach is strictly limited to low-passed discrete RIRs simulated using the vanilla shoebox ISM. Nonetheless, it represents to our knowledge the first algorithmic demonstration that this difficult inverse problem is in-principle fully solvable over a wide range of configurations. △ Less

Submitted 6 May, 2024; originally announced May 2024.

arXiv:2308.02560 [pdf, other]

From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion

Authors: Robin San Roman, Yossi Adi, Antoine Deleforge, Romain Serizel, Gabriel Synnaeve, Alexandre Défossez

Abstract: Deep generative models can generate high-fidelity audio conditioned on various types of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)). Recently, such models have been used to synthesize audio waveforms conditioned on highly compressed representations. Although such methods produce impressive results, they are prone to generate audible artifacts when the condi… ▽ More Deep generative models can generate high-fidelity audio conditioned on various types of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)). Recently, such models have been used to synthesize audio waveforms conditioned on highly compressed representations. Although such methods produce impressive results, they are prone to generate audible artifacts when the conditioning is flawed or imperfect. An alternative modeling approach is to use diffusion models. However, these have mainly been used as speech vocoders (i.e., conditioned on mel-spectrograms) or generating relatively low sampling rate signals. In this work, we propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality (e.g., speech, music, environmental sounds) from low-bitrate discrete representations. At equal bit rate, the proposed approach outperforms state-of-the-art generative techniques in terms of perceptual quality. Training and, evaluation code, along with audio samples, are available on the facebookresearch/audiocraft Github page. △ Less

Submitted 8 November, 2023; v1 submitted 2 August, 2023; originally announced August 2023.

Comments: 10 pages

Journal ref: Thirty-seventh Conference on Neural Information Processing Systems (2023)

arXiv:2211.16958 [pdf, ps, other]

How to (virtually) train your speaker localizer

Authors: Prerak Srivastava, Antoine Deleforge, Archontis Politis, Emmanuel Vincent

Abstract: Learning-based methods have become ubiquitous in speaker localization. Existing systems rely on simulated training sets for the lack of sufficiently large, diverse and annotated real datasets. Most room acoustics simulators used for this purpose rely on the image source method (ISM) because of its computational efficiency. This paper argues that carefully extending the ISM to incorporate more real… ▽ More Learning-based methods have become ubiquitous in speaker localization. Existing systems rely on simulated training sets for the lack of sufficiently large, diverse and annotated real datasets. Most room acoustics simulators used for this purpose rely on the image source method (ISM) because of its computational efficiency. This paper argues that carefully extending the ISM to incorporate more realistic surface, source and microphone responses into training sets can significantly boost the real-world performance of speaker localization systems. It is shown that increasing the training-set realism of a state-of-the-art direction-of-arrival estimator yields consistent improvements across three different real test sets featuring human speakers in a variety of rooms and various microphone arrays. An ablation study further reveals that every added layer of realism contributes positively to these improvements. △ Less

Submitted 25 May, 2023; v1 submitted 30 November, 2022; originally announced November 2022.

Comments: Published in INTERSPEECH 2023

arXiv:2208.14017 [pdf, ps, other]

doi 10.1109/LSP.2022.3224682

Gridless 3D Recovery of Image Sources from Room Impulse Responses

Authors: Tom Sprunck, Yannick Privat, Cédric Foy, Antoine Deleforge

Abstract: Given a sound field generated by a sparse distribution of impulse image sources, can the continuous 3D positions and amplitudes of these sources be recovered from discrete, bandlimited measurements of the field at a finite set of locations, e.g., a multichannel room impulse response? Borrowing from recent advances in super-resolution imaging, it is shown that this nonlinear, non-convex inverse pro… ▽ More Given a sound field generated by a sparse distribution of impulse image sources, can the continuous 3D positions and amplitudes of these sources be recovered from discrete, bandlimited measurements of the field at a finite set of locations, e.g., a multichannel room impulse response? Borrowing from recent advances in super-resolution imaging, it is shown that this nonlinear, non-convex inverse problem can be efficiently relaxed into a convex linear inverse problem over the space of Radon measures in R3. The linear operator introduced here stems from the fundamental solution of the free-field inhomogenous wave equation combined with the receivers' responses. An adaptation of the Sliding Frank-Wolfe algorithm is proposed to numerically solve the problem off-the-grid, i.e., in continuous 3D space. Simulated experiments show that the approach achieves near-exact recovery of hundreds of image sources using an arbitrarily placed compact 32-channel spherical microphone array in random rectangular rooms. The impact of noise, sampling rate and array diameter on these results is also examined. △ Less

Submitted 7 December, 2022; v1 submitted 30 August, 2022; originally announced August 2022.

Comments: IEEE Signal Processing Letters, 2022

arXiv:2207.09133 [pdf, other]

Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators

Authors: Prerak Srivastava, Antoine Deleforge, Emmanuel Vincent

Abstract: Blind acoustic parameter estimation consists in inferring the acoustic properties of an environment from recordings of unknown sound sources. Recent works in this area have utilized deep neural networks trained either partially or exclusively on simulated data, due to the limited availability of real annotated measurements. In this paper, we study whether a model purely trained using a fast image-… ▽ More Blind acoustic parameter estimation consists in inferring the acoustic properties of an environment from recordings of unknown sound sources. Recent works in this area have utilized deep neural networks trained either partially or exclusively on simulated data, due to the limited availability of real annotated measurements. In this paper, we study whether a model purely trained using a fast image-source room impulse response simulator can generalize to real data. We present an ablation study on carefully crafted simulated training sets that account for different levels of realism in source, receiver and wall responses. The extent of realism is controlled by the sampling of wall absorption coefficients and by applying measured directivity patterns to microphones and sources. A state-of-the-art model trained on these datasets is evaluated on the task of jointly estimating the room's volume, total surface area, and octave-band reverberation times from multiple, multichannel speech recordings. Results reveal that every added layer of simulation realism at train time significantly improves the estimation of all quantities on real signals. △ Less

Submitted 19 July, 2022; originally announced July 2022.

arXiv:2111.08327 [pdf, other]

Detecting acoustic reflectors using a robot's ego-noise

Authors: Usama Saqib, Antoine Deleforge, Jesper Jensen

Abstract: In this paper, we propose a method to estimate the proximity of an acoustic reflector, e.g., a wall, using ego-noise, i.e., the noise produced by the moving parts of a listening robot. This is achieved by estimating the times of arrival of acoustic echoes reflected from the surface. Simulated experiments show that the proposed nonintrusive approach is capable of accurately estimating the distance… ▽ More In this paper, we propose a method to estimate the proximity of an acoustic reflector, e.g., a wall, using ego-noise, i.e., the noise produced by the moving parts of a listening robot. This is achieved by estimating the times of arrival of acoustic echoes reflected from the surface. Simulated experiments show that the proposed nonintrusive approach is capable of accurately estimating the distance of a reflector up to 1 meter and outperforms a previously proposed intrusive approach under loud ego-noise conditions. The proposed method is helped by a probabilistic echo detector that estimates whether or not an acoustic reflector is within a short range of the robotic platform. This preliminary investigation paves the way towards a new kind of collision avoidance system that would purely rely on audio sensors rather than conventional proximity sensors. △ Less

Submitted 16 November, 2021; originally announced November 2021.

Journal ref: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun 2021, Toronto, Canada

arXiv:2109.00393 [pdf, other]

doi 10.1121/10.0005888

Mean absorption estimation from room impulse responses using virtually supervised learning

Authors: Cédric Foy, Antoine Deleforge, Diego Di Carlo

Abstract: In the context of building acoustics and the acoustic diagnosis of an existing room, this paper introduces and investigates a new approach to estimate mean absorption coefficients solely from a room impulse response (RIR). This inverse problem is tackled via virtually-supervised learning, namely, the RIR-to-absorption map** is implicitly learned by regression on a simulated dataset using artific… ▽ More In the context of building acoustics and the acoustic diagnosis of an existing room, this paper introduces and investigates a new approach to estimate mean absorption coefficients solely from a room impulse response (RIR). This inverse problem is tackled via virtually-supervised learning, namely, the RIR-to-absorption map** is implicitly learned by regression on a simulated dataset using artificial neural networks. We focus on simple models based on well-understood architectures. The critical choices of geometric, acoustic and simulation parameters used to train the models are extensively discussed and studied, while kee** in mind conditions that are representative of the field of building acoustics. Estimation errors from the learned neural models are compared to those obtained with classical formulas that require knowledge of the room's geometry and reverberation times. Extensive comparisons made on a variety of simulated test sets highlight different conditions under which the learned models can overcome the well-known limitations of the diffuse sound field hypothesis underlying these formulas. Results obtained on real RIRs measured in an acoustically configurable room show that at 1~kHz and above, the proposed approach performs comparably to classical models when reverberation times can be reliably estimated, and continues to work even when they cannot. △ Less

Submitted 1 September, 2021; originally announced September 2021.

Journal ref: Journal of the Acoustical Society of America, Acoustical Society of America, 2021, 150 (2), pp.1286-1299

arXiv:2107.13832 [pdf, other]

Blind Room Parameter Estimation Using Multiple-Multichannel Speech Recordings

Authors: Prerak Srivastava, Antoine Deleforge, Emmanuel Vincent

Abstract: Knowing the geometrical and acoustical parameters of a room may benefit applications such as audio augmented reality, speech dereverberation or audio forensics. In this paper, we study the problem of jointly estimating the total surface area, the volume, as well as the frequency-dependent reverberation time and mean surface absorption of a room in a blind fashion, based on two-channel noisy speech… ▽ More Knowing the geometrical and acoustical parameters of a room may benefit applications such as audio augmented reality, speech dereverberation or audio forensics. In this paper, we study the problem of jointly estimating the total surface area, the volume, as well as the frequency-dependent reverberation time and mean surface absorption of a room in a blind fashion, based on two-channel noisy speech recordings from multiple, unknown source-receiver positions. A novel convolutional neural network architecture leveraging both single- and inter-channel cues is proposed and trained on a large, realistic simulated dataset. Results on both simulated and real data show that using multiple observations in one room significantly reduces estimation errors and variances on all target quantities, and that using two channels helps the estimation of surface and volume. The proposed model outperforms a recently proposed blind volume estimation method on the considered datasets. △ Less

Submitted 29 July, 2021; originally announced July 2021.

Comments: Accepted In WASPAA 2021 ( IEEE Workshop on Applications of Signal Processing to Audio and Acoustics )

arXiv:2106.06999 [pdf, other]

A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection

Authors: Archontis Politis, Sharath Adavanne, Daniel Krause, Antoine Deleforge, Prerak Srivastava, Tuomas Virtanen

Abstract: This report presents the dataset and baseline of Task 3 of the DCASE2021 Challenge on Sound Event Localization and Detection (SELD). The dataset is based on emulation of real recordings of static or moving sound events under real conditions of reverberation and ambient noise, using spatial room impulse responses captured in a variety of rooms and delivered in two spatial formats. The acoustical sy… ▽ More This report presents the dataset and baseline of Task 3 of the DCASE2021 Challenge on Sound Event Localization and Detection (SELD). The dataset is based on emulation of real recordings of static or moving sound events under real conditions of reverberation and ambient noise, using spatial room impulse responses captured in a variety of rooms and delivered in two spatial formats. The acoustical synthesis remains the same as in the previous iteration of the challenge, however the new dataset brings more challenging conditions of polyphony and overlap** instances of the same class. The most important difference of the new dataset is the introduction of directional interferers, meaning sound events that are localized in space but do not belong to the target classes to be detected and are not annotated. Since such interfering events are expected in every real-world scenario of SELD, the new dataset aims to promote systems that deal with this condition effectively. A modified SELDnet baseline employing the recent ACCDOA representation of SELD problems accompanies the dataset and it is shown to outperform the previous one. The new dataset is shown to be significantly more challenging for both baselines according to all considered metrics. To investigate the individual and combined effects of ambient noise, interferers, and reverberation, we study the performance of the baseline on different versions of the dataset excluding or including combinations of these factors. The results indicate that by far the most detrimental effects are caused by directional interferers. △ Less

Submitted 4 July, 2021; v1 submitted 13 June, 2021; originally announced June 2021.

arXiv:2104.13168 [pdf, other]

dEchorate: a Calibrated Room Impulse Response Database for Echo-aware Signal Processing

Authors: Diego Di Carlo, Pinchas Tandeitnik, Cédric Foy, Antoine Deleforge, Nancy Bertin, Sharon Gannot

Abstract: This paper presents dEchorate: a new database of measured multichannel Room Impulse Responses (RIRs) including annotations of early echo timings and 3D positions of microphones, real sources and image sources under different wall configurations in a cuboid room. These data provide a tool for benchmarking recent methods in echo-aware speech enhancement, room geometry estimation, RIR estimation, aco… ▽ More This paper presents dEchorate: a new database of measured multichannel Room Impulse Responses (RIRs) including annotations of early echo timings and 3D positions of microphones, real sources and image sources under different wall configurations in a cuboid room. These data provide a tool for benchmarking recent methods in echo-aware speech enhancement, room geometry estimation, RIR estimation, acoustic echo retrieval, microphone calibration, echo labeling and reflectors estimation. The database is accompanied with software utilities to easily access, manipulate and visualize the data as well as baseline methods for echo-related tasks. △ Less

Submitted 27 April, 2021; originally announced April 2021.

arXiv:2005.04132 [pdf, other]

Asteroid: the PyTorch-based audio source separation toolkit for researchers

Authors: Manuel Pariente, Samuele Cornell, Joris Cosentino, Sunit Sivasankaran, Efthymios Tzinis, Jens Heitkaemper, Michel Olvera, Fabian-Robert Stöter, Mathieu Hu, Juan M. Martín-Doñas, David Ditter, Ariel Frank, Antoine Deleforge, Emmanuel Vincent

Abstract: This paper describes Asteroid, the PyTorch-based audio source separation toolkit for researchers. Inspired by the most successful neural source separation systems, it provides all neural building blocks required to build such a system. To improve reproducibility, Kaldi-style recipes on common audio source separation datasets are also provided. This paper describes the software architecture of Aste… ▽ More This paper describes Asteroid, the PyTorch-based audio source separation toolkit for researchers. Inspired by the most successful neural source separation systems, it provides all neural building blocks required to build such a system. To improve reproducibility, Kaldi-style recipes on common audio source separation datasets are also provided. This paper describes the software architecture of Asteroid and its most important features. By showing experimental results obtained with Asteroid's recipes, we show that our implementations are at least on par with most results reported in reference papers. The toolkit is publicly available at https://github.com/mpariente/asteroid . △ Less

Submitted 8 May, 2020; originally announced May 2020.

Comments: Submitted to Interspeech 2020

arXiv:1910.10400 [pdf, other]

Filterbank design for end-to-end speech separation

Authors: Manuel Pariente, Samuele Cornell, Antoine Deleforge, Emmanuel Vincent

Abstract: Single-channel speech separation has recently made great progress thanks to learned filterbanks as used in ConvTasNet. In parallel, parameterized filterbanks have been proposed for speaker recognition where only center frequencies and bandwidths are learned. In this work, we extend real-valued learned and parameterized filterbanks into complex-valued analytic filterbanks and define a set of corres… ▽ More Single-channel speech separation has recently made great progress thanks to learned filterbanks as used in ConvTasNet. In parallel, parameterized filterbanks have been proposed for speaker recognition where only center frequencies and bandwidths are learned. In this work, we extend real-valued learned and parameterized filterbanks into complex-valued analytic filterbanks and define a set of corresponding representations and masking strategies. We evaluate these filterbanks on a newly released noisy speech separation dataset (WHAM). The results show that the proposed analytic learned filterbank consistently outperforms the real-valued filterbank of ConvTasNet. Also, we validate the use of parameterized filterbanks and show that complex-valued representations and masks are beneficial in all conditions. Finally, we show that the STFT achieves its best performance for 2ms windows. △ Less

Submitted 28 February, 2020; v1 submitted 23 October, 2019; originally announced October 2019.

Comments: ICASSP 2020

arXiv:1907.04655 [pdf, other]

doi 10.1109/MSP.2019.2924687

Audio-Based Search and Rescue with a Drone: Highlights from the IEEE Signal Processing Cup 2019 Student Competition

Authors: Antoine Deleforge, Diego Di Carlo, Martin Strauss, Romain Serizel, Lucio Marcenaro

Abstract: Unmanned aerial vehicles (UAV), commonly referred to as drones, have raised increasing interest in recent years. Search and rescue scenarios where humans in emergency situations need to be quickly found in areas difficult to access constitute an important field of application for this technology. While research efforts have mostly focused on develo** video-based solutions for this task \cite{lop… ▽ More Unmanned aerial vehicles (UAV), commonly referred to as drones, have raised increasing interest in recent years. Search and rescue scenarios where humans in emergency situations need to be quickly found in areas difficult to access constitute an important field of application for this technology. While research efforts have mostly focused on develo** video-based solutions for this task \cite{lopez2017cvemergency}, UAV-embedded audio-based localization has received relatively less attention. Though, UAVs equipped with a microphone array could be of critical help to localize people in emergency situations, in particular when video sensors are limited by a lack of visual feedback due to bad lighting conditions or obstacles limiting the field of view. This motivated the topic of the 6th edition of the IEEE Signal Processing Cup (SP Cup): a UAV-embedded sound source localization challenge for search and rescue. In this article, we share an overview of the IEEE SP Cup experience including the competition tasks, participating teams, technical approaches and statistics. △ Less

Submitted 3 July, 2019; originally announced July 2019.

Journal ref: IEEE Signal Processing Magazine, Institute of Electrical and Electronics Engineers, In press

arXiv:1905.01209 [pdf, other]

A Statistically Principled and Computationally Efficient Approach to Speech Enhancement using Variational Autoencoders

Authors: Manuel Pariente, Antoine Deleforge, Emmanuel Vincent

Abstract: Recent studies have explored the use of deep generative models of speech spectra based of variational autoencoders (VAEs), combined with unsupervised noise models, to perform speech enhancement. These studies developed iterative algorithms involving either Gibbs sampling or gradient descent at each step, making them computationally expensive. This paper proposes a variational inference method to i… ▽ More Recent studies have explored the use of deep generative models of speech spectra based of variational autoencoders (VAEs), combined with unsupervised noise models, to perform speech enhancement. These studies developed iterative algorithms involving either Gibbs sampling or gradient descent at each step, making them computationally expensive. This paper proposes a variational inference method to iteratively estimate the power spectrogram of the clean speech. Our main contribution is the analytical derivation of the variational steps in which the en-coder of the pre-learned VAE can be used to estimate the varia-tional approximation of the true posterior distribution, using the very same assumption made to train VAEs. Experiments show that the proposed method produces results on par with the afore-mentioned iterative methods using sampling, while decreasing the computational cost by a factor 36 to reach a given performance . △ Less

Submitted 14 May, 2019; v1 submitted 3 May, 2019; originally announced May 2019.

Comments: Submitted to INTERSPEECH 2019

arXiv:1812.05901 [pdf, ps, other]

Evaluation of an open-source implementation of the SRP-PHAT algorithm within the 2018 LOCATA challenge

Authors: Romain Lebarbenchon, Ewen Camberlein, Diego di Carlo, Clément Gaultier, Antoine Deleforge, Nancy Bertin

Abstract: This short paper presents an efficient, flexible implementation of the SRP-PHAT multichannel sound source localization method. The method is evaluated on the single-source tasks of the LOCATA 2018 development dataset, and an associated Matlab toolbox is made available online. This short paper presents an efficient, flexible implementation of the SRP-PHAT multichannel sound source localization method. The method is evaluated on the single-source tasks of the LOCATA 2018 development dataset, and an associated Matlab toolbox is made available online. △ Less

Submitted 14 December, 2018; originally announced December 2018.

Comments: In Proceedings of the LOCATA Challenge Workshop - a satellite event of IWAENC 2018 (arXiv:1811.08482 )

Report number: LOCATAchallenge/2018/01

arXiv:1810.13338 [pdf, other]

MULAN: A Blind and Off-Grid Method for Multichannel Echo Retrieval

Authors: Helena Peic Tukuljac, Antoine Deleforge, Rémi Gribonval

Abstract: This paper addresses the general problem of blind echo retrieval, i.e., given M sensors measuring in the discrete-time domain M mixtures of K delayed and attenuated copies of an unknown source signal, can the echo locations and weights be recovered? This problem has broad applications in fields such as sonars, seismol-ogy, ultrasounds or room acoustics. It belongs to the broader class of blind cha… ▽ More This paper addresses the general problem of blind echo retrieval, i.e., given M sensors measuring in the discrete-time domain M mixtures of K delayed and attenuated copies of an unknown source signal, can the echo locations and weights be recovered? This problem has broad applications in fields such as sonars, seismol-ogy, ultrasounds or room acoustics. It belongs to the broader class of blind channel identification problems, which have been intensively studied in signal processing. Existing methods in the literature proceed in two steps: (i) blind estimation of sparse discrete-time filters and (ii) echo information retrieval by peak-picking on filters. The precision of these methods is fundamentally limited by the rate at which the signals are sampled: estimated echo locations are necessary on-grid, and since true locations never match the sampling grid, the weight estimation precision is impacted. This is the so-called basis-mismatch problem in compressed sensing. We propose a radically different approach to the problem, building on the framework of finite-rate-of-innovation sampling. The approach operates directly in the parameter-space of echo locations and weights, and enables near-exact blind and off-grid echo retrieval from discrete-time measurements. It is shown to outperform conventional methods by several orders of magnitude in precision. △ Less

Submitted 31 October, 2018; originally announced October 2018.

Journal ref: Thirty-second Conference on Neural Information Processing Systems (NIPS 2018), Dec 2018, Montr{é}al, Canada

arXiv:1711.06805 [pdf, other]

doi 10.1109/ICASSP.2018.8461345

Separake: Source Separation with a Little Help From Echoes

Authors: Robin Scheibler, Diego Di Carlo, Antoine Deleforge, Ivan Dokmanić

Abstract: It is commonly believed that multipath hurts various audio processing algorithms. At odds with this belief, we show that multipath in fact helps sound source separation, even with very simple propagation models. Unlike most existing methods, we neither ignore the room impulse responses, nor we attempt to estimate them fully. We rather assume that we know the positions of a few virtual microphones… ▽ More It is commonly believed that multipath hurts various audio processing algorithms. At odds with this belief, we show that multipath in fact helps sound source separation, even with very simple propagation models. Unlike most existing methods, we neither ignore the room impulse responses, nor we attempt to estimate them fully. We rather assume that we know the positions of a few virtual microphones generated by echoes and we show how this gives us enough spatial diversity to get a performance boost over the anechoic case. We show improvements for two standard algorithms---one that uses only magnitudes of the transfer functions, and one that also uses the phases. Concretely, we show that multichannel non-negative matrix factorization aided with a small number of echoes beats the vanilla variant of the same algorithm, and that with magnitude information only, echoes enable separation where it was previously impossible. △ Less

Submitted 17 November, 2017; originally announced November 2017.

arXiv:1711.04460 [pdf, other]

Blind Source Separation Using Mixtures of Alpha-Stable Distributions

Authors: Nicolas Keriven, Antoine Deleforge, Antoine Liutkus

Abstract: We propose a new blind source separation algorithm based on mixtures of alpha-stable distributions. Complex symmetric alpha-stable distributions have been recently showed to better model audio signals in the time-frequency domain than classical Gaussian distributions thanks to their larger dynamic range. However, inference of these models is notoriously hard to perform because their probability de… ▽ More We propose a new blind source separation algorithm based on mixtures of alpha-stable distributions. Complex symmetric alpha-stable distributions have been recently showed to better model audio signals in the time-frequency domain than classical Gaussian distributions thanks to their larger dynamic range. However, inference of these models is notoriously hard to perform because their probability density functions do not have a closed-form expression in general. Here, we introduce a novel method for estimating mixture of alpha-stable distributions based on characteristic function matching. We apply this to the blind estimation of binary masks in individual frequency bands from multichannel convolutive audio mixes. We show that the proposed method yields better separation performance than Gaussian-based binary-masking methods. △ Less

Submitted 12 February, 2018; v1 submitted 13 November, 2017; originally announced November 2017.

Comments: International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2018, Calgary, Canada

arXiv:1704.08972 [pdf, ps, other]

Phase retrieval with a multivariate Von Mises prior: from a Bayesian formulation to a lifting solution

Authors: Angelique Dremeau, Antoine Deleforge

Abstract: In this paper, we investigate a new method for phase recovery when prior information on the missing phases is available. In particular, we propose to take into account this information in a generic fashion by means of a multivariate Von Mises dis- tribution. Building on a Bayesian formulation (a Maximum A Posteriori estimation), we show that the problem can be expressed using a Mahalanobis distanc… ▽ More In this paper, we investigate a new method for phase recovery when prior information on the missing phases is available. In particular, we propose to take into account this information in a generic fashion by means of a multivariate Von Mises dis- tribution. Building on a Bayesian formulation (a Maximum A Posteriori estimation), we show that the problem can be expressed using a Mahalanobis distance and be solved by a lifting optimization procedure. △ Less

Submitted 28 April, 2017; originally announced April 2017.

Comments: Preprint of the paper published in the proc. of ICASSP'17

arXiv:1612.06287 [pdf, other]

VAST : The Virtual Acoustic Space Traveler Dataset

Authors: Clément Gaultier, Saurabh Kataria, Antoine Deleforge

Abstract: This paper introduces a new paradigm for sound source lo-calization referred to as virtual acoustic space traveling (VAST) and presents a first dataset designed for this purpose. Existing sound source localization methods are either based on an approximate physical model (physics-driven) or on a specific-purpose calibration set (data-driven). With VAST, the idea is to learn a map** from audio fe… ▽ More This paper introduces a new paradigm for sound source lo-calization referred to as virtual acoustic space traveling (VAST) and presents a first dataset designed for this purpose. Existing sound source localization methods are either based on an approximate physical model (physics-driven) or on a specific-purpose calibration set (data-driven). With VAST, the idea is to learn a map** from audio features to desired audio properties using a massive dataset of simulated room impulse responses. This virtual dataset is designed to be maximally representative of the potential audio scenes that the considered system may be evolving in, while remaining reasonably compact. We show that virtually-learned map**s on this dataset generalize to real data, overcoming some intrinsic limitations of traditional binaural sound localization methods based on time differences of arrival. △ Less

Submitted 14 December, 2016; originally announced December 2016.

Comments: International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA), Feb 2017, Grenoble, France. International Conference on Latent Variable Analysis and Signal Separation

arXiv:1609.09747 [pdf, other]

Hearing in a shoe-box : binaural source position and wall absorption estimation using virtually supervised learning

Authors: Saurabh Kataria, Clément Gaultier, Antoine Deleforge

Abstract: This paper introduces a new framework for supervised sound source localization referred to as virtually-supervised learning. An acoustic shoe-box room simulator is used to generate a large number of binaural single-source audio scenes. These scenes are used to build a dataset of spatial binaural features annotated with acoustic properties such as the 3D source position and the walls' absorption co… ▽ More This paper introduces a new framework for supervised sound source localization referred to as virtually-supervised learning. An acoustic shoe-box room simulator is used to generate a large number of binaural single-source audio scenes. These scenes are used to build a dataset of spatial binaural features annotated with acoustic properties such as the 3D source position and the walls' absorption coefficients. A probabilistic high- to low-dimensional regression framework is used to learn a map** from these features to the acoustic properties. Results indicate that this map** successfully estimates the azimuth and elevation of new sources, but also their range and even the walls' absorption coefficients solely based on binaural signals. Results also reveal that incorporating random-diffusion effects in the data significantly improves the estimation of all parameters. △ Less

Submitted 20 March, 2017; v1 submitted 30 September, 2016; originally announced September 2016.

Comments: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar 2017, New-Orleans, United States

Report number: hal-01372435

arXiv:1609.09744 [pdf, other]

Phase Unmixing : Multichannel Source Separation with Magnitude Constraints

Authors: Antoine Deleforge, Yann Traonmilin

Abstract: We consider the problem of estimating the phases of K mixed complex signals from a multichannel observation, when the mixing matrix and signal magnitudes are known. This problem can be cast as a non-convex quadratically constrained quadratic program which is known to be NP-hard in general. We propose three approaches to tackle it: a heuristic method, an alternate minimization method, and a convex… ▽ More We consider the problem of estimating the phases of K mixed complex signals from a multichannel observation, when the mixing matrix and signal magnitudes are known. This problem can be cast as a non-convex quadratically constrained quadratic program which is known to be NP-hard in general. We propose three approaches to tackle it: a heuristic method, an alternate minimization method, and a convex relaxation into a semi-definite program. The last two approaches are showed to outperform the oracle multichannel Wiener filter in under-determined informed source separation tasks, using simulated and speech signals. The convex relaxation approach yields best results, including the potential for exact source separation in under-determined settings. △ Less

Submitted 20 March, 2017; v1 submitted 30 September, 2016; originally announced September 2016.

Comments: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar 2017, New Orleans, United States

Report number: hal-01372418

arXiv:1609.09743 [pdf, other]

Rectified binaural ratio: A complex T-distributed feature for robust sound localization

Authors: Antoine Deleforge, Florence Forbes

Abstract: Most existing methods in binaural sound source localization rely on some kind of aggregation of phase-and level-difference cues in the time-frequency plane. While different ag-gregation schemes exist, they are often heuristic and suffer in adverse noise conditions. In this paper, we introduce the rectified binaural ratio as a new feature for sound source local-ization. We show that for Gaussian-pr… ▽ More Most existing methods in binaural sound source localization rely on some kind of aggregation of phase-and level-difference cues in the time-frequency plane. While different ag-gregation schemes exist, they are often heuristic and suffer in adverse noise conditions. In this paper, we introduce the rectified binaural ratio as a new feature for sound source local-ization. We show that for Gaussian-process point source signals corrupted by stationary Gaussian noise, this ratio follows a complex t-distribution with explicit parameters. This new formulation provides a principled and statistically sound way to aggregate binaural features in the presence of noise. We subsequently derive two simple and efficient methods for robust relative transfer function and time-delay estimation. Experiments on heavily corrupted simulated and speech signals demonstrate the robustness of the proposed scheme. △ Less

Submitted 30 September, 2016; originally announced September 2016.

Comments: European Signal Processing Conference, Aug 2016, Budapest, Hungary. Proceedings of the 24th European Signal Processing Conference (EUSIPCO), 2016, 2016

arXiv:1603.09732 [pdf, other]

doi 10.1109/TIP.2017.2654165

Robust Head-Pose Estimation Based on Partially-Latent Mixture of Linear Regressions

Authors: Vincent Drouard, Radu Horaud, Antoine Deleforge, Silèye Ba, Georgios Evangelidis

Abstract: Head-pose estimation has many applications, such as social event analysis, human-robot and human-computer interaction, driving assistance, and so forth. Head-pose estimation is challenging because it must cope with changing illumination conditions, variabilities in face orientation and in appearance, partial occlusions of facial landmarks, as well as bounding-box-to-face alignment errors. We propo… ▽ More Head-pose estimation has many applications, such as social event analysis, human-robot and human-computer interaction, driving assistance, and so forth. Head-pose estimation is challenging because it must cope with changing illumination conditions, variabilities in face orientation and in appearance, partial occlusions of facial landmarks, as well as bounding-box-to-face alignment errors. We propose tu use a mixture of linear regressions with partially-latent output. This regression method learns to map high-dimensional feature vectors (extracted from bounding boxes of faces) onto the joint space of head-pose angles and bounding-box shifts, such that they are robustly predicted in the presence of unobservable phenomena. We describe in detail the map** method that combines the merits of unsupervised manifold learning techniques and of mixtures of regressions. We validate our method with three publicly available datasets and we thoroughly benchmark four variants of the proposed algorithm with several state-of-the-art head-pose estimation methods. △ Less

Submitted 6 March, 2017; v1 submitted 31 March, 2016; originally announced March 2016.

Comments: 12 pages, 5 figures, 3 tables

Journal ref: IEEE Transactions on Image Processing, volume 26, Issue 3, 1428-1440, 2017

arXiv:1507.00201 [pdf, ps, other]

Towards a Generalization of Relative Transfer Functions to More Than One Source

Authors: Antoine Deleforge, Sharon Gannot, Walter Kellermann

Abstract: We propose a natural way to generalize relative transfer functions (RTFs) to more than one source. We first prove that such a generalization is not possible using a single multichannel spectro-temporal observation, regardless of the number of microphones. We then introduce a new transform for multichannel multi-frame spectrograms, i.e., containing several channels and time frames in each time-freq… ▽ More We propose a natural way to generalize relative transfer functions (RTFs) to more than one source. We first prove that such a generalization is not possible using a single multichannel spectro-temporal observation, regardless of the number of microphones. We then introduce a new transform for multichannel multi-frame spectrograms, i.e., containing several channels and time frames in each time-frequency bin. This transform allows a natural generalization which satisfies the three key properties of RTFs, namely, they can be directly estimated from observed signals, they capture spatial properties of the sources and they do not depend on emitted signals. Through simulated experiments, we show how this new method can localize multiple simultaneously active sound sources using short spectro-temporal windows, without relying on source separation. △ Less

Submitted 1 July, 2015; originally announced July 2015.

arXiv:1410.2430 [pdf, ps, other]

Phase-Optimized K-SVD for Signal Extraction from Underdetermined Multichannel Sparse Mixtures

Authors: Antoine Deleforge, Walter Kellermann

Abstract: We propose a novel sparse representation for heavily underdetermined multichannel sound mixtures, i.e., with much more sources than microphones. The proposed approach operates in the complex Fourier domain, thus preserving spatial characteristics carried by phase differences. We derive a generalization of K-SVD which jointly estimates a dictionary capturing both spectral and spatial features, a sp… ▽ More We propose a novel sparse representation for heavily underdetermined multichannel sound mixtures, i.e., with much more sources than microphones. The proposed approach operates in the complex Fourier domain, thus preserving spatial characteristics carried by phase differences. We derive a generalization of K-SVD which jointly estimates a dictionary capturing both spectral and spatial features, a sparse activation matrix, and all instantaneous source phases from a set of signal examples. The dictionary can then be used to extract the learned signal from a new input mixture. The method is applied to the challenging problem of ego-noise reduction for robot audition. We demonstrate its superiority relative to conventional dictionary-based techniques using recordings made in a real room. △ Less

Submitted 9 October, 2014; originally announced October 2014.

arXiv:1409.8500 [pdf, other]

doi 10.1109/JSTSP.2015.2416677

Hyper-Spectral Image Analysis with Partially-Latent Regression and Spatial Markov Dependencies

Authors: Antoine Deleforge, Florence Forbes, Sileye Ba, Radu Horaud

Abstract: Hyper-spectral data can be analyzed to recover physical properties at large planetary scales. This involves resolving inverse problems which can be addressed within machine learning, with the advantage that, once a relationship between physical parameters and spectra has been established in a data-driven fashion, the learned relationship can be used to estimate physical parameters for new hyper-sp… ▽ More Hyper-spectral data can be analyzed to recover physical properties at large planetary scales. This involves resolving inverse problems which can be addressed within machine learning, with the advantage that, once a relationship between physical parameters and spectra has been established in a data-driven fashion, the learned relationship can be used to estimate physical parameters for new hyper-spectral observations. Within this framework, we propose a spatially-constrained and partially-latent regression method which maps high-dimensional inputs (hyper-spectral images) onto low-dimensional responses (physical parameters such as the local chemical composition of the soil). The proposed regression model comprises two key features. Firstly, it combines a Gaussian mixture of locally-linear map**s (GLLiM) with a partially-latent response model. While the former makes high-dimensional regression tractable, the latter enables to deal with physical parameters that cannot be observed or, more generally, with data contaminated by experimental artifacts that cannot be explained with noise models. Secondly, spatial constraints are introduced in the model through a Markov random field (MRF) prior which provides a spatial structure to the Gaussian-mixture hidden variables. Experiments conducted on a database composed of remotely sensed observations collected from the Mars planet by the Mars Express orbiter demonstrate the effectiveness of the proposed model. △ Less

Submitted 27 March, 2015; v1 submitted 30 September, 2014; originally announced September 2014.

Comments: 12 pages, 4 figures, 3 tables

Journal ref: IEEE Journal on Selected Topics in Signal Processing, volume 9, number 6, 1037-1048, 2015

arXiv:1408.2700 [pdf, other]

doi 10.1109/TASLP.2015.2405475

Co-Localization of Audio Sources in Images Using Binaural Features and Locally-Linear Regression

Authors: Antoine Deleforge, Radu Horaud, Yoav Schechner, Laurent Girin

Abstract: This paper addresses the problem of localizing audio sources using binaural measurements. We propose a supervised formulation that simultaneously localizes multiple sources at different locations. The approach is intrinsically efficient because, contrary to prior work, it relies neither on source separation, nor on monaural segregation. The method starts with a training stage that establishes a lo… ▽ More This paper addresses the problem of localizing audio sources using binaural measurements. We propose a supervised formulation that simultaneously localizes multiple sources at different locations. The approach is intrinsically efficient because, contrary to prior work, it relies neither on source separation, nor on monaural segregation. The method starts with a training stage that establishes a locally-linear Gaussian regression model between the directional coordinates of all the sources and the auditory features extracted from binaural measurements. While fixed-length wide-spectrum sounds (white noise) are used for training to reliably estimate the model parameters, we show that the testing (localization) can be extended to variable-length sparse-spectrum sounds (such as speech), thus enabling a wide range of realistic applications. Indeed, we demonstrate that the method can be used for audio-visual fusion, namely to map speech signals onto images and hence to spatially align the audio and visual modalities, thus enabling to discriminate between speaking and non-speaking faces. We release a novel corpus of real-room recordings that allow quantitative evaluation of the co-localization method in the presence of one or two sound sources. Experiments demonstrate increased accuracy and speed relative to several state-of-the-art methods. △ Less

Submitted 15 April, 2016; v1 submitted 12 August, 2014; originally announced August 2014.

Comments: 15 pages, 8 figures

Journal ref: IEEE Transactions on Audio, Speech, and Language Processing 23(4), 718-731, April, 2015

arXiv:1402.2683 [pdf, other]

doi 10.1142/S0129065714400036

Acoustic Space Learning for Sound Source Separation and Localization on Binaural Manifolds

Authors: Antoine Deleforge, Florence Forbes, Radu Horaud

Abstract: In this paper we address the problems of modeling the acoustic space generated by a full-spectrum sound source and of using the learned model for the localization and separation of multiple sources that simultaneously emit sparse-spectrum sounds. We lay theoretical and methodological grounds in order to introduce the binaural manifold paradigm. We perform an in-depth study of the latent low-dimens… ▽ More In this paper we address the problems of modeling the acoustic space generated by a full-spectrum sound source and of using the learned model for the localization and separation of multiple sources that simultaneously emit sparse-spectrum sounds. We lay theoretical and methodological grounds in order to introduce the binaural manifold paradigm. We perform an in-depth study of the latent low-dimensional structure of the high-dimensional interaural spectral data, based on a corpus recorded with a human-like audiomotor robot head. A non-linear dimensionality reduction technique is used to show that these data lie on a two-dimensional (2D) smooth manifold parameterized by the motor states of the listener, or equivalently, the sound source directions. We propose a probabilistic piecewise affine map** model (PPAM) specifically designed to deal with high-dimensional data exhibiting an intrinsic piecewise linear structure. We derive a closed-form expectation-maximization (EM) procedure for estimating the model parameters, followed by Bayes inversion for obtaining the full posterior density function of a sound source direction. We extend this solution to deal with missing data and redundancy in real world spectrograms, and hence for 2D localization of natural sound sources such as speech. We further generalize the model to the challenging case of multiple sound sources and we propose a variational EM framework. The associated algorithm, referred to as variational EM for source separation and localization (VESSL) yields a Bayesian estimation of the 2D locations and time-frequency masks of all the sources. Comparisons of the proposed approach with several existing methods reveal that the combination of acoustic-space learning with Bayesian inference enables our method to outperform state-of-the-art methods. △ Less

Submitted 20 March, 2014; v1 submitted 11 February, 2014; originally announced February 2014.

Comments: 19 pages, 9 figures, 3 tables

Journal ref: International Journal of Neural Systems 25(1) 2015

arXiv:1308.2302 [pdf, ps, other]

doi 10.1007/s11222-014-9461-5

High-Dimensional Regression with Gaussian Mixtures and Partially-Latent Response Variables

Authors: Antoine Deleforge, Florence Forbes, Radu Horaud

Abstract: In this work we address the problem of approximating high-dimensional data with a low-dimensional representation. We make the following contributions. We propose an inverse regression method which exchanges the roles of input and response, such that the low-dimensional variable becomes the regressor, and which is tractable. We introduce a mixture of locally-linear probabilistic map** model that… ▽ More In this work we address the problem of approximating high-dimensional data with a low-dimensional representation. We make the following contributions. We propose an inverse regression method which exchanges the roles of input and response, such that the low-dimensional variable becomes the regressor, and which is tractable. We introduce a mixture of locally-linear probabilistic map** model that starts with estimating the parameters of inverse regression, and follows with inferring closed-form solutions for the forward parameters of the high-dimensional regression problem of interest. Moreover, we introduce a partially-latent paradigm, such that the vector-valued response variable is composed of both observed and latent entries, thus being able to deal with data contaminated by experimental artifacts that cannot be explained with noise models. The proposed probabilistic formulation could be viewed as a latent-variable augmentation of regression. We devise expectation-maximization (EM) procedures based on a data augmentation strategy which facilitates the maximum-likelihood search over the model parameters. We propose two augmentation schemes and we describe in detail the associated EM inference procedures that may well be viewed as generalizations of a number of EM regression, dimension reduction, and factor analysis algorithms. The proposed framework is validated with both synthetic and real data. We provide experimental evidence that our method outperforms several existing regression techniques. △ Less

Submitted 20 December, 2013; v1 submitted 10 August, 2013; originally announced August 2013.

Journal ref: Statistics and Computing, 25(5), 893-911, 2015

Showing 1–30 of 30 results for author: Deleforge, A