-
Fully Reversing the Shoebox Image Source Method: From Impulse Responses to Room Parameters
Authors:
Tom Sprunck,
Antoine Deleforge,
Yannick Privat,
Cédric Foy
Abstract:
We present an algorithm that fully reverses the shoebox image source method (ISM), a popular and widely used room impulse response (RIR) simulator for cuboid rooms introduced by Allen and Berkley in 1979. More precisely, given a discrete multichannel RIR generated by the shoebox ISM for a microphone array of known geometry, the algorithm reliably recovers the 18 input parameters. These are the 3D…
▽ More
We present an algorithm that fully reverses the shoebox image source method (ISM), a popular and widely used room impulse response (RIR) simulator for cuboid rooms introduced by Allen and Berkley in 1979. More precisely, given a discrete multichannel RIR generated by the shoebox ISM for a microphone array of known geometry, the algorithm reliably recovers the 18 input parameters. These are the 3D source position, the 3 dimensions of the room, the 6-degrees-of-freedom room translation and orientation, and an absorption coefficient for each of the 6 room boundaries. The approach builds on a recently proposed gridless image source localization technique combined with new procedures for room axes recovery and first-order-reflection identification. Extensive simulated experiments reveal that near-exact recovery of all parameters is achieved for a 32-element, 8.4-cm-wide spherical microphone array and a sampling rate of 16~kHz using fully randomized input parameters within rooms of size 2X2X2 to 10X10X5 meters. Estimation errors decay towards zero when increasing the array size and sampling rate. The method is also shown to strongly outperform a known baseline, and its ability to extrapolate RIRs at new positions is demonstrated. Crucially, the approach is strictly limited to low-passed discrete RIRs simulated using the vanilla shoebox ISM. Nonetheless, it represents to our knowledge the first algorithmic demonstration that this difficult inverse problem is in-principle fully solvable over a wide range of configurations.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion
Authors:
Robin San Roman,
Yossi Adi,
Antoine Deleforge,
Romain Serizel,
Gabriel Synnaeve,
Alexandre Défossez
Abstract:
Deep generative models can generate high-fidelity audio conditioned on various types of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)). Recently, such models have been used to synthesize audio waveforms conditioned on highly compressed representations. Although such methods produce impressive results, they are prone to generate audible artifacts when the condi…
▽ More
Deep generative models can generate high-fidelity audio conditioned on various types of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)). Recently, such models have been used to synthesize audio waveforms conditioned on highly compressed representations. Although such methods produce impressive results, they are prone to generate audible artifacts when the conditioning is flawed or imperfect. An alternative modeling approach is to use diffusion models. However, these have mainly been used as speech vocoders (i.e., conditioned on mel-spectrograms) or generating relatively low sampling rate signals. In this work, we propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality (e.g., speech, music, environmental sounds) from low-bitrate discrete representations. At equal bit rate, the proposed approach outperforms state-of-the-art generative techniques in terms of perceptual quality. Training and, evaluation code, along with audio samples, are available on the facebookresearch/audiocraft Github page.
△ Less
Submitted 8 November, 2023; v1 submitted 2 August, 2023;
originally announced August 2023.
-
How to (virtually) train your speaker localizer
Authors:
Prerak Srivastava,
Antoine Deleforge,
Archontis Politis,
Emmanuel Vincent
Abstract:
Learning-based methods have become ubiquitous in speaker localization. Existing systems rely on simulated training sets for the lack of sufficiently large, diverse and annotated real datasets. Most room acoustics simulators used for this purpose rely on the image source method (ISM) because of its computational efficiency. This paper argues that carefully extending the ISM to incorporate more real…
▽ More
Learning-based methods have become ubiquitous in speaker localization. Existing systems rely on simulated training sets for the lack of sufficiently large, diverse and annotated real datasets. Most room acoustics simulators used for this purpose rely on the image source method (ISM) because of its computational efficiency. This paper argues that carefully extending the ISM to incorporate more realistic surface, source and microphone responses into training sets can significantly boost the real-world performance of speaker localization systems. It is shown that increasing the training-set realism of a state-of-the-art direction-of-arrival estimator yields consistent improvements across three different real test sets featuring human speakers in a variety of rooms and various microphone arrays. An ablation study further reveals that every added layer of realism contributes positively to these improvements.
△ Less
Submitted 25 May, 2023; v1 submitted 30 November, 2022;
originally announced November 2022.
-
Gridless 3D Recovery of Image Sources from Room Impulse Responses
Authors:
Tom Sprunck,
Yannick Privat,
Cédric Foy,
Antoine Deleforge
Abstract:
Given a sound field generated by a sparse distribution of impulse image sources, can the continuous 3D positions and amplitudes of these sources be recovered from discrete, bandlimited measurements of the field at a finite set of locations, e.g., a multichannel room impulse response? Borrowing from recent advances in super-resolution imaging, it is shown that this nonlinear, non-convex inverse pro…
▽ More
Given a sound field generated by a sparse distribution of impulse image sources, can the continuous 3D positions and amplitudes of these sources be recovered from discrete, bandlimited measurements of the field at a finite set of locations, e.g., a multichannel room impulse response? Borrowing from recent advances in super-resolution imaging, it is shown that this nonlinear, non-convex inverse problem can be efficiently relaxed into a convex linear inverse problem over the space of Radon measures in R3. The linear operator introduced here stems from the fundamental solution of the free-field inhomogenous wave equation combined with the receivers' responses. An adaptation of the Sliding Frank-Wolfe algorithm is proposed to numerically solve the problem off-the-grid, i.e., in continuous 3D space. Simulated experiments show that the approach achieves near-exact recovery of hundreds of image sources using an arbitrarily placed compact 32-channel spherical microphone array in random rectangular rooms. The impact of noise, sampling rate and array diameter on these results is also examined.
△ Less
Submitted 7 December, 2022; v1 submitted 30 August, 2022;
originally announced August 2022.
-
Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators
Authors:
Prerak Srivastava,
Antoine Deleforge,
Emmanuel Vincent
Abstract:
Blind acoustic parameter estimation consists in inferring the acoustic properties of an environment from recordings of unknown sound sources. Recent works in this area have utilized deep neural networks trained either partially or exclusively on simulated data, due to the limited availability of real annotated measurements. In this paper, we study whether a model purely trained using a fast image-…
▽ More
Blind acoustic parameter estimation consists in inferring the acoustic properties of an environment from recordings of unknown sound sources. Recent works in this area have utilized deep neural networks trained either partially or exclusively on simulated data, due to the limited availability of real annotated measurements. In this paper, we study whether a model purely trained using a fast image-source room impulse response simulator can generalize to real data. We present an ablation study on carefully crafted simulated training sets that account for different levels of realism in source, receiver and wall responses. The extent of realism is controlled by the sampling of wall absorption coefficients and by applying measured directivity patterns to microphones and sources. A state-of-the-art model trained on these datasets is evaluated on the task of jointly estimating the room's volume, total surface area, and octave-band reverberation times from multiple, multichannel speech recordings. Results reveal that every added layer of simulation realism at train time significantly improves the estimation of all quantities on real signals.
△ Less
Submitted 19 July, 2022;
originally announced July 2022.
-
Detecting acoustic reflectors using a robot's ego-noise
Authors:
Usama Saqib,
Antoine Deleforge,
Jesper Jensen
Abstract:
In this paper, we propose a method to estimate the proximity of an acoustic reflector, e.g., a wall, using ego-noise, i.e., the noise produced by the moving parts of a listening robot. This is achieved by estimating the times of arrival of acoustic echoes reflected from the surface. Simulated experiments show that the proposed nonintrusive approach is capable of accurately estimating the distance…
▽ More
In this paper, we propose a method to estimate the proximity of an acoustic reflector, e.g., a wall, using ego-noise, i.e., the noise produced by the moving parts of a listening robot. This is achieved by estimating the times of arrival of acoustic echoes reflected from the surface. Simulated experiments show that the proposed nonintrusive approach is capable of accurately estimating the distance of a reflector up to 1 meter and outperforms a previously proposed intrusive approach under loud ego-noise conditions. The proposed method is helped by a probabilistic echo detector that estimates whether or not an acoustic reflector is within a short range of the robotic platform. This preliminary investigation paves the way towards a new kind of collision avoidance system that would purely rely on audio sensors rather than conventional proximity sensors.
△ Less
Submitted 16 November, 2021;
originally announced November 2021.
-
Mean absorption estimation from room impulse responses using virtually supervised learning
Authors:
Cédric Foy,
Antoine Deleforge,
Diego Di Carlo
Abstract:
In the context of building acoustics and the acoustic diagnosis of an existing room, this paper introduces and investigates a new approach to estimate mean absorption coefficients solely from a room impulse response (RIR). This inverse problem is tackled via virtually-supervised learning, namely, the RIR-to-absorption map** is implicitly learned by regression on a simulated dataset using artific…
▽ More
In the context of building acoustics and the acoustic diagnosis of an existing room, this paper introduces and investigates a new approach to estimate mean absorption coefficients solely from a room impulse response (RIR). This inverse problem is tackled via virtually-supervised learning, namely, the RIR-to-absorption map** is implicitly learned by regression on a simulated dataset using artificial neural networks. We focus on simple models based on well-understood architectures. The critical choices of geometric, acoustic and simulation parameters used to train the models are extensively discussed and studied, while kee** in mind conditions that are representative of the field of building acoustics. Estimation errors from the learned neural models are compared to those obtained with classical formulas that require knowledge of the room's geometry and reverberation times. Extensive comparisons made on a variety of simulated test sets highlight different conditions under which the learned models can overcome the well-known limitations of the diffuse sound field hypothesis underlying these formulas. Results obtained on real RIRs measured in an acoustically configurable room show that at 1~kHz and above, the proposed approach performs comparably to classical models when reverberation times can be reliably estimated, and continues to work even when they cannot.
△ Less
Submitted 1 September, 2021;
originally announced September 2021.
-
Blind Room Parameter Estimation Using Multiple-Multichannel Speech Recordings
Authors:
Prerak Srivastava,
Antoine Deleforge,
Emmanuel Vincent
Abstract:
Knowing the geometrical and acoustical parameters of a room may benefit applications such as audio augmented reality, speech dereverberation or audio forensics. In this paper, we study the problem of jointly estimating the total surface area, the volume, as well as the frequency-dependent reverberation time and mean surface absorption of a room in a blind fashion, based on two-channel noisy speech…
▽ More
Knowing the geometrical and acoustical parameters of a room may benefit applications such as audio augmented reality, speech dereverberation or audio forensics. In this paper, we study the problem of jointly estimating the total surface area, the volume, as well as the frequency-dependent reverberation time and mean surface absorption of a room in a blind fashion, based on two-channel noisy speech recordings from multiple, unknown source-receiver positions. A novel convolutional neural network architecture leveraging both single- and inter-channel cues is proposed and trained on a large, realistic simulated dataset. Results on both simulated and real data show that using multiple observations in one room significantly reduces estimation errors and variances on all target quantities, and that using two channels helps the estimation of surface and volume. The proposed model outperforms a recently proposed blind volume estimation method on the considered datasets.
△ Less
Submitted 29 July, 2021;
originally announced July 2021.
-
A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection
Authors:
Archontis Politis,
Sharath Adavanne,
Daniel Krause,
Antoine Deleforge,
Prerak Srivastava,
Tuomas Virtanen
Abstract:
This report presents the dataset and baseline of Task 3 of the DCASE2021 Challenge on Sound Event Localization and Detection (SELD). The dataset is based on emulation of real recordings of static or moving sound events under real conditions of reverberation and ambient noise, using spatial room impulse responses captured in a variety of rooms and delivered in two spatial formats. The acoustical sy…
▽ More
This report presents the dataset and baseline of Task 3 of the DCASE2021 Challenge on Sound Event Localization and Detection (SELD). The dataset is based on emulation of real recordings of static or moving sound events under real conditions of reverberation and ambient noise, using spatial room impulse responses captured in a variety of rooms and delivered in two spatial formats. The acoustical synthesis remains the same as in the previous iteration of the challenge, however the new dataset brings more challenging conditions of polyphony and overlap** instances of the same class. The most important difference of the new dataset is the introduction of directional interferers, meaning sound events that are localized in space but do not belong to the target classes to be detected and are not annotated. Since such interfering events are expected in every real-world scenario of SELD, the new dataset aims to promote systems that deal with this condition effectively. A modified SELDnet baseline employing the recent ACCDOA representation of SELD problems accompanies the dataset and it is shown to outperform the previous one. The new dataset is shown to be significantly more challenging for both baselines according to all considered metrics. To investigate the individual and combined effects of ambient noise, interferers, and reverberation, we study the performance of the baseline on different versions of the dataset excluding or including combinations of these factors. The results indicate that by far the most detrimental effects are caused by directional interferers.
△ Less
Submitted 4 July, 2021; v1 submitted 13 June, 2021;
originally announced June 2021.
-
dEchorate: a Calibrated Room Impulse Response Database for Echo-aware Signal Processing
Authors:
Diego Di Carlo,
Pinchas Tandeitnik,
Cédric Foy,
Antoine Deleforge,
Nancy Bertin,
Sharon Gannot
Abstract:
This paper presents dEchorate: a new database of measured multichannel Room Impulse Responses (RIRs) including annotations of early echo timings and 3D positions of microphones, real sources and image sources under different wall configurations in a cuboid room. These data provide a tool for benchmarking recent methods in echo-aware speech enhancement, room geometry estimation, RIR estimation, aco…
▽ More
This paper presents dEchorate: a new database of measured multichannel Room Impulse Responses (RIRs) including annotations of early echo timings and 3D positions of microphones, real sources and image sources under different wall configurations in a cuboid room. These data provide a tool for benchmarking recent methods in echo-aware speech enhancement, room geometry estimation, RIR estimation, acoustic echo retrieval, microphone calibration, echo labeling and reflectors estimation. The database is accompanied with software utilities to easily access, manipulate and visualize the data as well as baseline methods for echo-related tasks.
△ Less
Submitted 27 April, 2021;
originally announced April 2021.
-
Asteroid: the PyTorch-based audio source separation toolkit for researchers
Authors:
Manuel Pariente,
Samuele Cornell,
Joris Cosentino,
Sunit Sivasankaran,
Efthymios Tzinis,
Jens Heitkaemper,
Michel Olvera,
Fabian-Robert Stöter,
Mathieu Hu,
Juan M. Martín-Doñas,
David Ditter,
Ariel Frank,
Antoine Deleforge,
Emmanuel Vincent
Abstract:
This paper describes Asteroid, the PyTorch-based audio source separation toolkit for researchers. Inspired by the most successful neural source separation systems, it provides all neural building blocks required to build such a system. To improve reproducibility, Kaldi-style recipes on common audio source separation datasets are also provided. This paper describes the software architecture of Aste…
▽ More
This paper describes Asteroid, the PyTorch-based audio source separation toolkit for researchers. Inspired by the most successful neural source separation systems, it provides all neural building blocks required to build such a system. To improve reproducibility, Kaldi-style recipes on common audio source separation datasets are also provided. This paper describes the software architecture of Asteroid and its most important features. By showing experimental results obtained with Asteroid's recipes, we show that our implementations are at least on par with most results reported in reference papers. The toolkit is publicly available at https://github.com/mpariente/asteroid .
△ Less
Submitted 8 May, 2020;
originally announced May 2020.
-
Filterbank design for end-to-end speech separation
Authors:
Manuel Pariente,
Samuele Cornell,
Antoine Deleforge,
Emmanuel Vincent
Abstract:
Single-channel speech separation has recently made great progress thanks to learned filterbanks as used in ConvTasNet. In parallel, parameterized filterbanks have been proposed for speaker recognition where only center frequencies and bandwidths are learned. In this work, we extend real-valued learned and parameterized filterbanks into complex-valued analytic filterbanks and define a set of corres…
▽ More
Single-channel speech separation has recently made great progress thanks to learned filterbanks as used in ConvTasNet. In parallel, parameterized filterbanks have been proposed for speaker recognition where only center frequencies and bandwidths are learned. In this work, we extend real-valued learned and parameterized filterbanks into complex-valued analytic filterbanks and define a set of corresponding representations and masking strategies. We evaluate these filterbanks on a newly released noisy speech separation dataset (WHAM). The results show that the proposed analytic learned filterbank consistently outperforms the real-valued filterbank of ConvTasNet. Also, we validate the use of parameterized filterbanks and show that complex-valued representations and masks are beneficial in all conditions. Finally, we show that the STFT achieves its best performance for 2ms windows.
△ Less
Submitted 28 February, 2020; v1 submitted 23 October, 2019;
originally announced October 2019.
-
Audio-Based Search and Rescue with a Drone: Highlights from the IEEE Signal Processing Cup 2019 Student Competition
Authors:
Antoine Deleforge,
Diego Di Carlo,
Martin Strauss,
Romain Serizel,
Lucio Marcenaro
Abstract:
Unmanned aerial vehicles (UAV), commonly referred to as drones, have raised increasing interest in recent years. Search and rescue scenarios where humans in emergency situations need to be quickly found in areas difficult to access constitute an important field of application for this technology. While research efforts have mostly focused on develo** video-based solutions for this task \cite{lop…
▽ More
Unmanned aerial vehicles (UAV), commonly referred to as drones, have raised increasing interest in recent years. Search and rescue scenarios where humans in emergency situations need to be quickly found in areas difficult to access constitute an important field of application for this technology. While research efforts have mostly focused on develo** video-based solutions for this task \cite{lopez2017cvemergency}, UAV-embedded audio-based localization has received relatively less attention. Though, UAVs equipped with a microphone array could be of critical help to localize people in emergency situations, in particular when video sensors are limited by a lack of visual feedback due to bad lighting conditions or obstacles limiting the field of view. This motivated the topic of the 6th edition of the IEEE Signal Processing Cup (SP Cup): a UAV-embedded sound source localization challenge for search and rescue. In this article, we share an overview of the IEEE SP Cup experience including the competition tasks, participating teams, technical approaches and statistics.
△ Less
Submitted 3 July, 2019;
originally announced July 2019.
-
A Statistically Principled and Computationally Efficient Approach to Speech Enhancement using Variational Autoencoders
Authors:
Manuel Pariente,
Antoine Deleforge,
Emmanuel Vincent
Abstract:
Recent studies have explored the use of deep generative models of speech spectra based of variational autoencoders (VAEs), combined with unsupervised noise models, to perform speech enhancement. These studies developed iterative algorithms involving either Gibbs sampling or gradient descent at each step, making them computationally expensive. This paper proposes a variational inference method to i…
▽ More
Recent studies have explored the use of deep generative models of speech spectra based of variational autoencoders (VAEs), combined with unsupervised noise models, to perform speech enhancement. These studies developed iterative algorithms involving either Gibbs sampling or gradient descent at each step, making them computationally expensive. This paper proposes a variational inference method to iteratively estimate the power spectrogram of the clean speech. Our main contribution is the analytical derivation of the variational steps in which the en-coder of the pre-learned VAE can be used to estimate the varia-tional approximation of the true posterior distribution, using the very same assumption made to train VAEs. Experiments show that the proposed method produces results on par with the afore-mentioned iterative methods using sampling, while decreasing the computational cost by a factor 36 to reach a given performance .
△ Less
Submitted 14 May, 2019; v1 submitted 3 May, 2019;
originally announced May 2019.
-
Evaluation of an open-source implementation of the SRP-PHAT algorithm within the 2018 LOCATA challenge
Authors:
Romain Lebarbenchon,
Ewen Camberlein,
Diego di Carlo,
Clément Gaultier,
Antoine Deleforge,
Nancy Bertin
Abstract:
This short paper presents an efficient, flexible implementation of the SRP-PHAT multichannel sound source localization method. The method is evaluated on the single-source tasks of the LOCATA 2018 development dataset, and an associated Matlab toolbox is made available online.
This short paper presents an efficient, flexible implementation of the SRP-PHAT multichannel sound source localization method. The method is evaluated on the single-source tasks of the LOCATA 2018 development dataset, and an associated Matlab toolbox is made available online.
△ Less
Submitted 14 December, 2018;
originally announced December 2018.
-
MULAN: A Blind and Off-Grid Method for Multichannel Echo Retrieval
Authors:
Helena Peic Tukuljac,
Antoine Deleforge,
Rémi Gribonval
Abstract:
This paper addresses the general problem of blind echo retrieval, i.e., given M sensors measuring in the discrete-time domain M mixtures of K delayed and attenuated copies of an unknown source signal, can the echo locations and weights be recovered? This problem has broad applications in fields such as sonars, seismol-ogy, ultrasounds or room acoustics. It belongs to the broader class of blind cha…
▽ More
This paper addresses the general problem of blind echo retrieval, i.e., given M sensors measuring in the discrete-time domain M mixtures of K delayed and attenuated copies of an unknown source signal, can the echo locations and weights be recovered? This problem has broad applications in fields such as sonars, seismol-ogy, ultrasounds or room acoustics. It belongs to the broader class of blind channel identification problems, which have been intensively studied in signal processing. Existing methods in the literature proceed in two steps: (i) blind estimation of sparse discrete-time filters and (ii) echo information retrieval by peak-picking on filters. The precision of these methods is fundamentally limited by the rate at which the signals are sampled: estimated echo locations are necessary on-grid, and since true locations never match the sampling grid, the weight estimation precision is impacted. This is the so-called basis-mismatch problem in compressed sensing. We propose a radically different approach to the problem, building on the framework of finite-rate-of-innovation sampling. The approach operates directly in the parameter-space of echo locations and weights, and enables near-exact blind and off-grid echo retrieval from discrete-time measurements. It is shown to outperform conventional methods by several orders of magnitude in precision.
△ Less
Submitted 31 October, 2018;
originally announced October 2018.
-
Separake: Source Separation with a Little Help From Echoes
Authors:
Robin Scheibler,
Diego Di Carlo,
Antoine Deleforge,
Ivan Dokmanić
Abstract:
It is commonly believed that multipath hurts various audio processing algorithms. At odds with this belief, we show that multipath in fact helps sound source separation, even with very simple propagation models. Unlike most existing methods, we neither ignore the room impulse responses, nor we attempt to estimate them fully. We rather assume that we know the positions of a few virtual microphones…
▽ More
It is commonly believed that multipath hurts various audio processing algorithms. At odds with this belief, we show that multipath in fact helps sound source separation, even with very simple propagation models. Unlike most existing methods, we neither ignore the room impulse responses, nor we attempt to estimate them fully. We rather assume that we know the positions of a few virtual microphones generated by echoes and we show how this gives us enough spatial diversity to get a performance boost over the anechoic case. We show improvements for two standard algorithms---one that uses only magnitudes of the transfer functions, and one that also uses the phases. Concretely, we show that multichannel non-negative matrix factorization aided with a small number of echoes beats the vanilla variant of the same algorithm, and that with magnitude information only, echoes enable separation where it was previously impossible.
△ Less
Submitted 17 November, 2017;
originally announced November 2017.
-
Blind Source Separation Using Mixtures of Alpha-Stable Distributions
Authors:
Nicolas Keriven,
Antoine Deleforge,
Antoine Liutkus
Abstract:
We propose a new blind source separation algorithm based on mixtures of alpha-stable distributions. Complex symmetric alpha-stable distributions have been recently showed to better model audio signals in the time-frequency domain than classical Gaussian distributions thanks to their larger dynamic range. However, inference of these models is notoriously hard to perform because their probability de…
▽ More
We propose a new blind source separation algorithm based on mixtures of alpha-stable distributions. Complex symmetric alpha-stable distributions have been recently showed to better model audio signals in the time-frequency domain than classical Gaussian distributions thanks to their larger dynamic range. However, inference of these models is notoriously hard to perform because their probability density functions do not have a closed-form expression in general. Here, we introduce a novel method for estimating mixture of alpha-stable distributions based on characteristic function matching. We apply this to the blind estimation of binary masks in individual frequency bands from multichannel convolutive audio mixes. We show that the proposed method yields better separation performance than Gaussian-based binary-masking methods.
△ Less
Submitted 12 February, 2018; v1 submitted 13 November, 2017;
originally announced November 2017.
-
Phase retrieval with a multivariate Von Mises prior: from a Bayesian formulation to a lifting solution
Authors:
Angelique Dremeau,
Antoine Deleforge
Abstract:
In this paper, we investigate a new method for phase recovery when prior information on the missing phases is available. In particular, we propose to take into account this information in a generic fashion by means of a multivariate Von Mises dis- tribution. Building on a Bayesian formulation (a Maximum A Posteriori estimation), we show that the problem can be expressed using a Mahalanobis distanc…
▽ More
In this paper, we investigate a new method for phase recovery when prior information on the missing phases is available. In particular, we propose to take into account this information in a generic fashion by means of a multivariate Von Mises dis- tribution. Building on a Bayesian formulation (a Maximum A Posteriori estimation), we show that the problem can be expressed using a Mahalanobis distance and be solved by a lifting optimization procedure.
△ Less
Submitted 28 April, 2017;
originally announced April 2017.
-
VAST : The Virtual Acoustic Space Traveler Dataset
Authors:
Clément Gaultier,
Saurabh Kataria,
Antoine Deleforge
Abstract:
This paper introduces a new paradigm for sound source lo-calization referred to as virtual acoustic space traveling (VAST) and presents a first dataset designed for this purpose. Existing sound source localization methods are either based on an approximate physical model (physics-driven) or on a specific-purpose calibration set (data-driven). With VAST, the idea is to learn a map** from audio fe…
▽ More
This paper introduces a new paradigm for sound source lo-calization referred to as virtual acoustic space traveling (VAST) and presents a first dataset designed for this purpose. Existing sound source localization methods are either based on an approximate physical model (physics-driven) or on a specific-purpose calibration set (data-driven). With VAST, the idea is to learn a map** from audio features to desired audio properties using a massive dataset of simulated room impulse responses. This virtual dataset is designed to be maximally representative of the potential audio scenes that the considered system may be evolving in, while remaining reasonably compact. We show that virtually-learned map**s on this dataset generalize to real data, overcoming some intrinsic limitations of traditional binaural sound localization methods based on time differences of arrival.
△ Less
Submitted 14 December, 2016;
originally announced December 2016.
-
Hearing in a shoe-box : binaural source position and wall absorption estimation using virtually supervised learning
Authors:
Saurabh Kataria,
Clément Gaultier,
Antoine Deleforge
Abstract:
This paper introduces a new framework for supervised sound source localization referred to as virtually-supervised learning. An acoustic shoe-box room simulator is used to generate a large number of binaural single-source audio scenes. These scenes are used to build a dataset of spatial binaural features annotated with acoustic properties such as the 3D source position and the walls' absorption co…
▽ More
This paper introduces a new framework for supervised sound source localization referred to as virtually-supervised learning. An acoustic shoe-box room simulator is used to generate a large number of binaural single-source audio scenes. These scenes are used to build a dataset of spatial binaural features annotated with acoustic properties such as the 3D source position and the walls' absorption coefficients. A probabilistic high- to low-dimensional regression framework is used to learn a map** from these features to the acoustic properties. Results indicate that this map** successfully estimates the azimuth and elevation of new sources, but also their range and even the walls' absorption coefficients solely based on binaural signals. Results also reveal that incorporating random-diffusion effects in the data significantly improves the estimation of all parameters.
△ Less
Submitted 20 March, 2017; v1 submitted 30 September, 2016;
originally announced September 2016.
-
Phase Unmixing : Multichannel Source Separation with Magnitude Constraints
Authors:
Antoine Deleforge,
Yann Traonmilin
Abstract:
We consider the problem of estimating the phases of K mixed complex signals from a multichannel observation, when the mixing matrix and signal magnitudes are known. This problem can be cast as a non-convex quadratically constrained quadratic program which is known to be NP-hard in general. We propose three approaches to tackle it: a heuristic method, an alternate minimization method, and a convex…
▽ More
We consider the problem of estimating the phases of K mixed complex signals from a multichannel observation, when the mixing matrix and signal magnitudes are known. This problem can be cast as a non-convex quadratically constrained quadratic program which is known to be NP-hard in general. We propose three approaches to tackle it: a heuristic method, an alternate minimization method, and a convex relaxation into a semi-definite program. The last two approaches are showed to outperform the oracle multichannel Wiener filter in under-determined informed source separation tasks, using simulated and speech signals. The convex relaxation approach yields best results, including the potential for exact source separation in under-determined settings.
△ Less
Submitted 20 March, 2017; v1 submitted 30 September, 2016;
originally announced September 2016.
-
Rectified binaural ratio: A complex T-distributed feature for robust sound localization
Authors:
Antoine Deleforge,
Florence Forbes
Abstract:
Most existing methods in binaural sound source localization rely on some kind of aggregation of phase-and level-difference cues in the time-frequency plane. While different ag-gregation schemes exist, they are often heuristic and suffer in adverse noise conditions. In this paper, we introduce the rectified binaural ratio as a new feature for sound source local-ization. We show that for Gaussian-pr…
▽ More
Most existing methods in binaural sound source localization rely on some kind of aggregation of phase-and level-difference cues in the time-frequency plane. While different ag-gregation schemes exist, they are often heuristic and suffer in adverse noise conditions. In this paper, we introduce the rectified binaural ratio as a new feature for sound source local-ization. We show that for Gaussian-process point source signals corrupted by stationary Gaussian noise, this ratio follows a complex t-distribution with explicit parameters. This new formulation provides a principled and statistically sound way to aggregate binaural features in the presence of noise. We subsequently derive two simple and efficient methods for robust relative transfer function and time-delay estimation. Experiments on heavily corrupted simulated and speech signals demonstrate the robustness of the proposed scheme.
△ Less
Submitted 30 September, 2016;
originally announced September 2016.
-
Robust Head-Pose Estimation Based on Partially-Latent Mixture of Linear Regressions
Authors:
Vincent Drouard,
Radu Horaud,
Antoine Deleforge,
Silèye Ba,
Georgios Evangelidis
Abstract:
Head-pose estimation has many applications, such as social event analysis, human-robot and human-computer interaction, driving assistance, and so forth. Head-pose estimation is challenging because it must cope with changing illumination conditions, variabilities in face orientation and in appearance, partial occlusions of facial landmarks, as well as bounding-box-to-face alignment errors. We propo…
▽ More
Head-pose estimation has many applications, such as social event analysis, human-robot and human-computer interaction, driving assistance, and so forth. Head-pose estimation is challenging because it must cope with changing illumination conditions, variabilities in face orientation and in appearance, partial occlusions of facial landmarks, as well as bounding-box-to-face alignment errors. We propose tu use a mixture of linear regressions with partially-latent output. This regression method learns to map high-dimensional feature vectors (extracted from bounding boxes of faces) onto the joint space of head-pose angles and bounding-box shifts, such that they are robustly predicted in the presence of unobservable phenomena. We describe in detail the map** method that combines the merits of unsupervised manifold learning techniques and of mixtures of regressions. We validate our method with three publicly available datasets and we thoroughly benchmark four variants of the proposed algorithm with several state-of-the-art head-pose estimation methods.
△ Less
Submitted 6 March, 2017; v1 submitted 31 March, 2016;
originally announced March 2016.
-
Towards a Generalization of Relative Transfer Functions to More Than One Source
Authors:
Antoine Deleforge,
Sharon Gannot,
Walter Kellermann
Abstract:
We propose a natural way to generalize relative transfer functions (RTFs) to more than one source. We first prove that such a generalization is not possible using a single multichannel spectro-temporal observation, regardless of the number of microphones. We then introduce a new transform for multichannel multi-frame spectrograms, i.e., containing several channels and time frames in each time-freq…
▽ More
We propose a natural way to generalize relative transfer functions (RTFs) to more than one source. We first prove that such a generalization is not possible using a single multichannel spectro-temporal observation, regardless of the number of microphones. We then introduce a new transform for multichannel multi-frame spectrograms, i.e., containing several channels and time frames in each time-frequency bin. This transform allows a natural generalization which satisfies the three key properties of RTFs, namely, they can be directly estimated from observed signals, they capture spatial properties of the sources and they do not depend on emitted signals. Through simulated experiments, we show how this new method can localize multiple simultaneously active sound sources using short spectro-temporal windows, without relying on source separation.
△ Less
Submitted 1 July, 2015;
originally announced July 2015.
-
Phase-Optimized K-SVD for Signal Extraction from Underdetermined Multichannel Sparse Mixtures
Authors:
Antoine Deleforge,
Walter Kellermann
Abstract:
We propose a novel sparse representation for heavily underdetermined multichannel sound mixtures, i.e., with much more sources than microphones. The proposed approach operates in the complex Fourier domain, thus preserving spatial characteristics carried by phase differences. We derive a generalization of K-SVD which jointly estimates a dictionary capturing both spectral and spatial features, a sp…
▽ More
We propose a novel sparse representation for heavily underdetermined multichannel sound mixtures, i.e., with much more sources than microphones. The proposed approach operates in the complex Fourier domain, thus preserving spatial characteristics carried by phase differences. We derive a generalization of K-SVD which jointly estimates a dictionary capturing both spectral and spatial features, a sparse activation matrix, and all instantaneous source phases from a set of signal examples. The dictionary can then be used to extract the learned signal from a new input mixture. The method is applied to the challenging problem of ego-noise reduction for robot audition. We demonstrate its superiority relative to conventional dictionary-based techniques using recordings made in a real room.
△ Less
Submitted 9 October, 2014;
originally announced October 2014.
-
Hyper-Spectral Image Analysis with Partially-Latent Regression and Spatial Markov Dependencies
Authors:
Antoine Deleforge,
Florence Forbes,
Sileye Ba,
Radu Horaud
Abstract:
Hyper-spectral data can be analyzed to recover physical properties at large planetary scales. This involves resolving inverse problems which can be addressed within machine learning, with the advantage that, once a relationship between physical parameters and spectra has been established in a data-driven fashion, the learned relationship can be used to estimate physical parameters for new hyper-sp…
▽ More
Hyper-spectral data can be analyzed to recover physical properties at large planetary scales. This involves resolving inverse problems which can be addressed within machine learning, with the advantage that, once a relationship between physical parameters and spectra has been established in a data-driven fashion, the learned relationship can be used to estimate physical parameters for new hyper-spectral observations. Within this framework, we propose a spatially-constrained and partially-latent regression method which maps high-dimensional inputs (hyper-spectral images) onto low-dimensional responses (physical parameters such as the local chemical composition of the soil). The proposed regression model comprises two key features. Firstly, it combines a Gaussian mixture of locally-linear map**s (GLLiM) with a partially-latent response model. While the former makes high-dimensional regression tractable, the latter enables to deal with physical parameters that cannot be observed or, more generally, with data contaminated by experimental artifacts that cannot be explained with noise models. Secondly, spatial constraints are introduced in the model through a Markov random field (MRF) prior which provides a spatial structure to the Gaussian-mixture hidden variables. Experiments conducted on a database composed of remotely sensed observations collected from the Mars planet by the Mars Express orbiter demonstrate the effectiveness of the proposed model.
△ Less
Submitted 27 March, 2015; v1 submitted 30 September, 2014;
originally announced September 2014.
-
Co-Localization of Audio Sources in Images Using Binaural Features and Locally-Linear Regression
Authors:
Antoine Deleforge,
Radu Horaud,
Yoav Schechner,
Laurent Girin
Abstract:
This paper addresses the problem of localizing audio sources using binaural measurements. We propose a supervised formulation that simultaneously localizes multiple sources at different locations. The approach is intrinsically efficient because, contrary to prior work, it relies neither on source separation, nor on monaural segregation. The method starts with a training stage that establishes a lo…
▽ More
This paper addresses the problem of localizing audio sources using binaural measurements. We propose a supervised formulation that simultaneously localizes multiple sources at different locations. The approach is intrinsically efficient because, contrary to prior work, it relies neither on source separation, nor on monaural segregation. The method starts with a training stage that establishes a locally-linear Gaussian regression model between the directional coordinates of all the sources and the auditory features extracted from binaural measurements. While fixed-length wide-spectrum sounds (white noise) are used for training to reliably estimate the model parameters, we show that the testing (localization) can be extended to variable-length sparse-spectrum sounds (such as speech), thus enabling a wide range of realistic applications. Indeed, we demonstrate that the method can be used for audio-visual fusion, namely to map speech signals onto images and hence to spatially align the audio and visual modalities, thus enabling to discriminate between speaking and non-speaking faces. We release a novel corpus of real-room recordings that allow quantitative evaluation of the co-localization method in the presence of one or two sound sources. Experiments demonstrate increased accuracy and speed relative to several state-of-the-art methods.
△ Less
Submitted 15 April, 2016; v1 submitted 12 August, 2014;
originally announced August 2014.
-
Acoustic Space Learning for Sound Source Separation and Localization on Binaural Manifolds
Authors:
Antoine Deleforge,
Florence Forbes,
Radu Horaud
Abstract:
In this paper we address the problems of modeling the acoustic space generated by a full-spectrum sound source and of using the learned model for the localization and separation of multiple sources that simultaneously emit sparse-spectrum sounds. We lay theoretical and methodological grounds in order to introduce the binaural manifold paradigm. We perform an in-depth study of the latent low-dimens…
▽ More
In this paper we address the problems of modeling the acoustic space generated by a full-spectrum sound source and of using the learned model for the localization and separation of multiple sources that simultaneously emit sparse-spectrum sounds. We lay theoretical and methodological grounds in order to introduce the binaural manifold paradigm. We perform an in-depth study of the latent low-dimensional structure of the high-dimensional interaural spectral data, based on a corpus recorded with a human-like audiomotor robot head. A non-linear dimensionality reduction technique is used to show that these data lie on a two-dimensional (2D) smooth manifold parameterized by the motor states of the listener, or equivalently, the sound source directions. We propose a probabilistic piecewise affine map** model (PPAM) specifically designed to deal with high-dimensional data exhibiting an intrinsic piecewise linear structure. We derive a closed-form expectation-maximization (EM) procedure for estimating the model parameters, followed by Bayes inversion for obtaining the full posterior density function of a sound source direction. We extend this solution to deal with missing data and redundancy in real world spectrograms, and hence for 2D localization of natural sound sources such as speech. We further generalize the model to the challenging case of multiple sound sources and we propose a variational EM framework. The associated algorithm, referred to as variational EM for source separation and localization (VESSL) yields a Bayesian estimation of the 2D locations and time-frequency masks of all the sources. Comparisons of the proposed approach with several existing methods reveal that the combination of acoustic-space learning with Bayesian inference enables our method to outperform state-of-the-art methods.
△ Less
Submitted 20 March, 2014; v1 submitted 11 February, 2014;
originally announced February 2014.
-
High-Dimensional Regression with Gaussian Mixtures and Partially-Latent Response Variables
Authors:
Antoine Deleforge,
Florence Forbes,
Radu Horaud
Abstract:
In this work we address the problem of approximating high-dimensional data with a low-dimensional representation. We make the following contributions. We propose an inverse regression method which exchanges the roles of input and response, such that the low-dimensional variable becomes the regressor, and which is tractable. We introduce a mixture of locally-linear probabilistic map** model that…
▽ More
In this work we address the problem of approximating high-dimensional data with a low-dimensional representation. We make the following contributions. We propose an inverse regression method which exchanges the roles of input and response, such that the low-dimensional variable becomes the regressor, and which is tractable. We introduce a mixture of locally-linear probabilistic map** model that starts with estimating the parameters of inverse regression, and follows with inferring closed-form solutions for the forward parameters of the high-dimensional regression problem of interest. Moreover, we introduce a partially-latent paradigm, such that the vector-valued response variable is composed of both observed and latent entries, thus being able to deal with data contaminated by experimental artifacts that cannot be explained with noise models. The proposed probabilistic formulation could be viewed as a latent-variable augmentation of regression. We devise expectation-maximization (EM) procedures based on a data augmentation strategy which facilitates the maximum-likelihood search over the model parameters. We propose two augmentation schemes and we describe in detail the associated EM inference procedures that may well be viewed as generalizations of a number of EM regression, dimension reduction, and factor analysis algorithms. The proposed framework is validated with both synthetic and real data. We provide experimental evidence that our method outperforms several existing regression techniques.
△ Less
Submitted 20 December, 2013; v1 submitted 10 August, 2013;
originally announced August 2013.