-
Blind identification of Ambisonic reduced room impulse response
Authors:
Srđan Kitić,
Jérôme Daniel
Abstract:
Recently proposed Generalized Time-domain Velocity Vector (GTVV) is a generalization of relative room impulse response in spherical harmonic (aka Ambisonic) domain that allows for blind estimation of early-echo parameters: the directions and relative delays of individual reflections. However, the derived closed-form expression of GTVV mandates few assumptions to hold, most important being that the…
▽ More
Recently proposed Generalized Time-domain Velocity Vector (GTVV) is a generalization of relative room impulse response in spherical harmonic (aka Ambisonic) domain that allows for blind estimation of early-echo parameters: the directions and relative delays of individual reflections. However, the derived closed-form expression of GTVV mandates few assumptions to hold, most important being that the impulse response of the reference signal needs to be a minimum-phase filter. In practice, the reference is obtained by spatial filtering towards the Direction-of-Arrival of the source, and the aforementioned condition is bounded by the performance of the applied beamformer (and thus, by the Ambisonic array order). In the present work, we suggest to circumvent this problem by directly modeling the impulse responses constituting the GTVV time series, which permits not only to relax the initial assumptions, but also to extract the information therein in a more consistent and efficient manner, entering the realm of blind system identification. Experiments using measured room impulse responses confirm the effectiveness of the proposed approach.
△ Less
Submitted 6 November, 2023; v1 submitted 5 May, 2023;
originally announced May 2023.
-
Echo-enabled Direction-of-Arrival and range estimation of a mobile source in Ambisonic domain
Authors:
Jérôme Daniel,
Srđan Kitić
Abstract:
Range estimation of a far field sound source in a reverberant environment is known to be a notoriously difficult problem, hence most localization methods are only capable of estimating the source's Direction-of-Arrival (DoA). In an earlier work, we have demonstrated that, under certain restrictive acoustic conditions and given the orientation of a reflecting surface, one can exploit the dominant a…
▽ More
Range estimation of a far field sound source in a reverberant environment is known to be a notoriously difficult problem, hence most localization methods are only capable of estimating the source's Direction-of-Arrival (DoA). In an earlier work, we have demonstrated that, under certain restrictive acoustic conditions and given the orientation of a reflecting surface, one can exploit the dominant acoustic reflection to evaluate the DoA \emph{and} the distance to a static sound source in Ambisonic domain. In this article, we leverage the recently presented Generalized Time-domain Velocity Vector (GTVV) representation to estimate these quantities for a moving sound source without an a priori knowledge of reflectors' orientations. We show that the trajectories of a moving source and the corresponding reflections are spatially and temporally related, which can be used to infer the absolute delay of the propagating source signal and, therefore, approximate the microphone-to-source distance. Experiments on real sound data confirm the validity of the proposed approach.
△ Less
Submitted 10 March, 2022;
originally announced March 2022.
-
Generalized Time Domain Velocity Vector
Authors:
Srđan Kitić,
Jérôme Daniel
Abstract:
We introduce and analyze Generalized Time Domain Velocity Vector (GTVV), an extension of the previously presented acoustic multipath footprint extracted from the Ambisonic recordings. GTVV is better adapted to adverse acoustic conditions, and enables efficient parameter estimation of multiple plane wave components in the recorded multichannel mixture. Experiments on simulated data confirm the pred…
▽ More
We introduce and analyze Generalized Time Domain Velocity Vector (GTVV), an extension of the previously presented acoustic multipath footprint extracted from the Ambisonic recordings. GTVV is better adapted to adverse acoustic conditions, and enables efficient parameter estimation of multiple plane wave components in the recorded multichannel mixture. Experiments on simulated data confirm the predicted theoretical advantages of these new spatio-temporal features.
△ Less
Submitted 19 May, 2022; v1 submitted 12 October, 2021;
originally announced October 2021.
-
A Survey of Sound Source Localization with Deep Learning Methods
Authors:
Pierre-Amaury Grumiaux,
Srđan Kitić,
Laurent Girin,
Alexandre Guérin
Abstract:
This article is a survey on deep learning methods for single and multiple sound source localization. We are particularly interested in sound source localization in indoor/domestic environment, where reverberation and diffuse noise are present. We provide an exhaustive topography of the neural-based localization literature in this context, organized according to several aspects: the neural network…
▽ More
This article is a survey on deep learning methods for single and multiple sound source localization. We are particularly interested in sound source localization in indoor/domestic environment, where reverberation and diffuse noise are present. We provide an exhaustive topography of the neural-based localization literature in this context, organized according to several aspects: the neural network architecture, the type of input features, the output strategy (classification or regression), the types of data used for model training and evaluation, and the model training strategy. This way, an interested reader can easily comprehend the vast panorama of the deep learning-based sound source localization methods. Tables summarizing the literature survey are provided at the end of the paper for a quick search of methods with a given set of target characteristics.
△ Less
Submitted 17 June, 2022; v1 submitted 8 September, 2021;
originally announced September 2021.
-
SALADnet: Self-Attentive multisource Localization in the Ambisonics Domain
Authors:
Pierre-Amaury Grumiaux,
Srdan Kitic,
Prerak Srivastava,
Laurent Girin,
Alexandre Guérin
Abstract:
In this work, we propose a novel self-attention based neural network for robust multi-speaker localization from Ambisonics recordings. Starting from a state-of-the-art convolutional recurrent neural network, we investigate the benefit of replacing the recurrent layers by self-attention encoders, inherited from the Transformer architecture. We evaluate these models on synthetic and real-world data,…
▽ More
In this work, we propose a novel self-attention based neural network for robust multi-speaker localization from Ambisonics recordings. Starting from a state-of-the-art convolutional recurrent neural network, we investigate the benefit of replacing the recurrent layers by self-attention encoders, inherited from the Transformer architecture. We evaluate these models on synthetic and real-world data, with up to 3 simultaneous speakers. The obtained results indicate that the majority of the proposed architectures either perform on par, or outperform the CRNN baseline, especially in the multisource scenario. Moreover, by avoiding the recurrent layers, the proposed models lend themselves to parallel computing, which is shown to produce considerable savings in execution time.
△ Less
Submitted 23 July, 2021;
originally announced July 2021.
-
Improved feature extraction for CRNN-based multiple sound source localization
Authors:
Pierre-Amaury Grumiaux,
Srdan Kitic,
Laurent Girin,
Alexandre Guérin
Abstract:
In this work, we propose to extend a state-of-the-art multi-source localization system based on a convolutional recurrent neural network and Ambisonics signals. We significantly improve the performance of the baseline network by changing the layout between convolutional and pooling layers. We propose several configurations with more convolutional layers and smaller pooling sizes in-between, so tha…
▽ More
In this work, we propose to extend a state-of-the-art multi-source localization system based on a convolutional recurrent neural network and Ambisonics signals. We significantly improve the performance of the baseline network by changing the layout between convolutional and pooling layers. We propose several configurations with more convolutional layers and smaller pooling sizes in-between, so that less information is lost across the layers, leading to a better feature extraction. In parallel, we test the system's ability to localize up to 3 sources, in which case the improved feature extraction provides the most significant boost in accuracy. We evaluate and compare these improved configurations on synthetic and real-world data. The obtained results show a quite substantial improvement of the multiple sound source localization performance over the baseline network.
△ Less
Submitted 5 May, 2021;
originally announced May 2021.
-
Multichannel CRNN for Speaker Counting: an Analysis of Performance
Authors:
Pierre-Amaury Grumiaux,
Srdan Kitic,
Laurent Girin,
Alexandre Guérin
Abstract:
Speaker counting is the task of estimating the number of people that are simultaneously speaking in an audio recording. For several audio processing tasks such as speaker diarization, separation, localization and tracking, knowing the number of speakers at each timestep is a prerequisite, or at least it can be a strong advantage, in addition to enabling a low latency processing. In a previous work…
▽ More
Speaker counting is the task of estimating the number of people that are simultaneously speaking in an audio recording. For several audio processing tasks such as speaker diarization, separation, localization and tracking, knowing the number of speakers at each timestep is a prerequisite, or at least it can be a strong advantage, in addition to enabling a low latency processing. In a previous work, we addressed the speaker counting problem with a multichannel convolutional recurrent neural network which produces an estimation at a short-term frame resolution. In this work, we show that, for a given frame, there is an optimal position in the input sequence for best prediction accuracy. We empirically demonstrate the link between that optimal position, the length of the input sequence and the size of the convolutional filters.
△ Less
Submitted 6 January, 2021;
originally announced January 2021.
-
Time Domain Velocity Vector for Retracing the Multipath Propagation
Authors:
Jérôme Daniel,
Srđan Kitić
Abstract:
We propose a conceptually and computationally simple form of sound velocity that offers a readable view of the interference between direct and indirect sound waves. Unlike most approaches in the literature, it jointly exploits both active and reactive sound intensity measurements, as typically derived from a first order ambisonics recording. This representation has a potential both as a valuable t…
▽ More
We propose a conceptually and computationally simple form of sound velocity that offers a readable view of the interference between direct and indirect sound waves. Unlike most approaches in the literature, it jointly exploits both active and reactive sound intensity measurements, as typically derived from a first order ambisonics recording. This representation has a potential both as a valuable tool for directly analyzing sound multipath propagation, as well as being a new spatial feature format for machine learning algorithms in audio and acoustics. As a showcase, we demonstrate that the Direction-Of-Arrival and the range of a sound source can be estimated as a development of this approach. To the best knowledge of the authors, this is the first time that range is estimated from an ambisonics recording.
△ Less
Submitted 3 June, 2020;
originally announced June 2020.
-
Dilated U-net based approach for multichannel speech enhancement from First-Order Ambisonics recordings
Authors:
Amélie Bosca,
Alexandre Guérin,
Lauréline Perotin,
Srđan Kitić
Abstract:
We present a CNN architecture for speech enhancement from multichannel first-order Ambisonics mixtures. The data-dependent spatial filters, deduced from a mask-based approach, are used to help an automatic speech recognition engine to face adverse conditions of reverberation and competitive speakers. The mask predictions are provided by a neural network, fed with rough estimations of speech and no…
▽ More
We present a CNN architecture for speech enhancement from multichannel first-order Ambisonics mixtures. The data-dependent spatial filters, deduced from a mask-based approach, are used to help an automatic speech recognition engine to face adverse conditions of reverberation and competitive speakers. The mask predictions are provided by a neural network, fed with rough estimations of speech and noise amplitude spectra, under the assumption of known directions of arrival. This study evaluates the replacing of the recurrent LSTM network previously investigated by a convolutive U-net under more stressing conditions with an additional second competitive speaker. We show that, due to more accurate short-term masks prediction, the U-net architecture brings some improvements in terms of word error rate. Moreover, results indicate that the use of dilated convolutive layers is beneficial in difficult situations with two interfering speakers, and/or where the target and interferences are close to each other in terms of the angular distance. Moreover, these results come with a two-fold reduction in the number of parameters.
△ Less
Submitted 2 June, 2020;
originally announced June 2020.
-
Sparsity-based audio declip** methods: selected overview, new algorithms, and large-scale evaluation
Authors:
Clément Gaultier,
Srđan Kitić,
Rémi Gribonval,
Nancy Bertin
Abstract:
Recent advances in audio declip** have substantially improved the state of the art.% in certain saturation regimes. Yet, practitioners need guidelines to choose a method, and while existing benchmarks have been instrumental in advancing the field, larger-scale experiments are needed to guide such choices. First, we show that the clip** levels in existing small-scale benchmarks are moderate and…
▽ More
Recent advances in audio declip** have substantially improved the state of the art.% in certain saturation regimes. Yet, practitioners need guidelines to choose a method, and while existing benchmarks have been instrumental in advancing the field, larger-scale experiments are needed to guide such choices. First, we show that the clip** levels in existing small-scale benchmarks are moderate and call for benchmarks with more perceptually significant clip** levels. We then propose a general algorithmic framework for declip** that covers existing and new combinations of variants of state-of-the-art techniques exploiting time-frequency sparsity: synthesis vs. analysis sparsity, with plain or structured sparsity. Finally, we systematically compare these combinations and a selection of state-of-the-art methods. Using a large-scale numerical benchmark and a smaller scale formal listening test, we provide guidelines for various clip** levels, both for speech and various musical genres. The code is made publicly available for the purpose of reproducible research and benchmarking.
△ Less
Submitted 30 November, 2020; v1 submitted 19 May, 2020;
originally announced May 2020.
-
High-Resolution Speaker Counting In Reverberant Rooms Using CRNN With Ambisonics Features
Authors:
Pierre-Amaury Grumiaux,
Srdjan Kitic,
Laurent Girin,
Alexandre Guérin
Abstract:
Speaker counting is the task of estimating the number of people that are simultaneously speaking in an audio recording. For several audio processing tasks such as speaker diarization, separation, localization and tracking, knowing the number of speakers at each timestep is a prerequisite, or at least it can be a strong advantage, in addition to enabling a low latency processing. For that purpose,…
▽ More
Speaker counting is the task of estimating the number of people that are simultaneously speaking in an audio recording. For several audio processing tasks such as speaker diarization, separation, localization and tracking, knowing the number of speakers at each timestep is a prerequisite, or at least it can be a strong advantage, in addition to enabling a low latency processing. For that purpose, we address the speaker counting problem with a multichannel convolutional recurrent neural network which produces an estimation at a short-term frame resolution. We trained the network to predict up to 5 concurrent speakers in a multichannel mixture, with simulated data including many different conditions in terms of source and microphone positions, reverberation, and noise. The network can predict the number of speakers with good accuracy at frame resolution.
△ Less
Submitted 17 March, 2020;
originally announced March 2020.
-
Scattering Features for Multimodal Gait Recognition
Authors:
Srđan Kitić,
Gilles Puy,
Patrick Pérez,
Philippe Gilberton
Abstract:
We consider the problem of identifying people on the basis of their walk (gait) pattern. Classical approaches to tackle this problem are based on, e.g., video recordings or piezoelectric sensors embedded in the floor. In this work, we rely on acoustic and vibration measurements, obtained from a microphone and a geophone sensor, respectively. The contribution of this work is twofold. First, we prop…
▽ More
We consider the problem of identifying people on the basis of their walk (gait) pattern. Classical approaches to tackle this problem are based on, e.g., video recordings or piezoelectric sensors embedded in the floor. In this work, we rely on acoustic and vibration measurements, obtained from a microphone and a geophone sensor, respectively. The contribution of this work is twofold. First, we propose a feature extraction method based on an (untrained) shallow scattering network, specially tailored for the gait signals. Second, we demonstrate that fusing the two modalities improves identification in the practically relevant open set scenario.
△ Less
Submitted 23 January, 2020;
originally announced January 2020.
-
A Comparative Study of Multilateration Methods for Single-Source Localization in Distributed Audio
Authors:
Srđan Kitić,
Clément Gaultier,
Grégory Pallone
Abstract:
In this article we analyze the state-of-the-art in multilateration - the family of localization methods enabled by the range difference observations. These methods are computationally efficient, signal-independent, and flexible with regards to the number of sensing nodes and their spatial arrangement. However, the multilateration problem does not admit a closed-form solution in the general case, a…
▽ More
In this article we analyze the state-of-the-art in multilateration - the family of localization methods enabled by the range difference observations. These methods are computationally efficient, signal-independent, and flexible with regards to the number of sensing nodes and their spatial arrangement. However, the multilateration problem does not admit a closed-form solution in the general case, and the localization performance is conditioned on the accuracy of range difference estimates. For that reason, we consider a simplified use case where multiple distributed microphones capture the signal coming from a near field sound source, and discuss their robustness to the estimation errors. In addition to surveying the relevant bibliography, we present the results of a small-scale benchmark of few "mainstream" multilateration algorithms, based on an in-house Room Impulse Response dataset.
△ Less
Submitted 28 July, 2020; v1 submitted 23 October, 2019;
originally announced October 2019.
-
TRAMP: Tracking by a Real-time AMbisonic-based Particle filter
Authors:
Srđan Kitić,
Alexandre Guérin
Abstract:
This article presents a multiple sound source localization and tracking system, fed by the Eigenmike array. The First Order Ambisonics (FOA) format is used to build a pseudointensity-based spherical histogram, from which the source position estimates are deduced. These instantaneous estimates are processed by a wellknown tracking system relying on a set of particle filters. While the novelty withi…
▽ More
This article presents a multiple sound source localization and tracking system, fed by the Eigenmike array. The First Order Ambisonics (FOA) format is used to build a pseudointensity-based spherical histogram, from which the source position estimates are deduced. These instantaneous estimates are processed by a wellknown tracking system relying on a set of particle filters. While the novelty within localization and tracking is incremental, the fully-functional, complete and real-time running system based on these algorithms is proposed for the first time. As such, it could serve as an additional baseline method of the LOCATA challenge.
△ Less
Submitted 4 December, 2018; v1 submitted 9 October, 2018;
originally announced October 2018.
-
A modeling and algorithmic framework for (non)social (co)sparse audio restoration
Authors:
Clément Gaultier,
Nancy Bertin,
Srđan Kitić,
Rémi Gribonval
Abstract:
We propose a unified modeling and algorithmic framework for audio restoration problem. It encompasses analysis sparse priors as well as more classical synthesis sparse priors, and regular sparsity as well as various forms of structured sparsity embodied by shrinkage operators (such as social shrinkage). The versatility of the framework is illustrated on two restoration scenarios: denoising, and de…
▽ More
We propose a unified modeling and algorithmic framework for audio restoration problem. It encompasses analysis sparse priors as well as more classical synthesis sparse priors, and regular sparsity as well as various forms of structured sparsity embodied by shrinkage operators (such as social shrinkage). The versatility of the framework is illustrated on two restoration scenarios: denoising, and declip**. Extensive experimental results on these scenarios highlight both the speedups of 20% or even more offered by the analysis sparse prior, and the substantial declip** quality that is achievable with both the social and the plain flavor. While both flavors overall exhibit similar performance, their detailed comparison displays distinct trends depending whether declip** or denoising is considered.
△ Less
Submitted 30 November, 2017;
originally announced November 2017.