Search | arXiv e-print repository

Ambisonics Networks -- The Effect Of Radial Functions Regularization

Authors: Bar Shaybet, Anurag Kumar, Vladimir Tourbabin, Boaz Rafaely

Abstract: Ambisonics, a popular format of spatial audio, is the spherical harmonic (SH) representation of the plane wave density function of a sound field. Many algorithms operate in the SH domain and utilize the Ambisonics as their input signal. The process of encoding Ambisonics from a spherical microphone array involves dividing by the radial functions, which may amplify noise at low frequencies. This ca… ▽ More Ambisonics, a popular format of spatial audio, is the spherical harmonic (SH) representation of the plane wave density function of a sound field. Many algorithms operate in the SH domain and utilize the Ambisonics as their input signal. The process of encoding Ambisonics from a spherical microphone array involves dividing by the radial functions, which may amplify noise at low frequencies. This can be overcome by regularization, with the downside of introducing errors to the Ambisonics encoding. This paper aims to investigate the impact of different ways of regularization on Deep Neural Network (DNN) training and performance. Ideally, these networks should be robust to the way of regularization. Simulated data of a single speaker in a room and experimental data from the LOCATA challenge were used to evaluate this robustness on an example algorithm of speaker localization based on the direct-path dominance (DPD) test. Results show that performance may be sensitive to the way of regularization, and an informed approach is proposed and investigated, highlighting the importance of regularization information. △ Less

Submitted 29 February, 2024; originally announced February 2024.

Comments: to be published in Icassp 2024

arXiv:2401.03286 [pdf, ps, other]

doi 10.1109/TASLP.2014.2351133

Theoretical Framework for the Optimization of Microphone Array Configuration for Humanoid Robot Audition

Authors: Vladimir Tourbabin, Boaz Rafaely

Abstract: An important aspect of a humanoid robot is audition. Previous work has presented robot systems capable of sound localization and source segregation based on microphone arrays with various configurations. However, no theoretical framework for the design of these arrays has been presented. In the current paper, a design framework is proposed based on a novel array quality measure. The measure is bas… ▽ More An important aspect of a humanoid robot is audition. Previous work has presented robot systems capable of sound localization and source segregation based on microphone arrays with various configurations. However, no theoretical framework for the design of these arrays has been presented. In the current paper, a design framework is proposed based on a novel array quality measure. The measure is based on the effective rank of a matrix composed of the generalized head related transfer functions (GHRTFs) that account for microphone positions other than the ears. The measure is shown to be theoretically related to standard array performance measures such as beamforming robustness and DOA estimation accuracy. Then, the measure is applied to produce sample designs of microphone arrays. Their performance is investigated numerically, verifying the advantages of array design based on the proposed theoretical framework. △ Less

Submitted 6 January, 2024; originally announced January 2024.

Journal ref: in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 12, 1803-1814, 2014

arXiv:2401.02386 [pdf, ps, other]

doi 10.1109/TASLP.2015.2464671

Direction of Arrival Estimation Using Microphone Array Processing for Moving Humanoid Robots

Authors: Vladimir Tourbabin, Boaz Rafaely

Abstract: The auditory system of humanoid robots has gained increased attention in recent years. This system typically acquires the surrounding sound field by means of a microphone array. Signals acquired by the array are then processed using various methods. One of the widely applied methods is direction of arrival estimation. The conventional direction of arrival estimation methods assume that the array i… ▽ More The auditory system of humanoid robots has gained increased attention in recent years. This system typically acquires the surrounding sound field by means of a microphone array. Signals acquired by the array are then processed using various methods. One of the widely applied methods is direction of arrival estimation. The conventional direction of arrival estimation methods assume that the array is fixed at a given position during the estimation. However, this is not necessarily true for an array installed on a moving humanoid robot. The array motion, if not accounted for appropriately, can introduce a significant error in the estimated direction of arrival. The current paper presents a signal model that takes the motion into account. Based on this model, two processing methods are proposed. The first one compensates for the motion of the robot. The second method is applicable to periodic signals and utilizes the motion in order to enhance the performance to a level beyond that of a stationary array. Numerical simulations and an experimental study are provided, demonstrating that the motion compensation method almost eliminates the motion-related error. It is also demonstrated that by using the motion-based enhancement method it is possible to improve the direction of arrival estimation performance, as compared to that obtained when using a stationary array. △ Less

Submitted 4 January, 2024; originally announced January 2024.

Journal ref: in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 11, pp. 2046-2058, Nov. 2015

arXiv:2401.02285 [pdf, ps, other]

doi 10.1109/TASL.2012.2208626

Optimal Real-Weighted Beamforming With Application to Linear and Spherical Arrays

Authors: V. Tourbabin, M. Agmon, B. Rafaely, J. Tabrikian

Abstract: One of the uses of sensor arrays is for spatial filtering or beamforming. Current digital signal processing methods facilitate complex-weighted beamforming, providing flexibility in array design. Previous studies proposed the use of real-valued beamforming weights, which although reduce flexibility in design, may provide a range of benefits, e.g., simplified beamformer implementation or efficient… ▽ More One of the uses of sensor arrays is for spatial filtering or beamforming. Current digital signal processing methods facilitate complex-weighted beamforming, providing flexibility in array design. Previous studies proposed the use of real-valued beamforming weights, which although reduce flexibility in design, may provide a range of benefits, e.g., simplified beamformer implementation or efficient beamforming algorithms. This paper presents a new method for the design of arrays with real-valued weights, that achieve maximum directivity, providing closed-form solution to array weights. The method is studied for linear and spherical arrays, where it is shown that rigid spherical arrays are particularly suitable for real-weight designs as they do not suffer from grating lobes, a dominant feature in linear arrays with real weights. A simulation study is presented for linear and spherical arrays, along with an experimental investigation, validating the theoretical developments. △ Less

Submitted 4 January, 2024; originally announced January 2024.

Journal ref: n IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 9, pp. 2575-2585, Nov. 2012

arXiv:2312.13707 [pdf, other]

Blind Localization of Room Reflections with Application to Spatial Audio

Authors: Yogev Hadadi, Vladimir Tourbabin, Paul Calamia, Boaz Rafaely

Abstract: Blind estimation of early room reflections, without knowledge of the room impulse response, holds substantial value. The FF-PHALCOR (Frequency Focusing PHase ALigned CORrelation), method was recently developed for this objective, extending the original PHALCOR method from spherical to arbitrary arrays. However, previous studies only compared the two methods under limited conditions without present… ▽ More Blind estimation of early room reflections, without knowledge of the room impulse response, holds substantial value. The FF-PHALCOR (Frequency Focusing PHase ALigned CORrelation), method was recently developed for this objective, extending the original PHALCOR method from spherical to arbitrary arrays. However, previous studies only compared the two methods under limited conditions without presenting a comprehensive performance analysis. This study presents an advance by evaluating the performance of the algorithm in a wider range of conditions. Additionally, performance in terms of perception is investigated through a listening test. This test involves synthesizing room impulse responses from known room acoustics parameters and replacing the early reflections with the estimated ones. The importance of the estimated reflections for spatial perception is demonstrated through this test. △ Less

Submitted 21 December, 2023; originally announced December 2023.

Journal ref: in 2023 Immersive and 3D Audio: from Architecture to Automotive (I3DA 2023), Bologna, Italy, September 2023

arXiv:2311.18689 [pdf, other]

Subspace Hybrid MVDR Beamforming for Augmented Hearing

Authors: Sina Hafezi, Alastair H. Moore, Pierre H. Guiraud, Patrick A. Naylor, Jacob Donley, Vladimir Tourbabin, Thomas Lunner

Abstract: Signal-dependent beamformers are advantageous over signal-independent beamformers when the acoustic scenario - be it real-world or simulated - is straightforward in terms of the number of sound sources, the ambient sound field and their dynamics. However, in the context of augmented reality audio using head-worn microphone arrays, the acoustic scenarios encountered are often far from straightforwa… ▽ More Signal-dependent beamformers are advantageous over signal-independent beamformers when the acoustic scenario - be it real-world or simulated - is straightforward in terms of the number of sound sources, the ambient sound field and their dynamics. However, in the context of augmented reality audio using head-worn microphone arrays, the acoustic scenarios encountered are often far from straightforward. The design of robust, high-performance, adaptive beamformers for such scenarios is an on-going challenge. This is due to the violation of the typically required assumptions on the noise field caused by, for example, rapid variations resulting from complex acoustic environments, and/or rotations of the listener's head. This work proposes a multi-channel speech enhancement algorithm which utilises the adaptability of signal-dependent beamformers while still benefiting from the computational efficiency and robust performance of signal-independent super-directive beamformers. The algorithm has two stages. (i) The first stage is a hybrid beamformer based on a dictionary of weights corresponding to a set of noise field models. (ii) The second stage is a wide-band subspace post-filter to remove any artifacts resulting from (i). The algorithm is evaluated using both real-world recordings and simulations of a cocktail-party scenario. Noise suppression, intelligibility and speech quality results show a significant performance improvement by the proposed algorithm compared to the baseline super-directive beamformer. A data-driven implementation of the noise field dictionary is shown to provide more noise suppression, and similar speech intelligibility and quality, compared to a parametric dictionary. △ Less

Submitted 30 November, 2023; originally announced November 2023.

Comments: 14 pages, 10 figures, submitted for IEEE/ACM Transactions on Audio, Speech, and Language Processing on 23-Nov-2023

arXiv:2107.04174 [pdf, other]

EasyCom: An Augmented Reality Dataset to Support Algorithms for Easy Communication in Noisy Environments

Authors: Jacob Donley, Vladimir Tourbabin, Jung-Suk Lee, Mark Broyles, Hao Jiang, Jie Shen, Maja Pantic, Vamsi Krishna Ithapu, Ravish Mehra

Abstract: Augmented Reality (AR) as a platform has the potential to facilitate the reduction of the cocktail party effect. Future AR headsets could potentially leverage information from an array of sensors spanning many different modalities. Training and testing signal processing and machine learning algorithms on tasks such as beam-forming and speech enhancement require high quality representative data. To… ▽ More Augmented Reality (AR) as a platform has the potential to facilitate the reduction of the cocktail party effect. Future AR headsets could potentially leverage information from an array of sensors spanning many different modalities. Training and testing signal processing and machine learning algorithms on tasks such as beam-forming and speech enhancement require high quality representative data. To the best of the author's knowledge, as of publication there are no available datasets that contain synchronized egocentric multi-channel audio and video with dynamic movement and conversations in a noisy environment. In this work, we describe, evaluate and release a dataset that contains over 5 hours of multi-modal data useful for training and testing algorithms for the application of improving conversations for an AR glasses wearer. We provide speech intelligibility, quality and signal-to-noise ratio improvement results for a baseline method and show improvements across all tested metrics. The dataset we are releasing contains AR glasses egocentric multi-channel microphone array audio, wide field-of-view RGB video, speech source pose, headset microphone audio, annotated voice activity, speech transcriptions, head bounding boxes, target of speech and source identification labels. We have created and are releasing this dataset to facilitate research in multi-modal AR solutions to the cocktail party problem. △ Less

Submitted 18 October, 2021; v1 submitted 8 July, 2021; originally announced July 2021.

Comments: Dataset is available at: https://github.com/facebookresearch/EasyComDataset

arXiv:2007.11795 [pdf, other]

Sound Field Translation and Mixed Source Model for Virtual Applications with Perceptual Validation

Authors: Lachlan Birnie, Thushara Abhayapala, Vladimir Tourbabin, Prasanga Samarasinghe

Abstract: Non-interactive and linear experiences like cinema film offer high quality surround sound audio to enhance immersion, however the listener's experience is usually fixed to a single acoustic perspective. With the rise of virtual reality, there is a demand for recording and recreating real-world experiences in a way that allows for the user to interact and move within the reproduction. Conventional… ▽ More Non-interactive and linear experiences like cinema film offer high quality surround sound audio to enhance immersion, however the listener's experience is usually fixed to a single acoustic perspective. With the rise of virtual reality, there is a demand for recording and recreating real-world experiences in a way that allows for the user to interact and move within the reproduction. Conventional sound field translation techniques take a recording and expand it into an equivalent environment of virtual sources. However, the finite sampling of a commercial higher order microphone produces an acoustic sweet-spot in the virtual reproduction. As a result, the technique remains to restrict the listener's navigable region. In this paper, we propose a method for listener translation in an acoustic reproduction that incorporates a mixture of near-field and far-field sources in a sparsely expanded virtual environment. We perceptually validate the method through a Multiple Stimulus with Hidden Reference and Anchor (MUSHRA) experiment. Compared to the planewave benchmark, the proposed method offers both improved source localizability and robustness to spectral distortions at translated positions. A cross-examination with numerical simulations demonstrated that the sparse expansion relaxes the inherent sweet-spot constraint, leading to the improved localizability for sparse environments. Additionally, the proposed method is seen to better reproduce the intensity and binaural room impulse response spectra of near-field environments, further supporting the strong perceptual results. △ Less

Submitted 23 July, 2020; originally announced July 2020.

Comments: 12 pages, 11 figures This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Showing 1–8 of 8 results for author: Tourbabin, V