Search | arXiv e-print repository

Unsupervised Improved MVDR Beamforming for Sound Enhancement

Authors: Jacob Kealey, John Hershey, François Grondin

Abstract: Neural networks have recently become the dominant approach to sound separation. Their good performance relies on large datasets of isolated recordings. For speech and music, isolated single channel data are readily available; however the same does not hold in the multi-channel case, and with most other sound classes. Multi-channel methods have the potential to outperform single channel approaches… ▽ More Neural networks have recently become the dominant approach to sound separation. Their good performance relies on large datasets of isolated recordings. For speech and music, isolated single channel data are readily available; however the same does not hold in the multi-channel case, and with most other sound classes. Multi-channel methods have the potential to outperform single channel approaches as they can exploit both spatial and spectral features, but the lack of training data remains a challenge. We propose unsupervised improved minimum variation distortionless response (UIMVDR), which enables multi-channel separation to leverage in-the-wild single-channel data through unsupervised training and beamforming. Results show that UIMVDR generalizes well and improves separation performance compared to supervised models, particularly in cases with limited supervised data. By using data available online, it also reduces the effort required to gather data for multi-channel approaches. △ Less

Submitted 12 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted at INTERSPEECH 2024

arXiv:2309.08005 [pdf, ps, other]

Efficient Face Detection with Audio-Based Region Proposals for Human-Robot Interactions

Authors: William Aris, François Grondin

Abstract: Efficient face detection is critical to provide natural human-robot interactions. However, computer vision tends to involve a large computational load due to the amount of data (i.e. pixels) that needs to be processed in a short amount of time. This is undesirable on robotics platforms where multiple processes need to run in parallel and where the processing power is limited by portability constra… ▽ More Efficient face detection is critical to provide natural human-robot interactions. However, computer vision tends to involve a large computational load due to the amount of data (i.e. pixels) that needs to be processed in a short amount of time. This is undesirable on robotics platforms where multiple processes need to run in parallel and where the processing power is limited by portability constraints. Existing solutions often involve reducing image quality which can negatively impact processing. The literature also reports methods to generate regions of interest in images from pixel data. Although it is a promising idea, these methods often involve heavy vision algorithms. In this paper, we evaluate how audio can be used to generate regions of interest in optical images to reduce the number of pixels to process with computer vision. Thereby, we propose a unique attention mechanism to localize a speech source and evaluate its impact on an existing face detection algorithm. Our results show that the attention mechanism reduces the computational load and offers an interesting trade-off between speed and accuracy. The proposed pipeline is flexible and can be easily adapted to other applications such as robot surveillance, video conferences or smart glasses. △ Less

Submitted 15 March, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

arXiv:2309.05057 [pdf, other]

Gray Jedi MVDR Post-filtering

Authors: François Grondin, Caleb Rascón

Abstract: Spatial filters can exploit deep-learning-based speech enhancement models to increase their reliability in scenarios with multiple speech sources scenarios. To further improve speech quality, it is common to perform postfiltering on the estimated target speech obtained with spatial filtering. In this work, Minimum Variance Distortionless Response (MVDR) is employed to provide the interference esti… ▽ More Spatial filters can exploit deep-learning-based speech enhancement models to increase their reliability in scenarios with multiple speech sources scenarios. To further improve speech quality, it is common to perform postfiltering on the estimated target speech obtained with spatial filtering. In this work, Minimum Variance Distortionless Response (MVDR) is employed to provide the interference estimation, along with the estimation of the target speech, to be later used for postfiltering. This improves the enhancement performance over a single-input baseline in a far more significant way than by increasing the model's complexity. Results suggest that less computing resources are required for postfiltering when provided with both target and interference signals, which is a step forward in develo** an online speech enhancement system for multi-speech scenarios. △ Less

Submitted 10 September, 2023; originally announced September 2023.

Comments: \c{opyright} 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

arXiv:2303.00949 [pdf, other]

Real-time Audio Video Enhancement \\with a Microphone Array and Headphones

Authors: Jacob Kealey, Anthony Gosselin, Étienne Deshaies-Samson, Francis Cardinal, Félix Ducharme-Turcotte, Olivier Bergeron, Amélie Rioux-Joyal, Jérémy Bélec, François Grondin

Abstract: This paper presents a complete hardware and software pipeline for real-time speech enhancement in noisy and reverberant conditions. The device consists of a microphone array and a camera mounted on eyeglasses, connected to an embedded system that enhances speech and plays back the audio in headphones, with a latency of maximum 120 msec. The proposed approach relies on face detection, tracking and… ▽ More This paper presents a complete hardware and software pipeline for real-time speech enhancement in noisy and reverberant conditions. The device consists of a microphone array and a camera mounted on eyeglasses, connected to an embedded system that enhances speech and plays back the audio in headphones, with a latency of maximum 120 msec. The proposed approach relies on face detection, tracking and verification to enhance the speech of a target speaker using a beamformer and a postfiltering neural network. Results demonstrate the feasibility of the approach, and opens the door to the exploration and validation of a wide range of beamformer and speech enhancement methods for real-time speech enhancement. △ Less

Submitted 1 March, 2023; originally announced March 2023.

Comments: Submitted to IROS 2023

arXiv:2303.00829 [pdf, other]

Ego-noise reduction of a mobile robot using noise spatial covariance matrix learning and minimum variance distortionless response

Authors: Pierre-Olivier Lagacé, François Ferland, François Grondin

Abstract: The performance of speech and events recognition systems significantly improved recently thanks to deep learning methods. However, some of these tasks remain challenging when algorithms are deployed on robots due to the unseen mechanical noise and electrical interference generated by their actuators while training the neural networks. Ego-noise reduction as a preprocessing step therefore can help… ▽ More The performance of speech and events recognition systems significantly improved recently thanks to deep learning methods. However, some of these tasks remain challenging when algorithms are deployed on robots due to the unseen mechanical noise and electrical interference generated by their actuators while training the neural networks. Ego-noise reduction as a preprocessing step therefore can help solve this issue when using pre-trained speech and event recognition algorithms on robots. In this paper, we propose a new method to reduce ego-noise using only a microphone array and less than two minute of noise recordings. Using Principal Component Analysis (PCA), the best covariance matrix candidate is selected from a dictionary created online during calibration and used with the Minimum Variance Distortionless Response (MVDR) beamformer. Results show that the proposed method runs in real-time, improves the signal-to-distortion ratio (SDR) by up to 10 dB, decreases the word error rate (WER) by 55\% in some cases and increases the Average Precision (AP) of event detection by up to 0.2. △ Less

Submitted 6 March, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

Comments: Submitted to IROS 2023

arXiv:2206.09507 [pdf, other]

Resource-Efficient Separation Transformer

Authors: Luca Della Libera, Cem Subakan, Mirco Ravanelli, Samuele Cornell, Frédéric Lepoutre, François Grondin

Abstract: Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally demanding and require a lot of learnable parameters. This paper explores Transformer-based speech separation with a reduced computational cost. Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-bas… ▽ More Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally demanding and require a lot of learnable parameters. This paper explores Transformer-based speech separation with a reduced computational cost. Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-based architecture that reduces the computational burden in two ways. First, it uses non-overlap** blocks in the latent space. Second, it operates on compact latent summaries calculated from each chunk. The RE-SepFormer reaches a competitive performance on the popular WSJ0-2Mix and WHAM! datasets in both causal and non-causal settings. Remarkably, it scales significantly better than the previous Transformer-based architectures in terms of memory and inference time, making it more suitable for processing long mixtures. △ Less

Submitted 15 January, 2024; v1 submitted 19 June, 2022; originally announced June 2022.

Comments: Accepted to ICASSP 2024

arXiv:2204.13622 [pdf, other]

Fast Cross-Correlation for TDoA Estimation on Small Aperture Microphone Arrays

Authors: François Grondin, Marc-Antoine Maheux, Jean-Samuel Lauzon, Jonathan Vincent, François Michaud

Abstract: This paper introduces the Fast Cross-Correlation (FCC) method for Time Difference of Arrival (TDoA) Estimation for pairs of microphones on a small aperture microphone array. FCC relies on low-rank decomposition and exploits symmetry in even and odd bases to speed up computation while preserving TDoA accuracy. FCC reduces the number of flops by a factor of 4.5 and the execution speed by factors bet… ▽ More This paper introduces the Fast Cross-Correlation (FCC) method for Time Difference of Arrival (TDoA) Estimation for pairs of microphones on a small aperture microphone array. FCC relies on low-rank decomposition and exploits symmetry in even and odd bases to speed up computation while preserving TDoA accuracy. FCC reduces the number of flops by a factor of 4.5 and the execution speed by factors between 3.5 and 8.3 on embedded hardware, compared to the state-of-the-art Generalized Cross-Correlation (GCC) method that relies on the Fast Fourier Transform (FFT). This improvement can provide portable microphone arrays with extended battery life and allow real-time processing on low-cost hardware. △ Less

Submitted 10 March, 2023; v1 submitted 28 April, 2022; originally announced April 2022.

Comments: Submitted to IEEE ICASSP 2023

arXiv:2203.14409 [pdf, other]

SMP-PHAT: Lightweight DoA Estimation by Merging Microphone Pairs

Authors: François Grondin, Marc-Antoine Maheux, Jean-Samuel Lauzon, Jonathan Vincent, François Michaud

Abstract: This paper introduces SMP-PHAT, which performs direction of arrival (DoA) of sound estimation with a microphone array by merging pairs of microphones that are parallel in space. This approach reduces the number of pairwise cross-correlation computations, and brings down the number of flops and memory lookups when searching for DoA. Experiments on low-cost hardware with commonly used microphone arr… ▽ More This paper introduces SMP-PHAT, which performs direction of arrival (DoA) of sound estimation with a microphone array by merging pairs of microphones that are parallel in space. This approach reduces the number of pairwise cross-correlation computations, and brings down the number of flops and memory lookups when searching for DoA. Experiments on low-cost hardware with commonly used microphone arrays show that the proposed method provides the same accuracy as the former SRP-PHAT approach, while reducing the computational load by 39% in some cases. △ Less

Submitted 27 March, 2022; originally announced March 2022.

Comments: Submitted to Interspeech 2022

arXiv:2202.08947 [pdf, other]

Machine Learning for Touch Localization on Ultrasonic Wave Touchscreen

Authors: Sahar Bahrami, Jérémy Moriot, Patrice Masson, François Grondin

Abstract: Classification and regression employing a simple Deep Neural Network (DNN) are investigated to perform touch localization on a tactile surface using ultrasonic guided waves. A robotic finger first simulates the touch action and captures the data to train a model. The model is then validated with data from experiments conducted with human fingers. The localization root mean square errors (RMSE) in… ▽ More Classification and regression employing a simple Deep Neural Network (DNN) are investigated to perform touch localization on a tactile surface using ultrasonic guided waves. A robotic finger first simulates the touch action and captures the data to train a model. The model is then validated with data from experiments conducted with human fingers. The localization root mean square errors (RMSE) in time and frequency domains are presented. The proposed method provides satisfactory localization results for most human-machine interactions, with a mean error of 0.47 cm and standard deviation of 0.18 cm and a computing time of 0.44 ms. The classification approach is also adapted to identify touches on an access control keypad layout, which leads to an accuracy of 97% with a computing time of 0.28 ms. These results demonstrate that DNN-based methods are a viable alternative to signal processing-based approaches for accurate and robust touch localization using ultrasonic guided waves. △ Less

Submitted 27 April, 2022; v1 submitted 17 February, 2022; originally announced February 2022.

arXiv:2202.02884 [pdf, other]

Exploring Self-Attention Mechanisms for Speech Separation

Authors: Cem Subakan, Mirco Ravanelli, Samuele Cornell, Francois Grondin, Mirko Bronzi

Abstract: Transformers have enabled impressive improvements in deep learning. They often outperform recurrent and convolutional models in many tasks while taking advantage of parallel processing. Recently, we proposed the SepFormer, which obtains state-of-the-art performance in speech separation with the WSJ0-2/3 Mix datasets. This paper studies in-depth Transformers for speech separation. In particular, we… ▽ More Transformers have enabled impressive improvements in deep learning. They often outperform recurrent and convolutional models in many tasks while taking advantage of parallel processing. Recently, we proposed the SepFormer, which obtains state-of-the-art performance in speech separation with the WSJ0-2/3 Mix datasets. This paper studies in-depth Transformers for speech separation. In particular, we extend our previous findings on the SepFormer by providing results on more challenging noisy and noisy-reverberant datasets, such as LibriMix, WHAM!, and WHAMR!. Moreover, we extend our model to perform speech enhancement and provide experimental evidence on denoising and dereverberation tasks. Finally, we investigate, for the first time in speech separation, the use of efficient self-attention mechanisms such as Linformers, Lonformers, and ReFormers. We found that they reduce memory requirements significantly. For example, we show that the Reformer-based attention outperforms the popular Conv-TasNet model on the WSJ0-2Mix dataset while being faster at inference and comparable in terms of memory consumption. △ Less

Submitted 27 May, 2023; v1 submitted 6 February, 2022; originally announced February 2022.

Comments: Accepted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

arXiv:2111.04614 [pdf, other]

Learning Filterbanks for End-to-End Acoustic Beamforming

Authors: Samuele Cornell, Manuel Pariente, François Grondin, Stefano Squartini

Abstract: Recent work on monaural source separation has shown that performance can be increased by using fully learned filterbanks with short windows. On the other hand it is widely known that, for conventional beamforming techniques, performance increases with long analysis windows. This applies also to most hybrid neural beamforming methods which rely on a deep neural network (DNN) to estimate the spatial… ▽ More Recent work on monaural source separation has shown that performance can be increased by using fully learned filterbanks with short windows. On the other hand it is widely known that, for conventional beamforming techniques, performance increases with long analysis windows. This applies also to most hybrid neural beamforming methods which rely on a deep neural network (DNN) to estimate the spatial covariance matrices. In this work we try to bridge the gap between these two worlds and explore fully end-to-end hybrid neural beamforming in which, instead of using the Short-Time-Fourier Transform, also the analysis and synthesis filterbanks are learnt jointly with the DNN. In detail, we explore two different types of learned filterbanks: fully learned and analytic. We perform a detailed analysis using the recent Clarity Challenge data and show that by using learnt filterbanks it is possible to surpass oracle-mask based beamforming for short windows. △ Less

Submitted 19 February, 2022; v1 submitted 8 November, 2021; originally announced November 2021.

Comments: accepted at ICASSP 2022

arXiv:2110.10812 [pdf, other]

REAL-M: Towards Speech Separation on Real Mixtures

Authors: Cem Subakan, Mirco Ravanelli, Samuele Cornell, François Grondin

Abstract: In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation models on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life… ▽ More In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation models on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life mixtures. Secondly, we address the problem of performance evaluation of real-life mixtures, where the ground truth is not available. We bypass this issue by carefully designing a blind Scale-Invariant Signal-to-Noise Ratio (SI-SNR) neural estimator. Through a user study, we show that our estimator reliably evaluates the separation performance on real mixtures. The performance predictions of the SI-SNR estimator indeed correlate well with human opinions. Moreover, we observe that the performance trends predicted by our estimator on the REAL-M dataset closely follow those achieved on synthetic benchmarks when evaluating popular speech separation models. △ Less

Submitted 20 October, 2021; originally announced October 2021.

Comments: Submitted to ICASSP 2022

arXiv:2110.03103 [pdf, other]

Lightweight Speech Enhancement in Unseen Noisy and Reverberant Conditions using KISS-GEV Beamforming

Authors: Thomas Bernard, François Grondin

Abstract: This paper introduces a new method referred to as KISS-GEV (for Keep It Super Simple Generalized eigenvalue) beamforming. While GEV beamforming usually relies on deep neural network for estimating target and noise time-frequency masks, this method uses a signal processing approach based on the direction of arrival (DoA) of the target. This considerably reduces the amount of computations involved a… ▽ More This paper introduces a new method referred to as KISS-GEV (for Keep It Super Simple Generalized eigenvalue) beamforming. While GEV beamforming usually relies on deep neural network for estimating target and noise time-frequency masks, this method uses a signal processing approach based on the direction of arrival (DoA) of the target. This considerably reduces the amount of computations involved at test time, and works for speech enhancement in unseen conditions as there is no need to train a neural network with noisy speech. The proposed method can also be used to separate speech from a mixture, provided the speech sources come from different directions. Results also show that the proposed method uses the same minimal DoA assumption as Delay-and-Sum beamforming, yet outperforms this traditional approach. △ Less

Submitted 10 October, 2021; v1 submitted 6 October, 2021; originally announced October 2021.

arXiv:2109.08263 [pdf, other]

doi 10.1109/SSRR53300.2021.9597855

Neural Network Based Lidar Gesture Recognition for Realtime Robot Teleoperation

Authors: Simon Chamorro, Jack Collier, François Grondin

Abstract: We propose a novel low-complexity lidar gesture recognition system for mobile robot control robust to gesture variation. Our system uses a modular approach, consisting of a pose estimation module and a gesture classifier. Pose estimates are predicted from lidar scans using a Convolutional Neural Network trained using an existing stereo-based pose estimation system. Gesture classification is accomp… ▽ More We propose a novel low-complexity lidar gesture recognition system for mobile robot control robust to gesture variation. Our system uses a modular approach, consisting of a pose estimation module and a gesture classifier. Pose estimates are predicted from lidar scans using a Convolutional Neural Network trained using an existing stereo-based pose estimation system. Gesture classification is accomplished using a Long Short-Term Memory network and uses a sequence of estimated body poses as input to predict a gesture. Breaking down the pipeline into two modules reduces the dimensionality of the input, which could be lidar scans, stereo imagery, or any other modality from which body keypoints can be extracted, making our system lightweight and suitable for mobile robot control with limited computing power. The use of lidar contributes to the robustness of the system, allowing it to operate in most outdoor conditions, to be independent of lighting conditions, and for input to be detected 360 degrees around the robot. The lidar-based pose estimator and gesture classifier use data augmentation and automated labeling techniques, requiring a minimal amount of data collection and avoiding the need for manual labeling. We report experimental results for each module of our system and demonstrate its effectiveness by testing it in a real-world robot teleoperation setting. △ Less

Submitted 16 September, 2021; originally announced September 2021.

Journal ref: 2021 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR)

arXiv:2106.04624 [pdf, other]

SpeechBrain: A General-Purpose Speech Toolkit

Authors: Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, Yoshua Bengio

Abstract: SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing… ▽ More SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines. SpeechBrain achieves competitive or state-of-the-art performance in a wide range of speech benchmarks. It also provides training recipes, pretrained models, and inference scripts for popular speech datasets, as well as tutorials which allow anyone with basic Python proficiency to familiarize themselves with speech technologies. △ Less

Submitted 8 June, 2021; originally announced June 2021.

Comments: Preprint

arXiv:2104.01466 [pdf, other]

doi 10.21437/Interspeech.2021-941

ECAPA-TDNN Embeddings for Speaker Diarization

Authors: Nauman Dawalatabad, Mirco Ravanelli, François Grondin, Jenthe Thienpondt, Brecht Desplanques, Hwidong Na

Abstract: Learning robust speaker embeddings is a crucial step in speaker diarization. Deep neural networks can accurately capture speaker discriminative characteristics and popular deep embeddings such as x-vectors are nowadays a fundamental component of modern diarization systems. Recently, some improvements over the standard TDNN architecture used for x-vectors have been proposed. The ECAPA-TDNN model, f… ▽ More Learning robust speaker embeddings is a crucial step in speaker diarization. Deep neural networks can accurately capture speaker discriminative characteristics and popular deep embeddings such as x-vectors are nowadays a fundamental component of modern diarization systems. Recently, some improvements over the standard TDNN architecture used for x-vectors have been proposed. The ECAPA-TDNN model, for instance, has shown impressive performance in the speaker verification domain, thanks to a carefully designed neural model. In this work, we extend, for the first time, the use of the ECAPA-TDNN model to speaker diarization. Moreover, we improved its robustness with a powerful augmentation scheme that concatenates several contaminated versions of the same signal within the same training batch. The ECAPA-TDNN model turned out to provide robust speaker embeddings under both close-talking and distant-talking conditions. Our results on the popular AMI meeting corpus show that our system significantly outperforms recently proposed approaches. △ Less

Submitted 3 April, 2021; originally announced April 2021.

arXiv:2103.03954 [pdf, other]

ODAS: Open embeddeD Audition System

Authors: François Grondin, Dominic Létourneau, Cédric Godin, Jean-Samuel Lauzon, Jonathan Vincent, Simon Michaud, Samuel Faucher, François Michaud

Abstract: Artificial audition aims at providing hearing capabilities to machines, computers and robots. Existing frameworks in robot audition offer interesting sound source localization, tracking and separation performance, although involve a significant amount of computations that limit their use on robots with embedded computing capabilities. This paper presents ODAS, the Open embeddeD Audition System fra… ▽ More Artificial audition aims at providing hearing capabilities to machines, computers and robots. Existing frameworks in robot audition offer interesting sound source localization, tracking and separation performance, although involve a significant amount of computations that limit their use on robots with embedded computing capabilities. This paper presents ODAS, the Open embeddeD Audition System framework, which includes strategies to reduce the computational load and perform robot audition tasks on low-cost embedded computing systems. It presents key features of ODAS, along with cases illustrating its uses in different robots and artificial audition applications. △ Less

Submitted 11 May, 2022; v1 submitted 5 March, 2021; originally announced March 2021.

Comments: This paper was published in Frontiers Robotics and AI

arXiv:2103.01830 [pdf, other]

doi 10.1109/JIOT.2021.3103523

Audio scene monitoring using redundant ad-hoc microphone array networks

Authors: Peter Gerstoft, Yihan Hu, Michael J. Bianco, Chaitanya Patil, Ardel Alegre, Yoav Freund, Francois Grondin

Abstract: We present a system for localizing sound sources in a room with several ad-hoc microphone arrays. Each circular array performs direction of arrival (DOA) estimation independently using commercial software. The DOAs are fed to a fusion center, concatenated, and used to perform the localization based on two proposed methods, which require only few labeled source locations (anchor points) for trainin… ▽ More We present a system for localizing sound sources in a room with several ad-hoc microphone arrays. Each circular array performs direction of arrival (DOA) estimation independently using commercial software. The DOAs are fed to a fusion center, concatenated, and used to perform the localization based on two proposed methods, which require only few labeled source locations (anchor points) for training. The first proposed method is based on principal component analysis (PCA) of the observed DOA and does not require any knowledge of anchor points. The array cluster can then perform localization on a manifold defined by the PCA of concatenated DOAs over time. The second proposed method performs localization using an affine transformation between the DOA vectors and the room manifold. The PCA has fewer requirements on the training sequence, but is less robust to missing DOAs from one of the arrays. The methods are demonstrated with five IoT 8-microphone circular arrays, placed at unspecified fixed locations in an office. Both the PCA and the affine method can easily map out a rectangle based on a few anchor points with similar accuracy. The proposed methods provide a step towards monitoring activities in a smart home and require little installation effort as the array locations are not needed. △ Less

Submitted 23 August, 2021; v1 submitted 2 March, 2021; originally announced March 2021.

Comments: IN press, IEEE Internet of Things Journal

arXiv:2010.09930 [pdf, other]

BIRD: Big Impulse Response Dataset

Authors: François Grondin, Jean-Samuel Lauzon, Simon Michaud, Mirco Ravanelli, François Michaud

Abstract: This paper introduces BIRD, the Big Impulse Response Dataset. This open dataset consists of 100,000 multichannel room impulse responses (RIRs) generated from simulations using the Image Method, making it the largest multichannel open dataset currently available. These RIRs can be used toperform efficient online data augmentation for scenarios that involve two microphones and multiple sound sources… ▽ More This paper introduces BIRD, the Big Impulse Response Dataset. This open dataset consists of 100,000 multichannel room impulse responses (RIRs) generated from simulations using the Image Method, making it the largest multichannel open dataset currently available. These RIRs can be used toperform efficient online data augmentation for scenarios that involve two microphones and multiple sound sources. The paper also introduces use cases to illustrate how BIRD can perform data augmentation with existing speech corpora. △ Less

Submitted 19 October, 2020; originally announced October 2020.

arXiv:2008.00072 [pdf, other]

Dynamic Object Tracking and Masking for Visual SLAM

Authors: Jonathan Vincent, Mathieu Labbé, Jean-Samuel Lauzon, François Grondin, Pier-Marc Comtois-Rivet, François Michaud

Abstract: In dynamic environments, performance of visual SLAM techniques can be impaired by visual features taken from moving objects. One solution is to identify those objects so that their visual features can be removed for localization and map**. This paper presents a simple and fast pipeline that uses deep neural networks, extended Kalman filters and visual SLAM to improve both localization and mappin… ▽ More In dynamic environments, performance of visual SLAM techniques can be impaired by visual features taken from moving objects. One solution is to identify those objects so that their visual features can be removed for localization and map**. This paper presents a simple and fast pipeline that uses deep neural networks, extended Kalman filters and visual SLAM to improve both localization and map** in dynamic environments (around 14 fps on a GTX 1080). Results on the dynamic sequences from the TUM dataset using RTAB-Map as visual SLAM suggest that the approach achieves similar localization performance compared to other state-of-the-art methods, while also providing the position of the tracked dynamic objects, a 3D map free of those dynamic objects, better loop closure detection with the whole pipeline able to run on a robot moving at moderate speed. △ Less

Submitted 31 July, 2020; originally announced August 2020.

arXiv:2007.11079 [pdf, other]

3D Localization of a Sound Source Using Mobile Microphone Arrays Referenced by SLAM

Authors: Simon Michaud, Samuel Faucher, François Grondin, Jean-Samuel Lauzon, Mathieu Labbé, Dominic Létourneau, François Ferland, François Michaud

Abstract: A microphone array can provide a mobile robot with the capability of localizing, tracking and separating distant sound sources in 2D, i.e., estimating their relative elevation and azimuth. To combine acoustic data with visual information in real world settings, spatial correlation must be established. The approach explored in this paper consists of having two robots, each equipped with a microphon… ▽ More A microphone array can provide a mobile robot with the capability of localizing, tracking and separating distant sound sources in 2D, i.e., estimating their relative elevation and azimuth. To combine acoustic data with visual information in real world settings, spatial correlation must be established. The approach explored in this paper consists of having two robots, each equipped with a microphone array, localizing themselves in a shared reference map using SLAM. Based on their locations, data from the microphone arrays are used to triangulate in 3D the location of a sound source in relation to the same map. This strategy results in a novel cooperative sound map** approach using mobile microphone arrays. Trials are conducted using two mobile robots localizing a static or a moving sound source to examine in which conditions this is possible. Results suggest that errors under 0.3 m are observed when the relative angle between the two robots are above 30 degrees for a static sound source, while errors under 0.3 m for angles between 40 degrees and 140 degrees are observed with a moving sound source. △ Less

Submitted 21 July, 2020; originally announced July 2020.

arXiv:2005.09587 [pdf, other]

GEV Beamforming Supported by DOA-based Masks Generated on Pairs of Microphones

Authors: Francois Grondin, Jean-Samuel Lauzon, Jonathan Vincent, Francois Michaud

Abstract: Distant speech processing is a challenging task, especially when dealing with the cocktail party effect. Sound source separation is thus often required as a preprocessing step prior to speech recognition to improve the signal to distortion ratio (SDR). Recently, a combination of beamforming and speech separation networks have been proposed to improve the target source quality in the direction of a… ▽ More Distant speech processing is a challenging task, especially when dealing with the cocktail party effect. Sound source separation is thus often required as a preprocessing step prior to speech recognition to improve the signal to distortion ratio (SDR). Recently, a combination of beamforming and speech separation networks have been proposed to improve the target source quality in the direction of arrival of interest. However, with this type of approach, the neural network needs to be trained in advance for a specific microphone array geometry, which limits versatility when adding/removing microphones, or changing the shape of the array. The solution presented in this paper is to train a neural network on pairs of microphones with different spacing and acoustic environmental conditions, and then use this network to estimate a time-frequency mask from all the pairs of microphones forming the array with an arbitrary shape. Using this mask, the target and noise covariance matrices can be estimated, and then used to perform generalized eigenvalue (GEV) beamforming. Results show that the proposed approach improves the SDR from 4.78 dB to 7.69 dB on average, for various microphone array geometries that correspond to commercially available hardware. △ Less

Submitted 5 August, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

arXiv:2002.01440 [pdf, other]

Audio-Visual Calibration with Polynomial Regression for 2-D Projection Using SVD-PHAT

Authors: Francois Grondin, Hao Tang, James Glass

Abstract: This paper proposes a straightforward 2-D method to spatially calibrate the visual field of a camera with the auditory field of an array microphone by generating and overlaying an acoustic image over an optical image. Using a low-cost microphone array and an off-the-shelf camera, we show that polynomial regression can deal efficiently with non-linear camera distortion, and that a recently proposed… ▽ More This paper proposes a straightforward 2-D method to spatially calibrate the visual field of a camera with the auditory field of an array microphone by generating and overlaying an acoustic image over an optical image. Using a low-cost microphone array and an off-the-shelf camera, we show that polynomial regression can deal efficiently with non-linear camera distortion, and that a recently proposed sound source localization method for real-time processing, SVD-PHAT, can be adapted for this task. △ Less

Submitted 12 February, 2020; v1 submitted 4 February, 2020; originally announced February 2020.

arXiv:1910.10049 [pdf, other]

Sound Event Localization and Detection Using CRNN on Pairs of Microphones

Authors: Francois Grondin, James Glass, Iwona Sobieraj, Mark D. Plumbley

Abstract: This paper proposes sound event localization and detection methods from multichannel recording. The proposed system is based on two Convolutional Recurrent Neural Networks (CRNNs) to perform sound event detection (SED) and time difference of arrival (TDOA) estimation on each pair of microphones in a microphone array. In this paper, the system is evaluated with a four-microphone array, and thus com… ▽ More This paper proposes sound event localization and detection methods from multichannel recording. The proposed system is based on two Convolutional Recurrent Neural Networks (CRNNs) to perform sound event detection (SED) and time difference of arrival (TDOA) estimation on each pair of microphones in a microphone array. In this paper, the system is evaluated with a four-microphone array, and thus combines the results from six pairs of microphones to provide a final classification and a 3-D direction of arrival (DOA) estimate. Results demonstrate that the proposed approach outperforms the DCASE 2019 baseline system. △ Less

Submitted 22 October, 2019; originally announced October 2019.

arXiv:1907.12621 [pdf, other]

Fast and Robust 3-D Sound Source Localization with DSVD-PHAT

Authors: Francois Grondin, James Glass

Abstract: This paper introduces a variant of the Singular Value Decomposition with Phase Transform (SVD-PHAT), named Difference SVD-PHAT (DSVD-PHAT), to achieve robust Sound Source Localization (SSL) in noisy conditions. Experiments are performed on a Baxter robot with a four-microphone planar array mounted on its head. Results show that this method offers similar robustness to noise as the state-of-the-art… ▽ More This paper introduces a variant of the Singular Value Decomposition with Phase Transform (SVD-PHAT), named Difference SVD-PHAT (DSVD-PHAT), to achieve robust Sound Source Localization (SSL) in noisy conditions. Experiments are performed on a Baxter robot with a four-microphone planar array mounted on its head. Results show that this method offers similar robustness to noise as the state-of-the-art Multiple Signal Classification based on Generalized Singular Value Decomposition (GSVD-MUSIC) method, and considerably reduces the computational load by a factor of 250. This performance gain thus makes DSVD-PHAT appealing for real-time application on robots with limited on-board computing power. △ Less

Submitted 29 July, 2019; originally announced July 2019.

arXiv:1906.11913 [pdf, ps, other]

Multiple Sound Source Localization with SVD-PHAT

Authors: Francois Grondin, James Glass

Abstract: This paper introduces a modification of phase transform on singular value decomposition (SVD-PHAT) to localize multiple sound sources. This work aims to improve localization accuracy and keeps the algorithm complexity low for real-time applications. This method relies on multiple scans of the search space, with projection of each low-dimensional observation onto orthogonal subspaces. We show that… ▽ More This paper introduces a modification of phase transform on singular value decomposition (SVD-PHAT) to localize multiple sound sources. This work aims to improve localization accuracy and keeps the algorithm complexity low for real-time applications. This method relies on multiple scans of the search space, with projection of each low-dimensional observation onto orthogonal subspaces. We show that this method localizes multiple sound sources more accurately than discrete SRP-PHAT, with a reduction in the Root Mean Square Error up to 0.0395 radians. △ Less

Submitted 27 June, 2019; originally announced June 2019.

arXiv:1812.00115 [pdf, other]

Lightweight and Optimized Sound Source Localization and Tracking Methods for Open and Closed Microphone Array Configurations

Authors: Francois Grondin, Francois Michaud

Abstract: Human-robot interaction in natural settings requires filtering out the different sources of sounds from the environment. Such ability usually involves the use of microphone arrays to localize, track and separate sound sources online. Multi-microphone signal processing techniques can improve robustness to noise but the processing cost increases with the number of microphones used, limiting response… ▽ More Human-robot interaction in natural settings requires filtering out the different sources of sounds from the environment. Such ability usually involves the use of microphone arrays to localize, track and separate sound sources online. Multi-microphone signal processing techniques can improve robustness to noise but the processing cost increases with the number of microphones used, limiting response time and widespread use on different types of mobile robots. Since sound source localization methods are the most expensive in terms of computing resources as they involve scanning a large 3D space, minimizing the amount of computations required would facilitate their implementation and use on robots. The robot's shape also brings constraints on the microphone array geometry and configurations. In addition, sound source localization methods usually return noisy features that need to be smoothed and filtered by tracking the sound sources. This paper presents a novel sound source localization method, called SRP-PHAT-HSDA, that scans space with coarse and fine resolution grids to reduce the number of memory lookups. A microphone directivity model is used to reduce the number of directions to scan and ignore non significant pairs of microphones. A configuration method is also introduced to automatically set parameters that are normally empirically tuned according to the shape of the microphone array. For sound source tracking, this paper presents a modified 3D Kalman (M3K) method capable of simultaneously tracking in 3D the directions of sound sources. Using a 16-microphone array and low cost hardware, results show that SRP-PHAT-HSDA and M3K perform at least as well as other sound source localization and tracking methods while using up to 4 and 30 times less computing resources respectively. △ Less

Submitted 30 November, 2018; originally announced December 2018.

arXiv:1811.11787 [pdf, ps, other]

A Study of the Complexity and Accuracy of Direction of Arrival Estimation Methods Based on GCC-PHAT for a Pair of Close Microphones

Authors: Francois Grondin, James Glass

Abstract: This paper investigates the accuracy of various Generalized Cross-Correlation with Phase Transform (GCC-PHAT) methods for a close pair of microphones. We investigate interpolation-based methods and also propose another approach based on Singular Value Decomposition (SVD). All investigated methods are implemented in C code, and the execution time is measured to determine which approach is the most… ▽ More This paper investigates the accuracy of various Generalized Cross-Correlation with Phase Transform (GCC-PHAT) methods for a close pair of microphones. We investigate interpolation-based methods and also propose another approach based on Singular Value Decomposition (SVD). All investigated methods are implemented in C code, and the execution time is measured to determine which approach is the most appealing for real-time applications on low-cost embedded hardware. △ Less

Submitted 28 November, 2018; originally announced November 2018.

arXiv:1811.11785 [pdf, ps, other]

SVD-PHAT: A Fast Sound Source Localization Method

Authors: Francois Grondin, James Glass

Abstract: This paper introduces a new localization method called SVD-PHAT. The SVD-PHAT method relies on Singular Value Decomposition of the SRP-PHAT projection matrix. A k-d tree is also proposed to speed up the search for the most likely direction of arrival of sound. We show that this method performs as accurately as SRP-PHAT, while reducing significantly the amount of computation required. This paper introduces a new localization method called SVD-PHAT. The SVD-PHAT method relies on Singular Value Decomposition of the SRP-PHAT projection matrix. A k-d tree is also proposed to speed up the search for the most likely direction of arrival of sound. We show that this method performs as accurately as SRP-PHAT, while reducing significantly the amount of computation required. △ Less

Submitted 11 February, 2019; v1 submitted 28 November, 2018; originally announced November 2018.

Journal ref: Proceedings of the 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing

arXiv:1806.04841 [pdf, ps, other]

A Study of Enhancement, Augmentation, and Autoencoder Methods for Domain Adaptation in Distant Speech Recognition

Authors: Hao Tang, Wei-Ning Hsu, Francois Grondin, James Glass

Abstract: Speech recognizers trained on close-talking speech do not generalize to distant speech and the word error rate degradation can be as large as 40% absolute. Most studies focus on tackling distant speech recognition as a separate problem, leaving little effort to adapting close-talking speech recognizers to distant speech. In this work, we review several approaches from a domain adaptation perspecti… ▽ More Speech recognizers trained on close-talking speech do not generalize to distant speech and the word error rate degradation can be as large as 40% absolute. Most studies focus on tackling distant speech recognition as a separate problem, leaving little effort to adapting close-talking speech recognizers to distant speech. In this work, we review several approaches from a domain adaptation perspective. These approaches, including speech enhancement, multi-condition training, data augmentation, and autoencoders, all involve a transformation of the data between domains. We conduct experiments on the AMI data set, where these approaches can be realized under the same controlled setting. These approaches lead to different amounts of improvement under their respective assumptions. The purpose of this paper is to quantify and characterize the performance gap between the two domains, setting up the basis for studying adaptation of speech recognizers from close-talking speech to distant speech. Our results also have implications for improving distant speech recognition. △ Less

Submitted 13 June, 2018; originally announced June 2018.

Comments: Interspeech, 2018

Showing 1–30 of 30 results for author: Grondin, F