Search | arXiv e-print repository

doi 10.1109/ICASSP43922.2022.9746132

SALSA-Lite: A Fast and Effective Feature for Polyphonic Sound Event Localization and Detection with Microphone Arrays

Authors: Thi Ngoc Tho Nguyen, Douglas L. Jones, Karn N. Watcharasupat, Huy Phan, Woon-Seng Gan

Abstract: Polyphonic sound event localization and detection (SELD) has many practical applications in acoustic sensing and monitoring. However, the development of real-time SELD has been limited by the demanding computational requirement of most recent SELD systems. In this work, we introduce SALSA-Lite, a fast and effective feature for polyphonic SELD using microphone array inputs. SALSA-Lite is a lightwei… ▽ More Polyphonic sound event localization and detection (SELD) has many practical applications in acoustic sensing and monitoring. However, the development of real-time SELD has been limited by the demanding computational requirement of most recent SELD systems. In this work, we introduce SALSA-Lite, a fast and effective feature for polyphonic SELD using microphone array inputs. SALSA-Lite is a lightweight variation of a previously proposed SALSA feature for polyphonic SELD. SALSA, which stands for Spatial Cue-Augmented Log-Spectrogram, consists of multichannel log-spectrograms stacked channelwise with the normalized principal eigenvectors of the spectrotemporally corresponding spatial covariance matrices. In contrast to SALSA, which uses eigenvector-based spatial features, SALSA-Lite uses normalized inter-channel phase differences as spatial features, allowing a 30-fold speedup compared to the original SALSA feature. Experimental results on the TAU-NIGENS Spatial Sound Events 2021 dataset showed that the SALSA-Lite feature achieved competitive performance compared to the full SALSA feature, and significantly outperformed the traditional feature set of multichannel log-mel spectrograms with generalized cross-correlation spectra. Specifically, using SALSA-Lite features increased localization-dependent F1 score and class-dependent localization recall by 15% and 5%, respectively, compared to using multichannel log-mel spectrograms with generalized cross-correlation spectra. △ Less

Submitted 4 May, 2022; v1 submitted 15 November, 2021; originally announced November 2021.

Comments: arXiv admin note: text overlap with arXiv:2110.00275

Journal ref: Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 716-720

arXiv:2110.00275 [pdf, other]

doi 10.1109/TASLP.2022.3173054

SALSA: Spatial Cue-Augmented Log-Spectrogram Features for Polyphonic Sound Event Localization and Detection

Authors: Thi Ngoc Tho Nguyen, Karn N. Watcharasupat, Ngoc Khanh Nguyen, Douglas L. Jones, Woon-Seng Gan

Abstract: Sound event localization and detection (SELD) consists of two subtasks, which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses amplitude and/or phase differences between microphones to estimate source directions. As a result, it is often di… ▽ More Sound event localization and detection (SELD) consists of two subtasks, which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses amplitude and/or phase differences between microphones to estimate source directions. As a result, it is often difficult to jointly optimize these two subtasks. We propose a novel feature called Spatial cue-Augmented Log-SpectrogrAm (SALSA) with exact time-frequency map** between the signal power and the source directional cues, which is crucial for resolving overlap** sound sources. The SALSA feature consists of multichannel log-spectrograms stacked along with the normalized principal eigenvector of the spatial covariance matrix at each corresponding time-frequency bin. Depending on the microphone array format, the principal eigenvector can be normalized differently to extract amplitude and/or phase differences between the microphones. As a result, SALSA features are applicable for different microphone array formats such as first-order ambisonics (FOA) and multichannel microphone array (MIC). Experimental results on the TAU-NIGENS Spatial Sound Events 2021 dataset with directional interferences showed that SALSA features outperformed other state-of-the-art features. Specifically, the use of SALSA features in the FOA format increased the F1 score and localization recall by 6% each, compared to the multichannel log-mel spectrograms with intensity vectors. For the MIC format, using SALSA features increased F1 score and localization recall by 16% and 7%, respectively, compared to using multichannel log-mel spectrograms with generalized cross-correlation spectra. △ Less

Submitted 6 June, 2022; v1 submitted 1 October, 2021; originally announced October 2021.

Comments: (c) 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1749-1762, 2022

arXiv:2107.10471 [pdf, ps, other]

Improving Polyphonic Sound Event Detection on Multichannel Recordings with the Sørensen-Dice Coefficient Loss and Transfer Learning

Authors: Karn N. Watcharasupat, Thi Ngoc Tho Nguyen, Ngoc Khanh Nguyen, Zhen Jian Lee, Douglas L. Jones, Woon Seng Gan

Abstract: The Sørensen--Dice Coefficient has recently seen rising popularity as a loss function (also known as Dice loss) due to its robustness in tasks where the number of negative samples significantly exceeds that of positive samples, such as semantic segmentation, natural language processing, and sound event detection. Conventional training of polyphonic sound event detection systems with binary cross-e… ▽ More The Sørensen--Dice Coefficient has recently seen rising popularity as a loss function (also known as Dice loss) due to its robustness in tasks where the number of negative samples significantly exceeds that of positive samples, such as semantic segmentation, natural language processing, and sound event detection. Conventional training of polyphonic sound event detection systems with binary cross-entropy loss often results in suboptimal detection performance as the training is often overwhelmed by updates from negative samples. In this paper, we investigated the effect of the Dice loss, intra- and inter-modal transfer learning, data augmentation, and recording formats, on the performance of polyphonic sound event detection systems with multichannel inputs. Our analysis showed that polyphonic sound event detection systems trained with Dice loss consistently outperformed those trained with cross-entropy loss across different training settings and recording formats in terms of F1 score and error rate. We achieved further performance gains via the use of transfer learning and an appropriate combination of different data augmentation techniques. △ Less

Submitted 2 October, 2021; v1 submitted 22 July, 2021; originally announced July 2021.

Comments: Submitted to the 6th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2021

arXiv:2107.10469 [pdf, other]

What Makes Sound Event Localization and Detection Difficult? Insights from Error Analysis

Authors: Thi Ngoc Tho Nguyen, Karn N. Watcharasupat, Zhen Jian Lee, Ngoc Khanh Nguyen, Douglas L. Jones, Woon Seng Gan

Abstract: Sound event localization and detection (SELD) is an emerging research topic that aims to unify the tasks of sound event detection and direction-of-arrival estimation. As a result, SELD inherits the challenges of both tasks, such as noise, reverberation, interference, polyphony, and non-stationarity of sound sources. Furthermore, SELD often faces an additional challenge of assigning correct corresp… ▽ More Sound event localization and detection (SELD) is an emerging research topic that aims to unify the tasks of sound event detection and direction-of-arrival estimation. As a result, SELD inherits the challenges of both tasks, such as noise, reverberation, interference, polyphony, and non-stationarity of sound sources. Furthermore, SELD often faces an additional challenge of assigning correct correspondences between the detected sound classes and directions of arrival to multiple overlap** sound events. Previous studies have shown that unknown interferences in reverberant environments often cause major degradation in the performance of SELD systems. To further understand the challenges of the SELD task, we performed a detailed error analysis on two of our SELD systems, which both ranked second in the team category of DCASE SELD Challenge, one in 2020 and one in 2021. Experimental results indicate polyphony as the main challenge in SELD, due to the difficulty in detecting all sound events of interest. In addition, the SELD systems tend to make fewer errors for the polyphonic scenario that is dominant in the training set. △ Less

Submitted 2 October, 2021; v1 submitted 22 July, 2021; originally announced July 2021.

Comments: Accepted for the 6th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2021

Journal ref: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop, pp. 120-124

arXiv:2106.15190 [pdf, other]

doi 10.5281/zenodo.5031836

DCASE 2021 Task 3: Spectrotemporally-aligned Features for Polyphonic Sound Event Localization and Detection

Authors: Thi Ngoc Tho Nguyen, Karn Watcharasupat, Ngoc Khanh Nguyen, Douglas L. Jones, Woon Seng Gan

Abstract: Sound event localization and detection consists of two subtasks which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses magnitude or phase differences between microphones to estimate source directions. Therefore, it is often difficult to joi… ▽ More Sound event localization and detection consists of two subtasks which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses magnitude or phase differences between microphones to estimate source directions. Therefore, it is often difficult to jointly train these two subtasks simultaneously. We propose a novel feature called spatial cue-augmented log-spectrogram (SALSA) with exact time-frequency map** between the signal power and the source direction-of-arrival. The feature includes multichannel log-spectrograms stacked along with the estimated direct-to-reverberant ratio and a normalized version of the principal eigenvector of the spatial covariance matrix at each time-frequency bin on the spectrograms. Experimental results on the DCASE 2021 dataset for sound event localization and detection with directional interference showed that the deep learning-based models trained on this new feature outperformed the DCASE challenge baseline by a large margin. We combined several models with slightly different architectures that were trained on the new feature to further improve the system performances for the DCASE sound event localization and detection challenge. △ Less

Submitted 29 June, 2021; originally announced June 2021.

Comments: 5 pages, Technical Report for DCASE 2021 Challenge Task 3. arXiv admin note text overlap with arXiv:2110.00275

arXiv:1911.11373 [pdf, other]

A two-step system for sound event localization and detection

Authors: T. N. T. Nguyen, D. L. Jones, R. Ranjan, S. Jayabalan, W. S. Gan

Abstract: Sound event detection and sound event localization requires different features from audio input signals. While sound event detection mainly relies on time-frequency patterns to distinguish different event classes, sound event localization uses magnitude or phase differences between microphones to estimate source directions. Therefore, we propose a two-step system to do sound event localization and… ▽ More Sound event detection and sound event localization requires different features from audio input signals. While sound event detection mainly relies on time-frequency patterns to distinguish different event classes, sound event localization uses magnitude or phase differences between microphones to estimate source directions. Therefore, we propose a two-step system to do sound event localization and detection. In the first step, we detect the sound events and estimate the directions-of-arrival separately. In the second step, we combine the results of the event detector and direction-of-arrival estimator together. The obtained results show a significant improvement over the baseline solution for sound event localization and detection in DCASE 2019 task 3 challenge. Using the evaluation dataset, the proposed system achieved an F1 score of 93.4% for sound event detection and an error of 5.4 degrees for direction-of-arrival estimation, while the winning solution achieved an F1 score of 94.7% and an angle error of 3.7 degrees respectively. △ Less

Submitted 26 November, 2019; originally announced November 2019.

Comments: 5 pages

arXiv:1705.00615 [pdf, other]

Guided-Processing Outperforms Duty-Cycling for Energy-Efficient Systems

Authors: Long N. Le, Douglas L. Jones

Abstract: Energy-efficiency is highly desirable for sensing systems in the Internet of Things (IoT). A common approach to achieve low-power systems is duty-cycling, where components in a system are turned off periodically to meet an energy budget. However, this work shows that such an approach is not necessarily optimal in energy-efficiency, and proposes \textit{guided-processing} as a fundamentally better… ▽ More Energy-efficiency is highly desirable for sensing systems in the Internet of Things (IoT). A common approach to achieve low-power systems is duty-cycling, where components in a system are turned off periodically to meet an energy budget. However, this work shows that such an approach is not necessarily optimal in energy-efficiency, and proposes \textit{guided-processing} as a fundamentally better alternative. The proposed approach offers 1) explicit modeling of performance uncertainties in system internals, 2) a realistic resource consumption model, and 3) a key insight into the superiority of guided-processing over duty-cycling. Generalization from the cascade structure to the more general graph-based one is also presented. Once applied to optimize a large-scale audio sensing system with a practical detection application, empirical results show that the proposed approach significantly improves the detection performance (up to $1.7\times$ and $4\times$ reduction in false-alarm and miss rate, respectively) for the same energy consumption, when compared to the duty-cycling approach. △ Less

Submitted 1 May, 2017; originally announced May 2017.

Comments: preprint, the published version is in IEEE Transactions on Circuits and Systems I, Special Issue on Circuits and Systems for the Internet of Things - From Sensing to Sensemaking, 2017. arXiv admin note: substantial text overlap with arXiv:1705.00596

arXiv:1705.00596 [pdf, other]

doi 10.1109/JSTSP.2017.2679539

Feature-Sharing in Cascade Detection Systems with Multiple Applications

Authors: Long N. Le, Douglas L. Jones

Abstract: Traditional distributed detection systems are often designed for a single target application. However, with the emergence of the Internet of Things (IoT) paradigm, next-generation systems are expected to be a shared infrastructure for multiple applications. To this end, we propose a modular, cascade design for resource-efficient, multi-task detection systems. Two (classes of) applications are cons… ▽ More Traditional distributed detection systems are often designed for a single target application. However, with the emergence of the Internet of Things (IoT) paradigm, next-generation systems are expected to be a shared infrastructure for multiple applications. To this end, we propose a modular, cascade design for resource-efficient, multi-task detection systems. Two (classes of) applications are considered in the system, a primary and a secondary one. The primary application has universal features that can be shared with other applications, to reduce the overall feature extraction cost, while the secondary application does not. In this setting, the two applications can collaborate via feature sharing. We provide a method to optimize the operation of the multi-application cascade system based on an accurate resource consumption model. In addition, the inherent uncertainties in feature models are articulated and taken into account. For evaluation, the twin-comparison argument is invoked, and it is shown that, with the optimal feature sharing strategy, a system can achieve 9$\times$ resource saving and 1.43$\times$ improvement in detection performance. △ Less

Submitted 1 May, 2017; originally announced May 2017.

Comments: preprint, the published version is in IEEE Journal of Selected Topics in Signal Processing, Special Issue on Cooperative Signal Processing for Heterogeneous and Multi-Task Wireless Sensor Networks, 2017

arXiv:1410.4249 [pdf, other]

doi 10.1109/ISIT.2014.6874897

Optimal Simultaneous Detection and Signal and Noise Power Estimation

Authors: Long Le, Douglas L. Jones

Abstract: Simultaneous detection and estimation is important in many engineering applications. In particular, there are many applications where it is important to perform signal detection and Signal-to-Noise-Ratio (SNR) estimation jointly. Application of existing frameworks in the literature that handle simultaneous detection and estimation is not straightforward for this class of application. This paper th… ▽ More Simultaneous detection and estimation is important in many engineering applications. In particular, there are many applications where it is important to perform signal detection and Signal-to-Noise-Ratio (SNR) estimation jointly. Application of existing frameworks in the literature that handle simultaneous detection and estimation is not straightforward for this class of application. This paper therefore aims at bridging the gap between an existing framework, specifically the work by Middleton et al., and the mentioned application class by presenting a jointly optimal detector and signal and noise power estimators. The detector and estimators are given for the Gaussian observation model with appropriate conjugate priors on the signal and noise power. Simulation results affirm the superior performance of the optimal solution compared to the separate detection and estimation approaches. △ Less

Submitted 15 October, 2014; originally announced October 2014.

Comments: appears in 2014 IEEE International Symposium on Information Theory (ISIT)

Showing 1–9 of 9 results for author: Jones, D L