-
Spatial Scaper: A Library to Simulate and Augment Soundscapes for Sound Event Localization and Detection in Realistic Rooms
Authors:
Iran R. Roman,
Christopher Ick,
Sivan Ding,
Adrian S. Roman,
Brian McFee,
Juan P. Bello
Abstract:
Sound event localization and detection (SELD) is an important task in machine listening. Major advancements rely on simulated data with sound events in specific rooms and strong spatio-temporal labels. SELD data is simulated by convolving spatialy-localized room impulse responses (RIRs) with sound waveforms to place sound events in a soundscape. However, RIRs require manual collection in specific…
▽ More
Sound event localization and detection (SELD) is an important task in machine listening. Major advancements rely on simulated data with sound events in specific rooms and strong spatio-temporal labels. SELD data is simulated by convolving spatialy-localized room impulse responses (RIRs) with sound waveforms to place sound events in a soundscape. However, RIRs require manual collection in specific rooms. We present SpatialScaper, a library for SELD data simulation and augmentation. Compared to existing tools, SpatialScaper emulates virtual rooms via parameters such as size and wall absorption. This allows for parameterized placement (including movement) of foreground and background sound sources. SpatialScaper also includes data augmentation pipelines that can be applied to existing SELD data. As a case study, we use SpatialScaper to add rooms to the DCASE SELD data. Training a model with our data led to progressive performance improves as a direct function of acoustic diversity. These results show that SpatialScaper is valuable to train robust SELD models.
△ Less
Submitted 19 January, 2024;
originally announced January 2024.
-
Leveraging Geometrical Acoustic Simulations of Spatial Room Impulse Responses for Improved Sound Event Detection and Localization
Authors:
Christopher Ick,
Brian McFee
Abstract:
As deeper and more complex models are developed for the task of sound event localization and detection (SELD), the demand for annotated spatial audio data continues to increase. Annotating field recordings with 360$^{\circ}$ video takes many hours from trained annotators, while recording events within motion-tracked laboratories are bounded by cost and expertise. Because of this, localization mode…
▽ More
As deeper and more complex models are developed for the task of sound event localization and detection (SELD), the demand for annotated spatial audio data continues to increase. Annotating field recordings with 360$^{\circ}$ video takes many hours from trained annotators, while recording events within motion-tracked laboratories are bounded by cost and expertise. Because of this, localization models rely on a relatively limited amount of spatial audio data in the form of spatial room impulse response (SRIR) datasets, which limits the progress of increasingly deep neural network based approaches. In this work, we demonstrate that simulated geometrical acoustics can provide an appealing solution to this problem. We use simulated geometrical acoustics to generate a novel SRIR dataset that can train a SELD model to provide similar performance to that of a real SRIR dataset. Furthermore, we demonstrate using simulated data to augment existing datasets, improving on benchmarks set by state of the art SELD models. We explore the potential and limitations of geometric acoustic simulation for localization and event detection. We also propose further studies to verify the limitations of this method, as well as further methods to generate synthetic data for SELD tasks without the need to record more data.
△ Less
Submitted 6 September, 2023;
originally announced September 2023.
-
Blind Acoustic Room Parameter Estimation Using Phase Features
Authors:
Christopher Ick,
Adib Mehrabi,
Wenyu **
Abstract:
Modeling room acoustics in a field setting involves some degree of blind parameter estimation from noisy and reverberant audio. Modern approaches leverage convolutional neural networks (CNNs) in tandem with time-frequency representation. Using short-time Fourier transforms to develop these spectrogram-like features has shown promising results, but this method implicitly discards a significant amou…
▽ More
Modeling room acoustics in a field setting involves some degree of blind parameter estimation from noisy and reverberant audio. Modern approaches leverage convolutional neural networks (CNNs) in tandem with time-frequency representation. Using short-time Fourier transforms to develop these spectrogram-like features has shown promising results, but this method implicitly discards a significant amount of audio information in the phase domain. Inspired by recent works in speech enhancement, we propose utilizing novel phase-related features to extend recent approaches to blindly estimate the so-called "reverberation fingerprint" parameters, namely, volume and RT60. The addition of these features is shown to outperform existing methods that rely solely on magnitude-based spectral features across a wide range of acoustics spaces. We evaluate the effectiveness of the deployment of these novel features in both single-parameter and multi-parameter estimation strategies, using a novel dataset that consists of publicly available room impulse responses (RIRs), synthesized RIRs, and in-house measurements of real acoustic spaces.
△ Less
Submitted 13 March, 2023;
originally announced March 2023.
-
Searching for quasi-periodic oscillations in astrophysical transients using Gaussian processes
Authors:
M. Hübner,
D. Huppenkothen,
P. D. Lasky,
A. R. Inglis,
C. Ick,
D. W. Hogg
Abstract:
Analyses of quasi-periodic oscillations (QPOs) are important to understanding the dynamic behaviour in many astrophysical objects during transient events like gamma-ray bursts, solar flares, magnetar flares and fast radio bursts. Astrophysicists often search for QPOs with frequency-domain methods such as (Lomb-Scargle) periodograms, which generally assume power-law models plus some excess around t…
▽ More
Analyses of quasi-periodic oscillations (QPOs) are important to understanding the dynamic behaviour in many astrophysical objects during transient events like gamma-ray bursts, solar flares, magnetar flares and fast radio bursts. Astrophysicists often search for QPOs with frequency-domain methods such as (Lomb-Scargle) periodograms, which generally assume power-law models plus some excess around the QPO frequency. Time-series data can alternatively be investigated directly in the time domain using Gaussian Process (GP) regression. While GP regression is computationally expensive in the general case, the properties of astrophysical data and models allow fast likelihood strategies. Heteroscedasticity and non-stationarity in data have been shown to cause bias in periodogram-based analyses. Gaussian processes can take account of these properties. Using GPs, we model QPOs as a stochastic process on top of a deterministic flare shape. Using Bayesian inference, we demonstrate how to infer GP hyperparameters and assign them physical meaning, such as the QPO frequency. We also perform model selection between QPOs and alternative models such as red noise and show that this can be used to reliably find QPOs. This method is easily applicable to a variety of different astrophysical data sets. We demonstrate the use of this method on a range of short transients: a gamma-ray burst, a magnetar flare, a magnetar giant flare, and simulated solar flare data.
△ Less
Submitted 25 May, 2022;
originally announced May 2022.
-
Sound Event Detection in Urban Audio With Single and Multi-Rate PCEN
Authors:
Christopher Ick,
Brian McFee
Abstract:
Recent literature has demonstrated that the use of per-channel energy normalization (PCEN), has significant performance improvements over traditional log-scaled mel-frequency spectrograms in acoustic sound event detection (SED) in a multi-class setting with overlap** events. However, the configuration of PCEN's parameters is sensitive to the recording environment, the characteristics of the clas…
▽ More
Recent literature has demonstrated that the use of per-channel energy normalization (PCEN), has significant performance improvements over traditional log-scaled mel-frequency spectrograms in acoustic sound event detection (SED) in a multi-class setting with overlap** events. However, the configuration of PCEN's parameters is sensitive to the recording environment, the characteristics of the class of events of interest, and the presence of multiple overlap** events. This leads to improvements on a class-by-class basis, but poor cross-class performance. In this article, we experiment using PCEN spectrograms as an alternative method for SED in urban audio using the UrbanSED dataset, demonstrating per-class improvements based on parameter configuration. Furthermore, we address cross-class performance with PCEN using a novel method, Multi-Rate PCEN (MRPCEN). We demonstrate cross-class SED performance with MRPCEN, demonstrating improvements to cross-class performance compared to traditional single-rate PCEN.
△ Less
Submitted 5 February, 2021;
originally announced February 2021.
-
Learning a Lie Algebra from Unlabeled Data Pairs
Authors:
Christopher Ick,
Vincent Lostanlen
Abstract:
Deep convolutional networks (convnets) show a remarkable ability to learn disentangled representations. In recent years, the generalization of deep learning to Lie groups beyond rigid motion in $\mathbb{R}^n$ has allowed to build convnets over datasets with non-trivial symmetries, such as patterns over the surface of a sphere. However, one limitation of this approach is the need to explicitly defi…
▽ More
Deep convolutional networks (convnets) show a remarkable ability to learn disentangled representations. In recent years, the generalization of deep learning to Lie groups beyond rigid motion in $\mathbb{R}^n$ has allowed to build convnets over datasets with non-trivial symmetries, such as patterns over the surface of a sphere. However, one limitation of this approach is the need to explicitly define the Lie group underlying the desired invariance property before training the convnet. Whereas rotations on the sphere have a well-known symmetry group ($\mathrm{SO}(3)$), the same cannot be said of many real-world factors of variability. For example, the disentanglement of pitch, intensity dynamics, and playing technique remains a challenging task in music information retrieval.
This article proposes a machine learning method to discover a nonlinear transformation of the space $\mathbb{R}^n$ which maps a collection of $n$-dimensional vectors $(\boldsymbol{x}_i)_i$ onto a collection of target vectors $(\boldsymbol{y}_i)_i$. The key idea is to approximate every target $\boldsymbol{y}_i$ by a matrix--vector product of the form $\boldsymbol{\widetilde{y}}_i = \boldsymbolφ(t_i) \boldsymbol{x}_i$, where the matrix $\boldsymbolφ(t_i)$ belongs to a one-parameter subgroup of $\mathrm{GL}_n (\mathbb{R})$. Crucially, the value of the parameter $t_i \in \mathbb{R}$ may change between data pairs $(\boldsymbol{x}_i, \boldsymbol{y}_i)$ and does not need to be known in advance.
△ Less
Submitted 12 November, 2020; v1 submitted 19 September, 2020;
originally announced September 2020.