-
Spatial Scaper: A Library to Simulate and Augment Soundscapes for Sound Event Localization and Detection in Realistic Rooms
Authors:
Iran R. Roman,
Christopher Ick,
Sivan Ding,
Adrian S. Roman,
Brian McFee,
Juan P. Bello
Abstract:
Sound event localization and detection (SELD) is an important task in machine listening. Major advancements rely on simulated data with sound events in specific rooms and strong spatio-temporal labels. SELD data is simulated by convolving spatialy-localized room impulse responses (RIRs) with sound waveforms to place sound events in a soundscape. However, RIRs require manual collection in specific…
▽ More
Sound event localization and detection (SELD) is an important task in machine listening. Major advancements rely on simulated data with sound events in specific rooms and strong spatio-temporal labels. SELD data is simulated by convolving spatialy-localized room impulse responses (RIRs) with sound waveforms to place sound events in a soundscape. However, RIRs require manual collection in specific rooms. We present SpatialScaper, a library for SELD data simulation and augmentation. Compared to existing tools, SpatialScaper emulates virtual rooms via parameters such as size and wall absorption. This allows for parameterized placement (including movement) of foreground and background sound sources. SpatialScaper also includes data augmentation pipelines that can be applied to existing SELD data. As a case study, we use SpatialScaper to add rooms to the DCASE SELD data. Training a model with our data led to progressive performance improves as a direct function of acoustic diversity. These results show that SpatialScaper is valuable to train robust SELD models.
△ Less
Submitted 19 January, 2024;
originally announced January 2024.
-
Robust DOA estimation using deep acoustic imaging
Authors:
Adrian S. Roman,
Iran R. Roman,
Juan P. Bello
Abstract:
Direction of arrival estimation (DoAE) aims at tracking a sound in azimuth and elevation. Recent advancements include data-driven models with inputs derived from ambisonics intensity vectors or correlations between channels in a microphone array. A spherical intensity map (SIM), or acoustic image, is an alternative input representation that remains underexplored. SIMs benefit from high-resolution…
▽ More
Direction of arrival estimation (DoAE) aims at tracking a sound in azimuth and elevation. Recent advancements include data-driven models with inputs derived from ambisonics intensity vectors or correlations between channels in a microphone array. A spherical intensity map (SIM), or acoustic image, is an alternative input representation that remains underexplored. SIMs benefit from high-resolution microphone arrays, yet most DoAE datasets use low-resolution ones. Therefore, we first propose a super-resolution method to upsample low-resolution microphones. Next, we benchmark DoAE models that use SIMs as input. We arrive to a model that uses SIMs for DoAE estimation and outperforms a baseline and a state-of-the-art model. Our study highlights the relevance of acoustic imaging for DoAE tasks.
△ Less
Submitted 15 January, 2024;
originally announced January 2024.
-
From-Ground-To-Objects: Coarse-to-Fine Self-supervised Monocular Depth Estimation of Dynamic Objects with Ground Contact Prior
Authors:
Jaeho Moon,
Juan Luis Gonzalez Bello,
Byeongjun Kwon,
Munchurl Kim
Abstract:
Self-supervised monocular depth estimation (DE) is an approach to learning depth without costly depth ground truths. However, it often struggles with moving objects that violate the static scene assumption during training. To address this issue, we introduce a coarse-to-fine training strategy leveraging the ground contacting prior based on the observation that most moving objects in outdoor scenes…
▽ More
Self-supervised monocular depth estimation (DE) is an approach to learning depth without costly depth ground truths. However, it often struggles with moving objects that violate the static scene assumption during training. To address this issue, we introduce a coarse-to-fine training strategy leveraging the ground contacting prior based on the observation that most moving objects in outdoor scenes contact the ground. In the coarse training stage, we exclude the objects in dynamic classes from the reprojection loss calculation to avoid inaccurate depth learning. To provide precise supervision on the depth of the objects, we present a novel Ground-contacting-prior Disparity Smoothness Loss (GDS-Loss) that encourages a DE network to align the depth of the objects with their ground-contacting points. Subsequently, in the fine training stage, we refine the DE network to learn the detailed depth of the objects from the reprojection loss, while ensuring accurate DE on the moving object regions by employing our regularization loss with a cost-volume-based weighting factor. Our overall coarse-to-fine training strategy can easily be integrated with existing DE methods without any modifications, significantly enhancing DE performance on challenging Cityscapes and KITTI datasets, especially in the moving object regions.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
ProNeRF: Learning Efficient Projection-Aware Ray Sampling for Fine-Grained Implicit Neural Radiance Fields
Authors:
Juan Luis Gonzalez Bello,
Minh-Quan Viet Bui,
Munchurl Kim
Abstract:
Recent advances in neural rendering have shown that, albeit slow, implicit compact models can learn a scene's geometries and view-dependent appearances from multiple views. To maintain such a small memory footprint but achieve faster inference times, recent works have adopted `sampler' networks that adaptively sample a small subset of points along each ray in the implicit neural radiance fields. A…
▽ More
Recent advances in neural rendering have shown that, albeit slow, implicit compact models can learn a scene's geometries and view-dependent appearances from multiple views. To maintain such a small memory footprint but achieve faster inference times, recent works have adopted `sampler' networks that adaptively sample a small subset of points along each ray in the implicit neural radiance fields. Although these methods achieve up to a 10$\times$ reduction in rendering time, they still suffer from considerable quality degradation compared to the vanilla NeRF. In contrast, we propose ProNeRF, which provides an optimal trade-off between memory footprint (similar to NeRF), speed (faster than HyperReel), and quality (better than K-Planes). ProNeRF is equipped with a novel projection-aware sampling (PAS) network together with a new training strategy for ray exploration and exploitation, allowing for efficient fine-grained particle sampling. Our ProNeRF yields state-of-the-art metrics, being 15-23x faster with 0.65dB higher PSNR than NeRF and yielding 0.95dB higher PSNR than the best published sampler-based method, HyperReel. Our exploration and exploitation training strategy allows ProNeRF to learn the full scenes' color and density distributions while also learning efficient ray sampling focused on the highest-density regions. We provide extensive experimental results that support the effectiveness of our method on the widely adopted forward-facing and 360 datasets, LLFF and Blender, respectively.
△ Less
Submitted 13 December, 2023;
originally announced December 2023.
-
Novel View Synthesis with View-Dependent Effects from a Single Image
Authors:
Juan Luis Gonzalez Bello,
Munchurl Kim
Abstract:
In this paper, we firstly consider view-dependent effects into single image-based novel view synthesis (NVS) problems. For this, we propose to exploit the camera motion priors in NVS to model view-dependent appearance or effects (VDE) as the negative disparity in the scene. By recognizing specularities "follow" the camera motion, we infuse VDEs into the input images by aggregating input pixel colo…
▽ More
In this paper, we firstly consider view-dependent effects into single image-based novel view synthesis (NVS) problems. For this, we propose to exploit the camera motion priors in NVS to model view-dependent appearance or effects (VDE) as the negative disparity in the scene. By recognizing specularities "follow" the camera motion, we infuse VDEs into the input images by aggregating input pixel colors along the negative depth region of the epipolar lines. Also, we propose a `relaxed volumetric rendering' approximation that allows computing the densities in a single pass, improving efficiency for NVS from single images. Our method can learn single-image NVS from image sequences only, which is a completely self-supervised learning method, for the first time requiring neither depth nor camera pose annotations. We present extensive experiment results and show that our proposed method can learn NVS with VDEs, outperforming the SOTA single-view NVS methods on the RealEstate10k and MannequinChallenge datasets.
△ Less
Submitted 13 December, 2023;
originally announced December 2023.
-
Two vs. Four-Channel Sound Event Localization and Detection
Authors:
Julia Wilkins,
Magdalena Fuentes,
Luca Bondi,
Shabnam Ghaffarzadegan,
Ali Abavisani,
Juan Pablo Bello
Abstract:
Sound event localization and detection (SELD) systems estimate both the direction-of-arrival (DOA) and class of sound sources over time. In the DCASE 2022 SELD Challenge (Task 3), models are designed to operate in a 4-channel setting. While beneficial to further the development of SELD systems using a multichannel recording setup such as first-order Ambisonics (FOA), most consumer electronics devi…
▽ More
Sound event localization and detection (SELD) systems estimate both the direction-of-arrival (DOA) and class of sound sources over time. In the DCASE 2022 SELD Challenge (Task 3), models are designed to operate in a 4-channel setting. While beneficial to further the development of SELD systems using a multichannel recording setup such as first-order Ambisonics (FOA), most consumer electronics devices rarely are able to record using more than two channels. For this reason, in this work we investigate the performance of the DCASE 2022 SELD baseline model using three audio input representations: FOA, binaural, and stereo. We perform a novel comparative analysis illustrating the effect of these audio input representations on SELD performance. Crucially, we show that binaural and stereo (i.e. 2-channel) audio-based SELD models are still able to localize and detect sound sources laterally quite well, despite overall performance degrading as less audio information is provided. Further, we segment our analysis by scenes containing varying degrees of sound source polyphony to better understand the effect of audio input representation on localization and detection performance as scene conditions become increasingly complex.
△ Less
Submitted 23 September, 2023;
originally announced September 2023.
-
Sound Source Distance Estimation in Diverse and Dynamic Acoustic Conditions
Authors:
Saksham Singh Kushwaha,
Iran R. Roman,
Magdalena Fuentes,
Juan Pablo Bello
Abstract:
Localizing a moving sound source in the real world involves determining its direction-of-arrival (DOA) and distance relative to a microphone. Advancements in DOA estimation have been facilitated by data-driven methods optimized with large open-source datasets with microphone array recordings in diverse environments. In contrast, estimating a sound source's distance remains understudied. Existing a…
▽ More
Localizing a moving sound source in the real world involves determining its direction-of-arrival (DOA) and distance relative to a microphone. Advancements in DOA estimation have been facilitated by data-driven methods optimized with large open-source datasets with microphone array recordings in diverse environments. In contrast, estimating a sound source's distance remains understudied. Existing approaches assume recordings by non-coincident microphones to use methods that are susceptible to differences in room reverberation. We present a CRNN able to estimate the distance of moving sound sources across multiple datasets featuring diverse rooms, outperforming a recently-published approach. We also characterize our model's performance as a function of sound source distance and different training losses. This analysis reveals optimal training using a loss that weighs model errors as an inverse function of the sound source true distance. Our study is the first to demonstrate that sound source distance estimation can be performed across diverse acoustic conditions using deep learning.
△ Less
Submitted 17 September, 2023;
originally announced September 2023.
-
Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries
Authors:
Julia Wilkins,
Justin Salamon,
Magdalena Fuentes,
Juan Pablo Bello,
Oriol Nieto
Abstract:
Finding the right sound effects (SFX) to match moments in a video is a difficult and time-consuming task, and relies heavily on the quality and completeness of text metadata. Retrieving high-quality (HQ) SFX using a video frame directly as the query is an attractive alternative, removing the reliance on text metadata and providing a low barrier to entry for non-experts. Due to the lack of HQ audio…
▽ More
Finding the right sound effects (SFX) to match moments in a video is a difficult and time-consuming task, and relies heavily on the quality and completeness of text metadata. Retrieving high-quality (HQ) SFX using a video frame directly as the query is an attractive alternative, removing the reliance on text metadata and providing a low barrier to entry for non-experts. Due to the lack of HQ audio-visual training data, previous work on audio-visual retrieval relies on YouTube (in-the-wild) videos of varied quality for training, where the audio is often noisy and the video of amateur quality. As such it is unclear whether these systems would generalize to the task of matching HQ audio to production-quality video. To address this, we propose a multimodal framework for recommending HQ SFX given a video frame by (1) leveraging large language models and foundational vision-language models to bridge HQ audio and video to create audio-visual pairs, resulting in a highly scalable automatic audio-visual data curation pipeline; and (2) using pre-trained audio and visual encoders to train a contrastive learning-based retrieval system. We show that our system, trained using our automatic data curation pipeline, significantly outperforms baselines trained on in-the-wild data on the task of HQ SFX retrieval for video. Furthermore, while the baselines fail to generalize to this task, our system generalizes well from clean to in-the-wild data, outperforming the baselines on a dataset of YouTube videos despite only being trained on the HQ audio-visual pairs. A user study confirms that people prefer SFX retrieved by our system over the baseline 67% of the time both for HQ and in-the-wild data. Finally, we present ablations to determine the impact of model and data pipeline design choices on downstream retrieval performance. Please visit our project website to listen to and view our SFX retrieval results.
△ Less
Submitted 17 August, 2023;
originally announced August 2023.
-
ARGUS: Visualization of AI-Assisted Task Guidance in AR
Authors:
Sonia Castelo,
Joao Rulff,
Erin McGowan,
Bea Steers,
Guande Wu,
Shaoyu Chen,
Iran Roman,
Roque Lopez,
Ethan Brewer,
Chen Zhao,
**g Qian,
Kyunghyun Cho,
He He,
Qi Sun,
Huy Vo,
Juan Bello,
Michael Krone,
Claudio Silva
Abstract:
The concept of augmented reality (AR) assistants has captured the human imagination for decades, becoming a staple of modern science fiction. To pursue this goal, it is necessary to develop artificial intelligence (AI)-based methods that simultaneously perceive the 3D environment, reason about physical tasks, and model the performer, all in real-time. Within this framework, a wide variety of senso…
▽ More
The concept of augmented reality (AR) assistants has captured the human imagination for decades, becoming a staple of modern science fiction. To pursue this goal, it is necessary to develop artificial intelligence (AI)-based methods that simultaneously perceive the 3D environment, reason about physical tasks, and model the performer, all in real-time. Within this framework, a wide variety of sensors are needed to generate data across different modalities, such as audio, video, depth, speech, and time-of-flight. The required sensors are typically part of the AR headset, providing performer sensing and interaction through visual, audio, and haptic feedback. AI assistants not only record the performer as they perform activities, but also require machine learning (ML) models to understand and assist the performer as they interact with the physical world. Therefore, develo** such assistants is a challenging task. We propose ARGUS, a visual analytics system to support the development of intelligent AR assistants. Our system was designed as part of a multi year-long collaboration between visualization researchers and ML and AR experts. This co-design process has led to advances in the visualization of ML in AR. Our system allows for online visualization of object, action, and step detection as well as offline analysis of previously recorded AR sessions. It visualizes not only the multimodal sensor data streams but also the output of the ML models. This allows developers to gain insights into the performer activities as well as the ML models, hel** them troubleshoot, improve, and fine tune the components of the AR assistant.
△ Less
Submitted 11 August, 2023;
originally announced August 2023.
-
Audio-Text Models Do Not Yet Leverage Natural Language
Authors:
Ho-Hsiang Wu,
Oriol Nieto,
Juan Pablo Bello,
Justin Salamon
Abstract:
Multi-modal contrastive learning techniques in the audio-text domain have quickly become a highly active area of research. Most works are evaluated with standard audio retrieval and classification benchmarks assuming that (i) these models are capable of leveraging the rich information contained in natural language, and (ii) current benchmarks are able to capture the nuances of such information. In…
▽ More
Multi-modal contrastive learning techniques in the audio-text domain have quickly become a highly active area of research. Most works are evaluated with standard audio retrieval and classification benchmarks assuming that (i) these models are capable of leveraging the rich information contained in natural language, and (ii) current benchmarks are able to capture the nuances of such information. In this work, we show that state-of-the-art audio-text models do not yet really understand natural language, especially contextual concepts such as sequential or concurrent ordering of sound events. Our results suggest that existing benchmarks are not sufficient to assess these models' capabilities to match complex contexts from the audio and text modalities. We propose a Transformer-based architecture and show that, unlike prior work, it is capable of modeling the sequential relationship between sound events in the text and audio, given appropriate benchmark data. We advocate for the collection or generation of additional, diverse, data to allow future research to fully leverage natural language for audio-text modeling.
△ Less
Submitted 19 March, 2023;
originally announced March 2023.
-
Granger causality test for heteroskedastic and structural-break time series using generalized least squares
Authors:
Hugo J. Bello
Abstract:
This paper proposes a novel method (GLS Granger test) to determine causal relationships between time series based on the estimation of the autocovariance matrix and generalized least squares. We show the effectiveness of proposed autocovariance matrix estimator (the sliding autocovariance matrix) and we compare the proposed method with the classical Granger F-test with via a synthetic dataset and…
▽ More
This paper proposes a novel method (GLS Granger test) to determine causal relationships between time series based on the estimation of the autocovariance matrix and generalized least squares. We show the effectiveness of proposed autocovariance matrix estimator (the sliding autocovariance matrix) and we compare the proposed method with the classical Granger F-test with via a synthetic dataset and a real dataset composed by cryptocurrencies. The simulations show that the proposed GLS Granger test captures causality more accurately than Granger F-tests in the cases of heteroskedastic or structural-break residuals. Finally, we use the proposed method to unravel unknown causal relationships between cryptocurrencies.
△ Less
Submitted 8 January, 2023;
originally announced January 2023.
-
FlowGrad: Using Motion for Visual Sound Source Localization
Authors:
Rajsuryan Singh,
Pablo Zinemanas,
Xavier Serra,
Juan Pablo Bello,
Magdalena Fuentes
Abstract:
Most recent work in visual sound source localization relies on semantic audio-visual representations learned in a self-supervised manner, and by design excludes temporal information present in videos. While it proves to be effective for widely used benchmark datasets, the method falls short for challenging scenarios like urban traffic. This work introduces temporal context into the state-of-the-ar…
▽ More
Most recent work in visual sound source localization relies on semantic audio-visual representations learned in a self-supervised manner, and by design excludes temporal information present in videos. While it proves to be effective for widely used benchmark datasets, the method falls short for challenging scenarios like urban traffic. This work introduces temporal context into the state-of-the-art methods for sound source localization in urban scenes using optical flow as a means to encode motion information. An analysis of the strengths and weaknesses of our methods helps us better understand the problem of visual sound source localization and sheds light on open challenges for audio-visual scene understanding.
△ Less
Submitted 14 April, 2023; v1 submitted 15 November, 2022;
originally announced November 2022.
-
Urban Rhapsody: Large-scale exploration of urban soundscapes
Authors:
Joao Rulff,
Fabio Miranda,
Maryam Hosseini,
Marcos Lage,
Mark Cartwright,
Graham Dove,
Juan Bello,
Claudio T. Silva
Abstract:
Noise is one of the primary quality-of-life issues in urban environments. In addition to annoyance, noise negatively impacts public health and educational performance. While low-cost sensors can be deployed to monitor ambient noise levels at high temporal resolutions, the amount of data they produce and the complexity of these data pose significant analytical challenges. One way to address these c…
▽ More
Noise is one of the primary quality-of-life issues in urban environments. In addition to annoyance, noise negatively impacts public health and educational performance. While low-cost sensors can be deployed to monitor ambient noise levels at high temporal resolutions, the amount of data they produce and the complexity of these data pose significant analytical challenges. One way to address these challenges is through machine listening techniques, which are used to extract features in attempts to classify the source of noise and understand temporal patterns of a city's noise situation. However, the overwhelming number of noise sources in the urban environment and the scarcity of labeled data makes it nearly impossible to create classification models with large enough vocabularies that capture the true dynamism of urban soundscapes In this paper, we first identify a set of requirements in the yet unexplored domain of urban soundscape exploration. To satisfy the requirements and tackle the identified challenges, we propose Urban Rhapsody, a framework that combines state-of-the-art audio representation, machine learning, and visual analytics to allow users to interactively create classification models, understand noise patterns of a city, and quickly retrieve and label audio excerpts in order to create a large high-precision annotated database of urban sound recordings. We demonstrate the tool's utility through case studies performed by domain experts using data generated over the five-year deployment of a one-of-a-kind sensor network in New York City.
△ Less
Submitted 25 May, 2022;
originally announced May 2022.
-
Positional Information is All You Need: A Novel Pipeline for Self-Supervised SVDE from Videos
Authors:
Juan Luis Gonzalez Bello,
Jaeho Moon,
Munchurl Kim
Abstract:
Recently, much attention has been drawn to learning the underlying 3D structures of a scene from monocular videos in a fully self-supervised fashion. One of the most challenging aspects of this task is handling the independently moving objects as they break the rigid-scene assumption. For the first time, we show that pixel positional information can be exploited to learn SVDE (Single View Depth Es…
▽ More
Recently, much attention has been drawn to learning the underlying 3D structures of a scene from monocular videos in a fully self-supervised fashion. One of the most challenging aspects of this task is handling the independently moving objects as they break the rigid-scene assumption. For the first time, we show that pixel positional information can be exploited to learn SVDE (Single View Depth Estimation) from videos. Our proposed moving object (MO) masks, which are induced by shifted positional information (SPI) and referred to as `SPIMO' masks, are very robust and consistently remove the independently moving objects in the scenes, allowing for better learning of SVDE from videos. Additionally, we introduce a new adaptive quantization scheme that assigns the best per-pixel quantization curve for our depth discretization. Finally, we employ existing boosting techniques in a new way to further self-supervise the depth of the moving objects. With these features, our pipeline is robust against moving objects and generalizes well to high-resolution images, even when trained with small patches, yielding state-of-the-art (SOTA) results with almost 8.5x fewer parameters than the previous works that learn from videos. We present extensive experiments on KITTI and CityScapes that show the effectiveness of our method.
△ Less
Submitted 18 May, 2022;
originally announced May 2022.
-
Few-Shot Musical Source Separation
Authors:
Yu Wang,
Daniel Stoller,
Rachel M. Bittner,
Juan Pablo Bello
Abstract:
Deep learning-based approaches to musical source separation are often limited to the instrument classes that the models are trained on and do not generalize to separate unseen instruments. To address this, we propose a few-shot musical source separation paradigm. We condition a generic U-Net source separation model using few audio examples of the target instrument. We train a few-shot conditioning…
▽ More
Deep learning-based approaches to musical source separation are often limited to the instrument classes that the models are trained on and do not generalize to separate unseen instruments. To address this, we propose a few-shot musical source separation paradigm. We condition a generic U-Net source separation model using few audio examples of the target instrument. We train a few-shot conditioning encoder jointly with the U-Net to encode the audio examples into a conditioning vector to configure the U-Net via feature-wise linear modulation (FiLM). We evaluate the trained models on real musical recordings in the MUSDB18 and MedleyDB datasets. We show that our proposed few-shot conditioning paradigm outperforms the baseline one-hot instrument-class conditioned model for both seen and unseen instruments. To extend the scope of our approach to a wider variety of real-world scenarios, we also experiment with different conditioning example characteristics, including examples from different recordings, with multiple sources, or negative conditioning examples.
△ Less
Submitted 2 May, 2022;
originally announced May 2022.
-
Ferrimagnet GdFeCo characterization for spin-orbitronics: large field-like and dam**-like torques
Authors:
Héloïse Damas,
Alberto Anadon,
David Céspedes-Berrocal,
Junior Alegre-Saenz,
Jean-Loïs Bello,
Aldo Arriola-Córdova,
Sylvie Migot,
Jaafar Ghanbaja,
Olivier Copie,
Michel Hehn,
Vincent Cros,
Sébastien Petit-Watelot,
Juan-Carlos Rojas-Sánchez
Abstract:
Spintronics is showing promising results in the search for new materials and effects to reduce energy consumption in information technology. Among these materials, ferrimagnets are of special interest, since they can produce large spin currents that trigger the magnetization dynamics of adjacent layers or even their own magnetization. Here, we present a study of the generation of spin current by G…
▽ More
Spintronics is showing promising results in the search for new materials and effects to reduce energy consumption in information technology. Among these materials, ferrimagnets are of special interest, since they can produce large spin currents that trigger the magnetization dynamics of adjacent layers or even their own magnetization. Here, we present a study of the generation of spin current by GdFeCo in a GdFeCo/Cu/NiFe trilayer where the FeCo sublattice magnetization is dominant at room temperature. Magnetic properties such as the saturation magnetization are deduced from magnetometry measurements while dam** constant is estimated from spin-torque ferromagnetic resonance (ST-FMR). We show that the overall dam**-like (DL) and field-like (FL) effective fields as well as the associated spin Hall angles can be reliably obtained by performing the dependence of ST-FMR by an added dc current. The sum of the spin Hall angles for both the spin Hall effect (SHE) and the spin anomalous Hall effect (SAHE) symmetries are: $θ_{DL}^{SAHE} + θ_{DL}^{SHE}=-0.15 \pm 0.05$ and $θ_{FL}^{SAHE} + θ_{FL}^{SHE}=0.026 \pm 0.005$. From the symmetry of ST-FMR signals we find that $θ_{DL}^{SHE}$ is positive and dominated by the negative $θ_{DL}^{SAHE}$. The present study paves the way for tuning the different symmetries in spin conversion in highly efficient ferrimagnetic systems.
△ Less
Submitted 20 April, 2022;
originally announced April 2022.
-
How to Listen? Rethinking Visual Sound Localization
Authors:
Ho-Hsiang Wu,
Magdalena Fuentes,
Prem Seetharaman,
Juan Pablo Bello
Abstract:
Localizing visual sounds consists on locating the position of objects that emit sound within an image. It is a growing research area with potential applications in monitoring natural and urban environments, such as wildlife migration and urban traffic. Previous works are usually evaluated with datasets having mostly a single dominant visible object, and proposed models usually require the introduc…
▽ More
Localizing visual sounds consists on locating the position of objects that emit sound within an image. It is a growing research area with potential applications in monitoring natural and urban environments, such as wildlife migration and urban traffic. Previous works are usually evaluated with datasets having mostly a single dominant visible object, and proposed models usually require the introduction of localization modules during training or dedicated sampling strategies, but it remains unclear how these design choices play a role in the adaptability of these methods in more challenging scenarios. In this work, we analyze various model choices for visual sound localization and discuss how their different components affect the model's performance, namely the encoders' architecture, the loss function and the localization strategy. Furthermore, we study the interaction between these decisions, the model performance, and the data, by digging into different evaluation datasets spanning different difficulties and characteristics, and discuss the implications of such decisions in the context of real-world applications. Our code and model weights are open-sourced and made available for further applications.
△ Less
Submitted 11 April, 2022;
originally announced April 2022.
-
A Study on Robustness to Perturbations for Representations of Environmental Sound
Authors:
Sangeeta Srivastava,
Ho-Hsiang Wu,
Joao Rulff,
Magdalena Fuentes,
Mark Cartwright,
Claudio Silva,
Anish Arora,
Juan Pablo Bello
Abstract:
Audio applications involving environmental sound analysis increasingly use general-purpose audio representations, also known as embeddings, for transfer learning. Recently, Holistic Evaluation of Audio Representations (HEAR) evaluated twenty-nine embedding models on nineteen diverse tasks. However, the evaluation's effectiveness depends on the variation already captured within a given dataset. The…
▽ More
Audio applications involving environmental sound analysis increasingly use general-purpose audio representations, also known as embeddings, for transfer learning. Recently, Holistic Evaluation of Audio Representations (HEAR) evaluated twenty-nine embedding models on nineteen diverse tasks. However, the evaluation's effectiveness depends on the variation already captured within a given dataset. Therefore, for a given data domain, it is unclear how the representations would be affected by the variations caused by myriad microphones' range and acoustic conditions -- commonly known as channel effects. We aim to extend HEAR to evaluate invariance to channel effects in this work. To accomplish this, we imitate channel effects by injecting perturbations to the audio signal and measure the shift in the new (perturbed) embeddings with three distance measures, making the evaluation domain-dependent but not task-dependent. Combined with the downstream performance, it helps us make a more informed prediction of how robust the embeddings are to the channel effects. We evaluate two embeddings -- YAMNet, and OpenL3 on monophonic (UrbanSound8K) and polyphonic (SONYC-UST) urban datasets. We show that one distance measure does not suffice in such task-independent evaluation. Although Fréchet Audio Distance (FAD) correlates with the trend of the performance drop in the downstream task most accurately, we show that we need to study FAD in conjunction with the other distances to get a clear understanding of the overall effect of the perturbation. In terms of the embedding performance, we find OpenL3 to be more robust than YAMNet, which aligns with the HEAR evaluation.
△ Less
Submitted 6 July, 2022; v1 submitted 19 March, 2022;
originally announced March 2022.
-
Infrastructure-free, Deep Learned Urban Noise Monitoring at $\sim$100mW
Authors:
Jihoon Yun,
Sangeeta Srivastava,
Dhrubojyoti Roy,
Nathan Stohs,
Charlie Mydlarz,
Mahin Salman,
Bea Steers,
Juan Pablo Bello,
Anish Arora
Abstract:
The Sounds of New York City (SONYC) wireless sensor network (WSN) has been fielded in Manhattan and Brooklyn over the past five years, as part of a larger human-in-the-loop cyber-physical control system for monitoring, analyzing, and mitigating urban noise pollution. We describe the evolution of the 2-tier SONYC WSN from an acoustic data collection fabric into a 3-tier in situ noise complaint moni…
▽ More
The Sounds of New York City (SONYC) wireless sensor network (WSN) has been fielded in Manhattan and Brooklyn over the past five years, as part of a larger human-in-the-loop cyber-physical control system for monitoring, analyzing, and mitigating urban noise pollution. We describe the evolution of the 2-tier SONYC WSN from an acoustic data collection fabric into a 3-tier in situ noise complaint monitoring WSN, and its current evaluation. The added tier consists of long-range (LoRa), multi-hop networks of a new low-power acoustic mote, MKII ("Mach 2"), that we have designed and fabricated. MKII motes are notable in three ways: First, they advance machine learning capability at mote-scale in this application domain by introducing a real-time Convolutional Neural Network (CNN) based embedding model that is competitive with alternatives while also requiring 10$\times$ lesser training data and $\sim$2 orders of magnitude fewer runtime resources. Second, they are conveniently deployed relatively far from higher-tier base station nodes without assuming power or network infrastructure support at operationally relevant sites (such as construction zones), yielding a relatively low-cost solution. And third, their networking is frequency agile, unlike conventional LoRa networks: it tolerates in a distributed, self-stabilizing way the variable external interference and link fading in the cluttered 902-928MHz ISM band urban environment by dynamically choosing good frequencies using an efficient new method that combines passive and active measurements.
△ Less
Submitted 11 March, 2022;
originally announced March 2022.
-
Wav2CLIP: Learning Robust Audio Representations From CLIP
Authors:
Ho-Hsiang Wu,
Prem Seetharaman,
Kundan Kumar,
Juan Pablo Bello
Abstract:
We propose Wav2CLIP, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on a variety of audio tasks including classification, retrieval, and generation, and show that Wav2CLIP can outperform several publicly available pre-trained audio representation algorithms. Wav2CLIP projects audio into a shared e…
▽ More
We propose Wav2CLIP, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on a variety of audio tasks including classification, retrieval, and generation, and show that Wav2CLIP can outperform several publicly available pre-trained audio representation algorithms. Wav2CLIP projects audio into a shared embedding space with images and text, which enables multimodal applications such as zero-shot classification, and cross-modal retrieval. Furthermore, Wav2CLIP needs just ~10% of the data to achieve competitive performance on downstream tasks compared with fully supervised models, and is more efficient to pre-train than competing methods as it does not require learning a visual model in concert with an auditory model. Finally, we demonstrate image generation from Wav2CLIP as qualitative assessment of the shared embedding space. Our code and model weights are open sourced and made available for further applications.
△ Less
Submitted 15 February, 2022; v1 submitted 21 October, 2021;
originally announced October 2021.
-
Who calls the shots? Rethinking Few-Shot Learning for Audio
Authors:
Yu Wang,
Nicholas J. Bryan,
Justin Salamon,
Mark Cartwright,
Juan Pablo Bello
Abstract:
Few-shot learning aims to train models that can recognize novel classes given just a handful of labeled examples, known as the support set. While the field has seen notable advances in recent years, they have often focused on multi-class image classification. Audio, in contrast, is often multi-label due to overlap** sounds, resulting in unique properties such as polyphony and signal-to-noise rat…
▽ More
Few-shot learning aims to train models that can recognize novel classes given just a handful of labeled examples, known as the support set. While the field has seen notable advances in recent years, they have often focused on multi-class image classification. Audio, in contrast, is often multi-label due to overlap** sounds, resulting in unique properties such as polyphony and signal-to-noise ratios (SNR). This leads to unanswered questions concerning the impact such audio properties may have on few-shot learning system design, performance, and human-computer interaction, as it is typically up to the user to collect and provide inference-time support set examples. We address these questions through a series of experiments designed to elucidate the answers to these questions. We introduce two novel datasets, FSD-MIX-CLIPS and FSD-MIX-SED, whose programmatic generation allows us to explore these questions systematically. Our experiments lead to audio-specific insights on few-shot learning, some of which are at odds with recent findings in the image domain: there is no best one-size-fits-all model, method, and support set selection criterion. Rather, it depends on the expected application scenario. Our code and data are available at https://github.com/wangyu/rethink-audio-fsl.
△ Less
Submitted 18 October, 2021;
originally announced October 2021.
-
Soundata: A Python library for reproducible use of audio datasets
Authors:
Magdalena Fuentes,
Justin Salamon,
Pablo Zinemanas,
Martín Rocamora,
Genís Paja,
Irán R. Román,
Marius Miron,
Xavier Serra,
Juan Pablo Bello
Abstract:
Soundata is a Python library for loading and working with audio datasets in a standardized way, removing the need for writing custom loaders in every project, and improving reproducibility by providing tools to validate data against a canonical version. It speeds up research pipelines by allowing users to quickly download a dataset, load it into memory in a standardized and reproducible way, valid…
▽ More
Soundata is a Python library for loading and working with audio datasets in a standardized way, removing the need for writing custom loaders in every project, and improving reproducibility by providing tools to validate data against a canonical version. It speeds up research pipelines by allowing users to quickly download a dataset, load it into memory in a standardized and reproducible way, validate that the dataset is complete and correct, and more. Soundata is based and inspired on mirdata and design to complement mirdata by working with environmental sound, bioacoustic and speech datasets, among others. Soundata was created to be easy to use, easy to contribute to, and to increase reproducibility and standardize usage of sound datasets in a flexible way.
△ Less
Submitted 4 October, 2021; v1 submitted 26 September, 2021;
originally announced September 2021.
-
Cyclic Cellular Automata and Greenberg-Hastings Models on Regular Trees
Authors:
Jason Bello,
David Sivakoff
Abstract:
We study the cyclic cellular automaton (CCA) and the Greenberg-Hastings model (GHM) with $κ\ge 3$ colors and contact threshold $θ\ge 2$ on the infinite $(d+1)$-regular tree, $T_d$. When the initial state has the uniform product distribution, we show that these dynamical systems exhibit at least two distinct phases. For sufficiently large $d$, we show that if $κ(θ-1) \le d - O(\sqrt{dκ\ln(d)})$, th…
▽ More
We study the cyclic cellular automaton (CCA) and the Greenberg-Hastings model (GHM) with $κ\ge 3$ colors and contact threshold $θ\ge 2$ on the infinite $(d+1)$-regular tree, $T_d$. When the initial state has the uniform product distribution, we show that these dynamical systems exhibit at least two distinct phases. For sufficiently large $d$, we show that if $κ(θ-1) \le d - O(\sqrt{dκ\ln(d)})$, then every vertex almost surely changes its color infinitely often, while if $κθ\ge d + O(κ\sqrt{d\ln(d)})$, then every vertex almost surely changes its color only finitely many times. Roughly, this implies that as $d\to \infty$, there is a phase transition where $κθ/d = 1$. For the GHM dynamics, in the scenario where every vertex changes color finitely many times, we moreover give an exponential tail bound for the distribution of the time of the last color change at a given vertex.
△ Less
Submitted 13 August, 2021;
originally announced August 2021.
-
Exploring modality-agnostic representations for music classification
Authors:
Ho-Hsiang Wu,
Magdalena Fuentes,
Juan P. Bello
Abstract:
Music information is often conveyed or recorded across multiple data modalities including but not limited to audio, images, text and scores. However, music information retrieval research has almost exclusively focused on single modality recognition, requiring development of separate models for each modality. Some multi-modal works require multiple coexisting modalities given to the model as inputs…
▽ More
Music information is often conveyed or recorded across multiple data modalities including but not limited to audio, images, text and scores. However, music information retrieval research has almost exclusively focused on single modality recognition, requiring development of separate models for each modality. Some multi-modal works require multiple coexisting modalities given to the model as inputs, constraining the use of these models to the few cases where data from all modalities are available. To the best of our knowledge, no existing model has the ability to take inputs from varying modalities, e.g. images or sounds, and classify them into unified music categories. We explore the use of cross-modal retrieval as a pretext task to learn modality-agnostic representations, which can then be used as inputs to classifiers that are independent of modality. We select instrument classification as an example task for our study as both visual and audio components provide relevant semantic information. We train music instrument classifiers that can take both images or sounds as input, and perform comparably to sound-only or image-only classifiers. Furthermore, we explore the case when there is limited labeled data for a given modality, and the impact in performance by using labeled data from other modalities. We are able to achieve almost 70% of best performing system in a zero-shot setting. We provide a detailed analysis of experimental results to understand the potential and limitations of the approach, and discuss future steps towards modality-agnostic classifiers.
△ Less
Submitted 2 June, 2021;
originally announced June 2021.
-
Weakly Supervised Source-Specific Sound Level Estimation in Noisy Soundscapes
Authors:
Aurora Cramer,
Mark Cartwright,
Fatemeh Pishdadian,
Juan Pablo Bello
Abstract:
While the estimation of what sound sources are, when they occur, and from where they originate has been well-studied, the estimation of how loud these sound sources are has been often overlooked. Current solutions to this task, which we refer to as source-specific sound level estimation (SSSLE), suffer from challenges due to the impracticality of acquiring realistic data and a lack of robustness t…
▽ More
While the estimation of what sound sources are, when they occur, and from where they originate has been well-studied, the estimation of how loud these sound sources are has been often overlooked. Current solutions to this task, which we refer to as source-specific sound level estimation (SSSLE), suffer from challenges due to the impracticality of acquiring realistic data and a lack of robustness to realistic recording conditions. Recently proposed weakly supervised source separation offer a means of leveraging clip-level source annotations to train source separation models, which we augment with modified loss functions to bridge the gap between source separation and SSSLE and to address the presence of background. We show that our approach improves SSSLE performance compared to baseline source separation models and provide an ablation analysis to explore our method's design choices, showing that SSSLE in practical recording and annotation scenarios is possible.
△ Less
Submitted 29 July, 2021; v1 submitted 6 May, 2021;
originally announced May 2021.
-
PLADE-Net: Towards Pixel-Level Accuracy for Self-Supervised Single-View Depth Estimation with Neural Positional Encoding and Distilled Matting Loss
Authors:
Juan Luis Gonzalez Bello,
Munchurl Kim
Abstract:
In this paper, we propose a self-supervised single-view pixel-level accurate depth estimation network, called PLADE-Net. The PLADE-Net is the first work that shows unprecedented accuracy levels, exceeding 95\% in terms of the $δ^1$ metric on the challenging KITTI dataset. Our PLADE-Net is based on a new network architecture with neural positional encoding and a novel loss function that borrows fro…
▽ More
In this paper, we propose a self-supervised single-view pixel-level accurate depth estimation network, called PLADE-Net. The PLADE-Net is the first work that shows unprecedented accuracy levels, exceeding 95\% in terms of the $δ^1$ metric on the challenging KITTI dataset. Our PLADE-Net is based on a new network architecture with neural positional encoding and a novel loss function that borrows from the closed-form solution of the matting Laplacian to learn pixel-level accurate depth estimation from stereo images. Neural positional encoding allows our PLADE-Net to obtain more consistent depth estimates by letting the network reason about location-specific image properties such as lens and projection distortions. Our novel distilled matting Laplacian loss allows our network to predict sharp depths at object boundaries and more consistent depths in highly homogeneous regions. Our proposed method outperforms all previous self-supervised single-view depth estimation methods by a large margin on the challenging KITTI dataset, with unprecedented levels of accuracy. Furthermore, our PLADE-Net, naively extended for stereo inputs, outperforms the most recent self-supervised stereo methods, even without any advanced blocks like 1D correlations, 3D convolutions, or spatial pyramid pooling. We present extensive ablation studies and experiments that support our method's effectiveness on the KITTI, CityScapes, and Make3D datasets.
△ Less
Submitted 12 March, 2021;
originally announced March 2021.
-
Suicide Classificaction for News Media Using Convolutional Neural Network
Authors:
Hugo J. Bello,
Nora Palomar-Ciria,
Enrique Baca-García,
Celia Lozano
Abstract:
Currently, the process of evaluating suicides is highly subjective, which limits the efficacy and accuracy of prevention efforts. Artificial intelligence (AI) has emerged as a means of investigating large datasets to identify patterns within "big data" that can determine the factors on suicide outcomes. Here, we use AI tools to extract the topic from (press and social) media text. However, news me…
▽ More
Currently, the process of evaluating suicides is highly subjective, which limits the efficacy and accuracy of prevention efforts. Artificial intelligence (AI) has emerged as a means of investigating large datasets to identify patterns within "big data" that can determine the factors on suicide outcomes. Here, we use AI tools to extract the topic from (press and social) media text. However, news media articles lack of suicide tags. Using tweets with hashtags related to sucide, we train a neuronal model which identifies if a given text has a suicidade-related contagion. Our results suggest a high level of the impact of mediatic into suicide cases, and a intrinsic thematic relationship of suicide news. These results pave the way to build more interpretable suicide data, which may help to better track, understand its origin, and improve prevention strategies.
△ Less
Submitted 18 February, 2021;
originally announced March 2021.
-
Direct imaging of chiral domain walls and Néel-type skyrmionium in ferrimagnetic alloys
Authors:
Boris Seng,
Daniel Schönke,
Javier Yeste,
Robert M. Reeve,
Nico Kerber,
Daniel Lacour,
Jean-Loïs Bello,
Nicolas Bergeard,
Fabian Kammerbauer,
Mona Bhukta,
Tom Ferté,
Christine Boeglin,
Florin Radu,
Radu Abrudan,
Torsten Kachel,
Stéphane Mangin,
Michel Hehn,
Mathias Kläui
Abstract:
The evolution of chiral spin structures is studied in ferrimagnet Ta/Ir/Fe/GdFeCo/Pt multilayers as a function of temperature using scanning electron microscopy with polarization analysis (SEMPA). The GdFeCo ferrimagnet exhibits pure right-hand Néel-type domain wall (DW) spin textures over a large temperature range. This indicates the presence of a negative Dzyaloshinskii-Moriya interaction (DMI)…
▽ More
The evolution of chiral spin structures is studied in ferrimagnet Ta/Ir/Fe/GdFeCo/Pt multilayers as a function of temperature using scanning electron microscopy with polarization analysis (SEMPA). The GdFeCo ferrimagnet exhibits pure right-hand Néel-type domain wall (DW) spin textures over a large temperature range. This indicates the presence of a negative Dzyaloshinskii-Moriya interaction (DMI) that can originate from both the top Fe/Pt and the Co/Pt interfaces. From measurements of the DW width, as well as complementary magnetic characterization, the exchange stiffness as a function of temperature is ascertained. The exchange stiffness is surprisingly mostly constant, which is explained by theoretical predictions. Beyond single skyrmions, we find by direct imaging a pure Néel-type skyrmionium, which due to the absence of a skyrmion Hall angle is a promising topological spin structure to enable high impact potential applications in the next generation of spintronic devices.
△ Less
Submitted 21 July, 2021; v1 submitted 26 February, 2021;
originally announced February 2021.
-
Multi-Task Self-Supervised Pre-Training for Music Classification
Authors:
Ho-Hsiang Wu,
Chieh-Chi Kao,
Qingming Tang,
Ming Sun,
Brian McFee,
Juan Pablo Bello,
Chao Wang
Abstract:
Deep learning is very data hungry, and supervised learning especially requires massive labeled data to work well. Machine listening research often suffers from limited labeled data problem, as human annotations are costly to acquire, and annotations for audio are time consuming and less intuitive. Besides, models learned from labeled dataset often embed biases specific to that particular dataset.…
▽ More
Deep learning is very data hungry, and supervised learning especially requires massive labeled data to work well. Machine listening research often suffers from limited labeled data problem, as human annotations are costly to acquire, and annotations for audio are time consuming and less intuitive. Besides, models learned from labeled dataset often embed biases specific to that particular dataset. Therefore, unsupervised learning techniques become popular approaches in solving machine listening problems. Particularly, a self-supervised learning technique utilizing reconstructions of multiple hand-crafted audio features has shown promising results when it is applied to speech domain such as emotion recognition and automatic speech recognition (ASR). In this paper, we apply self-supervised and multi-task learning methods for pre-training music encoders, and explore various design choices including encoder architectures, weighting mechanisms to combine losses from multiple tasks, and worker selections of pretext tasks. We investigate how these design choices interact with various downstream music classification tasks. We find that using various music specific workers altogether with weighting mechanisms to balance the losses during pre-training helps improve and generalize to the downstream tasks.
△ Less
Submitted 5 February, 2021;
originally announced February 2021.
-
Machine Learning to study the impact of gender-based violence in the news media
Authors:
Hugo J. Bello,
Nora Palomar,
Elisa Gallego,
Lourdes Jiménez Navascués,
Celia Lozano
Abstract:
While it remains a taboo topic, gender-based violence (GBV) undermines the health, dignity, security and autonomy of its victims. Many factors have been studied to generate or maintain this kind of violence, however, the influence of the media is still uncertain. Here, we use Machine Learning tools to extrapolate the effect of the news in GBV. By feeding neural networks with news, the topic inform…
▽ More
While it remains a taboo topic, gender-based violence (GBV) undermines the health, dignity, security and autonomy of its victims. Many factors have been studied to generate or maintain this kind of violence, however, the influence of the media is still uncertain. Here, we use Machine Learning tools to extrapolate the effect of the news in GBV. By feeding neural networks with news, the topic information associated with each article can be recovered. Our findings show a relationship between GBV news and public awareness, the effect of mediatic GBV cases, and the intrinsic thematic relationship of GBV news. Because the used neural model can be easily adjusted, this also allows us to extend our approach to other media sources or topics
△ Less
Submitted 27 November, 2020;
originally announced December 2020.
-
Current-induced spin torques on single GdFeCo magnetic layers
Authors:
David Céspedes-Berrocal,
Heloïse Damas,
Sébastien Petit-Watelot,
David Maccariello,
** Tang,
Aldo Arriola-Córdova,
Pierre Vallobra,
Yong Xu,
Jean-Loïs Bello,
Elodie Martin,
Sylvie Migot,
Jaafar Ghanbaja,
Shufeng Zhang,
Michel Hehn,
Stéphane Mangin,
Christos Panagopoulos,
Vincent Cros,
Albert Fert,
Juan-Carlos Rojas-Sánchez
Abstract:
Spintronics exploits spin-orbit coupling (SOC) to generate spin currents, spin torques, and, in the absence of inversion symmetry, Rashba, and Dzyaloshinskii-Moriya interactions (DMI). The widely used magnetic materials, based on 3d metals such as Fe and Co, possess a small SOC. To circumvent this shortcoming, the common practice has been to utilize the large SOC of nonmagnetic layers of 5d heavy…
▽ More
Spintronics exploits spin-orbit coupling (SOC) to generate spin currents, spin torques, and, in the absence of inversion symmetry, Rashba, and Dzyaloshinskii-Moriya interactions (DMI). The widely used magnetic materials, based on 3d metals such as Fe and Co, possess a small SOC. To circumvent this shortcoming, the common practice has been to utilize the large SOC of nonmagnetic layers of 5d heavy metals (HMs), such as Pt, to generate spin currents by Spin Hall Effect (SHE) and, in turn, exert spin torques on the magnetic layers. Here, we introduce a new class of material architectures, excluding nonmagnetic 5d HMs, for high-performance spintronics operations. We demonstrate very strong current-induced torques exerted on single GdFeCo layers due to the combination of large SOC of the Gd 5d states, and inversion symmetry breaking mainly engineered by interfaces. These "self-torques" are enhanced around the magnetization compensation temperature (close to room temperature) and can be tuned by adjusting the spin absorption outside the GdFeCo layer. In other measurements, we determine the very large emission of spin current from GdFeCo. This material platform opens new perspectives to exert "self-torques" on single magnetic layers as well as to generate spin currents from a magnetic layer.
△ Less
Submitted 18 October, 2020;
originally announced October 2020.
-
SONYC-UST-V2: An Urban Sound Tagging Dataset with Spatiotemporal Context
Authors:
Mark Cartwright,
Jason Cramer,
Ana Elisa Mendez Mendez,
Yu Wang,
Ho-Hsiang Wu,
Vincent Lostanlen,
Magdalena Fuentes,
Graham Dove,
Charlie Mydlarz,
Justin Salamon,
Oded Nov,
Juan Pablo Bello
Abstract:
We present SONYC-UST-V2, a dataset for urban sound tagging with spatiotemporal information. This dataset is aimed for the development and evaluation of machine listening systems for real-world urban noise monitoring. While datasets of urban recordings are available, this dataset provides the opportunity to investigate how spatiotemporal metadata can aid in the prediction of urban sound tags. SONYC…
▽ More
We present SONYC-UST-V2, a dataset for urban sound tagging with spatiotemporal information. This dataset is aimed for the development and evaluation of machine listening systems for real-world urban noise monitoring. While datasets of urban recordings are available, this dataset provides the opportunity to investigate how spatiotemporal metadata can aid in the prediction of urban sound tags. SONYC-UST-V2 consists of 18510 audio recordings from the "Sounds of New York City" (SONYC) acoustic sensor network, including the timestamp of audio acquisition and location of the sensor. The dataset contains annotations by volunteers from the Zooniverse citizen science platform, as well as a two-stage verification with our team. In this article, we describe our data collection procedure and propose evaluation metrics for multilabel classification of urban sound tags. We report the results of a simple baseline model that exploits spatiotemporal information.
△ Less
Submitted 10 September, 2020;
originally announced September 2020.
-
Few-Shot Drum Transcription in Polyphonic Music
Authors:
Yu Wang,
Justin Salamon,
Mark Cartwright,
Nicholas J. Bryan,
Juan Pablo Bello
Abstract:
Data-driven approaches to automatic drum transcription (ADT) are often limited to a predefined, small vocabulary of percussion instrument classes. Such models cannot recognize out-of-vocabulary classes nor are they able to adapt to finer-grained vocabularies. In this work, we address open vocabulary ADT by introducing few-shot learning to the task. We train a Prototypical Network on a synthetic da…
▽ More
Data-driven approaches to automatic drum transcription (ADT) are often limited to a predefined, small vocabulary of percussion instrument classes. Such models cannot recognize out-of-vocabulary classes nor are they able to adapt to finer-grained vocabularies. In this work, we address open vocabulary ADT by introducing few-shot learning to the task. We train a Prototypical Network on a synthetic dataset and evaluate the model on multiple real-world ADT datasets with polyphonic accompaniment. We show that, given just a handful of selected examples at inference time, we can match and in some cases outperform a state-of-the-art supervised ADT approach under a fixed vocabulary setting. At the same time, we show that our model can successfully generalize to finer-grained or extended vocabularies unseen during training, a scenario where supervised approaches cannot operate at all. We provide a detailed analysis of our experimental results, including a breakdown of performance by sound class and by polyphony.
△ Less
Submitted 6 August, 2020;
originally announced August 2020.
-
Pan-Sharpening with Color-Aware Perceptual Loss and Guided Re-Colorization
Authors:
Juan Luis Gonzalez Bello,
Soomin Seo,
Munchurl Kim
Abstract:
We present a novel color-aware perceptual (CAP) loss for learning the task of pan-sharpening. Our CAP loss is designed to focus on the deep features of a pre-trained VGG network that are more sensitive to spatial details and ignore color information to allow the network to extract the structural information from the PAN image while kee** the color from the lower resolution MS image. Additionally…
▽ More
We present a novel color-aware perceptual (CAP) loss for learning the task of pan-sharpening. Our CAP loss is designed to focus on the deep features of a pre-trained VGG network that are more sensitive to spatial details and ignore color information to allow the network to extract the structural information from the PAN image while kee** the color from the lower resolution MS image. Additionally, we propose "guided re-colorization", which generates a pan-sharpened image with real colors from the MS input by "picking" the closest MS pixel color for each pan-sharpened pixel, as a human operator would do in manual colorization. Such a re-colorized (RC) image is completely aligned with the pan-sharpened (PS) network output and can be used as a self-supervision signal during training, or to enhance the colors in the PS image during test. We present several experiments where our network trained with our CAP loss generates naturally looking pan-sharpened images with fewer artifacts and outperforms the state-of-the-arts on the WorldView3 dataset in terms of ERGAS, SCC, and QNR metrics.
△ Less
Submitted 30 June, 2020;
originally announced June 2020.
-
One or Two Components? The Scattering Transform Answers
Authors:
Vincent Lostanlen,
Alice Cohen-Hadria,
Juan Pablo Bello
Abstract:
With the aim of constructing a biologically plausible model of machine listening, we study the representation of a multicomponent stationary signal by a wavelet scattering network. First, we show that renormalizing second-order nodes by their first-order parents gives a simple numerical criterion to assess whether two neighboring components will interfere psychoacoustically. Secondly, we run a man…
▽ More
With the aim of constructing a biologically plausible model of machine listening, we study the representation of a multicomponent stationary signal by a wavelet scattering network. First, we show that renormalizing second-order nodes by their first-order parents gives a simple numerical criterion to assess whether two neighboring components will interfere psychoacoustically. Secondly, we run a manifold learning algorithm (Isomap) on scattering coefficients to visualize the similarity space underlying parametric additive synthesis. Thirdly, we generalize the "one or two components" framework to three sine waves or more, and prove that the effective scattering depth of a Fourier series grows in logarithmic proportion to its bandwidth.
△ Less
Submitted 25 June, 2020; v1 submitted 2 March, 2020;
originally announced March 2020.
-
Long-distance Detection of Bioacoustic Events with Per-channel Energy Normalization
Authors:
Vincent Lostanlen,
Kaitlin Palmer,
Elly Knight,
Christopher Clark,
Holger Klinck,
Andrew Farnsworth,
Tina Wong,
Jason Cramer,
Juan Pablo Bello
Abstract:
This paper proposes to perform unsupervised detection of bioacoustic events by pooling the magnitudes of spectrogram frames after per-channel energy normalization (PCEN). Although PCEN was originally developed for speech recognition, it also has beneficial effects in enhancing animal vocalizations, despite the presence of atmospheric absorption and intermittent noise. We prove that PCEN generalize…
▽ More
This paper proposes to perform unsupervised detection of bioacoustic events by pooling the magnitudes of spectrogram frames after per-channel energy normalization (PCEN). Although PCEN was originally developed for speech recognition, it also has beneficial effects in enhancing animal vocalizations, despite the presence of atmospheric absorption and intermittent noise. We prove that PCEN generalizes logarithm-based spectral flux, yet with a tunable time scale for background noise estimation. In comparison with pointwise logarithm, PCEN reduces false alarm rate by 50x in the near field and 5x in the far field, both on avian and marine bioacoustic datasets. Such improvements come at moderate computational cost and require no human intervention, thus heralding a promising future for PCEN in bioacoustics.
△ Less
Submitted 1 November, 2019;
originally announced November 2019.
-
Learning the helix topology of musical pitch
Authors:
Vincent Lostanlen,
Sripathi Sridhar,
Brian McFee,
Andrew Farnsworth,
Juan Pablo Bello
Abstract:
To explain the consonance of octaves, music psychologists represent pitch as a helix where azimuth and axial coordinate correspond to pitch class and pitch height respectively. This article addresses the problem of discovering this helical structure from unlabeled audio data. We measure Pearson correlations in the constant-Q transform (CQT) domain to build a K-nearest neighbor graph between freque…
▽ More
To explain the consonance of octaves, music psychologists represent pitch as a helix where azimuth and axial coordinate correspond to pitch class and pitch height respectively. This article addresses the problem of discovering this helical structure from unlabeled audio data. We measure Pearson correlations in the constant-Q transform (CQT) domain to build a K-nearest neighbor graph between frequency subbands. Then, we run the Isomap manifold learning algorithm to represent this graph in a three-dimensional space in which straight lines approximate graph geodesics. Experiments on isolated musical notes demonstrate that the resulting manifold resembles a helix which makes a full turn at every octave. A circular shape is also found in English speech, but not in urban noise. We discuss the impact of various design choices on the visualization: instrumentarium, loudness map** function, and number of neighbors K.
△ Less
Submitted 4 February, 2020; v1 submitted 22 October, 2019;
originally announced October 2019.
-
Deep 3D Pan via adaptive "t-shaped" convolutions with global and local adaptive dilations
Authors:
Juan Luis Gonzalez Bello,
Munchurl Kim
Abstract:
Recent advances in deep learning have shown promising results in many low-level vision tasks. However, solving the single-image-based view synthesis is still an open problem. In particular, the generation of new images at parallel camera views given a single input image is of great interest, as it enables 3D visualization of the 2D input scenery. We propose a novel network architecture to perform…
▽ More
Recent advances in deep learning have shown promising results in many low-level vision tasks. However, solving the single-image-based view synthesis is still an open problem. In particular, the generation of new images at parallel camera views given a single input image is of great interest, as it enables 3D visualization of the 2D input scenery. We propose a novel network architecture to perform stereoscopic view synthesis at arbitrary camera positions along the X-axis, or Deep 3D Pan, with "t-shaped" adaptive kernels equipped with globally and locally adaptive dilations. Our proposed network architecture, the monster-net, is devised with a novel "t-shaped" adaptive kernel with globally and locally adaptive dilation, which can efficiently incorporate global camera shift into and handle local 3D geometries of the target image's pixels for the synthesis of naturally looking 3D panned views when a 2-D input image is given. Extensive experiments were performed on the KITTI, CityScapes and our VICLAB_STEREO indoors dataset to prove the efficacy of our method. Our monster-net significantly outperforms the state-of-the-art method, SOTA, by a large margin in all metrics of RMSE, PSNR, and SSIM. Our proposed monster-net is capable of reconstructing more reliable image structures in synthesized images with coherent geometry. Moreover, the disparity information that can be extracted from the "t-shaped" kernel is much more reliable than that of the SOTA for the unsupervised monocular depth estimation task, confirming the effectiveness of our method.
△ Less
Submitted 20 October, 2019; v1 submitted 2 October, 2019;
originally announced October 2019.
-
Deep 3D-Zoom Net: Unsupervised Learning of Photo-Realistic 3D-Zoom
Authors:
Juan Luis Gonzalez Bello,
Munchurl Kim
Abstract:
The 3D-zoom operation is the positive translation of the camera in the Z-axis, perpendicular to the image plane. In contrast, the optical zoom changes the focal length and the digital zoom is used to enlarge a certain region of an image to the original image size. In this paper, we are the first to formulate an unsupervised 3D-zoom learning problem where images with an arbitrary zoom factor can be…
▽ More
The 3D-zoom operation is the positive translation of the camera in the Z-axis, perpendicular to the image plane. In contrast, the optical zoom changes the focal length and the digital zoom is used to enlarge a certain region of an image to the original image size. In this paper, we are the first to formulate an unsupervised 3D-zoom learning problem where images with an arbitrary zoom factor can be generated from a given single image. An unsupervised framework is convenient, as it is a challenging task to obtain a 3D-zoom dataset of natural scenes due to the need for special equipment to ensure camera movement is restricted to the Z-axis. In addition, the objects in the scenes should not move when being captured, which hinders the construction of a large dataset of outdoor scenes. We present a novel unsupervised framework to learn how to generate arbitrarily 3D-zoomed versions of a single image, not requiring a 3D-zoom ground truth, called the Deep 3D-Zoom Net. The Deep 3D-Zoom Net incorporates the following features: (i) transfer learning from a pre-trained disparity estimation network via a back re-projection reconstruction loss; (ii) a fully convolutional network architecture that models depth-image-based rendering (DIBR), taking into account high-frequency details without the need for estimating the intermediate disparity; and (iii) incorporating a discriminator network that acts as a no-reference penalty for unnaturally rendered areas. Even though there is no baseline to fairly compare our results, our method outperforms previous novel view synthesis research in terms of realistic appearance on large camera baselines. We performed extensive experiments to verify the effectiveness of our method on the KITTI and Cityscapes datasets.
△ Less
Submitted 2 October, 2019; v1 submitted 20 September, 2019;
originally announced September 2019.
-
Adversarial Learning for Improved Onsets and Frames Music Transcription
Authors:
Jong Wook Kim,
Juan Pablo Bello
Abstract:
Automatic music transcription is considered to be one of the hardest problems in music information retrieval, yet recent deep learning approaches have achieved substantial improvements on transcription performance. These approaches commonly employ supervised learning models that predict various time-frequency representations, by minimizing element-wise losses such as the cross entropy function. Ho…
▽ More
Automatic music transcription is considered to be one of the hardest problems in music information retrieval, yet recent deep learning approaches have achieved substantial improvements on transcription performance. These approaches commonly employ supervised learning models that predict various time-frequency representations, by minimizing element-wise losses such as the cross entropy function. However, applying the loss in this manner assumes conditional independence of each label given the input, and thus cannot accurately express inter-label dependencies. To address this issue, we introduce an adversarial training scheme that operates directly on the time-frequency representations and makes the output distribution closer to the ground-truth. Through adversarial learning, we achieve a consistent improvement in both frame-level and note-level metrics over Onsets and Frames, a state-of-the-art music transcription model. Our results show that adversarial learning can significantly reduce the error rate while increasing the confidence of the model estimations. Our approach is generic and applicable to any transcription model based on multi-label predictions, which are very common in music signal analysis.
△ Less
Submitted 20 June, 2019;
originally announced June 2019.
-
Robust sound event detection in bioacoustic sensor networks
Authors:
Vincent Lostanlen,
Justin Salamon,
Andrew Farnsworth,
Steve Kelling,
Juan Pablo Bello
Abstract:
Bioacoustic sensors, sometimes known as autonomous recording units (ARUs), can record sounds of wildlife over long periods of time in scalable and minimally invasive ways. Deriving per-species abundance estimates from these sensors requires detection, classification, and quantification of animal vocalizations as individual acoustic events. Yet, variability in ambient noise, both over time and acro…
▽ More
Bioacoustic sensors, sometimes known as autonomous recording units (ARUs), can record sounds of wildlife over long periods of time in scalable and minimally invasive ways. Deriving per-species abundance estimates from these sensors requires detection, classification, and quantification of animal vocalizations as individual acoustic events. Yet, variability in ambient noise, both over time and across sensors, hinders the reliability of current automated systems for sound event detection (SED), such as convolutional neural networks (CNN) in the time-frequency domain. In this article, we develop, benchmark, and combine several machine listening techniques to improve the generalizability of SED models across heterogeneous acoustic environments. As a case study, we consider the problem of detecting avian flight calls from a ten-hour recording of nocturnal bird migration, recorded by a network of six ARUs in the presence of heterogeneous background noise. Starting from a CNN yielding state-of-the-art accuracy on this task, we introduce two noise adaptation techniques, respectively integrating short-term (60 milliseconds) and long-term (30 minutes) context. First, we apply per-channel energy normalization (PCEN) in the time-frequency domain, which applies short-term automatic gain control to every subband in the mel-frequency spectrogram. Secondly, we replace the last dense layer in the network by a context-adaptive neural network (CA-NN) layer. Combining them yields state-of-the-art results that are unmatched by artificial data augmentation alone. We release a pre-trained version of our best performing system under the name of BirdVoxDetect, a ready-to-use detector of avian flight calls in field recordings.
△ Less
Submitted 29 October, 2019; v1 submitted 20 May, 2019;
originally announced May 2019.
-
A Novel Monocular Disparity Estimation Network with Domain Transformation and Ambiguity Learning
Authors:
Juan Luis Gonzalez Bello,
Munchurl Kim
Abstract:
Convolutional neural networks (CNN) have shown state-of-the-art results for low-level computer vision problems such as stereo and monocular disparity estimations, but still, have much room to further improve their performance in terms of accuracy, numbers of parameters, etc. Recent works have uncovered the advantages of using an unsupervised scheme to train CNN's to estimate monocular disparity, w…
▽ More
Convolutional neural networks (CNN) have shown state-of-the-art results for low-level computer vision problems such as stereo and monocular disparity estimations, but still, have much room to further improve their performance in terms of accuracy, numbers of parameters, etc. Recent works have uncovered the advantages of using an unsupervised scheme to train CNN's to estimate monocular disparity, where only the relatively-easy-to-obtain stereo images are needed for training. We propose a novel encoder-decoder architecture that outperforms previous unsupervised monocular depth estimation networks by (i) taking into account ambiguities, (ii) efficient fusion between encoder and decoder features with rectangular convolutions and (iii) domain transformations between encoder and decoder. Our architecture outperforms the Monodepth baseline in all metrics, even with a considerable reduction of parameters. Furthermore, our architecture is capable of estimating a full disparity map in a single forward pass, whereas the baseline needs two passes. We perform extensive experiments to verify the effectiveness of our method on the KITTI dataset.
△ Less
Submitted 20 March, 2019;
originally announced March 2019.
-
The life of a New York City noise sensor network
Authors:
Charlie Mydlarz,
Mohit Sharma,
Yitzchak Lockerman,
Ben Steers,
Claudio Silva,
Juan Pablo Bello
Abstract:
Noise pollution is one of the topmost quality of life issues for urban residents in the United States. Continued exposure to high levels of noise has proven effects on health, including acute effects such as sleep disruption, and long-term effects such as hypertension, heart disease, and hearing loss. To investigate and ultimately aid in the mitigation of urban noise, a network of 55 sensor nodes…
▽ More
Noise pollution is one of the topmost quality of life issues for urban residents in the United States. Continued exposure to high levels of noise has proven effects on health, including acute effects such as sleep disruption, and long-term effects such as hypertension, heart disease, and hearing loss. To investigate and ultimately aid in the mitigation of urban noise, a network of 55 sensor nodes has been deployed across New York City for over two years, collecting sound pressure level (SPL) and audio data. This network has cumulatively amassed over 75 years of calibrated, high-resolution SPL measurements and 35 years of audio data. In addition, high frequency telemetry data has been collected that provides an indication of a sensors' health. This telemetry data was analyzed over an 18 month period across 31 of the sensors. It has been used to develop a prototype model for pre-failure detection which has the ability to identify sensors in a prefail state 69.1% of the time. The entire network infrastructure is outlined, including the operation of the sensors, followed by an analysis of its data yield and the development of the fault detection approach and the future system integration plans for this.
△ Less
Submitted 26 March, 2019; v1 submitted 7 March, 2019;
originally announced March 2019.
-
Neural Music Synthesis for Flexible Timbre Control
Authors:
Jong Wook Kim,
Rachel Bittner,
Aparna Kumar,
Juan Pablo Bello
Abstract:
The recent success of raw audio waveform synthesis models like WaveNet motivates a new approach for music synthesis, in which the entire process --- creating audio samples from a score and instrument information --- is modeled using generative neural networks. This paper describes a neural music synthesis model with flexible timbre controls, which consists of a recurrent neural network conditioned…
▽ More
The recent success of raw audio waveform synthesis models like WaveNet motivates a new approach for music synthesis, in which the entire process --- creating audio samples from a score and instrument information --- is modeled using generative neural networks. This paper describes a neural music synthesis model with flexible timbre controls, which consists of a recurrent neural network conditioned on a learned instrument embedding followed by a WaveNet vocoder. The learned embedding space successfully captures the diverse variations in timbres within a large dataset and enables timbre control and morphing by interpolating between instruments in the embedding space. The synthesis quality is evaluated both numerically and perceptually, and an interactive web demo is presented.
△ Less
Submitted 1 November, 2018;
originally announced November 2018.
-
Multitask Learning for Fundamental Frequency Estimation in Music
Authors:
Rachel M. Bittner,
Brian McFee,
Juan P. Bello
Abstract:
Fundamental frequency (f0) estimation from polyphonic music includes the tasks of multiple-f0, melody, vocal, and bass line estimation. Historically these problems have been approached separately, and only recently, using learning-based approaches. We present a multitask deep learning architecture that jointly estimates outputs for various tasks including multiple-f0, melody, vocal and bass line e…
▽ More
Fundamental frequency (f0) estimation from polyphonic music includes the tasks of multiple-f0, melody, vocal, and bass line estimation. Historically these problems have been approached separately, and only recently, using learning-based approaches. We present a multitask deep learning architecture that jointly estimates outputs for various tasks including multiple-f0, melody, vocal and bass line estimation, and is trained using a large, semi-automatically annotated dataset. We show that the multitask model outperforms its single-task counterparts, and explore the effect of various design decisions in our approach, and show that it performs better or at least competitively when compared against strong baseline methods.
△ Less
Submitted 2 September, 2018;
originally announced September 2018.
-
SONYC: A System for the Monitoring, Analysis and Mitigation of Urban Noise Pollution
Authors:
Juan Pablo Bello,
Claudio Silva,
Oded Nov,
R. Luke DuBois,
Anish Arora,
Justin Salamon,
Charles Mydlarz,
Harish Doraiswamy
Abstract:
We present the Sounds of New York City (SONYC) project, a smart cities initiative focused on develo** a cyber-physical system for the monitoring, analysis and mitigation of urban noise pollution. Noise pollution is one of the topmost quality of life issues for urban residents in the U.S. with proven effects on health, education, the economy, and the environment. Yet, most cities lack the resourc…
▽ More
We present the Sounds of New York City (SONYC) project, a smart cities initiative focused on develo** a cyber-physical system for the monitoring, analysis and mitigation of urban noise pollution. Noise pollution is one of the topmost quality of life issues for urban residents in the U.S. with proven effects on health, education, the economy, and the environment. Yet, most cities lack the resources to continuously monitor noise and understand the contribution of individual sources, the tools to analyze patterns of noise pollution at city-scale, and the means to empower city agencies to take effective, data-driven action for noise mitigation. The SONYC project advances novel technological and socio-technical solutions that help address these needs.
SONYC includes a distributed network of both sensors and people for large-scale noise monitoring. The sensors use low-cost, low-power technology, and cutting-edge machine listening techniques, to produce calibrated acoustic measurements and recognize individual sound sources in real time. Citizen science methods are used to help urban residents connect to city agencies and each other, understand their noise footprint, and facilitate reporting and self-regulation. Crucially, SONYC utilizes big data solutions to analyze, retrieve and visualize information from sensors and citizens, creating a comprehensive acoustic model of the city that can be used to identify significant patterns of noise pollution. These data can be used to drive the strategic application of noise code enforcement by city agencies to optimize the reduction of noise pollution. The entire system, integrating cyber, physical and social infrastructure, forms a closed loop of continuous sensing, analysis and actuation on the environment.
SONYC provides a blueprint for the mitigation of noise pollution that can potentially be applied to other cities in the US and abroad.
△ Less
Submitted 18 May, 2018; v1 submitted 2 May, 2018;
originally announced May 2018.
-
Adaptive pooling operators for weakly labeled sound event detection
Authors:
Brian McFee,
Justin Salamon,
Juan Pablo Bello
Abstract:
Sound event detection (SED) methods are tasked with labeling segments of audio recordings by the presence of active sound sources. SED is typically posed as a supervised machine learning problem, requiring strong annotations for the presence or absence of each sound source at every time instant within the recording. However, strong annotations of this type are both labor- and cost-intensive for hu…
▽ More
Sound event detection (SED) methods are tasked with labeling segments of audio recordings by the presence of active sound sources. SED is typically posed as a supervised machine learning problem, requiring strong annotations for the presence or absence of each sound source at every time instant within the recording. However, strong annotations of this type are both labor- and cost-intensive for human annotators to produce, which limits the practical scalability of SED methods.
In this work, we treat SED as a multiple instance learning (MIL) problem, where training labels are static over a short excerpt, indicating the presence or absence of sound sources but not their temporal locality. The models, however, must still produce temporally dynamic predictions, which must be aggregated (pooled) when comparing against static labels during training. To facilitate this aggregation, we develop a family of adaptive pooling operators---referred to as auto-pool---which smoothly interpolate between common pooling operators, such as min-, max-, or average-pooling, and automatically adapt to the characteristics of the sound sources in question. We evaluate the proposed pooling operators on three datasets, and demonstrate that in each case, the proposed methods outperform non-adaptive pooling operators for static prediction, and nearly match the performance of models trained with strong, dynamic annotations. The proposed method is evaluated in conjunction with convolutional neural networks, but can be readily applied to any differentiable model for time-series label prediction.
△ Less
Submitted 10 August, 2018; v1 submitted 26 April, 2018;
originally announced April 2018.
-
CREPE: A Convolutional Representation for Pitch Estimation
Authors:
Jong Wook Kim,
Justin Salamon,
Peter Li,
Juan Pablo Bello
Abstract:
The task of estimating the fundamental frequency of a monophonic sound recording, also known as pitch tracking, is fundamental to audio processing with multiple applications in speech processing and music information retrieval. To date, the best performing techniques, such as the pYIN algorithm, are based on a combination of DSP pipelines and heuristics. While such techniques perform very well on…
▽ More
The task of estimating the fundamental frequency of a monophonic sound recording, also known as pitch tracking, is fundamental to audio processing with multiple applications in speech processing and music information retrieval. To date, the best performing techniques, such as the pYIN algorithm, are based on a combination of DSP pipelines and heuristics. While such techniques perform very well on average, there remain many cases in which they fail to correctly estimate the pitch. In this paper, we propose a data-driven pitch tracking algorithm, CREPE, which is based on a deep convolutional neural network that operates directly on the time-domain waveform. We show that the proposed model produces state-of-the-art results, performing equally or better than pYIN. Furthermore, we evaluate the model's generalizability in terms of noise robustness. A pre-trained version of CREPE is made freely available as an open-source Python module for easy application.
△ Less
Submitted 16 February, 2018;
originally announced February 2018.
-
Confidence regions for neutrino oscillation parameters from double-Chooz data
Authors:
B. Vargas Perez,
J. García-Ravelo,
Dionisio Tun,
Jorge Garcia Bello,
Jesús Escamilla Roa
Abstract:
In this work, an independent and detailed statistical analysis of the double-Chooz experiment is performed. To have a thorough understanding of the implications of the double-Chooz data on both oscillation parameters $\sin^{2}(2θ_{13})$ and $Δm^2_{31}$, we decided to analyze the data corresponding to the Far detector, with no additional restriction. By doing this, confidence regions and best fit v…
▽ More
In this work, an independent and detailed statistical analysis of the double-Chooz experiment is performed. To have a thorough understanding of the implications of the double-Chooz data on both oscillation parameters $\sin^{2}(2θ_{13})$ and $Δm^2_{31}$, we decided to analyze the data corresponding to the Far detector, with no additional restriction. By doing this, confidence regions and best fit values are obtained for ($\sin^{2}(2θ_{13}),Δm^2_{31}$). This analysis yields an out-of-order $Δm^2_{31}$ minimum, which has already been mentioned in previous works, and it is corrected with the inclusion of additional restrictions. With such restrictions it is obtained that $\sin ^{2}(2 θ_{13})=0{.}084_{-0{.}028}^{+0{.}030}$ and $Δm^2_{31}=2.444^{+0.187}_{-0.215} \times 10^{-3}$ eV$^2$/c$^4$. Our analysis allows us to study the effects of the so called "spectral bump" around 5 MeV, it is observed that a variation of this spectral bump may be able to move the $Δm^2_{31}$ best fit value, in such a way that $Δm^2_{31}$ takes the order of magnitude of the MINOS value. Finally, and with the intention of understanding the effects of the preliminary Near detector data, we performed two different analyses, aiming to eliminate the effects of the energy bump. As a consequence, it is found that unlike the Far Detector analysis, the Near detector data may be able to fully determine both oscillation parameters by itself, resulting in, $\sin^2(2θ_{13}) = 0.095 \pm 0.053$ and $Δm^{2}_{31} = 2.63^{+0.98}_{-1.15} \times 10^{-3} \text{eV}^2 / \text{c}^4$. The later analyses represent an improvement with respect to previous works, where additional constraints for $Δm^2_{31}$ were necessary.
△ Less
Submitted 1 August, 2018; v1 submitted 14 December, 2017;
originally announced December 2017.
-
Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification
Authors:
Justin Salamon,
Juan Pablo Bello
Abstract:
The ability of deep convolutional neural networks (CNN) to learn discriminative spectro-temporal patterns makes them well suited to environmental sound classification. However, the relative scarcity of labeled data has impeded the exploitation of this family of high-capacity models. This study has two primary contributions: first, we propose a deep convolutional neural network architecture for env…
▽ More
The ability of deep convolutional neural networks (CNN) to learn discriminative spectro-temporal patterns makes them well suited to environmental sound classification. However, the relative scarcity of labeled data has impeded the exploitation of this family of high-capacity models. This study has two primary contributions: first, we propose a deep convolutional neural network architecture for environmental sound classification. Second, we propose the use of audio data augmentation for overcoming the problem of data scarcity and explore the influence of different augmentations on the performance of the proposed CNN architecture. Combined with data augmentation, the proposed model produces state-of-the-art results for environmental sound classification. We show that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a "shallow" dictionary learning model with augmentation. Finally, we examine the influence of each augmentation on the model's classification accuracy for each class, and observe that the accuracy for each class is influenced differently by each augmentation, suggesting that the performance of the model could be improved further by applying class-conditional data augmentation.
△ Less
Submitted 28 November, 2016; v1 submitted 15 August, 2016;
originally announced August 2016.