Search | arXiv e-print repository

arXiv:2401.12238 [pdf, other]

Spatial Scaper: A Library to Simulate and Augment Soundscapes for Sound Event Localization and Detection in Realistic Rooms

Authors: Iran R. Roman, Christopher Ick, Sivan Ding, Adrian S. Roman, Brian McFee, Juan P. Bello

Abstract: Sound event localization and detection (SELD) is an important task in machine listening. Major advancements rely on simulated data with sound events in specific rooms and strong spatio-temporal labels. SELD data is simulated by convolving spatialy-localized room impulse responses (RIRs) with sound waveforms to place sound events in a soundscape. However, RIRs require manual collection in specific… ▽ More Sound event localization and detection (SELD) is an important task in machine listening. Major advancements rely on simulated data with sound events in specific rooms and strong spatio-temporal labels. SELD data is simulated by convolving spatialy-localized room impulse responses (RIRs) with sound waveforms to place sound events in a soundscape. However, RIRs require manual collection in specific rooms. We present SpatialScaper, a library for SELD data simulation and augmentation. Compared to existing tools, SpatialScaper emulates virtual rooms via parameters such as size and wall absorption. This allows for parameterized placement (including movement) of foreground and background sound sources. SpatialScaper also includes data augmentation pipelines that can be applied to existing SELD data. As a case study, we use SpatialScaper to add rooms to the DCASE SELD data. Training a model with our data led to progressive performance improves as a direct function of acoustic diversity. These results show that SpatialScaper is valuable to train robust SELD models. △ Less

Submitted 19 January, 2024; originally announced January 2024.

Comments: 5 pages, 4 figures, 1 table, to be presented at ICASSP 2024 in Seoul, South Korea

arXiv:2401.08717 [pdf, other]

Robust DOA estimation using deep acoustic imaging

Authors: Adrian S. Roman, Iran R. Roman, Juan P. Bello

Abstract: Direction of arrival estimation (DoAE) aims at tracking a sound in azimuth and elevation. Recent advancements include data-driven models with inputs derived from ambisonics intensity vectors or correlations between channels in a microphone array. A spherical intensity map (SIM), or acoustic image, is an alternative input representation that remains underexplored. SIMs benefit from high-resolution… ▽ More Direction of arrival estimation (DoAE) aims at tracking a sound in azimuth and elevation. Recent advancements include data-driven models with inputs derived from ambisonics intensity vectors or correlations between channels in a microphone array. A spherical intensity map (SIM), or acoustic image, is an alternative input representation that remains underexplored. SIMs benefit from high-resolution microphone arrays, yet most DoAE datasets use low-resolution ones. Therefore, we first propose a super-resolution method to upsample low-resolution microphones. Next, we benchmark DoAE models that use SIMs as input. We arrive to a model that uses SIMs for DoAE estimation and outperforms a baseline and a state-of-the-art model. Our study highlights the relevance of acoustic imaging for DoAE tasks. △ Less

Submitted 15 January, 2024; originally announced January 2024.

arXiv:2312.10118 [pdf, other]

From-Ground-To-Objects: Coarse-to-Fine Self-supervised Monocular Depth Estimation of Dynamic Objects with Ground Contact Prior

Authors: Jaeho Moon, Juan Luis Gonzalez Bello, Byeongjun Kwon, Munchurl Kim

Abstract: Self-supervised monocular depth estimation (DE) is an approach to learning depth without costly depth ground truths. However, it often struggles with moving objects that violate the static scene assumption during training. To address this issue, we introduce a coarse-to-fine training strategy leveraging the ground contacting prior based on the observation that most moving objects in outdoor scenes… ▽ More Self-supervised monocular depth estimation (DE) is an approach to learning depth without costly depth ground truths. However, it often struggles with moving objects that violate the static scene assumption during training. To address this issue, we introduce a coarse-to-fine training strategy leveraging the ground contacting prior based on the observation that most moving objects in outdoor scenes contact the ground. In the coarse training stage, we exclude the objects in dynamic classes from the reprojection loss calculation to avoid inaccurate depth learning. To provide precise supervision on the depth of the objects, we present a novel Ground-contacting-prior Disparity Smoothness Loss (GDS-Loss) that encourages a DE network to align the depth of the objects with their ground-contacting points. Subsequently, in the fine training stage, we refine the DE network to learn the detailed depth of the objects from the reprojection loss, while ensuring accurate DE on the moving object regions by employing our regularization loss with a cost-volume-based weighting factor. Our overall coarse-to-fine training strategy can easily be integrated with existing DE methods without any modifications, significantly enhancing DE performance on challenging Cityscapes and KITTI datasets, especially in the moving object regions. △ Less

Submitted 15 December, 2023; originally announced December 2023.

arXiv:2312.08136 [pdf, other]

ProNeRF: Learning Efficient Projection-Aware Ray Sampling for Fine-Grained Implicit Neural Radiance Fields

Authors: Juan Luis Gonzalez Bello, Minh-Quan Viet Bui, Munchurl Kim

Abstract: Recent advances in neural rendering have shown that, albeit slow, implicit compact models can learn a scene's geometries and view-dependent appearances from multiple views. To maintain such a small memory footprint but achieve faster inference times, recent works have adopted `sampler' networks that adaptively sample a small subset of points along each ray in the implicit neural radiance fields. A… ▽ More Recent advances in neural rendering have shown that, albeit slow, implicit compact models can learn a scene's geometries and view-dependent appearances from multiple views. To maintain such a small memory footprint but achieve faster inference times, recent works have adopted `sampler' networks that adaptively sample a small subset of points along each ray in the implicit neural radiance fields. Although these methods achieve up to a 10$\times$ reduction in rendering time, they still suffer from considerable quality degradation compared to the vanilla NeRF. In contrast, we propose ProNeRF, which provides an optimal trade-off between memory footprint (similar to NeRF), speed (faster than HyperReel), and quality (better than K-Planes). ProNeRF is equipped with a novel projection-aware sampling (PAS) network together with a new training strategy for ray exploration and exploitation, allowing for efficient fine-grained particle sampling. Our ProNeRF yields state-of-the-art metrics, being 15-23x faster with 0.65dB higher PSNR than NeRF and yielding 0.95dB higher PSNR than the best published sampler-based method, HyperReel. Our exploration and exploitation training strategy allows ProNeRF to learn the full scenes' color and density distributions while also learning efficient ray sampling focused on the highest-density regions. We provide extensive experimental results that support the effectiveness of our method on the widely adopted forward-facing and 360 datasets, LLFF and Blender, respectively. △ Less

Submitted 13 December, 2023; originally announced December 2023.

Comments: Visit our project website at https://kaist-viclab.github.io/pronerf-site/

arXiv:2312.08071 [pdf, other]

Novel View Synthesis with View-Dependent Effects from a Single Image

Authors: Juan Luis Gonzalez Bello, Munchurl Kim

Abstract: In this paper, we firstly consider view-dependent effects into single image-based novel view synthesis (NVS) problems. For this, we propose to exploit the camera motion priors in NVS to model view-dependent appearance or effects (VDE) as the negative disparity in the scene. By recognizing specularities "follow" the camera motion, we infuse VDEs into the input images by aggregating input pixel colo… ▽ More In this paper, we firstly consider view-dependent effects into single image-based novel view synthesis (NVS) problems. For this, we propose to exploit the camera motion priors in NVS to model view-dependent appearance or effects (VDE) as the negative disparity in the scene. By recognizing specularities "follow" the camera motion, we infuse VDEs into the input images by aggregating input pixel colors along the negative depth region of the epipolar lines. Also, we propose a `relaxed volumetric rendering' approximation that allows computing the densities in a single pass, improving efficiency for NVS from single images. Our method can learn single-image NVS from image sequences only, which is a completely self-supervised learning method, for the first time requiring neither depth nor camera pose annotations. We present extensive experiment results and show that our proposed method can learn NVS with VDEs, outperforming the SOTA single-view NVS methods on the RealEstate10k and MannequinChallenge datasets. △ Less

Submitted 13 December, 2023; originally announced December 2023.

Comments: Visit our website https://kaist-viclab.github.io/monovde-site

arXiv:2309.13343 [pdf, other]

Two vs. Four-Channel Sound Event Localization and Detection

Authors: Julia Wilkins, Magdalena Fuentes, Luca Bondi, Shabnam Ghaffarzadegan, Ali Abavisani, Juan Pablo Bello

Abstract: Sound event localization and detection (SELD) systems estimate both the direction-of-arrival (DOA) and class of sound sources over time. In the DCASE 2022 SELD Challenge (Task 3), models are designed to operate in a 4-channel setting. While beneficial to further the development of SELD systems using a multichannel recording setup such as first-order Ambisonics (FOA), most consumer electronics devi… ▽ More Sound event localization and detection (SELD) systems estimate both the direction-of-arrival (DOA) and class of sound sources over time. In the DCASE 2022 SELD Challenge (Task 3), models are designed to operate in a 4-channel setting. While beneficial to further the development of SELD systems using a multichannel recording setup such as first-order Ambisonics (FOA), most consumer electronics devices rarely are able to record using more than two channels. For this reason, in this work we investigate the performance of the DCASE 2022 SELD baseline model using three audio input representations: FOA, binaural, and stereo. We perform a novel comparative analysis illustrating the effect of these audio input representations on SELD performance. Crucially, we show that binaural and stereo (i.e. 2-channel) audio-based SELD models are still able to localize and detect sound sources laterally quite well, despite overall performance degrading as less audio information is provided. Further, we segment our analysis by scenes containing varying degrees of sound source polyphony to better understand the effect of audio input representation on localization and detection performance as scene conditions become increasingly complex. △ Less

Submitted 23 September, 2023; originally announced September 2023.

arXiv:2309.09288 [pdf, other]

Sound Source Distance Estimation in Diverse and Dynamic Acoustic Conditions

Authors: Saksham Singh Kushwaha, Iran R. Roman, Magdalena Fuentes, Juan Pablo Bello

Abstract: Localizing a moving sound source in the real world involves determining its direction-of-arrival (DOA) and distance relative to a microphone. Advancements in DOA estimation have been facilitated by data-driven methods optimized with large open-source datasets with microphone array recordings in diverse environments. In contrast, estimating a sound source's distance remains understudied. Existing a… ▽ More Localizing a moving sound source in the real world involves determining its direction-of-arrival (DOA) and distance relative to a microphone. Advancements in DOA estimation have been facilitated by data-driven methods optimized with large open-source datasets with microphone array recordings in diverse environments. In contrast, estimating a sound source's distance remains understudied. Existing approaches assume recordings by non-coincident microphones to use methods that are susceptible to differences in room reverberation. We present a CRNN able to estimate the distance of moving sound sources across multiple datasets featuring diverse rooms, outperforming a recently-published approach. We also characterize our model's performance as a function of sound source distance and different training losses. This analysis reveals optimal training using a loss that weighs model errors as an inverse function of the sound source true distance. Our study is the first to demonstrate that sound source distance estimation can be performed across diverse acoustic conditions using deep learning. △ Less

Submitted 17 September, 2023; originally announced September 2023.

Comments: Accepted in WASPAA 2023

arXiv:2308.09089 [pdf, other]

Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries

Authors: Julia Wilkins, Justin Salamon, Magdalena Fuentes, Juan Pablo Bello, Oriol Nieto

Abstract: Finding the right sound effects (SFX) to match moments in a video is a difficult and time-consuming task, and relies heavily on the quality and completeness of text metadata. Retrieving high-quality (HQ) SFX using a video frame directly as the query is an attractive alternative, removing the reliance on text metadata and providing a low barrier to entry for non-experts. Due to the lack of HQ audio… ▽ More Finding the right sound effects (SFX) to match moments in a video is a difficult and time-consuming task, and relies heavily on the quality and completeness of text metadata. Retrieving high-quality (HQ) SFX using a video frame directly as the query is an attractive alternative, removing the reliance on text metadata and providing a low barrier to entry for non-experts. Due to the lack of HQ audio-visual training data, previous work on audio-visual retrieval relies on YouTube (in-the-wild) videos of varied quality for training, where the audio is often noisy and the video of amateur quality. As such it is unclear whether these systems would generalize to the task of matching HQ audio to production-quality video. To address this, we propose a multimodal framework for recommending HQ SFX given a video frame by (1) leveraging large language models and foundational vision-language models to bridge HQ audio and video to create audio-visual pairs, resulting in a highly scalable automatic audio-visual data curation pipeline; and (2) using pre-trained audio and visual encoders to train a contrastive learning-based retrieval system. We show that our system, trained using our automatic data curation pipeline, significantly outperforms baselines trained on in-the-wild data on the task of HQ SFX retrieval for video. Furthermore, while the baselines fail to generalize to this task, our system generalizes well from clean to in-the-wild data, outperforming the baselines on a dataset of YouTube videos despite only being trained on the HQ audio-visual pairs. A user study confirms that people prefer SFX retrieved by our system over the baseline 67% of the time both for HQ and in-the-wild data. Finally, we present ablations to determine the impact of model and data pipeline design choices on downstream retrieval performance. Please visit our project website to listen to and view our SFX retrieval results. △ Less

Submitted 17 August, 2023; originally announced August 2023.

Comments: WASPAA 2023. Project page: https://juliawilkins.github.io/sound-effects-retrieval-from-video/. 4 pages, 2 figures, 2 tables

arXiv:2308.06246 [pdf, other]

ARGUS: Visualization of AI-Assisted Task Guidance in AR

Authors: Sonia Castelo, Joao Rulff, Erin McGowan, Bea Steers, Guande Wu, Shaoyu Chen, Iran Roman, Roque Lopez, Ethan Brewer, Chen Zhao, **g Qian, Kyunghyun Cho, He He, Qi Sun, Huy Vo, Juan Bello, Michael Krone, Claudio Silva

Abstract: The concept of augmented reality (AR) assistants has captured the human imagination for decades, becoming a staple of modern science fiction. To pursue this goal, it is necessary to develop artificial intelligence (AI)-based methods that simultaneously perceive the 3D environment, reason about physical tasks, and model the performer, all in real-time. Within this framework, a wide variety of senso… ▽ More The concept of augmented reality (AR) assistants has captured the human imagination for decades, becoming a staple of modern science fiction. To pursue this goal, it is necessary to develop artificial intelligence (AI)-based methods that simultaneously perceive the 3D environment, reason about physical tasks, and model the performer, all in real-time. Within this framework, a wide variety of sensors are needed to generate data across different modalities, such as audio, video, depth, speech, and time-of-flight. The required sensors are typically part of the AR headset, providing performer sensing and interaction through visual, audio, and haptic feedback. AI assistants not only record the performer as they perform activities, but also require machine learning (ML) models to understand and assist the performer as they interact with the physical world. Therefore, develo** such assistants is a challenging task. We propose ARGUS, a visual analytics system to support the development of intelligent AR assistants. Our system was designed as part of a multi year-long collaboration between visualization researchers and ML and AR experts. This co-design process has led to advances in the visualization of ML in AR. Our system allows for online visualization of object, action, and step detection as well as offline analysis of previously recorded AR sessions. It visualizes not only the multimodal sensor data streams but also the output of the ML models. This allows developers to gain insights into the performer activities as well as the ML models, hel** them troubleshoot, improve, and fine tune the components of the AR assistant. △ Less

Submitted 11 August, 2023; originally announced August 2023.

Comments: 11 pages, 8 figures. This is the author's version of the article of the article that has been accepted for publication in IEEE Transactions on Visualization and Computer Graphics

arXiv:2303.10667 [pdf, other]

Audio-Text Models Do Not Yet Leverage Natural Language

Authors: Ho-Hsiang Wu, Oriol Nieto, Juan Pablo Bello, Justin Salamon

Abstract: Multi-modal contrastive learning techniques in the audio-text domain have quickly become a highly active area of research. Most works are evaluated with standard audio retrieval and classification benchmarks assuming that (i) these models are capable of leveraging the rich information contained in natural language, and (ii) current benchmarks are able to capture the nuances of such information. In… ▽ More Multi-modal contrastive learning techniques in the audio-text domain have quickly become a highly active area of research. Most works are evaluated with standard audio retrieval and classification benchmarks assuming that (i) these models are capable of leveraging the rich information contained in natural language, and (ii) current benchmarks are able to capture the nuances of such information. In this work, we show that state-of-the-art audio-text models do not yet really understand natural language, especially contextual concepts such as sequential or concurrent ordering of sound events. Our results suggest that existing benchmarks are not sufficient to assess these models' capabilities to match complex contexts from the audio and text modalities. We propose a Transformer-based architecture and show that, unlike prior work, it is capable of modeling the sequential relationship between sound events in the text and audio, given appropriate benchmark data. We advocate for the collection or generation of additional, diverse, data to allow future research to fully leverage natural language for audio-text modeling. △ Less

Submitted 19 March, 2023; originally announced March 2023.

Comments: Copyright 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

arXiv:2301.03085 [pdf, other]

Granger causality test for heteroskedastic and structural-break time series using generalized least squares

Authors: Hugo J. Bello

Abstract: This paper proposes a novel method (GLS Granger test) to determine causal relationships between time series based on the estimation of the autocovariance matrix and generalized least squares. We show the effectiveness of proposed autocovariance matrix estimator (the sliding autocovariance matrix) and we compare the proposed method with the classical Granger F-test with via a synthetic dataset and… ▽ More This paper proposes a novel method (GLS Granger test) to determine causal relationships between time series based on the estimation of the autocovariance matrix and generalized least squares. We show the effectiveness of proposed autocovariance matrix estimator (the sliding autocovariance matrix) and we compare the proposed method with the classical Granger F-test with via a synthetic dataset and a real dataset composed by cryptocurrencies. The simulations show that the proposed GLS Granger test captures causality more accurately than Granger F-tests in the cases of heteroskedastic or structural-break residuals. Finally, we use the proposed method to unravel unknown causal relationships between cryptocurrencies. △ Less

Submitted 8 January, 2023; originally announced January 2023.

arXiv:2211.08367 [pdf, other]

FlowGrad: Using Motion for Visual Sound Source Localization

Authors: Rajsuryan Singh, Pablo Zinemanas, Xavier Serra, Juan Pablo Bello, Magdalena Fuentes

Abstract: Most recent work in visual sound source localization relies on semantic audio-visual representations learned in a self-supervised manner, and by design excludes temporal information present in videos. While it proves to be effective for widely used benchmark datasets, the method falls short for challenging scenarios like urban traffic. This work introduces temporal context into the state-of-the-ar… ▽ More Most recent work in visual sound source localization relies on semantic audio-visual representations learned in a self-supervised manner, and by design excludes temporal information present in videos. While it proves to be effective for widely used benchmark datasets, the method falls short for challenging scenarios like urban traffic. This work introduces temporal context into the state-of-the-art methods for sound source localization in urban scenes using optical flow as a means to encode motion information. An analysis of the strengths and weaknesses of our methods helps us better understand the problem of visual sound source localization and sheds light on open challenges for audio-visual scene understanding. △ Less

Submitted 14 April, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

Comments: Accepted in ICASSP 2023

arXiv:2205.13064 [pdf, other]

doi 10.1111/cgf.14534

Urban Rhapsody: Large-scale exploration of urban soundscapes

Authors: Joao Rulff, Fabio Miranda, Maryam Hosseini, Marcos Lage, Mark Cartwright, Graham Dove, Juan Bello, Claudio T. Silva

Abstract: Noise is one of the primary quality-of-life issues in urban environments. In addition to annoyance, noise negatively impacts public health and educational performance. While low-cost sensors can be deployed to monitor ambient noise levels at high temporal resolutions, the amount of data they produce and the complexity of these data pose significant analytical challenges. One way to address these c… ▽ More Noise is one of the primary quality-of-life issues in urban environments. In addition to annoyance, noise negatively impacts public health and educational performance. While low-cost sensors can be deployed to monitor ambient noise levels at high temporal resolutions, the amount of data they produce and the complexity of these data pose significant analytical challenges. One way to address these challenges is through machine listening techniques, which are used to extract features in attempts to classify the source of noise and understand temporal patterns of a city's noise situation. However, the overwhelming number of noise sources in the urban environment and the scarcity of labeled data makes it nearly impossible to create classification models with large enough vocabularies that capture the true dynamism of urban soundscapes In this paper, we first identify a set of requirements in the yet unexplored domain of urban soundscape exploration. To satisfy the requirements and tackle the identified challenges, we propose Urban Rhapsody, a framework that combines state-of-the-art audio representation, machine learning, and visual analytics to allow users to interactively create classification models, understand noise patterns of a city, and quickly retrieve and label audio excerpts in order to create a large high-precision annotated database of urban sound recordings. We demonstrate the tool's utility through case studies performed by domain experts using data generated over the five-year deployment of a one-of-a-kind sensor network in New York City. △ Less

Submitted 25 May, 2022; originally announced May 2022.

Comments: Accepted at EuroVis 2022. Source code available at: https://github.com/VIDA-NYU/Urban-Rhapsody

arXiv:2205.08851 [pdf, other]

Positional Information is All You Need: A Novel Pipeline for Self-Supervised SVDE from Videos

Authors: Juan Luis Gonzalez Bello, Jaeho Moon, Munchurl Kim

Abstract: Recently, much attention has been drawn to learning the underlying 3D structures of a scene from monocular videos in a fully self-supervised fashion. One of the most challenging aspects of this task is handling the independently moving objects as they break the rigid-scene assumption. For the first time, we show that pixel positional information can be exploited to learn SVDE (Single View Depth Es… ▽ More Recently, much attention has been drawn to learning the underlying 3D structures of a scene from monocular videos in a fully self-supervised fashion. One of the most challenging aspects of this task is handling the independently moving objects as they break the rigid-scene assumption. For the first time, we show that pixel positional information can be exploited to learn SVDE (Single View Depth Estimation) from videos. Our proposed moving object (MO) masks, which are induced by shifted positional information (SPI) and referred to as `SPIMO' masks, are very robust and consistently remove the independently moving objects in the scenes, allowing for better learning of SVDE from videos. Additionally, we introduce a new adaptive quantization scheme that assigns the best per-pixel quantization curve for our depth discretization. Finally, we employ existing boosting techniques in a new way to further self-supervise the depth of the moving objects. With these features, our pipeline is robust against moving objects and generalizes well to high-resolution images, even when trained with small patches, yielding state-of-the-art (SOTA) results with almost 8.5x fewer parameters than the previous works that learn from videos. We present extensive experiments on KITTI and CityScapes that show the effectiveness of our method. △ Less

Submitted 18 May, 2022; originally announced May 2022.

arXiv:2205.01273 [pdf, other]

Few-Shot Musical Source Separation

Authors: Yu Wang, Daniel Stoller, Rachel M. Bittner, Juan Pablo Bello

Abstract: Deep learning-based approaches to musical source separation are often limited to the instrument classes that the models are trained on and do not generalize to separate unseen instruments. To address this, we propose a few-shot musical source separation paradigm. We condition a generic U-Net source separation model using few audio examples of the target instrument. We train a few-shot conditioning… ▽ More Deep learning-based approaches to musical source separation are often limited to the instrument classes that the models are trained on and do not generalize to separate unseen instruments. To address this, we propose a few-shot musical source separation paradigm. We condition a generic U-Net source separation model using few audio examples of the target instrument. We train a few-shot conditioning encoder jointly with the U-Net to encode the audio examples into a conditioning vector to configure the U-Net via feature-wise linear modulation (FiLM). We evaluate the trained models on real musical recordings in the MUSDB18 and MedleyDB datasets. We show that our proposed few-shot conditioning paradigm outperforms the baseline one-hot instrument-class conditioned model for both seen and unseen instruments. To extend the scope of our approach to a wider variety of real-world scenarios, we also experiment with different conditioning example characteristics, including examples from different recordings, with multiple sources, or negative conditioning examples. △ Less

Submitted 2 May, 2022; originally announced May 2022.

Comments: ICASSP 2022

arXiv:2204.09776 [pdf]

doi 10.1002/pssr.202200035

Ferrimagnet GdFeCo characterization for spin-orbitronics: large field-like and dam**-like torques

Authors: Héloïse Damas, Alberto Anadon, David Céspedes-Berrocal, Junior Alegre-Saenz, Jean-Loïs Bello, Aldo Arriola-Córdova, Sylvie Migot, Jaafar Ghanbaja, Olivier Copie, Michel Hehn, Vincent Cros, Sébastien Petit-Watelot, Juan-Carlos Rojas-Sánchez

Abstract: Spintronics is showing promising results in the search for new materials and effects to reduce energy consumption in information technology. Among these materials, ferrimagnets are of special interest, since they can produce large spin currents that trigger the magnetization dynamics of adjacent layers or even their own magnetization. Here, we present a study of the generation of spin current by G… ▽ More Spintronics is showing promising results in the search for new materials and effects to reduce energy consumption in information technology. Among these materials, ferrimagnets are of special interest, since they can produce large spin currents that trigger the magnetization dynamics of adjacent layers or even their own magnetization. Here, we present a study of the generation of spin current by GdFeCo in a GdFeCo/Cu/NiFe trilayer where the FeCo sublattice magnetization is dominant at room temperature. Magnetic properties such as the saturation magnetization are deduced from magnetometry measurements while dam** constant is estimated from spin-torque ferromagnetic resonance (ST-FMR). We show that the overall dam**-like (DL) and field-like (FL) effective fields as well as the associated spin Hall angles can be reliably obtained by performing the dependence of ST-FMR by an added dc current. The sum of the spin Hall angles for both the spin Hall effect (SHE) and the spin anomalous Hall effect (SAHE) symmetries are: $θ_{DL}^{SAHE} + θ_{DL}^{SHE}=-0.15 \pm 0.05$ and $θ_{FL}^{SAHE} + θ_{FL}^{SHE}=0.026 \pm 0.005$. From the symmetry of ST-FMR signals we find that $θ_{DL}^{SHE}$ is positive and dominated by the negative $θ_{DL}^{SAHE}$. The present study paves the way for tuning the different symmetries in spin conversion in highly efficient ferrimagnetic systems. △ Less

Submitted 20 April, 2022; originally announced April 2022.

Comments: 20 pages, 4 figures

Journal ref: Physica Status Solidi - Rapid Research Letters 2022

arXiv:2204.05156 [pdf, other]

How to Listen? Rethinking Visual Sound Localization

Authors: Ho-Hsiang Wu, Magdalena Fuentes, Prem Seetharaman, Juan Pablo Bello

Abstract: Localizing visual sounds consists on locating the position of objects that emit sound within an image. It is a growing research area with potential applications in monitoring natural and urban environments, such as wildlife migration and urban traffic. Previous works are usually evaluated with datasets having mostly a single dominant visible object, and proposed models usually require the introduc… ▽ More Localizing visual sounds consists on locating the position of objects that emit sound within an image. It is a growing research area with potential applications in monitoring natural and urban environments, such as wildlife migration and urban traffic. Previous works are usually evaluated with datasets having mostly a single dominant visible object, and proposed models usually require the introduction of localization modules during training or dedicated sampling strategies, but it remains unclear how these design choices play a role in the adaptability of these methods in more challenging scenarios. In this work, we analyze various model choices for visual sound localization and discuss how their different components affect the model's performance, namely the encoders' architecture, the loss function and the localization strategy. Furthermore, we study the interaction between these decisions, the model performance, and the data, by digging into different evaluation datasets spanning different difficulties and characteristics, and discuss the implications of such decisions in the context of real-world applications. Our code and model weights are open-sourced and made available for further applications. △ Less

Submitted 11 April, 2022; originally announced April 2022.

Comments: Submitted to INTERSPEECH 2022

arXiv:2203.10425 [pdf, other]

A Study on Robustness to Perturbations for Representations of Environmental Sound

Authors: Sangeeta Srivastava, Ho-Hsiang Wu, Joao Rulff, Magdalena Fuentes, Mark Cartwright, Claudio Silva, Anish Arora, Juan Pablo Bello

Abstract: Audio applications involving environmental sound analysis increasingly use general-purpose audio representations, also known as embeddings, for transfer learning. Recently, Holistic Evaluation of Audio Representations (HEAR) evaluated twenty-nine embedding models on nineteen diverse tasks. However, the evaluation's effectiveness depends on the variation already captured within a given dataset. The… ▽ More Audio applications involving environmental sound analysis increasingly use general-purpose audio representations, also known as embeddings, for transfer learning. Recently, Holistic Evaluation of Audio Representations (HEAR) evaluated twenty-nine embedding models on nineteen diverse tasks. However, the evaluation's effectiveness depends on the variation already captured within a given dataset. Therefore, for a given data domain, it is unclear how the representations would be affected by the variations caused by myriad microphones' range and acoustic conditions -- commonly known as channel effects. We aim to extend HEAR to evaluate invariance to channel effects in this work. To accomplish this, we imitate channel effects by injecting perturbations to the audio signal and measure the shift in the new (perturbed) embeddings with three distance measures, making the evaluation domain-dependent but not task-dependent. Combined with the downstream performance, it helps us make a more informed prediction of how robust the embeddings are to the channel effects. We evaluate two embeddings -- YAMNet, and OpenL3 on monophonic (UrbanSound8K) and polyphonic (SONYC-UST) urban datasets. We show that one distance measure does not suffice in such task-independent evaluation. Although Fréchet Audio Distance (FAD) correlates with the trend of the performance drop in the downstream task most accurately, we show that we need to study FAD in conjunction with the other distances to get a clear understanding of the overall effect of the perturbation. In terms of the embedding performance, we find OpenL3 to be more robust than YAMNet, which aligns with the HEAR evaluation. △ Less

Submitted 6 July, 2022; v1 submitted 19 March, 2022; originally announced March 2022.

Comments: Accepted in EUSIPCO 2022

arXiv:2203.06220 [pdf, other]

Infrastructure-free, Deep Learned Urban Noise Monitoring at $\sim$100mW

Authors: Jihoon Yun, Sangeeta Srivastava, Dhrubojyoti Roy, Nathan Stohs, Charlie Mydlarz, Mahin Salman, Bea Steers, Juan Pablo Bello, Anish Arora

Abstract: The Sounds of New York City (SONYC) wireless sensor network (WSN) has been fielded in Manhattan and Brooklyn over the past five years, as part of a larger human-in-the-loop cyber-physical control system for monitoring, analyzing, and mitigating urban noise pollution. We describe the evolution of the 2-tier SONYC WSN from an acoustic data collection fabric into a 3-tier in situ noise complaint moni… ▽ More The Sounds of New York City (SONYC) wireless sensor network (WSN) has been fielded in Manhattan and Brooklyn over the past five years, as part of a larger human-in-the-loop cyber-physical control system for monitoring, analyzing, and mitigating urban noise pollution. We describe the evolution of the 2-tier SONYC WSN from an acoustic data collection fabric into a 3-tier in situ noise complaint monitoring WSN, and its current evaluation. The added tier consists of long-range (LoRa), multi-hop networks of a new low-power acoustic mote, MKII ("Mach 2"), that we have designed and fabricated. MKII motes are notable in three ways: First, they advance machine learning capability at mote-scale in this application domain by introducing a real-time Convolutional Neural Network (CNN) based embedding model that is competitive with alternatives while also requiring 10$\times$ lesser training data and $\sim$2 orders of magnitude fewer runtime resources. Second, they are conveniently deployed relatively far from higher-tier base station nodes without assuming power or network infrastructure support at operationally relevant sites (such as construction zones), yielding a relatively low-cost solution. And third, their networking is frequency agile, unlike conventional LoRa networks: it tolerates in a distributed, self-stabilizing way the variable external interference and link fading in the cluttered 902-928MHz ISM band urban environment by dynamically choosing good frequencies using an efficient new method that combines passive and active measurements. △ Less

Submitted 11 March, 2022; originally announced March 2022.

Comments: Accepted in ICCPS 2022

arXiv:2110.11499 [pdf, other]

Wav2CLIP: Learning Robust Audio Representations From CLIP

Authors: Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, Juan Pablo Bello

Abstract: We propose Wav2CLIP, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on a variety of audio tasks including classification, retrieval, and generation, and show that Wav2CLIP can outperform several publicly available pre-trained audio representation algorithms. Wav2CLIP projects audio into a shared e… ▽ More We propose Wav2CLIP, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on a variety of audio tasks including classification, retrieval, and generation, and show that Wav2CLIP can outperform several publicly available pre-trained audio representation algorithms. Wav2CLIP projects audio into a shared embedding space with images and text, which enables multimodal applications such as zero-shot classification, and cross-modal retrieval. Furthermore, Wav2CLIP needs just ~10% of the data to achieve competitive performance on downstream tasks compared with fully supervised models, and is more efficient to pre-train than competing methods as it does not require learning a visual model in concert with an auditory model. Finally, we demonstrate image generation from Wav2CLIP as qualitative assessment of the shared embedding space. Our code and model weights are open sourced and made available for further applications. △ Less

Submitted 15 February, 2022; v1 submitted 21 October, 2021; originally announced October 2021.

Comments: Copyright 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

arXiv:2110.09600 [pdf, other]

Who calls the shots? Rethinking Few-Shot Learning for Audio

Authors: Yu Wang, Nicholas J. Bryan, Justin Salamon, Mark Cartwright, Juan Pablo Bello

Abstract: Few-shot learning aims to train models that can recognize novel classes given just a handful of labeled examples, known as the support set. While the field has seen notable advances in recent years, they have often focused on multi-class image classification. Audio, in contrast, is often multi-label due to overlap** sounds, resulting in unique properties such as polyphony and signal-to-noise rat… ▽ More Few-shot learning aims to train models that can recognize novel classes given just a handful of labeled examples, known as the support set. While the field has seen notable advances in recent years, they have often focused on multi-class image classification. Audio, in contrast, is often multi-label due to overlap** sounds, resulting in unique properties such as polyphony and signal-to-noise ratios (SNR). This leads to unanswered questions concerning the impact such audio properties may have on few-shot learning system design, performance, and human-computer interaction, as it is typically up to the user to collect and provide inference-time support set examples. We address these questions through a series of experiments designed to elucidate the answers to these questions. We introduce two novel datasets, FSD-MIX-CLIPS and FSD-MIX-SED, whose programmatic generation allows us to explore these questions systematically. Our experiments lead to audio-specific insights on few-shot learning, some of which are at odds with recent findings in the image domain: there is no best one-size-fits-all model, method, and support set selection criterion. Rather, it depends on the expected application scenario. Our code and data are available at https://github.com/wangyu/rethink-audio-fsl. △ Less

Submitted 18 October, 2021; originally announced October 2021.

Comments: WASPAA 2021

arXiv:2109.12690 [pdf, ps, other]

Soundata: A Python library for reproducible use of audio datasets

Authors: Magdalena Fuentes, Justin Salamon, Pablo Zinemanas, Martín Rocamora, Genís Paja, Irán R. Román, Marius Miron, Xavier Serra, Juan Pablo Bello

Abstract: Soundata is a Python library for loading and working with audio datasets in a standardized way, removing the need for writing custom loaders in every project, and improving reproducibility by providing tools to validate data against a canonical version. It speeds up research pipelines by allowing users to quickly download a dataset, load it into memory in a standardized and reproducible way, valid… ▽ More Soundata is a Python library for loading and working with audio datasets in a standardized way, removing the need for writing custom loaders in every project, and improving reproducibility by providing tools to validate data against a canonical version. It speeds up research pipelines by allowing users to quickly download a dataset, load it into memory in a standardized and reproducible way, validate that the dataset is complete and correct, and more. Soundata is based and inspired on mirdata and design to complement mirdata by working with environmental sound, bioacoustic and speech datasets, among others. Soundata was created to be easy to use, easy to contribute to, and to increase reproducibility and standardize usage of sound datasets in a flexible way. △ Less

Submitted 4 October, 2021; v1 submitted 26 September, 2021; originally announced September 2021.

arXiv:2108.06404 [pdf, other]

Cyclic Cellular Automata and Greenberg-Hastings Models on Regular Trees

Authors: Jason Bello, David Sivakoff

Abstract: We study the cyclic cellular automaton (CCA) and the Greenberg-Hastings model (GHM) with $κ\ge 3$ colors and contact threshold $θ\ge 2$ on the infinite $(d+1)$-regular tree, $T_d$. When the initial state has the uniform product distribution, we show that these dynamical systems exhibit at least two distinct phases. For sufficiently large $d$, we show that if $κ(θ-1) \le d - O(\sqrt{dκ\ln(d)})$, th… ▽ More We study the cyclic cellular automaton (CCA) and the Greenberg-Hastings model (GHM) with $κ\ge 3$ colors and contact threshold $θ\ge 2$ on the infinite $(d+1)$-regular tree, $T_d$. When the initial state has the uniform product distribution, we show that these dynamical systems exhibit at least two distinct phases. For sufficiently large $d$, we show that if $κ(θ-1) \le d - O(\sqrt{dκ\ln(d)})$, then every vertex almost surely changes its color infinitely often, while if $κθ\ge d + O(κ\sqrt{d\ln(d)})$, then every vertex almost surely changes its color only finitely many times. Roughly, this implies that as $d\to \infty$, there is a phase transition where $κθ/d = 1$. For the GHM dynamics, in the scenario where every vertex changes color finitely many times, we moreover give an exponential tail bound for the distribution of the time of the last color change at a given vertex. △ Less

Submitted 13 August, 2021; originally announced August 2021.

Comments: 22 pages, 2 figures

MSC Class: 60K35 (Primary) 05D99 (Secondary)

arXiv:2106.01149 [pdf, other]

Exploring modality-agnostic representations for music classification

Authors: Ho-Hsiang Wu, Magdalena Fuentes, Juan P. Bello

Abstract: Music information is often conveyed or recorded across multiple data modalities including but not limited to audio, images, text and scores. However, music information retrieval research has almost exclusively focused on single modality recognition, requiring development of separate models for each modality. Some multi-modal works require multiple coexisting modalities given to the model as inputs… ▽ More Music information is often conveyed or recorded across multiple data modalities including but not limited to audio, images, text and scores. However, music information retrieval research has almost exclusively focused on single modality recognition, requiring development of separate models for each modality. Some multi-modal works require multiple coexisting modalities given to the model as inputs, constraining the use of these models to the few cases where data from all modalities are available. To the best of our knowledge, no existing model has the ability to take inputs from varying modalities, e.g. images or sounds, and classify them into unified music categories. We explore the use of cross-modal retrieval as a pretext task to learn modality-agnostic representations, which can then be used as inputs to classifiers that are independent of modality. We select instrument classification as an example task for our study as both visual and audio components provide relevant semantic information. We train music instrument classifiers that can take both images or sounds as input, and perform comparably to sound-only or image-only classifiers. Furthermore, we explore the case when there is limited labeled data for a given modality, and the impact in performance by using labeled data from other modalities. We are able to achieve almost 70% of best performing system in a zero-shot setting. We provide a detailed analysis of experimental results to understand the potential and limitations of the approach, and discuss future steps towards modality-agnostic classifiers. △ Less

Submitted 2 June, 2021; originally announced June 2021.

arXiv:2105.02911 [pdf, other]

Weakly Supervised Source-Specific Sound Level Estimation in Noisy Soundscapes

Authors: Aurora Cramer, Mark Cartwright, Fatemeh Pishdadian, Juan Pablo Bello

Abstract: While the estimation of what sound sources are, when they occur, and from where they originate has been well-studied, the estimation of how loud these sound sources are has been often overlooked. Current solutions to this task, which we refer to as source-specific sound level estimation (SSSLE), suffer from challenges due to the impracticality of acquiring realistic data and a lack of robustness t… ▽ More While the estimation of what sound sources are, when they occur, and from where they originate has been well-studied, the estimation of how loud these sound sources are has been often overlooked. Current solutions to this task, which we refer to as source-specific sound level estimation (SSSLE), suffer from challenges due to the impracticality of acquiring realistic data and a lack of robustness to realistic recording conditions. Recently proposed weakly supervised source separation offer a means of leveraging clip-level source annotations to train source separation models, which we augment with modified loss functions to bridge the gap between source separation and SSSLE and to address the presence of background. We show that our approach improves SSSLE performance compared to baseline source separation models and provide an ablation analysis to explore our method's design choices, showing that SSSLE in practical recording and annotation scenarios is possible. △ Less

Submitted 29 July, 2021; v1 submitted 6 May, 2021; originally announced May 2021.

Comments: 5 pages, 3 figures, WASPAA 2021 preprint

arXiv:2103.07362 [pdf, other]

PLADE-Net: Towards Pixel-Level Accuracy for Self-Supervised Single-View Depth Estimation with Neural Positional Encoding and Distilled Matting Loss

Authors: Juan Luis Gonzalez Bello, Munchurl Kim

Abstract: In this paper, we propose a self-supervised single-view pixel-level accurate depth estimation network, called PLADE-Net. The PLADE-Net is the first work that shows unprecedented accuracy levels, exceeding 95\% in terms of the $δ^1$ metric on the challenging KITTI dataset. Our PLADE-Net is based on a new network architecture with neural positional encoding and a novel loss function that borrows fro… ▽ More In this paper, we propose a self-supervised single-view pixel-level accurate depth estimation network, called PLADE-Net. The PLADE-Net is the first work that shows unprecedented accuracy levels, exceeding 95\% in terms of the $δ^1$ metric on the challenging KITTI dataset. Our PLADE-Net is based on a new network architecture with neural positional encoding and a novel loss function that borrows from the closed-form solution of the matting Laplacian to learn pixel-level accurate depth estimation from stereo images. Neural positional encoding allows our PLADE-Net to obtain more consistent depth estimates by letting the network reason about location-specific image properties such as lens and projection distortions. Our novel distilled matting Laplacian loss allows our network to predict sharp depths at object boundaries and more consistent depths in highly homogeneous regions. Our proposed method outperforms all previous self-supervised single-view depth estimation methods by a large margin on the challenging KITTI dataset, with unprecedented levels of accuracy. Furthermore, our PLADE-Net, naively extended for stereo inputs, outperforms the most recent self-supervised stereo methods, even without any advanced blocks like 1D correlations, 3D convolutions, or spatial pyramid pooling. We present extensive ablation studies and experiments that support our method's effectiveness on the KITTI, CityScapes, and Make3D datasets. △ Less

Submitted 12 March, 2021; originally announced March 2021.

Comments: Accepted paper (poster) at CVPR2021

arXiv:2103.03727 [pdf]

Suicide Classificaction for News Media Using Convolutional Neural Network

Authors: Hugo J. Bello, Nora Palomar-Ciria, Enrique Baca-García, Celia Lozano

Abstract: Currently, the process of evaluating suicides is highly subjective, which limits the efficacy and accuracy of prevention efforts. Artificial intelligence (AI) has emerged as a means of investigating large datasets to identify patterns within "big data" that can determine the factors on suicide outcomes. Here, we use AI tools to extract the topic from (press and social) media text. However, news me… ▽ More Currently, the process of evaluating suicides is highly subjective, which limits the efficacy and accuracy of prevention efforts. Artificial intelligence (AI) has emerged as a means of investigating large datasets to identify patterns within "big data" that can determine the factors on suicide outcomes. Here, we use AI tools to extract the topic from (press and social) media text. However, news media articles lack of suicide tags. Using tweets with hashtags related to sucide, we train a neuronal model which identifies if a given text has a suicidade-related contagion. Our results suggest a high level of the impact of mediatic into suicide cases, and a intrinsic thematic relationship of suicide news. These results pave the way to build more interpretable suicide data, which may help to better track, understand its origin, and improve prevention strategies. △ Less

Submitted 18 February, 2021; originally announced March 2021.

arXiv:2102.13502 [pdf]

Direct imaging of chiral domain walls and Néel-type skyrmionium in ferrimagnetic alloys

Authors: Boris Seng, Daniel Schönke, Javier Yeste, Robert M. Reeve, Nico Kerber, Daniel Lacour, Jean-Loïs Bello, Nicolas Bergeard, Fabian Kammerbauer, Mona Bhukta, Tom Ferté, Christine Boeglin, Florin Radu, Radu Abrudan, Torsten Kachel, Stéphane Mangin, Michel Hehn, Mathias Kläui

Abstract: The evolution of chiral spin structures is studied in ferrimagnet Ta/Ir/Fe/GdFeCo/Pt multilayers as a function of temperature using scanning electron microscopy with polarization analysis (SEMPA). The GdFeCo ferrimagnet exhibits pure right-hand Néel-type domain wall (DW) spin textures over a large temperature range. This indicates the presence of a negative Dzyaloshinskii-Moriya interaction (DMI)… ▽ More The evolution of chiral spin structures is studied in ferrimagnet Ta/Ir/Fe/GdFeCo/Pt multilayers as a function of temperature using scanning electron microscopy with polarization analysis (SEMPA). The GdFeCo ferrimagnet exhibits pure right-hand Néel-type domain wall (DW) spin textures over a large temperature range. This indicates the presence of a negative Dzyaloshinskii-Moriya interaction (DMI) that can originate from both the top Fe/Pt and the Co/Pt interfaces. From measurements of the DW width, as well as complementary magnetic characterization, the exchange stiffness as a function of temperature is ascertained. The exchange stiffness is surprisingly mostly constant, which is explained by theoretical predictions. Beyond single skyrmions, we find by direct imaging a pure Néel-type skyrmionium, which due to the absence of a skyrmion Hall angle is a promising topological spin structure to enable high impact potential applications in the next generation of spintronic devices. △ Less

Submitted 21 July, 2021; v1 submitted 26 February, 2021; originally announced February 2021.

arXiv:2102.03229 [pdf, other]

Multi-Task Self-Supervised Pre-Training for Music Classification

Authors: Ho-Hsiang Wu, Chieh-Chi Kao, Qingming Tang, Ming Sun, Brian McFee, Juan Pablo Bello, Chao Wang

Abstract: Deep learning is very data hungry, and supervised learning especially requires massive labeled data to work well. Machine listening research often suffers from limited labeled data problem, as human annotations are costly to acquire, and annotations for audio are time consuming and less intuitive. Besides, models learned from labeled dataset often embed biases specific to that particular dataset.… ▽ More Deep learning is very data hungry, and supervised learning especially requires massive labeled data to work well. Machine listening research often suffers from limited labeled data problem, as human annotations are costly to acquire, and annotations for audio are time consuming and less intuitive. Besides, models learned from labeled dataset often embed biases specific to that particular dataset. Therefore, unsupervised learning techniques become popular approaches in solving machine listening problems. Particularly, a self-supervised learning technique utilizing reconstructions of multiple hand-crafted audio features has shown promising results when it is applied to speech domain such as emotion recognition and automatic speech recognition (ASR). In this paper, we apply self-supervised and multi-task learning methods for pre-training music encoders, and explore various design choices including encoder architectures, weighting mechanisms to combine losses from multiple tasks, and worker selections of pretext tasks. We investigate how these design choices interact with various downstream music classification tasks. We find that using various music specific workers altogether with weighting mechanisms to balance the losses during pre-training helps improve and generalize to the downstream tasks. △ Less

Submitted 5 February, 2021; originally announced February 2021.

Comments: Copyright 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

arXiv:2012.07490 [pdf]

Machine Learning to study the impact of gender-based violence in the news media

Authors: Hugo J. Bello, Nora Palomar, Elisa Gallego, Lourdes Jiménez Navascués, Celia Lozano

Abstract: While it remains a taboo topic, gender-based violence (GBV) undermines the health, dignity, security and autonomy of its victims. Many factors have been studied to generate or maintain this kind of violence, however, the influence of the media is still uncertain. Here, we use Machine Learning tools to extrapolate the effect of the news in GBV. By feeding neural networks with news, the topic inform… ▽ More While it remains a taboo topic, gender-based violence (GBV) undermines the health, dignity, security and autonomy of its victims. Many factors have been studied to generate or maintain this kind of violence, however, the influence of the media is still uncertain. Here, we use Machine Learning tools to extrapolate the effect of the news in GBV. By feeding neural networks with news, the topic information associated with each article can be recovered. Our findings show a relationship between GBV news and public awareness, the effect of mediatic GBV cases, and the intrinsic thematic relationship of GBV news. Because the used neural model can be easily adjusted, this also allows us to extend our approach to other media sources or topics △ Less

Submitted 27 November, 2020; originally announced December 2020.

arXiv:2010.09137 [pdf]

doi 10.1002/adma.202007047

Current-induced spin torques on single GdFeCo magnetic layers

Authors: David Céspedes-Berrocal, Heloïse Damas, Sébastien Petit-Watelot, David Maccariello, ** Tang, Aldo Arriola-Córdova, Pierre Vallobra, Yong Xu, Jean-Loïs Bello, Elodie Martin, Sylvie Migot, Jaafar Ghanbaja, Shufeng Zhang, Michel Hehn, Stéphane Mangin, Christos Panagopoulos, Vincent Cros, Albert Fert, Juan-Carlos Rojas-Sánchez

Abstract: Spintronics exploits spin-orbit coupling (SOC) to generate spin currents, spin torques, and, in the absence of inversion symmetry, Rashba, and Dzyaloshinskii-Moriya interactions (DMI). The widely used magnetic materials, based on 3d metals such as Fe and Co, possess a small SOC. To circumvent this shortcoming, the common practice has been to utilize the large SOC of nonmagnetic layers of 5d heavy… ▽ More Spintronics exploits spin-orbit coupling (SOC) to generate spin currents, spin torques, and, in the absence of inversion symmetry, Rashba, and Dzyaloshinskii-Moriya interactions (DMI). The widely used magnetic materials, based on 3d metals such as Fe and Co, possess a small SOC. To circumvent this shortcoming, the common practice has been to utilize the large SOC of nonmagnetic layers of 5d heavy metals (HMs), such as Pt, to generate spin currents by Spin Hall Effect (SHE) and, in turn, exert spin torques on the magnetic layers. Here, we introduce a new class of material architectures, excluding nonmagnetic 5d HMs, for high-performance spintronics operations. We demonstrate very strong current-induced torques exerted on single GdFeCo layers due to the combination of large SOC of the Gd 5d states, and inversion symmetry breaking mainly engineered by interfaces. These "self-torques" are enhanced around the magnetization compensation temperature (close to room temperature) and can be tuned by adjusting the spin absorption outside the GdFeCo layer. In other measurements, we determine the very large emission of spin current from GdFeCo. This material platform opens new perspectives to exert "self-torques" on single magnetic layers as well as to generate spin currents from a magnetic layer. △ Less

Submitted 18 October, 2020; originally announced October 2020.

Comments: 26 pages, 4 figures plus 5 pages of sup. information

Journal ref: Advanced Materials 2021

arXiv:2009.05188 [pdf, other]

SONYC-UST-V2: An Urban Sound Tagging Dataset with Spatiotemporal Context

Authors: Mark Cartwright, Jason Cramer, Ana Elisa Mendez Mendez, Yu Wang, Ho-Hsiang Wu, Vincent Lostanlen, Magdalena Fuentes, Graham Dove, Charlie Mydlarz, Justin Salamon, Oded Nov, Juan Pablo Bello

Abstract: We present SONYC-UST-V2, a dataset for urban sound tagging with spatiotemporal information. This dataset is aimed for the development and evaluation of machine listening systems for real-world urban noise monitoring. While datasets of urban recordings are available, this dataset provides the opportunity to investigate how spatiotemporal metadata can aid in the prediction of urban sound tags. SONYC… ▽ More We present SONYC-UST-V2, a dataset for urban sound tagging with spatiotemporal information. This dataset is aimed for the development and evaluation of machine listening systems for real-world urban noise monitoring. While datasets of urban recordings are available, this dataset provides the opportunity to investigate how spatiotemporal metadata can aid in the prediction of urban sound tags. SONYC-UST-V2 consists of 18510 audio recordings from the "Sounds of New York City" (SONYC) acoustic sensor network, including the timestamp of audio acquisition and location of the sensor. The dataset contains annotations by volunteers from the Zooniverse citizen science platform, as well as a two-stage verification with our team. In this article, we describe our data collection procedure and propose evaluation metrics for multilabel classification of urban sound tags. We report the results of a simple baseline model that exploits spatiotemporal information. △ Less

Submitted 10 September, 2020; originally announced September 2020.

arXiv:2008.02791 [pdf, other]

Few-Shot Drum Transcription in Polyphonic Music

Authors: Yu Wang, Justin Salamon, Mark Cartwright, Nicholas J. Bryan, Juan Pablo Bello

Abstract: Data-driven approaches to automatic drum transcription (ADT) are often limited to a predefined, small vocabulary of percussion instrument classes. Such models cannot recognize out-of-vocabulary classes nor are they able to adapt to finer-grained vocabularies. In this work, we address open vocabulary ADT by introducing few-shot learning to the task. We train a Prototypical Network on a synthetic da… ▽ More Data-driven approaches to automatic drum transcription (ADT) are often limited to a predefined, small vocabulary of percussion instrument classes. Such models cannot recognize out-of-vocabulary classes nor are they able to adapt to finer-grained vocabularies. In this work, we address open vocabulary ADT by introducing few-shot learning to the task. We train a Prototypical Network on a synthetic dataset and evaluate the model on multiple real-world ADT datasets with polyphonic accompaniment. We show that, given just a handful of selected examples at inference time, we can match and in some cases outperform a state-of-the-art supervised ADT approach under a fixed vocabulary setting. At the same time, we show that our model can successfully generalize to finer-grained or extended vocabularies unseen during training, a scenario where supervised approaches cannot operate at all. We provide a detailed analysis of our experimental results, including a breakdown of performance by sound class and by polyphony. △ Less

Submitted 6 August, 2020; originally announced August 2020.

Comments: ISMIR 2020 camera-ready

arXiv:2006.16583 [pdf, other]

Pan-Sharpening with Color-Aware Perceptual Loss and Guided Re-Colorization

Authors: Juan Luis Gonzalez Bello, Soomin Seo, Munchurl Kim

Abstract: We present a novel color-aware perceptual (CAP) loss for learning the task of pan-sharpening. Our CAP loss is designed to focus on the deep features of a pre-trained VGG network that are more sensitive to spatial details and ignore color information to allow the network to extract the structural information from the PAN image while kee** the color from the lower resolution MS image. Additionally… ▽ More We present a novel color-aware perceptual (CAP) loss for learning the task of pan-sharpening. Our CAP loss is designed to focus on the deep features of a pre-trained VGG network that are more sensitive to spatial details and ignore color information to allow the network to extract the structural information from the PAN image while kee** the color from the lower resolution MS image. Additionally, we propose "guided re-colorization", which generates a pan-sharpened image with real colors from the MS input by "picking" the closest MS pixel color for each pan-sharpened pixel, as a human operator would do in manual colorization. Such a re-colorized (RC) image is completely aligned with the pan-sharpened (PS) network output and can be used as a self-supervision signal during training, or to enhance the colors in the PS image during test. We present several experiments where our network trained with our CAP loss generates naturally looking pan-sharpened images with fewer artifacts and outperforms the state-of-the-arts on the WorldView3 dataset in terms of ERGAS, SCC, and QNR metrics. △ Less

Submitted 30 June, 2020; originally announced June 2020.

arXiv:2003.01037 [pdf, other]

One or Two Components? The Scattering Transform Answers

Authors: Vincent Lostanlen, Alice Cohen-Hadria, Juan Pablo Bello

Abstract: With the aim of constructing a biologically plausible model of machine listening, we study the representation of a multicomponent stationary signal by a wavelet scattering network. First, we show that renormalizing second-order nodes by their first-order parents gives a simple numerical criterion to assess whether two neighboring components will interfere psychoacoustically. Secondly, we run a man… ▽ More With the aim of constructing a biologically plausible model of machine listening, we study the representation of a multicomponent stationary signal by a wavelet scattering network. First, we show that renormalizing second-order nodes by their first-order parents gives a simple numerical criterion to assess whether two neighboring components will interfere psychoacoustically. Secondly, we run a manifold learning algorithm (Isomap) on scattering coefficients to visualize the similarity space underlying parametric additive synthesis. Thirdly, we generalize the "one or two components" framework to three sine waves or more, and prove that the effective scattering depth of a Fourier series grows in logarithmic proportion to its bandwidth. △ Less

Submitted 25 June, 2020; v1 submitted 2 March, 2020; originally announced March 2020.

Comments: 5 pages, 4 figures, in English. Proceedings of the European Signal Processing Conference (EUSIPCO 2020)

arXiv:1911.00417 [pdf, other]

doi 10.33682/ts6e-sn53

Long-distance Detection of Bioacoustic Events with Per-channel Energy Normalization

Authors: Vincent Lostanlen, Kaitlin Palmer, Elly Knight, Christopher Clark, Holger Klinck, Andrew Farnsworth, Tina Wong, Jason Cramer, Juan Pablo Bello

Abstract: This paper proposes to perform unsupervised detection of bioacoustic events by pooling the magnitudes of spectrogram frames after per-channel energy normalization (PCEN). Although PCEN was originally developed for speech recognition, it also has beneficial effects in enhancing animal vocalizations, despite the presence of atmospheric absorption and intermittent noise. We prove that PCEN generalize… ▽ More This paper proposes to perform unsupervised detection of bioacoustic events by pooling the magnitudes of spectrogram frames after per-channel energy normalization (PCEN). Although PCEN was originally developed for speech recognition, it also has beneficial effects in enhancing animal vocalizations, despite the presence of atmospheric absorption and intermittent noise. We prove that PCEN generalizes logarithm-based spectral flux, yet with a tunable time scale for background noise estimation. In comparison with pointwise logarithm, PCEN reduces false alarm rate by 50x in the near field and 5x in the far field, both on avian and marine bioacoustic datasets. Such improvements come at moderate computational cost and require no human intervention, thus heralding a promising future for PCEN in bioacoustics. △ Less

Submitted 1 November, 2019; originally announced November 2019.

Comments: 5 pages, 3 figures. Presented at the 3rd International Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE). 25--26 October 2019, New York, NY, USA

arXiv:1910.10246 [pdf, other]

Learning the helix topology of musical pitch

Authors: Vincent Lostanlen, Sripathi Sridhar, Brian McFee, Andrew Farnsworth, Juan Pablo Bello

Abstract: To explain the consonance of octaves, music psychologists represent pitch as a helix where azimuth and axial coordinate correspond to pitch class and pitch height respectively. This article addresses the problem of discovering this helical structure from unlabeled audio data. We measure Pearson correlations in the constant-Q transform (CQT) domain to build a K-nearest neighbor graph between freque… ▽ More To explain the consonance of octaves, music psychologists represent pitch as a helix where azimuth and axial coordinate correspond to pitch class and pitch height respectively. This article addresses the problem of discovering this helical structure from unlabeled audio data. We measure Pearson correlations in the constant-Q transform (CQT) domain to build a K-nearest neighbor graph between frequency subbands. Then, we run the Isomap manifold learning algorithm to represent this graph in a three-dimensional space in which straight lines approximate graph geodesics. Experiments on isolated musical notes demonstrate that the resulting manifold resembles a helix which makes a full turn at every octave. A circular shape is also found in English speech, but not in urban noise. We discuss the impact of various design choices on the visualization: instrumentarium, loudness map** function, and number of neighbors K. △ Less

Submitted 4 February, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

Comments: 5 pages, 6 figures. To appear in the Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Barcelona, Spain, May 2020

arXiv:1910.01089 [pdf, other]

Deep 3D Pan via adaptive "t-shaped" convolutions with global and local adaptive dilations

Authors: Juan Luis Gonzalez Bello, Munchurl Kim

Abstract: Recent advances in deep learning have shown promising results in many low-level vision tasks. However, solving the single-image-based view synthesis is still an open problem. In particular, the generation of new images at parallel camera views given a single input image is of great interest, as it enables 3D visualization of the 2D input scenery. We propose a novel network architecture to perform… ▽ More Recent advances in deep learning have shown promising results in many low-level vision tasks. However, solving the single-image-based view synthesis is still an open problem. In particular, the generation of new images at parallel camera views given a single input image is of great interest, as it enables 3D visualization of the 2D input scenery. We propose a novel network architecture to perform stereoscopic view synthesis at arbitrary camera positions along the X-axis, or Deep 3D Pan, with "t-shaped" adaptive kernels equipped with globally and locally adaptive dilations. Our proposed network architecture, the monster-net, is devised with a novel "t-shaped" adaptive kernel with globally and locally adaptive dilation, which can efficiently incorporate global camera shift into and handle local 3D geometries of the target image's pixels for the synthesis of naturally looking 3D panned views when a 2-D input image is given. Extensive experiments were performed on the KITTI, CityScapes and our VICLAB_STEREO indoors dataset to prove the efficacy of our method. Our monster-net significantly outperforms the state-of-the-art method, SOTA, by a large margin in all metrics of RMSE, PSNR, and SSIM. Our proposed monster-net is capable of reconstructing more reliable image structures in synthesized images with coherent geometry. Moreover, the disparity information that can be extracted from the "t-shaped" kernel is much more reliable than that of the SOTA for the unsupervised monocular depth estimation task, confirming the effectiveness of our method. △ Less

Submitted 20 October, 2019; v1 submitted 2 October, 2019; originally announced October 2019.

Comments: Check our video at https://www.youtube.com/watch?v=o0b-e282Rt4

arXiv:1909.09349 [pdf, other]

Deep 3D-Zoom Net: Unsupervised Learning of Photo-Realistic 3D-Zoom

Authors: Juan Luis Gonzalez Bello, Munchurl Kim

Abstract: The 3D-zoom operation is the positive translation of the camera in the Z-axis, perpendicular to the image plane. In contrast, the optical zoom changes the focal length and the digital zoom is used to enlarge a certain region of an image to the original image size. In this paper, we are the first to formulate an unsupervised 3D-zoom learning problem where images with an arbitrary zoom factor can be… ▽ More The 3D-zoom operation is the positive translation of the camera in the Z-axis, perpendicular to the image plane. In contrast, the optical zoom changes the focal length and the digital zoom is used to enlarge a certain region of an image to the original image size. In this paper, we are the first to formulate an unsupervised 3D-zoom learning problem where images with an arbitrary zoom factor can be generated from a given single image. An unsupervised framework is convenient, as it is a challenging task to obtain a 3D-zoom dataset of natural scenes due to the need for special equipment to ensure camera movement is restricted to the Z-axis. In addition, the objects in the scenes should not move when being captured, which hinders the construction of a large dataset of outdoor scenes. We present a novel unsupervised framework to learn how to generate arbitrarily 3D-zoomed versions of a single image, not requiring a 3D-zoom ground truth, called the Deep 3D-Zoom Net. The Deep 3D-Zoom Net incorporates the following features: (i) transfer learning from a pre-trained disparity estimation network via a back re-projection reconstruction loss; (ii) a fully convolutional network architecture that models depth-image-based rendering (DIBR), taking into account high-frequency details without the need for estimating the intermediate disparity; and (iii) incorporating a discriminator network that acts as a no-reference penalty for unnaturally rendered areas. Even though there is no baseline to fairly compare our results, our method outperforms previous novel view synthesis research in terms of realistic appearance on large camera baselines. We performed extensive experiments to verify the effectiveness of our method on the KITTI and Cityscapes datasets. △ Less

Submitted 2 October, 2019; v1 submitted 20 September, 2019; originally announced September 2019.

Comments: Check our video at https://www.youtube.com/watch?v=Gz76VYwUzZ8

arXiv:1906.08512 [pdf, other]

Adversarial Learning for Improved Onsets and Frames Music Transcription

Authors: Jong Wook Kim, Juan Pablo Bello

Abstract: Automatic music transcription is considered to be one of the hardest problems in music information retrieval, yet recent deep learning approaches have achieved substantial improvements on transcription performance. These approaches commonly employ supervised learning models that predict various time-frequency representations, by minimizing element-wise losses such as the cross entropy function. Ho… ▽ More Automatic music transcription is considered to be one of the hardest problems in music information retrieval, yet recent deep learning approaches have achieved substantial improvements on transcription performance. These approaches commonly employ supervised learning models that predict various time-frequency representations, by minimizing element-wise losses such as the cross entropy function. However, applying the loss in this manner assumes conditional independence of each label given the input, and thus cannot accurately express inter-label dependencies. To address this issue, we introduce an adversarial training scheme that operates directly on the time-frequency representations and makes the output distribution closer to the ground-truth. Through adversarial learning, we achieve a consistent improvement in both frame-level and note-level metrics over Onsets and Frames, a state-of-the-art music transcription model. Our results show that adversarial learning can significantly reduce the error rate while increasing the confidence of the model estimations. Our approach is generic and applicable to any transcription model based on multi-label predictions, which are very common in music signal analysis. △ Less

Submitted 20 June, 2019; originally announced June 2019.

arXiv:1905.08352 [pdf, other]

doi 10.1371/journal.pone.0214168

Robust sound event detection in bioacoustic sensor networks

Authors: Vincent Lostanlen, Justin Salamon, Andrew Farnsworth, Steve Kelling, Juan Pablo Bello

Abstract: Bioacoustic sensors, sometimes known as autonomous recording units (ARUs), can record sounds of wildlife over long periods of time in scalable and minimally invasive ways. Deriving per-species abundance estimates from these sensors requires detection, classification, and quantification of animal vocalizations as individual acoustic events. Yet, variability in ambient noise, both over time and acro… ▽ More Bioacoustic sensors, sometimes known as autonomous recording units (ARUs), can record sounds of wildlife over long periods of time in scalable and minimally invasive ways. Deriving per-species abundance estimates from these sensors requires detection, classification, and quantification of animal vocalizations as individual acoustic events. Yet, variability in ambient noise, both over time and across sensors, hinders the reliability of current automated systems for sound event detection (SED), such as convolutional neural networks (CNN) in the time-frequency domain. In this article, we develop, benchmark, and combine several machine listening techniques to improve the generalizability of SED models across heterogeneous acoustic environments. As a case study, we consider the problem of detecting avian flight calls from a ten-hour recording of nocturnal bird migration, recorded by a network of six ARUs in the presence of heterogeneous background noise. Starting from a CNN yielding state-of-the-art accuracy on this task, we introduce two noise adaptation techniques, respectively integrating short-term (60 milliseconds) and long-term (30 minutes) context. First, we apply per-channel energy normalization (PCEN) in the time-frequency domain, which applies short-term automatic gain control to every subband in the mel-frequency spectrogram. Secondly, we replace the last dense layer in the network by a context-adaptive neural network (CA-NN) layer. Combining them yields state-of-the-art results that are unmatched by artificial data augmentation alone. We release a pre-trained version of our best performing system under the name of BirdVoxDetect, a ready-to-use detector of avian flight calls in field recordings. △ Less

Submitted 29 October, 2019; v1 submitted 20 May, 2019; originally announced May 2019.

Comments: 32 pages, in English. Submitted to PLOS ONE journal in February 2019; revised August 2019; published October 2019

arXiv:1903.08514 [pdf, other]

A Novel Monocular Disparity Estimation Network with Domain Transformation and Ambiguity Learning

Authors: Juan Luis Gonzalez Bello, Munchurl Kim

Abstract: Convolutional neural networks (CNN) have shown state-of-the-art results for low-level computer vision problems such as stereo and monocular disparity estimations, but still, have much room to further improve their performance in terms of accuracy, numbers of parameters, etc. Recent works have uncovered the advantages of using an unsupervised scheme to train CNN's to estimate monocular disparity, w… ▽ More Convolutional neural networks (CNN) have shown state-of-the-art results for low-level computer vision problems such as stereo and monocular disparity estimations, but still, have much room to further improve their performance in terms of accuracy, numbers of parameters, etc. Recent works have uncovered the advantages of using an unsupervised scheme to train CNN's to estimate monocular disparity, where only the relatively-easy-to-obtain stereo images are needed for training. We propose a novel encoder-decoder architecture that outperforms previous unsupervised monocular depth estimation networks by (i) taking into account ambiguities, (ii) efficient fusion between encoder and decoder features with rectangular convolutions and (iii) domain transformations between encoder and decoder. Our architecture outperforms the Monodepth baseline in all metrics, even with a considerable reduction of parameters. Furthermore, our architecture is capable of estimating a full disparity map in a single forward pass, whereas the baseline needs two passes. We perform extensive experiments to verify the effectiveness of our method on the KITTI dataset. △ Less

Submitted 20 March, 2019; originally announced March 2019.

arXiv:1903.03195 [pdf, other]

doi 10.3390/s19061415

The life of a New York City noise sensor network

Authors: Charlie Mydlarz, Mohit Sharma, Yitzchak Lockerman, Ben Steers, Claudio Silva, Juan Pablo Bello

Abstract: Noise pollution is one of the topmost quality of life issues for urban residents in the United States. Continued exposure to high levels of noise has proven effects on health, including acute effects such as sleep disruption, and long-term effects such as hypertension, heart disease, and hearing loss. To investigate and ultimately aid in the mitigation of urban noise, a network of 55 sensor nodes… ▽ More Noise pollution is one of the topmost quality of life issues for urban residents in the United States. Continued exposure to high levels of noise has proven effects on health, including acute effects such as sleep disruption, and long-term effects such as hypertension, heart disease, and hearing loss. To investigate and ultimately aid in the mitigation of urban noise, a network of 55 sensor nodes has been deployed across New York City for over two years, collecting sound pressure level (SPL) and audio data. This network has cumulatively amassed over 75 years of calibrated, high-resolution SPL measurements and 35 years of audio data. In addition, high frequency telemetry data has been collected that provides an indication of a sensors' health. This telemetry data was analyzed over an 18 month period across 31 of the sensors. It has been used to develop a prototype model for pre-failure detection which has the ability to identify sensors in a prefail state 69.1% of the time. The entire network infrastructure is outlined, including the operation of the sensors, followed by an analysis of its data yield and the development of the fault detection approach and the future system integration plans for this. △ Less

Submitted 26 March, 2019; v1 submitted 7 March, 2019; originally announced March 2019.

Comments: This article belongs to the Section Intelligent Sensors, 24 pages, 15 figures, 3 tables, 45 references

Journal ref: Sensors 2019, 19, 1415

arXiv:1811.00223 [pdf, other]

Neural Music Synthesis for Flexible Timbre Control

Authors: Jong Wook Kim, Rachel Bittner, Aparna Kumar, Juan Pablo Bello

Abstract: The recent success of raw audio waveform synthesis models like WaveNet motivates a new approach for music synthesis, in which the entire process --- creating audio samples from a score and instrument information --- is modeled using generative neural networks. This paper describes a neural music synthesis model with flexible timbre controls, which consists of a recurrent neural network conditioned… ▽ More The recent success of raw audio waveform synthesis models like WaveNet motivates a new approach for music synthesis, in which the entire process --- creating audio samples from a score and instrument information --- is modeled using generative neural networks. This paper describes a neural music synthesis model with flexible timbre controls, which consists of a recurrent neural network conditioned on a learned instrument embedding followed by a WaveNet vocoder. The learned embedding space successfully captures the diverse variations in timbres within a large dataset and enables timbre control and morphing by interpolating between instruments in the embedding space. The synthesis quality is evaluated both numerically and perceptually, and an interactive web demo is presented. △ Less

Submitted 1 November, 2018; originally announced November 2018.

arXiv:1809.00381 [pdf, other]

Multitask Learning for Fundamental Frequency Estimation in Music

Authors: Rachel M. Bittner, Brian McFee, Juan P. Bello

Abstract: Fundamental frequency (f0) estimation from polyphonic music includes the tasks of multiple-f0, melody, vocal, and bass line estimation. Historically these problems have been approached separately, and only recently, using learning-based approaches. We present a multitask deep learning architecture that jointly estimates outputs for various tasks including multiple-f0, melody, vocal and bass line e… ▽ More Fundamental frequency (f0) estimation from polyphonic music includes the tasks of multiple-f0, melody, vocal, and bass line estimation. Historically these problems have been approached separately, and only recently, using learning-based approaches. We present a multitask deep learning architecture that jointly estimates outputs for various tasks including multiple-f0, melody, vocal and bass line estimation, and is trained using a large, semi-automatically annotated dataset. We show that the multitask model outperforms its single-task counterparts, and explore the effect of various design decisions in our approach, and show that it performs better or at least competitively when compared against strong baseline methods. △ Less

Submitted 2 September, 2018; originally announced September 2018.

arXiv:1805.00889 [pdf, other]

SONYC: A System for the Monitoring, Analysis and Mitigation of Urban Noise Pollution

Authors: Juan Pablo Bello, Claudio Silva, Oded Nov, R. Luke DuBois, Anish Arora, Justin Salamon, Charles Mydlarz, Harish Doraiswamy

Abstract: We present the Sounds of New York City (SONYC) project, a smart cities initiative focused on develo** a cyber-physical system for the monitoring, analysis and mitigation of urban noise pollution. Noise pollution is one of the topmost quality of life issues for urban residents in the U.S. with proven effects on health, education, the economy, and the environment. Yet, most cities lack the resourc… ▽ More We present the Sounds of New York City (SONYC) project, a smart cities initiative focused on develo** a cyber-physical system for the monitoring, analysis and mitigation of urban noise pollution. Noise pollution is one of the topmost quality of life issues for urban residents in the U.S. with proven effects on health, education, the economy, and the environment. Yet, most cities lack the resources to continuously monitor noise and understand the contribution of individual sources, the tools to analyze patterns of noise pollution at city-scale, and the means to empower city agencies to take effective, data-driven action for noise mitigation. The SONYC project advances novel technological and socio-technical solutions that help address these needs. SONYC includes a distributed network of both sensors and people for large-scale noise monitoring. The sensors use low-cost, low-power technology, and cutting-edge machine listening techniques, to produce calibrated acoustic measurements and recognize individual sound sources in real time. Citizen science methods are used to help urban residents connect to city agencies and each other, understand their noise footprint, and facilitate reporting and self-regulation. Crucially, SONYC utilizes big data solutions to analyze, retrieve and visualize information from sensors and citizens, creating a comprehensive acoustic model of the city that can be used to identify significant patterns of noise pollution. These data can be used to drive the strategic application of noise code enforcement by city agencies to optimize the reduction of noise pollution. The entire system, integrating cyber, physical and social infrastructure, forms a closed loop of continuous sensing, analysis and actuation on the environment. SONYC provides a blueprint for the mitigation of noise pollution that can potentially be applied to other cities in the US and abroad. △ Less

Submitted 18 May, 2018; v1 submitted 2 May, 2018; originally announced May 2018.

Comments: Accepted May 2018, Communications of the ACM. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record will be published in Communications of the ACM

arXiv:1804.10070 [pdf, other]

Adaptive pooling operators for weakly labeled sound event detection

Authors: Brian McFee, Justin Salamon, Juan Pablo Bello

Abstract: Sound event detection (SED) methods are tasked with labeling segments of audio recordings by the presence of active sound sources. SED is typically posed as a supervised machine learning problem, requiring strong annotations for the presence or absence of each sound source at every time instant within the recording. However, strong annotations of this type are both labor- and cost-intensive for hu… ▽ More Sound event detection (SED) methods are tasked with labeling segments of audio recordings by the presence of active sound sources. SED is typically posed as a supervised machine learning problem, requiring strong annotations for the presence or absence of each sound source at every time instant within the recording. However, strong annotations of this type are both labor- and cost-intensive for human annotators to produce, which limits the practical scalability of SED methods. In this work, we treat SED as a multiple instance learning (MIL) problem, where training labels are static over a short excerpt, indicating the presence or absence of sound sources but not their temporal locality. The models, however, must still produce temporally dynamic predictions, which must be aggregated (pooled) when comparing against static labels during training. To facilitate this aggregation, we develop a family of adaptive pooling operators---referred to as auto-pool---which smoothly interpolate between common pooling operators, such as min-, max-, or average-pooling, and automatically adapt to the characteristics of the sound sources in question. We evaluate the proposed pooling operators on three datasets, and demonstrate that in each case, the proposed methods outperform non-adaptive pooling operators for static prediction, and nearly match the performance of models trained with strong, dynamic annotations. The proposed method is evaluated in conjunction with convolutional neural networks, but can be readily applied to any differentiable model for time-series label prediction. △ Less

Submitted 10 August, 2018; v1 submitted 26 April, 2018; originally announced April 2018.

arXiv:1802.06182 [pdf, other]

CREPE: A Convolutional Representation for Pitch Estimation

Authors: Jong Wook Kim, Justin Salamon, Peter Li, Juan Pablo Bello

Abstract: The task of estimating the fundamental frequency of a monophonic sound recording, also known as pitch tracking, is fundamental to audio processing with multiple applications in speech processing and music information retrieval. To date, the best performing techniques, such as the pYIN algorithm, are based on a combination of DSP pipelines and heuristics. While such techniques perform very well on… ▽ More The task of estimating the fundamental frequency of a monophonic sound recording, also known as pitch tracking, is fundamental to audio processing with multiple applications in speech processing and music information retrieval. To date, the best performing techniques, such as the pYIN algorithm, are based on a combination of DSP pipelines and heuristics. While such techniques perform very well on average, there remain many cases in which they fail to correctly estimate the pitch. In this paper, we propose a data-driven pitch tracking algorithm, CREPE, which is based on a deep convolutional neural network that operates directly on the time-domain waveform. We show that the proposed model produces state-of-the-art results, performing equally or better than pYIN. Furthermore, we evaluate the model's generalizability in terms of noise robustness. A pre-trained version of CREPE is made freely available as an open-source Python module for easy application. △ Less

Submitted 16 February, 2018; originally announced February 2018.

Comments: ICASSP 2018

arXiv:1712.05522 [pdf]

doi 10.1103/PhysRevD.97.093005

Confidence regions for neutrino oscillation parameters from double-Chooz data

Authors: B. Vargas Perez, J. García-Ravelo, Dionisio Tun, Jorge Garcia Bello, Jesús Escamilla Roa

Abstract: In this work, an independent and detailed statistical analysis of the double-Chooz experiment is performed. To have a thorough understanding of the implications of the double-Chooz data on both oscillation parameters $\sin^{2}(2θ_{13})$ and $Δm^2_{31}$, we decided to analyze the data corresponding to the Far detector, with no additional restriction. By doing this, confidence regions and best fit v… ▽ More In this work, an independent and detailed statistical analysis of the double-Chooz experiment is performed. To have a thorough understanding of the implications of the double-Chooz data on both oscillation parameters $\sin^{2}(2θ_{13})$ and $Δm^2_{31}$, we decided to analyze the data corresponding to the Far detector, with no additional restriction. By doing this, confidence regions and best fit values are obtained for ($\sin^{2}(2θ_{13}),Δm^2_{31}$). This analysis yields an out-of-order $Δm^2_{31}$ minimum, which has already been mentioned in previous works, and it is corrected with the inclusion of additional restrictions. With such restrictions it is obtained that $\sin ^{2}(2 θ_{13})=0{.}084_{-0{.}028}^{+0{.}030}$ and $Δm^2_{31}=2.444^{+0.187}_{-0.215} \times 10^{-3}$ eV$^2$/c$^4$. Our analysis allows us to study the effects of the so called "spectral bump" around 5 MeV, it is observed that a variation of this spectral bump may be able to move the $Δm^2_{31}$ best fit value, in such a way that $Δm^2_{31}$ takes the order of magnitude of the MINOS value. Finally, and with the intention of understanding the effects of the preliminary Near detector data, we performed two different analyses, aiming to eliminate the effects of the energy bump. As a consequence, it is found that unlike the Far Detector analysis, the Near detector data may be able to fully determine both oscillation parameters by itself, resulting in, $\sin^2(2θ_{13}) = 0.095 \pm 0.053$ and $Δm^{2}_{31} = 2.63^{+0.98}_{-1.15} \times 10^{-3} \text{eV}^2 / \text{c}^4$. The later analyses represent an improvement with respect to previous works, where additional constraints for $Δm^2_{31}$ were necessary. △ Less

Submitted 1 August, 2018; v1 submitted 14 December, 2017; originally announced December 2017.

Comments: 11 pages, 2 b/n figure, 5 color figures, 6 tables

Journal ref: Phys. Rev. D 97, 093005 (2018)

arXiv:1608.04363 [pdf, other]

doi 10.1109/LSP.2017.2657381

Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification

Authors: Justin Salamon, Juan Pablo Bello

Abstract: The ability of deep convolutional neural networks (CNN) to learn discriminative spectro-temporal patterns makes them well suited to environmental sound classification. However, the relative scarcity of labeled data has impeded the exploitation of this family of high-capacity models. This study has two primary contributions: first, we propose a deep convolutional neural network architecture for env… ▽ More The ability of deep convolutional neural networks (CNN) to learn discriminative spectro-temporal patterns makes them well suited to environmental sound classification. However, the relative scarcity of labeled data has impeded the exploitation of this family of high-capacity models. This study has two primary contributions: first, we propose a deep convolutional neural network architecture for environmental sound classification. Second, we propose the use of audio data augmentation for overcoming the problem of data scarcity and explore the influence of different augmentations on the performance of the proposed CNN architecture. Combined with data augmentation, the proposed model produces state-of-the-art results for environmental sound classification. We show that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a "shallow" dictionary learning model with augmentation. Finally, we examine the influence of each augmentation on the model's classification accuracy for each class, and observe that the accuracy for each class is influenced differently by each augmentation, suggesting that the performance of the model could be improved further by applying class-conditional data augmentation. △ Less

Submitted 28 November, 2016; v1 submitted 15 August, 2016; originally announced August 2016.

Comments: Accepted November 2016, IEEE Signal Processing Letters. Copyright IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material, creating new collective works, for resale or redistribution, or reuse of any copyrighted component of this work in other works

Showing 1–50 of 54 results for author: Bello, J