-
Joint Selection: Adaptively Incorporating Public Information for Private Synthetic Data
Authors:
Miguel Fuentes,
Brett Mullins,
Ryan McKenna,
Gerome Miklau,
Daniel Sheldon
Abstract:
Mechanisms for generating differentially private synthetic data based on marginals and graphical models have been successful in a wide range of settings. However, one limitation of these methods is their inability to incorporate public data. Initializing a data generating model by pre-training on public data has shown to improve the quality of synthetic data, but this technique is not applicable w…
▽ More
Mechanisms for generating differentially private synthetic data based on marginals and graphical models have been successful in a wide range of settings. However, one limitation of these methods is their inability to incorporate public data. Initializing a data generating model by pre-training on public data has shown to improve the quality of synthetic data, but this technique is not applicable when model structure is not determined a priori. We develop the mechanism jam-pgm, which expands the adaptive measurements framework to jointly select between measuring public data and private data. This technique allows for public data to be included in a graphical-model-based mechanism. We show that jam-pgm is able to outperform both publicly assisted and non publicly assisted synthetic data generation mechanisms even when the public data distribution is biased.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
Two vs. Four-Channel Sound Event Localization and Detection
Authors:
Julia Wilkins,
Magdalena Fuentes,
Luca Bondi,
Shabnam Ghaffarzadegan,
Ali Abavisani,
Juan Pablo Bello
Abstract:
Sound event localization and detection (SELD) systems estimate both the direction-of-arrival (DOA) and class of sound sources over time. In the DCASE 2022 SELD Challenge (Task 3), models are designed to operate in a 4-channel setting. While beneficial to further the development of SELD systems using a multichannel recording setup such as first-order Ambisonics (FOA), most consumer electronics devi…
▽ More
Sound event localization and detection (SELD) systems estimate both the direction-of-arrival (DOA) and class of sound sources over time. In the DCASE 2022 SELD Challenge (Task 3), models are designed to operate in a 4-channel setting. While beneficial to further the development of SELD systems using a multichannel recording setup such as first-order Ambisonics (FOA), most consumer electronics devices rarely are able to record using more than two channels. For this reason, in this work we investigate the performance of the DCASE 2022 SELD baseline model using three audio input representations: FOA, binaural, and stereo. We perform a novel comparative analysis illustrating the effect of these audio input representations on SELD performance. Crucially, we show that binaural and stereo (i.e. 2-channel) audio-based SELD models are still able to localize and detect sound sources laterally quite well, despite overall performance degrading as less audio information is provided. Further, we segment our analysis by scenes containing varying degrees of sound source polyphony to better understand the effect of audio input representation on localization and detection performance as scene conditions become increasingly complex.
△ Less
Submitted 23 September, 2023;
originally announced September 2023.
-
Sound Source Distance Estimation in Diverse and Dynamic Acoustic Conditions
Authors:
Saksham Singh Kushwaha,
Iran R. Roman,
Magdalena Fuentes,
Juan Pablo Bello
Abstract:
Localizing a moving sound source in the real world involves determining its direction-of-arrival (DOA) and distance relative to a microphone. Advancements in DOA estimation have been facilitated by data-driven methods optimized with large open-source datasets with microphone array recordings in diverse environments. In contrast, estimating a sound source's distance remains understudied. Existing a…
▽ More
Localizing a moving sound source in the real world involves determining its direction-of-arrival (DOA) and distance relative to a microphone. Advancements in DOA estimation have been facilitated by data-driven methods optimized with large open-source datasets with microphone array recordings in diverse environments. In contrast, estimating a sound source's distance remains understudied. Existing approaches assume recordings by non-coincident microphones to use methods that are susceptible to differences in room reverberation. We present a CRNN able to estimate the distance of moving sound sources across multiple datasets featuring diverse rooms, outperforming a recently-published approach. We also characterize our model's performance as a function of sound source distance and different training losses. This analysis reveals optimal training using a loss that weighs model errors as an inverse function of the sound source true distance. Our study is the first to demonstrate that sound source distance estimation can be performed across diverse acoustic conditions using deep learning.
△ Less
Submitted 17 September, 2023;
originally announced September 2023.
-
Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries
Authors:
Julia Wilkins,
Justin Salamon,
Magdalena Fuentes,
Juan Pablo Bello,
Oriol Nieto
Abstract:
Finding the right sound effects (SFX) to match moments in a video is a difficult and time-consuming task, and relies heavily on the quality and completeness of text metadata. Retrieving high-quality (HQ) SFX using a video frame directly as the query is an attractive alternative, removing the reliance on text metadata and providing a low barrier to entry for non-experts. Due to the lack of HQ audio…
▽ More
Finding the right sound effects (SFX) to match moments in a video is a difficult and time-consuming task, and relies heavily on the quality and completeness of text metadata. Retrieving high-quality (HQ) SFX using a video frame directly as the query is an attractive alternative, removing the reliance on text metadata and providing a low barrier to entry for non-experts. Due to the lack of HQ audio-visual training data, previous work on audio-visual retrieval relies on YouTube (in-the-wild) videos of varied quality for training, where the audio is often noisy and the video of amateur quality. As such it is unclear whether these systems would generalize to the task of matching HQ audio to production-quality video. To address this, we propose a multimodal framework for recommending HQ SFX given a video frame by (1) leveraging large language models and foundational vision-language models to bridge HQ audio and video to create audio-visual pairs, resulting in a highly scalable automatic audio-visual data curation pipeline; and (2) using pre-trained audio and visual encoders to train a contrastive learning-based retrieval system. We show that our system, trained using our automatic data curation pipeline, significantly outperforms baselines trained on in-the-wild data on the task of HQ SFX retrieval for video. Furthermore, while the baselines fail to generalize to this task, our system generalizes well from clean to in-the-wild data, outperforming the baselines on a dataset of YouTube videos despite only being trained on the HQ audio-visual pairs. A user study confirms that people prefer SFX retrieved by our system over the baseline 67% of the time both for HQ and in-the-wild data. Finally, we present ablations to determine the impact of model and data pipeline design choices on downstream retrieval performance. Please visit our project website to listen to and view our SFX retrieval results.
△ Less
Submitted 17 August, 2023;
originally announced August 2023.
-
A Multimodal Prototypical Approach for Unsupervised Sound Classification
Authors:
Saksham Singh Kushwaha,
Magdalena Fuentes
Abstract:
In the context of environmental sound classification, the adaptability of systems is key: which sound classes are interesting depends on the context and the user's needs. Recent advances in text-to-audio retrieval allow for zero-shot audio classification, but performance compared to supervised models remains limited. This work proposes a multimodal prototypical approach that exploits local audio-t…
▽ More
In the context of environmental sound classification, the adaptability of systems is key: which sound classes are interesting depends on the context and the user's needs. Recent advances in text-to-audio retrieval allow for zero-shot audio classification, but performance compared to supervised models remains limited. This work proposes a multimodal prototypical approach that exploits local audio-text embeddings to provide more relevant answers to audio queries, augmenting the adaptability of sound detection in the wild. We do this by first using text to query a nearby community of audio embeddings that best characterize each query sound, and select the group's centroids as our prototypes. Second, we compare unseen audio to these prototypes for classification. We perform multiple ablation studies to understand the impact of the embedding models and prompts. Our unsupervised approach improves upon the zero-shot state-of-the-art in three sound recognition benchmarks by an average of 12%.
△ Less
Submitted 17 August, 2023; v1 submitted 21 June, 2023;
originally announced June 2023.
-
Adapting Meter Tracking Models to Latin American Music
Authors:
Lucas S. Maia,
Martín Rocamora,
Luiz W. P. Biscainho,
Magdalena Fuentes
Abstract:
Beat and downbeat tracking models have improved significantly in recent years with the introduction of deep learning methods. However, despite these improvements, several challenges remain. Particularly, the adaptation of available models to underrepresented music traditions in MIR is usually synonymous with collecting and annotating large amounts of data, which is impractical and time-consuming.…
▽ More
Beat and downbeat tracking models have improved significantly in recent years with the introduction of deep learning methods. However, despite these improvements, several challenges remain. Particularly, the adaptation of available models to underrepresented music traditions in MIR is usually synonymous with collecting and annotating large amounts of data, which is impractical and time-consuming. Transfer learning, data augmentation, and fine-tuning techniques have been used quite successfully in related tasks and are known to alleviate this bottleneck. Furthermore, when studying these music traditions, models are not required to generalize to multiple mainstream music genres but to perform well in more constrained, homogeneous conditions. In this work, we investigate simple yet effective strategies to adapt beat and downbeat tracking models to two different Latin American music traditions and analyze the feasibility of these adaptations in real-world applications concerning the data and computational requirements. Contrary to common belief, our findings show it is possible to achieve good performance by spending just a few minutes annotating a portion of the data and training a model in a standard CPU machine, with the precise amount of resources needed depending on the task and the complexity of the dataset.
△ Less
Submitted 14 April, 2023;
originally announced April 2023.
-
Tempo vs. Pitch: understanding self-supervised tempo estimation
Authors:
Giovana Morais,
Matthew E. P. Davies,
Marcelo Queiroz,
Magdalena Fuentes
Abstract:
Self-supervision methods learn representations by solving pretext tasks that do not require human-generated labels, alleviating the need for time-consuming annotations. These methods have been applied in computer vision, natural language processing, environmental sound analysis, and recently in music information retrieval, e.g. for pitch estimation. Particularly in the context of music, there are…
▽ More
Self-supervision methods learn representations by solving pretext tasks that do not require human-generated labels, alleviating the need for time-consuming annotations. These methods have been applied in computer vision, natural language processing, environmental sound analysis, and recently in music information retrieval, e.g. for pitch estimation. Particularly in the context of music, there are few insights about the fragility of these models regarding different distributions of data, and how they could be mitigated. In this paper, we explore these questions by dissecting a self-supervised model for pitch estimation adapted for tempo estimation via rigorous experimentation with synthetic data. Specifically, we study the relationship between the input representation and data distribution for self-supervised tempo estimation.
△ Less
Submitted 13 April, 2023;
originally announced April 2023.
-
FlowGrad: Using Motion for Visual Sound Source Localization
Authors:
Rajsuryan Singh,
Pablo Zinemanas,
Xavier Serra,
Juan Pablo Bello,
Magdalena Fuentes
Abstract:
Most recent work in visual sound source localization relies on semantic audio-visual representations learned in a self-supervised manner, and by design excludes temporal information present in videos. While it proves to be effective for widely used benchmark datasets, the method falls short for challenging scenarios like urban traffic. This work introduces temporal context into the state-of-the-ar…
▽ More
Most recent work in visual sound source localization relies on semantic audio-visual representations learned in a self-supervised manner, and by design excludes temporal information present in videos. While it proves to be effective for widely used benchmark datasets, the method falls short for challenging scenarios like urban traffic. This work introduces temporal context into the state-of-the-art methods for sound source localization in urban scenes using optical flow as a means to encode motion information. An analysis of the strengths and weaknesses of our methods helps us better understand the problem of visual sound source localization and sheds light on open challenges for audio-visual scene understanding.
△ Less
Submitted 14 April, 2023; v1 submitted 15 November, 2022;
originally announced November 2022.
-
How to Listen? Rethinking Visual Sound Localization
Authors:
Ho-Hsiang Wu,
Magdalena Fuentes,
Prem Seetharaman,
Juan Pablo Bello
Abstract:
Localizing visual sounds consists on locating the position of objects that emit sound within an image. It is a growing research area with potential applications in monitoring natural and urban environments, such as wildlife migration and urban traffic. Previous works are usually evaluated with datasets having mostly a single dominant visible object, and proposed models usually require the introduc…
▽ More
Localizing visual sounds consists on locating the position of objects that emit sound within an image. It is a growing research area with potential applications in monitoring natural and urban environments, such as wildlife migration and urban traffic. Previous works are usually evaluated with datasets having mostly a single dominant visible object, and proposed models usually require the introduction of localization modules during training or dedicated sampling strategies, but it remains unclear how these design choices play a role in the adaptability of these methods in more challenging scenarios. In this work, we analyze various model choices for visual sound localization and discuss how their different components affect the model's performance, namely the encoders' architecture, the loss function and the localization strategy. Furthermore, we study the interaction between these decisions, the model performance, and the data, by digging into different evaluation datasets spanning different difficulties and characteristics, and discuss the implications of such decisions in the context of real-world applications. Our code and model weights are open-sourced and made available for further applications.
△ Less
Submitted 11 April, 2022;
originally announced April 2022.
-
A Study on Robustness to Perturbations for Representations of Environmental Sound
Authors:
Sangeeta Srivastava,
Ho-Hsiang Wu,
Joao Rulff,
Magdalena Fuentes,
Mark Cartwright,
Claudio Silva,
Anish Arora,
Juan Pablo Bello
Abstract:
Audio applications involving environmental sound analysis increasingly use general-purpose audio representations, also known as embeddings, for transfer learning. Recently, Holistic Evaluation of Audio Representations (HEAR) evaluated twenty-nine embedding models on nineteen diverse tasks. However, the evaluation's effectiveness depends on the variation already captured within a given dataset. The…
▽ More
Audio applications involving environmental sound analysis increasingly use general-purpose audio representations, also known as embeddings, for transfer learning. Recently, Holistic Evaluation of Audio Representations (HEAR) evaluated twenty-nine embedding models on nineteen diverse tasks. However, the evaluation's effectiveness depends on the variation already captured within a given dataset. Therefore, for a given data domain, it is unclear how the representations would be affected by the variations caused by myriad microphones' range and acoustic conditions -- commonly known as channel effects. We aim to extend HEAR to evaluate invariance to channel effects in this work. To accomplish this, we imitate channel effects by injecting perturbations to the audio signal and measure the shift in the new (perturbed) embeddings with three distance measures, making the evaluation domain-dependent but not task-dependent. Combined with the downstream performance, it helps us make a more informed prediction of how robust the embeddings are to the channel effects. We evaluate two embeddings -- YAMNet, and OpenL3 on monophonic (UrbanSound8K) and polyphonic (SONYC-UST) urban datasets. We show that one distance measure does not suffice in such task-independent evaluation. Although Fréchet Audio Distance (FAD) correlates with the trend of the performance drop in the downstream task most accurately, we show that we need to study FAD in conjunction with the other distances to get a clear understanding of the overall effect of the perturbation. In terms of the embedding performance, we find OpenL3 to be more robust than YAMNet, which aligns with the HEAR evaluation.
△ Less
Submitted 6 July, 2022; v1 submitted 19 March, 2022;
originally announced March 2022.
-
Soundata: A Python library for reproducible use of audio datasets
Authors:
Magdalena Fuentes,
Justin Salamon,
Pablo Zinemanas,
Martín Rocamora,
Genís Paja,
Irán R. Román,
Marius Miron,
Xavier Serra,
Juan Pablo Bello
Abstract:
Soundata is a Python library for loading and working with audio datasets in a standardized way, removing the need for writing custom loaders in every project, and improving reproducibility by providing tools to validate data against a canonical version. It speeds up research pipelines by allowing users to quickly download a dataset, load it into memory in a standardized and reproducible way, valid…
▽ More
Soundata is a Python library for loading and working with audio datasets in a standardized way, removing the need for writing custom loaders in every project, and improving reproducibility by providing tools to validate data against a canonical version. It speeds up research pipelines by allowing users to quickly download a dataset, load it into memory in a standardized and reproducible way, validate that the dataset is complete and correct, and more. Soundata is based and inspired on mirdata and design to complement mirdata by working with environmental sound, bioacoustic and speech datasets, among others. Soundata was created to be easy to use, easy to contribute to, and to increase reproducibility and standardize usage of sound datasets in a flexible way.
△ Less
Submitted 4 October, 2021; v1 submitted 26 September, 2021;
originally announced September 2021.
-
Exploring modality-agnostic representations for music classification
Authors:
Ho-Hsiang Wu,
Magdalena Fuentes,
Juan P. Bello
Abstract:
Music information is often conveyed or recorded across multiple data modalities including but not limited to audio, images, text and scores. However, music information retrieval research has almost exclusively focused on single modality recognition, requiring development of separate models for each modality. Some multi-modal works require multiple coexisting modalities given to the model as inputs…
▽ More
Music information is often conveyed or recorded across multiple data modalities including but not limited to audio, images, text and scores. However, music information retrieval research has almost exclusively focused on single modality recognition, requiring development of separate models for each modality. Some multi-modal works require multiple coexisting modalities given to the model as inputs, constraining the use of these models to the few cases where data from all modalities are available. To the best of our knowledge, no existing model has the ability to take inputs from varying modalities, e.g. images or sounds, and classify them into unified music categories. We explore the use of cross-modal retrieval as a pretext task to learn modality-agnostic representations, which can then be used as inputs to classifiers that are independent of modality. We select instrument classification as an example task for our study as both visual and audio components provide relevant semantic information. We train music instrument classifiers that can take both images or sounds as input, and perform comparably to sound-only or image-only classifiers. Furthermore, we explore the case when there is limited labeled data for a given modality, and the impact in performance by using labeled data from other modalities. We are able to achieve almost 70% of best performing system in a zero-shot setting. We provide a detailed analysis of experimental results to understand the potential and limitations of the approach, and discuss future steps towards modality-agnostic classifiers.
△ Less
Submitted 2 June, 2021;
originally announced June 2021.
-
SONYC-UST-V2: An Urban Sound Tagging Dataset with Spatiotemporal Context
Authors:
Mark Cartwright,
Jason Cramer,
Ana Elisa Mendez Mendez,
Yu Wang,
Ho-Hsiang Wu,
Vincent Lostanlen,
Magdalena Fuentes,
Graham Dove,
Charlie Mydlarz,
Justin Salamon,
Oded Nov,
Juan Pablo Bello
Abstract:
We present SONYC-UST-V2, a dataset for urban sound tagging with spatiotemporal information. This dataset is aimed for the development and evaluation of machine listening systems for real-world urban noise monitoring. While datasets of urban recordings are available, this dataset provides the opportunity to investigate how spatiotemporal metadata can aid in the prediction of urban sound tags. SONYC…
▽ More
We present SONYC-UST-V2, a dataset for urban sound tagging with spatiotemporal information. This dataset is aimed for the development and evaluation of machine listening systems for real-world urban noise monitoring. While datasets of urban recordings are available, this dataset provides the opportunity to investigate how spatiotemporal metadata can aid in the prediction of urban sound tags. SONYC-UST-V2 consists of 18510 audio recordings from the "Sounds of New York City" (SONYC) acoustic sensor network, including the timestamp of audio acquisition and location of the sensor. The dataset contains annotations by volunteers from the Zooniverse citizen science platform, as well as a two-stage verification with our team. In this article, we describe our data collection procedure and propose evaluation metrics for multilabel classification of urban sound tags. We report the results of a simple baseline model that exploits spatiotemporal information.
△ Less
Submitted 10 September, 2020;
originally announced September 2020.
-
Pioneering Studies on LTE eMBMS: Towards 5G Point-to-Multipoint Transmissions
Authors:
Hongzhi Chen,
De Mi,
Manuel Fuentes,
David Vargas,
Eduardo Garro,
Jose Luis Carcel,
Belkacem Mouhouche,
Pei Xiao,
Rahim Tafazolli
Abstract:
The first 5G (5th generation wireless systems) New Radio Release-15 was recently completed. However, the specification only considers the use of unicast technologies and the extension to point-to-multipoint (PTM) scenarios is not yet considered. To this end, we first present in this work a technical overview of the state-of-the-art LTE (Long Term Evolution) PTM technology, i.e., eMBMS (evolved Mul…
▽ More
The first 5G (5th generation wireless systems) New Radio Release-15 was recently completed. However, the specification only considers the use of unicast technologies and the extension to point-to-multipoint (PTM) scenarios is not yet considered. To this end, we first present in this work a technical overview of the state-of-the-art LTE (Long Term Evolution) PTM technology, i.e., eMBMS (evolved Multimedia Broadcast Multicast Services), and investigate the physical layer performance via link-level simulations. Then based on the simulation analysis, we discuss potential improvements for the two current eMBMS solutions, i.e., MBSFN (MBMS over Single Frequency Networks) and SC-PTM (Single-Cell PTM). This work explicitly focus on equip** the current eMBMS solutions with 5G candidate techniques, e.g., multiple antennas and millimeter wave, and its potentials to meet the requirements of next generation PTM transmissions.
△ Less
Submitted 29 November, 2019;
originally announced January 2020.
-
On the Performance of PDCCH in LTE and 5G New Radio
Authors:
Hongzhi Chen,
De Mi,
Manuel Fuentes,
Eduardo Garro,
Jose Luis Carcel,
Belkacem Mouhouche,
Pei Xiao,
Rahim Tafazolli
Abstract:
5G New Radio (NR) Release 15 has been specified in June 2018. It introduces numerous changes and potential improvements for physical layer data transmissions, although only point-to-point (PTP) communications are considered. In order to use physical data channels such as the Physical Downlink Shared Channel (PDSCH), it is essential to guarantee a successful transmission of control information via…
▽ More
5G New Radio (NR) Release 15 has been specified in June 2018. It introduces numerous changes and potential improvements for physical layer data transmissions, although only point-to-point (PTP) communications are considered. In order to use physical data channels such as the Physical Downlink Shared Channel (PDSCH), it is essential to guarantee a successful transmission of control information via the Physical Downlink Control Channel (PDCCH). Taking into account these two aspects, in this paper, we first analyze the PDCCH processing chain in NR PTP as well as in the state-of-the-art Long Term Evolution (LTE) point-to-multipoint (PTM) solution, i.e., evolved Multimedia Broadcast Multicast Service (eMBMS). Then, via link level simulations, we compare the performance of the two technologies, observing the Bit/Block Error Rate (BER/BLER) for various scenarios. The objective is to identify the performance gap brought by physical layer changes in NR PDCCH as well as provide insightful guidelines on the control channel configuration towards NR PTM scenarios.
△ Less
Submitted 29 November, 2019;
originally announced January 2020.
-
A Fuzzy Control System for Inductive Video Games
Authors:
Carlos Lara-Alvarez,
Hugo Mitre-Hernandez,
Juan Flores,
Maria Fuentes
Abstract:
It has been shown that the emotional state of students has an important relationship with learning; for instance, engaged concentration is positively correlated with learning. This paper proposes the Inductive Control (IC) for educational games. Unlike conventional approaches that only modify the game level, the proposed technique also induces emotions in the player for supporting the learning pro…
▽ More
It has been shown that the emotional state of students has an important relationship with learning; for instance, engaged concentration is positively correlated with learning. This paper proposes the Inductive Control (IC) for educational games. Unlike conventional approaches that only modify the game level, the proposed technique also induces emotions in the player for supporting the learning process. This paper explores a fuzzy system that analyzes the players' performance and their emotional state for controlling the level and aesthetic content of an educational video game. The emotional state of the player is recognized through voice analysis. A total of 20 subjects played a video game designed to practice basic math skills; for each trial, a student plays two times in a row the same game but each time the game was controlled by one of the two approaches ---Dynamic Difficulty Adjustment (DDA) and IC, the playing order was assigned randomly. Results show that when the proposed approach is used the participants changed faster from Unpleasant--low to pleasant or high emotions, and reached softly and kept in the flow zone. These experiments demonstrate that the inductive control technique improves the learning effectiveness through detection and stimulation of positive emotions.
△ Less
Submitted 15 April, 2018; v1 submitted 4 September, 2017;
originally announced September 2017.
-
Does network complexity help organize Babel's library?
Authors:
Juan Pablo Cárdenas,
Iván González,
Gerardo Vidal,
Miguel Fuentes
Abstract:
In this work, we study properties of texts from the perspective of complex network theory. Words in given texts are linked by co-occurrence and transformed into networks, and we observe that these display topological properties common to other complex systems. However, there are some properties that seem to be exclusive to texts; many of these properties depend on the frequency of words in the tex…
▽ More
In this work, we study properties of texts from the perspective of complex network theory. Words in given texts are linked by co-occurrence and transformed into networks, and we observe that these display topological properties common to other complex systems. However, there are some properties that seem to be exclusive to texts; many of these properties depend on the frequency of words in the text, while others seem to be strictly determined by the grammar. Precisely, these properties allow for a categorization of texts as either with a sense and others encoded or senseless.
△ Less
Submitted 16 October, 2015; v1 submitted 23 September, 2014;
originally announced September 2014.