Skip to main content

Showing 1–20 of 20 results for author: Salamon, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2308.09089  [pdf, other

    cs.SD cs.CV cs.IR cs.MM eess.AS

    Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries

    Authors: Julia Wilkins, Justin Salamon, Magdalena Fuentes, Juan Pablo Bello, Oriol Nieto

    Abstract: Finding the right sound effects (SFX) to match moments in a video is a difficult and time-consuming task, and relies heavily on the quality and completeness of text metadata. Retrieving high-quality (HQ) SFX using a video frame directly as the query is an attractive alternative, removing the reliance on text metadata and providing a low barrier to entry for non-experts. Due to the lack of HQ audio… ▽ More

    Submitted 17 August, 2023; originally announced August 2023.

    Comments: WASPAA 2023. Project page: https://juliawilkins.github.io/sound-effects-retrieval-from-video/. 4 pages, 2 figures, 2 tables

  2. arXiv:2304.08490  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Conditional Generation of Audio from Video via Foley Analogies

    Authors: Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, Andrew Owens

    Abstract: The sound effects that designers add to videos are designed to convey a particular artistic effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges of creating a soundtrack for a video that differs from its true sound, but that nonetheless matches the actions occurring on screen, we propose the problem of conditional Foley. We present the following contributi… ▽ More

    Submitted 17 April, 2023; originally announced April 2023.

    Comments: CVPR 2023

  3. arXiv:2303.10667  [pdf, other

    cs.SD eess.AS

    Audio-Text Models Do Not Yet Leverage Natural Language

    Authors: Ho-Hsiang Wu, Oriol Nieto, Juan Pablo Bello, Justin Salamon

    Abstract: Multi-modal contrastive learning techniques in the audio-text domain have quickly become a highly active area of research. Most works are evaluated with standard audio retrieval and classification benchmarks assuming that (i) these models are capable of leveraging the rich information contained in natural language, and (ii) current benchmarks are able to capture the nuances of such information. In… ▽ More

    Submitted 19 March, 2023; originally announced March 2023.

    Comments: Copyright 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  4. arXiv:2203.15135  [pdf, other

    cs.CL cs.SD eess.AS

    Filler Word Detection and Classification: A Dataset and Benchmark

    Authors: Ge Zhu, Juan-Pablo Caceres, Justin Salamon

    Abstract: Filler words such as `uh' or `um' are sounds or words people use to signal they are pausing to think. Finding and removing filler words from recordings is a common and tedious task in media editing. Automatically detecting and classifying filler words could greatly aid in this task, but few studies have been published on this problem to date. A key reason is the absence of a dataset with annotated… ▽ More

    Submitted 1 July, 2022; v1 submitted 28 March, 2022; originally announced March 2022.

    Comments: To appear at Insterspeech 2022

  5. arXiv:2203.03022  [pdf, ps, other

    cs.SD cs.AI cs.LG eess.AS stat.ML

    HEAR: Holistic Evaluation of Audio Representations

    Authors: Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Björn W. Schuller, Christian J. Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, Max Henry, Nicolas Pinto, Camille Noufi, Christian Clough, Dorien Herremans, Eduardo Fonseca, Jesse Engel, Justin Salamon, Philippe Esling, Pranay Manocha, Shinji Watanabe, Zeyu **, Yonatan Bisk

    Abstract: What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR evaluates audio representations using a benchmark suite across a variety of domains, in… ▽ More

    Submitted 29 May, 2022; v1 submitted 6 March, 2022; originally announced March 2022.

    Comments: to appear in Proceedings of Machine Learning Research (PMLR): NeurIPS 2021 Competition Track

  6. arXiv:2110.09600  [pdf, other

    cs.SD eess.AS

    Who calls the shots? Rethinking Few-Shot Learning for Audio

    Authors: Yu Wang, Nicholas J. Bryan, Justin Salamon, Mark Cartwright, Juan Pablo Bello

    Abstract: Few-shot learning aims to train models that can recognize novel classes given just a handful of labeled examples, known as the support set. While the field has seen notable advances in recent years, they have often focused on multi-class image classification. Audio, in contrast, is often multi-label due to overlap** sounds, resulting in unique properties such as polyphony and signal-to-noise rat… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

    Comments: WASPAA 2021

  7. arXiv:2109.12690  [pdf, ps, other

    cs.SD cs.DB cs.LG eess.AS

    Soundata: A Python library for reproducible use of audio datasets

    Authors: Magdalena Fuentes, Justin Salamon, Pablo Zinemanas, Martín Rocamora, Genís Paja, Irán R. Román, Marius Miron, Xavier Serra, Juan Pablo Bello

    Abstract: Soundata is a Python library for loading and working with audio datasets in a standardized way, removing the need for writing custom loaders in every project, and improving reproducibility by providing tools to validate data against a canonical version. It speeds up research pipelines by allowing users to quickly download a dataset, load it into memory in a standardized and reproducible way, valid… ▽ More

    Submitted 4 October, 2021; v1 submitted 26 September, 2021; originally announced September 2021.

  8. arXiv:2011.00803  [pdf, other

    cs.SD eess.AS

    What's All the FUSS About Free Universal Sound Separation Data?

    Authors: Scott Wisdom, Hakan Erdogan, Daniel Ellis, Romain Serizel, Nicolas Turpault, Eduardo Fonseca, Justin Salamon, Prem Seetharaman, John Hershey

    Abstract: We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types. The dataset consists of 23 hours of single-source audio data drawn from 357 classes, which are used to create mixtures of one to four sources. To simulate reverberation, an acoustic room simulator is used to generate… ▽ More

    Submitted 2 November, 2020; originally announced November 2020.

  9. arXiv:2011.00801  [pdf, other

    cs.SD eess.AS

    Sound Event Detection and Separation: a Benchmark on Desed Synthetic Soundscapes

    Authors: Nicolas Turpault, Romain Serizel, Scott Wisdom, Hakan Erdogan, John Hershey, Eduardo Fonseca, Prem Seetharaman, Justin Salamon

    Abstract: We propose a benchmark of state-of-the-art sound event detection systems (SED). We designed synthetic evaluation sets to focus on specific sound event detection challenges. We analyze the performance of the submissions to DCASE 2021 task 4 depending on time related modifications (time position of an event and length of clips) and we study the impact of non-target sound events and reverberation. We… ▽ More

    Submitted 2 November, 2020; originally announced November 2020.

  10. arXiv:2009.05188  [pdf, other

    cs.SD cs.LG eess.AS

    SONYC-UST-V2: An Urban Sound Tagging Dataset with Spatiotemporal Context

    Authors: Mark Cartwright, Jason Cramer, Ana Elisa Mendez Mendez, Yu Wang, Ho-Hsiang Wu, Vincent Lostanlen, Magdalena Fuentes, Graham Dove, Charlie Mydlarz, Justin Salamon, Oded Nov, Juan Pablo Bello

    Abstract: We present SONYC-UST-V2, a dataset for urban sound tagging with spatiotemporal information. This dataset is aimed for the development and evaluation of machine listening systems for real-world urban noise monitoring. While datasets of urban recordings are available, this dataset provides the opportunity to investigate how spatiotemporal metadata can aid in the prediction of urban sound tags. SONYC… ▽ More

    Submitted 10 September, 2020; originally announced September 2020.

  11. arXiv:2008.03729  [pdf, other

    cs.SD cs.IR cs.LG eess.AS

    Metric Learning vs Classification for Disentangled Music Representation Learning

    Authors: Jongpil Lee, Nicholas J. Bryan, Justin Salamon, Zeyu **, Juhan Nam

    Abstract: Deep representation learning offers a powerful paradigm for map** input data onto an organized embedding space and is useful for many music information retrieval tasks. Two central methods for representation learning include deep metric learning and classification, both having the same goal of learning a representation that can generalize well across tasks. Along with generalization, the emergin… ▽ More

    Submitted 12 August, 2020; v1 submitted 9 August, 2020; originally announced August 2020.

    Comments: Accepted for publication at the 21st International Society for Music Information Retrieval Conference (ISMIR 2020)

  12. arXiv:2008.03720  [pdf, other

    eess.AS cs.LG cs.SD

    Disentangled Multidimensional Metric Learning for Music Similarity

    Authors: Jongpil Lee, Nicholas J. Bryan, Justin Salamon, Zeyu **, Juhan Nam

    Abstract: Music similarity search is useful for a variety of creative tasks such as replacing one music recording with another recording with a similar "feel", a common task in video editing. For this task, it is typically necessary to define a similarity metric to compare one recording to another. Music similarity, however, is hard to define and depends on multiple simultaneous notions of similarity (i.e.… ▽ More

    Submitted 12 August, 2020; v1 submitted 9 August, 2020; originally announced August 2020.

    Comments: Accepted for publication at the 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020)

  13. arXiv:2008.03388  [pdf, other

    eess.AS cs.LG cs.SD

    Controllable Neural Prosody Synthesis

    Authors: Max Morrison, Zeyu **, Justin Salamon, Nicholas J. Bryan, Gautham J. Mysore

    Abstract: Speech synthesis has recently seen significant improvements in fidelity, driven by the advent of neural vocoders and neural prosody generators. However, these systems lack intuitive user controls over prosody, making them unable to rectify prosody errors (e.g., misplaced emphases and contextually inappropriate emotions) or generate prosodies with diverse speaker excitement levels and emotions. We… ▽ More

    Submitted 11 August, 2020; v1 submitted 7 August, 2020; originally announced August 2020.

    Comments: To appear in proceedings of INTERSPEECH 2020

  14. arXiv:2008.02791  [pdf, other

    cs.SD eess.AS

    Few-Shot Drum Transcription in Polyphonic Music

    Authors: Yu Wang, Justin Salamon, Mark Cartwright, Nicholas J. Bryan, Juan Pablo Bello

    Abstract: Data-driven approaches to automatic drum transcription (ADT) are often limited to a predefined, small vocabulary of percussion instrument classes. Such models cannot recognize out-of-vocabulary classes nor are they able to adapt to finer-grained vocabularies. In this work, we address open vocabulary ADT by introducing few-shot learning to the task. We train a Prototypical Network on a synthetic da… ▽ More

    Submitted 6 August, 2020; originally announced August 2020.

    Comments: ISMIR 2020 camera-ready

  15. arXiv:2007.03932  [pdf, other

    cs.SD eess.AS eess.SP

    Improving Sound Event Detection In Domestic Environments Using Sound Separation

    Authors: Nicolas Turpault, Scott Wisdom, Hakan Erdogan, John Hershey, Romain Serizel, Eduardo Fonseca, Prem Seetharaman, Justin Salamon

    Abstract: Performing sound event detection on real-world recordings often implies dealing with overlap** target sound events and non-target sounds, also referred to as interference or noise. Until now these problems were mainly tackled at the classifier level. We propose to use sound separation as a pre-processing for sound event detection. In this paper we start from a sound separation model trained on t… ▽ More

    Submitted 8 July, 2020; originally announced July 2020.

  16. arXiv:2006.06175  [pdf, other

    cs.CV cs.SD eess.AS

    Telling Left from Right: Learning Spatial Correspondence of Sight and Sound

    Authors: Karren Yang, Bryan Russell, Justin Salamon

    Abstract: Self-supervised audio-visual learning aims to capture useful representations of video by leveraging correspondences between visual and audio inputs. Existing approaches have focused primarily on matching semantic information between the sensory streams. We propose a novel self-supervised task to leverage an orthogonal principle: matching spatial information in the audio stream to the positions of… ▽ More

    Submitted 11 June, 2020; v1 submitted 11 June, 2020; originally announced June 2020.

    Comments: CVPR 2020

    MSC Class: 68T45 ACM Class: I.4.0

  17. arXiv:1905.08352  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Robust sound event detection in bioacoustic sensor networks

    Authors: Vincent Lostanlen, Justin Salamon, Andrew Farnsworth, Steve Kelling, Juan Pablo Bello

    Abstract: Bioacoustic sensors, sometimes known as autonomous recording units (ARUs), can record sounds of wildlife over long periods of time in scalable and minimally invasive ways. Deriving per-species abundance estimates from these sensors requires detection, classification, and quantification of animal vocalizations as individual acoustic events. Yet, variability in ambient noise, both over time and acro… ▽ More

    Submitted 29 October, 2019; v1 submitted 20 May, 2019; originally announced May 2019.

    Comments: 32 pages, in English. Submitted to PLOS ONE journal in February 2019; revised August 2019; published October 2019

  18. arXiv:1805.00889  [pdf, other

    cs.SD cs.CY cs.HC eess.AS

    SONYC: A System for the Monitoring, Analysis and Mitigation of Urban Noise Pollution

    Authors: Juan Pablo Bello, Claudio Silva, Oded Nov, R. Luke DuBois, Anish Arora, Justin Salamon, Charles Mydlarz, Harish Doraiswamy

    Abstract: We present the Sounds of New York City (SONYC) project, a smart cities initiative focused on develo** a cyber-physical system for the monitoring, analysis and mitigation of urban noise pollution. Noise pollution is one of the topmost quality of life issues for urban residents in the U.S. with proven effects on health, education, the economy, and the environment. Yet, most cities lack the resourc… ▽ More

    Submitted 18 May, 2018; v1 submitted 2 May, 2018; originally announced May 2018.

    Comments: Accepted May 2018, Communications of the ACM. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record will be published in Communications of the ACM

  19. arXiv:1804.10070  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Adaptive pooling operators for weakly labeled sound event detection

    Authors: Brian McFee, Justin Salamon, Juan Pablo Bello

    Abstract: Sound event detection (SED) methods are tasked with labeling segments of audio recordings by the presence of active sound sources. SED is typically posed as a supervised machine learning problem, requiring strong annotations for the presence or absence of each sound source at every time instant within the recording. However, strong annotations of this type are both labor- and cost-intensive for hu… ▽ More

    Submitted 10 August, 2018; v1 submitted 26 April, 2018; originally announced April 2018.

  20. arXiv:1802.06182  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    CREPE: A Convolutional Representation for Pitch Estimation

    Authors: Jong Wook Kim, Justin Salamon, Peter Li, Juan Pablo Bello

    Abstract: The task of estimating the fundamental frequency of a monophonic sound recording, also known as pitch tracking, is fundamental to audio processing with multiple applications in speech processing and music information retrieval. To date, the best performing techniques, such as the pYIN algorithm, are based on a combination of DSP pipelines and heuristics. While such techniques perform very well on… ▽ More

    Submitted 16 February, 2018; originally announced February 2018.

    Comments: ICASSP 2018