Skip to main content

Showing 1–18 of 18 results for author: Elizalde, B

Searching in archive eess. Search in all archives.
.
  1. arXiv:2402.00282  [pdf, other

    eess.AS cs.SD

    PAM: Prompting Audio-Language Models for Audio Quality Assessment

    Authors: Soham Deshmukh, Dareen Alharthi, Benjamin Elizalde, Hannes Gamper, Mahmoud Al Ismail, Rita Singh, Bhiksha Raj, Huaming Wang

    Abstract: While audio quality is a key performance metric for various audio processing tasks, including generative modeling, its objective measurement remains a challenge. Audio-Language Models (ALMs) are pre-trained on audio-text pairs that may contain information about audio quality, the presence of artifacts, or noise. Given an audio input and a text prompt related to quality, an ALM can be used to calcu… ▽ More

    Submitted 31 January, 2024; originally announced February 2024.

  2. arXiv:2401.08887  [pdf, ps, other

    cs.SD cs.AI cs.CL eess.AS

    NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription

    Authors: Alon Vinnikov, Amir Ivry, Aviv Hurvitz, Igor Abramovski, Sharon Koubi, Ilya Gurvich, Shai Pe`er, Xiong Xiao, Benjamin Martinez Elizalde, Naoyuki Kanda, Xiaofei Wang, Shalev Shaer, Stav Yagev, Yossi Asher, Sunit Sivasankaran, Yifan Gong, Min Tang, Huaming Wang, Eyal Krupka

    Abstract: We introduce the first Natural Office Talkers in Settings of Far-field Audio Recordings (``NOTSOFAR-1'') Challenge alongside datasets and baseline system. The challenge focuses on distant speaker diarization and automatic speech recognition (DASR) in far-field meeting scenarios, with single-channel and known-geometry multi-channel tracks, and serves as a launch platform for two new datasets: First… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

    Comments: preprint

  3. arXiv:2310.02298  [pdf, other

    cs.SD cs.AI eess.AS

    Prompting Audios Using Acoustic Properties For Emotion Representation

    Authors: Hira Dhamyal, Benjamin Elizalde, Soham Deshmukh, Huaming Wang, Bhiksha Raj, Rita Singh

    Abstract: Emotions lie on a continuum, but current models treat emotions as a finite valued discrete variable. This representation does not capture the diversity in the expression of emotion. To better represent emotions we propose the use of natural language descriptions (or prompts). In this work, we address the challenge of automatically generating these prompts and training a model to better learn emoti… ▽ More

    Submitted 6 December, 2023; v1 submitted 3 October, 2023; originally announced October 2023.

    Comments: arXiv admin note: substantial text overlap with arXiv:2211.07737

  4. arXiv:2309.07372  [pdf, other

    eess.AS cs.SD

    Training Audio Captioning Models without Audio

    Authors: Soham Deshmukh, Benjamin Elizalde, Dimitra Emmanouilidou, Bhiksha Raj, Rita Singh, Huaming Wang

    Abstract: Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this major limitation and propose an a… ▽ More

    Submitted 13 September, 2023; originally announced September 2023.

  5. arXiv:2309.05767  [pdf, other

    cs.SD eess.AS

    Natural Language Supervision for General-Purpose Audio Representations

    Authors: Benjamin Elizalde, Soham Deshmukh, Huaming Wang

    Abstract: Audio-Language models jointly learn multimodal text and audio representations that enable Zero-Shot inference. Models rely on the encoders to create powerful representations of the input and generalize to multiple tasks ranging from sounds, music, and speech. Although models have achieved remarkable performance, there is still a performance gap with task-specific models. In this paper, we propose… ▽ More

    Submitted 6 February, 2024; v1 submitted 11 September, 2023; originally announced September 2023.

  6. arXiv:2305.11834  [pdf, other

    eess.AS cs.SD

    Pengi: An Audio Language Model for Audio Tasks

    Authors: Soham Deshmukh, Benjamin Elizalde, Rita Singh, Huaming Wang

    Abstract: In the domain of audio processing, Transfer Learning has facilitated the rise of Self-Supervised Learning and Zero-Shot Learning techniques. These approaches have led to the development of versatile models capable of tackling a wide array of tasks, while delivering state-of-the-art performance. However, current models inherently lack the capacity to produce the requisite language for open-ended ta… ▽ More

    Submitted 18 January, 2024; v1 submitted 19 May, 2023; originally announced May 2023.

    Comments: Accepted at NeurIPS 2023. The manuscript is updated with additional experiments suggested by reviewers

  7. arXiv:2302.09719  [pdf, ps, other

    eess.AS cs.SD

    Synergy between human and machine approaches to sound/scene recognition and processing: An overview of ICASSP special session

    Authors: Laurie M. Heller, Benjamin Elizalde, Bhiksha Raj, Soham Deshmukh

    Abstract: Machine Listening, as usually formalized, attempts to perform a task that is, from our perspective, fundamentally human-performable, and performed by humans. Current automated models of Machine Listening vary from purely data-driven approaches to approaches imitating human systems. In recent years, the most promising approaches have been hybrid in that they have used data-driven approaches informe… ▽ More

    Submitted 23 February, 2023; v1 submitted 19 February, 2023; originally announced February 2023.

    Comments: 4 pages. Summary of Special Session planned for 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://2023.ieeeicassp.org/ Second version has corrected spelling of an author's name

  8. arXiv:2211.07737  [pdf, other

    cs.SD cs.LG eess.AS

    Describing emotions with acoustic property prompts for speech emotion recognition

    Authors: Hira Dhamyal, Benjamin Elizalde, Soham Deshmukh, Huaming Wang, Bhiksha Raj, Rita Singh

    Abstract: Emotions lie on a broad continuum and treating emotions as a discrete number of classes limits the ability of a model to capture the nuances in the continuum. The challenge is how to describe the nuances of emotions and how to enable a model to learn the descriptions. In this work, we devise a method to automatically create a description (or prompt) for a given audio by computing acoustic properti… ▽ More

    Submitted 14 November, 2022; originally announced November 2022.

  9. arXiv:2209.14275  [pdf, other

    eess.AS cs.AI

    Audio Retrieval with WavText5K and CLAP Training

    Authors: Soham Deshmukh, Benjamin Elizalde, Huaming Wang

    Abstract: Audio-Text retrieval takes a natural language query to retrieve relevant audio files in a database. Conversely, Text-Audio retrieval takes an audio file as a query to retrieve relevant natural language descriptions. Most of the literature train retrieval systems with one audio captioning dataset, but evaluating the benefit of training with multiple datasets is underexplored. Moreover, retrieval sy… ▽ More

    Submitted 28 September, 2022; originally announced September 2022.

  10. arXiv:2206.04769  [pdf, other

    cs.SD eess.AS

    CLAP: Learning Audio Concepts From Natural Language Supervision

    Authors: Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, Huaming Wang

    Abstract: Mainstream Audio Analytics models are trained to learn under the paradigm of one class label to many recordings focusing on one task. Learning under such restricted supervision limits the flexibility of models because they require labeled audio for training and can only predict the predefined categories. Instead, we propose to learn audio concepts from natural language supervision. We call our app… ▽ More

    Submitted 9 June, 2022; originally announced June 2022.

  11. arXiv:2105.10619  [pdf, other

    cs.SD eess.AS

    COVID-19 Detection Using Recorded Coughs in the 2021 DiCOVA Challenge

    Authors: Benjamin Elizalde, Daniel Tompkins

    Abstract: COVID-19 has resulted in over 100 million infections and caused worldwide lock downs due to its high transmission rate and limited testing options. Current diagnostic tests can be expensive, limited in availability, time-intensive and require risky in-person appointments. It has been established that symptomatic COVID-19 seriously impairs normal functioning of the respiratory system, thus affectin… ▽ More

    Submitted 21 May, 2021; originally announced May 2021.

  12. arXiv:2104.12693  [pdf, other

    cs.SD eess.AS

    Identifying Actions for Sound Event Classification

    Authors: Benjamin Elizalde, Radu Revutchi, Samarjit Das, Bhiksha Raj, Ian Lane, Laurie M. Heller

    Abstract: In Psychology, actions are paramount for humans to identify sound events. In Machine Learning (ML), action recognition achieves high accuracy; however, it has not been asked whether identifying actions can benefit Sound Event Classification (SEC), as opposed to map** the audio directly to a sound event. Therefore, we propose a new Psychology-inspired approach for SEC that includes identification… ▽ More

    Submitted 5 August, 2021; v1 submitted 26 April, 2021; originally announced April 2021.

  13. arXiv:2002.09026  [pdf

    eess.AS cs.IR cs.LG cs.SD

    Multi-label Sound Event Retrieval Using a Deep Learning-based Siamese Structure with a Pairwise Presence Matrix

    Authors: Jianyu Fan, Eric Nichols, Daniel Tompkins, Ana Elisa Mendez Mendez, Benjamin Elizalde, Philippe Pasquier

    Abstract: Realistic recordings of soundscapes often have multiple sound events co-occurring, such as car horns, engine and human voices. Sound event retrieval is a type of content-based search aiming at finding audio samples, similar to an audio query based on their acoustic or semantic content. State of the art sound event retrieval models have focused on single-label audio recordings, with only one sound… ▽ More

    Submitted 20 February, 2020; originally announced February 2020.

    Comments: Paper accepted for 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020)

  14. arXiv:1801.05544  [pdf, other

    cs.SD eess.AS

    NELS -- Never-Ending Learner of Sounds

    Authors: Benjamin Elizalde, Rohan Badlani, Ankit Shah, Anurag Kumar, Bhiksha Raj

    Abstract: Sounds are essential to how humans perceive and interact with the world and are captured in recordings and shared on the Internet on a minute-by-minute basis. These recordings, which are predominantly videos, constitute the largest archive of sounds we know. However, most of these recordings have undescribed content making necessary methods for automatic sound analysis, indexing and retrieval. The… ▽ More

    Submitted 29 March, 2023; v1 submitted 16 January, 2018; originally announced January 2018.

    Comments: Accepted at Machine Learning for Audio Signal Processing (ML4Audio), 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA

  15. arXiv:1801.02690  [pdf, other

    cs.SD eess.AS

    DCASE 2017 Task 1: Acoustic Scene Classification Using Shift-Invariant Kernels and Random Features

    Authors: Abelino Jimenez, Benjamin Elizalde, Bhiksha Raj

    Abstract: Acoustic scene recordings are represented by different types of handcrafted or Neural Network-derived features. These features, typically of thousands of dimensions, are classified in state of the art approaches using kernel machines, such as the Support Vector Machines (SVM). However, the complexity of training these methods increases with the dimensionality of these input features and the size o… ▽ More

    Submitted 8 January, 2018; originally announced January 2018.

  16. arXiv:1711.00804  [pdf, other

    cs.SD cs.AI cs.IR eess.AS

    Framework for evaluation of sound event detection in web videos

    Authors: Rohan Badlani, Ankit Shah, Benjamin Elizalde, Anurag Kumar, Bhiksha Raj

    Abstract: The largest source of sound events is web videos. Most videos lack sound event labels at segment level, however, a significant number of them do respond to text queries, from a match found using metadata by search engines. In this paper we explore the extent to which a search query can be used as the true label for detection of sound events in videos. We present a framework for large-scale sound e… ▽ More

    Submitted 4 April, 2018; v1 submitted 2 November, 2017; originally announced November 2017.

    Comments: Camera Ready Version of Paper accepted at International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2018. First two Authors - Rohan Badlani and Ankit Shah contributed equally

  17. arXiv:1710.10974  [pdf, other

    cs.SD cs.IR eess.AS

    Content-based Representations of audio using Siamese neural networks

    Authors: Pranay Manocha, Rohan Badlani, Anurag Kumar, Ankit Shah, Benjamin Elizalde, Bhiksha Raj

    Abstract: In this paper, we focus on the problem of content-based retrieval for audio, which aims to retrieve all semantically similar audio recordings for a given audio clip query. This problem is similar to the problem of query by example of audio, which aims to retrieve media samples from a database, which are similar to the user-provided example. We propose a novel approach which encodes the audio into… ▽ More

    Submitted 15 February, 2018; v1 submitted 30 October, 2017; originally announced October 2017.

  18. arXiv:1710.04288  [pdf, other

    eess.AS cs.SD

    Audio Concept Classification with Hierarchical Deep Neural Networks

    Authors: Mirco Ravanelli, Benjamin Elizalde, Karl Ni, Gerald Friedland

    Abstract: Audio-based multimedia retrieval tasks may identify semantic information in audio streams, i.e., audio concepts (such as music, laughter, or a revving engine). Conventional Gaussian-Mixture-Models have had some success in classifying a reduced set of audio concepts. However, multi-class classification can benefit from context window analysis and the discriminating power of deeper architectures. Al… ▽ More

    Submitted 11 October, 2017; originally announced October 2017.

    Journal ref: EUSIPCO 2014