Skip to main content

Showing 1–14 of 14 results for author: Nieto, O

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.11768  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

    Authors: Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha

    Abstract: Perceiving and understanding non-speech sounds and non-verbal speech is essential to making decisions that help us interact with our surroundings. In this paper, we propose GAMA, a novel General-purpose Large Audio-Language Model (LALM) with Advanced Audio Understanding and Complex Reasoning Abilities. We build GAMA by integrating an LLM with multiple types of audio representations, including feat… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Project Website: https://sreyan88.github.io/gamaaudio/

  2. arXiv:2405.15683  [pdf, other

    cs.CV cs.AI cs.CL

    VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap

    Authors: Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Utkarsh Tyagi, Oriol Nieto, Zeyu **, Dinesh Manocha

    Abstract: Recent interest in Large Vision-Language Models (LVLMs) for practical applications is moderated by the significant challenge of hallucination or the inconsistency between the factual information and the generated text. In this paper, we first perform an in-depth analysis of hallucinations and discover several novel insights about how and when LVLMs hallucinate. From our analysis, we show that: (1)… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

    Comments: Preprint. Under review. Code will be released on paper acceptance

  3. arXiv:2310.08753  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

    Authors: Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Evuru, Ramaneswaran S, S. Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha

    Abstract: A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in many downstream applications, including zero-shot audio classification, audio retrieval, etc. However, the ability of these models to effectively perfo… ▽ More

    Submitted 18 June, 2024; v1 submitted 12 October, 2023; originally announced October 2023.

    Comments: ICLR 2024

  4. arXiv:2308.09089  [pdf, other

    cs.SD cs.CV cs.IR cs.MM eess.AS

    Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries

    Authors: Julia Wilkins, Justin Salamon, Magdalena Fuentes, Juan Pablo Bello, Oriol Nieto

    Abstract: Finding the right sound effects (SFX) to match moments in a video is a difficult and time-consuming task, and relies heavily on the quality and completeness of text metadata. Retrieving high-quality (HQ) SFX using a video frame directly as the query is an attractive alternative, removing the reliance on text metadata and providing a low barrier to entry for non-experts. Due to the lack of HQ audio… ▽ More

    Submitted 17 August, 2023; originally announced August 2023.

    Comments: WASPAA 2023. Project page: https://juliawilkins.github.io/sound-effects-retrieval-from-video/. 4 pages, 2 figures, 2 tables

  5. arXiv:2306.01945  [pdf, other

    cs.CL cs.LG

    Efficient Spoken Language Recognition via Multilabel Classification

    Authors: Oriol Nieto, Zeyu **, Franck Dernoncourt, Justin Salamon

    Abstract: Spoken language recognition (SLR) is the task of automatically identifying the language present in a speech signal. Existing SLR models are either too computationally expensive or too large to run effectively on devices with limited resources. For real-world deployment, a model should also gracefully handle unseen languages outside of the target language set, yet prior work has focused on closed-s… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

    Comments: Accepted to InterSpeech 2023

  6. arXiv:2303.16342  [pdf, other

    cs.CV cs.AI cs.CL

    Language-Guided Audio-Visual Source Separation via Trimodal Consistency

    Authors: Reuben Tan, Arijit Ray, Andrea Burns, Bryan A. Plummer, Justin Salamon, Oriol Nieto, Bryan Russell, Kate Saenko

    Abstract: We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data. A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform, all without access to… ▽ More

    Submitted 23 September, 2023; v1 submitted 28 March, 2023; originally announced March 2023.

    Comments: Accepted at CVPR 2023

  7. arXiv:2303.10667  [pdf, other

    cs.SD eess.AS

    Audio-Text Models Do Not Yet Leverage Natural Language

    Authors: Ho-Hsiang Wu, Oriol Nieto, Juan Pablo Bello, Justin Salamon

    Abstract: Multi-modal contrastive learning techniques in the audio-text domain have quickly become a highly active area of research. Most works are evaluated with standard audio retrieval and classification benchmarks assuming that (i) these models are capable of leveraging the rich information contained in natural language, and (ii) current benchmarks are able to capture the nuances of such information. In… ▽ More

    Submitted 19 March, 2023; originally announced March 2023.

    Comments: Copyright 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  8. arXiv:2204.13289  [pdf, other

    cs.SD cs.LG eess.AS

    Music Enhancement via Image Translation and Vocoding

    Authors: Nikhil Kandpal, Oriol Nieto, Zeyu **

    Abstract: Consumer-grade music recordings such as those captured by mobile devices typically contain distortions in the form of background noise, reverb, and microphone-induced EQ. This paper presents a deep learning approach to enhance low-quality music recordings by combining (i) an image-to-image translation model for manipulating audio in its mel-spectrogram representation and (ii) a music vocoding mode… ▽ More

    Submitted 28 April, 2022; originally announced April 2022.

    Comments: ICASSP 2022

  9. arXiv:2010.16030  [pdf, other

    cs.IR cs.MM cs.SD eess.AS

    Multimodal Metric Learning for Tag-based Music Retrieval

    Authors: Minz Won, Sergio Oramas, Oriol Nieto, Fabien Gouyon, Xavier Serra

    Abstract: Tag-based music retrieval is crucial to browse large-scale music libraries efficiently. Hence, automatic music tagging has been actively explored, mostly as a classification task, which has an inherent limitation: a fixed vocabulary. On the other hand, metric learning enables flexible vocabularies by using pretrained word embeddings as side information. Also, metric learning has already proven its… ▽ More

    Submitted 29 October, 2020; originally announced October 2020.

    Comments: 5 pages, 2 figures, submitted to ICASSP 2021

  10. arXiv:2010.11512  [pdf, other

    cs.SD cs.IR eess.AS

    Mood Classification Using Listening Data

    Authors: Filip Korzeniowski, Oriol Nieto, Matthew McCallum, Minz Won, Sergio Oramas, Erik Schmidt

    Abstract: The mood of a song is a highly relevant feature for exploration and recommendation in large collections of music. These collections tend to require automatic methods for predicting such moods. In this work, we show that listening-based features outperform content-based ones when classifying moods: embeddings obtained through matrix factorization of listening data appear to be more informative of a… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

    Comments: Appears in Proc. of the International Society for Music Information Retrieval Conference 2020 (ISMIR 2020)

  11. arXiv:1802.03319  [pdf, other

    stat.ML cs.SD eess.AS

    Predicting Audio Advertisement Quality

    Authors: Samaneh Ebrahimi, Hossein Vahabi, Matthew Prockup, Oriol Nieto

    Abstract: Online audio advertising is a particular form of advertising used abundantly in online music streaming services. In these platforms, which tend to host tens of thousands of unique audio advertisements (ads), providing high quality ads ensures a better user experience and results in longer user engagement. Therefore, the automatic assessment of these ads is an important step toward audio ads rankin… ▽ More

    Submitted 9 February, 2018; originally announced February 2018.

    Comments: WSDM '18 Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 9 pages

    Journal ref: 2018. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM '18)

  12. arXiv:1711.02520  [pdf, other

    cs.SD eess.AS

    End-to-end learning for music audio tagging at scale

    Authors: Jordi Pons, Oriol Nieto, Matthew Prockup, Erik Schmidt, Andreas Ehmann, Xavier Serra

    Abstract: The lack of data tends to limit the outcomes of deep learning research, particularly when dealing with end-to-end learning stacks processing raw data such as waveforms. In this study, 1.2M tracks annotated with musical labels are available to train our end-to-end models. This large amount of data allows us to unrestrictedly explore two different design paradigms for music auto-tagging: assumption-… ▽ More

    Submitted 15 June, 2018; v1 submitted 7 November, 2017; originally announced November 2017.

    Comments: Presented at the Workshop on Machine Learning for Audio Signal Processing (ML4Audio) at NIPS 2017, and in proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR2018). Code: https://github.com/jordipons/music-audio-tagging-at-scale-models. Demo: http://www.jordipons.me/apps/music-audio-tagging-at-scale-demo/

  13. arXiv:1707.04916  [pdf, other

    cs.IR

    Multi-label Music Genre Classification from Audio, Text, and Images Using Deep Features

    Authors: Sergio Oramas, Oriol Nieto, Francesco Barbieri, Xavier Serra

    Abstract: Music genres allow to categorize musical items that share common characteristics. Although these categories are not mutually exclusive, most related research is traditionally focused on classifying tracks into a single class. Furthermore, these categories (e.g., Pop, Rock) tend to be too broad for certain applications. In this work we aim to expand this task by categorizing musical items into mult… ▽ More

    Submitted 16 July, 2017; originally announced July 2017.

    Comments: In Proceedings of the 18th International Society of Music Information Retrieval Conference (ISMIR 2017)

  14. A Deep Multimodal Approach for Cold-start Music Recommendation

    Authors: Sergio Oramas, Oriol Nieto, Mohamed Sordo, Xavier Serra

    Abstract: An increasing amount of digital music is being published daily. Music streaming services often ingest all available music, but this poses a challenge: how to recommend new artists for which prior knowledge is scarce? In this work we aim to address this so-called cold-start problem by combining text and audio information with user feedback data using deep network architectures. Our method is divide… ▽ More

    Submitted 24 July, 2017; v1 submitted 29 June, 2017; originally announced June 2017.

    Comments: In Proceedings of the 2nd Workshop on Deep Learning for Recommender Systems (DLRS 2017), collocated with RecSys 2017