Skip to main content

Showing 1–24 of 24 results for author: Pascual, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2310.00140  [pdf, ps, other

    cs.SD cs.AI cs.LG eess.AS eess.SP

    GASS: Generalizing Audio Source Separation with Large-scale Data

    Authors: Jordi Pons, Xiaoyu Liu, Santiago Pascual, Joan Serrà

    Abstract: Universal source separation targets at separating the audio sources of an arbitrary mix, removing the constraint to operate on a specific domain like speech or music. Yet, the potential of universal source separation is limited because most existing works focus on mixes with predominantly sound events, and small training datasets also limit its potential for supervised learning. Here, we study a s… ▽ More

    Submitted 29 September, 2023; originally announced October 2023.

  2. arXiv:2308.09300  [pdf, other

    cs.CV cs.AI cs.MM cs.SD eess.AS

    V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

    Authors: Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, Weidong Cai

    Abstract: Building artificial intelligence (AI) systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. Their representative and generative abilities learnt from vast amounts of data can be easily adapted and transferred to a wide range of downstream tasks without extra training from scratch. However, leveraging FMs in cross-modal generation remains under-researched whe… ▽ More

    Submitted 13 December, 2023; v1 submitted 18 August, 2023; originally announced August 2023.

    Comments: AAAI 2024. Demo page: https://v2a-mapper.github.io/

  3. arXiv:2306.14647  [pdf, other

    cs.SD cs.LG eess.AS

    Mono-to-stereo through parametric stereo generation

    Authors: Joan Serrà, Davide Scaini, Santiago Pascual, Daniel Arteaga, Jordi Pons, Jeroen Breebaart, Giulio Cengarle

    Abstract: Generating a stereophonic presentation from a monophonic audio signal is a challenging open task, especially if the goal is to obtain a realistic spatial imaging with a specific panning of sound elements. In this work, we propose to convert mono to stereo by means of predicting parametric stereo (PS) parameters using both nearest neighbor and deep network approaches. In combination with PS, we als… ▽ More

    Submitted 26 June, 2023; originally announced June 2023.

    Comments: 7 pages, 1 figure; accepted for ISMIR23

  4. arXiv:2306.09635  [pdf, other

    cs.SD cs.LG cs.MM eess.AS eess.SP

    CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

    Authors: Hao-Wen Dong, Xiaoyu Liu, Jordi Pons, Gautam Bhattacharya, Santiago Pascual, Joan Serrà, Taylor Berg-Kirkpatrick, Julian McAuley

    Abstract: Recent work has studied text-to-audio synthesis using large amounts of paired text-audio data. However, audio recordings with high-quality text annotations can be difficult to acquire. In this work, we approach text-to-audio synthesis using unlabeled videos and pretrained language-vision models. We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge… ▽ More

    Submitted 23 July, 2023; v1 submitted 16 June, 2023; originally announced June 2023.

    Comments: Accepted by WASPAA 2023. Demo: https://salu133445.github.io/clipsonic/

  5. arXiv:2210.14661  [pdf, other

    cs.SD cs.LG eess.AS

    Full-band General Audio Synthesis with Score-based Diffusion

    Authors: Santiago Pascual, Gautam Bhattacharya, Chunghsin Yeh, Jordi Pons, Joan Serrà

    Abstract: Recent works have shown the capability of deep generative models to tackle general audio synthesis from a single label, producing a variety of impulsive, tonal, and environmental sounds. Such models operate on band-limited signals and, as a result of an autoregressive approach, they are typically conformed by pre-trained latent encoders and/or several cascaded modules. In this work, we propose a d… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

  6. arXiv:2210.12108  [pdf, other

    cs.SD cs.LG eess.AS

    Adversarial Permutation Invariant Training for Universal Sound Separation

    Authors: Emilian Postolache, Jordi Pons, Santiago Pascual, Joan Serrà

    Abstract: Universal sound separation consists of separating mixes with arbitrary sounds of different types, and permutation invariant training (PIT) is used to train source agnostic models that do so. In this work, we complement PIT with adversarial losses but find it challenging with the standard formulation used in speech source separation. We overcome this challenge with a novel I-replacement context-bas… ▽ More

    Submitted 6 March, 2023; v1 submitted 21 October, 2022; originally announced October 2022.

    Comments: Demo page: http://jordipons.me/apps/adversarialPIT/, Accepted at ICASSP-23

  7. arXiv:2206.03065  [pdf, other

    cs.SD cs.LG eess.AS

    Universal Speech Enhancement with Score-based Diffusion

    Authors: Joan Serrà, Santiago Pascual, Jordi Pons, R. Oguz Araz, Davide Scaini

    Abstract: Removing background noise from speech audio has been the subject of considerable effort, especially in recent years due to the rise of virtual communication and amateur recordings. Yet background noise is not the only unpleasant disturbance that can prevent intelligibility: reverb, clip**, codec artifacts, problematic equalization, limited bandwidth, or inconsistent loudness are equally disturbi… ▽ More

    Submitted 16 September, 2022; v1 submitted 7 June, 2022; originally announced June 2022.

    Comments: 24 pages, 6 figures; includes appendix; examples in https://serrjoa.github.io/projects/universe/

  8. arXiv:2202.07968  [pdf, other

    cs.SD cs.LG eess.AS

    On loss functions and evaluation metrics for music source separation

    Authors: Enric Gusó, Jordi Pons, Santiago Pascual, Joan Serrà

    Abstract: We investigate which loss functions provide better separations via benchmarking an extensive set of those for music source separation. To that end, we first survey the most representative audio source separation losses we identified, to later consistently benchmark them in a controlled experimental setup. We also explore using such losses as evaluation metrics, via cross-correlating them with the… ▽ More

    Submitted 16 February, 2022; originally announced February 2022.

    Comments: Accepted to ICASSP 2022

  9. arXiv:2111.11773  [pdf, other

    cs.SD cs.AI eess.AS

    Upsampling layers for music source separation

    Authors: Jordi Pons, Joan Serrà, Santiago Pascual, Giulio Cengarle, Daniel Arteaga, Davide Scaini

    Abstract: Upsampling artifacts are caused by problematic upsampling layers and due to spectral replicas that emerge while upsampling. Also, depending on the used upsampling layer, such artifacts can either be tonal artifacts (additive high-frequency noise) or filtering artifacts (substractive, attenuating some bands). In this work we investigate the practical implications of having upsampling artifacts in t… ▽ More

    Submitted 23 November, 2021; originally announced November 2021.

    Comments: Demo page: http://www.jordipons.me/apps/upsamplers/

  10. arXiv:2107.03100  [pdf, other

    cs.SD eess.AS

    Adversarial Auto-Encoding for Packet Loss Concealment

    Authors: Santiago Pascual, Joan Serrà, Jordi Pons

    Abstract: Communication technologies like voice over IP operate under constrained real-time conditions, with voice packets being subject to delays and losses from the network. In such cases, the packet loss concealment (PLC) algorithm reconstructs missing frames until a new real packet is received. Recently, autoregressive deep neural networks have been shown to surpass the quality of signal processing meth… ▽ More

    Submitted 8 July, 2021; v1 submitted 7 July, 2021; originally announced July 2021.

  11. arXiv:2104.03725  [pdf, other

    cs.LG cs.AI cs.CV cs.SD eess.AS

    On tuning consistent annealed sampling for denoising score matching

    Authors: Joan Serrà, Santiago Pascual, Jordi Pons

    Abstract: Score-based generative models provide state-of-the-art quality for image and audio synthesis. Sampling from these models is performed iteratively, typically employing a discretized series of noise levels and a predefined scheme. In this note, we first overview three common sampling schemes for models trained with denoising score matching. Next, we focus on one of them, consistent annealed sampling… ▽ More

    Submitted 8 April, 2021; originally announced April 2021.

    Comments: 3 pages and 1 figure

  12. arXiv:2010.14356  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Upsampling artifacts in neural audio synthesis

    Authors: Jordi Pons, Santiago Pascual, Giulio Cengarle, Joan Serrà

    Abstract: A number of recent advances in neural audio synthesis rely on upsampling layers, which can introduce undesired artifacts. In computer vision, upsampling artifacts have been studied and are known as checkerboard artifacts (due to their characteristic visual pattern). However, their effect has been overlooked so far in audio processing. Here, we address this gap by studying this problem from the aud… ▽ More

    Submitted 9 February, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

    Comments: In proceedings of ICASSP2021. Code: https://github.com/DolbyLaboratories/neural-upsampling-artifacts-audio

  13. arXiv:2010.10291  [pdf, other

    eess.AS cs.SD

    Automatic multitrack mixing with a differentiable mixing console of neural audio effects

    Authors: Christian J. Steinmetz, Jordi Pons, Santiago Pascual, Joan Serrà

    Abstract: Applications of deep learning to automatic multitrack mixing are largely unexplored. This is partly due to the limited available data, coupled with the fact that such data is relatively unstructured and variable. To address these challenges, we propose a domain-inspired model with a strong inductive bias for the mixing task. We achieve this with the application of pre-trained sub-networks and weig… ▽ More

    Submitted 20 October, 2020; originally announced October 2020.

  14. arXiv:2010.00368  [pdf, other

    eess.AS cs.LG cs.SD

    SESQA: semi-supervised learning for speech quality assessment

    Authors: Joan Serrà, Jordi Pons, Santiago Pascual

    Abstract: Automatic speech quality assessment is an important, transversal task whose progress is hampered by the scarcity of human annotations, poor generalization to unseen recording conditions, and a lack of flexibility of existing approaches. In this work, we tackle these problems with a semi-supervised learning approach, combining available annotations with programmatically generated data, and using 3… ▽ More

    Submitted 8 February, 2021; v1 submitted 1 October, 2020; originally announced October 2020.

    Comments: Long version (with appendix) of the paper with the same title accepted for ICASSP2021

  15. arXiv:2001.09239  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Multi-task self-supervised learning for Robust Speech Recognition

    Authors: Mirco Ravanelli, Jianyuan Zhong, Santiago Pascual, Pawel Swietojanski, Joao Monteiro, Jan Trmal, Yoshua Bengio

    Abstract: Despite the growing interest in unsupervised learning, extracting meaningful knowledge from unlabelled audio remains an open challenge. To take a step in this direction, we recently proposed a problem-agnostic speech encoder (PASE), that combines a convolutional encoder followed by multiple neural networks, called workers, tasked to solve self-supervised problems (i.e., ones that do not require ma… ▽ More

    Submitted 17 April, 2020; v1 submitted 24 January, 2020; originally announced January 2020.

    Comments: In Proc. of ICASSP 2020

  16. Sample Drop Detection for Distant-speech Recognition with Asynchronous Devices Distributed in Space

    Authors: Tina Raissi, Santiago Pascual, Maurizio Omologo

    Abstract: In many applications of multi-microphone multi-device processing, the synchronization among different input channels can be affected by the lack of a common clock and isolated drops of samples. In this work, we address the issue of sample drop detection in the context of a conversational speech scenario, recorded by a set of microphones distributed in space. The goal is to design a neural-based mo… ▽ More

    Submitted 15 November, 2019; originally announced November 2019.

    Comments: Submitted to ICASSP 2020

    ACM Class: I.2.7

  17. arXiv:1906.04155  [pdf, other

    physics.ins-det astro-ph.IM eess.IV physics.optics

    Standardized spectral and radiometric calibration of consumer cameras

    Authors: Olivier Burggraaff, Norbert Schmidt, Jaime Zamorano, Klaas Pauly, Sergio Pascual, Carlos Tapia, Evangelos Spyrakos, Frans Snik

    Abstract: Consumer cameras, particularly onboard smartphones and UAVs, are now commonly used as scientific instruments. However, their data processing pipelines are not optimized for quantitative radiometry and their calibration is more complex than that of scientific cameras. The lack of a standardized calibration methodology limits the interoperability between devices and, in the ever-changing market, ult… ▽ More

    Submitted 7 June, 2019; originally announced June 2019.

    Comments: 27 pages, 11 figures, accepted for publication in Optics Express

  18. arXiv:1906.00794  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion

    Authors: Joan Serrà, Santiago Pascual, Carlos Segura

    Abstract: End-to-end models for raw audio generation are a challenge, specially if they have to work with non-parallel data, which is a desirable setup in many situations. Voice conversion, in which a model has to impersonate a speaker in a recording, is one of those situations. In this paper, we propose Blow, a single-scale normalizing flow using hypernetwork conditioning to perform many-to-many voice conv… ▽ More

    Submitted 5 September, 2019; v1 submitted 3 June, 2019; originally announced June 2019.

    Comments: Includes appendix. Accepted for NeurIPS2019

  19. arXiv:1906.00733  [pdf, other

    cs.SD eess.AS

    Problem-Agnostic Speech Embeddings for Multi-Speaker Text-to-Speech with SampleRNN

    Authors: David Álvarez, Santiago Pascual, Antonio Bonafonte

    Abstract: Text-to-speech (TTS) acoustic models map linguistic features into an acoustic representation out of which an audible waveform is generated. The latest and most natural TTS systems build a direct map** between linguistic and waveform domains, like SampleRNN. This way, possible signal naturalness losses are avoided as intermediate acoustic representations are discarded. Another important dimension… ▽ More

    Submitted 22 September, 2019; v1 submitted 3 June, 2019; originally announced June 2019.

    Comments: Published at 10th ISCA Speech Synthesis Workshop

  20. arXiv:1904.03418  [pdf, other

    cs.SD cs.LG eess.AS

    Towards Generalized Speech Enhancement with Generative Adversarial Networks

    Authors: Santiago Pascual, Joan Serrà, Antonio Bonafonte

    Abstract: The speech enhancement task usually consists of removing additive noise or reverberation that partially mask spoken utterances, affecting their intelligibility. However, little attention is drawn to other, perhaps more aggressive signal distortions like clip**, chunk elimination, or frequency-band removal. Such distortions can have a large impact not only on intelligibility, but also on naturaln… ▽ More

    Submitted 6 April, 2019; originally announced April 2019.

  21. arXiv:1904.03416  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks

    Authors: Santiago Pascual, Mirco Ravanelli, Joan Serrà, Antonio Bonafonte, Yoshua Bengio

    Abstract: Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. Some recent works, however, have shown that it is possible to derive useful speech representations by employing a self-supervised encoder-discriminator approach. This… ▽ More

    Submitted 6 April, 2019; originally announced April 2019.

  22. arXiv:1808.10687  [pdf, other

    cs.SD cs.LG eess.AS

    Whispered-to-voiced Alaryngeal Speech Conversion with Generative Adversarial Networks

    Authors: Santiago Pascual, Antonio Bonafonte, Joan Serrà, Jose A. Gonzalez

    Abstract: Most methods of voice restoration for patients suffering from aphonia either produce whispered or monotone speech. Apart from intelligibility, this type of speech lacks expressiveness and naturalness due to the absence of pitch (whispered speech) or artificial generation of it (monotone speech). Existing techniques to restore prosodic information typically combine a vocoder, which parameterises th… ▽ More

    Submitted 5 November, 2018; v1 submitted 31 August, 2018; originally announced August 2018.

  23. arXiv:1808.10678  [pdf, other

    cs.SD cs.LG eess.AS

    Self-Attention Linguistic-Acoustic Decoder

    Authors: Santiago Pascual, Antonio Bonafonte, Joan Serrà

    Abstract: The conversion from text to speech relies on the accurate map** from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models like recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the generated speech), their recursive structure tends to make them slow to train and to sample from. In this work, w… ▽ More

    Submitted 5 November, 2018; v1 submitted 31 August, 2018; originally announced August 2018.

  24. arXiv:1712.06340  [pdf, other

    cs.SD cs.LG eess.AS

    Language and Noise Transfer in Speech Enhancement Generative Adversarial Network

    Authors: Santiago Pascual, Maruchan Park, Joan Serrà, Antonio Bonafonte, Kang-Hun Ahn

    Abstract: Speech enhancement deep learning systems usually require large amounts of training data to operate in broad conditions or real applications. This makes the adaptability of those systems into new, low resource environments an important topic. In this work, we present the results of adapting a speech enhancement generative adversarial network by finetuning the generator with small amounts of data. W… ▽ More

    Submitted 18 December, 2017; originally announced December 2017.