Skip to main content

Showing 1–30 of 30 results for author: Serrà, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2310.00140  [pdf, ps, other

    cs.SD cs.AI cs.LG eess.AS eess.SP

    GASS: Generalizing Audio Source Separation with Large-scale Data

    Authors: Jordi Pons, Xiaoyu Liu, Santiago Pascual, Joan Serrà

    Abstract: Universal source separation targets at separating the audio sources of an arbitrary mix, removing the constraint to operate on a specific domain like speech or music. Yet, the potential of universal source separation is limited because most existing works focus on mixes with predominantly sound events, and small training datasets also limit its potential for supervised learning. Here, we study a s… ▽ More

    Submitted 29 September, 2023; originally announced October 2023.

  2. arXiv:2306.14647  [pdf, other

    cs.SD cs.LG eess.AS

    Mono-to-stereo through parametric stereo generation

    Authors: Joan Serrà, Davide Scaini, Santiago Pascual, Daniel Arteaga, Jordi Pons, Jeroen Breebaart, Giulio Cengarle

    Abstract: Generating a stereophonic presentation from a monophonic audio signal is a challenging open task, especially if the goal is to obtain a realistic spatial imaging with a specific panning of sound elements. In this work, we propose to convert mono to stereo by means of predicting parametric stereo (PS) parameters using both nearest neighbor and deep network approaches. In combination with PS, we als… ▽ More

    Submitted 26 June, 2023; originally announced June 2023.

    Comments: 7 pages, 1 figure; accepted for ISMIR23

  3. arXiv:2306.09635  [pdf, other

    cs.SD cs.LG cs.MM eess.AS eess.SP

    CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

    Authors: Hao-Wen Dong, Xiaoyu Liu, Jordi Pons, Gautam Bhattacharya, Santiago Pascual, Joan Serrà, Taylor Berg-Kirkpatrick, Julian McAuley

    Abstract: Recent work has studied text-to-audio synthesis using large amounts of paired text-audio data. However, audio recordings with high-quality text annotations can be difficult to acquire. In this work, we approach text-to-audio synthesis using unlabeled videos and pretrained language-vision models. We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge… ▽ More

    Submitted 23 July, 2023; v1 submitted 16 June, 2023; originally announced June 2023.

    Comments: Accepted by WASPAA 2023. Demo: https://salu133445.github.io/clipsonic/

  4. arXiv:2210.14661  [pdf, other

    cs.SD cs.LG eess.AS

    Full-band General Audio Synthesis with Score-based Diffusion

    Authors: Santiago Pascual, Gautam Bhattacharya, Chunghsin Yeh, Jordi Pons, Joan Serrà

    Abstract: Recent works have shown the capability of deep generative models to tackle general audio synthesis from a single label, producing a variety of impulsive, tonal, and environmental sounds. Such models operate on band-limited signals and, as a result of an autoregressive approach, they are typically conformed by pre-trained latent encoders and/or several cascaded modules. In this work, we propose a d… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

  5. arXiv:2210.12635  [pdf, other

    cs.SD cs.AI eess.AS

    Quantitative Evidence on Overlooked Aspects of Enrollment Speaker Embeddings for Target Speaker Separation

    Authors: Xiaoyu Liu, Xu Li, Joan Serrà

    Abstract: Single channel target speaker separation (TSS) aims at extracting a speaker's voice from a mixture of multiple talkers given an enrollment utterance of that speaker. A typical deep learning TSS framework consists of an upstream model that obtains enrollment speaker embeddings and a downstream model that performs the separation conditioned on the embeddings. In this paper, we look into several impo… ▽ More

    Submitted 26 October, 2022; v1 submitted 23 October, 2022; originally announced October 2022.

    Comments: Submitted version to ICASSP 2023

  6. arXiv:2210.12108  [pdf, other

    cs.SD cs.LG eess.AS

    Adversarial Permutation Invariant Training for Universal Sound Separation

    Authors: Emilian Postolache, Jordi Pons, Santiago Pascual, Joan Serrà

    Abstract: Universal sound separation consists of separating mixes with arbitrary sounds of different types, and permutation invariant training (PIT) is used to train source agnostic models that do so. In this work, we complement PIT with adversarial losses but find it challenging with the standard formulation used in speech source separation. We overcome this challenge with a novel I-replacement context-bas… ▽ More

    Submitted 6 March, 2023; v1 submitted 21 October, 2022; originally announced October 2022.

    Comments: Demo page: http://jordipons.me/apps/adversarialPIT/, Accepted at ICASSP-23

  7. arXiv:2206.03065  [pdf, other

    cs.SD cs.LG eess.AS

    Universal Speech Enhancement with Score-based Diffusion

    Authors: Joan Serrà, Santiago Pascual, Jordi Pons, R. Oguz Araz, Davide Scaini

    Abstract: Removing background noise from speech audio has been the subject of considerable effort, especially in recent years due to the rise of virtual communication and amateur recordings. Yet background noise is not the only unpleasant disturbance that can prevent intelligibility: reverb, clip**, codec artifacts, problematic equalization, limited bandwidth, or inconsistent loudness are equally disturbi… ▽ More

    Submitted 16 September, 2022; v1 submitted 7 June, 2022; originally announced June 2022.

    Comments: 24 pages, 6 figures; includes appendix; examples in https://serrjoa.github.io/projects/universe/

  8. arXiv:2202.07968  [pdf, other

    cs.SD cs.LG eess.AS

    On loss functions and evaluation metrics for music source separation

    Authors: Enric Gusó, Jordi Pons, Santiago Pascual, Joan Serrà

    Abstract: We investigate which loss functions provide better separations via benchmarking an extensive set of those for music source separation. To that end, we first survey the most representative audio source separation losses we identified, to later consistently benchmark them in a controlled experimental setup. We also explore using such losses as evaluation metrics, via cross-correlating them with the… ▽ More

    Submitted 16 February, 2022; originally announced February 2022.

    Comments: Accepted to ICASSP 2022

  9. arXiv:2111.11773  [pdf, other

    cs.SD cs.AI eess.AS

    Upsampling layers for music source separation

    Authors: Jordi Pons, Joan Serrà, Santiago Pascual, Giulio Cengarle, Daniel Arteaga, Davide Scaini

    Abstract: Upsampling artifacts are caused by problematic upsampling layers and due to spectral replicas that emerge while upsampling. Also, depending on the used upsampling layer, such artifacts can either be tonal artifacts (additive high-frequency noise) or filtering artifacts (substractive, attenuating some bands). In this work we investigate the practical implications of having upsampling artifacts in t… ▽ More

    Submitted 23 November, 2021; originally announced November 2021.

    Comments: Demo page: http://www.jordipons.me/apps/upsamplers/

  10. arXiv:2109.15188  [pdf, other

    cs.SD cs.IR eess.AS

    Assessing Algorithmic Biases for Musical Version Identification

    Authors: Furkan Yesiler, Marius Miron, Joan Serrà, Emilia Gómez

    Abstract: Version identification (VI) systems now offer accurate and scalable solutions for detecting different renditions of a musical composition, allowing the use of these systems in industrial applications and throughout the wider music ecosystem. Such use can have an important impact on various stakeholders regarding recognition and financial benefits, including how royalties are circulated for digital… ▽ More

    Submitted 30 September, 2021; originally announced September 2021.

  11. Audio-based Musical Version Identification: Elements and Challenges

    Authors: Furkan Yesiler, Guillaume Doras, Rachel M. Bittner, Christopher J. Tralie, Joan Serrà

    Abstract: In this article, we aim to provide a review of the key ideas and approaches proposed in 20 years of scientific literature around musical version identification (VI) research and connect them to current practice. For more than a decade, VI systems suffered from the accuracy-scalability trade-off, with attempts to increase accuracy that typically resulted in cumbersome, non-scalable systems. Recent… ▽ More

    Submitted 6 September, 2021; originally announced September 2021.

    Comments: Accepted to be published in IEEE Signal Processing Magazine

  12. arXiv:2107.03100  [pdf, other

    cs.SD eess.AS

    Adversarial Auto-Encoding for Packet Loss Concealment

    Authors: Santiago Pascual, Joan Serrà, Jordi Pons

    Abstract: Communication technologies like voice over IP operate under constrained real-time conditions, with voice packets being subject to delays and losses from the network. In such cases, the packet loss concealment (PLC) algorithm reconstructs missing frames until a new real packet is received. Recently, autoregressive deep neural networks have been shown to surpass the quality of signal processing meth… ▽ More

    Submitted 8 July, 2021; v1 submitted 7 July, 2021; originally announced July 2021.

  13. arXiv:2104.04143  [pdf, other

    cs.SD eess.AS physics.soc-ph

    Heaps' Law and Vocabulary Richness in the History of Classical Music Harmony

    Authors: Marc Serra-Peralta, Joan Serrà, Álvaro Corral

    Abstract: Music is a fundamental human construct, and harmony provides the building blocks of musical language. Using the Kunstderfuge corpus of classical music, we analyze the historical evolution of the richness of harmonic vocabulary of 76 classical composers, covering almost 6 centuries. Such corpus comprises about 9500 pieces, resulting in more than 5 million tokens of music codewords. The fulfilment o… ▽ More

    Submitted 9 April, 2021; originally announced April 2021.

    Comments: 12 pages

  14. arXiv:2104.03725  [pdf, other

    cs.LG cs.AI cs.CV cs.SD eess.AS

    On tuning consistent annealed sampling for denoising score matching

    Authors: Joan Serrà, Santiago Pascual, Jordi Pons

    Abstract: Score-based generative models provide state-of-the-art quality for image and audio synthesis. Sampling from these models is performed iteratively, typically employing a discretized series of noise levels and a predefined scheme. In this note, we first overview three common sampling schemes for models trained with denoising score matching. Next, we focus on one of them, consistent annealed sampling… ▽ More

    Submitted 8 April, 2021; originally announced April 2021.

    Comments: 3 pages and 1 figure

  15. arXiv:2102.12046  [pdf, other

    eess.IV

    An Adaptive Video Acquisition Scheme for Object Tracking and its Performance Optimization

    Authors: Srutarshi Banerjee, Henry H. Chopp, Juan G. Serra, Hao Tian Yang, Oliver Cossairt, A. K. Katsaggelos

    Abstract: We present a novel adaptive host-chip modular architecture for video acquisition to optimize an overall objective task constrained under a given bit rate. The chip is a high resolution imaging sensor such as gigapixel focal plane array (FPA) with low computational power deployed on the field remotely, while the host is a server with high computational power. The communication channel data bandwidt… ▽ More

    Submitted 23 February, 2021; originally announced February 2021.

  16. arXiv:2101.02098  [pdf, other

    cs.SD cs.IR eess.AS

    Investigating the efficacy of music version retrieval systems for setlist identification

    Authors: Furkan Yesiler, Emilio Molina, Joan Serrà, Emilia Gómez

    Abstract: The setlist identification (SLI) task addresses a music recognition use case where the goal is to retrieve the metadata and timestamps for all the tracks played in live music events. Due to various musical and non-musical changes in live performances, develo** automatic SLI systems is still a challenging task that, despite its industrial relevance, has been under-explored in the academic literat… ▽ More

    Submitted 6 January, 2021; originally announced January 2021.

  17. arXiv:2010.14356  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Upsampling artifacts in neural audio synthesis

    Authors: Jordi Pons, Santiago Pascual, Giulio Cengarle, Joan Serrà

    Abstract: A number of recent advances in neural audio synthesis rely on upsampling layers, which can introduce undesired artifacts. In computer vision, upsampling artifacts have been studied and are known as checkerboard artifacts (due to their characteristic visual pattern). However, their effect has been overlooked so far in audio processing. Here, we address this gap by studying this problem from the aud… ▽ More

    Submitted 9 February, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

    Comments: In proceedings of ICASSP2021. Code: https://github.com/DolbyLaboratories/neural-upsampling-artifacts-audio

  18. arXiv:2010.10291  [pdf, other

    eess.AS cs.SD

    Automatic multitrack mixing with a differentiable mixing console of neural audio effects

    Authors: Christian J. Steinmetz, Jordi Pons, Santiago Pascual, Joan Serrà

    Abstract: Applications of deep learning to automatic multitrack mixing are largely unexplored. This is partly due to the limited available data, coupled with the fact that such data is relatively unstructured and variable. To address these challenges, we propose a domain-inspired model with a strong inductive bias for the mixing task. We achieve this with the application of pre-trained sub-networks and weig… ▽ More

    Submitted 20 October, 2020; originally announced October 2020.

  19. arXiv:2010.03284  [pdf, other

    cs.SD cs.LG eess.AS

    Less is more: Faster and better music version identification with embedding distillation

    Authors: Furkan Yesiler, Joan Serrà, Emilia Gómez

    Abstract: Version identification systems aim to detect different renditions of the same underlying musical composition (loosely called cover songs). By learning to encode entire recordings into plain vector embeddings, recent systems have made significant progress in bridging the gap between accuracy and scalability, which has been a key challenge for nearly two decades. In this work, we propose to further… ▽ More

    Submitted 7 October, 2020; originally announced October 2020.

    Comments: Accepted to the 21st International Society for Music Information Retrieval Conference (ISMIR 2020)

  20. arXiv:2010.00368  [pdf, other

    eess.AS cs.LG cs.SD

    SESQA: semi-supervised learning for speech quality assessment

    Authors: Joan Serrà, Jordi Pons, Santiago Pascual

    Abstract: Automatic speech quality assessment is an important, transversal task whose progress is hampered by the scarcity of human annotations, poor generalization to unseen recording conditions, and a lack of flexibility of existing approaches. In this work, we tackle these problems with a semi-supervised learning approach, combining available annotations with programmatically generated data, and using 3… ▽ More

    Submitted 8 February, 2021; v1 submitted 1 October, 2020; originally announced October 2020.

    Comments: Long version (with appendix) of the paper with the same title accepted for ICASSP2021

  21. arXiv:1910.12551  [pdf, other

    cs.SD cs.LG eess.AS

    Accurate and Scalable Version Identification Using Musically-Motivated Embeddings

    Authors: Furkan Yesiler, Joan Serrà, Emilia Gómez

    Abstract: The version identification (VI) task deals with the automatic detection of recordings that correspond to the same underlying musical piece. Despite many efforts, VI is still an open problem, with much room for improvement, specially with regard to combining accuracy and scalability. In this paper, we present MOVE, a musically-motivated method for accurate and scalable version identification. MOVE… ▽ More

    Submitted 13 April, 2020; v1 submitted 28 October, 2019; originally announced October 2019.

  22. arXiv:1906.00794  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion

    Authors: Joan Serrà, Santiago Pascual, Carlos Segura

    Abstract: End-to-end models for raw audio generation are a challenge, specially if they have to work with non-parallel data, which is a desirable setup in many situations. Voice conversion, in which a model has to impersonate a speaker in a recording, is one of those situations. In this paper, we propose Blow, a single-scale normalizing flow using hypernetwork conditioning to perform many-to-many voice conv… ▽ More

    Submitted 5 September, 2019; v1 submitted 3 June, 2019; originally announced June 2019.

    Comments: Includes appendix. Accepted for NeurIPS2019

  23. arXiv:1904.03418  [pdf, other

    cs.SD cs.LG eess.AS

    Towards Generalized Speech Enhancement with Generative Adversarial Networks

    Authors: Santiago Pascual, Joan Serrà, Antonio Bonafonte

    Abstract: The speech enhancement task usually consists of removing additive noise or reverberation that partially mask spoken utterances, affecting their intelligibility. However, little attention is drawn to other, perhaps more aggressive signal distortions like clip**, chunk elimination, or frequency-band removal. Such distortions can have a large impact not only on intelligibility, but also on naturaln… ▽ More

    Submitted 6 April, 2019; originally announced April 2019.

  24. arXiv:1904.03416  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks

    Authors: Santiago Pascual, Mirco Ravanelli, Joan Serrà, Antonio Bonafonte, Yoshua Bengio

    Abstract: Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. Some recent works, however, have shown that it is possible to derive useful speech representations by employing a self-supervised encoder-discriminator approach. This… ▽ More

    Submitted 6 April, 2019; originally announced April 2019.

  25. arXiv:1810.10274  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Training neural audio classifiers with few data

    Authors: Jordi Pons, Joan Serrà, Xavier Serra

    Abstract: We investigate supervised learning strategies that improve the training of neural network audio classifiers on small annotated collections. In particular, we study whether (i) a naive regularization of the solution space, (ii) prototypical networks, (iii) transfer learning, or (iv) their combination, can foster deep learning models to better leverage a small amount of training examples. To this en… ▽ More

    Submitted 3 November, 2018; v1 submitted 24 October, 2018; originally announced October 2018.

    Comments: Code: https://github.com/jordipons/neural-classifiers-with-few-audio/

  26. arXiv:1808.10687  [pdf, other

    cs.SD cs.LG eess.AS

    Whispered-to-voiced Alaryngeal Speech Conversion with Generative Adversarial Networks

    Authors: Santiago Pascual, Antonio Bonafonte, Joan Serrà, Jose A. Gonzalez

    Abstract: Most methods of voice restoration for patients suffering from aphonia either produce whispered or monotone speech. Apart from intelligibility, this type of speech lacks expressiveness and naturalness due to the absence of pitch (whispered speech) or artificial generation of it (monotone speech). Existing techniques to restore prosodic information typically combine a vocoder, which parameterises th… ▽ More

    Submitted 5 November, 2018; v1 submitted 31 August, 2018; originally announced August 2018.

  27. arXiv:1808.10678  [pdf, other

    cs.SD cs.LG eess.AS

    Self-Attention Linguistic-Acoustic Decoder

    Authors: Santiago Pascual, Antonio Bonafonte, Joan Serrà

    Abstract: The conversion from text to speech relies on the accurate map** from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models like recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the generated speech), their recursive structure tends to make them slow to train and to sample from. In this work, w… ▽ More

    Submitted 5 November, 2018; v1 submitted 31 August, 2018; originally announced August 2018.

  28. arXiv:1712.06340  [pdf, other

    cs.SD cs.LG eess.AS

    Language and Noise Transfer in Speech Enhancement Generative Adversarial Network

    Authors: Santiago Pascual, Maruchan Park, Joan Serrà, Antonio Bonafonte, Kang-Hun Ahn

    Abstract: Speech enhancement deep learning systems usually require large amounts of training data to operate in broad conditions or real applications. This makes the adaptability of those systems into new, low resource environments an important topic. In this work, we present the results of adapting a speech enhancement generative adversarial network by finetuning the generator with small amounts of data. W… ▽ More

    Submitted 18 December, 2017; originally announced December 2017.

  29. arXiv:1704.05249  [pdf, other

    cs.LG cs.NI eess.SY

    Hot or not? Forecasting cellular network hot spots using sector performance indicators

    Authors: Joan Serrà, Ilias Leontiadis, Alexandros Karatzoglou, Konstantina Papagiannaki

    Abstract: To manage and maintain large-scale cellular networks, operators need to know which sectors underperform at any given time. For this purpose, they use the so-called hot spot score, which is the result of a combination of multiple network measurements and reflects the instantaneous overall performance of individual sectors. While operators have a good understanding of the current performance of a ne… ▽ More

    Submitted 18 April, 2017; originally announced April 2017.

    Comments: Accepted for publication at ICDE 2017 - Industrial Track

  30. arXiv:1204.5383  [pdf, ps, other

    math.OC eess.SY

    Climbing on Pyramids

    Authors: Jean Serra, Bangalore Ravi Kiran

    Abstract: A new approach is proposed for finding the "best cut" in a hierarchy of partitions by energy minimization. Said energy must be "climbing" i.e. it must be hierarchically and scale increasing. It encompasses separable energies and those composed under supremum.

    Submitted 13 June, 2012; v1 submitted 21 April, 2012; originally announced April 2012.

    Comments: Few more errata fixed