Search | arXiv e-print repository

Sequential Contrastive Audio-Visual Learning

Authors: Ioannis Tsiamas, Santiago Pascual, Chunghsin Yeh, Joan Serrà

Abstract: Contrastive learning has emerged as a powerful technique in audio-visual representation learning, leveraging the natural co-occurrence of audio and visual modalities in extensive web-scale video datasets to achieve significant advancements. However, conventional contrastive audio-visual learning methodologies often rely on aggregated representations derived through temporal aggregation, which negl… ▽ More Contrastive learning has emerged as a powerful technique in audio-visual representation learning, leveraging the natural co-occurrence of audio and visual modalities in extensive web-scale video datasets to achieve significant advancements. However, conventional contrastive audio-visual learning methodologies often rely on aggregated representations derived through temporal aggregation, which neglects the intrinsic sequential nature of the data. This oversight raises concerns regarding the ability of standard approaches to capture and utilize fine-grained information within sequences, information that is vital for distinguishing between semantically similar yet distinct examples. In response to this limitation, we propose sequential contrastive audio-visual learning (SCAV), which contrasts examples based on their non-aggregated representation space using sequential distances. Retrieval experiments with the VGGSound and Music datasets demonstrate the effectiveness of SCAV, showing 2-3x relative improvements against traditional aggregation-based contrastive learning and other methods from the literature. We also show that models trained with SCAV exhibit a high degree of flexibility regarding the metric employed for retrieval, allowing them to operate on a spectrum of efficiency-accuracy trade-offs, potentially making them applicable in multiple scenarios, from small- to large-scale retrieval. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2310.00140 [pdf, ps, other]

GASS: Generalizing Audio Source Separation with Large-scale Data

Authors: Jordi Pons, Xiaoyu Liu, Santiago Pascual, Joan Serrà

Abstract: Universal source separation targets at separating the audio sources of an arbitrary mix, removing the constraint to operate on a specific domain like speech or music. Yet, the potential of universal source separation is limited because most existing works focus on mixes with predominantly sound events, and small training datasets also limit its potential for supervised learning. Here, we study a s… ▽ More Universal source separation targets at separating the audio sources of an arbitrary mix, removing the constraint to operate on a specific domain like speech or music. Yet, the potential of universal source separation is limited because most existing works focus on mixes with predominantly sound events, and small training datasets also limit its potential for supervised learning. Here, we study a single general audio source separation (GASS) model trained to separate speech, music, and sound events in a supervised fashion with a large-scale dataset. We assess GASS models on a diverse set of tasks. Our strong in-distribution results show the feasibility of GASS models, and the competitive out-of-distribution performance in sound event and speech separation shows its generalization abilities. Yet, it is challenging for GASS models to generalize for separating out-of-distribution cinematic and music content. We also fine-tune GASS models on each dataset and consistently outperform the ones without pre-training. All fine-tuned models (except the music separation one) obtain state-of-the-art results in their respective benchmarks. △ Less

Submitted 29 September, 2023; originally announced October 2023.

arXiv:2306.14647 [pdf, other]

Mono-to-stereo through parametric stereo generation

Authors: Joan Serrà, Davide Scaini, Santiago Pascual, Daniel Arteaga, Jordi Pons, Jeroen Breebaart, Giulio Cengarle

Abstract: Generating a stereophonic presentation from a monophonic audio signal is a challenging open task, especially if the goal is to obtain a realistic spatial imaging with a specific panning of sound elements. In this work, we propose to convert mono to stereo by means of predicting parametric stereo (PS) parameters using both nearest neighbor and deep network approaches. In combination with PS, we als… ▽ More Generating a stereophonic presentation from a monophonic audio signal is a challenging open task, especially if the goal is to obtain a realistic spatial imaging with a specific panning of sound elements. In this work, we propose to convert mono to stereo by means of predicting parametric stereo (PS) parameters using both nearest neighbor and deep network approaches. In combination with PS, we also propose to model the task with generative approaches, allowing to synthesize multiple and equally-plausible stereo renditions from the same mono signal. To achieve this, we consider both autoregressive and masked token modelling approaches. We provide evidence that the proposed PS-based models outperform a competitive classical decorrelation baseline and that, within a PS prediction framework, modern generative models outshine equivalent non-generative counterparts. Overall, our work positions both PS and generative modelling as strong and appealing methodologies for mono-to-stereo upmixing. A discussion of the limitations of these approaches is also provided. △ Less

Submitted 26 June, 2023; originally announced June 2023.

Comments: 7 pages, 1 figure; accepted for ISMIR23

arXiv:2306.09635 [pdf, other]

CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

Authors: Hao-Wen Dong, Xiaoyu Liu, Jordi Pons, Gautam Bhattacharya, Santiago Pascual, Joan Serrà, Taylor Berg-Kirkpatrick, Julian McAuley

Abstract: Recent work has studied text-to-audio synthesis using large amounts of paired text-audio data. However, audio recordings with high-quality text annotations can be difficult to acquire. In this work, we approach text-to-audio synthesis using unlabeled videos and pretrained language-vision models. We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge… ▽ More Recent work has studied text-to-audio synthesis using large amounts of paired text-audio data. However, audio recordings with high-quality text annotations can be difficult to acquire. In this work, we approach text-to-audio synthesis using unlabeled videos and pretrained language-vision models. We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge. We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining (CLIP) model. At test time, we first explore performing a zero-shot modality transfer and condition the diffusion model with a CLIP-encoded text query. However, we observe a noticeable performance drop with respect to image queries. To close this gap, we further adopt a pretrained diffusion prior model to generate a CLIP image embedding given a CLIP text embedding. Our results show the effectiveness of the proposed method, and that the pretrained diffusion prior can reduce the modality transfer gap. While we focus on text-to-audio synthesis, the proposed model can also generate audio from image queries, and it shows competitive performance against a state-of-the-art image-to-audio synthesis model in a subjective listening test. This study offers a new direction of approaching text-to-audio synthesis that leverages the naturally-occurring audio-visual correspondence in videos and the power of pretrained language-vision models. △ Less

Submitted 23 July, 2023; v1 submitted 16 June, 2023; originally announced June 2023.

Comments: Accepted by WASPAA 2023. Demo: https://salu133445.github.io/clipsonic/

arXiv:2210.14661 [pdf, other]

Full-band General Audio Synthesis with Score-based Diffusion

Authors: Santiago Pascual, Gautam Bhattacharya, Chunghsin Yeh, Jordi Pons, Joan Serrà

Abstract: Recent works have shown the capability of deep generative models to tackle general audio synthesis from a single label, producing a variety of impulsive, tonal, and environmental sounds. Such models operate on band-limited signals and, as a result of an autoregressive approach, they are typically conformed by pre-trained latent encoders and/or several cascaded modules. In this work, we propose a d… ▽ More Recent works have shown the capability of deep generative models to tackle general audio synthesis from a single label, producing a variety of impulsive, tonal, and environmental sounds. Such models operate on band-limited signals and, as a result of an autoregressive approach, they are typically conformed by pre-trained latent encoders and/or several cascaded modules. In this work, we propose a diffusion-based generative model for general audio synthesis, named DAG, which deals with full-band signals end-to-end in the waveform domain. Results show the superiority of DAG over existing label-conditioned generators in terms of both quality and diversity. More specifically, when compared to the state of the art, the band-limited and full-band versions of DAG achieve relative improvements that go up to 40 and 65%, respectively. We believe DAG is flexible enough to accommodate different conditioning schemas while providing good quality synthesis. △ Less

Submitted 26 October, 2022; originally announced October 2022.

arXiv:2210.12635 [pdf, other]

Quantitative Evidence on Overlooked Aspects of Enrollment Speaker Embeddings for Target Speaker Separation

Authors: Xiaoyu Liu, Xu Li, Joan Serrà

Abstract: Single channel target speaker separation (TSS) aims at extracting a speaker's voice from a mixture of multiple talkers given an enrollment utterance of that speaker. A typical deep learning TSS framework consists of an upstream model that obtains enrollment speaker embeddings and a downstream model that performs the separation conditioned on the embeddings. In this paper, we look into several impo… ▽ More Single channel target speaker separation (TSS) aims at extracting a speaker's voice from a mixture of multiple talkers given an enrollment utterance of that speaker. A typical deep learning TSS framework consists of an upstream model that obtains enrollment speaker embeddings and a downstream model that performs the separation conditioned on the embeddings. In this paper, we look into several important but overlooked aspects of the enrollment embeddings, including the suitability of the widely used speaker identification embeddings, the introduction of the log-mel filterbank and self-supervised embeddings, and the embeddings' cross-dataset generalization capability. Our results show that the speaker identification embeddings could lose relevant information due to a sub-optimal metric, training objective, or common pre-processing. In contrast, both the filterbank and the self-supervised embeddings preserve the integrity of the speaker information, but the former consistently outperforms the latter in a cross-dataset evaluation. The competitive separation and generalization performance of the previously overlooked filterbank embedding is consistent across our study, which calls for future research on better upstream features. △ Less

Submitted 26 October, 2022; v1 submitted 23 October, 2022; originally announced October 2022.

Comments: Submitted version to ICASSP 2023

arXiv:2210.12108 [pdf, other]

Adversarial Permutation Invariant Training for Universal Sound Separation

Authors: Emilian Postolache, Jordi Pons, Santiago Pascual, Joan Serrà

Abstract: Universal sound separation consists of separating mixes with arbitrary sounds of different types, and permutation invariant training (PIT) is used to train source agnostic models that do so. In this work, we complement PIT with adversarial losses but find it challenging with the standard formulation used in speech source separation. We overcome this challenge with a novel I-replacement context-bas… ▽ More Universal sound separation consists of separating mixes with arbitrary sounds of different types, and permutation invariant training (PIT) is used to train source agnostic models that do so. In this work, we complement PIT with adversarial losses but find it challenging with the standard formulation used in speech source separation. We overcome this challenge with a novel I-replacement context-based adversarial loss, and by training with multiple discriminators. Our experiments show that by simply improving the loss (kee** the same model and dataset) we obtain a non-negligible improvement of 1.4 dB SI-SNRi in the reverberant FUSS dataset. We also find adversarial PIT to be effective at reducing spectral holes, ubiquitous in mask-based separation models, which highlights the potential relevance of adversarial losses for source separation. △ Less

Submitted 6 March, 2023; v1 submitted 21 October, 2022; originally announced October 2022.

Comments: Demo page: http://jordipons.me/apps/adversarialPIT/, Accepted at ICASSP-23

arXiv:2206.03065 [pdf, other]

Universal Speech Enhancement with Score-based Diffusion

Authors: Joan Serrà, Santiago Pascual, Jordi Pons, R. Oguz Araz, Davide Scaini

Abstract: Removing background noise from speech audio has been the subject of considerable effort, especially in recent years due to the rise of virtual communication and amateur recordings. Yet background noise is not the only unpleasant disturbance that can prevent intelligibility: reverb, clip**, codec artifacts, problematic equalization, limited bandwidth, or inconsistent loudness are equally disturbi… ▽ More Removing background noise from speech audio has been the subject of considerable effort, especially in recent years due to the rise of virtual communication and amateur recordings. Yet background noise is not the only unpleasant disturbance that can prevent intelligibility: reverb, clip**, codec artifacts, problematic equalization, limited bandwidth, or inconsistent loudness are equally disturbing and ubiquitous. In this work, we propose to consider the task of speech enhancement as a holistic endeavor, and present a universal speech enhancement system that tackles 55 different distortions at the same time. Our approach consists of a generative model that employs score-based diffusion, together with a multi-resolution conditioning network that performs enhancement with mixture density networks. We show that this approach significantly outperforms the state of the art in a subjective test performed by expert listeners. We also show that it achieves competitive objective scores with just 4-8 diffusion steps, despite not considering any particular strategy for fast sampling. We hope that both our methodology and technical contributions encourage researchers and practitioners to adopt a universal approach to speech enhancement, possibly framing it as a generative task. △ Less

Submitted 16 September, 2022; v1 submitted 7 June, 2022; originally announced June 2022.

Comments: 24 pages, 6 figures; includes appendix; examples in https://serrjoa.github.io/projects/universe/

arXiv:2202.07968 [pdf, other]

On loss functions and evaluation metrics for music source separation

Authors: Enric Gusó, Jordi Pons, Santiago Pascual, Joan Serrà

Abstract: We investigate which loss functions provide better separations via benchmarking an extensive set of those for music source separation. To that end, we first survey the most representative audio source separation losses we identified, to later consistently benchmark them in a controlled experimental setup. We also explore using such losses as evaluation metrics, via cross-correlating them with the… ▽ More We investigate which loss functions provide better separations via benchmarking an extensive set of those for music source separation. To that end, we first survey the most representative audio source separation losses we identified, to later consistently benchmark them in a controlled experimental setup. We also explore using such losses as evaluation metrics, via cross-correlating them with the results of a subjective test. Based on the observation that the standard signal-to-distortion ratio metric can be misleading in some scenarios, we study alternative evaluation metrics based on the considered losses. △ Less

Submitted 16 February, 2022; originally announced February 2022.

Comments: Accepted to ICASSP 2022

arXiv:2111.11773 [pdf, other]

Upsampling layers for music source separation

Authors: Jordi Pons, Joan Serrà, Santiago Pascual, Giulio Cengarle, Daniel Arteaga, Davide Scaini

Abstract: Upsampling artifacts are caused by problematic upsampling layers and due to spectral replicas that emerge while upsampling. Also, depending on the used upsampling layer, such artifacts can either be tonal artifacts (additive high-frequency noise) or filtering artifacts (substractive, attenuating some bands). In this work we investigate the practical implications of having upsampling artifacts in t… ▽ More Upsampling artifacts are caused by problematic upsampling layers and due to spectral replicas that emerge while upsampling. Also, depending on the used upsampling layer, such artifacts can either be tonal artifacts (additive high-frequency noise) or filtering artifacts (substractive, attenuating some bands). In this work we investigate the practical implications of having upsampling artifacts in the resulting audio, by studying how different artifacts interact and assessing their impact on the models' performance. To that end, we benchmark a large set of upsampling layers for music source separation: different transposed and subpixel convolution setups, different interpolation upsamplers (including two novel layers based on stretch and sinc interpolation), and different wavelet-based upsamplers (including a novel learnable wavelet layer). Our results show that filtering artifacts, associated with interpolation upsamplers, are perceptually preferrable, even if they tend to achieve worse objective scores. △ Less

Submitted 23 November, 2021; originally announced November 2021.

Comments: Demo page: http://www.jordipons.me/apps/upsamplers/

arXiv:2109.15188 [pdf, other]

Assessing Algorithmic Biases for Musical Version Identification

Authors: Furkan Yesiler, Marius Miron, Joan Serrà, Emilia Gómez

Abstract: Version identification (VI) systems now offer accurate and scalable solutions for detecting different renditions of a musical composition, allowing the use of these systems in industrial applications and throughout the wider music ecosystem. Such use can have an important impact on various stakeholders regarding recognition and financial benefits, including how royalties are circulated for digital… ▽ More Version identification (VI) systems now offer accurate and scalable solutions for detecting different renditions of a musical composition, allowing the use of these systems in industrial applications and throughout the wider music ecosystem. Such use can have an important impact on various stakeholders regarding recognition and financial benefits, including how royalties are circulated for digital rights management. In this work, we take a step toward acknowledging this impact and consider VI systems as socio-technical systems rather than isolated technologies. We propose a framework for quantifying performance disparities across 5 systems and 6 relevant side attributes: gender, popularity, country, language, year, and prevalence. We also consider 3 main stakeholders for this particular information retrieval use case: the performing artists of query tracks, those of reference (original) tracks, and the composers. By categorizing the recordings in our dataset using such attributes and stakeholders, we analyze whether the considered VI systems show any implicit biases. We find signs of disparities in identification performance for most of the groups we include in our analyses. Moreover, we also find that learning- and rule-based systems behave differently for some attributes, which suggests an additional dimension to consider along with accuracy and scalability when evaluating VI systems. Lastly, we share our dataset with attribute annotations to encourage VI researchers to take these aspects into account while building new systems. △ Less

Submitted 30 September, 2021; originally announced September 2021.

arXiv:2109.02472 [pdf, other]

doi 10.1109/MSP.2021.3105941

Audio-based Musical Version Identification: Elements and Challenges

Authors: Furkan Yesiler, Guillaume Doras, Rachel M. Bittner, Christopher J. Tralie, Joan Serrà

Abstract: In this article, we aim to provide a review of the key ideas and approaches proposed in 20 years of scientific literature around musical version identification (VI) research and connect them to current practice. For more than a decade, VI systems suffered from the accuracy-scalability trade-off, with attempts to increase accuracy that typically resulted in cumbersome, non-scalable systems. Recent… ▽ More In this article, we aim to provide a review of the key ideas and approaches proposed in 20 years of scientific literature around musical version identification (VI) research and connect them to current practice. For more than a decade, VI systems suffered from the accuracy-scalability trade-off, with attempts to increase accuracy that typically resulted in cumbersome, non-scalable systems. Recent years, however, have witnessed the rise of deep learning-based approaches that take a step toward bridging the accuracy-scalability gap, yielding systems that can realistically be deployed in industrial applications. Although this trend positively influences the number of researchers and institutions working on VI, it may also result in obscuring the literature before the deep learning era. To appreciate two decades of novel ideas in VI research and to facilitate building better systems, we now review some of the successful concepts and applications proposed in the literature and study their evolution throughout the years. △ Less

Submitted 6 September, 2021; originally announced September 2021.

Comments: Accepted to be published in IEEE Signal Processing Magazine

arXiv:2107.03100 [pdf, other]

Adversarial Auto-Encoding for Packet Loss Concealment

Authors: Santiago Pascual, Joan Serrà, Jordi Pons

Abstract: Communication technologies like voice over IP operate under constrained real-time conditions, with voice packets being subject to delays and losses from the network. In such cases, the packet loss concealment (PLC) algorithm reconstructs missing frames until a new real packet is received. Recently, autoregressive deep neural networks have been shown to surpass the quality of signal processing meth… ▽ More Communication technologies like voice over IP operate under constrained real-time conditions, with voice packets being subject to delays and losses from the network. In such cases, the packet loss concealment (PLC) algorithm reconstructs missing frames until a new real packet is received. Recently, autoregressive deep neural networks have been shown to surpass the quality of signal processing methods for PLC, specially for long-term predictions beyond 60 ms. In this work, we propose a non-autoregressive adversarial auto-encoder, named PLAAE, to perform real-time PLC in the waveform domain. PLAAE has a causal convolutional structure, and it learns in an auto-encoder fashion to reconstruct signals with gaps, with the help of an adversarial loss. During inference, it is able to predict smooth and coherent continuations of such gaps in a single feed-forward step, as opposed to autoregressive models. Our evaluation highlights the superiority of PLAAE over two classic PLCs and two deep autoregressive models in terms of spectral and intonation reconstruction, perceptual quality, and intelligibility. △ Less

Submitted 8 July, 2021; v1 submitted 7 July, 2021; originally announced July 2021.

arXiv:2104.04143 [pdf, other]

doi 10.1140/epjds/s13688-021-00293-8

Heaps' Law and Vocabulary Richness in the History of Classical Music Harmony

Authors: Marc Serra-Peralta, Joan Serrà, Álvaro Corral

Abstract: Music is a fundamental human construct, and harmony provides the building blocks of musical language. Using the Kunstderfuge corpus of classical music, we analyze the historical evolution of the richness of harmonic vocabulary of 76 classical composers, covering almost 6 centuries. Such corpus comprises about 9500 pieces, resulting in more than 5 million tokens of music codewords. The fulfilment o… ▽ More Music is a fundamental human construct, and harmony provides the building blocks of musical language. Using the Kunstderfuge corpus of classical music, we analyze the historical evolution of the richness of harmonic vocabulary of 76 classical composers, covering almost 6 centuries. Such corpus comprises about 9500 pieces, resulting in more than 5 million tokens of music codewords. The fulfilment of Heaps' law for the relation between the size of the harmonic vocabulary of a composer (in codeword types) and the total length of his works (in codeword tokens), with an exponent around 0.35, allows us to define a relative measure of vocabulary richness that has a transparent interpretation. When coupled with the considered corpus, this measure allows us to quantify harmony richness across centuries, unveiling a clear increasing linear trend. In this way, we are able to rank the composers in terms of richness of vocabulary, in the same way as for other related metrics, such as entropy. We find that the latter is particularly highly correlated with our measure of richness. Our approach is not specific for music and can be applied to other systems built by tokens of different types, as for instance natural language. △ Less

Submitted 9 April, 2021; originally announced April 2021.

Comments: 12 pages

arXiv:2104.03725 [pdf, other]

On tuning consistent annealed sampling for denoising score matching

Authors: Joan Serrà, Santiago Pascual, Jordi Pons

Abstract: Score-based generative models provide state-of-the-art quality for image and audio synthesis. Sampling from these models is performed iteratively, typically employing a discretized series of noise levels and a predefined scheme. In this note, we first overview three common sampling schemes for models trained with denoising score matching. Next, we focus on one of them, consistent annealed sampling… ▽ More Score-based generative models provide state-of-the-art quality for image and audio synthesis. Sampling from these models is performed iteratively, typically employing a discretized series of noise levels and a predefined scheme. In this note, we first overview three common sampling schemes for models trained with denoising score matching. Next, we focus on one of them, consistent annealed sampling, and study its hyper-parameter boundaries. We then highlight a possible formulation of such hyper-parameter that explicitly considers those boundaries and facilitates tuning when using few or a variable number of steps. Finally, we highlight some connections of the formulation with other sampling schemes. △ Less

Submitted 8 April, 2021; originally announced April 2021.

Comments: 3 pages and 1 figure

arXiv:2101.02098 [pdf, other]

Investigating the efficacy of music version retrieval systems for setlist identification

Authors: Furkan Yesiler, Emilio Molina, Joan Serrà, Emilia Gómez

Abstract: The setlist identification (SLI) task addresses a music recognition use case where the goal is to retrieve the metadata and timestamps for all the tracks played in live music events. Due to various musical and non-musical changes in live performances, develo** automatic SLI systems is still a challenging task that, despite its industrial relevance, has been under-explored in the academic literat… ▽ More The setlist identification (SLI) task addresses a music recognition use case where the goal is to retrieve the metadata and timestamps for all the tracks played in live music events. Due to various musical and non-musical changes in live performances, develo** automatic SLI systems is still a challenging task that, despite its industrial relevance, has been under-explored in the academic literature. In this paper, we propose an end-to-end workflow that identifies relevant metadata and timestamps of live music performances using a version identification system. We compare 3 of such systems to investigate their suitability for this particular task. For develo** and evaluating SLI systems, we also contribute a new dataset that contains 99.5h of concerts with annotated metadata and timestamps, along with the corresponding reference set. The dataset is categorized by audio qualities and genres to analyze the performance of SLI systems in different use cases. Our approach can identify 68% of the annotated segments, with values ranging from 35% to 77% based on the genre. Finally, we evaluate our approach against a database of 56.8k songs to illustrate the effect of expanding the reference set, where we can still identify 56% of the annotated segments. △ Less

Submitted 6 January, 2021; originally announced January 2021.

arXiv:2010.14356 [pdf, ps, other]

Upsampling artifacts in neural audio synthesis

Authors: Jordi Pons, Santiago Pascual, Giulio Cengarle, Joan Serrà

Abstract: A number of recent advances in neural audio synthesis rely on upsampling layers, which can introduce undesired artifacts. In computer vision, upsampling artifacts have been studied and are known as checkerboard artifacts (due to their characteristic visual pattern). However, their effect has been overlooked so far in audio processing. Here, we address this gap by studying this problem from the aud… ▽ More A number of recent advances in neural audio synthesis rely on upsampling layers, which can introduce undesired artifacts. In computer vision, upsampling artifacts have been studied and are known as checkerboard artifacts (due to their characteristic visual pattern). However, their effect has been overlooked so far in audio processing. Here, we address this gap by studying this problem from the audio signal processing perspective. We first show that the main sources of upsampling artifacts are: (i) the tonal and filtering artifacts introduced by problematic upsampling operators, and (ii) the spectral replicas that emerge while upsampling. We then compare different upsampling layers, showing that nearest neighbor upsamplers can be an alternative to the problematic (but state-of-the-art) transposed and subpixel convolutions which are prone to introduce tonal artifacts. △ Less

Submitted 9 February, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

Comments: In proceedings of ICASSP2021. Code: https://github.com/DolbyLaboratories/neural-upsampling-artifacts-audio

arXiv:2010.10291 [pdf, other]

Automatic multitrack mixing with a differentiable mixing console of neural audio effects

Authors: Christian J. Steinmetz, Jordi Pons, Santiago Pascual, Joan Serrà

Abstract: Applications of deep learning to automatic multitrack mixing are largely unexplored. This is partly due to the limited available data, coupled with the fact that such data is relatively unstructured and variable. To address these challenges, we propose a domain-inspired model with a strong inductive bias for the mixing task. We achieve this with the application of pre-trained sub-networks and weig… ▽ More Applications of deep learning to automatic multitrack mixing are largely unexplored. This is partly due to the limited available data, coupled with the fact that such data is relatively unstructured and variable. To address these challenges, we propose a domain-inspired model with a strong inductive bias for the mixing task. We achieve this with the application of pre-trained sub-networks and weight sharing, as well as with a sum/difference stereo loss function. The proposed model can be trained with a limited number of examples, is permutation invariant with respect to the input ordering, and places no limit on the number of input sources. Furthermore, it produces human-readable mixing parameters, allowing users to manually adjust or refine the generated mix. Results from a perceptual evaluation involving audio engineers indicate that our approach generates mixes that outperform baseline approaches. To the best of our knowledge, this work demonstrates the first approach in learning multitrack mixing conventions from real-world data at the waveform level, without knowledge of the underlying mixing parameters. △ Less

Submitted 20 October, 2020; originally announced October 2020.

arXiv:2010.03284 [pdf, other]

Less is more: Faster and better music version identification with embedding distillation

Authors: Furkan Yesiler, Joan Serrà, Emilia Gómez

Abstract: Version identification systems aim to detect different renditions of the same underlying musical composition (loosely called cover songs). By learning to encode entire recordings into plain vector embeddings, recent systems have made significant progress in bridging the gap between accuracy and scalability, which has been a key challenge for nearly two decades. In this work, we propose to further… ▽ More Version identification systems aim to detect different renditions of the same underlying musical composition (loosely called cover songs). By learning to encode entire recordings into plain vector embeddings, recent systems have made significant progress in bridging the gap between accuracy and scalability, which has been a key challenge for nearly two decades. In this work, we propose to further narrow this gap by employing a set of data distillation techniques that reduce the embedding dimensionality of a pre-trained state-of-the-art model. We compare a wide range of techniques and propose new ones, from classical dimensionality reduction to more sophisticated distillation schemes. With those, we obtain 99% smaller embeddings that, moreover, yield up to a 3% accuracy increase. Such small embeddings can have an important impact in retrieval time, up to the point of making a real-world system practical on a standalone laptop. △ Less

Submitted 7 October, 2020; originally announced October 2020.

Comments: Accepted to the 21st International Society for Music Information Retrieval Conference (ISMIR 2020)

arXiv:2010.00368 [pdf, other]

SESQA: semi-supervised learning for speech quality assessment

Authors: Joan Serrà, Jordi Pons, Santiago Pascual

Abstract: Automatic speech quality assessment is an important, transversal task whose progress is hampered by the scarcity of human annotations, poor generalization to unseen recording conditions, and a lack of flexibility of existing approaches. In this work, we tackle these problems with a semi-supervised learning approach, combining available annotations with programmatically generated data, and using 3… ▽ More Automatic speech quality assessment is an important, transversal task whose progress is hampered by the scarcity of human annotations, poor generalization to unseen recording conditions, and a lack of flexibility of existing approaches. In this work, we tackle these problems with a semi-supervised learning approach, combining available annotations with programmatically generated data, and using 3 different optimization criteria together with 5 complementary auxiliary tasks. Our results show that such a semi-supervised approach can cut the error of existing methods by more than 36%, while providing additional benefits in terms of reusable features or auxiliary outputs. Improvement is further corroborated with an out-of-sample test showing promising generalization capabilities. △ Less

Submitted 8 February, 2021; v1 submitted 1 October, 2020; originally announced October 2020.

Comments: Long version (with appendix) of the paper with the same title accepted for ICASSP2021

arXiv:1910.12551 [pdf, other]

Accurate and Scalable Version Identification Using Musically-Motivated Embeddings

Authors: Furkan Yesiler, Joan Serrà, Emilia Gómez

Abstract: The version identification (VI) task deals with the automatic detection of recordings that correspond to the same underlying musical piece. Despite many efforts, VI is still an open problem, with much room for improvement, specially with regard to combining accuracy and scalability. In this paper, we present MOVE, a musically-motivated method for accurate and scalable version identification. MOVE… ▽ More The version identification (VI) task deals with the automatic detection of recordings that correspond to the same underlying musical piece. Despite many efforts, VI is still an open problem, with much room for improvement, specially with regard to combining accuracy and scalability. In this paper, we present MOVE, a musically-motivated method for accurate and scalable version identification. MOVE achieves state-of-the-art performance on two publicly-available benchmark sets by learning scalable embeddings in an Euclidean distance space, using a triplet loss and a hard triplet mining strategy. It improves over previous work by employing an alternative input representation, and introducing a novel technique for temporal content summarization, a standardized latent space, and a data augmentation strategy specifically designed for VI. In addition to the main results, we perform an ablation study to highlight the importance of our design choices, and study the relation between embedding dimensionality and model performance. △ Less

Submitted 13 April, 2020; v1 submitted 28 October, 2019; originally announced October 2019.

arXiv:1909.11480 [pdf, other]

Input complexity and out-of-distribution detection with likelihood-based generative models

Authors: Joan Serrà, David Álvarez, Vicenç Gómez, Olga Slizovskaia, José F. Núñez, Jordi Luque

Abstract: Likelihood-based generative models are a promising resource to detect out-of-distribution (OOD) inputs which could compromise the robustness or reliability of a machine learning system. However, likelihoods derived from such models have been shown to be problematic for detecting certain types of inputs that significantly differ from training data. In this paper, we pose that this problem is due to… ▽ More Likelihood-based generative models are a promising resource to detect out-of-distribution (OOD) inputs which could compromise the robustness or reliability of a machine learning system. However, likelihoods derived from such models have been shown to be problematic for detecting certain types of inputs that significantly differ from training data. In this paper, we pose that this problem is due to the excessive influence that input complexity has in generative models' likelihoods. We report a set of experiments supporting this hypothesis, and use an estimate of input complexity to derive an efficient and parameter-free OOD score, which can be seen as a likelihood-ratio, akin to Bayesian model comparison. We find such score to perform comparably to, or even better than, existing OOD detection approaches under a wide range of data sets, models, model sizes, and complexity estimates. △ Less

Submitted 17 January, 2020; v1 submitted 25 September, 2019; originally announced September 2019.

Comments: Accepted for ICLR2020

arXiv:1906.00794 [pdf, other]

Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion

Authors: Joan Serrà, Santiago Pascual, Carlos Segura

Abstract: End-to-end models for raw audio generation are a challenge, specially if they have to work with non-parallel data, which is a desirable setup in many situations. Voice conversion, in which a model has to impersonate a speaker in a recording, is one of those situations. In this paper, we propose Blow, a single-scale normalizing flow using hypernetwork conditioning to perform many-to-many voice conv… ▽ More End-to-end models for raw audio generation are a challenge, specially if they have to work with non-parallel data, which is a desirable setup in many situations. Voice conversion, in which a model has to impersonate a speaker in a recording, is one of those situations. In this paper, we propose Blow, a single-scale normalizing flow using hypernetwork conditioning to perform many-to-many voice conversion between raw audio. Blow is trained end-to-end, with non-parallel data, on a frame-by-frame basis using a single speaker identifier. We show that Blow compares favorably to existing flow-based architectures and other competitive baselines, obtaining equal or better performance in both objective and subjective evaluations. We further assess the impact of its main components with an ablation study, and quantify a number of properties such as the necessary amount of training data or the preference for source or target speakers. △ Less

Submitted 5 September, 2019; v1 submitted 3 June, 2019; originally announced June 2019.

Comments: Includes appendix. Accepted for NeurIPS2019

arXiv:1904.03418 [pdf, other]

Towards Generalized Speech Enhancement with Generative Adversarial Networks

Authors: Santiago Pascual, Joan Serrà, Antonio Bonafonte

Abstract: The speech enhancement task usually consists of removing additive noise or reverberation that partially mask spoken utterances, affecting their intelligibility. However, little attention is drawn to other, perhaps more aggressive signal distortions like clip**, chunk elimination, or frequency-band removal. Such distortions can have a large impact not only on intelligibility, but also on naturaln… ▽ More The speech enhancement task usually consists of removing additive noise or reverberation that partially mask spoken utterances, affecting their intelligibility. However, little attention is drawn to other, perhaps more aggressive signal distortions like clip**, chunk elimination, or frequency-band removal. Such distortions can have a large impact not only on intelligibility, but also on naturalness or even speaker identity, and require of careful signal reconstruction. In this work, we give full consideration to this generalized speech enhancement task, and show it can be tackled with a time-domain generative adversarial network (GAN). In particular, we extend a previous GAN-based speech enhancement system to deal with mixtures of four types of aggressive distortions. Firstly, we propose the addition of an adversarial acoustic regression loss that promotes a richer feature extraction at the discriminator. Secondly, we also make use of a two-step adversarial training schedule, acting as a warm up-and-fine-tune sequence. Both objective and subjective evaluations show that these two additions bring improved speech reconstructions that better match the original speaker identity and naturalness. △ Less

Submitted 6 April, 2019; originally announced April 2019.

arXiv:1904.03416 [pdf, other]

Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks

Authors: Santiago Pascual, Mirco Ravanelli, Joan Serrà, Antonio Bonafonte, Yoshua Bengio

Abstract: Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. Some recent works, however, have shown that it is possible to derive useful speech representations by employing a self-supervised encoder-discriminator approach. This… ▽ More Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. Some recent works, however, have shown that it is possible to derive useful speech representations by employing a self-supervised encoder-discriminator approach. This paper proposes an improved self-supervised method, where a single neural encoder is followed by multiple workers that jointly solve different self-supervised tasks. The needed consensus across different tasks naturally imposes meaningful constraints to the encoder, contributing to discover general representations and to minimize the risk of learning superficial ones. Experiments show that the proposed approach can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues. In addition, a number of design choices make the encoder easily exportable, facilitating its direct usage or adaptation to different problems. △ Less

Submitted 6 April, 2019; originally announced April 2019.

arXiv:1811.12507 [pdf, other]

Regression and Classification by Zonal Kriging

Authors: Jean Serra, Jesus Angulo, B Ravi Kiran

Abstract: Consider a family $Z=\{\boldsymbol{x_{i}},y_{i}$,$1\leq i\leq N\}$ of $N$ pairs of vectors $\boldsymbol{x_{i}} \in \mathbb{R}^d$ and scalars $y_{i}$ that we aim to predict for a new sample vector $\mathbf{x}_0$. Kriging models $y$ as a sum of a deterministic function $m$, a drift which depends on the point $\boldsymbol{x}$, and a random function $z$ with zero mean. The zonality hypothesis interpre… ▽ More Consider a family $Z=\{\boldsymbol{x_{i}},y_{i}$,$1\leq i\leq N\}$ of $N$ pairs of vectors $\boldsymbol{x_{i}} \in \mathbb{R}^d$ and scalars $y_{i}$ that we aim to predict for a new sample vector $\mathbf{x}_0$. Kriging models $y$ as a sum of a deterministic function $m$, a drift which depends on the point $\boldsymbol{x}$, and a random function $z$ with zero mean. The zonality hypothesis interprets $y$ as a weighted sum of $d$ random functions of a single independent variables, each of which is a kriging, with a quadratic form for the variograms drift. We can therefore construct an unbiased estimator $y^{*}(\boldsymbol{x_{0}})=\sum_{i}λ^{i}z(\boldsymbol{x_{i}})$ de $y(\boldsymbol{x_{0}})$ with minimal variance $E[y^{*}(\boldsymbol{x_{0}})-y(\boldsymbol{x_{0}})]^{2}$, with the help of the known training set points. We give the explicitly closed form for $λ^{i}$ without having calculated the inverse of the matrices. △ Less

Submitted 11 December, 2018; v1 submitted 29 November, 2018; originally announced November 2018.

Comments: Technical Report

arXiv:1810.10274 [pdf, other]

Training neural audio classifiers with few data

Authors: Jordi Pons, Joan Serrà, Xavier Serra

Abstract: We investigate supervised learning strategies that improve the training of neural network audio classifiers on small annotated collections. In particular, we study whether (i) a naive regularization of the solution space, (ii) prototypical networks, (iii) transfer learning, or (iv) their combination, can foster deep learning models to better leverage a small amount of training examples. To this en… ▽ More We investigate supervised learning strategies that improve the training of neural network audio classifiers on small annotated collections. In particular, we study whether (i) a naive regularization of the solution space, (ii) prototypical networks, (iii) transfer learning, or (iv) their combination, can foster deep learning models to better leverage a small amount of training examples. To this end, we evaluate (i-iv) for the tasks of acoustic event recognition and acoustic scene classification, considering from 1 to 100 labeled examples per class. Results indicate that transfer learning is a powerful strategy in such scenarios, but prototypical networks show promising results when one does not count with external or validation data. △ Less

Submitted 3 November, 2018; v1 submitted 24 October, 2018; originally announced October 2018.

Comments: Code: https://github.com/jordipons/neural-classifiers-with-few-audio/

arXiv:1808.10687 [pdf, other]

Whispered-to-voiced Alaryngeal Speech Conversion with Generative Adversarial Networks

Authors: Santiago Pascual, Antonio Bonafonte, Joan Serrà, Jose A. Gonzalez

Abstract: Most methods of voice restoration for patients suffering from aphonia either produce whispered or monotone speech. Apart from intelligibility, this type of speech lacks expressiveness and naturalness due to the absence of pitch (whispered speech) or artificial generation of it (monotone speech). Existing techniques to restore prosodic information typically combine a vocoder, which parameterises th… ▽ More Most methods of voice restoration for patients suffering from aphonia either produce whispered or monotone speech. Apart from intelligibility, this type of speech lacks expressiveness and naturalness due to the absence of pitch (whispered speech) or artificial generation of it (monotone speech). Existing techniques to restore prosodic information typically combine a vocoder, which parameterises the speech signal, with machine learning techniques that predict prosodic information. In contrast, this paper describes an end-to-end neural approach for estimating a fully-voiced speech waveform from whispered alaryngeal speech. By adapting our previous work in speech enhancement with generative adversarial networks, we develop a speaker-dependent model to perform whispered-to-voiced speech conversion. Preliminary qualitative results show effectiveness in re-generating voiced speech, with the creation of realistic pitch contours. △ Less

Submitted 5 November, 2018; v1 submitted 31 August, 2018; originally announced August 2018.

arXiv:1808.10678 [pdf, other]

Self-Attention Linguistic-Acoustic Decoder

Authors: Santiago Pascual, Antonio Bonafonte, Joan Serrà

Abstract: The conversion from text to speech relies on the accurate map** from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models like recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the generated speech), their recursive structure tends to make them slow to train and to sample from. In this work, w… ▽ More The conversion from text to speech relies on the accurate map** from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models like recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the generated speech), their recursive structure tends to make them slow to train and to sample from. In this work, we try to overcome the limitations of recursive structure by using a module based on the transformer decoder network, designed without recurrent connections but emulating them with attention and positioning codes. Our results show that the proposed decoder network is competitive in terms of distortion when compared to a recurrent baseline, whilst being significantly faster in terms of CPU inference time. On average, it increases Mel cepstral distortion between 0.1 and 0.3 dB, but it is over an order of magnitude faster on average. Fast inference is important for the deployment of speech synthesis systems on devices with restricted resources, like mobile phones or embedded systems, where speaking virtual assistants are gaining importance. △ Less

Submitted 5 November, 2018; v1 submitted 31 August, 2018; originally announced August 2018.

arXiv:1806.03192 [pdf]

Assessing the impact of machine intelligence on human behaviour: an interdisciplinary endeavour

Authors: Emilia Gómez, Carlos Castillo, Vicky Charisi, Verónica Dahl, Gustavo Deco, Blagoj Delipetrev, Nicole Dewandre, Miguel Ángel González-Ballester, Fabien Gouyon, José Hernández-Orallo, Perfecto Herrera, Anders Jonsson, Ansgar Koene, Martha Larson, Ramón López de Mántaras, Bertin Martens, Marius Miron, Rubén Moreno-Bote, Nuria Oliver, Antonio Puertas Gallardo, Heike Schweitzer, Nuria Sebastian, Xavier Serra, Joan Serrà, Songül Tolan , et al. (1 additional authors not shown)

Abstract: This document contains the outcome of the first Human behaviour and machine intelligence (HUMAINT) workshop that took place 5-6 March 2018 in Barcelona, Spain. The workshop was organized in the context of a new research programme at the Centre for Advanced Studies, Joint Research Centre of the European Commission, which focuses on studying the potential impact of artificial intelligence on human b… ▽ More This document contains the outcome of the first Human behaviour and machine intelligence (HUMAINT) workshop that took place 5-6 March 2018 in Barcelona, Spain. The workshop was organized in the context of a new research programme at the Centre for Advanced Studies, Joint Research Centre of the European Commission, which focuses on studying the potential impact of artificial intelligence on human behaviour. The workshop gathered an interdisciplinary group of experts to establish the state of the art research in the field and a list of future research challenges to be addressed on the topic of human and machine intelligence, algorithm's potential impact on human cognitive capabilities and decision making, and evaluation and regulation needs. The document is made of short position statements and identification of challenges provided by each expert, and incorporates the result of the discussions carried out during the workshop. In the conclusion section, we provide a list of emerging research topics and strategies to be addressed in the near future. △ Less

Submitted 7 June, 2018; originally announced June 2018.

Comments: Proceedings of 1st HUMAINT (Human Behaviour and Machine Intelligence) workshop, Barcelona, Spain, March 5-6, 2018, edited by European Commission, Seville, 2018, JRC111773 https://ec.europa.eu/jrc/communities/community/humaint/document/assessing-impact-machine-intelligence-human-behaviour-interdisciplinary. arXiv admin note: text overlap with arXiv:1409.3097 by other authors

Report number: JRC111773

arXiv:1806.02701 [pdf, other]

There goes Wally: Anonymously sharing your location gives you away

Authors: Apostolos Pyrgelis, Nicolas Kourtellis, Ilias Leontiadis, Joan Serrà, Claudio Soriente

Abstract: With current technology, a number of entities have access to user mobility traces at different levels of spatio-temporal granularity. At the same time, users frequently reveal their location through different means, including geo-tagged social media posts and mobile app usage. Such leaks are often bound to a pseudonym or a fake identity in an attempt to preserve one's privacy. In this work, we inv… ▽ More With current technology, a number of entities have access to user mobility traces at different levels of spatio-temporal granularity. At the same time, users frequently reveal their location through different means, including geo-tagged social media posts and mobile app usage. Such leaks are often bound to a pseudonym or a fake identity in an attempt to preserve one's privacy. In this work, we investigate how large-scale mobility traces can de-anonymize anonymous location leaks. By mining the country-wide mobility traces of tens of millions of users, we aim to understand how many location leaks are required to uniquely match a trace, how spatio-temporal obfuscation decreases the matching quality, and how the location popularity and time of the leak influence de-anonymization. We also study the mobility characteristics of those individuals whose anonymous leaks are more prone to identification. Finally, by extending our matching methodology to full traces, we show how large-scale human mobility is highly unique. Our quantitative results have implications for the privacy of users' traces, and may serve as a guideline for future policies regarding the management and publication of mobility data. △ Less

Submitted 15 November, 2018; v1 submitted 7 June, 2018; originally announced June 2018.

Comments: To appear in the 2018 IEEE International Conference on Big Data

arXiv:1805.03908 [pdf, other]

Towards a universal neural network encoder for time series

Authors: Joan Serrà, Santiago Pascual, Alexandros Karatzoglou

Abstract: We study the use of a time series encoder to learn representations that are useful on data set types with which it has not been trained on. The encoder is formed of a convolutional neural network whose temporal output is summarized by a convolutional attention mechanism. This way, we obtain a compact, fixed-length representation from longer, variable-length time series. We evaluate the performance… ▽ More We study the use of a time series encoder to learn representations that are useful on data set types with which it has not been trained on. The encoder is formed of a convolutional neural network whose temporal output is summarized by a convolutional attention mechanism. This way, we obtain a compact, fixed-length representation from longer, variable-length time series. We evaluate the performance of the proposed approach on a well-known time series classification benchmark, considering full adaptation, partial adaptation, and no adaptation of the encoder to the new data type. Results show that such strategies are competitive with the state-of-the-art, often outperforming conceptually-matching approaches. Besides accuracy scores, the facility of adaptation and the efficiency of pre-trained encoders make them an appealing option for the processing of scarcely- or non-labeled time series. △ Less

Submitted 10 May, 2018; originally announced May 2018.

Comments: 10 pages, 2 figures

arXiv:1803.07310 [pdf, other]

doi 10.1109/MVT.2015.2508320

The CTTC 5G end-to-end experimental platform: Integrating heterogeneous wireless/optical networks, distributed cloud, and IoT devices

Authors: Raul Muñóz, Josep Mangues, Ricard Vilalta, Christos Verikoukis, Jesús Alonso-Zarate, Nikolaos Bartzoudis, Apostolos Georgiadis, Miquel Payaró, Ana Pérez-Neira, Ramon Casellas, Ricardo Martínez, José Núñez-Martínez, Manuel Requena-Esteso, David Pubill, Oriol Font-Bach, Pol Henarejos, Jordi Serra, Francisco Vazquez-Gallego

Abstract: The Internet of Things (IoT) will facilitate a wide variety of applications in different domains, such as smart cities, smart grids, industrial automation (Industry 4.0), smart driving, assistance of the elderly, and home automation. Billions of heterogeneous smart devices with different application requirements will be connected to the networks and will generate huge aggregated volumes of data th… ▽ More The Internet of Things (IoT) will facilitate a wide variety of applications in different domains, such as smart cities, smart grids, industrial automation (Industry 4.0), smart driving, assistance of the elderly, and home automation. Billions of heterogeneous smart devices with different application requirements will be connected to the networks and will generate huge aggregated volumes of data that will be processed in distributed cloud infrastructures. On the other hand, there is also a general trend to deploy functions as software (SW) instances in cloud infrastructures [e.g., network function virtualization (NFV) or mobile edge computing (MEC)]. Thus, the next generation of mobile networks, the fifth-generation (5G), will need not only to develop new radio interfaces or waveforms to cope with the expected traffic growth but also to integrate heterogeneous networks from end to end (E2E) with distributed cloud resources to deliver E2E IoT and mobile services. This article presents the E2E 5G platform that is being developed by the Centre Tecnològic de Telecomunicacions de Catalunya (CTTC), the first known platform capable of reproducing such an ambitious scenario. △ Less

Submitted 20 March, 2018; originally announced March 2018.

arXiv:1801.01423 [pdf, other]

Overcoming catastrophic forgetting with hard attention to the task

Authors: Joan Serrà, Dídac Surís, Marius Miron, Alexandros Karatzoglou

Abstract: Catastrophic forgetting occurs when a neural network loses the information learned in a previous task after training on subsequent tasks. This problem remains a hurdle for artificial intelligence systems with sequential learning capabilities. In this paper, we propose a task-based hard attention mechanism that preserves previous tasks' information without affecting the current task's learning. A h… ▽ More Catastrophic forgetting occurs when a neural network loses the information learned in a previous task after training on subsequent tasks. This problem remains a hurdle for artificial intelligence systems with sequential learning capabilities. In this paper, we propose a task-based hard attention mechanism that preserves previous tasks' information without affecting the current task's learning. A hard attention mask is learned concurrently to every task, through stochastic gradient descent, and previous masks are exploited to condition such learning. We show that the proposed mechanism is effective for reducing catastrophic forgetting, cutting current rates by 45 to 80%. We also show that it is robust to different hyperparameter choices, and that it offers a number of monitoring capabilities. The approach features the possibility to control both the stability and compactness of the learned knowledge, which we believe makes it also attractive for online learning or network compression applications. △ Less

Submitted 29 May, 2018; v1 submitted 4 January, 2018; originally announced January 2018.

Comments: Includes appendix. Accepted for ICML 2018

arXiv:1712.07120 [pdf, other]

Continual Prediction of Notification Attendance with Classical and Deep Network Approaches

Authors: Kleomenis Katevas, Ilias Leontiadis, Martin Pielot, Joan Serrà

Abstract: We investigate to what extent mobile use patterns can predict -- at the moment it is posted -- whether a notification will be clicked within the next 10 minutes. We use a data set containing the detailed mobile phone usage logs of 279 users, who over the course of 5 weeks received 446,268 notifications from a variety of apps. Besides using classical gradient-boosted trees, we demonstrate how to ma… ▽ More We investigate to what extent mobile use patterns can predict -- at the moment it is posted -- whether a notification will be clicked within the next 10 minutes. We use a data set containing the detailed mobile phone usage logs of 279 users, who over the course of 5 weeks received 446,268 notifications from a variety of apps. Besides using classical gradient-boosted trees, we demonstrate how to make continual predictions using a recurrent neural network (RNN). The two approaches achieve a similar AUC of ca. 0.7 on unseen users, with a possible operation point of 50% sensitivity and 80% specificity considering all notification types (an increase of 40% with respect to a probabilistic baseline). These results enable automatic, intelligent handling of mobile phone notifications without the need for user feedback or personalization. Furthermore, they showcase how forego feature-extraction by using RNNs for continual predictions directly on mobile usage logs. To the best of our knowledge, this is the first work that leverages mobile sensor data for continual, context-aware predictions of interruptibility using deep neural networks. △ Less

Submitted 19 December, 2017; originally announced December 2017.

Comments: 15 pages

arXiv:1712.06340 [pdf, other]

Language and Noise Transfer in Speech Enhancement Generative Adversarial Network

Authors: Santiago Pascual, Maruchan Park, Joan Serrà, Antonio Bonafonte, Kang-Hun Ahn

Abstract: Speech enhancement deep learning systems usually require large amounts of training data to operate in broad conditions or real applications. This makes the adaptability of those systems into new, low resource environments an important topic. In this work, we present the results of adapting a speech enhancement generative adversarial network by finetuning the generator with small amounts of data. W… ▽ More Speech enhancement deep learning systems usually require large amounts of training data to operate in broad conditions or real applications. This makes the adaptability of those systems into new, low resource environments an important topic. In this work, we present the results of adapting a speech enhancement generative adversarial network by finetuning the generator with small amounts of data. We investigate the minimum requirements to obtain a stable behavior in terms of several objective metrics in two very different languages: Catalan and Korean. We also study the variability of test performance to unseen noise as a function of the amount of different types of noise available for training. Results show that adapting a pre-trained English model with 10 min of data already achieves a comparable performance to having two orders of magnitude more data. They also demonstrate the relative stability in test performance with respect to the number of training noise types. △ Less

Submitted 18 December, 2017; originally announced December 2017.

arXiv:1709.10299 [pdf, other]

MobInsight: A Framework Using Semantic Neighborhood Features for Localized Interpretations of Urban Mobility

Authors: Souneil Park, Joan Serra, Enrique Frias Martinez, Nuria Oliver

Abstract: Collective urban mobility embodies the residents' local insights on the city. Mobility practices of the residents are produced from their spatial choices, which involve various considerations such as the atmosphere of destinations, distance, past experiences, and preferences. The advances in mobile computing and the rise of geo-social platforms have provided the means for capturing the mobility pr… ▽ More Collective urban mobility embodies the residents' local insights on the city. Mobility practices of the residents are produced from their spatial choices, which involve various considerations such as the atmosphere of destinations, distance, past experiences, and preferences. The advances in mobile computing and the rise of geo-social platforms have provided the means for capturing the mobility practices; however, interpreting the residents' insights is challenging due to the scale and complexity of an urban environment, and its unique context. In this paper, we present MobInsight, a framework for making localized interpretations of urban mobility that reflect various aspects of the urbanism. MobInsight extracts a rich set of neighborhood features through holistic semantic aggregation, and models the mobility between all-pairs of neighborhoods. We evaluate MobInsight with the mobility data of Barcelona and demonstrate diverse localized and semantically-rich interpretations. △ Less

Submitted 29 September, 2017; originally announced September 2017.

arXiv:1706.03993 [pdf, other]

Getting deep recommenders fit: Bloom embeddings for sparse binary input/output networks

Authors: Joan Serrà, Alexandros Karatzoglou

Abstract: Recommendation algorithms that incorporate techniques from deep learning are becoming increasingly popular. Due to the structure of the data coming from recommendation domains (i.e., one-hot-encoded vectors of item preferences), these algorithms tend to have large input and output dimensionalities that dominate their overall size. This makes them difficult to train, due to the limited memory of gr… ▽ More Recommendation algorithms that incorporate techniques from deep learning are becoming increasingly popular. Due to the structure of the data coming from recommendation domains (i.e., one-hot-encoded vectors of item preferences), these algorithms tend to have large input and output dimensionalities that dominate their overall size. This makes them difficult to train, due to the limited memory of graphical processing units, and difficult to deploy on mobile devices with limited hardware. To address these difficulties, we propose Bloom embeddings, a compression technique that can be applied to the input and output of neural network models dealing with sparse high-dimensional binary-coded instances. Bloom embeddings are computationally efficient, and do not seriously compromise the accuracy of the model up to 1/5 compression ratios. In some cases, they even improve over the original accuracy, with relative increases up to 12%. We evaluate Bloom embeddings on 7 data sets and compare it against 4 alternative methods, obtaining favorable results. We also discuss a number of further advantages of Bloom embeddings, such as 'on-the-fly' constant-time operation, zero or marginal space requirements, training time speedups, or the fact that they do not require any change to the core model architecture or training configuration. △ Less

Submitted 13 June, 2017; originally announced June 2017.

Comments: Accepted for publication at ACM RecSys 2017; previous version submitted to ICLR 2016

arXiv:1705.06224 [pdf, other]

doi 10.1145/3089801.3089802

Practical Processing of Mobile Sensor Data for Continual Deep Learning Predictions

Authors: Kleomenis Katevas, Ilias Leontiadis, Martin Pielot, Joan Serrà

Abstract: We present a practical approach for processing mobile sensor time series data for continual deep learning predictions. The approach comprises data cleaning, normalization, cap**, time-based compression, and finally classification with a recurrent neural network. We demonstrate the effectiveness of the approach in a case study with 279 participants. On the basis of sparse sensor events, the netwo… ▽ More We present a practical approach for processing mobile sensor time series data for continual deep learning predictions. The approach comprises data cleaning, normalization, cap**, time-based compression, and finally classification with a recurrent neural network. We demonstrate the effectiveness of the approach in a case study with 279 participants. On the basis of sparse sensor events, the network continually predicts whether the participants would attend to a notification within 10 minutes. Compared to a random baseline, the classifier achieves a 40% performance increase (AUC of 0.702) on a withheld test set. This approach allows to forgo resource-intensive, domain-specific, error-prone feature engineering, which may drastically increase the applicability of machine learning to mobile phone sensor data. △ Less

Submitted 17 May, 2017; originally announced May 2017.

Comments: 6 pages, 3 figures, 3 tables

Journal ref: DeepMobile Workshop, MobileHCI 2017

arXiv:1704.05249 [pdf, other]

Hot or not? Forecasting cellular network hot spots using sector performance indicators

Authors: Joan Serrà, Ilias Leontiadis, Alexandros Karatzoglou, Konstantina Papagiannaki

Abstract: To manage and maintain large-scale cellular networks, operators need to know which sectors underperform at any given time. For this purpose, they use the so-called hot spot score, which is the result of a combination of multiple network measurements and reflects the instantaneous overall performance of individual sectors. While operators have a good understanding of the current performance of a ne… ▽ More To manage and maintain large-scale cellular networks, operators need to know which sectors underperform at any given time. For this purpose, they use the so-called hot spot score, which is the result of a combination of multiple network measurements and reflects the instantaneous overall performance of individual sectors. While operators have a good understanding of the current performance of a network and its overall trend, forecasting the performance of each sector over time is a challenging task, as it is affected by both regular and non-regular events, triggered by human behavior and hardware failures. In this paper, we study the spatio-temporal patterns of the hot spot score and uncover its regularities. Based on our observations, we then explore the possibility to use recent measurements' history to predict future hot spots. To this end, we consider tree-based machine learning models, and study their performance as a function of time, amount of past data, and prediction horizon. Our results indicate that, compared to the best baseline, tree-based models can deliver up to 14% better forecasts for regular hot spots and 153% better forecasts for non-regular hot spots. The latter brings strong evidence that, for moderate horizons, forecasts can be made even for sectors exhibiting isolated, non-regular behavior. Overall, our work provides insight into the dynamics of cellular sectors and their predictability. It also paves the way for more proactive network operations with greater forecasting horizons. △ Less

Submitted 18 April, 2017; originally announced April 2017.

Comments: Accepted for publication at ICDE 2017 - Industrial Track

arXiv:1703.09452 [pdf, other]

SEGAN: Speech Enhancement Generative Adversarial Network

Authors: Santiago Pascual, Antonio Bonafonte, Joan Serrà

Abstract: Current speech enhancement techniques operate on the spectral domain and/or exploit some higher-level feature. The majority of them tackle a limited number of noise conditions and rely on first-order statistics. To circumvent these issues, deep networks are being increasingly used, thanks to their ability to learn complex functions from large example sets. In this work, we propose the use of gener… ▽ More Current speech enhancement techniques operate on the spectral domain and/or exploit some higher-level feature. The majority of them tackle a limited number of noise conditions and rely on first-order statistics. To circumvent these issues, deep networks are being increasingly used, thanks to their ability to learn complex functions from large example sets. In this work, we propose the use of generative adversarial networks for speech enhancement. In contrast to current techniques, we operate at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them. We evaluate the proposed model using an independent, unseen test set with two speakers and 20 alternative noise conditions. The enhanced samples confirm the viability of the proposed model, and both objective and subjective evaluations confirm the effectiveness of it. With that, we open the exploration of generative architectures for speech enhancement, which may progressively incorporate further speech-centric design choices to improve their performance. △ Less

Submitted 9 June, 2017; v1 submitted 28 March, 2017; originally announced March 2017.

Comments: 5 pages, 4 figures, accepted in INTERSPEECH 2017

arXiv:1703.05430 [pdf, other]

Cost-complexity pruning of random forests

Authors: Kiran Bangalore Ravi, Jean Serra

Abstract: Random forests perform bootstrap-aggregation by sampling the training samples with replacement. This enables the evaluation of out-of-bag error which serves as a internal cross-validation mechanism. Our motivation lies in using the unsampled training samples to improve each decision tree in the ensemble. We study the effect of using the out-of-bag samples to improve the generalization error first… ▽ More Random forests perform bootstrap-aggregation by sampling the training samples with replacement. This enables the evaluation of out-of-bag error which serves as a internal cross-validation mechanism. Our motivation lies in using the unsampled training samples to improve each decision tree in the ensemble. We study the effect of using the out-of-bag samples to improve the generalization error first of the decision trees and second the random forest by post-pruning. A preliminary empirical study on four UCI repository datasets show consistent decrease in the size of the forests without considerable loss in accuracy. △ Less

Submitted 19 July, 2017; v1 submitted 15 March, 2017; originally announced March 2017.

Comments: Previous version in proceedings of ISMM 2017

arXiv:1511.04986 [pdf, other]

A genetic algorithm to discover flexible motifs with support

Authors: Joan Serrà, Aleksandar Matic, Josep Luis Arcos, Alexandros Karatzoglou

Abstract: Finding repeated patterns or motifs in a time series is an important unsupervised task that has still a number of open issues, starting by the definition of motif. In this paper, we revise the notion of motif support, characterizing it as the number of patterns or repetitions that define a motif. We then propose GENMOTIF, a genetic algorithm to discover motifs with support which, at the same time,… ▽ More Finding repeated patterns or motifs in a time series is an important unsupervised task that has still a number of open issues, starting by the definition of motif. In this paper, we revise the notion of motif support, characterizing it as the number of patterns or repetitions that define a motif. We then propose GENMOTIF, a genetic algorithm to discover motifs with support which, at the same time, is flexible enough to accommodate other motif specifications and task characteristics. GENMOTIF is an anytime algorithm that easily adapts to many situations: searching in a range of segment lengths, applying uniform scaling, dealing with multiple dimensions, using different similarity and grou** criteria, etc. GENMOTIF is also parameter-friendly: it has only two intuitive parameters which, if set within reasonable bounds, do not substantially affect its performance. We demonstrate the value of our approach in a number of synthetic and real-world settings, considering traffic volume measurements, accelerometer signals, and telephone call records. △ Less

Submitted 5 December, 2016; v1 submitted 16 November, 2015; originally announced November 2015.

Comments: 9 pages, 8 figures, code available at https://github.com/joansj/genmotif

arXiv:1503.01883 [pdf, other]

doi 10.1016/j.eswa.2016.02.026

Ranking and significance of variable-length similarity-based time series motifs

Authors: Joan Serrà, Isabel Serra, Álvaro Corral, Josep Lluis Arcos

Abstract: The detection of very similar patterns in a time series, commonly called motifs, has received continuous and increasing attention from diverse scientific communities. In particular, recent approaches for discovering similar motifs of different lengths have been proposed. In this work, we show that such variable-length similarity-based motifs cannot be directly compared, and hence ranked, by their… ▽ More The detection of very similar patterns in a time series, commonly called motifs, has received continuous and increasing attention from diverse scientific communities. In particular, recent approaches for discovering similar motifs of different lengths have been proposed. In this work, we show that such variable-length similarity-based motifs cannot be directly compared, and hence ranked, by their normalized dissimilarities. Specifically, we find that length-normalized motif dissimilarities still have intrinsic dependencies on the motif length, and that lowest dissimilarities are particularly affected by this dependency. Moreover, we find that such dependencies are generally non-linear and change with the considered data set and dissimilarity measure. Based on these findings, we propose a solution to rank those motifs and measure their significance. This solution relies on a compact but accurate model of the dissimilarity space, using a beta distribution with three parameters that depend on the motif length in a non-linear way. We believe the incomparability of variable-length dissimilarities could go beyond the field of time series, and that similar modeling strategies as the one used here could be of help in a more broad context. △ Less

Submitted 6 March, 2015; originally announced March 2015.

Comments: 20 pages, 10 figures

Journal ref: Expert Systems with Applications 55: 452-460. Aug 2016

arXiv:1501.07399 [pdf, other]

doi 10.1016/j.knosys.2015.10.021

Particle swarm optimization for time series motif discovery

Authors: Joan Serrà, Josep Lluis Arcos

Abstract: Efficiently finding similar segments or motifs in time series data is a fundamental task that, due to the ubiquity of these data, is present in a wide range of domains and situations. Because of this, countless solutions have been devised but, to date, none of them seems to be fully satisfactory and flexible. In this article, we propose an innovative standpoint and present a solution coming from i… ▽ More Efficiently finding similar segments or motifs in time series data is a fundamental task that, due to the ubiquity of these data, is present in a wide range of domains and situations. Because of this, countless solutions have been devised but, to date, none of them seems to be fully satisfactory and flexible. In this article, we propose an innovative standpoint and present a solution coming from it: an anytime multimodal optimization algorithm for time series motif discovery based on particle swarms. By considering data from a variety of domains, we show that this solution is extremely competitive when compared to the state-of-the-art, obtaining comparable motifs in considerably less time using minimal memory. In addition, we show that it is robust to different implementation choices and see that it offers an unprecedented degree of flexibility with regard to the task. All these qualities make the presented solution stand out as one of the most prominent candidates for motif discovery in long time series streams. Besides, we believe the proposed standpoint can be exploited in further time series analysis and mining tasks, widening the scope of research and potentially yielding novel effective solutions. △ Less

Submitted 29 January, 2015; originally announced January 2015.

Comments: 12 pages, 9 figures, 2 tables

Journal ref: Knowledge-Based Systems 92: 127-137. Jan 2016

arXiv:1401.3973 [pdf, other]

doi 10.1016/j.knosys.2014.04.035

An Empirical Evaluation of Similarity Measures for Time Series Classification

Authors: Joan Serrà, Josep Lluis Arcos

Abstract: Time series are ubiquitous, and a measure to assess their similarity is a core part of many computational systems. In particular, the similarity measure is the most essential ingredient of time series clustering and classification systems. Because of this importance, countless approaches to estimate time series similarity have been proposed. However, there is a lack of comparative studies using em… ▽ More Time series are ubiquitous, and a measure to assess their similarity is a core part of many computational systems. In particular, the similarity measure is the most essential ingredient of time series clustering and classification systems. Because of this importance, countless approaches to estimate time series similarity have been proposed. However, there is a lack of comparative studies using empirical, rigorous, quantitative, and large-scale assessment strategies. In this article, we provide an extensive evaluation of similarity measures for time series classification following the aforementioned principles. We consider 7 different measures coming from alternative measure `families', and 45 publicly-available time series data sets coming from a wide variety of scientific domains. We focus on out-of-sample classification accuracy, but in-sample accuracies and parameter choices are also discussed. Our work is based on rigorous evaluation methodologies and includes the use of powerful statistical significance tests to derive meaningful conclusions. The obtained results show the equivalence, in terms of accuracy, of a number of measures, but with one single candidate outperforming the rest. Such findings, together with the followed methodology, invite researchers on the field to adopt a more consistent evaluation criteria and a more informed decision regarding the baseline measures to which new developments should be compared. △ Less

Submitted 16 January, 2014; originally announced January 2014.

Comments: 28 pages, 5 figures, 3 tables

Journal ref: Knowledge-Based Systems 67: 305-314, 2014

arXiv:1205.5651 [pdf, other]

doi 10.1038/srep00521

Measuring the evolution of contemporary western popular music

Authors: Joan Serrà, Álvaro Corral, Marián Boguñá, Martín Haro, Josep Lluis Arcos

Abstract: Popular music is a key cultural expression that has captured listeners' attention for ages. Many of the structural regularities underlying musical discourse are yet to be discovered and, accordingly, their historical evolution remains formally unknown. Here we unveil a number of patterns and metrics characterizing the generic usage of primary musical facets such as pitch, timbre, and loudness in c… ▽ More Popular music is a key cultural expression that has captured listeners' attention for ages. Many of the structural regularities underlying musical discourse are yet to be discovered and, accordingly, their historical evolution remains formally unknown. Here we unveil a number of patterns and metrics characterizing the generic usage of primary musical facets such as pitch, timbre, and loudness in contemporary western popular music. Many of these patterns and metrics have been consistently stable for a period of more than fifty years, thus pointing towards a great degree of conventionalism. Nonetheless, we prove important changes or trends related to the restriction of pitch transitions, the homogenization of the timbral palette, and the growing loudness levels. This suggests that our perception of the new would be rooted on these changing characteristics. Hence, an old tune could perfectly sound novel and fashionable, provided that it consisted of common harmonic progressions, changed the instrumentation, and increased the average loudness. △ Less

Submitted 25 May, 2012; originally announced May 2012.

Comments: Supplementary materials not included. Please see the journal reference or contact the authors

Journal ref: Scientific Reports 2, 521 (2012)

arXiv:1108.6003 [pdf, ps, other]

doi 10.1016/j.patrec.2012.02.013

Characterization and exploitation of community structure in cover song networks

Authors: Joan Serrà, Massimiliano Zanin, Perfecto Herrera, Xavier Serra

Abstract: The use of community detection algorithms is explored within the framework of cover song identification, i.e. the automatic detection of different audio renditions of the same underlying musical piece. Until now, this task has been posed as a typical query-by-example task, where one submits a query song and the system retrieves a list of possible matches ranked by their similarity to the query. In… ▽ More The use of community detection algorithms is explored within the framework of cover song identification, i.e. the automatic detection of different audio renditions of the same underlying musical piece. Until now, this task has been posed as a typical query-by-example task, where one submits a query song and the system retrieves a list of possible matches ranked by their similarity to the query. In this work, we propose a new approach which uses song communities to provide more relevant answers to a given query. Starting from the output of a state-of-the-art system, songs are embedded in a complex weighted network whose links represent similarity (related musical content). Communities inside the network are then recognized as groups of covers and this information is used to enhance the results of the system. In particular, we show that this approach increases both the coherence and the accuracy of the system. Furthermore, we provide insight into the internal organization of individual cover song communities, showing that there is a tendency for the original song to be central within the community. We postulate that the methods and results presented here could be relevant to other query-by-example tasks. △ Less

Submitted 12 September, 2011; v1 submitted 29 August, 2011; originally announced August 2011.

Journal ref: Pattern Recognition Letters 33(9): 1032-1041, 2012

Showing 1–48 of 48 results for author: Serra, J