Search | arXiv e-print repository

VampNet: Music Generation via Masked Acoustic Token Modeling

Authors: Hugo Flores Garcia, Prem Seetharaman, Rithesh Kumar, Bryan Pardo

Abstract: We introduce VampNet, a masked acoustic token modeling approach to music synthesis, compression, inpainting, and variation. We use a variable masking schedule during training which allows us to sample coherent music from the model by applying a variety of masking approaches (called prompts) during inference. VampNet is non-autoregressive, leveraging a bidirectional transformer architecture that at… ▽ More We introduce VampNet, a masked acoustic token modeling approach to music synthesis, compression, inpainting, and variation. We use a variable masking schedule during training which allows us to sample coherent music from the model by applying a variety of masking approaches (called prompts) during inference. VampNet is non-autoregressive, leveraging a bidirectional transformer architecture that attends to all tokens in a forward pass. With just 36 sampling passes, VampNet can generate coherent high-fidelity musical waveforms. We show that by prompting VampNet in various ways, we can apply it to tasks like music compression, inpainting, outpainting, continuation, and loo** with variation (vam**). Appropriately prompted, VampNet is capable of maintaining style, genre, instrumentation, and other high-level aspects of the music. This flexible prompting capability makes VampNet a powerful music co-creation tool. Code and audio samples are available online. △ Less

Submitted 12 July, 2023; v1 submitted 10 July, 2023; originally announced July 2023.

arXiv:2306.06546 [pdf, other]

High-Fidelity Audio Compression with Improved RVQGAN

Authors: Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, Kundan Kumar

Abstract: Language models have been successfully used to model natural signals, such as images, speech, and music. A key component of these models is a high quality neural compression model that can compress high-dimensional natural signals into lower dimensional discrete tokens. To that end, we introduce a high-fidelity universal neural audio compression algorithm that achieves ~90x compression of 44.1 KHz… ▽ More Language models have been successfully used to model natural signals, such as images, speech, and music. A key component of these models is a high quality neural compression model that can compress high-dimensional natural signals into lower dimensional discrete tokens. To that end, we introduce a high-fidelity universal neural audio compression algorithm that achieves ~90x compression of 44.1 KHz audio into tokens at just 8kbps bandwidth. We achieve this by combining advances in high-fidelity audio generation with better vector quantization techniques from the image domain, along with improved adversarial and reconstruction losses. We compress all domains (speech, environment, music, etc.) with a single universal model, making it widely applicable to generative modeling of all audio. We compare with competing audio compression algorithms, and find our method outperforms them significantly. We provide thorough ablations for every design choice, as well as open-source code and trained model weights. We hope our work can lay the foundation for the next generation of high-fidelity audio modeling. △ Less

Submitted 26 October, 2023; v1 submitted 10 June, 2023; originally announced June 2023.

Comments: Accepted at NeurIPS 2023 (spotlight)

arXiv:2208.12387 [pdf, other]

Music Separation Enhancement with Generative Modeling

Authors: Noah Schaffer, Boaz Cogan, Ethan Manilow, Max Morrison, Prem Seetharaman, Bryan Pardo

Abstract: Despite phenomenal progress in recent years, state-of-the-art music separation systems produce source estimates with significant perceptual shortcomings, such as adding extraneous noise or removing harmonics. We propose a post-processing model (the Make it Sound Good (MSG) post-processor) to enhance the output of music source separation systems. We apply our post-processing model to state-of-the-a… ▽ More Despite phenomenal progress in recent years, state-of-the-art music separation systems produce source estimates with significant perceptual shortcomings, such as adding extraneous noise or removing harmonics. We propose a post-processing model (the Make it Sound Good (MSG) post-processor) to enhance the output of music source separation systems. We apply our post-processing model to state-of-the-art waveform-based and spectrogram-based music source separators, including a separator unseen by MSG during training. Our analysis of the errors produced by source separators shows that waveform models tend to introduce more high-frequency noise, while spectrogram models tend to lose transients and high frequency content. We introduce objective measures to quantify both kinds of errors and show MSG improves the source reconstruction of both kinds of errors. Crowdsourced subjective evaluations demonstrate that human listeners prefer source estimates of bass and drums that have been post-processed by MSG. △ Less

Submitted 25 August, 2022; originally announced August 2022.

Comments: Accepted to ISMIR 2022

arXiv:2204.05156 [pdf, other]

How to Listen? Rethinking Visual Sound Localization

Authors: Ho-Hsiang Wu, Magdalena Fuentes, Prem Seetharaman, Juan Pablo Bello

Abstract: Localizing visual sounds consists on locating the position of objects that emit sound within an image. It is a growing research area with potential applications in monitoring natural and urban environments, such as wildlife migration and urban traffic. Previous works are usually evaluated with datasets having mostly a single dominant visible object, and proposed models usually require the introduc… ▽ More Localizing visual sounds consists on locating the position of objects that emit sound within an image. It is a growing research area with potential applications in monitoring natural and urban environments, such as wildlife migration and urban traffic. Previous works are usually evaluated with datasets having mostly a single dominant visible object, and proposed models usually require the introduction of localization modules during training or dedicated sampling strategies, but it remains unclear how these design choices play a role in the adaptability of these methods in more challenging scenarios. In this work, we analyze various model choices for visual sound localization and discuss how their different components affect the model's performance, namely the encoders' architecture, the loss function and the localization strategy. Furthermore, we study the interaction between these decisions, the model performance, and the data, by digging into different evaluation datasets spanning different difficulties and characteristics, and discuss the implications of such decisions in the context of real-world applications. Our code and model weights are open-sourced and made available for further applications. △ Less

Submitted 11 April, 2022; originally announced April 2022.

Comments: Submitted to INTERSPEECH 2022

arXiv:2110.13071 [pdf, other]

Unsupervised Source Separation By Steering Pretrained Music Models

Authors: Ethan Manilow, Patrick O'Reilly, Prem Seetharaman, Bryan Pardo

Abstract: We showcase an unsupervised method that repurposes deep models trained for music generation and music tagging for audio source separation, without any retraining. An audio generation model is conditioned on an input mixture, producing a latent encoding of the audio used to generate audio. This generated audio is fed to a pretrained music tagger that creates source labels. The cross-entropy loss be… ▽ More We showcase an unsupervised method that repurposes deep models trained for music generation and music tagging for audio source separation, without any retraining. An audio generation model is conditioned on an input mixture, producing a latent encoding of the audio used to generate audio. This generated audio is fed to a pretrained music tagger that creates source labels. The cross-entropy loss between the tag distribution for the generated audio and a predefined distribution for an isolated source is used to guide gradient ascent in the (unchanging) latent space of the generative model. This system does not update the weights of the generative model or the tagger, and only relies on moving through the generative model's latent space to produce separated sources. We use OpenAI's Jukebox as the pretrained generative model, and we couple it with four kinds of pretrained music taggers (two architectures and two tagging datasets). Experimental results on two source separation datasets, show this approach can produce separation estimates for a wider variety of sources than any tested supervised or unsupervised system. This work points to the vast and heretofore untapped potential of large pretrained music models for audio-to-audio tasks like source separation. △ Less

Submitted 25 October, 2021; originally announced October 2021.

Comments: Submitted to ICASSP 2022

arXiv:2110.11499 [pdf, other]

Wav2CLIP: Learning Robust Audio Representations From CLIP

Authors: Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, Juan Pablo Bello

Abstract: We propose Wav2CLIP, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on a variety of audio tasks including classification, retrieval, and generation, and show that Wav2CLIP can outperform several publicly available pre-trained audio representation algorithms. Wav2CLIP projects audio into a shared e… ▽ More We propose Wav2CLIP, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on a variety of audio tasks including classification, retrieval, and generation, and show that Wav2CLIP can outperform several publicly available pre-trained audio representation algorithms. Wav2CLIP projects audio into a shared embedding space with images and text, which enables multimodal applications such as zero-shot classification, and cross-modal retrieval. Furthermore, Wav2CLIP needs just ~10% of the data to achieve competitive performance on downstream tasks compared with fully supervised models, and is more efficient to pre-train than competing methods as it does not require learning a visual model in concert with an auditory model. Finally, we demonstrate image generation from Wav2CLIP as qualitative assessment of the shared embedding space. Our code and model weights are open sourced and made available for further applications. △ Less

Submitted 15 February, 2022; v1 submitted 21 October, 2021; originally announced October 2021.

Comments: Copyright 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

arXiv:2110.10139 [pdf, other]

Chunked Autoregressive GAN for Conditional Waveform Synthesis

Authors: Max Morrison, Rithesh Kumar, Kundan Kumar, Prem Seetharaman, Aaron Courville, Yoshua Bengio

Abstract: Conditional waveform synthesis models learn a distribution of audio waveforms given conditioning such as text, mel-spectrograms, or MIDI. These systems employ deep generative models that model the waveform via either sequential (autoregressive) or parallel (non-autoregressive) sampling. Generative adversarial networks (GANs) have become a common choice for non-autoregressive waveform synthesis. Ho… ▽ More Conditional waveform synthesis models learn a distribution of audio waveforms given conditioning such as text, mel-spectrograms, or MIDI. These systems employ deep generative models that model the waveform via either sequential (autoregressive) or parallel (non-autoregressive) sampling. Generative adversarial networks (GANs) have become a common choice for non-autoregressive waveform synthesis. However, state-of-the-art GAN-based models produce artifacts when performing mel-spectrogram inversion. In this paper, we demonstrate that these artifacts correspond with an inability for the generator to learn accurate pitch and periodicity. We show that simple pitch and periodicity conditioning is insufficient for reducing this error relative to using autoregression. We discuss the inductive bias that autoregression provides for learning the relationship between instantaneous frequency and phase, and show that this inductive bias holds even when autoregressively sampling large chunks of the waveform during each forward pass. Relative to prior state-of-the-art GAN-based models, our proposed model, Chunked Autoregressive GAN (CARGAN) reduces pitch error by 40-60%, reduces training time by 58%, maintains a fast generation speed suitable for real-time or interactive applications, and maintains or improves subjective quality. △ Less

Submitted 3 March, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

Comments: Published as a conference paper at ICLR 2022

arXiv:2011.00803 [pdf, other]

What's All the FUSS About Free Universal Sound Separation Data?

Authors: Scott Wisdom, Hakan Erdogan, Daniel Ellis, Romain Serizel, Nicolas Turpault, Eduardo Fonseca, Justin Salamon, Prem Seetharaman, John Hershey

Abstract: We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types. The dataset consists of 23 hours of single-source audio data drawn from 357 classes, which are used to create mixtures of one to four sources. To simulate reverberation, an acoustic room simulator is used to generate… ▽ More We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types. The dataset consists of 23 hours of single-source audio data drawn from 357 classes, which are used to create mixtures of one to four sources. To simulate reverberation, an acoustic room simulator is used to generate impulse responses of box shaped rooms with frequency-dependent reflective walls. Additional open-source data augmentation tools are also provided to produce new mixtures with different combinations of sources and room simulations. Finally, we introduce an open-source baseline separation model, based on an improved time-domain convolutional network (TDCN++), that can separate a variable number of sources in a mixture. This model achieves 9.8 dB of scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources, while reconstructing single-source inputs with 35.5 dB absolute SI-SNR. We hope this dataset will lower the barrier to new research and allow for fast iteration and application of novel techniques from other machine learning domains to the sound separation challenge. △ Less

Submitted 2 November, 2020; originally announced November 2020.

arXiv:2011.00801 [pdf, other]

Sound Event Detection and Separation: a Benchmark on Desed Synthetic Soundscapes

Authors: Nicolas Turpault, Romain Serizel, Scott Wisdom, Hakan Erdogan, John Hershey, Eduardo Fonseca, Prem Seetharaman, Justin Salamon

Abstract: We propose a benchmark of state-of-the-art sound event detection systems (SED). We designed synthetic evaluation sets to focus on specific sound event detection challenges. We analyze the performance of the submissions to DCASE 2021 task 4 depending on time related modifications (time position of an event and length of clips) and we study the impact of non-target sound events and reverberation. We… ▽ More We propose a benchmark of state-of-the-art sound event detection systems (SED). We designed synthetic evaluation sets to focus on specific sound event detection challenges. We analyze the performance of the submissions to DCASE 2021 task 4 depending on time related modifications (time position of an event and length of clips) and we study the impact of non-target sound events and reverberation. We show that the localization in time of sound events is still a problem for SED systems. We also show that reverberation and non-target sound events are severely degrading the performance of the SED systems. In the latter case, sound separation seems like a promising solution. △ Less

Submitted 2 November, 2020; originally announced November 2020.

arXiv:2010.12650 [pdf, other]

A Study of Transfer Learning in Music Source Separation

Authors: Andreas Bugler, Bryan Pardo, Prem Seetharaman

Abstract: Supervised deep learning methods for performing audio source separation can be very effective in domains where there is a large amount of training data. While some music domains have enough data suitable for training a separation system, such as rock and pop genres, many musical domains do not, such as classical music, choral music, and non-Western music traditions. It is well known that transferr… ▽ More Supervised deep learning methods for performing audio source separation can be very effective in domains where there is a large amount of training data. While some music domains have enough data suitable for training a separation system, such as rock and pop genres, many musical domains do not, such as classical music, choral music, and non-Western music traditions. It is well known that transferring learning from related domains can result in a performance boost for deep learning systems, but it is not always clear how best to do pretraining. In this work we investigate the effectiveness of data augmentation during pretraining, the impact on performance as a result of pretraining and downstream datasets having similar content domains, and also explore how much of a model must be retrained on the final target task, once pretrained. △ Less

Submitted 23 October, 2020; originally announced October 2020.

Comments: 4 pages + 1 reference page. 3 figures. Submitted to ICASSP

ACM Class: I.5.4

arXiv:2007.14469 [pdf, other]

AutoClip: Adaptive Gradient Clip** for Source Separation Networks

Authors: Prem Seetharaman, Gordon Wichern, Bryan Pardo, Jonathan Le Roux

Abstract: Clip** the gradient is a known approach to improving gradient descent, but requires hand selection of a clip** threshold hyperparameter. We present AutoClip, a simple method for automatically and adaptively choosing a gradient clip** threshold, based on the history of gradient norms observed during training. Experimental results show that applying AutoClip results in improved generalization… ▽ More Clip** the gradient is a known approach to improving gradient descent, but requires hand selection of a clip** threshold hyperparameter. We present AutoClip, a simple method for automatically and adaptively choosing a gradient clip** threshold, based on the history of gradient norms observed during training. Experimental results show that applying AutoClip results in improved generalization performance for audio source separation networks. Observation of the training dynamics of a separation network trained with and without AutoClip show that AutoClip guides optimization into smoother parts of the loss landscape. AutoClip is very simple to implement and can be integrated readily into a variety of applications across multiple domains. △ Less

Submitted 25 July, 2020; originally announced July 2020.

Comments: Accepted at 2020 IEEE International Workshop on Machine Learning for Signal Processing, Sept.\ 21--24, 2020, Espoo, Finland

arXiv:2007.06123 [pdf, other]

OtoWorld: Towards Learning to Separate by Learning to Move

Authors: Omkar Ranadive, Grant Gasser, David Terpay, Prem Seetharaman

Abstract: We present OtoWorld, an interactive environment in which agents must learn to listen in order to solve navigational tasks. The purpose of OtoWorld is to facilitate reinforcement learning research in computer audition, where agents must learn to listen to the world around them to navigate. OtoWorld is built on three open source libraries: OpenAI Gym for environment and agent interaction, PyRoomAcou… ▽ More We present OtoWorld, an interactive environment in which agents must learn to listen in order to solve navigational tasks. The purpose of OtoWorld is to facilitate reinforcement learning research in computer audition, where agents must learn to listen to the world around them to navigate. OtoWorld is built on three open source libraries: OpenAI Gym for environment and agent interaction, PyRoomAcoustics for ray-tracing and acoustics simulation, and nussl for training deep computer audition models. OtoWorld is the audio analogue of GridWorld, a simple navigation game. OtoWorld can be easily extended to more complex environments and games. To solve one episode of OtoWorld, an agent must move towards each sounding source in the auditory scene and "turn it off". The agent receives no other input than the current sound of the room. The sources are placed randomly within the room and can vary in number. The agent receives a reward for turning off a source. We present preliminary results on the ability of agents to win at OtoWorld. OtoWorld is open-source and available. △ Less

Submitted 12 July, 2020; originally announced July 2020.

Comments: Published in Self Supervision in Audio and Speech Workshop, 37th International Conference on Machine Learning, Vienna, Austria (ICML 2020)

arXiv:2007.03932 [pdf, other]

Improving Sound Event Detection In Domestic Environments Using Sound Separation

Authors: Nicolas Turpault, Scott Wisdom, Hakan Erdogan, John Hershey, Romain Serizel, Eduardo Fonseca, Prem Seetharaman, Justin Salamon

Abstract: Performing sound event detection on real-world recordings often implies dealing with overlap** target sound events and non-target sounds, also referred to as interference or noise. Until now these problems were mainly tackled at the classifier level. We propose to use sound separation as a pre-processing for sound event detection. In this paper we start from a sound separation model trained on t… ▽ More Performing sound event detection on real-world recordings often implies dealing with overlap** target sound events and non-target sounds, also referred to as interference or noise. Until now these problems were mainly tackled at the classifier level. We propose to use sound separation as a pre-processing for sound event detection. In this paper we start from a sound separation model trained on the Free Universal Sound Separation dataset and the DCASE 2020 task 4 sound event detection baseline. We explore different methods to combine separated sound sources and the original mixture within the sound event detection. Furthermore, we investigate the impact of adapting the sound separation model to the sound event detection data on both the sound separation and the sound event detection. △ Less

Submitted 8 July, 2020; originally announced July 2020.

arXiv:2006.13331 [pdf, other]

Incorporating Music Knowledge in Continual Dataset Augmentation for Music Generation

Authors: Alisa Liu, Alexander Fang, Gaëtan Hadjeres, Prem Seetharaman, Bryan Pardo

Abstract: Deep learning has rapidly become the state-of-the-art approach for music generation. However, training a deep model typically requires a large training set, which is often not available for specific musical styles. In this paper, we present augmentative generation (Aug-Gen), a method of dataset augmentation for any music generation system trained on a resource-constrained domain. The key intuition… ▽ More Deep learning has rapidly become the state-of-the-art approach for music generation. However, training a deep model typically requires a large training set, which is often not available for specific musical styles. In this paper, we present augmentative generation (Aug-Gen), a method of dataset augmentation for any music generation system trained on a resource-constrained domain. The key intuition of this method is that the training data for a generative system can be augmented by examples the system produces during the course of training, provided these examples are of sufficiently high quality and variety. We apply Aug-Gen to Transformer-based chorale generation in the style of J.S. Bach, and show that this allows for longer training and results in better generative output. △ Less

Submitted 20 July, 2020; v1 submitted 23 June, 2020; originally announced June 2020.

Comments: 2 pages, 2 figures, Machine Learning for Media Discovery (ML4MD) Workshop at ICML 2020

arXiv:2006.13329 [pdf, other]

Bach or Mock? A Grading Function for Chorales in the Style of J.S. Bach

Authors: Alexander Fang, Alisa Liu, Prem Seetharaman, Bryan Pardo

Abstract: Deep generative systems that learn probabilistic models from a corpus of existing music do not explicitly encode knowledge of a musical style, compared to traditional rule-based systems. Thus, it can be difficult to determine whether deep models generate stylistically correct output without expert evaluation, but this is expensive and time-consuming. Therefore, there is a need for automatic, inter… ▽ More Deep generative systems that learn probabilistic models from a corpus of existing music do not explicitly encode knowledge of a musical style, compared to traditional rule-based systems. Thus, it can be difficult to determine whether deep models generate stylistically correct output without expert evaluation, but this is expensive and time-consuming. Therefore, there is a need for automatic, interpretable, and musically-motivated evaluation measures of generated music. In this paper, we introduce a grading function that evaluates four-part chorales in the style of J.S. Bach along important musical features. We use the grading function to evaluate the output of a Transformer model, and show that the function is both interpretable and outperforms human experts at discriminating Bach chorales from model-generated ones. △ Less

Submitted 17 July, 2020; v1 submitted 23 June, 2020; originally announced June 2020.

Comments: 2 pages, 3 figures, Machine Learning for Media Discovery (ML4MD) Workshop at ICML 2020

arXiv:1910.12626 [pdf, other]

Model selection for deep audio source separation via clustering analysis

Authors: Alisa Liu, Prem Seetharaman, Bryan Pardo

Abstract: Audio source separation is the process of separating a mixture (e.g. a pop band recording) into isolated sounds from individual sources (e.g. just the lead vocals). Deep learning models are the state-of-the-art in source separation, given that the mixture to be separated is similar to the mixtures the deep model was trained on. This requires the end user to know enough about each model's training… ▽ More Audio source separation is the process of separating a mixture (e.g. a pop band recording) into isolated sounds from individual sources (e.g. just the lead vocals). Deep learning models are the state-of-the-art in source separation, given that the mixture to be separated is similar to the mixtures the deep model was trained on. This requires the end user to know enough about each model's training to select the correct model for a given audio mixture. In this work, we automate selection of the appropriate model for an audio mixture. We present a confidence measure that does not require ground truth to estimate separation quality, given a deep model and audio mixture. We use this confidence measure to automatically select the model output with the best predicted separation quality. We compare our confidence-based ensemble approach to using individual models with no selection, to an oracle that always selects the best model and to a random model selector. Results show our confidence-based ensemble significantly outperforms the random ensemble over general mixtures and approaches oracle performance for music mixtures. △ Less

Submitted 26 July, 2020; v1 submitted 23 October, 2019; originally announced October 2019.

arXiv:1910.12621 [pdf, other]

Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments

Authors: Ethan Manilow, Prem Seetharaman, Bryan Pardo

Abstract: We present a single deep learning architecture that can both separate an audio recording of a musical mixture into constituent single-instrument recordings and transcribe these instruments into a human-readable format at the same time, learning a shared musical representation for both tasks. This novel architecture, which we call Cerberus, builds on the Chimera network for source separation by add… ▽ More We present a single deep learning architecture that can both separate an audio recording of a musical mixture into constituent single-instrument recordings and transcribe these instruments into a human-readable format at the same time, learning a shared musical representation for both tasks. This novel architecture, which we call Cerberus, builds on the Chimera network for source separation by adding a third "head" for transcription. By training each head with different losses, we are able to jointly learn how to separate and transcribe up to 5 instruments in our experiments with a single network. We show that the two tasks are highly complementary with one another and when learned jointly, lead to Cerberus networks that are better at both separation and transcription and generalize better to unseen mixtures. △ Less

Submitted 12 February, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

Comments: Accepted to ICASSP 2020

arXiv:1910.11133 [pdf, other]

Bootstrap** deep music separation from primitive auditory grou** principles

Authors: Prem Seetharaman, Gordon Wichern, Jonathan Le Roux, Bryan Pardo

Abstract: Separating an audio scene such as a cocktail party into constituent, meaningful components is a core task in computer audition. Deep networks are the state-of-the-art approach. They are trained on synthetic mixtures of audio made from isolated sound source recordings so that ground truth for the separation is known. However, the vast majority of available audio is not isolated. The brain uses prim… ▽ More Separating an audio scene such as a cocktail party into constituent, meaningful components is a core task in computer audition. Deep networks are the state-of-the-art approach. They are trained on synthetic mixtures of audio made from isolated sound source recordings so that ground truth for the separation is known. However, the vast majority of available audio is not isolated. The brain uses primitive cues that are independent of the characteristics of any particular sound source to perform an initial segmentation of the audio scene. We present a method for bootstrap** a deep model for music source separation without ground truth by using multiple primitive cues. We apply our method to train a network on a large set of unlabeled music recordings from YouTube to separate vocals from accompaniment without the need for ground truth isolated sources or artificial training mixtures. △ Less

Submitted 23 October, 2019; originally announced October 2019.

arXiv:1909.08494 [pdf, other]

Cutting Music Source Separation Some Slakh: A Dataset to Study the Impact of Training Data Quality and Quantity

Authors: Ethan Manilow, Gordon Wichern, Prem Seetharaman, Jonathan Le Roux

Abstract: Music source separation performance has greatly improved in recent years with the advent of approaches based on deep learning. Such methods typically require large amounts of labelled training data, which in the case of music consist of mixtures and corresponding instrument stems. However, stems are unavailable for most commercial music, and only limited datasets have so far been released to the p… ▽ More Music source separation performance has greatly improved in recent years with the advent of approaches based on deep learning. Such methods typically require large amounts of labelled training data, which in the case of music consist of mixtures and corresponding instrument stems. However, stems are unavailable for most commercial music, and only limited datasets have so far been released to the public. It can thus be difficult to draw conclusions when comparing various source separation methods, as the difference in performance may stem as much from better data augmentation techniques or training tricks to alleviate the limited availability of training data, as from intrinsically better model architectures and objective functions. In this paper, we present the synthesized Lakh dataset (Slakh) as a new tool for music source separation research. Slakh consists of high-quality renderings of instrumental mixtures and corresponding stems generated from the Lakh MIDI dataset (LMD) using professional-grade sample-based virtual instruments. A first version, Slakh2100, focuses on 2100 songs, resulting in 145 hours of mixtures. While not fully comparable because it is purely instrumental, this dataset contains an order of magnitude more data than MUSDB18, the {\it de facto} standard dataset in the field. We show that Slakh can be used to effectively augment existing datasets for musical instrument separation, while opening the door to a wide array of data-intensive music signal analysis tasks. △ Less

Submitted 18 September, 2019; originally announced September 2019.

Comments: Accepted for publication at WASPAA 2019

arXiv:1811.03076 [pdf, other]

Class-conditional embeddings for music source separation

Authors: Prem Seetharaman, Gordon Wichern, Shrikant Venkataramani, Jonathan Le Roux

Abstract: Isolating individual instruments in a musical mixture has a myriad of potential applications, and seems imminently achievable given the levels of performance reached by recent deep learning methods. While most musical source separation techniques learn an independent model for each instrument, we propose using a common embedding space for the time-frequency bins of all instruments in a mixture ins… ▽ More Isolating individual instruments in a musical mixture has a myriad of potential applications, and seems imminently achievable given the levels of performance reached by recent deep learning methods. While most musical source separation techniques learn an independent model for each instrument, we propose using a common embedding space for the time-frequency bins of all instruments in a mixture inspired by deep clustering and deep attractor networks. Additionally, an auxiliary network is used to generate parameters of a Gaussian mixture model (GMM) where the posterior distribution over GMM components in the embedding space can be used to create a mask that separates individual sources from a mixture. In addition to outperforming a mask-inference baseline on the MUSDB-18 dataset, our embedding space is easily interpretable and can be used for query-based separation. △ Less

Submitted 7 November, 2018; originally announced November 2018.

Comments: 5 pages

arXiv:1811.02130 [pdf, other]

Bootstrap** single-channel source separation via unsupervised spatial clustering on stereo mixtures

Authors: Prem Seetharaman, Gordon Wichern, Jonathan Le Roux, Bryan Pardo

Abstract: Separating an audio scene into isolated sources is a fundamental problem in computer audition, analogous to image segmentation in visual scene analysis. Source separation systems based on deep learning are currently the most successful approaches for solving the underdetermined separation problem, where there are more sources than channels. Traditionally, such systems are trained on sound mixtures… ▽ More Separating an audio scene into isolated sources is a fundamental problem in computer audition, analogous to image segmentation in visual scene analysis. Source separation systems based on deep learning are currently the most successful approaches for solving the underdetermined separation problem, where there are more sources than channels. Traditionally, such systems are trained on sound mixtures where the ground truth decomposition is already known. Since most real-world recordings do not have such a decomposition available, this limits the range of mixtures one can train on, and the range of mixtures the learned models may successfully separate. In this work, we use a simple blind spatial source separation algorithm to generate estimated decompositions of stereo mixtures. These estimates, together with a weighting scheme in the time-frequency domain, based on confidence in the separation quality, are used to train a deep learning model that can be used for single-channel separation, where no source direction information is available. This demonstrates how a simple cue such as the direction of origin of source can be used to bootstrap a model for source separation that can be used in situations where that cue is not available. △ Less

Submitted 5 November, 2018; originally announced November 2018.

Comments: 5 pages, 2 figures

arXiv:1606.03539 [pdf]

Firm Growth and Innovation in the ERP Industry: A Systems Thinking Approach

Authors: Srujana Pinjala, Rahul Roy, Priya Seetharaman

Abstract: Achievement and sustenance of growth are essential themes in organizational literature. In our paper, we develop models using systems thinking approach to understand how firms achieve and sustain growth in a technology-intensive product domain. We augment these to explain the possible impact of a disruptive technological innovation. We use enterprise software industry as the context where SAP has… ▽ More Achievement and sustenance of growth are essential themes in organizational literature. In our paper, we develop models using systems thinking approach to understand how firms achieve and sustain growth in a technology-intensive product domain. We augment these to explain the possible impact of a disruptive technological innovation. We use enterprise software industry as the context where SAP has been acknowledged as the market leader. We find that product differentiation and learning effects helped SAP establish itself, and this growth was further sustained through networks and complementors. Introducing cloud computing as the disruptive innovation, we explain its impact on a firm. Analysis reveals that for the next wave of growth to occur, and to tap into newer markets, it would be imperative for SAP to create attractive cloud based offerings. We also discuss how the model can be enhanced by considering competitor dynamics. △ Less

Submitted 10 June, 2016; originally announced June 2016.

Comments: Research-in-progress ISBN# 978-0-646-95337-3 Presented at the Australasian Conference on Information Systems 2015 (arXiv:1605.01032)

Report number: ACIS/2015/219

Showing 1–22 of 22 results for author: Seetharaman, P