Search | arXiv e-print repository

LLark: A Multimodal Instruction-Following Language Model for Music

Authors: Josh Gardner, Simon Durand, Daniel Stoller, Rachel M. Bittner

Abstract: Music has a unique and complex structure which is challenging for both expert humans and existing AI systems to understand, and presents unique challenges relative to other forms of audio. We present LLark, an instruction-tuned multimodal model for \emph{music} understanding. We detail our process for dataset creation, which involves augmenting the annotations of diverse open-source music datasets… ▽ More Music has a unique and complex structure which is challenging for both expert humans and existing AI systems to understand, and presents unique challenges relative to other forms of audio. We present LLark, an instruction-tuned multimodal model for \emph{music} understanding. We detail our process for dataset creation, which involves augmenting the annotations of diverse open-source music datasets and converting them to a unified instruction-tuning format. We propose a multimodal architecture for LLark, integrating a pretrained generative model for music with a pretrained language model. In evaluations on three types of tasks (music understanding, captioning, reasoning), we show that LLark matches or outperforms existing baselines in music understanding, and that humans show a high degree of agreement with its responses in captioning and reasoning tasks. LLark is trained entirely from open-source music data and models, and we make our training code available along with the release of this paper. Additional results and audio examples are at https://bit.ly/llark, and our source code is available at https://github.com/spotify-research/llark . △ Less

Submitted 2 June, 2024; v1 submitted 10 October, 2023; originally announced October 2023.

Comments: ICML camera-ready version

arXiv:2307.10515 [pdf, other]

Gaussian Partial Information Decomposition: Bias Correction and Application to High-dimensional Data

Authors: Praveen Venkatesh, Corbett Bennett, Sam Gale, Tamina K. Ramirez, Greggory Heller, Severine Durand, Shawn Olsen, Stefan Mihalas

Abstract: Recent advances in neuroscientific experimental techniques have enabled us to simultaneously record the activity of thousands of neurons across multiple brain regions. This has led to a growing need for computational tools capable of analyzing how task-relevant information is represented and communicated between several brain regions. Partial information decompositions (PIDs) have emerged as one s… ▽ More Recent advances in neuroscientific experimental techniques have enabled us to simultaneously record the activity of thousands of neurons across multiple brain regions. This has led to a growing need for computational tools capable of analyzing how task-relevant information is represented and communicated between several brain regions. Partial information decompositions (PIDs) have emerged as one such tool, quantifying how much unique, redundant and synergistic information two or more brain regions carry about a task-relevant message. However, computing PIDs is computationally challenging in practice, and statistical issues such as the bias and variance of estimates remain largely unexplored. In this paper, we propose a new method for efficiently computing and estimating a PID definition on multivariate Gaussian distributions. We show empirically that our method satisfies an intuitive additivity property, and recovers the ground truth in a battery of canonical examples, even at high dimensionality. We also propose and evaluate, for the first time, a method to correct the bias in PID estimates at finite sample sizes. Finally, we demonstrate that our Gaussian PID effectively characterizes inter-areal interactions in the mouse brain, revealing higher redundancy between visual areas when a stimulus is behaviorally relevant. △ Less

Submitted 19 July, 2023; originally announced July 2023.

arXiv:2306.07744 [pdf, other]

doi 10.1109/ICASSP49357.2023.10096725

Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages

Authors: Simon Durand, Daniel Stoller, Sebastian Ewert

Abstract: Lyrics alignment gained considerable attention in recent years. State-of-the-art systems either re-use established speech recognition toolkits, or design end-to-end solutions involving a Connectionist Temporal Classification (CTC) loss. However, both approaches suffer from specific weaknesses: toolkits are known for their complexity, and CTC systems use a loss designed for transcription which can… ▽ More Lyrics alignment gained considerable attention in recent years. State-of-the-art systems either re-use established speech recognition toolkits, or design end-to-end solutions involving a Connectionist Temporal Classification (CTC) loss. However, both approaches suffer from specific weaknesses: toolkits are known for their complexity, and CTC systems use a loss designed for transcription which can limit alignment accuracy. In this paper, we use instead a contrastive learning procedure that derives cross-modal embeddings linking the audio and text domains. This way, we obtain a novel system that is simple to train end-to-end, can make use of weakly annotated training data, jointly learns a powerful text model, and is tailored to alignment. The system is not only the first to yield an average absolute error below 0.2 seconds on the standard Jamendo dataset but it is also robust to other languages, even when trained on English data only. Finally, we release word-level alignments for the JamendoLyrics Multi-Lang dataset. △ Less

Submitted 13 June, 2023; originally announced June 2023.

Comments: 5 pages, accepted at the International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2023

Journal ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1-5

arXiv:2009.09946 [pdf, other]

Optimal Targeting in Super-Modular Games

Authors: Giacomo Como, Stéphane Durand, Fabio Fagnani

Abstract: We study an optimal targeting problem for super-modular games with binary actions and finitely many players. The considered problem consists in the selection of a subset of players of minimum size such that, when the actions of these players are forced to a controlled value while the others are left to repeatedly play a best response action, the system will converge to the greatest Nash equilibriu… ▽ More We study an optimal targeting problem for super-modular games with binary actions and finitely many players. The considered problem consists in the selection of a subset of players of minimum size such that, when the actions of these players are forced to a controlled value while the others are left to repeatedly play a best response action, the system will converge to the greatest Nash equilibrium of the game. Our main contributions consist in showing that the problem is NP-complete and in proposing an efficient iterative algorithm with provable convergence properties for its solution. We discuss in detail the special case of network coordination games and its relation with the notion of cohesiveness. Finally, we show with simulations the strength of our approach with respect to naive heuristics based on classical network centrality measures. △ Less

Submitted 21 September, 2020; originally announced September 2020.

arXiv:2008.02069 [pdf, other]

Data Cleansing with Contrastive Learning for Vocal Note Event Annotations

Authors: Gabriel Meseguer-Brocal, Rachel Bittner, Simon Durand, Brian Brost

Abstract: Data cleansing is a well studied strategy for cleaning erroneous labels in datasets, which has not yet been widely adopted in Music Information Retrieval. Previously proposed data cleansing models do not consider structured (e.g. time varying) labels, such as those common to music data. We propose a novel data cleansing model for time-varying, structured labels which exploits the local structure o… ▽ More Data cleansing is a well studied strategy for cleaning erroneous labels in datasets, which has not yet been widely adopted in Music Information Retrieval. Previously proposed data cleansing models do not consider structured (e.g. time varying) labels, such as those common to music data. We propose a novel data cleansing model for time-varying, structured labels which exploits the local structure of the labels, and demonstrate its usefulness for vocal note event annotations in music. %Our model is trained in a contrastive learning manner by automatically creating local deformations of likely correct labels. Our model is trained in a contrastive learning manner by automatically contrasting likely correct labels pairs against local deformations of them. We demonstrate that the accuracy of a transcription model improves greatly when trained using our proposed strategy compared with the accuracy when trained using the original dataset. Additionally we use our model to estimate the annotation error rates in the DALI dataset, and highlight other potential uses for this type of model. △ Less

Submitted 27 April, 2021; v1 submitted 5 August, 2020; originally announced August 2020.

Comments: 21st International Society for Music Information Retrieval Conference 11-15 October 2020, Montreal, Canada

arXiv:2007.12581 [pdf, other]

Dereverberation using joint estimation of dry speech signal and acoustic system

Authors: Sanna Wager, Keunwoo Choi, Simon Durand

Abstract: The purpose of speech dereverberation is to remove quality-degrading effects of a time-invariant impulse response filter from the signal. In this report, we describe an approach to speech dereverberation that involves joint estimation of the dry speech signal and of the room impulse response. We explore deep learning models that apply to each task separately, and how these can be combined in a joi… ▽ More The purpose of speech dereverberation is to remove quality-degrading effects of a time-invariant impulse response filter from the signal. In this report, we describe an approach to speech dereverberation that involves joint estimation of the dry speech signal and of the room impulse response. We explore deep learning models that apply to each task separately, and how these can be combined in a joint model with shared parameters. △ Less

Submitted 24 July, 2020; originally announced July 2020.

arXiv:1912.07859 [pdf, other]

Controlling network coordination games

Authors: Stephane Durand, Giacomo Como, Fabio Fagnani

Abstract: We study a novel control problem in the context of network coordination games: the individuation of the smallest set of players capable of driving the system, globally, from one Nash equilibrium to another one. Our main contribution is the design of a randomized algorithm based on a time-reversible Markov chain with provable convergence garantees. We study a novel control problem in the context of network coordination games: the individuation of the smallest set of players capable of driving the system, globally, from one Nash equilibrium to another one. Our main contribution is the design of a randomized algorithm based on a time-reversible Markov chain with provable convergence garantees. △ Less

Submitted 17 December, 2019; originally announced December 2019.

Comments: submitted to the conference IFAC

arXiv:1902.06797 [pdf, other]

End-to-end Lyrics Alignment for Polyphonic Music Using an Audio-to-Character Recognition Model

Authors: Daniel Stoller, Simon Durand, Sebastian Ewert

Abstract: Time-aligned lyrics can enrich the music listening experience by enabling karaoke, text-based song retrieval and intra-song navigation, and other applications. Compared to text-to-speech alignment, lyrics alignment remains highly challenging, despite many attempts to combine numerous sub-modules including vocal separation and detection in an effort to break down the problem. Furthermore, training… ▽ More Time-aligned lyrics can enrich the music listening experience by enabling karaoke, text-based song retrieval and intra-song navigation, and other applications. Compared to text-to-speech alignment, lyrics alignment remains highly challenging, despite many attempts to combine numerous sub-modules including vocal separation and detection in an effort to break down the problem. Furthermore, training required fine-grained annotations to be available in some form. Here, we present a novel system based on a modified Wave-U-Net architecture, which predicts character probabilities directly from raw audio using learnt multi-scale representations of the various signal components. There are no sub-modules whose interdependencies need to be optimized. Our training procedure is designed to work with weak, line-level annotations available in the real world. With a mean alignment error of 0.35s on a standard dataset our system outperforms the state-of-the-art by an order of magnitude. △ Less

Submitted 18 February, 2019; originally announced February 2019.

Comments: 5 pages (1 for references), 2 figures, 2 tables. Camera-ready version, accepted at the International Conference on Acoustics, Speech, and Signal Processing 2019 (ICASSP)

arXiv:1605.08396 [pdf, other]

Robust Downbeat Tracking Using an Ensemble of Convolutional Networks

Authors: S. Durand, J. P. Bello, B. David, G. Richard

Abstract: In this paper, we present a novel state of the art system for automatic downbeat tracking from music signals. The audio signal is first segmented in frames which are synchronized at the tatum level of the music. We then extract different kind of features based on harmony, melody, rhythm and bass content to feed convolutional neural networks that are adapted to take advantage of each feature charac… ▽ More In this paper, we present a novel state of the art system for automatic downbeat tracking from music signals. The audio signal is first segmented in frames which are synchronized at the tatum level of the music. We then extract different kind of features based on harmony, melody, rhythm and bass content to feed convolutional neural networks that are adapted to take advantage of each feature characteristics. This ensemble of neural networks is combined to obtain one downbeat likelihood per tatum. The downbeat sequence is finally decoded with a flexible and efficient temporal model which takes advantage of the metrical continuity of a song. We then perform an evaluation of our system on a large base of 9 datasets, compare its performance to 4 other published algorithms and obtain a significant increase of 16.8 percent points compared to the second best system, for altogether a moderate cost in test and training. The influence of each step of the method is studied to show its strengths and shortcomings. △ Less

Submitted 26 May, 2016; originally announced May 2016.

Showing 1–9 of 9 results for author: Durand, S