Search | arXiv e-print repository

Retrieval Augmented Generation of Symbolic Music with LLMs

Authors: Nicolas Jonason, Luca Casini, Carl Thomé, Bob L. T. Sturm

Abstract: We explore the use of large language models (LLMs) for music generation using a retrieval system to select relevant examples. We find promising initial results for music generation in a dialogue with the user, especially considering the ease with which such a system can be implemented. The code is available online. We explore the use of large language models (LLMs) for music generation using a retrieval system to select relevant examples. We find promising initial results for music generation in a dialogue with the user, especially considering the ease with which such a system can be implemented. The code is available online. △ Less

Submitted 28 December, 2023; v1 submitted 17 November, 2023; originally announced November 2023.

Comments: LBD @ ISMIR 2023

arXiv:2202.02115 [pdf, other]

Polyphonic pitch detection with convolutional recurrent neural networks

Authors: Carl Thomé, Sven Ahlbäck

Abstract: Recent directions in automatic speech recognition (ASR) research have shown that applying deep learning models from image recognition challenges in computer vision is beneficial. As automatic music transcription (AMT) is superficially similar to ASR, in the sense that methods often rely on transforming spectrograms to symbolic sequences of events (e.g. words or notes), deep learning should benefit… ▽ More Recent directions in automatic speech recognition (ASR) research have shown that applying deep learning models from image recognition challenges in computer vision is beneficial. As automatic music transcription (AMT) is superficially similar to ASR, in the sense that methods often rely on transforming spectrograms to symbolic sequences of events (e.g. words or notes), deep learning should benefit AMT as well. In this work, we outline an online polyphonic pitch detection system that streams audio to MIDI by ConvLSTMs. Our system achieves state-of-the-art results on the 2007 MIREX multi-F0 development set, with an F-measure of 83\% on the bassoon, clarinet, flute, horn and oboe ensemble recording without requiring any musical language modelling or assumptions of instrument timbre. △ Less

Submitted 4 February, 2022; originally announced February 2022.

Comments: MIREX 2017

arXiv:2202.02112 [pdf, other]

Musical Audio Similarity with Self-supervised Convolutional Neural Networks

Authors: Carl Thomé, Sebastian Piwell, Oscar Utterbäck

Abstract: We have built a music similarity search engine that lets video producers search by listenable music excerpts, as a complement to traditional full-text search. Our system suggests similar sounding track segments in a large music catalog by training a self-supervised convolutional neural network with triplet loss terms and musical transformations. Semi-structured user interviews demonstrate that we… ▽ More We have built a music similarity search engine that lets video producers search by listenable music excerpts, as a complement to traditional full-text search. Our system suggests similar sounding track segments in a large music catalog by training a self-supervised convolutional neural network with triplet loss terms and musical transformations. Semi-structured user interviews demonstrate that we can successfully impress professional video producers with the quality of the search experience, and perceived similarities to query tracks averaged 7.8/10 in user testing. We believe this search tool will make for a more natural search experience that is easier to find music to soundtrack videos with. △ Less

Submitted 4 February, 2022; originally announced February 2022.

Comments: ISMIR LBD 2021

arXiv:2006.06287 [pdf, other]

Perceiving Music Quality with GANs

Authors: Agrin Hilmkil, Carl Thomé, Anders Arpteg

Abstract: Several methods have been developed to assess the perceptual quality of audio under transforms like lossy compression. However, they require paired reference signals of the unaltered content, limiting their use in applications where references are unavailable. This has hindered progress in audio generation and style transfer, where a no-reference quality assessment method would allow more reproduc… ▽ More Several methods have been developed to assess the perceptual quality of audio under transforms like lossy compression. However, they require paired reference signals of the unaltered content, limiting their use in applications where references are unavailable. This has hindered progress in audio generation and style transfer, where a no-reference quality assessment method would allow more reproducible comparisons across methods. We propose training a GAN on a large music library, and using its discriminator as a no-reference quality assessment measure of the perceived quality of music. This method is unsupervised, needs no access to degraded material and can be tuned for various domains of music. In a listening test with 448 human subjects, where participants rated professionally produced music tracks degraded with different levels and types of signal degradations such as wavesha** distortion and low-pass filtering, we establish a dataset of human rated material. By using the human rated dataset we show that the discriminator score correlates significantly with the subjective ratings, suggesting that the proposed method can be used to create a no-reference musical audio quality assessment measure. △ Less

Submitted 4 April, 2021; v1 submitted 11 June, 2020; originally announced June 2020.

Comments: Extended abstract (first version) accepted for the Northern Lights Deep Learning Workshop 2020

arXiv:1602.08750 [pdf, other]

Filtering Video Noise as Audio with Motion Detection to Form a Musical Instrument

Authors: Carl Thomé

Abstract: Even though they differ in the physical domain, digital video and audio share many characteristics. Both are temporal data streams often stored in buffers with 8-bit values. This paper investigates a method for creating harmonic sounds with a video signal as input. A musical instrument is proposed, that utilizes video in both a sound synthesis method, and in a controller interface for selecting mu… ▽ More Even though they differ in the physical domain, digital video and audio share many characteristics. Both are temporal data streams often stored in buffers with 8-bit values. This paper investigates a method for creating harmonic sounds with a video signal as input. A musical instrument is proposed, that utilizes video in both a sound synthesis method, and in a controller interface for selecting musical notes at specific velocities. The resulting instrument was informally determined by the author to sound both pleasant and interesting, but hard to control, and therefore suited for synth pad sounds. △ Less

Submitted 28 February, 2016; originally announced February 2016.

Comments: Received the 2015 best paper award in the KTH Royal Institute of Technology course "Musical Communication and Music Technology"

Showing 1–5 of 5 results for author: Thomé, C