-
Modeling Music Modality with a Key-Class Invariant Pitch Chroma CNN
Authors:
Anders Elowsson,
Anders Friberg
Abstract:
This paper presents a convolutional neural network (CNN) that uses input from a polyphonic pitch estimation system to predict perceived minor/major modality in music audio. The pitch activation input is structured to allow the first CNN layer to compute two pitch chromas focused on different octaves. The following layers perform harmony analysis across chroma and time scales. Through max pooling a…
▽ More
This paper presents a convolutional neural network (CNN) that uses input from a polyphonic pitch estimation system to predict perceived minor/major modality in music audio. The pitch activation input is structured to allow the first CNN layer to compute two pitch chromas focused on different octaves. The following layers perform harmony analysis across chroma and time scales. Through max pooling across pitch, the CNN becomes invariant with regards to the key class (i.e., key disregarding mode) of the music. A multilayer perceptron combines the modality activation output with spectral features for the final prediction. The study uses a dataset of 203 excerpts rated by around 20 listeners each, a small challenging data size requiring a carefully designed parameter sharing. With an R2 of about 0.71, the system clearly outperforms previous systems as well as individual human listeners. A final ablation study highlights the importance of using pitch activations processed across longer time scales, and using pooling to facilitate invariance with regards to the key class.
△ Less
Submitted 17 June, 2019;
originally announced June 2019.
-
Tempo-Invariant Processing of Rhythm with Convolutional Neural Networks
Authors:
Anders Elowsson
Abstract:
Rhythm patterns can be performed with a wide variation of tempi. This presents a challenge for many music information retrieval (MIR) systems; ideally, perceptually similar rhythms should be represented and processed similarly, regardless of the specific tempo at which they were performed. Several recent systems for tempo estimation, beat tracking, and downbeat tracking have therefore sought to pr…
▽ More
Rhythm patterns can be performed with a wide variation of tempi. This presents a challenge for many music information retrieval (MIR) systems; ideally, perceptually similar rhythms should be represented and processed similarly, regardless of the specific tempo at which they were performed. Several recent systems for tempo estimation, beat tracking, and downbeat tracking have therefore sought to process rhythm in a tempo-invariant way, often by sampling input vectors according to a precomputed pulse level. This paper describes how a log-frequency representation of rhythm-related activations instead can promote tempo invariance when processed with convolutional neural networks. The strategy incorporates invariance at a fundamental level and can be useful for most tasks related to rhythm processing. Different methods are described, relying on magnitude, phase relationships of different rhythm channels, as well as raw phase information. Several variations are explored to provide direction for future implementations.
△ Less
Submitted 28 April, 2018; v1 submitted 22 April, 2018;
originally announced April 2018.
-
Deep Layered Learning in MIR
Authors:
Anders Elowsson
Abstract:
Deep learning has boosted the performance of many music information retrieval (MIR) systems in recent years. Yet, the complex hierarchical arrangement of music makes end-to-end learning hard for some MIR tasks - a very deep and flexible processing chain is necessary to model some aspect of music audio. Representations involving tones, chords, and rhythm are fundamental building blocks of music. Th…
▽ More
Deep learning has boosted the performance of many music information retrieval (MIR) systems in recent years. Yet, the complex hierarchical arrangement of music makes end-to-end learning hard for some MIR tasks - a very deep and flexible processing chain is necessary to model some aspect of music audio. Representations involving tones, chords, and rhythm are fundamental building blocks of music. This paper discusses how these can be used as intermediate targets and priors in MIR to deal with structurally complex learning problems, with learning modules connected in a directed acyclic graph. It is suggested that this strategy for inference, referred to as deep layered learning (DLL), can help generalization by (1) - enforcing the validity and invariance of intermediate representations during processing, and by (2) - letting the inferred representations establish the musical organization to support higher-level invariant processing. A background to modular music processing is provided together with an overview of previous publications. Relevant concepts from information processing, such as pruning, skip connections, and performance supervision are reviewed within the context of DLL. A test is finally performed, showing how layered learning affects pitch tracking. It is indicated that especially offsets are easier to detect if guided by extracted framewise fundamental frequencies.
△ Less
Submitted 9 December, 2018; v1 submitted 17 April, 2018;
originally announced April 2018.
-
Polyphonic Pitch Tracking with Deep Layered Learning
Authors:
Anders Elowsson
Abstract:
This paper presents a polyphonic pitch tracking system able to extract both framewise and note-based estimates from audio. The system uses several artificial neural networks in a deep layered learning setup. First, cascading networks are applied to a spectrogram for framewise fundamental frequency (f0) estimation. A sparse receptive field is learned by the first network and then used as a filter k…
▽ More
This paper presents a polyphonic pitch tracking system able to extract both framewise and note-based estimates from audio. The system uses several artificial neural networks in a deep layered learning setup. First, cascading networks are applied to a spectrogram for framewise fundamental frequency (f0) estimation. A sparse receptive field is learned by the first network and then used as a filter kernel for parameter sharing throughout the system. The f0 activations are connected across time to extract pitch contours. These contours define a framework within which subsequent networks perform onset and offset detection, operating across both time and smaller pitch fluctuations at the same time. As input, the networks use, e.g., variations of latent representations from the f0 estimation network. Finally, incorrect tentative notes are removed one by one in an iterative procedure that allows a network to classify notes within an accurate context. The system was evaluated on four public test sets: MAPS, Bach10, TRIOS, and the MIREX Woodwind quintet, and performed state-of-the-art results for all four datasets. It performs well across all subtasks: f0, pitched onset, and pitched offset tracking.
△ Less
Submitted 18 March, 2019; v1 submitted 9 April, 2018;
originally announced April 2018.
-
Using perceptually defined music features in music information retrieval
Authors:
Anders Friberg,
Erwin Schoonderwaldt,
Anton Hedblad,
Marco Fabiani,
Anders Elowsson
Abstract:
In this study, the notion of perceptual features is introduced for describing general music properties based on human perception. This is an attempt at rethinking the concept of features, in order to understand the underlying human perception mechanisms. Instead of using concepts from music theory such as tones, pitches, and chords, a set of nine features describing overall properties of the music…
▽ More
In this study, the notion of perceptual features is introduced for describing general music properties based on human perception. This is an attempt at rethinking the concept of features, in order to understand the underlying human perception mechanisms. Instead of using concepts from music theory such as tones, pitches, and chords, a set of nine features describing overall properties of the music was selected. They were chosen from qualitative measures used in psychology studies and motivated from an ecological approach. The selected perceptual features were rated in two listening experiments using two different data sets. They were modeled both from symbolic (MIDI) and audio data using different sets of computational features. Ratings of emotional expression were predicted using the perceptual features. The results indicate that (1) at least some of the perceptual features are reliable estimates; (2) emotion ratings could be predicted by a small combination of perceptual features with an explained variance up to 90%; (3) the perceptual features could only to a limited extent be modeled using existing audio features. The results also clearly indicated that a small number of dedicated features were superior to a 'brute force' model using a large number of general audio features.
△ Less
Submitted 31 March, 2014;
originally announced March 2014.