Abstract

In the pursuit of develo** expressive music performance models using artificial intelligence, this paper introduces DExter, a new approach leveraging diffusion probabilistic models to render Western classical piano performances. In this approach, performance parameters are represented in a continuous expression space and a diffusion model is trained to predict these continuous parameters while being conditioned on the musical score. Furthermore, DExter also enables the generation of interpretations (expressive variations of a performance) guided by perceptually meaningful features by conditioning jointly on score and perceptual feature representations. Consequently, we find that our model is useful for learning expressive performance, generating perceptually steered performances, and transferring performance styles. We assess the model through quantitative and qualitative analyses, focusing on specific performance metrics regarding dimensions like asynchrony and articulation, as well as through listening tests comparing generated performances with different human interpretations. Results show that DExter is able to capture the time-varying correlation of the expressive parameters, and compares well to existing rendering models in subjectively evaluated ratings. The perceptual-feature-conditioned generation and transferring capabilities of DExter are verified by a proxy model predicting perceptual characteristics of differently steered performances.

keywords:
Piano Performance, Expressive Rendering, Diffusion Model, Deep Learning, Music
\pubvolume

1 \issuenum1 \articlenumber0 \datereceived \daterevised \dateaccepted \datepublished \hreflinkhttps://doi.org/ \TitleDExter: Learning and Controlling Performance Expression with Diffusion Models \TitleCitationTitle \AuthorHuan Zhang 1\orcidA, Shreyan Chowdhury 2\orcidC, Carlos Eduardo Cancino-Chacón 2\orcidF, **hua Liang1\orcidE, Simon Dixon 1\orcidB, Gerhard Widmer2\orcidD \AuthorNamesHuan Zhang, Shreyan Chowdhury, Carlos Eduardo Cancino-Chacón, **hua Liang, Simon Dixon and Gerhard Widmer \AuthorCitationZhang, H.; Chowdhury, S.; Cancino-Chacón, C. E.; Liang, J.; Dixon, S; Widmer, G \corresCorrespondence: [email protected]

1 Introduction

A trained musician can take a piece of music and interpret it in their own way, moulding and varying the emotional expression of the piece by subtly changing performance parameters. Parametric dimensions include timing, dynamics, articulation, and use of devices like sustain pedals in piano. Studying such expression patterns has long been of keen interest to musicians, educators and researchers, and it presents a compelling inquiry into exploring whether such intricate expressions can be accurately encapsulated and replicated by computational systems Widmer et al. (2003). The accurate replication of human musical expression by machines not only bridges the gap between traditional and technological approaches to music but also opens new avenues for interactive performances Cancino-Chacón et al. (2023) and music education systems Morsi et al. (2024). Leveraging such technology can enhance musical training, allowing students and professionals alike to experiment with and learn from dynamically generated expression, thus broadening both creative perspectives and educational methods. In this paper, we aim at rendering such an expressive performance of a piece of music from its score using a machine learning model. We propose DExter, a Diffusion-based Expressive performance generat(o)r, which predicts expression parameters conditioned on the score. In addition, we investigate whether the rendering process can also be conditioned on desired high-level performance characteristics given in the form of mid-level perceptual features Aljanaki (2018); Chowdhury et al. (2019), in this way permitting us to control general performance character, as well as to explore the potential of style transfer within the varied space of human expressive performances. In this context, we leverage the conditional design of diffusion models as well as the diffusion chain to serve as a mechanism to regulate the extent of transferred information from a source performance to a target performance.

This paper offers three contributions111Project demo page with examples: http://bit.ly/4a1xs1x. Code is available at https://github.com/anusfoil/DExter.: 1) we propose a diffusion-based method for learning and conditioning the expression parameters in Western classical solo piano performance; 2) we conduct a comprehensive quantitative evaluation on the rendered outputs along with other renderers in the literature (re-trained to make for a fair comparison), taking into account multiple expressive dimensions such as asynchrony and articulation; and 3) we explore mid-level conditioned generation and style transfer with our model and conduct an experimental study on the conditioning effects.

2 Related Work

2.1 Expressive Performance Rendering

Expressive performance rendering has long been a challenge for Music Information Retrieval (MIR) research. While the role of machine learning in such a task was recognised early on Widmer et al. (2003), several rule-based methods have been proposed and investigated over the years Widmer and Goebl (2004); Cancino-Chacón et al. (2018); Kirke and Miranda (2013). Early experiments in deep-learning based performance rendering Cancino-Chacón (2018); Maezawa et al. (2019); Jeong et al. (2019) use traditional sequence modeling architectures like RNNs and LSTMs with modifications focusing on the music hierarchy and score features being applied as inputs. Recently, transformer-based systems Rhyu et al. (2022); Borovik and Viro (2023) have been proposed for controllable rendering, predicting different aspects of performance such as the shape of expressive attributes Rhyu et al. (2022) and performance direction markings in the score Borovik and Viro (2023). All of the above systems predict descriptors designed to capture expressive aspects of musical performance, typically tokens representing local tempo and timing deviations. However, such tokenized and quantized encodings of performance parameters are not lossless and can result in a large vocabulary to train Zhang and Dixon (2023).

Regarding the evaluation of performance rendering systems, there has been a growing criticism of the practice of evaluating against a single ground truth and ignoring the variations in interpretations Peter et al. (2023), as reconstruction-based error analysis has inherent limitations on fidelity and diversity Plasser et al. (2023); Peter et al. (2023). To mitigate this problem, we will evaluate the rendered performances with respect to a multitude of performance parameter dimensions, and against multiple different human interpretations.

2.2 Diffusion Models in Music

Diffusion Probabilistic Models (DPMs) generate data by inverting a Markovian data corruption process. DPMs have demonstrated impressive results first in the vision domain by generating text-controlled images Ramesh et al. (2021), and then also in the audio domain, with the most promising applications involving generation of high-fidelity audio samples Kong et al. (2021); Chen et al. (2021) and synthesis of speech Kim et al. (2022) and music Hawthorne et al. (2022).

Symbolic music, however, seems to be a more challenging target for DPMs – the challenge is to fit their probabilistic formulation into discrete data distributions. Mittal et al. (2021) train continuous DDPMs (Denoising Diffusion Probabilistic Models) on sequences of latent MusicVAE Roberts et al. (2018) embeddings, in order to achieve generation of monophonic melodies. Plasser et al. (2023) build upon the MusicVAE-like token representation and directly apply discrete denoising diffusion probabilistic models (D3PM). Another representation suitable for learning symbolic music Zhang et al. (2023) under the DPM framework is the piano roll: Cheuk et al. (2023) managed to transcribe music by generating a piano roll using an audio spectrogram as condition. Min et al. (2023) also achieved piano roll generation with more diverse control such as infilling music context and high-level guidance of chords and texture.

Our work places music DPMs into a niche spot: while the rendering is applied on symbolic data (discrete notes), DExter predicts continuously varying expressive parameters which are then applied at the note level. Generation of the continuous expressive parameters facilitates fine-grained control of performance parameters of each note without the reverse diffusion process having to learn a quantized representation space.

Refer to caption
Figure 1: Training (left) and inference (right) phases of the diffusion framework. Training starts with p_codec and corrupts the x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by injecting noise; the UNet model takes in the corrupted xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, conditions cssubscript𝑐𝑠c_{s}italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and ccsubscript𝑐𝑐c_{c}italic_c start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to predict the injected noise, which is then used to reconstruct x^0subscript^𝑥0\hat{x}_{0}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Loss is calculated for both noise prediction and p_codec reconstruction. The inference process (right) starts with a random sample from 𝒩(𝟎,𝐈)𝒩0𝐈\mathcal{N}(\mathbf{0},\mathbf{I})caligraphic_N ( bold_0 , bold_I ); the model iteratively predicts the noise and reconstructs x^0subscript^𝑥0\hat{x}_{0}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, conditioned on the same cssubscript𝑐𝑠c_{s}italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and ccsubscript𝑐𝑐c_{c}italic_c start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Alternatively for transferal, we initialize the process from another performance y0subscript𝑦0y_{0}italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, corrupting it for t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT steps and denoising for the remaining Tt0𝑇subscript𝑡0T-t_{0}italic_T - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT steps.

3 Methodology

In this section, we first introduce the representations for expressive parameters, musical score, and perceptual features used to train DExter. We call these codecs, which are described in detail in Section 4.1. We then explain our diffusion framework that learns these representations, followed by the training and inference architectures and conditioning methodology.

3.1 The Codecs

We represent score information (note onset, duration, pitch, and voice), performance parameters (beat period, velocity, timing, articulation ratio, and pedal), and mid-level perceptual features (melodiousness, articulation, rhythmic complexity, rhythmic stability, dissonance, tonal stability, and minorness) as two-dimensional spectrogram-like matrices of (mostly real-valued, except for the score codec) numeric values. We call these the score codec (s_codec), the performance codec (p_codec), and the perceptual features codec (c_codec) respectively. The task of our diffusion model is to predict a p_codec conditioned on the s_codec and c_codec. Detailed descriptions of the composition of the codecs are given in Section 4.1.

3.2 Diffusion Framework

We frame the expression rendering problem as the task of learning a continuous space of performance expression parameters. Diffusion models Ho et al. (2020) consist of two processes: i) a forward process that transforms each data sample into a standard Gaussian noisy sample step-by-step with a predefined noise schedule; and ii) a reverse process where the model learns to denoise pure-noise inputs gradually, generating samples from the learned training data distribution. In effect, our model aims to convert Gaussian noise xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT into a posterior performance codec x^0subscript^𝑥0\hat{x}_{0}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, conditioned on a score codec cs:=assignsubscript𝑐𝑠absentc_{s}:=italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT := s_codec, and a perceptual features codec cc:=assignsubscript𝑐𝑐absentc_{c}:=italic_c start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT := c_codec.

The diffusion forward pass q(xt|x0)𝑞conditionalsubscript𝑥𝑡subscript𝑥0q(x_{t}|x_{0})italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) produces a noisy version of the performance codec. With the noise ϵ𝒩(𝟎,𝐈)italic-ϵ𝒩0𝐈\epsilon\in\mathcal{N}(\mathbf{0},\mathbf{I})italic_ϵ ∈ caligraphic_N ( bold_0 , bold_I ) sampled from a standard Gaussian distribution, we blend it with the input sample x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT using β𝛽\betaitalic_β as a scaling factor intended to ultimately achieve zero mean and unit variance of the fully-noised result. Specifically, the sampling process applies a linear noise schedule with β[0.0001,0.2]𝛽0.00010.2\beta\in[0.0001,0.2]italic_β ∈ [ 0.0001 , 0.2 ]. As we would like to perform multiple steps simultaneously, reparameterization is applied to derive a closed-form equation, given that αt=1βtsubscript𝛼𝑡1subscript𝛽𝑡\alpha_{t}=1-\beta_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and α¯=s=1tαs¯𝛼subscriptsuperscriptproduct𝑡𝑠1subscript𝛼𝑠\bar{\alpha}=\prod^{t}_{s=1}\alpha_{s}over¯ start_ARG italic_α end_ARG = ∏ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

xt=α¯tx0+1α¯tϵsubscript𝑥𝑡subscript¯𝛼𝑡subscript𝑥01subscript¯𝛼𝑡italic-ϵx_{t}=\sqrt{\bar{\alpha}_{t}}x_{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilonitalic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ (1)

During training, model fθ(xt,t,cs,cc)subscript𝑓𝜃subscript𝑥𝑡𝑡subscript𝑐𝑠subscript𝑐𝑐f_{\theta}(x_{t},t,c_{s},c_{c})italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) learns to predict the injected noise ϵ^θsubscript^italic-ϵ𝜃\hat{\epsilon}_{\theta}over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT given a random timestep t𝑡titalic_t and its noised codec version xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT calculated in the forward pass. t𝑡titalic_t is sampled from [1,T]1𝑇[1,T][ 1 , italic_T ]; we use T=1000𝑇1000T=1000italic_T = 1000 in our experiments. Then, we use the predicted noise ϵ^θsubscript^italic-ϵ𝜃\hat{\epsilon}_{\theta}over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to reconstruct the predicted initial codec x^0subscript^𝑥0\hat{x}_{0}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by inverting Eq. 1.

The training objective combines the noise estimation and codec reconstruction as shown in Eq. 2. Although the noise prediction is theoretically enough for training the model, empirically we found that constraining on the reconstructed codec yields better performance, with weighting h=0.20.2h=0.2italic_h = 0.2.

L(θ)=ϵϵ^θ2+hx0x^02𝐿𝜃superscriptnormitalic-ϵsubscript^italic-ϵ𝜃2superscriptnormsubscript𝑥0subscript^𝑥02L(\theta)=\parallel\epsilon-\hat{\epsilon}_{\theta}\parallel^{2}+h\parallel x_% {0}-\hat{x}_{0}\parallel^{2}italic_L ( italic_θ ) = ∥ italic_ϵ - over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_h ∥ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (2)

During inference, we start from a Gaussian noise distribution p(xt)𝒩(𝟎,𝐈)similar-to𝑝subscript𝑥𝑡𝒩0𝐈p(x_{t})\sim\mathcal{N}(\mathbf{0},\mathbf{I})italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∼ caligraphic_N ( bold_0 , bold_I ) and iteratively generate the codec posterior through pθ(x^t1|xt,cs,cc)=𝒩(μθ,t(xt,t,cs,cc),σt2𝐈)subscript𝑝𝜃conditionalsubscript^𝑥𝑡1subscript𝑥𝑡subscript𝑐𝑠subscript𝑐𝑐𝒩subscript𝜇𝜃𝑡subscript𝑥𝑡𝑡subscript𝑐𝑠subscript𝑐𝑐subscriptsuperscript𝜎2𝑡𝐈p_{\theta}(\hat{x}_{t-1}|x_{t},c_{s},c_{c})=\mathcal{N}(\mu_{\theta,t}(x_{t},t% ,c_{s},c_{c}),\sigma^{2}_{t}\mathbf{I})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_θ , italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_I ) until x^0subscript^𝑥0\hat{x}_{0}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is reached. As the model fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT estimates noise ϵ^θsubscript^italic-ϵ𝜃\hat{\epsilon}_{\theta}over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, we use it to construct the model mean μθ,tsubscript𝜇𝜃𝑡\mu_{\theta,t}italic_μ start_POSTSUBSCRIPT italic_θ , italic_t end_POSTSUBSCRIPT and the posterior variance is predetermined by the noise schedule. Full construction of model mean and posterior variance is given in Eq. 3 and Eq. 4:

μθ,t(xt,t,cs,cc)=1αt(xtϵ^θ(1αt)1α¯t)subscript𝜇𝜃𝑡subscript𝑥𝑡𝑡subscript𝑐𝑠subscript𝑐𝑐1subscript𝛼𝑡subscript𝑥𝑡subscript^italic-ϵ𝜃1subscript𝛼𝑡1subscript¯𝛼𝑡\mu_{\theta,t}(x_{t},t,c_{s},c_{c})=\sqrt{\frac{1}{\alpha_{t}}}(x_{t}-\frac{% \hat{\epsilon}_{\theta}(1-\alpha_{t})}{\sqrt{1-\bar{\alpha}_{t}}})italic_μ start_POSTSUBSCRIPT italic_θ , italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ) (3)
σt2=(1αt)1α¯t11α¯tsubscriptsuperscript𝜎2𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡11subscript¯𝛼𝑡\sigma^{2}_{t}=(1-\alpha_{t})\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) divide start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG (4)
Refer to caption
Figure 2: Diagram of the UNet conditioning module in the network.

3.3 Architecture and Conditioning

We employ a 2-D UNet as the backbone of our architecture; the detailed layer and insertion structure can be found in Fig. 2. The conditioning on score information and perceptual feature information is enforced by a joint conditioning layer that projects the score dimensions and perceptual dimensions (5 and 7 respectively – see definition of codecs below) onto 512 dimensions. The diffusion timestep t𝑡titalic_t is encoded via sinusoidal position embeddings. The input codec and the conditioning codecs are downsampled and upsampled through ResNet blocks and 2D convolutions. Attention layers are interleaved at bottlenecks.

Before narrowing down on the above described architecture, we experimented with a DiffWave-based architecture Kong et al. (2021) which uses a series of 12 residual layers of 1D convolution. For conditioning, we also experimented with FiLM Perez et al. (2018); Kim and Serra (2023) which yields comparable results to the UNet model. We found that our present architecture gives the best trade-off between model simplicity and performance.

Classifier-Free Guidance (CFG) Ho and Salimans (2021) is widely used for conditioning diffusion models to achieve controllable generation, which we also adopt. During training, a dropout layer is applied to the conditions cssubscript𝑐𝑠c_{s}italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and ccsubscript𝑐𝑐c_{c}italic_c start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to randomly mask out the conditions with probability p𝑝pitalic_p, in order to simultaneously train the conditional model fθ(xt,t,cs,cc)subscript𝑓𝜃subscript𝑥𝑡𝑡subscript𝑐𝑠subscript𝑐𝑐f_{\theta}(x_{t},t,c_{s},c_{c})italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) and unconditional model fθ(xt,t)subscript𝑓𝜃subscript𝑥𝑡𝑡f_{\theta}(x_{t},t)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ). We fixed p=0.1𝑝0.1p=0.1italic_p = 0.1 in our training. In inference, a weight parameter w𝑤witalic_w is applied as the guidance scale to a combined prediction.

ϵ^=wϵ^(xt,t)+(1w)ϵ^(xt,t,cs,cc)^italic-ϵ𝑤^italic-ϵsubscript𝑥𝑡𝑡1𝑤^italic-ϵsubscript𝑥𝑡𝑡subscript𝑐𝑠subscript𝑐𝑐\hat{\epsilon}=w\hat{\epsilon}(x_{t},t)+(1-w)\hat{\epsilon}(x_{t},t,c_{s},c_{c})over^ start_ARG italic_ϵ end_ARG = italic_w over^ start_ARG italic_ϵ end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) + ( 1 - italic_w ) over^ start_ARG italic_ϵ end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) (5)
Dataset Pieces Performances Duration MIDI Repertoire
Vienna4x22 (Goebl (1999)) 4 88 2h 18m recorded Excerpts from 4 pieces by F. Chopin (Op. 10 No. 3, Op. 38), W. A. Mozart (KV331, first mov.), and F. Schubert (D. 783 No. 15)
(n)ASAP (Peter et al. (2023)) 235 1067 94h 30m recorded Common Practice Period solo piano works by 15 composers
ATEPP-subset* (Zhang et al. (2022)) 1580 11677 1000h transcribed Solo piano works by 25 composers, ranging from Baroque to Modern era
Table 1: Overview of datasets used in experiments.

4 Data, Representation, and Processing

4.1 Input and Target Encodings

The performance codec (p_codec) – our prediction target – was originally proposed in the expressive rendering framework Basis Mixer Cancino-Chacón (2018), where four expressive parameters are computed for each note n𝑛nitalic_n appearing in the score. These parameters of the p_codec encoding the expression controls modify properties of notes in a MIDI piano performance, thus changing speed and loudness of the performance with time. Combined with score information, the full expressive performance can be reconstructed in a lossless fashion. We expanded the original Performance Codec v.1.0 Cancino-Chacón (2018); Cancino-Chacón et al. (2023) by defining an additional parameter for sustain pedal control. The resulting five (note-wise) performance parameters are as follows:

  • Beat period: the ratio of the inter-onset intervals (IOI) between two consecutive notes of the performance and the score. This parameter is computed for each onset oksubscript𝑜𝑘o_{k}italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT instead of each note nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. It is defined as:
    x𝑏𝑝(ok)=IOI𝑝𝑒𝑟𝑓(ok)IOI𝑠𝑐𝑜𝑟𝑒(ok)=o^k+1𝑝𝑒𝑟𝑓o^k𝑝𝑒𝑟𝑓ok+1oksubscript𝑥𝑏𝑝subscript𝑜𝑘subscriptIOI𝑝𝑒𝑟𝑓subscript𝑜𝑘subscriptIOI𝑠𝑐𝑜𝑟𝑒subscript𝑜𝑘subscriptsuperscript^𝑜𝑝𝑒𝑟𝑓𝑘1subscriptsuperscript^𝑜𝑝𝑒𝑟𝑓𝑘subscript𝑜𝑘1subscript𝑜𝑘x_{\mathit{bp}}(o_{k})=\frac{\text{IOI}_{\mathit{perf}}(o_{k})}{\text{IOI}_{% \mathit{score}}(o_{k})}=\frac{\hat{o}^{\mathit{perf}}_{k+1}-\hat{o}^{\mathit{% perf}}_{k}}{o_{k+1}-o_{k}}italic_x start_POSTSUBSCRIPT italic_bp end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = divide start_ARG IOI start_POSTSUBSCRIPT italic_perf end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG IOI start_POSTSUBSCRIPT italic_score end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG = divide start_ARG over^ start_ARG italic_o end_ARG start_POSTSUPERSCRIPT italic_perf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - over^ start_ARG italic_o end_ARG start_POSTSUPERSCRIPT italic_perf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_o start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG, where o^k𝑝𝑒𝑟𝑓subscriptsuperscript^𝑜𝑝𝑒𝑟𝑓𝑘\hat{o}^{\mathit{perf}}_{k}over^ start_ARG italic_o end_ARG start_POSTSUPERSCRIPT italic_perf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the actual performed onset time, in seconds, corresponding to score onset oksubscript𝑜𝑘o_{k}italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in beats, calculated as the average onset time of all notes played at score onset position oksubscript𝑜𝑘o_{k}italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

  • Velocity: x𝑣𝑒𝑙(ni)=vel(ni)127subscript𝑥𝑣𝑒𝑙subscript𝑛𝑖velsubscript𝑛𝑖127x_{\mathit{vel}}(n_{i})=\frac{\text{vel}(n_{i})}{127}italic_x start_POSTSUBSCRIPT italic_vel end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG vel ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG 127 end_ARG, where vel is the MIDI velocity of a played note.

  • Timing: x𝑡𝑖𝑚(ni)=Δt(ni)=o^𝑝𝑒𝑟𝑓(ni)onset(ni)subscript𝑥𝑡𝑖𝑚subscript𝑛𝑖subscriptΔ𝑡subscript𝑛𝑖superscript^𝑜𝑝𝑒𝑟𝑓subscript𝑛𝑖onsetsubscript𝑛𝑖x_{\mathit{tim}}(n_{i})=\Delta_{t}(n_{i})=\hat{o}^{\mathit{perf}}(n_{i})-\text% {onset}(n_{i})italic_x start_POSTSUBSCRIPT italic_tim end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = over^ start_ARG italic_o end_ARG start_POSTSUPERSCRIPT italic_perf end_POSTSUPERSCRIPT ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - onset ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), the average onset time of all notes played at score onset position (used in beat period) minus the performance onset time of nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Taking beat period as the ‘tempo grid’ notion Gillick et al. (2021), timing would then refer to the micro-deviation of each note relative to the grid.

  • Articulation ratio: x𝑎𝑟𝑡(ni)=dur𝑝𝑒𝑟𝑓(ni)dur(ni)x𝑏𝑝(ni)subscript𝑥𝑎𝑟𝑡subscript𝑛𝑖superscriptdur𝑝𝑒𝑟𝑓subscript𝑛𝑖dursubscript𝑛𝑖subscript𝑥𝑏𝑝subscript𝑛𝑖x_{\mathit{art}}(n_{i})=\frac{\text{dur}^{\mathit{perf}}(n_{i})}{\text{dur}(n_% {i})\cdot x_{\mathit{bp}}(n_{i})}italic_x start_POSTSUBSCRIPT italic_art end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG dur start_POSTSUPERSCRIPT italic_perf end_POSTSUPERSCRIPT ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG dur ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_x start_POSTSUBSCRIPT italic_bp end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG, measures the fraction of the expected note duration that is actually played.

  • Pedal: x𝑝𝑒𝑑(ni)=ped(ni)127subscript𝑥𝑝𝑒𝑑subscript𝑛𝑖pedsubscript𝑛𝑖127x_{\mathit{ped}}(n_{i})=\frac{\text{ped}(n_{i})}{127}italic_x start_POSTSUBSCRIPT italic_ped end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG ped ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG 127 end_ARG, where ped(ni)pedsubscript𝑛𝑖\text{ped}(n_{i})ped ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the discrete MIDI pedal value at the onset of nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Note that pedal encoding is not lossless since changes of value between note onsets will not be captured.

The p_codec is fully invertible in that the full event information from the MIDI file [Pitch, Onset, Duration, Velocity] can be reconstructed given the p_codec and score s_codec.

The Score codec (s_codec) represents the musical score and is derived from the note array from the partitura package Cancino-Chacón et al. (2022). Aligned with the p_codec at the note level, it contains four score parameters for each note: (notated) Onset, Duration, Pitch, and Voice, resulting in a 2D matrix of dimension 4×n4𝑛4\times n4 × italic_n where n is the number of notes. The score is indispensable for performance conditioning, as it defines the musical content of the piece.

The Perceptual features codec (c_codec), which we use as steering inputs for the performance generation, are representations of the so-called mid-level perceptual features Aljanaki (2018), namely – melodiousness, articulation, rhythm complexity, rhythm stability, dissonance, tonal stability, and ‘minorness’ (or mode). They describe musical qualities that most listeners can easily perceive. Taking cue from previous research Chowdhury et al. (2019); Chowdhury (2022) showing that these features effectively represent musical factors underlying a wide range of emotions and capture variations in expressive character between different performances of a piece Chowdhury and Widmer (2021), we incorporate these as the performance steering inputs. In our scenario, these features are calculated by a previously trained specialised model Chowdhury et al. (2019), over the recorded audio performance data of Vienna4×224224\times 224 × 22, (n)ASAP, and ATEPP datasets (see Sec. 4.4). The values are calculated from successive overlap** 15s windows with hop size of 5s. Each computed window is then aligned with the score note array to broadcast into c_codec, a 2D matrix of dimension 7×n7𝑛7\times n7 × italic_n.

4.2 Processing

Given that there could be slight variations in each performance (missing and extra notes relative to the score), we perform padding based on the score note array so that each pair of performances is perfectly aligned. To accommodate pieces of different lengths, we train our network on segments of N𝑁Nitalic_N notes where shorter segments are padded. In our experiment, we take N=200𝑁200N=200italic_N = 200 (which corresponds to about 10 to 20 seconds of music depending on the tempo and note density).

4.3 Mixup Augmentation

Mixup Zhang et al. (2018) is a data augmentation scheme that regularizes a network to favor simple linear behavior between training examples. To strengthen our model’s ability to model different interpretations, we fuse p_codec pairs x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (codecs representing two different performances of the same piece) and their corresponding c_codec using Eq. 6, where λ𝜆\lambdaitalic_λ is a scaling factor varying between [0,1]01[0,1][ 0 , 1 ].

x1,2=λx1+(1λ)x2subscript𝑥12𝜆subscript𝑥11𝜆subscript𝑥2x_{1,2}=\lambda x_{1}+(1-\lambda)x_{2}italic_x start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT = italic_λ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_λ ) italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (6)

After the mixup augmentation, our dataset consists of 170k segments; the interpolated data are only used in training.

4.4 Datasets and Training Setup

We used three datasets of expressive performances (from the Western classical music solo piano repertoire): Vienna4×224224\times 224 × 22 Goebl (1999), (n)ASAP Peter et al. (2023), and ATEPP Zhang et al. (2022). Each dataset includes audio, performance MIDI, score in MusicXML format, and their alignment. Information and a comparison of these sets can be found in Table 1. The training is based on ATEPP and 80% of (n)ASAP and Vienna4×224224\times 224 × 22 data, while the testing set (used in all subsequent experiments in Sec. 5) contains the remaining 20% of (n)ASAP and Vienna4×224224\times 224 × 22 data. The latter two datasets were recorded on computer-controlled grand pianos and are thus more accurate and precise than the ATEPP data, which were obtained through curated audio transcription.

For the training of the network as mentioned in Sec. 3, we use the Adam optimizer with a learning rate 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, and employ early stop** with a patience of 50 epochs.

Basis Mixer Cancino-Chacón (2018) VirtuosoNet Jeong et al. (2019) DExter (Ours)
Deviation multiple (\rightarrow0)
articulation.key_overlap_ratio 0.62±plus-or-minus\pm±3.15 1.92±plus-or-minus\pm±2.72 2.24±plus-or-minus\pm±2.76
asynchrony.pitch_correlation -1.19±plus-or-minus\pm±1.41 -1.67±plus-or-minus\pm±1.25 -1.17±plus-or-minus\pm±1.66
asynchrony.delta 4.10±plus-or-minus\pm±1.78 4.38±plus-or-minus\pm±1.63 4.43±plus-or-minus\pm±2.21
dynamics.agreement -0.07±plus-or-minus\pm±1.33 -0.40±plus-or-minus\pm±1.30 -0.002±plus-or-minus\pm±1.05
dynamics.consistency -0.67±plus-or-minus\pm±1.64 -1.07±plus-or-minus\pm±1.56 0.73±plus-or-minus\pm±2.21
dynamics.ramp_correlation -0.36±plus-or-minus\pm±2.54 0.65±plus-or-minus\pm±1.96 -0.28±plus-or-minus\pm±2.44
pedal.onset_value - -1.16±plus-or-minus\pm±1.68 -1.39±plus-or-minus\pm±2.13
tempo_curve -0.10±plus-or-minus\pm±2.44 0.52±plus-or-minus\pm±2.48 0.72±plus-or-minus\pm±2.65
velocity_curve -0.67±plus-or-minus\pm±1.48 0.15±plus-or-minus\pm±1.02 1.48±plus-or-minus\pm±2.12
KL Divergence (\downarrow)
articulation.key_overlap_ratio 0.92±plus-or-minus\pm±2.15 1.66±plus-or-minus\pm±6.89 1.64±plus-or-minus\pm±3.63
asynchrony.pitch_correlation 0.14±plus-or-minus\pm±0.278 0.13±plus-or-minus\pm±1.25 0.20±plus-or-minus\pm±0.33
asynchrony.delta 4.04±plus-or-minus\pm±5.15 4.83±plus-or-minus\pm±9.58 1.29±plus-or-minus\pm±3.16
dynamics.agreement 0.10±plus-or-minus\pm±0.04 0.09±plus-or-minus\pm±0.04 0.06±plus-or-minus\pm±0.04
dynamics.consistency 0.12±plus-or-minus\pm±0.23 0.06±plus-or-minus\pm±0.07 0.28±plus-or-minus\pm±0.49
dynamics.ramp_correlation 1.54±plus-or-minus\pm±5.43 0.35±plus-or-minus\pm±1.01 0.42±plus-or-minus\pm±1.13
pedal.onset_value - 0.34±plus-or-minus\pm±1.45 0.33±plus-or-minus\pm±0.36
tempo_curve 0.98±plus-or-minus\pm±2.55 0.65±plus-or-minus\pm±1.86 1.26±plus-or-minus\pm±5.66
velocity_curve 0.16±plus-or-minus\pm±0.21 0.10±plus-or-minus\pm±0.06 0.71±plus-or-minus\pm±1.37
Pearson’s Correlation (\uparrow)
articulation.key_overlap_ratio -0.01±plus-or-minus\pm±0.13 0.05±plus-or-minus\pm±0.16 0.11±plus-or-minus\pm±0.17
asynchrony.pitch_correlation 0.33±plus-or-minus\pm±0.25 0.55±plus-or-minus\pm±0.17 0.57±plus-or-minus\pm±0.25
asynchrony.delta 0.17±plus-or-minus\pm±0.22 0.29±plus-or-minus\pm±0.19 0.28±plus-or-minus\pm±0.21
dynamics.agreement 0.02±plus-or-minus\pm±0.87 0.04±plus-or-minus\pm±0.84 0.11±plus-or-minus\pm±0.79
dynamics.consistency 0.92±plus-or-minus\pm±0.17 0.91±plus-or-minus\pm±0.13 0.92±plus-or-minus\pm±0.15
dynamics.ramp_correlation 0.04±plus-or-minus\pm±0.76 0.12±plus-or-minus\pm±0.80 0.14±plus-or-minus\pm±0.73
pedal.onset_value - 0.01±plus-or-minus\pm±0.13 0.02±plus-or-minus\pm±0.14
tempo_curve 0.02±plus-or-minus\pm±0.13 0.09±plus-or-minus\pm±0.19 0.19±plus-or-minus\pm±0.17
velocity_curve 0.08±plus-or-minus\pm±0.23 0.21±plus-or-minus\pm±0.27 0.27±plus-or-minus\pm±0.23
Table 2: Quantitative expression metrics in the categories of articulation, asynchrony, dynamics, pedaling, plus global tempo and velocity curves. Columns represent different models, and rows are divided into blocks according to the three different evaluation metrics, with each block detailing the outcomes for all features. Note that each generated performance is compared with multiple human ground truth interpretations.

5 Evaluation

In this section, we present quantitative evaluation of generated performances without and with steering, followed by evaluation of performance transfer, and an investigation into the effect of varying the conditioning weight. Finally, we also describe our qualitative study employing a listening test and human participants and present the results.

5.1 Quantitative Evaluation

In this subsection, we evaluate our generated samples’ expressiveness by comparing core expression attributes with ground truth performances. This experiment is conducted on the aforementioned testing set, with condition of s_codec and audio performance inferred c_codec as described in Sec. 4.1. With respect to the critique of reconstruction-based evaluation Peter et al. (2023), we compare with various interpretations of ground truth (the testing set consists of about 5.3 human performances for each piece, on average).

5.1.1 Assessed Attributes

The expression attributes we assess are derived from the tempo and velocity curves (joint-onset level), joint-onset asynchrony, articulation, dynamics and pedalling. While a detailed documentation of the selected attributes can be found on the project page, we provide a summary below:

  • Tempo curve: Onset-level tempo (inverse of local inter-beat-intervals), with values averaged across notes on the joint onset. (tem_curve)

  • Velocity curve: Onset-level velocity, with values averaged across notes on the joint onset. (vel_curve)

  • Asynchrony: The absolute difference in seconds between the earliest and latest note in a joint onset (asy.delta). We also measure the pitch correlation (asy.p_cor) between the pitch and micro-timing within the joint onset, inspired by the melody lead phenomenon Goebl (2001).

  • Articulation: Key overlap ratio (art.kor) Bresin and Umberto Battel (2000), measured at each note transition; overlap time (or gap time if staccato) divided by the IOI between the two notes.

  • Dynamics: Comparing performed velocity and score marking (\f, \p, etc.), and measuring their agreement (dyn.agr) and consistency (dyn.con) as proposed by Kosta et al. (2018). We also propose the ramp correlation (dyn.r_cor) for changing markings (hairpins) since Kosta et al. (2018) only worked with constant markings. The ramp correlation computes the amount of agreement between the performed velocities with respect to their cresc. or decresc. ramp, if the markings exist.

  • Pedals: We measure the sustain pedal value at the note onset (ped.onval). Actually, sustain pedal change is a continuous stream of values and changes in pedal position often happen between note attacks; however, sampling at the note onsets simplifies the computation and allows for a consistent assessment across models.

5.1.2 Metrics

For each expression dimension, we measure three metrics between the generated performance and the ground truth space:

Standard deviation multiple: This metric computes the deviation of an attribute of the rendered output from the mean of multiple human performances on a beat-level basis. Different from absolute deviation, this measure incorporates the flexibility of interpretations: when human interpretations already contain large differences, a larger discrepancy can be tolerated. But if human players tend to agree on the interpretation, we would expect the rendered output to fit more closely to ground truth values. Additionally, we retain the sign (direction) of deviation, so that negative values indicate deviations in the direction of slower tempo or softer dynamics, for example.

KL divergence takes all the note-level or beat-level attributes, and compares their divergence with the ground truth attributes as an overall distribution. (Note that the ground truth attribute distributions are aggregated from multiple interpretations.) The KL divergence is calculated by estimating the two distributions using Monte Carlo sampling (N=300𝑁300N=300italic_N = 300) and computing the relative entropy between them.

Pearson’s correlation is measured between the feature sequences of generated and ground truth performances. In contrast to the previous two metrics, this metric captures the time-varying similarity between the performance attributes.

5.1.3 Results

In Table 2, we compare our model with the samples from two existing performance rendering systems, BasisMixer Cancino-Chacón (2018) (BM) 222We applied the Basis Mixer model with LSTM architecture, trained on the ASAP corpus., and VirtuosoNet Jeong et al. (2019) (VN) 333The applied model is the isgn with default tempo and composer setting agreed by the author.. Results are rendered on the same testing set as ours and shown as mean and standard deviation.

Overall, DExter shows commendable results particularly in correlation across almost all performance dimensions, especially in capturing the global curve of tempo and velocity. This latter effect (learning the overall musical shape) could be attributed to the diffusion model predicting the time-varying codec in one pass, in contrast to autoregressive approaches. However, it is evident that DExter has room for improvement in terms of deviation and divergence: DExter’s outputs demonstrate more volatile changes of parameters that are less smooth than other renderings.

Meanwhile, each model exhibits distinct strengths across various performance dimensions. BM’s outputs have articulation that is closer to human ground truths, and this can be attributed to BM being more conservative in its use of expressive devices, using smaller deviations from mechanical reproduction of the score. VN also excels in modelling the dynamics in agreement with the score markings. It is also notable that both models with sustain pedal prediction did a poor job in mimicking human pedaling techniques, with almost no time-wise correlation and on average one standard deviation away from the gound truth. Another area where all models struggle is the asynchrony time (asy.delta: similar-to\mathtt{\sim}4 deviations away from human benchmark), highlighting the need of refining the micro-timing aspect in performance rendering models.

Refer to caption
Figure 3: Steering the expressive characteristics of generated performances by using mid-level features (c_codec, see Sec. 4.1) as conditions. For each piece, a performance is generated with the c_codec derived from an actual performance of the piece, and two further performances are generated with one of the mid-level features doubled (2×\times×) or halved (0.5×\times×). The average difference between the halved and unmodified, and between the doubled and unmodified conditions are plotted here.

5.2 Expressive Steering with Perceptual Features

Our framework of conditioned performance generation allows us to explore conditioning the performance generation on additional features. As described in Sec. 4.1, we use mid-level perceptual features (encoded as c_codec) as steering inputs to guide the expressive character of the generated performance.

To gauge DExter’s sensitivity to changes in these features we use the perceptual feature recognition model of Chowdhury et al. (2019) as a proxy for human perception. However, as that model had originally been trained on audio input, we wanted to eliminate the effect of acoustic artifacts introduced by rendering MIDI to audio, we decided to fit a MIDI-to-perceptual-features model to serve as the proxy instead. Details on this proxy model are given in the Appendix.

Steering performance generation is done by manipulating individual dimensions of the perceptual features, aiming to induce measurable corresponding effects in the resulting outputs. For each test sample and across all seven perceptual attributes, we first generate performances using the unmodified target ccsubscript𝑐𝑐c_{c}italic_c start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. These target perceptual feature values could be randomly initialized in practice, or derived from an actual performance. We take the feature values derived from actual performances and modify the values to steer the generation in particular expressive directions, thus generating “alternate" performances of the original performance. We either halve one feature, 12cc12subscript𝑐𝑐\frac{1}{2}c_{c}divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_c start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, or double one feature, 2cc2subscript𝑐𝑐2c_{c}2 italic_c start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, at a time.

Fig. 3 displays the proxy model’s predictions on the generated outputs. We observe that for the first four features melodiousness, articulation, rhythm complexity, rhythm stability, the adjustments applied to the input conditions manifest as anticipated directional changes (the cc,𝑑𝑜𝑢𝑏𝑙𝑒subscript𝑐𝑐𝑑𝑜𝑢𝑏𝑙𝑒c_{c,\mathit{double}}italic_c start_POSTSUBSCRIPT italic_c , italic_double end_POSTSUBSCRIPT group leads the cc,ℎ𝑎𝑙𝑓subscript𝑐𝑐ℎ𝑎𝑙𝑓c_{c,\mathit{half}}italic_c start_POSTSUBSCRIPT italic_c , italic_half end_POSTSUBSCRIPT group 12.2% in terms of their absolute value), providing evidence of the model’s responsive behavior to the controlled feature alterations.

The other three dimensions – notably, dissonance – exhibit less consistent patterns in alignment with the input modifications. That seems reasonable, as harmonic and tonality-related properties are more a function of a piece itself, rather than any specific interpretation of it.

5.3 Transferring from a Source Performance

As suggested by Liu et al. (2023) and Zhang et al. (2023), style transfer can be achieved in a diffusion framework by using, as a starting point for generation, a shallowly noised version of the source information. Given the large amount of music overlap in our datasets, we can test this by forming data pairs that consist of two interpretations of the same piece, to be used as the source and target p_codec in this experiment.

Given a source performance codec x𝑠𝑟𝑐subscript𝑥𝑠𝑟𝑐x_{\mathit{src}}italic_x start_POSTSUBSCRIPT italic_src end_POSTSUBSCRIPT, we calculate its noisy version xt0subscript𝑥subscript𝑡0x_{t_{0}}italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT with a predefined time step t0Tsubscript𝑡0𝑇t_{0}\leq Titalic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ italic_T according to the forward process shown in Equation 1. By using xt0subscript𝑥subscript𝑡0x_{t_{0}}italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT as the starting point for the reverse process of a pretrained model, we enable the manipulation of performance x𝑠𝑟𝑐subscript𝑥𝑠𝑟𝑐x_{\mathit{src}}italic_x start_POSTSUBSCRIPT italic_src end_POSTSUBSCRIPT with target mid-level condition and shared score condition c(s,𝑡𝑔𝑡)subscript𝑐𝑠𝑡𝑔𝑡c_{\mathit{(s,tgt)}}italic_c start_POSTSUBSCRIPT ( italic_s , italic_tgt ) end_POSTSUBSCRIPT in a shallow reverse process pθ(x^0:t0|xt0,c(s,𝑡𝑔𝑡))subscript𝑝𝜃conditionalsubscript^𝑥:0subscript𝑡0subscript𝑥subscript𝑡0subscript𝑐𝑠𝑡𝑔𝑡p_{\theta}(\hat{x}_{0:t_{0}}|x_{t_{0}},c_{\mathit{(s,tgt)}})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 : italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT ( italic_s , italic_tgt ) end_POSTSUBSCRIPT ), as illustrated in Fig. 1 (top, right). With the transfer experiments, we attempt to understand the following two questions:

1. Does transferring help with the final generation quality compared with rendering from scratch?

{tblr}

hline1-2,6 = -, t𝑡titalic_t & dev: tem (\downarrow) dev: vel (\downarrow) cor: tem (\uparrow) cor: vel (\uparrow)
T𝑇Titalic_T 0.72 ±plus-or-minus\pm±2.65 1.48±plus-or-minus\pm± 2.13 0.19±plus-or-minus\pm±0.17 0.27±plus-or-minus\pm±0.23
3T43𝑇4\frac{3T}{4}divide start_ARG 3 italic_T end_ARG start_ARG 4 end_ARG 0.68±plus-or-minus\pm±2.55 1.40±plus-or-minus\pm±2.16 0.19±plus-or-minus\pm±0.16 0.28±plus-or-minus\pm±0.21
T2𝑇2\frac{T}{2}divide start_ARG italic_T end_ARG start_ARG 2 end_ARG 0.74±plus-or-minus\pm±2.49 1.33±plus-or-minus\pm±2.10 0.15±plus-or-minus\pm±0.17 0.21±plus-or-minus\pm±0.22
T4𝑇4\frac{T}{4}divide start_ARG italic_T end_ARG start_ARG 4 end_ARG 0.87±plus-or-minus\pm±2.69 1.50±plus-or-minus\pm±2.11 0.11±plus-or-minus\pm±0.16 0.18±plus-or-minus\pm±0.21

Table 3: Deviation and correlation of test set relative to ground truth space (same analysis as in Section 5.1) of tempo and velocity curves.

In the transfer experiment, we combine pairs of ground truth performances of the same piece segment x𝑠𝑟𝑐,x𝑡𝑔𝑡subscript𝑥𝑠𝑟𝑐subscript𝑥𝑡𝑔𝑡x_{\mathit{src}},x_{\mathit{tgt}}italic_x start_POSTSUBSCRIPT italic_src end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_tgt end_POSTSUBSCRIPT where s𝑠𝑟𝑐=s𝑡𝑔𝑡subscript𝑠𝑠𝑟𝑐subscript𝑠𝑡𝑔𝑡s_{\mathit{src}}=s_{\mathit{tgt}}italic_s start_POSTSUBSCRIPT italic_src end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_tgt end_POSTSUBSCRIPT. The same testing set as the previous sections is used, and the source performance is randomly taken from the ground truth. We experimented with different transfer steps of t0{T,3T4,T2,T4}subscript𝑡0𝑇3𝑇4𝑇2𝑇4t_{0}\in\{T,\frac{3T}{4},\frac{T}{2},\frac{T}{4}\}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ { italic_T , divide start_ARG 3 italic_T end_ARG start_ARG 4 end_ARG , divide start_ARG italic_T end_ARG start_ARG 2 end_ARG , divide start_ARG italic_T end_ARG start_ARG 4 end_ARG }, and report the global metrics of tempo curve and velocity curve with their deviation and correlation.

What we observed in Table 3 is that transferring from a source performance slightly helps with initialization. Specifically, employing a denoiser for three-quarters of the diffusion steps – ideally preserving around one-quarter of the source’s characteristics – yielded the highest quality outcomes. However, transfer quality does not steadily improve with the retained information from source: the T4𝑇4\frac{T}{4}divide start_ARG italic_T end_ARG start_ARG 4 end_ARG-step transfer results in ambiguous outputs that do not align well with the given score.

2. Does a transferred rendering sound ‘closer’ to the source or the target?

Similar to Sec. 5.2, we wish to measure the transfer proximity using the predicted perceptual features. The radar plots in Fig. 4 show the seven perceptual feature dimensions predicted by the proxy, illustrating the perceptual distance between source, target, and generated performance for three different transfer gradations, T4,T2,3T4𝑇4𝑇23𝑇4\frac{T}{4},\frac{T}{2},\frac{3T}{4}divide start_ARG italic_T end_ARG start_ARG 4 end_ARG , divide start_ARG italic_T end_ARG start_ARG 2 end_ARG , divide start_ARG 3 italic_T end_ARG start_ARG 4 end_ARG. At T4𝑇4\frac{T}{4}divide start_ARG italic_T end_ARG start_ARG 4 end_ARG steps, the predicted performance deviates from both source and target, which fits our previous observation that insufficient denoising steps result in ambiguous outputs. As the transfer step increases, there is a discernible shift in the predicted output towards the target profile across most perceptual dimensions.

Refer to caption
Figure 4: Seven dimensions of perceptual features (AR: articuation; RC: rhythm complexity; RS: rhythm stability; TS: tonal stability; DI: dissonance; MI: Minorness; ME: Melodiousness) predicted using the proxy for output, source, and target, averaged across the testing set. The three plots correspond to transfer steps of 0.25T,0.5T,0.75T0.25𝑇0.5𝑇0.75𝑇0.25T,0.5T,0.75T0.25 italic_T , 0.5 italic_T , 0.75 italic_T.

5.4 Effect of Varying Conditioning Weights

In this experiment, we look at the effect of the conditioning weight w𝑤witalic_w on the generated results. As described in Sec. 3.3 and Eq. 5, the scale of classifier-free guidance w𝑤witalic_w is the ratio that combines the prediction with and without (masked by 0) the cssubscript𝑐𝑠c_{s}italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and ccsubscript𝑐𝑐c_{c}italic_c start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT conditioning. While the conditional and unconditional models are jointly trained in the training phase, the weighting parameter w𝑤witalic_w is only introduced in the sampling phase and the optimal w𝑤witalic_w is not trivial to find. In Tab. 4, conditioning weights w=0.5,1.2,2,3𝑤0.51.223w=0.5,1.2,2,3italic_w = 0.5 , 1.2 , 2 , 3 are compared, while other settings are the same as in Section 5.1. Experimental results are best for w=1.2𝑤1.2w=1.2italic_w = 1.2. Interestingly, with greater scale of classifier guidance, the generative results exhibit larger fluctuations in expressive parameters and less stability.

{tblr}

hline1-2,6 = -, w𝑤witalic_w & dev: tem (\downarrow) dev: vel(\rightarrow0) cor: tem(\uparrow) cor: vel (\uparrow)
0.50.50.50.5 1.11 ±plus-or-minus\pm±2.46 -2.47±plus-or-minus\pm± 1.10 0.11±plus-or-minus\pm±0.16 0.02±plus-or-minus\pm±0.22
1.21.21.21.2 0.72±plus-or-minus\pm±2.65 1.48±plus-or-minus\pm±2.12 0.19±plus-or-minus\pm±0.16 0.28±plus-or-minus\pm±0.21
2222 1.33±plus-or-minus\pm±2.37 3.02±plus-or-minus\pm±1.53 0.04±plus-or-minus\pm±0.15 0.13±plus-or-minus\pm±0.24
3333 1.86±plus-or-minus\pm±1.81 4.63±plus-or-minus\pm±1.15 0.04±plus-or-minus\pm±0.14 0.10±plus-or-minus\pm±0.23

Table 4: Deviation and correlation of test set relative to ground truth space (same analysis as in Section 5.1) of tempo and velocity curves.

5.5 Qualitative study

We evaluated the naturalness and expressiveness of the rendered performances through a listening test. For samples from eight selected pieces, we compared the following: 1) two human performances, with relatively distant interpretations; 2) renderings made with Basis Mixer and VirtuosoNet, as described in Section 5.1; and 3) a rendering from the proposed model DExter. The performances (including the ground truths) were rendered to audio using a Yamaha Disklavier, which produced similar pedal/articulation-related artifacts both in the human and the machine performances. 82 participants listened to the performances and evaluated them on a 100-point Likert scale, rating the overall naturalness and expression of the output as one score. The performances used for the test can be found on the demo page.

Refer to caption
Figure 5: Mean ratings for the eight pieces evaluated in the listening test, for each model and human (Ground Truth, GT) performances.

The results of the listening test, shown in Fig. 5, give a nuanced view of the performance rendering capabilities of the models in comparison to the ground truth. In terms of the mean rating, DExter demonstrates better performances of the pieces of Chopin, even sometimes comparable to the ground truth (Barcarolle). However, it is outperformed by Basis Mixer or VirtuosoNet in the case of older compositions (Bach Fugue and Beethoven Sonatas). Overall, in terms of the mean rating scores, there is still a gap between the generative outputs and GT (51.81), while DExter (48.54) slightly outperforms VirtuosoNet (48.31) and Basis Mixer (46.33). It is also surprising to observe that GT does not always secure the highest ratings. In the case of the Chopin Etude, at least, it might be explained by the fact that the human pianists suffer from physical limitations in technically demanding passages, while the generative models do not.

Besides the numerical ratings, we also asked three musically trained participants for explicit feedback on their ratings. In addition to some positive comments, we also received quite specific and useful negative feedback, such as (specifically referring to polyphonic music such as Bach) no clear voicing among the lines and poor balance between hands. Given currently dominating ‘flattened’ representations such as our p_codec or tokens in Transformer models, learning the vertical structure of music remains a challenge to rendering models Zhang et al. (2023).

6 Conclusion and future work

In this paper, we introduce a novel diffusion-based model, DExter, for learning and controlling performance expression in solo piano music. Besides rendering performances at a comparable level of quality to existing models in quantitative measurement of expressive characteristics, DExter is also capable of style transfer between interpretations and conditioning the rendered expression with perceptual variables.

As a future direction, we would like to improve on the inference speed of DExter (currently 40 seconds inference time for a 95 second piece on single RTX 6000 GPU). To accelerate sampling of 1k step iterative process we would explore techniques like DDIM or learning in a latent space. On the other hand, we would also like to explore other conditioning inputs like text that could allow for more explicit controls over the rendered outputs.

7 Acknowledgment

This work is supported by the UKRI Centre for Doctoral Training in Artificial Intelligence and Music, funded by UK Research and Innovation [grant number EP/S022694/1], also by the European Research Council (ERC) under the EU’s Horizon 2020 research and innovation programme, grant agreement No. 101019375 (Whither Music?). We also like to thank Daesam Jeong for contributing VirtuosoNet and for comparative discussion, as well as Maximillian Hofmann for the Basis Mixer trained with LSTM.

8 Appendix

8.1 MIDI to perceptual features model

The MIDI to perceptual features model is used as proxy for human mid-level perception in the experiments described in Sec. 5.2 and Sec. 5.3. It takes in a rendered MIDI and outputs a 7-dimensional perceptual features, for each window of 15 seconds. The specifications are as follows:

  • Data: The data used to train this oracle is ASAP performance MIDI, along with the audio perceptual features computed predicted from ASAP performance audio by the mid-level feature recognition model of Chowdhury and Widmer (2021), for 15 seconds windows.

  • Representation: Each 15 seconds MIDI window is transformed into a piano-roll matrix of dimension 800 * 131 (128 pitches + 3 pedal channels), with MIDI velocity as matrix value.

  • Architecture: The network consists of two residual blocks, each containing two convolution layers. A final projection layers is attached at the end to output the 7-dimension perceptual features.

  • Training: Adam optimizer with learning rate 1e31𝑒31e-31 italic_e - 3. After 20 epochs the training converges with validation loss 0.038.

{adjustwidth}

-\extralength0cm

\reftitle

References

References

  • Widmer et al. (2003) Widmer, G.; Dixon, S.; Goebl, W.; Pampalk, E.; Tobudic, A. In Search of the Horowitz Factor. AI Magazine 2003, 24, 111–130.
  • Cancino-Chacón et al. (2023) Cancino-Chacón, C.; Peter, S.; Hu, P.; Karystinaios, E.; Henkel, F.; Foscarin, F.; Widmer, G. The ACCompanion: Combining Reactivity, Robustness, and Musical Expressivity in an Automatic Piano Accompanist. In Proceedings of the IJCAI International Joint Conference on Artificial Intelligence, Macau, China, 2023; [2304.12939]. https://doi.org/10.24963/ijcai.2023/641.
  • Morsi et al. (2024) Morsi, A.; Zhang, H.; Maezawa, A.; Dixon, S.; Serra, X. Simulating and Validating Piano Performance Mistakes for Music Learning Context. In Proceedings of the Sound and Music Computing Conference (SMC), 2024.
  • Aljanaki (2018) Aljanaki, A. A data-driven approach to mid-level perceptual musical feature modeling. In Proceedings of the Proceedings of the 19th International Society for Music Information Retrieval Conference, (ISMIR), Paris, France, 2018; [arXiv:1806.04903v1].
  • Chowdhury et al. (2019) Chowdhury, S.; Vall, A.; Haunschmid, V.; Widmer, G. Towards explainable music emotion recognition: The route via Mid-level features. In Proceedings of the Proceedings of the 20th International Society for Music Information Retrieval Conference, ISMIR 2019, Delft, Netherlands, 2019; [1907.03572].
  • Widmer and Goebl (2004) Widmer, G.; Goebl, W. Computational models of expressive music performance: The state of the art. Journal of New Music Research 2004, 33, 203–216.
  • Cancino-Chacón et al. (2018) Cancino-Chacón, C.E.; Grachten, M.; Goebl, W.; Widmer, G. Computational Models of Expressive Music Performance: A Comprehensive and Critical Review. Frontiers in Digital Humanities 2018, 5, 1–23. https://doi.org/10.3389/fdigh.2018.00025.
  • Kirke and Miranda (2013) Kirke, A.; Miranda, E. Guide to Computing for Expressive Music Performance; 2013. https://doi.org/10.1007/978-1-4471-4123-5.
  • Cancino-Chacón (2018) Cancino-Chacón, C.E. Computational Modeling of Expressive Music Performance with Linear and Non-linear Basis Function Models. PhD thesis, Johannes Kepler University Linz, 2018.
  • Maezawa et al. (2019) Maezawa, A.; Yamamoto, K.; Fujishima, T. Rendering music performance with interpretation variations using conditional variational RNN. Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR) 2019.
  • Jeong et al. (2019) Jeong, D.; Kwon, T.; Kim, Y.; Lee, K.; Nam, J. VirtuosoNet: A Hierarchical RNN-based System for Modeling Expressive Piano Performance. In Proceedings of the Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), Delft, Netherlands, 2019.
  • Rhyu et al. (2022) Rhyu, S.; Kim, S.; Lee, K. Sketching the Expression: Flexible Rendering of Expressive Piano Performance with Self-Supervised Learning. In Proceedings of the Proceeding of the 23rd International Society on Music Information Retrieval (ISMIR), Bengaluru, India, 2022; [2208.14867].
  • Borovik and Viro (2023) Borovik, I.; Viro, V. ScorePerformer : Expressive Piano Performance Rendering with Fine-grained Control. In Proceedings of the Proceeding of the 24th International Society on Music Information Retrieval (ISMIR), Milan, Italy, 2023.
  • Zhang and Dixon (2023) Zhang, H.; Dixon, S. Disentangling the Horowitz Factor: Learning Content and Style From Expressive Piano Performance. In Proceedings of the ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023. https://doi.org/10.1109/icassp49357.2023.10095009.
  • Peter et al. (2023) Peter, S.D.; Cancino-chacón, C.E.; Widmer, G. Sounding Out Reconstruction Error-Based Evaluation of Generative Models of Expressive Performance. In Proceedings of the Proceedings of the Digital Libraries for Musicology (DLfM), Milan, Italy, 2023. https://doi.org/10.1145/3625135.3625141.
  • Plasser et al. (2023) Plasser, M.; Peter, S.; Widmer, G. Discrete Diffusion Probabilistic Models for Symbolic Music Generation. In Proceedings of the IJCAI International Joint Conference on Artificial Intelligence, Macau, China, 2023; [2305.09489]. https://doi.org/10.24963/ijcai.2023/648.
  • Ramesh et al. (2021) Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-Shot Text-to-Image Generation. In Proceedings of the Proceedings of the 38th International Conference on Machine Learning (ICML), Online, 2021; [2102.12092].
  • Kong et al. (2021) Kong, Z.; **, W.; Huang, J.; Zhao, K.; Catanzaro, B. DiffWave: a Versatile Diffusion Model for Audio Synthesis. In Proceedings of the ICLR 2021 - 9th International Conference on Learning Representations, Vienna, Austria, 2021; [2009.09761].
  • Chen et al. (2021) Chen, N.; Zhang, Y.; Zen, H.; Weiss, R.J.; Norouzi, M.; Chan, W. WaveGrad: Estimating Gradients for Waveform Generation. In Proceedings of the ICLR 2021 - 9th International Conference on Learning Representations, Vienna, Austria, 2021; [2009.00713].
  • Kim et al. (2022) Kim, H.; Kim, S.; Yoon, S. Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance. In Proceedings of the Proceedings of Machine Learning Research, Baltimore, USA, 2022; [2111.11755].
  • Hawthorne et al. (2022) Hawthorne, C.; Simon, I.; Roberts, A.; Zeghidour, N.; Gardner, J.; Manilow, E.; Engel, J. Multi-instrument Music Synthesis with Spectrogram Diffusion. In Proceedings of the Proceeding of the International Society on Music Information Retrieval (ISMIR), Bengaluru, India, 2022; [2206.05408].
  • Mittal et al. (2021) Mittal, G.; Engel, J.; Hawthorne, C.; Simon, I. Symbolic Music Generation with Diffusion Models. In Proceedings of the Proceeding of the 22nd International Society on Music Information Retrieval (ISMIR), Online, 2021.
  • Roberts et al. (2018) Roberts, A.; Engel, J.; Raffel, C.; Hawthorne, C.; Eck, D. A hierarchical latent vector model for learning long-term structure in music. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholm, Sweden, 2018; [1803.05428].
  • Zhang et al. (2023) Zhang, H.; Karystinaios, E.; Dixon, S.; Widmer, G.; Cancino-Chacón, C.E. Symbolic Music Representations for Classification Tasks: A Systematic Evaluation. In Proceedings of the Proceeding of the 24th International Society on Music Information Retrieval (ISMIR), Milan, Italy, 2023; [2309.02567].
  • Cheuk et al. (2023) Cheuk, K.W.; Sawata, R.; Uesaka, T.; Murata, N.; Takahashi, N.; Takahashi, S.; Herremans, D.; Mitsufuji, Y. DiffRoll: DIFFUSION-BASED GENERATIVE MUSIC TRANSCRIPTION WITH UNSUPERVISED PRETRAINING CAPABILITY. In Proceedings of the Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023.
  • Min et al. (2023) Min, L.; Jiang, J.; Xia, G.; Zhao, J. Polyffusion: A Diffusion Model for Polyphonic Score Generation with Internal and External Controls. In Proceedings of the Proceeding of the 24th International Society on Music Information Retrieval (ISMIR), Milan, Italy, 2023; [2307.10304].
  • Ho et al. (2020) Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Online, 2020; [2006.11239].
  • Perez et al. (2018) Perez, E.; Strub, F.; De Vries, H.; Dumoulin, V.; Courville, A. FiLM: Visual reasoning with a general conditioning layer. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, New Orleans, USA, 2018; [1709.07871]. https://doi.org/10.1609/aaai.v32i1.11671.
  • Kim and Serra (2023) Kim, H.; Serra, X. DiffVel : Note-Level MIDI Velocity Estimation for Piano Performance by A Double Conditioned Diffusion Model. In Proceedings of the 16th International Symposium on Computer Music Multidisciplinary Research (CMMR), Tokyo, Japan, 2023.
  • Ho and Salimans (2021) Ho, J.; Salimans, T. Classifier-Free Diffusion Guidance. In Proceedings of the NeurIPS Workshop on Deep Generative Models and Downstream Applications, Online, 2021; [2207.12598].
  • Goebl (1999) Goebl, W. The Vienna 4x22 Piano Corpus. http://dx.doi.org/10.21939/4X22, 1999. https://doi.org/10.21939/4X22.
  • Peter et al. (2023) Peter, S.D.; Cancino-chacón, C.E.; Foscarin, F.; Henkel, F.; Widmer, G. Automatic Note-Level Alignments in the ASAP Dataset. Transactions of the International Society for Music Information Retrieval (TISMIR) 2023, 6, 27–42. https://doi.org/10.5334/tismir.149.
  • Zhang et al. (2022) Zhang, H.; Tang, J.; Rafee, S.; Dixon, S.; Fazekas, G. ATEPP: A Dataset of Automatically Transcribed Expressive Piano Performance. In Proceedings of the Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Bengaluru, India, 2022.
  • Gillick et al. (2021) Gillick, J.; Yang, J.; Cella, C.E.; Bamman, D. Drumroll Please: Modeling Multi-Scale Rhythmic Gestures with Flexible Grids. Transactions of the International Society for Music Information Retrieval 2021, 4, 156–166. https://doi.org/10.5334/tismir.98.
  • Cancino-Chacón et al. (2022) Cancino-Chacón, C.; Peter, S.D.; Karystinaios, E.; Foscarin, F.; Grachten, M.; Widmer, G. Partitura: A Python Package for Symbolic Music Processing. In Proceedings of the Proceedings of the Music Encoding Conference (MEC), Halifax, Canada, 2022; [2206.01071].
  • Chowdhury (2022) Chowdhury, S. Modelling Emotional Expression in Music Using Interpretable and Transferable Perceptual Features. PhD thesis, Johannes Kepler University Linz, 2022.
  • Chowdhury and Widmer (2021) Chowdhury, S.; Widmer, G. ON PERCEIVED EMOTION IN EXPRESSIVE PIANO PERFORMANCE: FURTHER EXPERIMENTAL EVIDENCE FOR THE RELEVANCE OF MID-LEVEL PERCEPTUAL FEATURES. In Proceedings of the Proceeding of the 22nd International Society on Music Information Retrieval (ISMIR), Online, 2021.
  • Zhang et al. (2018) Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. MixUp: Beyond empirical risk minimization. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings, Vancouver, Canada, 2018; [1710.09412].
  • Goebl (2001) Goebl, W. Melody lead in piano performance: Expressive device or artifact? The Journal of the Acoustical Society of America 2001, 110, 641. https://doi.org/10.1121/1.1376133.
  • Bresin and Umberto Battel (2000) Bresin, R.; Umberto Battel, G. Articulation Strategies in Expressive Piano Performance Analysis of Legato, Staccato, and Repeated Notes in Performances of the Andante Movement of Mozart’s Sonata in G Major (K.545). Journal of New Music Research 2000, 29, 211–224. https://doi.org/10.1076/jnmr.29.3.211.3092.
  • Kosta et al. (2018) Kosta, K.; Bandtlow, O.F.; Chew, E. Dynamics and relativity: Practical implications of dynamic markings in the score. Journal of New Music Research 2018, 47, 438–461. https://doi.org/10.1080/09298215.2018.1486430.
  • Liu et al. (2023) Liu, H.; Chen, Z.; Yuan, Y.; Mei, X.; Liu, X.; Mandic, D.; Wang, W.; Plumbley, M.D. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. In Proceedings of the Proceedings of the 40th International Conference on Machine Learning (ICML), Hawaii, USA, 2023; [2301.12503].
  • Zhang et al. (2023) Zhang, Y.; Huang, N.; Tang, F.; Huang, H.; Ma, C.; Dong, W.; Xu, C. Inversion-based Style Transfer with Diffusion Models. In Proceedings of the Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023; [2211.13203]. https://doi.org/10.1109/CVPR52729.2023.00978.
\PublishersNote