[Uncaptioned image] PianoMotion10M: Dataset and Benchmark
for Hand Motion Generation in Piano Performance

Qijun Gan
Zhejiang University
[email protected]
\AndSong Wang
Zhejiang University
[email protected]
\ANDShengtao Wu
Hangzhou Dianzi University
[email protected]
\AndJianke Zhu
Zhejiang University
[email protected]
Abstract

Recently, artificial intelligence techniques for education have been received increasing attentions, while it still remains an open problem to design the effective music instrument instructing systems. Although key presses can be directly derived from sheet music, the transitional movements among key presses require more extensive guidance in piano performance. In this work, we construct a piano-hand motion generation benchmark to guide hand movements and fingerings for piano playing. To this end, we collect an annotated dataset, PianoMotion10M, consisting of 116 hours of piano playing videos from a bird’s-eye view with 10 million annotated hand poses. We also introduce a powerful baseline model that generates hand motions from piano audios through a position predictor and a position-guided gesture generator. Furthermore, a series of evaluation metrics are designed to assess the performance of the baseline model, including motion similarity, smoothness, positional accuracy of left and right hands, and overall fidelity of movement distribution. Despite that piano key presses with respect to music scores or audios are already accessible, PianoMotion10M aims to provide guidance on piano fingering for instruction purposes. The dataset and source code can be accessed at https://agnjason.github.io/PianoMotion-page.

1 Introduction

The process of learning has been significantly improved with artificial intelligence techniques, which enable individuals to enhance their skills under the guidance of an AI coach [64, 66]. This can be extended into learning to play the musical instruments. Particularly, piano performance requires a profound understanding of the underlying relationship between musical compositions and their corresponding physical motions. As the rigorous practice and training program are necessary for athletes to master a diverse range of expressive human poses, it entails extensive practices to achieve proficiency in piano fingering and hand movement. To facilitate access to playing guidance, the development of AI piano coach has been spurred.

Large-scale piano-motion datasets are the foundation of a nuanced approach for motion generation, which offer valuable guidance for physical performance and musical expression. The computational challenge of motion generation lies in capturing the nonlinear relationship between musical pieces and the intricate hand motions required for piano playing. The hand poses vary for the same note across different melodies. Moreover, the dynamic nature of musical expression demands a level of continuous motion, which is challenging to be learnt from small datasets. These limitations underscore the urgent need for a large-scale piano-motion dataset.

To address the deficiency of the dataset for guiding hand movements and fingerings in playing piano, we introduce a large-scale 3D piano-motion dataset named PianoMotion10M. As shown in Fig. 1, PianoMotion10M contains piano audio tracks, Musical Instrument Digital Interface (MIDI) files, and annotated hand motions with their corresponding videos, meticulously collected from the Internet. It comprises 1,966 pairs of video and music, with a total duration of 116 hours and 10 million annotated frames. The parametric MANO hand model [47] is employed to represent hand gestures. Our constructed dataset offers a diverse range of music styles and piano techniques, which addresses the demands of various preferences and skill levels.

Refer to caption
Figure 1: Overview of our framework. We collect videos of professional piano performances from the Internet and process them to construct a large-scale dataset, PianoMotion10M, which comprises piano music, MIDI files and hand motions. Building upon this dataset, we establish a benchmark for generating hand motions from piano music.

Traditional applications like PianoPlayer [45] are adept at generating static hand gestures and positions for piano scores, which typically make use of classifiers to predict the proper fingering combinations. However, they often fail to capture the diversity and continuity inherent of the piano performance in PianoMotion10M. Both rule-based methods [5, 33] and HMM-based approaches [69, 41] can estimate fingerings while they primarily focus on the local fingering constraints of continuous notes. Consequently, they often overlook crucial information like long-range fingering relationships, while our task aims to estimate the motions of long clips.

To address these limitations, a novel baseline model is presented to show the effectiveness of PianoMotion10M, which is able to generate realistic hand motions from piano melodies. Given a piece of piano music, our model can locate the positions of both hands and generate a long sequence of hand gestures for the performance. It effectively learns the music-position correlation through an efficient position predictor and produces continuous gestures with a position-guided gesture generator based on a diffusion probabilistic model. To assess our baseline model, we propose several evaluation metrics, including Frechet Gesture Distance and Wasserstein Gesture Distance to measure the fidelity of each hand motions, Frechet Inception Distance with a pre-trained auto-encoder to investigate the motion quality of double hands, and Position Distance to assess the accuracy of hand positioning, and Smoothness of the generated motions.

In summary, our main contributions are: 1) A large-scale piano-motion dataset PianoMotion10M comprises 116 hours of music and 10 million annotated frames with hand poses. To the best of our knowledge, it stands as the first dataset integrating piano music with its corresponding hand motions, which facilitates the tasks of 3D hand motion generation from piano audio tracks and piano music generation conditioned on hand motions. 2) Based on PianoMotion10M, we propose a benchmark with a series of evaluation metrics to investigate the effectiveness on hand motion generation, including the accuracy of positions and fidelity of gestures. 3) A novel baseline model bridges piano music with hand motions, which estimates hand location with a position predictor and generates hand gestures sequences through a position-guided gesture generator.

2 Related Work

Motion-music Datasets. While multi-modal datasets [52, 55, 36, 31] become the key of various learning tasks, there remains a significant gap in the availability of repository specially designed for music-conditional motion generation. Although the existing hand gesture datasets [39, 38, 30, 16, 67] contain a large number of image-hand gesture pairs, they do not have audio or other related information. This limitation hinders them from the generative tasks, since they mainly focus on hand reconstruction and pose estimation. Recently, datasets like AIST++ [32] and TikTok [77] are tailored for music-dance learning, which provide limited music segments, typically less than 5 hours in duration. Moryossef et al. [40] manage to automatically detect which fingers pressed the key of the piano and provide a dataset of piano-fingering. However, it does not provide continuous hand gesture movements. Therefore, there is a crucial need to build a piano-motion dataset specially tailored for motion generation tasks. To this end, we introduce the PianoMotion10M dataset, which comprises extensive piano music and their corresponding hand motion annotations.

3D Human Motion Synthesis. The problem of generating realistic and controllable 3D human motion sequences has been a subject of long-standing study. By taking advantage of 2D keypoint detection [8], the synthesis of 2D skeletons has been extensively explored [46, 53, 15]. Considerable research efforts have been devoted to 2D speech-driven head generation for facial mouth and lip motion generation [72, 10, 19, 25, 21], which usually employ either image-driven or voice-driven methods to produce realistic videos of speaking individuals. However, the expressive capabilities of 2D pose skeletons are limited and they are not applicable to 3D character models. Recent methods for full-body 3D dance generation have utilized LSTM [58, 68, 78], GANs [57, 17] or transformer encoders [32, 24, 54]. To generate vivid talking head videos, extensive research has been conducted in the field of speech-driven 3D facial animation [7, 13, 60, 14, 73]. EmoTalk [44] animates emotional 3D faces from speech input by generating controllable personal and emotional styles. Tian et al. [61] introduce a speed controller and a face region controller to enhance stability during the head generation process.

In the domain of hand motion generation, it has been primarily categorized into rule-based methods [9, 23, 56] and data-driven approaches [29, 12, 34, 70, 6]. For instance, Speech2Gesture [17] utilizes conditional generative adversarial networks to generate personalized 2D keypoints from audio. Ahuja et al. [1] propose a method for personalized motion transfer. Ao et al. [62] learn the map** between the speech and gestures from data using a combined network structure of the vector quantized variational auto-encoder (VQ-VAE). Beyond 2D keypoints generation, TriModal [70] extracts different upper body movements from TED talks and designs a LSTM-based neural network conditioned on audio, text, and identity to generate co-speech gestures.

Recently, diffusion models have achieved promising results in generating human motions. Previous works such as MDM [59] and MotionDiffuse [74] have produced realistic motion inspired by denoising diffusion probabilistic models (DDPM) [20]. PhysDiff [71] extends MDM by imposing physical constraints. MLD [11] utilizes latent carrier DDPM for forward denoising and reverse diffusion in motion latent space. MAA [3] enhances the performance of non-distributed data by pre-training diffusion models. Zhang et al. [75] introduce retrieval-enhanced DDPM, which improves the text-to-motion functionality in distribution.

3 PianoMotion10M Dataset

It is a formidably challenging task to map piano music to hand motions due to the significant influence of note sequences and positions on hand movements and fingering. Lacking diverse data may lead to the inferior performance on estimating various hand poses in piano playing. To capture the variability, we present the first large-scale piano-motion dataset, PianoMotion10M, which comprises 1,966 piano performance videos along with 10 million hand poses and their corresponding MIDI files, resulting in an overall duration of 116 hours. Fig. 2 presents an example of our dataset. These videos are segmented into 16,739 individual clips with a length of 30 seconds. To ensure comprehensive evaluation, we extract 7,519 clips for training, 821 for validation and 8,399 for testing. The detailed comparisons with the existing datasets on hands and 3D human motions are summarized into Tab. 1.

Table 1: Comparison between different hand and motion datasets. The proposed PianoMotion10M dataset consists of piano music with corresponding hand poses for hand motion generation. Existing hand-image datasets are listed in the first four rows, and music-motion datasets are presented in the subsequent four rows for reference.
  Dataset Pose Size Subject Music MIDI Duration(hour)
FreiHAND [79] 134K 32 -
InterHand2.6M [39] 2.6M 27 -
RGB2Hands [65] 1K 2 -
Re:InterHand [37] 1.5M 10 -
GrooveNet [2] - 1 0.38
DanceNet [78] - 2 0.96
EA-MUD [57] - - 0.35
AIST++ [32] 10.1M 30 5.19
PianoMotion10M 10.5M 14 116.16
 
Refer to caption
Figure 2: Illustration of sample from PianoMotion10M. Each sample in our dataset includes audio, hand pose annotations, and a MIDI file along with the corresponding Bilibili video ID.

3.1 Data Collection

There is a wealth of videos and live streams dedicated to musical instrument performances and tutorials from the Internet. Note that each individual has a unique playing style, we firstly select 14141414 piano experts from Bilibili111https://www.bilibili.com, one of the most popular video-sharing platforms in China. and collect a total of 3,647 candidate videos. To ensure consistency and enhance the quality of our dataset, we conduct pre-processing with five pivotal factors, including resolution, audio quality, camera perspective, presence of multiple individuals, and visibility of hands. We manually select pure piano music to ensure that the audio contained no human vocals or sounds from other instruments. Moreover, videos with a resolution lower than 1080×1920108019201080\times 19201080 × 1920 are discarded to ensure the quality of our dataset. To enhance the observation of hand movements and gestures, we choose videos with a bird’s-eye view to minimize hand occlusion. Furthermore, we remove those videos where hand regions are frequently obstructed or invisible during performance. Additionally, we do not take consideration of videos with multiple piano players, which is beyond the scope of this dataset. By addressing these factors, the overall quality and coherence of the dataset have been substantially enhanced. Following this preprocessing stage, the result is a collection of 1,96619661,9661 , 966 high-quality raw videos. Each one showcases the piano playing by an individual, which is captured from a bird’s-eye view along with pure piano audio.

3.2 Data Annotations

MIDI is a universal digital protocol to store musical data for various musical devices, which is served as a digital representation of musical notes, volume, tempo, and other performance parameters. Once video and pure piano audio pairs are collected, we transcribe the piano performances into MIDI files by utilizing a state-of-the-art automatic piano transcription [28]. To ensure the accuracy of the conversion results, we replay MIDI files and compare them with the original music tracks. Those files with high discrepancies are adjusted.

Hand Pose is captured via the parametric hand model MANO [47], which serves as our hand prior model. It effectively maps the pose parameter θJ×3𝜃superscript𝐽3\theta\in\mathbb{R}^{J\times 3}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_J × 3 end_POSTSUPERSCRIPT with J𝐽Jitalic_J per-bone parts and the shape parameter ρ10𝜌superscript10\rho\in\mathbb{R}^{10}italic_ρ ∈ blackboard_R start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT onto a template mesh ¯¯\bar{\mathcal{M}}over¯ start_ARG caligraphic_M end_ARG with vertices V𝑉Vitalic_V. The MANO model enables the simulation of various hand gestures and movements during piano performances, which provides an effective way to study the relationship between hand gestures in playing piano. Due to the non-uniformity of hand sizes and positions in the collected videos, we first employ the MediaPipe hand detection framework [35] to obtain bounding box of the hand region. Video frames are cropped according to the detected hand bounding boxes to enhance the robustness of the results. To this end, hand poses in collected videos are annotated using HaMeR [42]. HaMeR follows a fully transformer-based architecture and reconstructs hand models with increased accuracy and robustness. All images having hands detected by MediaPipe are annotated with hand poses by HaMeR, which result in a dataset of 10 million image-pose pairs.

To enhance the smoothness and continuity of our dataset, the generated hand poses require to be cleaned and refined. The results of MediaPipe and HaMeR are usually accurate in most cases, while some inferior results may occur due to rapid motion and image blurring. The Hampel filter [43] with a window size of 20202020 is utilized to identify these outliers. The outliers and the timestamps where hands are undetected are firstly labeled as missing values. Within a small period, hand movements can be considered as the motion with constant speed. Therefore, the small gap can be interpolated bilaterally. To address these missing data in the time series, missing segments with the frame length δ𝛿\deltaitalic_δ less than 30303030 frames are filled by linear interpolation with respect to their surrounding values. The others are considered as periods when the hand is invisible. To ensure the reliability of detected hands, a similar strategy is employed to label observations with excessively short duration (δ<15𝛿15\delta<15italic_δ < 15) as invisible. Finally, to ensure the smoothness of hand motions, we make use of a Savitzky-Golay filter [51] for data smoothing. The annotated hand poses are manually checked to ensure their quality.

3.3 Data Statistics

The dataset is made of contributions from several subjects with different experts, and each provides varying amounts of data in terms of videos, clips, duration, and annotated frames. Tab. 2 presents the detailed statistics on the distribution of subjects within our PianoMotion10M dataset.

Table 2: Statistics on the distribution of subjects in the PianoMotion10M dataset, where subject names denote the identity ID of experts.
  Subject Name Videos Clips Time(sec) Frames Subject Name Videos Clips Time(sec) Frames
1467634 337 4,359 103,293 2,237,181 470175873 14 100 2,450 64,946
2084102325 11 103 2,766 70,243 478315001 285 2,674 62,615 1,792,707
36760114 22 160 4,772 106,037 494725787 1 12 586 7,488
37367458 114 859 20,802 571,525 66685747 535 5,136 130,352 3,520,370
403444513 12 171 4,245 99,973 676539782 19 359 11,314 187,917
434762078 74 128 2,948 80,607 688183660 264 872 19,074 564,699
442401135 275 1,788 52,030 1,209,905 864712 3 18 923 13,569
Total Videos: 1,966, Clips: 16,739, Total Duration(hour): 116.16, Annotated Frames: 10,527,167
 

The whole piano-motion dataset, PianoMotion10M, consists of 1,966 videos with approximately 116 total hours of footage. Each piece of music is segmented into 30-second clips at 24-second intervals, resulting in a total of 16,739 clips. Note that we discard those clips where hand visibility is below 80%. Our dataset has 14 subjects with different playing styles. This variability ensures a rich diversity of playing techniques and music styles, which is essential for training robust models to predict piano hand gestures from musical pieces. All videos in our dataset are publicly accessible with provided video IDs on the Bilibili website.

4 Baselines

To tackle the challenging task of generating hand motions synchronized with piano audio tracks, we introduce a novel motion generation framework upon the PianoMotion10M dataset. As illustrated in Fig. 3, our framework consists of a position predictor and a gesture generator. The position predictor extracts the hand positions from piano music and integrates them as contextual input for the gesture generator. By leveraging a DDPM model [20], the gesture generator estimates hand pose sequence based on the piano audio and the predicted positions. Further details on each component are elaborated in the following subsections.

Refer to caption
Figure 3: Illustration of our baseline model. Given a piece of piano music, our baseline model estimates the hand motions by predicting hand positions and generating hand gestures.

Problem Formulation. Given a piano audio piece A𝐴Aitalic_A with N𝑁Nitalic_N frames, our objective is to obtain the hand position sequence PN×3×2𝑃superscript𝑁32P\in\mathbb{R}^{N\times 3\times 2}italic_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 3 × 2 end_POSTSUPERSCRIPT and hand gestures ΘN×J×3×2Θsuperscript𝑁𝐽32\Theta\in\mathbb{R}^{N\times J\times 3\times 2}roman_Θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_J × 3 × 2 end_POSTSUPERSCRIPT. The hand position sequence P𝑃Pitalic_P encompasses the 3D coordinates of the left and right hands. The hand gestures ΘΘ\Thetaroman_Θ consist of Euler angle at each joint of both hands.

Due to the highly nonlinear relationship between acoustic signals and hand gestures, it poses a significant challenge to estimate motions through discriminative models [44], which usually leads to an average pose, as demonstrated in Section 5.2. To address this issue, a hand position predictor is introduced to estimate the continuous changes in hand positions. Moreover, a generative model is utilized to reconstruct hand gestures from a piano music piece. Hereby, the task of hand motion generation becomes more concise and comprehensive by disentangling it into hand position estimation and gesture generation. To extract the audio features fasubscript𝑓𝑎f_{a}italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT from the audio A𝐴Aitalic_A, we make use of a pre-trained audio feature extractor ΦasubscriptΦ𝑎\Phi_{a}roman_Φ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT [4, 22], which is formulated as fa=Φa(A),faN×Cformulae-sequencesubscript𝑓𝑎subscriptΦ𝑎𝐴subscript𝑓𝑎superscript𝑁𝐶f_{a}=\Phi_{a}(A),f_{a}\in\mathbb{R}^{N\times C}italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = roman_Φ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_A ) , italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_C end_POSTSUPERSCRIPT. C𝐶Citalic_C is the feature dimension of ΦasubscriptΦ𝑎\Phi_{a}roman_Φ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT.

4.1 Position Predictor

The position predictor is employed to predict the 6-dimensional 3D positions for both the left and right hands. Since the audio features fasubscript𝑓𝑎f_{a}italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT cannot directly be mapped to the positions, a feature embedding module is treated as our position decoder ΦpsubscriptΦ𝑝\Phi_{p}roman_Φ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT to extract the latent features. It projects the sequential audio features fasubscript𝑓𝑎f_{a}italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT onto the latent features fpsubscript𝑓𝑝f_{p}italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT with more comprehensive temporal information, which is formulated as fp=Φp(fa),fpN×512formulae-sequencesubscript𝑓𝑝subscriptΦ𝑝subscript𝑓𝑎subscript𝑓𝑝superscript𝑁512f_{p}=\Phi_{p}(f_{a}),f_{p}\in\mathbb{R}^{N\times 512}italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = roman_Φ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) , italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 512 end_POSTSUPERSCRIPT. Subsequently, a linear map** is employed to project the feature fpsubscript𝑓𝑝f_{p}italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT onto the output positions P𝑃Pitalic_P.

In our experiments, we can make use of either Transformer [63] or State-Space Model (SSM) [18] as our feature embedding module. Transformer leverages self-attention mechanisms to effectively capture long-range dependencies and contextual information in order to learn temporal relationships. Recently, a different representation inspired by classical SSM [26] is proposed to replace the attention mechanism, which is built upon a more contemporary Selective Structured State Space Model (S6) [18] suitable for deep learning. By sharing a similar architecture to the classical RNN, it can efficiently capture information from previous inputs.

4.2 Position-guided Gesture Generator

Leveraging recent achievements [76, 3] in motion generation, our approach incorporates a diffusion probabilistic model [20]. By mastering denoising, this model effectively captures the complex distribution of hand motions observed in large-scale piano-motion datasets, which has the capability to generate motions with different conditions. Starting with a clean sample Θ0subscriptΘ0\Theta_{0}roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT of motion sequence, the forward diffusion process establishes a Markov chain that gradually adds noise to Θ0subscriptΘ0\Theta_{0}roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in T𝑇Titalic_T steps, which generates a series of noisy samples Θ1,,ΘTsubscriptΘ1subscriptΘ𝑇\Theta_{1},...,\Theta_{T}roman_Θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_Θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT as

q(Θt|x0)=𝒩(Θt;α¯tx0,(1α¯t)I),𝑞conditionalsubscriptΘ𝑡subscript𝑥0𝒩subscriptΘ𝑡subscript¯𝛼𝑡subscript𝑥01subscript¯𝛼𝑡𝐼q(\Theta_{t}|x_{0})=\mathcal{N}(\Theta_{t};\sqrt{\bar{\alpha}_{t}}x_{0},(1-% \bar{\alpha}_{t})I),italic_q ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_I ) , (1)

where 𝒩𝒩\mathcal{N}caligraphic_N denotes the Gaussian distribution. α¯t=s=1t(1βs)subscript¯𝛼𝑡superscriptsubscriptproduct𝑠1𝑡1subscript𝛽𝑠\bar{\alpha}_{t}=\prod_{s=1}^{t}(1-\beta_{s})over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) and β𝛽\betaitalic_β represent the variance scheduler for the added noise. Therefore, a model parameterized by a deep neural network G𝐺Gitalic_G is trained to master the inverse process within another Markov chain, which learns the map** p(Θt1|Θt)𝑝conditionalsubscriptΘ𝑡1subscriptΘ𝑡p(\Theta_{t-1}|\Theta_{t})italic_p ( roman_Θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to sequentially denoise samples over T𝑇Titalic_T steps. Specifically, denoising model G𝐺Gitalic_G consists of a 4-layer U-Net [48] with 256, 512, 1024, 2048 dimensions for each layer.

To reduce the noise in piano audio, the audio features fasubscript𝑓𝑎f_{a}italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT are simultaneously fed into the denoising neural network G𝐺Gitalic_G as conditions. Similar to the position decoder ΦpsubscriptΦ𝑝\Phi_{p}roman_Φ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, a gesture decoder ΦgsubscriptΦ𝑔\Phi_{g}roman_Φ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is utilized to extract gesture features fgsubscript𝑓𝑔f_{g}italic_f start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT from the audio features fasubscript𝑓𝑎f_{a}italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT as fg=Φg(fa)subscript𝑓𝑔subscriptΦ𝑔subscript𝑓𝑎f_{g}=\Phi_{g}(f_{a})italic_f start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = roman_Φ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ). Considering that the finger movements at different positions for the same pitch are distinct, G𝐺Gitalic_G requires the guidance of hand positions P𝑃Pitalic_P from the position predictor. This will enhance the fidelity of the generation process. As for the additional conditions, the time step embeddings of time t𝑡titalic_t are concatenated in denoising process. The denoising process can be formulated as follows

Θ^0=G(Θt,t;fg,P),subscript^Θ0𝐺subscriptΘ𝑡𝑡subscript𝑓𝑔𝑃\hat{\Theta}_{0}=G(\Theta_{t},t;f_{g},P),over^ start_ARG roman_Θ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_G ( roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ; italic_f start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_P ) , (2)

where Θ^0N×J×3subscript^Θ0superscript𝑁𝐽3\hat{\Theta}_{0}\in\mathbb{R}^{N\times J\times 3}over^ start_ARG roman_Θ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_J × 3 end_POSTSUPERSCRIPT denotes the result of hand motions within N𝑁Nitalic_N frames.

4.3 Implementation Details

We employ a two-stage scheme to train our proposed network. At the first stage, the position predictor is trained using position error psubscript𝑝\mathcal{L}_{p}caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and velocity loss vsubscript𝑣\mathcal{L}_{v}caligraphic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. The position error psubscript𝑝\mathcal{L}_{p}caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT computes the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss between the predicted position P^^𝑃\hat{P}over^ start_ARG italic_P end_ARG and the ground truth P𝑃Pitalic_P, expressed as p=P^P1subscript𝑝subscriptnorm^𝑃𝑃1\mathcal{L}_{p}=||\hat{P}-P||_{1}caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = | | over^ start_ARG italic_P end_ARG - italic_P | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Inspired by [44], velocity loss is adopted to induce temporal stability for generating smoothing movements, which is formulated as v=(P^nP^n1)(PnPn1)2subscript𝑣subscriptnormsubscript^𝑃𝑛subscript^𝑃𝑛1subscript𝑃𝑛subscript𝑃𝑛12\mathcal{L}_{v}=||(\hat{P}_{n}-\hat{P}_{n-1})-(P_{n}-P_{n-1})||_{2}caligraphic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = | | ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) - ( italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. n𝑛nitalic_n denotes the n𝑛nitalic_n-th frame in a motion sequence. Specifically, our model is trained by subject 1467634 and subject 66685747, which have the similar piano keyboard layout. At the second stage, the parameters of position predictor are frozen, and the estimated positions are employed as a guidance for the gesture generator. As in [50], the denoising process is modified from noise prediction to velocity prediction. During the whole training process, the parameters of audio feature extractor ΦasubscriptΦ𝑎\Phi_{a}roman_Φ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT are frozen.

The baseline model is implemented with PyTorch. We normalize the inputs into 30 FPS through interpolation, where each piece of music lasts 8 seconds. Both stages involve training for 100,000100000100,000100 , 000 iterations at the learning rates of 2e52superscript𝑒52e^{-5}2 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and 5e55superscript𝑒55e^{-5}5 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, respectively. We conducted all the experiments on a PC with single NVIDIA RTX 3090Ti GPU, which has 24GB of GPU RAM.

5 Benchmark

5.1 Evaluation Metrics

To assess the performance of our proposed baseline, we employ several evaluation metrics to examine the effectiveness of hand poses generated by the input piano music, which are crucial in understanding the relationship between piano melody and its corresponding playing motions.

Frechet Inception Distance (FID). Frechet Inception Distance is introduced to measure the Frechet distance between the feature vectors of prediction and ground truth. We pre-train an auto-encoder [70] to project motion sequence onto a latent space for double hands. The FID is adopted to assess the fidelity of the overall motions generated by our baseline model.

Frechet Gesture Distance (FGD). Unlike FID, Frechet Gesture Distance is utilized to compute the disparity between predicted gestures and ground truth of one hand. This metric is instrumental in evaluating the similarity of single hand gestures.

Wasserstein Gesture Distance (WGD). Wasserstein Gesture Distance [49] is computed between two distributions, each of which is represented as a Gaussian Mixture Model (GMM) [27] as below

W(x,y)=infγΠ(x,y)𝔼(x,y)γ[xy],𝑊subscript𝑥subscript𝑦subscriptinfimumsimilar-to𝛾Πsubscript𝑥subscript𝑦subscript𝔼similar-to𝑥𝑦𝛾delimited-[]norm𝑥𝑦W(\mathcal{I}_{x},\mathcal{I}_{y})=\inf_{\gamma\sim\Pi(\mathcal{I}_{x},% \mathcal{I}_{y})}\mathbb{E}_{(x,y)\sim\gamma}\left[\|x-y\|\right],italic_W ( caligraphic_I start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) = roman_inf start_POSTSUBSCRIPT italic_γ ∼ roman_Π ( caligraphic_I start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ italic_γ end_POSTSUBSCRIPT [ ∥ italic_x - italic_y ∥ ] , (3)

where Π(x,y)Πsubscript𝑥subscript𝑦\Pi(\mathcal{I}_{x},\mathcal{I}_{y})roman_Π ( caligraphic_I start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) indicates the joint distributions that combine the parametric GMM distributions xsubscript𝑥\mathcal{I}_{x}caligraphic_I start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and ysubscript𝑦\mathcal{I}_{y}caligraphic_I start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT. Distributions xsubscript𝑥\mathcal{I}_{x}caligraphic_I start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and ysubscript𝑦\mathcal{I}_{y}caligraphic_I start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT are fitted by predicted gestures and ground truth of single hand. 𝔼𝔼\mathbb{E}blackboard_E calculates the expectation of the sample pairs (x,y)𝑥𝑦(x,y)( italic_x , italic_y ). The WGD metric offers a robust measure of dissimilarity between the generated motion and ground truth.

Position Distance (PD). The Position Distance calculates the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance between estimated positions and the ground truth. It is essential to evaluate the precision of predicted hand positions to ensure the accuracy required for piano fingerings.

Smoothness. Smoothness is measured by computing the acceleration of each joint. However, hands in a static state exhibit maximum smoothness, which are contrary to the desired outcome. Consequently, we consider the acceleration of ground truth as a reference and utilize relative acceleration as the evaluation metric for smoothness, which is formulated as Smooth=i|τi^τi|𝑆𝑚𝑜𝑜𝑡subscript𝑖^subscript𝜏𝑖subscript𝜏𝑖Smooth=\sum_{i}|\hat{\tau_{i}}-\tau_{i}|italic_S italic_m italic_o italic_o italic_t italic_h = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | over^ start_ARG italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG - italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |. τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the acceleration of the i𝑖iitalic_i-th joint. It reflects the continuance and coherence of the estimated gestures.

5.2 Experiments

Table 3: Quantitative evaluation of our proposed hand motion generation baseline and existing models on the validation set. We present a comparative analysis of various network architectures, highlighting the performance and efficiency of our baselines in generating hand motion.
       Method Decoder Step Right Hand Left Hand FID\downarrow Params (M)
FGD\downarrow WGD\downarrow PD\downarrow Smooth\downarrow FGD\downarrow WGD\downarrow PD\downarrow Smooth\downarrow
EmoTalk [44] TF - 0.360 0.259 0.033 0.313 0.445 0.232 0.044 0.353 4.645 308
LivelySpeaker [76] TF 1000 0.535 0.249 0.030 0.334 0.538 0.220 0.038 0.406 4.157 321
Our-Base Wav2vec SSM 1000 0.416 0.246 0.034 0.335 0.425 0.223 0.042 0.412 3.587 320
TF 1000 0.424 0.246 0.033 0.334 0.426 0.219 0.040 0.402 3.608 323
HuBERT SSM 1000 0.402 0.247 0.033 0.336 0.432 0.218 0.041 0.407 3.412 320
TF 1000 0.418 0.247 0.034 0.338 0.432 0.219 0.041 0.412 3.529 323
Our-Large Wav2vec SSM 1000 0.421 0.244 0.031 0.208 0.430 0.219 0.038 0.253 3.453 539
TF 1000 0.354 0.244 0.030 0.209 0.372 0.214 0.038 0.251 3.376 557
HuBERT SSM 1000 0.403 0.244 0.030 0.214 0.406 0.217 0.037 0.237 3.395 539
TF 1000 0.351 0.244 0.030 0.205 0.372 0.217 0.037 0.248 3.281 557
 

Experimental Setup. Within the network, we compare two transformer-based audio feature extractors, Wav2Vec2.0 [4] and HuBERT [22], which are popular in the field of self-supervised speech recognition. Regarding the feature embedding module (Decoder), we evaluate the performance of SSM-based model [18] in contrast to the transformer (TF) approach [63]. We conduct experiments using two different model configurations. Both of them employ the same diffusion architecture while differing in the feature extractor setup. The base model incorporates Wav2Vec2.0-base/HuBERT-base as the audio feature extractors, featuring a model dimension of 768 and containing 8-layer TF/SSM decoder. The large model utilizes Wav2Vec2.0-large/HuBERT-large, with a model dimension of 1,024 and 16-layer TF/SSM decoder.

  Step FID\downarrow FGD\downarrow WGD\downarrow Smooth\downarrow
5 3.540 0.361 0.237 0.354
10 3.682 0.363 0.236 0.310
100 3.438 0.366 0.232 0.254
300 3.360 0.348 0.233 0.240
1000 3.281 0.362 0.231 0.227
 
Table 4: Ablation study on denoising steps.

Quantitative Results. Tab. 3 provides the results of existing methods and our baseline under various experimental settings on PianoMotion10M dataset. Due to the absence of prior work on generating gestures for piano music, we refer to the network structures of EmoTalk [44] and LivelySpeaker [76] and re-implement them to account for our task. EmoTalk directly generates poses from audio, while LivelySpeaker utilizes an MLP-based diffusion backbone for gesture generation. Our baseline models achieve better fidelity on hand motions by estimating positions and gestures, separately. In our baseline models, TF-based models outperform the SSM-based models in processing our time-series information, especially in our large model. For the audio feature extractor, the performance of the HuBERT model slightly outperforms the Wav2vec2. Additionally, we conduct ablation experiments on different denoising steps and our model also achieves competitive results with fewer steps, as shown in Table 4.

Qualitative Results. Fig. 4 demonstrates the visual results of our baseline and the existing models. The output of EmoTalk method exhibits a static average gesture. While MLPs afford LivelySpeaker rapid inference speed, they compromise the fidelity of generated motions. Conversely, our model demonstrates notably superior performance compared to previous methods by taking advantage of a two-stage approach, as illustrated in Fig. 4. To attain accurate positional information, we utilize an end-to-end position predictor rather than making use of an uncontrollable diffusion model for position generation. Furthermore, we employ a position-guided approach with a diffusion-based gesture generator for hand motion estimation.

Refer to caption
Figure 4: Illustration of the qualitative results. We display the generated gestures across frames using different methods. Our method stands out due to its greater fidelity, as shown in the examples.

6 Conclusion

In this work, we present PianoMotion10M, a new large-scale piano-motion dataset for hands, which has 116 hours of piano music and 10 million frames annotated with hand poses. We address the critical issue that the current datasets are insufficient for human-piano interaction. Based on PianoMotion10M, we develop a benchmark model that maps the piano music pieces to hand motions. To simplify the learning target, we divide the motion into position and gesture by introducing a position predictor along with a gesture generator guided by the estimated positions. We suggest the evaluation metrics to measure the fidelity and smoothness of the hand motions compared to the ground truth. Our dataset and benchmark will further advance the automation of piano performance simulation and assist in learning piano playing.

References

  • [1] Ahuja, C., Lee, D.W., Nakano, Y.I., Morency, L.P.: Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach. In: ECCV. pp. 248–265 (2020)
  • [2] Alemi, O., Françoise, J., Pasquier, P.: Groovenet: Real-time music-driven dance movement generation using artificial neural networks. Networks 8(17),  26 (2017)
  • [3] Azadi, S., Shah, A., Hayes, T., Parikh, D., Gupta, S.: Make-an-animation: Large-scale text-conditional 3d human motion generation. In: ICCV. pp. 15039–15048 (2023)
  • [4] Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. NeurIPS 33, 12449–12460 (2020)
  • [5] Balliauw, M., Herremans, D., Palhazi Cuervo, D., Sörensen, K.: A variable neighborhood search algorithm to generate piano fingerings for polyphonic sheet music. ITOR 24(3), 509–535 (2017)
  • [6] Bhattacharya, U., Rewkowski, N., Banerjee, A., Guhan, P., Bera, A., Manocha, D.: Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. In: VR. pp. 1–10 (2021)
  • [7] Cao, Y., Tien, W.C., Faloutsos, P., Pighin, F.: Expressive speech-driven facial animation. TOG 24(4), 1283–1302 (2005)
  • [8] Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: CVPR. pp. 7291–7299 (2017)
  • [9] Cassell, J., Vilhjálmsson, H.H., Bickmore, T.: Beat: the behavior expression animation toolkit. In: SIGGRAPH. pp. 477–486 (2001)
  • [10] Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: CVPR. pp. 7832–7841 (2019)
  • [11] Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, G.: Executing your commands via motion diffusion in latent space. In: CVPR. pp. 18000–18010 (2023)
  • [12] Chiu, C.C., Morency, L.P., Marsella, S.: Predicting co-verbal gestures: A deep and temporal modeling approach. In: Intelligent Virtual Agents: 15th International Conference. pp. 152–166 (2015)
  • [13] Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A., Black, M.J.: Capture, learning, and synthesis of 3d speaking styles. In: CVPR. pp. 10101–10111 (2019)
  • [14] Fan, Y., Lin, Z., Saito, J., Wang, W., Komura, T.: Faceformer: Speech-driven 3d facial animation with transformers. In: CVPR. pp. 18770–18780 (2022)
  • [15] Ferreira, J.P., Coutinho, T.M., Gomes, T.L., Neto, J.F., Azevedo, R., Martins, R., Nascimento, E.R.: Learning to dance: A graph convolutional adversarial network to generate realistic dance motions from audio. Computers & Graphics 94, 11–21 (2021)
  • [16] Gan, Q., Li, W., Ren, J., Zhu, J.: Fine-grained multi-view hand reconstruction using inverse rendering. In: AAAI. pp. 1779–1787 (2024)
  • [17] Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: CVPR. pp. 3497–3506 (2019)
  • [18] Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023)
  • [19] Guo, Y., Chen, K., Liang, S., Liu, Y.J., Bao, H., Zhang, J.: Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In: ICCV. pp. 5784–5794 (2021)
  • [20] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS 33, 6840–6851 (2020)
  • [21] Hong, F.T., Zhang, L., Shen, L., Xu, D.: Depth-aware generative adversarial network for talking head video generation. In: CVPR. pp. 3397–3406 (2022)
  • [22] Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: Hubert: Self-supervised speech representation learning by masked prediction of hidden units. TASLP 29, 3451–3460 (2021)
  • [23] Huang, C.M., Mutlu, B.: Robot behavior toolkit: generating effective social behaviors for robots. In: HRI. pp. 25–32 (2012)
  • [24] Huang, R., Hu, H., Wu, W., Sawada, K., Zhang, M., Jiang, D.: Dance revolution: Long-term dance generation with music via curriculum learning. arXiv preprint arXiv:2006.06119 (2020)
  • [25] Ji, X., Zhou, H., Wang, K., Wu, W., Loy, C.C., Cao, X., Xu, F.: Audio-driven emotional video portraits. In: CVPR. pp. 14080–14089 (2021)
  • [26] Kalman, R.E.: A new approach to linear filtering and prediction problems (1960)
  • [27] Kolouri, S., Rohde, G.K., Hoffmann, H.: Sliced wasserstein distance for learning gaussian mixture models. In: CVPR. pp. 3427–3436 (2018)
  • [28] Kong, Q., Li, B., Song, X., Wan, Y., Wang, Y.: High-resolution piano transcription with pedals by regressing onset and offset times. TASLP 29, 3707–3717 (2021)
  • [29] Kopp, S., Krenn, B., Marsella, S., Marshall, A.N., Pelachaud, C., Pirker, H., Thórisson, K.R., Vilhjálmsson, H.: Towards a common framework for multimodal generation: The behavior markup language. In: Intelligent Virtual Agents: 6th International Conference. pp. 205–217 (2006)
  • [30] Kwon, T., Tekin, B., Stühmer, J., Bogo, F., Pollefeys, M.: H2o: Two hands manipulating objects for first person interaction recognition. In: ICCV. pp. 10138–10148 (2021)
  • [31] Lee, S., Chung, J., Yu, Y., Kim, G., Breuel, T., Chechik, G., Song, Y.: Acav100m: Automatic curation of large-scale datasets for audio-visual video representation learning. In: ICCV. pp. 10274–10284 (2021)
  • [32] Li, R., Yang, S., Ross, D.A., Kanazawa, A.: Ai choreographer: Music conditioned 3d dance generation with aist++. In: ICCV. pp. 13401–13412 (2021)
  • [33] Lin, C.C., Liu, D.S.M.: An intelligent virtual piano tutor. In: VRCIA. pp. 353–356 (2006)
  • [34] Liu, X., Wu, Q., Zhou, H., Xu, Y., Qian, R., Lin, X., Zhou, X., Wu, W., Dai, B., Zhou, B.: Learning hierarchical cross-modal association for co-speech gesture generation. In: CVPR. pp. 10462–10472 (2022)
  • [35] Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.L., Yong, M.G., Lee, J., et al.: Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019)
  • [36] Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In: ICCV. pp. 2630–2640 (2019)
  • [37] Moon, G., Saito, S., Xu, W., Joshi, R., Buffalini, J., Bellan, H., Rosen, N., Richardson, J., Mize, M., De Bree, P., et al.: A dataset of relighted 3d interacting hands. NeurIPS 36 (2023)
  • [38] Moon, G., Shiratori, T., Lee, K.M.: Deephandmesh: A weakly-supervised deep encoder-decoder framework for high-fidelity hand mesh modeling. In: ECCV. pp. 440–455 (2020)
  • [39] Moon, G., Yu, S.I., Wen, H., Shiratori, T., Lee, K.M.: Interhand2.6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In: ECCV. pp. 548–564 (2020)
  • [40] Moryossef, A., Elazar, Y., Goldberg, Y.: At your fingertips: Extracting piano fingering instructions from videos. arXiv preprint arXiv:2303.03745 (2023)
  • [41] Nakamura, E., Sagayama, S.: Automatic piano reduction from ensemble scores based on merged-output hidden markov model. In: ICMC (2015)
  • [42] Pavlakos, G., Shan, D., Radosavovic, I., Kanazawa, A., Fouhey, D., Malik, J.: Reconstructing hands in 3d with transformers. arXiv preprint arXiv:2312.05251 (2023)
  • [43] Pearson, R.K., Neuvo, Y., Astola, J., Gabbouj, M.: Generalized hampel filters. EURASIP 2016, 1–18 (2016)
  • [44] Peng, Z., Wu, H., Song, Z., Xu, H., Zhu, X., He, J., Liu, H., Fan, Z.: Emotalk: Speech-driven emotional disentanglement for 3d face animation. In: ICCV. pp. 20687–20697 (2023)
  • [45] PianoPlayer: (2018), https://github.com/marcomusy/pianoplayer
  • [46] Ren, X., Li, H., Huang, Z., Chen, Q.: Self-supervised dance video synthesis conditioned on music. In: ACM MM. pp. 46–54 (2020)
  • [47] Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. ACM TOG pp. 245:1–245:17 (2017)
  • [48] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: LNCS. pp. 234–241 (2015)
  • [49] Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. IJCV 40, 99–121 (2000)
  • [50] Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512 (2022)
  • [51] Schafer, R.W.: What is a savitzky-golay filter?[lecture notes]. IEEE Signal processing magazine 28(4), 111–117 (2011)
  • [52] Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A.: Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
  • [53] Shiratori, T., Nakazawa, A., Ikeuchi, K.: Dancing-to-music character animation. In: CGF. pp. 449–458 (2006)
  • [54] Siyao, L., Yu, W., Gu, T., Lin, C., Wang, Q., Qian, C., Loy, C.C., Liu, Z.: Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In: CVPR. pp. 11050–11059 (2022)
  • [55] Srinivasan, K., Raman, K., Chen, J., Bendersky, M., Najork, M.: Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In: SIGIR. pp. 2443–2449 (2021)
  • [56] Starke, S., Mason, I., Komura, T.: Deepphase: Periodic autoencoders for learning motion phase manifolds. TOG 41(4), 1–13 (2022)
  • [57] Sun, G., Wong, Y., Cheng, Z., Kankanhalli, M.S., Geng, W., Li, X.: Deepdance: music-to-dance motion choreography with adversarial learning. TOMM 23, 497–509 (2020)
  • [58] Tang, T., Jia, J., Mao, H.: Dance with melody: An lstm-autoencoder approach to music-oriented dance synthesis. In: ACM MM. pp. 1598–1606 (2018)
  • [59] Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)
  • [60] Tian, G., Yuan, Y., Liu, Y.: Audio2face: Generating speech/face animation from single audio with attention-based bidirectional lstm networks. In: ICMEW. pp. 366–371. IEEE (2019)
  • [61] Tian, L., Wang, Q., Zhang, B., Bo, L.: Emo: Emote portrait alive-generating expressive portrait videos with audio2video diffusion model under weak conditions. arXiv preprint arXiv:2402.17485 (2024)
  • [62] Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. NeurIPS 30 (2017)
  • [63] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. NeurIPS 30 (2017)
  • [64] Wang, J., Qiu, K., Peng, H., Fu, J., Zhu, J.: Ai coach: Deep human pose estimation and analysis for personalized athletic training assistance. In: ACM MM. pp. 374–382 (2019)
  • [65] Wang, J., Mueller, F., Bernard, F., Sorli, S., Sotnychenko, O., Qian, N., Otaduy, M.A., Casas, D., Theobalt, C.: Rgb2hands: real-time tracking of 3d hand interactions from monocular rgb video. ACM TOG 39(6), 1–16 (2020)
  • [66] Wang, Z., Veličković, P., Hennes, D., Tomašev, N., Prince, L., Kaisers, M., Bachrach, Y., Elie, R., Wenliang, L.K., Piccinini, F., et al.: Tacticai: an ai assistant for football tactics. Nature communications 15(1), 1–13 (2024)
  • [67] Wu, E., Nishioka, H., Furuya, S., Koike, H.: Marker-removal networks to collect precise 3d hand data for rgb-based estimation and its application in piano. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 2977–2986 (2023)
  • [68] Yalta, N., Watanabe, S., Nakadai, K., Ogata, T.: Weakly-supervised deep recurrent neural networks for basic dance step generation. In: IJCNN. pp. 1–8 (2019)
  • [69] Yonebayashi, Y., Kameoka, H., Sagayama, S.: Automatic decision of piano fingering based on a hidden markov models. In: IJCAI. vol. 7, pp. 2915–2921 (2007)
  • [70] Yoon, Y., Cha, B., Lee, J.H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. TOG 39(6), 1–16 (2020)
  • [71] Yuan, Y., Song, J., Iqbal, U., Vahdat, A., Kautz, J.: Physdiff: Physics-guided human motion diffusion model. In: ICCV. pp. 16010–16021 (2023)
  • [72] Zakharov, E., Shysheya, A., Burkov, E., Lempitsky, V.: Few-shot adversarial learning of realistic neural talking head models. In: ICCV. pp. 9459–9468 (2019)
  • [73] Zhang, C., Zhao, Y., Huang, Y., Zeng, M., Ni, S., Budagavi, M., Guo, X.: Facial: Synthesizing dynamic talking face with implicit attribute learning. In: ICCV. pp. 3867–3876 (2021)
  • [74] Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondiffuse: Text-driven human motion generation with diffusion model. TPAMI (2024)
  • [75] Zhang, M., Guo, X., Pan, L., Cai, Z., Hong, F., Li, H., Yang, L., Liu, Z.: Remodiffuse: Retrieval-augmented motion diffusion model. In: ICCV. pp. 364–373 (2023)
  • [76] Zhi, Y., Cun, X., Chen, X., Shen, X., Guo, W., Huang, S., Gao, S.: Livelyspeaker: Towards semantic-aware co-speech gesture generation. In: ICCV. pp. 20807–20817 (2023)
  • [77] Zhu, Y., Olszewski, K., Wu, Y., Achlioptas, P., Chai, M., Yan, Y., Tulyakov, S.: Quantized gan for complex music generation from dance videos. In: ECCV. pp. 182–199 (2022)
  • [78] Zhuang, W., Wang, C., Chai, J., Wang, Y., Shao, M., Xia, S.: Music2dance: Dancenet for music-driven dance generation. TOMM 18(2), 1–21 (2022)
  • [79] Zimmermann, C., Ceylan, D., Yang, J., Russell, B., Argus, M., Brox, T.: Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. In: ICCV. pp. 813–822 (2019)

Appendix

In this part, we further provide more details and discussions on our proposed dataset and benchmark:

  • §A: More statistics of PianoMotion10M;

  • §B: More visual samples of PianoMotion10M dataset;

  • §C: More details of our baseline;

  • §D: Limitation and future work;

  • §E: Ethical considerations;

  • §F: Author statement;

  • §G: License and consent with public resources.

Appendix A More Statistics of PianoMotion10M

Refer to caption
Figure 5: Distribution of Note Clicks and Volume Levels in the PianoMotion10M Dataset. The top figure depicts note click frequency, and the bottom one shows the volume distribution.

Fig. 5 presents a detailed statistical analysis of piano fingerings, which focuses on the frequency of note clicks and the distribution of volume levels.

Note Click Counts. The top figure in Fig. 5 displays the frequency of each note played, measured in millions. This distribution spans 128 keys of the piano, which indicates frequent usage of those keys in performances. It shows higher counts in specific note ranges.

Notes around the middle of the keyboard, particularly from C4 to C6, exhibit significantly higher click counts, aligning with their common use in melodies and harmonic accompaniments. Conversely, notes in the low (A0 to B1) and high (C7 to C8) octaves have markedly fewer clicks, as these ranges are less frequently used and typically reserved for specific musical effects or embellishments. It is worth noting that, certain notes, particularly those fundamental to common chords and scales (e.g., A, C, E, and G), exhibit higher frequencies, reflecting their frequent use in various musical pieces.

Volume Distribution. In addition to note frequency, the figure below presents a comprehensive distribution of volume levels, spanning various ranges to highlight the dynamics of piano playing. Volume counts, measured in millions, provide insights into the intensity and expression captured in our constructed dataset.

There is a higher count of notes played at moderate volume levels. This reflects the natural dynamics of piano playing, where most notes are neither extremely soft nor loud.

Appendix B More Visual Samples of PianoMotion10M Dataset

Refer to caption
Figure 6: More samples from PianoMotion10M dataset. BV*** denote the corresponding video iDs.

Appendix C More Details of Our Baseline

The audio feature extractor ΦasubscriptΦ𝑎\Phi_{a}roman_Φ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT maps the audio A𝐴Aitalic_A to the feature vector fasubscript𝑓𝑎f_{a}italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT. We use pre-trained Wav2Vec2.0 [4] and HuBERT [22] models developed by Facebook AI for ΦasubscriptΦ𝑎\Phi_{a}roman_Φ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT. Both models leverage extensive unlabeled data for unsupervised pretraining to learn high-dimensional speech representations. HuBERT extends its semi-supervised learning approach with pseudo-labels to self-supervised learning. In our experiments, we use wav2vec2-base-960h and hubert-base-ls960 for the base model, and wav2vec2-large-960h-lv60-self and hubert-large-ls960-ft for the large model as the audio feature extractor ΦasubscriptΦ𝑎\Phi_{a}roman_Φ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT.

Appendix D Limitation and Future Work

Our dataset closely associates piano music with hand movements. Due to data diversity, we have not yet aligned piano positions in all videos. We plan to engage extra experts to annotate piano key positions for more precise spatial alignment in the next version. Additionally, the variance in piano tones across recordings may also affect the baseline model’s performance.

PianoMotion10M provides piano music, corresponding MIDI files, and hand poses, offering researchers a valuable resource for studying human-piano interaction. This dataset enables the analysis of piano music through hand gestures and the generation of hand poses from audio tracks. With PianoMotion10M, we hope to benefit and facilitate further research in relevant fields.

Appendix E Ethical Considerations

Piano motion datasets may pose significant privacy challenges, particularly concerning the pianist’s identifiable aspect, mainly their hands, during piano performance. Our dataset comprises videos uploaded by users on Bilibili, which are publicly accessible. To address privacy risks, we adopt a strict policy of releasing only video IDs, not the videos or images directly. Furthermore, we employ the MANO hand prior model [47] as a form of robust anonymization. This approach ensures the protection of personal information, mitigates the risk of individual identification, and minimizes privacy concerns. With these precautions, our dataset enables valuable research in piano motion analysis and generation.

Appendix F Author Statement

The authors bear all responsibility in case of violation of rights. We confirm that the PianoMotion10M dataset is open-sourced under the CC BY-NC 4.0 International license and the released code is publicly available under the Apache-2.0 license, ensuring open access and permissive usage for academic and research purposes.

Appendix G License and Consent with Public Resources

G.1 Tools for Annotation

The piano audios were transcribed by piano_transcription_inference [28] and the hand poses of the MANO model [47] in videos were annotated with MediaPipe [35] and HaMeR [42]:

G.2 Models for Baseline

Pre-trained Wav2Vec2.0 [4] and HuBERT [22] were utilized as audio feature extractor, while transformer [63] approach and SSM-based model [18] were employed as our feature decoder:

G.3 Re-evaluated Methods

In the experimental section, we evaluated EmoTalk [44] and LivelySpeaker [76], and models were reproduced using the official code: