PianoMotion10M: Dataset and Benchmark
for Hand Motion Generation in Piano Performance

Qijun Gan
Zhejiang University
[email protected]
\AndSong Wang
Zhejiang University
[email protected]
\ANDShengtao Wu
Hangzhou Dianzi University
[email protected]
\AndJianke Zhu
Zhejiang University
[email protected]

Abstract

Recently, artificial intelligence techniques for education have been received increasing attentions, while it still remains an open problem to design the effective music instrument instructing systems. Although key presses can be directly derived from sheet music, the transitional movements among key presses require more extensive guidance in piano performance. In this work, we construct a piano-hand motion generation benchmark to guide hand movements and fingerings for piano playing. To this end, we collect an annotated dataset, PianoMotion10M, consisting of 116 hours of piano playing videos from a bird’s-eye view with 10 million annotated hand poses. We also introduce a powerful baseline model that generates hand motions from piano audios through a position predictor and a position-guided gesture generator. Furthermore, a series of evaluation metrics are designed to assess the performance of the baseline model, including motion similarity, smoothness, positional accuracy of left and right hands, and overall fidelity of movement distribution. Despite that piano key presses with respect to music scores or audios are already accessible, PianoMotion10M aims to provide guidance on piano fingering for instruction purposes. The dataset and source code can be accessed at https://agnjason.github.io/PianoMotion-page.

1 Introduction

The process of learning has been significantly improved with artificial intelligence techniques, which enable individuals to enhance their skills under the guidance of an AI coach [64, 66]. This can be extended into learning to play the musical instruments. Particularly, piano performance requires a profound understanding of the underlying relationship between musical compositions and their corresponding physical motions. As the rigorous practice and training program are necessary for athletes to master a diverse range of expressive human poses, it entails extensive practices to achieve proficiency in piano fingering and hand movement. To facilitate access to playing guidance, the development of AI piano coach has been spurred.

Large-scale piano-motion datasets are the foundation of a nuanced approach for motion generation, which offer valuable guidance for physical performance and musical expression. The computational challenge of motion generation lies in capturing the nonlinear relationship between musical pieces and the intricate hand motions required for piano playing. The hand poses vary for the same note across different melodies. Moreover, the dynamic nature of musical expression demands a level of continuous motion, which is challenging to be learnt from small datasets. These limitations underscore the urgent need for a large-scale piano-motion dataset.

To address the deficiency of the dataset for guiding hand movements and fingerings in playing piano, we introduce a large-scale 3D piano-motion dataset named PianoMotion10M. As shown in Fig. 1, PianoMotion10M contains piano audio tracks, Musical Instrument Digital Interface (MIDI) files, and annotated hand motions with their corresponding videos, meticulously collected from the Internet. It comprises 1,966 pairs of video and music, with a total duration of 116 hours and 10 million annotated frames. The parametric MANO hand model [47] is employed to represent hand gestures. Our constructed dataset offers a diverse range of music styles and piano techniques, which addresses the demands of various preferences and skill levels.

Refer to caption — Figure 1: Overview of our framework. We collect videos of professional piano performances from the Internet and process them to construct a large-scale dataset, PianoMotion10M, which comprises piano music, MIDI files and hand motions. Building upon this dataset, we establish a benchmark for generating hand motions from piano music.

Traditional applications like PianoPlayer [45] are adept at generating static hand gestures and positions for piano scores, which typically make use of classifiers to predict the proper fingering combinations. However, they often fail to capture the diversity and continuity inherent of the piano performance in PianoMotion10M. Both rule-based methods [5, 33] and HMM-based approaches [69, 41] can estimate fingerings while they primarily focus on the local fingering constraints of continuous notes. Consequently, they often overlook crucial information like long-range fingering relationships, while our task aims to estimate the motions of long clips.

To address these limitations, a novel baseline model is presented to show the effectiveness of PianoMotion10M, which is able to generate realistic hand motions from piano melodies. Given a piece of piano music, our model can locate the positions of both hands and generate a long sequence of hand gestures for the performance. It effectively learns the music-position correlation through an efficient position predictor and produces continuous gestures with a position-guided gesture generator based on a diffusion probabilistic model. To assess our baseline model, we propose several evaluation metrics, including Frechet Gesture Distance and Wasserstein Gesture Distance to measure the fidelity of each hand motions, Frechet Inception Distance with a pre-trained auto-encoder to investigate the motion quality of double hands, and Position Distance to assess the accuracy of hand positioning, and Smoothness of the generated motions.

In summary, our main contributions are: 1) A large-scale piano-motion dataset PianoMotion10M comprises 116 hours of music and 10 million annotated frames with hand poses. To the best of our knowledge, it stands as the first dataset integrating piano music with its corresponding hand motions, which facilitates the tasks of 3D hand motion generation from piano audio tracks and piano music generation conditioned on hand motions. 2) Based on PianoMotion10M, we propose a benchmark with a series of evaluation metrics to investigate the effectiveness on hand motion generation, including the accuracy of positions and fidelity of gestures. 3) A novel baseline model bridges piano music with hand motions, which estimates hand location with a position predictor and generates hand gestures sequences through a position-guided gesture generator.

2 Related Work

Motion-music Datasets. While multi-modal datasets [52, 55, 36, 31] become the key of various learning tasks, there remains a significant gap in the availability of repository specially designed for music-conditional motion generation. Although the existing hand gesture datasets [39, 38, 30, 16, 67] contain a large number of image-hand gesture pairs, they do not have audio or other related information. This limitation hinders them from the generative tasks, since they mainly focus on hand reconstruction and pose estimation. Recently, datasets like AIST++ [32] and TikTok [77] are tailored for music-dance learning, which provide limited music segments, typically less than 5 hours in duration. Moryossef et al. [40] manage to automatically detect which fingers pressed the key of the piano and provide a dataset of piano-fingering. However, it does not provide continuous hand gesture movements. Therefore, there is a crucial need to build a piano-motion dataset specially tailored for motion generation tasks. To this end, we introduce the PianoMotion10M dataset, which comprises extensive piano music and their corresponding hand motion annotations.

3D Human Motion Synthesis. The problem of generating realistic and controllable 3D human motion sequences has been a subject of long-standing study. By taking advantage of 2D keypoint detection [8], the synthesis of 2D skeletons has been extensively explored [46, 53, 15]. Considerable research efforts have been devoted to 2D speech-driven head generation for facial mouth and lip motion generation [72, 10, 19, 25, 21], which usually employ either image-driven or voice-driven methods to produce realistic videos of speaking individuals. However, the expressive capabilities of 2D pose skeletons are limited and they are not applicable to 3D character models. Recent methods for full-body 3D dance generation have utilized LSTM [58, 68, 78], GANs [57, 17] or transformer encoders [32, 24, 54]. To generate vivid talking head videos, extensive research has been conducted in the field of speech-driven 3D facial animation [7, 13, 60, 14, 73]. EmoTalk [44] animates emotional 3D faces from speech input by generating controllable personal and emotional styles. Tian et al. [61] introduce a speed controller and a face region controller to enhance stability during the head generation process.

In the domain of hand motion generation, it has been primarily categorized into rule-based methods [9, 23, 56] and data-driven approaches [29, 12, 34, 70, 6]. For instance, Speech2Gesture [17] utilizes conditional generative adversarial networks to generate personalized 2D keypoints from audio. Ahuja et al. [1] propose a method for personalized motion transfer. Ao et al. [62] learn the map** between the speech and gestures from data using a combined network structure of the vector quantized variational auto-encoder (VQ-VAE). Beyond 2D keypoints generation, TriModal [70] extracts different upper body movements from TED talks and designs a LSTM-based neural network conditioned on audio, text, and identity to generate co-speech gestures.

Recently, diffusion models have achieved promising results in generating human motions. Previous works such as MDM [59] and MotionDiffuse [74] have produced realistic motion inspired by denoising diffusion probabilistic models (DDPM) [20]. PhysDiff [71] extends MDM by imposing physical constraints. MLD [11] utilizes latent carrier DDPM for forward denoising and reverse diffusion in motion latent space. MAA [3] enhances the performance of non-distributed data by pre-training diffusion models. Zhang et al. [75] introduce retrieval-enhanced DDPM, which improves the text-to-motion functionality in distribution.

3 PianoMotion10M Dataset

It is a formidably challenging task to map piano music to hand motions due to the significant influence of note sequences and positions on hand movements and fingering. Lacking diverse data may lead to the inferior performance on estimating various hand poses in piano playing. To capture the variability, we present the first large-scale piano-motion dataset, PianoMotion10M, which comprises 1,966 piano performance videos along with 10 million hand poses and their corresponding MIDI files, resulting in an overall duration of 116 hours. Fig. 2 presents an example of our dataset. These videos are segmented into 16,739 individual clips with a length of 30 seconds. To ensure comprehensive evaluation, we extract 7,519 clips for training, 821 for validation and 8,399 for testing. The detailed comparisons with the existing datasets on hands and 3D human motions are summarized into Tab. 1.

Table 1: Comparison between different hand and motion datasets. The proposed PianoMotion10M dataset consists of piano music with corresponding hand poses for hand motion generation. Existing hand-image datasets are listed in the first four rows, and music-motion datasets are presented in the subsequent four rows for reference.

Dataset	Pose	Size	Subject	Music	MIDI	Duration(hour)
FreiHAND [79]	✓	134K	32	✗	✗	-
InterHand2.6M [39]	✓	2.6M	27	✗	✗	-
RGB2Hands [65]	✓	1K	2	✗	✗	-
Re:InterHand [37]	✓	1.5M	10	✗	✗	-
GrooveNet [2]	✓	-	1	✓	✗	0.38
DanceNet [78]	✓	-	2	✓	✗	0.96
EA-MUD [57]	✓	-	-	✓	✗	0.35
AIST++ [32]	✓	10.1M	30	✓	✗	5.19
PianoMotion10M	✓	10.5M	14	✓	✓	116.16

3.1 Data Collection

There is a wealth of videos and live streams dedicated to musical instrument performances and tutorials from the Internet. Note that each individual has a unique playing style, we firstly select $14$ piano experts from Bilibili¹¹1https://www.bilibili.com, one of the most popular video-sharing platforms in China. and collect a total of 3,647 candidate videos. To ensure consistency and enhance the quality of our dataset, we conduct pre-processing with five pivotal factors, including resolution, audio quality, camera perspective, presence of multiple individuals, and visibility of hands. We manually select pure piano music to ensure that the audio contained no human vocals or sounds from other instruments. Moreover, videos with a resolution lower than $1080\times 1920$ are discarded to ensure the quality of our dataset. To enhance the observation of hand movements and gestures, we choose videos with a bird’s-eye view to minimize hand occlusion. Furthermore, we remove those videos where hand regions are frequently obstructed or invisible during performance. Additionally, we do not take consideration of videos with multiple piano players, which is beyond the scope of this dataset. By addressing these factors, the overall quality and coherence of the dataset have been substantially enhanced. Following this preprocessing stage, the result is a collection of $1,966$ high-quality raw videos. Each one showcases the piano playing by an individual, which is captured from a bird’s-eye view along with pure piano audio.

3.2 Data Annotations

MIDI is a universal digital protocol to store musical data for various musical devices, which is served as a digital representation of musical notes, volume, tempo, and other performance parameters. Once video and pure piano audio pairs are collected, we transcribe the piano performances into MIDI files by utilizing a state-of-the-art automatic piano transcription [28]. To ensure the accuracy of the conversion results, we replay MIDI files and compare them with the original music tracks. Those files with high discrepancies are adjusted.

Hand Pose is captured via the parametric hand model MANO [47], which serves as our hand prior model. It effectively maps the pose parameter $\theta\in\mathbb{R}^{J\times 3}$ with $J$ per-bone parts and the shape parameter $\rho\in\mathbb{R}^{10}$ onto a template mesh $\bar{\mathcal{M}}$ with vertices $V$ . The MANO model enables the simulation of various hand gestures and movements during piano performances, which provides an effective way to study the relationship between hand gestures in playing piano. Due to the non-uniformity of hand sizes and positions in the collected videos, we first employ the MediaPipe hand detection framework [35] to obtain bounding box of the hand region. Video frames are cropped according to the detected hand bounding boxes to enhance the robustness of the results. To this end, hand poses in collected videos are annotated using HaMeR [42]. HaMeR follows a fully transformer-based architecture and reconstructs hand models with increased accuracy and robustness. All images having hands detected by MediaPipe are annotated with hand poses by HaMeR, which result in a dataset of 10 million image-pose pairs.

To enhance the smoothness and continuity of our dataset, the generated hand poses require to be cleaned and refined. The results of MediaPipe and HaMeR are usually accurate in most cases, while some inferior results may occur due to rapid motion and image blurring. The Hampel filter [43] with a window size of $20$ is utilized to identify these outliers. The outliers and the timestamps where hands are undetected are firstly labeled as missing values. Within a small period, hand movements can be considered as the motion with constant speed. Therefore, the small gap can be interpolated bilaterally. To address these missing data in the time series, missing segments with the frame length $\delta$ less than $30$ frames are filled by linear interpolation with respect to their surrounding values. The others are considered as periods when the hand is invisible. To ensure the reliability of detected hands, a similar strategy is employed to label observations with excessively short duration ( $\delta<15$ ) as invisible. Finally, to ensure the smoothness of hand motions, we make use of a Savitzky-Golay filter [51] for data smoothing. The annotated hand poses are manually checked to ensure their quality.

3.3 Data Statistics

The dataset is made of contributions from several subjects with different experts, and each provides varying amounts of data in terms of videos, clips, duration, and annotated frames. Tab. 2 presents the detailed statistics on the distribution of subjects within our PianoMotion10M dataset.

Table 2: Statistics on the distribution of subjects in the PianoMotion10M dataset, where subject names denote the identity ID of experts.

Subject Name	Videos	Clips	Time(sec)	Frames	Subject Name	Videos	Clips	Time(sec)	Frames
1467634	337	4,359	103,293	2,237,181	470175873	14	100	2,450	64,946
2084102325	11	103	2,766	70,243	478315001	285	2,674	62,615	1,792,707
36760114	22	160	4,772	106,037	494725787	1	12	586	7,488
37367458	114	859	20,802	571,525	66685747	535	5,136	130,352	3,520,370
403444513	12	171	4,245	99,973	676539782	19	359	11,314	187,917
434762078	74	128	2,948	80,607	688183660	264	872	19,074	564,699
442401135	275	1,788	52,030	1,209,905	864712	3	18	923	13,569
Total Videos: 1,966, Clips: 16,739, Total Duration(hour): 116.16, Annotated Frames: 10,527,167

The whole piano-motion dataset, PianoMotion10M, consists of 1,966 videos with approximately 116 total hours of footage. Each piece of music is segmented into 30-second clips at 24-second intervals, resulting in a total of 16,739 clips. Note that we discard those clips where hand visibility is below 80%. Our dataset has 14 subjects with different playing styles. This variability ensures a rich diversity of playing techniques and music styles, which is essential for training robust models to predict piano hand gestures from musical pieces. All videos in our dataset are publicly accessible with provided video IDs on the Bilibili website.

4 Baselines

To tackle the challenging task of generating hand motions synchronized with piano audio tracks, we introduce a novel motion generation framework upon the PianoMotion10M dataset. As illustrated in Fig. 3, our framework consists of a position predictor and a gesture generator. The position predictor extracts the hand positions from piano music and integrates them as contextual input for the gesture generator. By leveraging a DDPM model [20], the gesture generator estimates hand pose sequence based on the piano audio and the predicted positions. Further details on each component are elaborated in the following subsections.

Problem Formulation. Given a piano audio piece $A$ with $N$ frames, our objective is to obtain the hand position sequence $P\in\mathbb{R}^{N\times 3\times 2}$ and hand gestures $\Theta\in\mathbb{R}^{N\times J\times 3\times 2}$ . The hand position sequence $P$ encompasses the 3D coordinates of the left and right hands. The hand gestures $\Theta$ consist of Euler angle at each joint of both hands.

Due to the highly nonlinear relationship between acoustic signals and hand gestures, it poses a significant challenge to estimate motions through discriminative models [44], which usually leads to an average pose, as demonstrated in Section 5.2. To address this issue, a hand position predictor is introduced to estimate the continuous changes in hand positions. Moreover, a generative model is utilized to reconstruct hand gestures from a piano music piece. Hereby, the task of hand motion generation becomes more concise and comprehensive by disentangling it into hand position estimation and gesture generation. To extract the audio features $f_{a}$ from the audio $A$ , we make use of a pre-trained audio feature extractor $\Phi_{a}$ [4, 22], which is formulated as $f_{a}=\Phi_{a}(A),f_{a}\in\mathbb{R}^{N\times C}$ . $C$ is the feature dimension of $\Phi_{a}$ .

4.1 Position Predictor

The position predictor is employed to predict the 6-dimensional 3D positions for both the left and right hands. Since the audio features $f_{a}$ cannot directly be mapped to the positions, a feature embedding module is treated as our position decoder $\Phi_{p}$ to extract the latent features. It projects the sequential audio features $f_{a}$ onto the latent features $f_{p}$ with more comprehensive temporal information, which is formulated as $f_{p}=\Phi_{p}(f_{a}),f_{p}\in\mathbb{R}^{N\times 512}$ . Subsequently, a linear map** is employed to project the feature $f_{p}$ onto the output positions $P$ .

In our experiments, we can make use of either Transformer [63] or State-Space Model (SSM) [18] as our feature embedding module. Transformer leverages self-attention mechanisms to effectively capture long-range dependencies and contextual information in order to learn temporal relationships. Recently, a different representation inspired by classical SSM [26] is proposed to replace the attention mechanism, which is built upon a more contemporary Selective Structured State Space Model (S6) [18] suitable for deep learning. By sharing a similar architecture to the classical RNN, it can efficiently capture information from previous inputs.

4.2 Position-guided Gesture Generator

Leveraging recent achievements [76, 3] in motion generation, our approach incorporates a diffusion probabilistic model [20]. By mastering denoising, this model effectively captures the complex distribution of hand motions observed in large-scale piano-motion datasets, which has the capability to generate motions with different conditions. Starting with a clean sample $\Theta_{0}$ of motion sequence, the forward diffusion process establishes a Markov chain that gradually adds noise to $\Theta_{0}$ in $T$ steps, which generates a series of noisy samples $\Theta_{1},...,\Theta_{T}$ as

q(\Theta_{t}|x_{0})=\mathcal{N}(\Theta_{t};\sqrt{\bar{\alpha}_{t}}x_{0},(1-% \bar{\alpha}_{t})I),

(1)

where $\mathcal{N}$ denotes the Gaussian distribution. $\bar{\alpha}_{t}=\prod_{s=1}^{t}(1-\beta_{s})$ and $\beta$ represent the variance scheduler for the added noise. Therefore, a model parameterized by a deep neural network $G$ is trained to master the inverse process within another Markov chain, which learns the map** $p(\Theta_{t-1}|\Theta_{t})$ to sequentially denoise samples over $T$ steps. Specifically, denoising model $G$ consists of a 4-layer U-Net [48] with 256, 512, 1024, 2048 dimensions for each layer.

To reduce the noise in piano audio, the audio features $f_{a}$ are simultaneously fed into the denoising neural network $G$ as conditions. Similar to the position decoder $\Phi_{p}$ , a gesture decoder $\Phi_{g}$ is utilized to extract gesture features $f_{g}$ from the audio features $f_{a}$ as $f_{g}=\Phi_{g}(f_{a})$ . Considering that the finger movements at different positions for the same pitch are distinct, $G$ requires the guidance of hand positions $P$ from the position predictor. This will enhance the fidelity of the generation process. As for the additional conditions, the time step embeddings of time $t$ are concatenated in denoising process. The denoising process can be formulated as follows

\hat{\Theta}_{0}=G(\Theta_{t},t;f_{g},P),

(2)

where $\hat{\Theta}_{0}\in\mathbb{R}^{N\times J\times 3}$ denotes the result of hand motions within $N$ frames.

4.3 Implementation Details

We employ a two-stage scheme to train our proposed network. At the first stage, the position predictor is trained using position error $\mathcal{L}_{p}$ and velocity loss $\mathcal{L}_{v}$ . The position error $\mathcal{L}_{p}$ computes the $L_{1}$ loss between the predicted position $\hat{P}$ and the ground truth $P$ , expressed as $\mathcal{L}_{p}=||\hat{P}-P||_{1}$ . Inspired by [44], velocity loss is adopted to induce temporal stability for generating smoothing movements, which is formulated as $\mathcal{L}_{v}=||(\hat{P}_{n}-\hat{P}_{n-1})-(P_{n}-P_{n-1})||_{2}$ . $n$ denotes the $n$ -th frame in a motion sequence. Specifically, our model is trained by subject 1467634 and subject 66685747, which have the similar piano keyboard layout. At the second stage, the parameters of position predictor are frozen, and the estimated positions are employed as a guidance for the gesture generator. As in [50], the denoising process is modified from noise prediction to velocity prediction. During the whole training process, the parameters of audio feature extractor $\Phi_{a}$ are frozen.

The baseline model is implemented with PyTorch. We normalize the inputs into 30 FPS through interpolation, where each piece of music lasts 8 seconds. Both stages involve training for $100,000$ iterations at the learning rates of $2e^{-5}$ and $5e^{-5}$ , respectively. We conducted all the experiments on a PC with single NVIDIA RTX 3090Ti GPU, which has 24GB of GPU RAM.

5 Benchmark

5.1 Evaluation Metrics

To assess the performance of our proposed baseline, we employ several evaluation metrics to examine the effectiveness of hand poses generated by the input piano music, which are crucial in understanding the relationship between piano melody and its corresponding playing motions.

Frechet Inception Distance (FID). Frechet Inception Distance is introduced to measure the Frechet distance between the feature vectors of prediction and ground truth. We pre-train an auto-encoder [70] to project motion sequence onto a latent space for double hands. The FID is adopted to assess the fidelity of the overall motions generated by our baseline model.

Frechet Gesture Distance (FGD). Unlike FID, Frechet Gesture Distance is utilized to compute the disparity between predicted gestures and ground truth of one hand. This metric is instrumental in evaluating the similarity of single hand gestures.

Wasserstein Gesture Distance (WGD). Wasserstein Gesture Distance [49] is computed between two distributions, each of which is represented as a Gaussian Mixture Model (GMM) [27] as below

W(\mathcal{I}_{x},\mathcal{I}_{y})=\inf_{\gamma\sim\Pi(\mathcal{I}_{x},% \mathcal{I}_{y})}\mathbb{E}_{(x,y)\sim\gamma}\left[\|x-y\|\right],

(3)

where $\Pi(\mathcal{I}_{x},\mathcal{I}_{y})$ indicates the joint distributions that combine the parametric GMM distributions $\mathcal{I}_{x}$ and $\mathcal{I}_{y}$ . Distributions $\mathcal{I}_{x}$ and $\mathcal{I}_{y}$ are fitted by predicted gestures and ground truth of single hand. $\mathbb{E}$ calculates the expectation of the sample pairs $(x,y)$ . The WGD metric offers a robust measure of dissimilarity between the generated motion and ground truth.

Position Distance (PD). The Position Distance calculates the $L_{2}$ distance between estimated positions and the ground truth. It is essential to evaluate the precision of predicted hand positions to ensure the accuracy required for piano fingerings.

Smoothness. Smoothness is measured by computing the acceleration of each joint. However, hands in a static state exhibit maximum smoothness, which are contrary to the desired outcome. Consequently, we consider the acceleration of ground truth as a reference and utilize relative acceleration as the evaluation metric for smoothness, which is formulated as $Smooth=\sum_{i}|\hat{\tau_{i}}-\tau_{i}|$ . $\tau_{i}$ denotes the acceleration of the $i$ -th joint. It reflects the continuance and coherence of the estimated gestures.

5.2 Experiments

Table 3: Quantitative evaluation of our proposed hand motion generation baseline and existing models on the validation set. We present a comparative analysis of various network architectures, highlighting the performance and efficiency of our baselines in generating hand motion.

Method					Decoder	Step	Right Hand				Left Hand			FID $\downarrow$	Params (M)
				FGD $\downarrow$	WGD $\downarrow$	PD $\downarrow$	Smooth $\downarrow$	FGD $\downarrow$	WGD $\downarrow$	PD $\downarrow$	Smooth $\downarrow$			FID $\downarrow$	Params (M)
EmoTalk [44]		TF	-	0.360	0.259	0.033	0.313	0.445	0.232	0.044	0.353	4.645	308
LivelySpeaker [76]		TF	1000	0.535	0.249	0.030	0.334	0.538	0.220	0.038	0.406	4.157	321
Our-Base	Wav2vec	SSM	1000	0.416	0.246	0.034	0.335	0.425	0.223	0.042	0.412	3.587	320
	Wav2vec	TF	1000	0.424	0.246	0.033	0.334	0.426	0.219	0.040	0.402	3.608	323
	HuBERT	SSM	1000	0.402	0.247	0.033	0.336	0.432	0.218	0.041	0.407	3.412	320
	HuBERT	TF	1000	0.418	0.247	0.034	0.338	0.432	0.219	0.041	0.412	3.529	323
Our-Large	Wav2vec	SSM	1000	0.421	0.244	0.031	0.208	0.430	0.219	0.038	0.253	3.453	539
	Wav2vec	TF	1000	0.354	0.244	0.030	0.209	0.372	0.214	0.038	0.251	3.376	557
	HuBERT	SSM	1000	0.403	0.244	0.030	0.214	0.406	0.217	0.037	0.237	3.395	539
	HuBERT	TF	1000	0.351	0.244	0.030	0.205	0.372	0.217	0.037	0.248	3.281	557

Experimental Setup. Within the network, we compare two transformer-based audio feature extractors, Wav2Vec2.0 [4] and HuBERT [22], which are popular in the field of self-supervised speech recognition. Regarding the feature embedding module (Decoder), we evaluate the performance of SSM-based model [18] in contrast to the transformer (TF) approach [63]. We conduct experiments using two different model configurations. Both of them employ the same diffusion architecture while differing in the feature extractor setup. The base model incorporates Wav2Vec2.0-base/HuBERT-base as the audio feature extractors, featuring a model dimension of 768 and containing 8-layer TF/SSM decoder. The large model utilizes Wav2Vec2.0-large/HuBERT-large, with a model dimension of 1,024 and 16-layer TF/SSM decoder.

Step	FID $\downarrow$	FGD $\downarrow$	WGD $\downarrow$	Smooth $\downarrow$
5	3.540	0.361	0.237	0.354
10	3.682	0.363	0.236	0.310
100	3.438	0.366	0.232	0.254
300	3.360	0.348	0.233	0.240
1000	3.281	0.362	0.231	0.227

Table 4: Ablation study on denoising steps.

Quantitative Results. Tab. 3 provides the results of existing methods and our baseline under various experimental settings on PianoMotion10M dataset. Due to the absence of prior work on generating gestures for piano music, we refer to the network structures of EmoTalk [44] and LivelySpeaker [76] and re-implement them to account for our task. EmoTalk directly generates poses from audio, while LivelySpeaker utilizes an MLP-based diffusion backbone for gesture generation. Our baseline models achieve better fidelity on hand motions by estimating positions and gestures, separately. In our baseline models, TF-based models outperform the SSM-based models in processing our time-series information, especially in our large model. For the audio feature extractor, the performance of the HuBERT model slightly outperforms the Wav2vec2. Additionally, we conduct ablation experiments on different denoising steps and our model also achieves competitive results with fewer steps, as shown in Table 4.

Qualitative Results. Fig. 4 demonstrates the visual results of our baseline and the existing models. The output of EmoTalk method exhibits a static average gesture. While MLPs afford LivelySpeaker rapid inference speed, they compromise the fidelity of generated motions. Conversely, our model demonstrates notably superior performance compared to previous methods by taking advantage of a two-stage approach, as illustrated in Fig. 4. To attain accurate positional information, we utilize an end-to-end position predictor rather than making use of an uncontrollable diffusion model for position generation. Furthermore, we employ a position-guided approach with a diffusion-based gesture generator for hand motion estimation.

6 Conclusion

In this work, we present PianoMotion10M, a new large-scale piano-motion dataset for hands, which has 116 hours of piano music and 10 million frames annotated with hand poses. We address the critical issue that the current datasets are insufficient for human-piano interaction. Based on PianoMotion10M, we develop a benchmark model that maps the piano music pieces to hand motions. To simplify the learning target, we divide the motion into position and gesture by introducing a position predictor along with a gesture generator guided by the estimated positions. We suggest the evaluation metrics to measure the fidelity and smoothness of the hand motions compared to the ground truth. Our dataset and benchmark will further advance the automation of piano performance simulation and assist in learning piano playing.

References

[1] Ahuja, C., Lee, D.W., Nakano, Y.I., Morency, L.P.: Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach. In: ECCV. pp. 248–265 (2020)
[2] Alemi, O., Françoise, J., Pasquier, P.: Groovenet: Real-time music-driven dance movement generation using artificial neural networks. Networks 8(17), 26 (2017)
[3] Azadi, S., Shah, A., Hayes, T., Parikh, D., Gupta, S.: Make-an-animation: Large-scale text-conditional 3d human motion generation. In: ICCV. pp. 15039–15048 (2023)
[4] Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. NeurIPS 33, 12449–12460 (2020)
[5] Balliauw, M., Herremans, D., Palhazi Cuervo, D., Sörensen, K.: A variable neighborhood search algorithm to generate piano fingerings for polyphonic sheet music. ITOR 24(3), 509–535 (2017)
[6] Bhattacharya, U., Rewkowski, N., Banerjee, A., Guhan, P., Bera, A., Manocha, D.: Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. In: VR. pp. 1–10 (2021)
[7] Cao, Y., Tien, W.C., Faloutsos, P., Pighin, F.: Expressive speech-driven facial animation. TOG 24(4), 1283–1302 (2005)
[8] Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: CVPR. pp. 7291–7299 (2017)
[9] Cassell, J., Vilhjálmsson, H.H., Bickmore, T.: Beat: the behavior expression animation toolkit. In: SIGGRAPH. pp. 477–486 (2001)
[10] Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: CVPR. pp. 7832–7841 (2019)
[11] Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, G.: Executing your commands via motion diffusion in latent space. In: CVPR. pp. 18000–18010 (2023)
[12] Chiu, C.C., Morency, L.P., Marsella, S.: Predicting co-verbal gestures: A deep and temporal modeling approach. In: Intelligent Virtual Agents: 15th International Conference. pp. 152–166 (2015)
[13] Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A., Black, M.J.: Capture, learning, and synthesis of 3d speaking styles. In: CVPR. pp. 10101–10111 (2019)
[14] Fan, Y., Lin, Z., Saito, J., Wang, W., Komura, T.: Faceformer: Speech-driven 3d facial animation with transformers. In: CVPR. pp. 18770–18780 (2022)
[15] Ferreira, J.P., Coutinho, T.M., Gomes, T.L., Neto, J.F., Azevedo, R., Martins, R., Nascimento, E.R.: Learning to dance: A graph convolutional adversarial network to generate realistic dance motions from audio. Computers & Graphics 94, 11–21 (2021)
[16] Gan, Q., Li, W., Ren, J., Zhu, J.: Fine-grained multi-view hand reconstruction using inverse rendering. In: AAAI. pp. 1779–1787 (2024)
[17] Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: CVPR. pp. 3497–3506 (2019)
[18] Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023)
[19] Guo, Y., Chen, K., Liang, S., Liu, Y.J., Bao, H., Zhang, J.: Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In: ICCV. pp. 5784–5794 (2021)
[20] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS 33, 6840–6851 (2020)
[21] Hong, F.T., Zhang, L., Shen, L., Xu, D.: Depth-aware generative adversarial network for talking head video generation. In: CVPR. pp. 3397–3406 (2022)
[22] Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: Hubert: Self-supervised speech representation learning by masked prediction of hidden units. TASLP 29, 3451–3460 (2021)
[23] Huang, C.M., Mutlu, B.: Robot behavior toolkit: generating effective social behaviors for robots. In: HRI. pp. 25–32 (2012)
[24] Huang, R., Hu, H., Wu, W., Sawada, K., Zhang, M., Jiang, D.: Dance revolution: Long-term dance generation with music via curriculum learning. arXiv preprint arXiv:2006.06119 (2020)
[25] Ji, X., Zhou, H., Wang, K., Wu, W., Loy, C.C., Cao, X., Xu, F.: Audio-driven emotional video portraits. In: CVPR. pp. 14080–14089 (2021)
[26] Kalman, R.E.: A new approach to linear filtering and prediction problems (1960)
[27] Kolouri, S., Rohde, G.K., Hoffmann, H.: Sliced wasserstein distance for learning gaussian mixture models. In: CVPR. pp. 3427–3436 (2018)
[28] Kong, Q., Li, B., Song, X., Wan, Y., Wang, Y.: High-resolution piano transcription with pedals by regressing onset and offset times. TASLP 29, 3707–3717 (2021)
[29] Kopp, S., Krenn, B., Marsella, S., Marshall, A.N., Pelachaud, C., Pirker, H., Thórisson, K.R., Vilhjálmsson, H.: Towards a common framework for multimodal generation: The behavior markup language. In: Intelligent Virtual Agents: 6th International Conference. pp. 205–217 (2006)
[30] Kwon, T., Tekin, B., Stühmer, J., Bogo, F., Pollefeys, M.: H2o: Two hands manipulating objects for first person interaction recognition. In: ICCV. pp. 10138–10148 (2021)
[31] Lee, S., Chung, J., Yu, Y., Kim, G., Breuel, T., Chechik, G., Song, Y.: Acav100m: Automatic curation of large-scale datasets for audio-visual video representation learning. In: ICCV. pp. 10274–10284 (2021)
[32] Li, R., Yang, S., Ross, D.A., Kanazawa, A.: Ai choreographer: Music conditioned 3d dance generation with aist++. In: ICCV. pp. 13401–13412 (2021)
[33] Lin, C.C., Liu, D.S.M.: An intelligent virtual piano tutor. In: VRCIA. pp. 353–356 (2006)
[34] Liu, X., Wu, Q., Zhou, H., Xu, Y., Qian, R., Lin, X., Zhou, X., Wu, W., Dai, B., Zhou, B.: Learning hierarchical cross-modal association for co-speech gesture generation. In: CVPR. pp. 10462–10472 (2022)
[35] Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.L., Yong, M.G., Lee, J., et al.: Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019)
[36] Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In: ICCV. pp. 2630–2640 (2019)
[37] Moon, G., Saito, S., Xu, W., Joshi, R., Buffalini, J., Bellan, H., Rosen, N., Richardson, J., Mize, M., De Bree, P., et al.: A dataset of relighted 3d interacting hands. NeurIPS 36 (2023)
[38] Moon, G., Shiratori, T., Lee, K.M.: Deephandmesh: A weakly-supervised deep encoder-decoder framework for high-fidelity hand mesh modeling. In: ECCV. pp. 440–455 (2020)
[39] Moon, G., Yu, S.I., Wen, H., Shiratori, T., Lee, K.M.: Interhand2.6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In: ECCV. pp. 548–564 (2020)
[40] Moryossef, A., Elazar, Y., Goldberg, Y.: At your fingertips: Extracting piano fingering instructions from videos. arXiv preprint arXiv:2303.03745 (2023)
[41] Nakamura, E., Sagayama, S.: Automatic piano reduction from ensemble scores based on merged-output hidden markov model. In: ICMC (2015)
[42] Pavlakos, G., Shan, D., Radosavovic, I., Kanazawa, A., Fouhey, D., Malik, J.: Reconstructing hands in 3d with transformers. arXiv preprint arXiv:2312.05251 (2023)
[43] Pearson, R.K., Neuvo, Y., Astola, J., Gabbouj, M.: Generalized hampel filters. EURASIP 2016, 1–18 (2016)
[44] Peng, Z., Wu, H., Song, Z., Xu, H., Zhu, X., He, J., Liu, H., Fan, Z.: Emotalk: Speech-driven emotional disentanglement for 3d face animation. In: ICCV. pp. 20687–20697 (2023)
[45] PianoPlayer: (2018), https://github.com/marcomusy/pianoplayer
[46] Ren, X., Li, H., Huang, Z., Chen, Q.: Self-supervised dance video synthesis conditioned on music. In: ACM MM. pp. 46–54 (2020)
[47] Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. ACM TOG pp. 245:1–245:17 (2017)
[48] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: LNCS. pp. 234–241 (2015)
[49] Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. IJCV 40, 99–121 (2000)
[50] Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512 (2022)
[51] Schafer, R.W.: What is a savitzky-golay filter?[lecture notes]. IEEE Signal processing magazine 28(4), 111–117 (2011)
[52] Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A.: Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
[53] Shiratori, T., Nakazawa, A., Ikeuchi, K.: Dancing-to-music character animation. In: CGF. pp. 449–458 (2006)
[54] Siyao, L., Yu, W., Gu, T., Lin, C., Wang, Q., Qian, C., Loy, C.C., Liu, Z.: Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In: CVPR. pp. 11050–11059 (2022)
[55] Srinivasan, K., Raman, K., Chen, J., Bendersky, M., Najork, M.: Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In: SIGIR. pp. 2443–2449 (2021)
[56] Starke, S., Mason, I., Komura, T.: Deepphase: Periodic autoencoders for learning motion phase manifolds. TOG 41(4), 1–13 (2022)
[57] Sun, G., Wong, Y., Cheng, Z., Kankanhalli, M.S., Geng, W., Li, X.: Deepdance: music-to-dance motion choreography with adversarial learning. TOMM 23, 497–509 (2020)
[58] Tang, T., Jia, J., Mao, H.: Dance with melody: An lstm-autoencoder approach to music-oriented dance synthesis. In: ACM MM. pp. 1598–1606 (2018)
[59] Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)
[60] Tian, G., Yuan, Y., Liu, Y.: Audio2face: Generating speech/face animation from single audio with attention-based bidirectional lstm networks. In: ICMEW. pp. 366–371. IEEE (2019)
[61] Tian, L., Wang, Q., Zhang, B., Bo, L.: Emo: Emote portrait alive-generating expressive portrait videos with audio2video diffusion model under weak conditions. arXiv preprint arXiv:2402.17485 (2024)
[62] Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. NeurIPS 30 (2017)
[63] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. NeurIPS 30 (2017)
[64] Wang, J., Qiu, K., Peng, H., Fu, J., Zhu, J.: Ai coach: Deep human pose estimation and analysis for personalized athletic training assistance. In: ACM MM. pp. 374–382 (2019)
[65] Wang, J., Mueller, F., Bernard, F., Sorli, S., Sotnychenko, O., Qian, N., Otaduy, M.A., Casas, D., Theobalt, C.: Rgb2hands: real-time tracking of 3d hand interactions from monocular rgb video. ACM TOG 39(6), 1–16 (2020)
[66] Wang, Z., Veličković, P., Hennes, D., Tomašev, N., Prince, L., Kaisers, M., Bachrach, Y., Elie, R., Wenliang, L.K., Piccinini, F., et al.: Tacticai: an ai assistant for football tactics. Nature communications 15(1), 1–13 (2024)
[67] Wu, E., Nishioka, H., Furuya, S., Koike, H.: Marker-removal networks to collect precise 3d hand data for rgb-based estimation and its application in piano. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 2977–2986 (2023)
[68] Yalta, N., Watanabe, S., Nakadai, K., Ogata, T.: Weakly-supervised deep recurrent neural networks for basic dance step generation. In: IJCNN. pp. 1–8 (2019)
[69] Yonebayashi, Y., Kameoka, H., Sagayama, S.: Automatic decision of piano fingering based on a hidden markov models. In: IJCAI. vol. 7, pp. 2915–2921 (2007)
[70] Yoon, Y., Cha, B., Lee, J.H., Jang, M., Lee, J., Kim, J., Lee, G.: Speech gesture generation from the trimodal context of text, audio, and speaker identity. TOG 39(6), 1–16 (2020)
[71] Yuan, Y., Song, J., Iqbal, U., Vahdat, A., Kautz, J.: Physdiff: Physics-guided human motion diffusion model. In: ICCV. pp. 16010–16021 (2023)
[72] Zakharov, E., Shysheya, A., Burkov, E., Lempitsky, V.: Few-shot adversarial learning of realistic neural talking head models. In: ICCV. pp. 9459–9468 (2019)
[73] Zhang, C., Zhao, Y., Huang, Y., Zeng, M., Ni, S., Budagavi, M., Guo, X.: Facial: Synthesizing dynamic talking face with implicit attribute learning. In: ICCV. pp. 3867–3876 (2021)
[74] Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondiffuse: Text-driven human motion generation with diffusion model. TPAMI (2024)
[75] Zhang, M., Guo, X., Pan, L., Cai, Z., Hong, F., Li, H., Yang, L., Liu, Z.: Remodiffuse: Retrieval-augmented motion diffusion model. In: ICCV. pp. 364–373 (2023)
[76] Zhi, Y., Cun, X., Chen, X., Shen, X., Guo, W., Huang, S., Gao, S.: Livelyspeaker: Towards semantic-aware co-speech gesture generation. In: ICCV. pp. 20807–20817 (2023)
[77] Zhu, Y., Olszewski, K., Wu, Y., Achlioptas, P., Chai, M., Yan, Y., Tulyakov, S.: Quantized gan for complex music generation from dance videos. In: ECCV. pp. 182–199 (2022)
[78] Zhuang, W., Wang, C., Chai, J., Wang, Y., Shao, M., Xia, S.: Music2dance: Dancenet for music-driven dance generation. TOMM 18(2), 1–21 (2022)
[79] Zimmermann, C., Ceylan, D., Yang, J., Russell, B., Argus, M., Brox, T.: Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. In: ICCV. pp. 813–822 (2019)

Appendix

In this part, we further provide more details and discussions on our proposed dataset and benchmark:

•

§A: More statistics of PianoMotion10M;
•

§B: More visual samples of PianoMotion10M dataset;
•

§C: More details of our baseline;
•

§D: Limitation and future work;
•

§E: Ethical considerations;
•

§F: Author statement;
•

§G: License and consent with public resources.

Appendix A More Statistics of PianoMotion10M

Fig. 5 presents a detailed statistical analysis of piano fingerings, which focuses on the frequency of note clicks and the distribution of volume levels.

Note Click Counts. The top figure in Fig. 5 displays the frequency of each note played, measured in millions. This distribution spans 128 keys of the piano, which indicates frequent usage of those keys in performances. It shows higher counts in specific note ranges.

Notes around the middle of the keyboard, particularly from C4 to C6, exhibit significantly higher click counts, aligning with their common use in melodies and harmonic accompaniments. Conversely, notes in the low (A0 to B1) and high (C7 to C8) octaves have markedly fewer clicks, as these ranges are less frequently used and typically reserved for specific musical effects or embellishments. It is worth noting that, certain notes, particularly those fundamental to common chords and scales (e.g., A, C, E, and G), exhibit higher frequencies, reflecting their frequent use in various musical pieces.

Volume Distribution. In addition to note frequency, the figure below presents a comprehensive distribution of volume levels, spanning various ranges to highlight the dynamics of piano playing. Volume counts, measured in millions, provide insights into the intensity and expression captured in our constructed dataset.

There is a higher count of notes played at moderate volume levels. This reflects the natural dynamics of piano playing, where most notes are neither extremely soft nor loud.

Appendix B More Visual Samples of PianoMotion10M Dataset

Appendix C More Details of Our Baseline

The audio feature extractor $\Phi_{a}$ maps the audio $A$ to the feature vector $f_{a}$ . We use pre-trained Wav2Vec2.0 [4] and HuBERT [22] models developed by Facebook AI for $\Phi_{a}$ . Both models leverage extensive unlabeled data for unsupervised pretraining to learn high-dimensional speech representations. HuBERT extends its semi-supervised learning approach with pseudo-labels to self-supervised learning. In our experiments, we use wav2vec2-base-960h and hubert-base-ls960 for the base model, and wav2vec2-large-960h-lv60-self and hubert-large-ls960-ft for the large model as the audio feature extractor $\Phi_{a}$ .

Appendix D Limitation and Future Work

Our dataset closely associates piano music with hand movements. Due to data diversity, we have not yet aligned piano positions in all videos. We plan to engage extra experts to annotate piano key positions for more precise spatial alignment in the next version. Additionally, the variance in piano tones across recordings may also affect the baseline model’s performance.

PianoMotion10M provides piano music, corresponding MIDI files, and hand poses, offering researchers a valuable resource for studying human-piano interaction. This dataset enables the analysis of piano music through hand gestures and the generation of hand poses from audio tracks. With PianoMotion10M, we hope to benefit and facilitate further research in relevant fields.

Appendix E Ethical Considerations

Piano motion datasets may pose significant privacy challenges, particularly concerning the pianist’s identifiable aspect, mainly their hands, during piano performance. Our dataset comprises videos uploaded by users on Bilibili, which are publicly accessible. To address privacy risks, we adopt a strict policy of releasing only video IDs, not the videos or images directly. Furthermore, we employ the MANO hand prior model [47] as a form of robust anonymization. This approach ensures the protection of personal information, mitigates the risk of individual identification, and minimizes privacy concerns. With these precautions, our dataset enables valuable research in piano motion analysis and generation.

Appendix F Author Statement

The authors bear all responsibility in case of violation of rights. We confirm that the PianoMotion10M dataset is open-sourced under the CC BY-NC 4.0 International license and the released code is publicly available under the Apache-2.0 license, ensuring open access and permissive usage for academic and research purposes.

Appendix G License and Consent with Public Resources

G.1 Tools for Annotation

The piano audios were transcribed by piano_transcription_inference [28] and the hand poses of the MANO model [47] in videos were annotated with MediaPipe [35] and HaMeR [42]:

•

piano_transcription_inference²²2https://github.com/bytedance/piano_transcription. Apache License 2.0
•

MANO³³3https://mano.is.tue.mpg.de/. CC BY-NC
•

MediaPipe⁴⁴4https://ai.google.dev/edge/mediapipe. Apache License 2.0
•

HaMeR⁵⁵5https://github.com/geopavlakos/hamer. MIT License

G.2 Models for Baseline

Pre-trained Wav2Vec2.0 [4] and HuBERT [22] were utilized as audio feature extractor, while transformer [63] approach and SSM-based model [18] were employed as our feature decoder:

•

Wav2Vec2.0⁶⁶6https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec. MIT license
•

HuBERT⁷⁷7https://github.com/pytorch/fairseq/tree/master/examples/hubert. MIT license
•

Transformer⁸⁸8https://github.com/huggingface/transformers. Apache License 2.0
•

SSM⁹⁹9https://github.com/state-spaces/mamba. Apache License 2.0

G.3 Re-evaluated Methods

In the experimental section, we evaluated EmoTalk [44] and LivelySpeaker [76], and models were reproduced using the official code:

•

EmoTalk¹⁰¹⁰10https://github.com/psyai-net/EmoTalk_release. CC BY-NC 4.0
•

LivelySpeaker¹¹¹¹11https://github.com/zyhbili/LivelySpeaker. Unknown

PianoMotion10M: Dataset and Benchmark for Hand Motion Generation in Piano Performance