[1]Guoheng Huang [2]Xuhang Chen 1]Guangdong University of Technology, Guangdong, China 2]Huizhou University, Guangdong, China 3]Guangdong Mechanical and Electrical College, Guangdong, China 4]University of Western Australia, WA, Australia

QEAN: Quaternion-Enhanced Attention Network for Visual Dance Generation

Zhizhen Zhou [email protected] Ye**g Huo [email protected] [email protected] An Zeng [email protected] [email protected] Lian Huang [email protected] Zinuo Li [email protected] [ [ [ [

Abstract

The study of music-generated dance is a novel and challenging Image generation task. It aims to input a piece of music and seed motions, then generate natural dance movements for the subsequent music. Transformer-based methods face challenges in time series prediction tasks related to human movements and music due to their struggle in capturing the nonlinear relationship and temporal aspects. This can lead to issues like joint deformation, role deviation, floating, and inconsistencies in dance movements generated in response to the music. In this paper, we propose a Quaternion-Enhanced Attention Network (QEAN) for visual dance synthesis from a quaternion perspective, which consists of a Spin Position Embedding (SPE) module and a Quaternion Rotary Attention (QRA) module. First, SPE embeds position information into self-attention in a rotational manner, leading to better learning of features of movement sequences and audio sequences, and improved understanding of the connection between music and dance. Second, QRA represents and fuses 3D motion features and audio features in the form of a series of quaternions, enabling the model to better learn the temporal coordination of music and dance under the complex temporal cycle conditions of dance generation. Finally, we conducted experiments on the dataset AIST++, and the results show that our approach achieves better and more robust performance in generating accurate, high-quality dance movements. Our source code and dataset can be available from https://github.com/MarasyZZ/QEAN and https://google.github.io/aistplusplus_dataset respectively.

keywords:

Dance generation, Multi-modal task, Quaternion network, Time-series prediction task, Animation generation task

1 Introduction

Refer to caption — Figure 1: The motivation of our method. We compare the effectiveness of our method is compared with other approaches in generating dance movements from seed motions. In the top row labeled “other methods”, two sets of images showcase the transformation of seed movements into unnatural final poses characterized by joint deformation and character drift. Conversely, in the bottom row labeled “our method”, we demonstrate how the application of Pre-quaternion parameterization (P), Spin Position Embedding (S), and Quaternion Attention (Q) yields natural-looking final poses. Each prediction produced by our method successfully learns the correlations between dance and music.

Dancing is a universal language across all cultures [1, 2] and is used by many as a powerful means of self-expression on online media platforms, becoming a dynamic tool for disseminating information on the Internet. Although dance is an art form, it requires professional practice and training to give dancers a rich expressive voice [3]. Therefore, from a computational point of view, music-conditioned 3D dance generation [4, 5, 6, 7] has become a critical task that promises to open up a variety of practical applications. However, creating satisfying dance sequences that harmonize with specific music and body structures faces challenges because of our lack of understanding of the timing of human movements and the connection between music and dance. Overcoming these challenges is essential to achieve fluid movements with a high degree of kinematic complexity while ensuring consistency with the complex non-linear relationships of the accompanying music.

Later, with the continuous development and advancement of deep learning, many deep learning methods [6, 7, 8, 9, 10, 11] were started to be applied to dance generation. Firstly, RNN [8, 10] based methods were used to simulate human dances, but RNN approaches would face the challenges of static poses and error accumulation, especially when the input data varied. Subsequently, some researchers have used Variational Auto-Encoders (VAE) and Generative Adversarial Networks (GAN) to model 2D dance movements [12], and then LSTM-Auto-Encoders were used to model 3D dance movements directly from musical features [13], although such an approach solves the shortcomings of error accumulation that exist in the RNN approach. However, such an approach suffers from the disadvantage of instability and is prone to regress to non-standard poses.

In recent years, Transformer [12, 13, 14, 15, 16, 17, 18] has been favoured by many in natural language processing as well as visual processing, and some scholars have made great progress in their research [19, 20] to be able to generate high-quality dance movements given a piece of music. However, due to the existence of Transformer’s inadequate modeling of the temporal dependence of sequences when dealing with time-series data, the generated dances will suffer from problems such as drifting and foot slip** (As shown in Fig.1). Given the non-linear relationship between music and dance, existing approaches to Transformer do not fully model this relationship.

Quaternions are widely used as a mathematical tool for rotational expression and gesture control [21]. In view of this, we believe that the introduction of quaternions in the field of dance generation may be a promising approach. Compared to traditional Euler angles, quaternions are more effective in avoiding the “Gimbal Lock” problem and improving the stability of gesture representations. We expect to use quaternions to more accurately adapt dance movements to the rhythm and emotion of the music. By combining musical features with the correlation of quaternions, we can more accurately capture the complex relationship between music and dance.

In this paper, to address these challenges, we propose a Quaternion-Enhanced Attention Network for multi-modal dance synthesis (shown in Fig.2). The network mainly consists of a Spin Position Embedding (SPE) module and a Quaternion Rotary Attention (QRA) module. The SPE module is mainly used in the Transformer structure of the network, which embeds information in the form of relative positions into the self-attention. The audio and motion features are extracted by the Transformer structure in the network, respectively, and the SPE module combines the advantages of relative position coding and absolute position coding to maximize the model’s representation of sequence features. The extracted audio and motion features are expanded to four dimensions by quaternion parameterization dimension, and then the splicing operation is performed through the Quaternion Rotary Attention module. Compared to the work of Li [19], the proposed SPE better increases the model’s representation and utilisation of positional information. Besides, the QRA module better learns the representation of the link between audio and movement relationships, improving the quality of the generated dances with good robustness.

The contribution of the proposed network can be summarized as follows:

1. In this paper, we introduce a Quaternion-Enhanced Attention Network (QEAN) for generating multimodal dances. This addresses challenges seen in current methods, like awkward joint movements and character inconsistency. QEAN uses quaternion operations to better capture the complex relationship between music and dance, improving the modeling of temporal dependencies.

2. We introduce the Spin Position Embedding (SPE) module, which computes query and key vectors for features, applies rotational operations, and embeds results into self-attention. SPE addresses limitations of traditional position encoding by introducing relative position encoding based on rotations, enhancing modeling for variable-length sequences while solving length consistency and overfitting issues. Additionally, relative position information improves modeling of intrinsic feature associations, significantly enhancing the model’s representation and utilization of temporal order in human motion.

3. We introduce the quaternion perspective and propose the Quaternion Rotary Attention (QRA) module. The QRA module maps audio and motion features to the quaternion space and explores the intrinsic correlation between the two using Hamiltonian multiplication, which enables the model to better learn the temporal coordination between music and dance, and generate smooth and natural dances coordinated with the music tempo.

4. Experimental results on the AIST++ dataset demonstrate that our proposed network is capable of effectively learning the connection between audio and movement, leading to the generation of higher quality dance movements. It outperforms other current state-of-the-art methods in terms of dance quality.

2 Related Work

3D Human Motion Synthesis The research on generating realistic and controllable 3D human motion sequences, as discussed in [8, 22, 23, 24], has seen significant advancements in recent years. Initial efforts utilized statistical models like kernel-based probability distributions [25] to synthesize motion, but these methods tended to oversimplify motion details. A subsequent breakthrough came with the introduction of the motion graph approach [26], which addressed this limitation by adopting a non-parametric method. This technique involved constructing a directed graph using motion capture datasets, where each node represented a pose, and edges denoted transitions between poses. Motion generation was achieved through random walks on this graph. However, a notable challenge in motion graphs was the generation of plausible transitions, and certain methods sought to overcome this by introducing parameterizations for transitions [27]. As deep learning gained prominence, several approaches explored the use of neural networks trained on extensive motion capture datasets to generate 3D motion. Various network architectures, including CNNs [23, 28], GANs [29], VAE [30], RNNs [6, 20], and Transformers [4, 20] have been investigated. While auto-regressive models like RNNs and pure Transformers [31] theoretically have the capacity to generate infinite motion, practical challenges such as mean regression arise. This phenomenon leads to motion “freezing” or drifting into unnatural movements after several iterations. To address this, some studies [31, 32] propose periodic usage of the network’s output as input during the training process. Additionally, Phase Function Neural Networks and their variants have been introduced [33, 34] to tackle the mean regression issue by conditioning network weights on the phase. However, their scalability in representing diverse movements is limited.

Music-Driven Dance Generation In recent years, data-driven deep learning has become the dominant technique for dance generation. Joao [35] used graph convolutional networks to learn from a variety of dance datasets and generate new dance sequences that are smooth and continuous. This deep learning approach significantly improves the quality and continuity of the generated dances. Holden [36], Qiu [37] and Starke [38] built on deep learning by de-augmenting long-term dependency modeling as a means of generating more coherent long dance sequences. Common approaches include integrating skeleton information and employing attention mechanisms. Li [19] built on that previous work by proposing the Full Attention Cross-Modal Transformer model (FACT), which can generate non-freezing, high-quality 3D motion sequences conditioned on music by learning audio-motion correspondences sequences.

Quaternion Networks In various domains of deep learning such as few-shot segmentation [39], human motion synthesis [21], and multi-sensor signal processing, Quaternion Neural Networks have made significant strides. Similar to the task discussed in this paper, Quaternion representations are employed in neural network architectures as a parameterization for rotations.Quaternion networks, exemplified by QuaterNet [21], utilize quaternions to represent joint rotations in both Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs). This approach addresses the discontinuity issues associated with Euler angles, achieving outstanding performance in long-term prediction tasks. In the context of our work, focused on music-driven dance generation, we propose constructing a learning process for the correlation between music and dance. This is essential as it requires consideration of the non-linear characteristics of both motion and music. Therefore, our method involves exploring the relationship between audio and motion features using quaternions. By leveraging quaternions, we aim to enhance the correlation between audio and motion, facilitating the generation of high-quality dance sequences.

3 Methods

3.1 Overview of QEAN

In this paper, we propose a Quaternion-Enhanced Attention Network (QEAN) for generating high-quality dances under musical conditions, as illustrated in Fig. 2.

We are given random motion seeds Y of length 120 frames and audio features Z of length 240 frames, where Y can be denoted as $Y=\{y_{1},y_{2},\ldots,y_{t}\}$ and Z can be denoted as $Z=\{z_{1},z_{2},\ldots,z_{t^{\prime}}\}$ . Our objective is to generate a sequence of future motion from t+1 to $t^{\prime}$ , $Y^{\prime}=\{y_{t+1},y_{t+2}\ldots y_{t^{\prime}}\}$ , where $t^{\prime}\gg t$ . QEAN first utilizes the two input transformers, the motion transformer $f_{mot}$ and the audio transformer $f_{audio}$ , to encode features and generate motion and audio embeddings, represented as $h^{y}_{1:T}$ and $h^{z}_{1:T^{\prime}}$ , respectively. Next, these two embedded features are combined and subjected to a quaternion parameterization operation (see 3.2 for details). This operation maps the features to four dimensions and embeds the information into the self-attention in a rotational manner using Spin Position Embedding (see 3.3 for details). Finally, the fused features are learned by a Quaternion Rotary Attention Transformer (see 3.4 for details) to generate the corresponding dance movements.

3.2 Quaternion Algorithms and Quaternion Parameterization

We begin by elucidating the fundamental concepts of quaternions crucial for understanding the context of this paper. Quaternions, classified as hyper-complex numbers of rank 4, stand out as a direct and non-commutative extension of complex-valued numbers. In our proposed methodologies, the intricate interplay between Hamilton products and quaternion algebra emerges as the linchpin, forming the cornerstone of our innovative approaches. This exploration of quaternion principles lays the groundwork for the subsequent discussions and applications detailed in this study.

A quaternion Q in quaternion domain D, Q $\in$ D, can be represented as:

Q=e+f\textbf{i}+g\textbf{j}+h\textbf{k}

(1)

Where e,f,g and h are real numbers,and i,j and k are the quaternion unit basis.In a quaternion, e is the real part, where fi+gj+hk, with $\textbf{i}^{2}$ = $\textbf{j}^{2}$ = $\textbf{k}^{2}$ =ijk=-1 is the imaginary part.

A pure quaternion is a quaternion whose real part is 0, resulting in the vector Q=fi+gj+hk. Operations on quaternions are defined as follows.

The addition of two Quaternions is defined as:

Q+R=Q_{e}+R_{e}+(Q_{f}+R_{f})\mathbf{i}+(Q_{g}+R_{g})\mathbf{j}+(Q_{h}+R_{h})% \mathbf{k}

(2)

Where Q and P with subscripts denote the real and imaginary parts of the quaternions Q and P.

The Multiplication with scalar $\gamma$ can be defined as:

\gamma Q=\gamma e+\gamma f\mathbf{i}+\gamma g\mathbf{j}+\gamma h\mathbf{k}

(3)

The conjugate complex $Q^{\star}$ of $Q$ can be defined as:

Q^{\star}=e-f\mathbf{i}-g\mathbf{j}-h\mathbf{k}

(4)

The multiplication of quaternions Q and R can be defined as follows:

\begin{split}Q\bigotimes R=&(Q_{e}R_{e}-Q_{f}R_{f}-Q_{g}R_{g}-Q_{h}R_{h})\\ &+(Q_{f}R_{e}+Q_{e}R_{f}-Q_{h}R_{g}+Q_{g}R_{h})\textbf{i}\\ &+(Q_{g}R_{e}+Q_{h}R_{e}+Q_{e}R_{g}-Q_{f}R_{h})\textbf{j}\\ &+(Q_{h}R_{e}-Q_{g}R_{f}+Q_{f}R_{g}+Q_{e}R_{h})\textbf{k}\end{split}

(5)

The equation above clearly describes the exchange between quaternions Q and R, indicating that Hamiltonian product is essential in quaternion neural networks. In this study, we extensively employ the Hamiltonian product to learn the correlations between music and dance, which forms the foundation of QEAN and enhances its generalization ability.

To implement our approach, we combine music and motion features to create a feature vector. Specifically, 35-dimensional music features and 219-dimensional motion features can be combined into a 254-dimensional feature vector through concatenation, based on dimension and time correspondence. Subsequently, we convert this concatenated feature vector into a sequence of quaternions. In this process, each music feature and three-dimensional dance motion feature are broken down into four components, representing a quaternion with a real part and three imaginary parts. As a result, the original 254-dimensional feature vector is transformed into a quaternion sequence with a length of 63 (disregarding the last two dimensions as they are insufficient to form a complete quaternion). Finally, we input this quaternion sequence into the Spin Position Embedding for further processing. In this model, a position encoding is assigned to each quaternion, enabling the capture of position information within the sequence.

By incorporating position information, the model gains a better understanding of the sequence and improves its performance accordingly.

3.3 Spin Position Embedding

The main types of position embedding methods are relative position embedding and absolute position embedding methods. In 2017, the Transformer [14] model was proposed. The concept of positional embedding was introduced in this model to provide information about the position of each word or token in this input sequence. This is crucial for NLP tasks that heavily rely on the relative position of words. In the Transformer model, positional embedding is used to encode information about different positions using sine and cosine functions, and the positional embedding is updated at different frequencies for different dimensions. In this way, the model is able to learn the relative positions of the tokens in the sequence. This way of position coding with sine and cosine, which is also known as absolute position coding, is easy to implement and relies directly on the absolute position without position loss, but this type of coding has poor generalisation ability, the model only adapts to a specific length of absolute position coding, and is prone to overfitting when the length varies, and performs poorly on tasks in long sequences. Relative positional embedding, on the other hand, is a method of obtaining positional embedding by using the relative distance or order relationship between lexical elements to rely on. This embedding method can provide relative position information between lexical elements instead of relying completely on absolute position. Such an embedding approach highlights the relevance of lexical elements in terms of content, which is conducive to content comprehension, and avoids the excessive computation caused by the exponential growth of absolute positional embedding with position. Consequently, it improves the generalisation ability.

Motivated by the work of Jianlin Su [40], who proposed the Rotary Position Embedding, a positional embedding method designed to enhance the performance of the Transformer architecture by integrating relative positional information into self-attention. The popular LLama2 [10] model currently employs this position embedding approach.Therefore, we borrowed from Su and embedded the extracted audio and motion features into self-attention in the form of rotated positions to better learn the features in it and improve the computational efficiency. The basic idea can be seen in Fig.3 .

First, we define a sequence of features of length N (since motion and audio features operate similarly in the process of SPE,The O in the next equation represents different operations for different eigenvectors in different situations): $F_{N}=\{W_{i}\}_{i=1}^{N}$ . Where $w_{i}$ represents the i-th token in the input sequence, and the embedding corresponding to the input sequence $F_{N}$ is denoted as $E_{N}=\{x_{i}\}_{i=1}^{N}$ , where $x_{i}$ represents the d-dimensional embedding vector of the i-th token $w_{i}$ .

Before performing self-attention operations, we use the feature embedding vectors to calculate the q, k, and v vectors and incorporate the corresponding positional information. The function expressions are as follows:

$\displaystyle q_{s}$	$\displaystyle=O_{q}(x_{s},s)$	(6)
$\displaystyle k_{t}$	$\displaystyle=O_{k}(x_{t},t)$	(7)
$\displaystyle v_{t}$	$\displaystyle=O_{v}(x_{t},t)$	(8)

Here, $q_{s}$ represents the query vector for the s-th token with positional information s integrated into the feature vector $x_{s}$ , while $k_{t}$ and $v_{t}$ represent the key and value vectors for the t-th token with positional information t integrated into the feature vector $x_{t}$ .

Then, we need to compute the output of self-attention for the s-th feature embedding vector $x_{s}$ . This involves calculating an attention score between $q_{s}$ and other $k_{t}$ , and then multiplying the attention score by the corresponding $v_{t}$ , followed by summation to obtain the output vector $o_{s}$ :

	$\displaystyle a_{s,t}=\frac{exp(\frac{q_{s}^{T}k_{t}}{\sqrt{d}})}{\sum_{j=1}^{% N}exp(\frac{q_{s}k_{j}}{\sqrt{d}})}$		(9)
	$\displaystyle o_{s}=\sum_{n=1}^{N}a_{s,t}v_{n}$		(10)

Next, in order to leverage the relative positional relationships between the mentioned tokens, let’s assume that the dot product operation between the query vector $q_{s}$ and the key vector $k_{t}$ is represented by a function g . The input to function g includes the word embedding vectors $x_{s}$ , $x_{t}$ and their relative position s-t:

\displaystyle<O_{q}(x_{s},s),O_{k}(x_{t},t)>=g(x_{s},x_{t},s-t)

(11)

We then discover an alternative approach to position embedding that upholds the aforementioned relationship.

	$\displaystyle O_{q}(x_{s},s)=(W_{q}x_{s})e^{is\theta}$		(12)
	$\displaystyle O_{k}(x_{t},t)=(W_{k}x_{t})e^{it\theta}$		(13)
	$\displaystyle O(x_{s},x_{t},s-t)=Re[(W_{q}x_{s})(W_{k}x_{t})^{*}e^{i(s-t)% \theta}]$		(14)

Here, x represents any real number, e is the base of the natural logarithm, and i is the imaginary unit in complex numbers.

We can cleverly use Euler’s formula $e^{ix}=cosx+isinx$ , where the real part is cosx and the imaginary part sinx is of a complex number.

After transformation, formulas O and g can be changed to:

	$\displaystyle e^{is\theta}=cos(s\theta)+isin(s\theta)$		(15)
	$\displaystyle e^{it\theta}=cos(t\theta)+isin(t\theta)$		(16)
	$\displaystyle e^{i(s-t)\theta}=cos((s-t)\theta)+isin((s-t)\theta)$		(17)
	$\displaystyle O_{q}(x_{s},s)=(W_{q}x_{s})e^{is\theta}$		(18)

Then, according to linear algebra, we can represent $q_{s}$ using a matrix:

	$\displaystyle q_{s}=\begin{pmatrix}q_{s}^{(1)}\\ q_{s}^{(2)}\end{pmatrix}=(W_{q}x_{s})=\begin{pmatrix}W_{q}^{(11)}&W_{q}^{(12)}% \\ W_{q}^{(21)}&W_{q}^{(22)}\end{pmatrix}\begin{pmatrix}x_{s}^{(1)}\\ x_{s}^{(2)}\end{pmatrix}$		(25)
	$\displaystyle O_{q}(x_{s},s)=(W_{q}x_{s})e^{is\theta}=q_{s}e^{is\theta}$		(26)

Therefore, multiplying these two complex numbers, we get the following result:

\displaystyle q_{s}e^{is\theta}=[q_{s}^{(1)}cos(s\theta)-q_{s}^{(2)}sin(s% \theta),q_{s}^{(2)}cos(s\theta)+q_{s}^{1}sin(s\theta)

(27)

Then, we magically discover that the above expression is equal to the query vector multiplied by a rotation matrix:

	$\displaystyle O_{q}(x_{s},s)$	$\displaystyle=(W_{q}x_{s})e^{is\theta}=q_{s}e^{is\theta}$
		$\displaystyle=\begin{pmatrix}\cos(s\theta)&-\sin(s\theta)\\ \sin(s\theta)&\cos(s\theta)\end{pmatrix}\begin{pmatrix}q_{s}^{(1)}\\ q_{s}^{(2)}\end{pmatrix}$		(32)

Similarly, the key vector $k_{t}$ can be represented as follows:

$\displaystyle O_{k}(x_{t},t)$	$\displaystyle=(W_{k}x_{t})e^{in\theta}=k_{t}e^{it\theta}$
	$\displaystyle=\begin{pmatrix}\cos(t\theta)&-\sin(t\theta)\end{pmatrix}\begin{% pmatrix}k_{t}^{(1)}\\ k_{t}^{(2)}\end{pmatrix}$	(36)
	$\displaystyle\quad+\begin{pmatrix}\sin(t\theta)&\cos(t\theta)\end{pmatrix}% \begin{pmatrix}k_{t}^{2}\\ k_{t}^{1}\end{pmatrix}$	(40)

By rearranging the above formulas, we can simplify the following expression:

	$\displaystyle<O_{q}(x_{s},s),O_{k}(x_{t},t)>$
	$\displaystyle=\left(\begin{pmatrix}\cos(s\theta)&-\sin(s\theta)\\ \sin(s\theta)&\cos(s\theta)\end{pmatrix}^{T}\begin{pmatrix}q_{s}^{(1)}\\ q_{s}^{(2)}\end{pmatrix}\right)^{T}$		(45)
	$\displaystyle\quad\begin{pmatrix}\cos(t\theta)&-\sin(t\theta)\\ \sin(t\theta)&\cos(t\theta)\end{pmatrix}\begin{pmatrix}k_{t}^{(1)}\\ k_{t}^{(2)}\end{pmatrix}$		(50)
	$\displaystyle=\begin{pmatrix}q_{s}^{(1)}&q_{s}^{(2)}\end{pmatrix}\begin{% pmatrix}\cos((s-t)\theta)&-\sin((s-t)\theta)\\ \sin((s-t)\theta)&\cos((s-t)\theta)\end{pmatrix}\begin{pmatrix}k_{t}^{(1)}\\ k_{t}^{(2)}\end{pmatrix}$		(56)

With the above formulas, we can summarize the following calculation process: In simple terms, the process of self-attention with Spin Position Embedding involves, for each feature embedding vector in the token sequence, first calculating its corresponding query and key vectors. Then, for each token position, calculate the corresponding rotated position embedding information. After that, apply the rotation transformation to the elements of the query and key vectors for each token position in pairs, and finally, calculate the dot product between the query and key to obtain the result of self-attention.

3.4 Quaternion Rotary Attention

For the features after the rotated attention module, we assume that there are N-length query series $X$ and an M-length key-values series $\gamma$ . Firstly the original $\chi$ and $\gamma$ are projected onto the representation space,and a series of operations are performed: $Q=\chi{W^{Q}}\in R^{N\times d}$ , $K=\gamma{W^{K}}\in R^{M\times d}$ and $V=\gamma{W^{V}}\in R^{M\times d}$ .

Here, Q represents the query vector, K represents the key, V represents the value, d represents the number of channels in the attention layer and W represents the trainable weights. Then, QRA will calculate $H=Attn(X,\gamma)$ to the map query series to output $H$ using key-value series.

Frequency/phase-Generation:

\displaystyle\begin{gathered}\left.\left(\begin{array}[]{c}\omega_{1}^{\mathrm% {Q}}\\ \cdots\\ \omega_{P}^{\mathrm{Q}}\end{array}\right.\right),\left(\begin{array}[]{c}% \theta_{1}^{\mathrm{Q}}\\ \cdots\\ \theta_{P}^{\mathrm{Q}}\end{array}\right)=\mathrm{Conv}(Q;W_{\omega}^{\mathrm{% Q}}),\mathrm{Conv}(Q;W_{\theta}^{\mathrm{Q}}),\\ \left.\left(\begin{array}[]{c}\omega_{1}^{\mathrm{K}}\\ \cdots\\ \omega_{P}^{\mathrm{K}}\end{array}\right.\right),\left(\begin{array}[]{c}% \theta_{1}^{\mathrm{K}}\\ \cdots\\ \theta_{P}^{\mathrm{K}}\end{array}\right)=\mathrm{Conv}(K;W_{\omega}^{\mathrm{% K}}),\mathrm{Conv}(K;W_{\theta}^{\mathrm{K}}),\end{gathered}

(71)

Series-Rotation

\displaystyle\begin{aligned} \Phi_{p}(Q,\text{pos}^{\mathrm{Q}})&=\tilde{Q}e^{% \mathbf{i}(2\pi\omega_{p}^{\mathrm{Q}}\text{pos}^{\mathrm{Q}}+\theta_{p}^{% \mathrm{Q}})},\quad p=1,2,\cdots,P\\ \Psi_{p}(K,\text{pos}^{\mathrm{K}})&=\tilde{K}e^{\mathbf{j}(2\pi\omega_{p}^{% \mathrm{K}}\text{pos}^{\mathrm{K}}+\theta_{p}^{\mathrm{K}})},\quad p=1,2,% \cdots,P\end{aligned}

(72)

Series-Attention with softmax-kernel (shown in Fig.5)

\displaystyle S=\text{softmax}\left(\frac{1}{P\sqrt{d}}\sum_{p=1}^{P}\text{Re}% [\Phi_{p}(Q,\text{pos}^{\mathbb{Q}})\Psi_{p}(K,\text{pos}^{\mathbb{K}})^{% \mathsf{H}}]\right)

(73)

Series-Aggregation:

\displaystyle H=SV

(74)

Here, we hypothesize that the series have $P$ periods, and $P$ is a hyper-parameter . In frequency/phase-generation step, we utilize 1D convolutions with activation ReLU to generate $P$ latent frequencies $\omega_{1\sim P}^{\mathrm{Q}}\in[0,+\infty)^{N\times 1}$ ( $\omega_{1{\sim}P}^{\mathrm{K}}$ is similar). Convolutions can effectively capture local contexts of each time step to generate reliable latent frequencies, and these latent frequencies are not identical at each time step implying variable periods. Moreover, to account for phase shifts, we additionally generate $P$ latent phases $\theta_{1\sim P}^{\mathrm{Q}}\in(-\pi,\pi)^{N\times 1}$ using 1D convolutions with activation $\pi\cdot\mathrm{tanh}$ ( $\theta_{1{\sim}P}^{\mathrm{K}}$ is similar).In series-rotation step, we rotate the representations at each time step according to the learned latent frequencies and phases in the previous step.Each row vector of $\tilde{Q},\tilde{K}$ is in the quaternion form of the corresponding row vector of Q, k, and $pos^{Q}=[0,1,2,\cdots,N-1]^{\mathrm{T}}/\mathrm{N}$ , $pos^{K}=[0,1,2,\cdots,M-1]^{\mathrm{T}}/\mathrm{M}$ are position vectors of series Q and K, respectively. In the series-attention step, to integratedly capture position-wise similarity under multiple periods, the unnormalized similarity is the mean of quaternion dot-product under multiple rotations. Finally, in the series-aggregation step, the outputs $H\in\mathbb{R}^{N\times d}$ is generated using the softmax-normalized similarity. In practice, we employ the multi-head variant of QRA, and will not go into details here, as it can be derived quite directly. Notice that, QRA is more expressive than canonical dot-product attention. When $P$ = 1, $\omega$ = 0 and $\theta$ = 0, QRA degenerates into canonical attention.

4 Experiments

4.1 Datasets

The AIST++ [41] dance movement dataset was constructed from the AIST dance [19] video database. A well-developed process was designed for estimating camera parameters, 3D human keypoints and 3D human dance movement sequences from multi-view videos. The dataset provides 3D human keypoint annotations and camera parameters for 10.1 million images covering 30 different subjects in 9 viewpoints. These features make it the largest and richest dataset containing 3D human keypoint annotations currently available. Additionaly, the dataset contains 1,408 3D human dance movement sequences represented as joint rotations and root trajectories. These dance movements are evenly distributed across 10 dance genres and contain hundreds of choreographies. The duration of the movements ranges from 7.4 to 48.0 seconds, and each dance movement is accompanied by corresponding music. Based on these annotations, AIST++ is designed to support multiple tasks including multi-view human keypoint estimation, human motion prediction/generation, and cross-modal analysis between human motion and music.

4.2 Implementation Details

In our primary experiments, the model takes a seed motion sequence spanning 120 frames (2 seconds) and a music sequence covering 240 frames (4 seconds) as input. These two sequences are aligned at the initial frame, and the model’s output consists of a future motion sequence with N=20 frames supervised by L2 loss. During the inference process, future motions are continuously generated in an auto-regressive manner at 60 FPS, with only the first predicted motion retained at each step.For music feature extraction, we employ the publicly available audio processing toolbox, Librosa [42], which includes 1-dimensional envelope, 20-dimensional MFCC, 12-dimensional chroma, 1-dimensional one-hot peaks, and 1-dimensional one-hot beats, resulting in a 35-dimensional music feature. The motion features combine a 9-dimensional representation of rotation matrices for all 24 joints with a 3-dimensional global translation vector, resulting in a 219-dimensional motion feature. These raw audio and motion features are initially embedded into 800-dimensional hidden representations using linear layers, with learnable position embeddings added before inputting them into the transformer layers. All three transformers (audio, motion, cross-modal) feature 16 attention heads with a hidden size of 800. In terms of training details, all experiments are trained using the Adam optimizer with a batch size of 16. The learning rate starts at 1e-4 and decays to {1e-5,1e-6} at {90k, 150k} steps, respectively. Training concludes after 500k steps, taking approximately 2 days on one RTX 3090. The baseline comparison includes the latest works on 3D dance generation with music and seed motion as input, such as Li [19] and Li et al [4]. For a more comprehensive evaluation, we also compare it with the recent state-of-the-art 2D dance generation method, DanceRevolution [5]. We adapt this work to output 3D joint positions for a direct quantitative comparison with our results, even though joint positions do not allow for immediate repositioning. The official code provided by the authors is used to train and test these baselines on the same dataset as ours.

4.3 Quanitative Evalutation

In this section, we assess the performance of our proposed Multi-modal Roformer across three key dimensions: (1) motion quality (2) generation diversity and (3) motion-music correlation. The results presented in Table 1 demonstrate that, under identical experimental conditions, our model surpasses state-of-the-art methods [2, 6, 7] in these aspects.

Motion Quality: Similar to previous studies, we assess the quality of generated motion by computing the Frechet Inception distance (FID) [43], which measures the dissimilarity between the distribution of generated motion and ground-truth motion. To capture motion features, we utilize two meticulously crafted motion feature extractors, as undisclosed motion encoders were employed in earlier works [44]. These extractors include: (1) a geometric feature extractor, generating a boolean vector that represents geometric relationships among specific body points in the motion sequence, and (2) a dynamic feature extractor, map** the motion sequence to capture dynamic aspects such as velocity and acceleration.We designate FID based on these geometric and dynamic features as $FID_{g}$ and $FID_{d}$ , respectively. The metrics are computed by comparing real dance motion sequences in the AIST++ test set with 40 generated motion sequences, each comprising T = 1200 frames (20 seconds). As depicted in Table 1, our generated motion sequences exhibit distributions that are closer to ground-truth motion compared to the three methods.

Generation Diversity:We also assess the model’s capacity to generate diverse dance movements in response to different input music, comparing its performance to baseline methods. Following a methodology similar to previous research [45], we compute the average Euclidean distance in the feature space of 40 generated motions from the AIST++ test set to quantify diversity. The motion diversity in geometric and dynamic feature spaces is denoted as $Dist_{g}$ and $Dist_{k}$ , respectively.Table1 illustrates that our method excels in generating more diverse dance movements in comparison to the baselines, with the exception of Li [29]. The latter discretizes motions, resulting in discontinuous outputs and elevated $Dist_{k}$ .

Motion-Music Correlation:Moreover, we gauge the correlation between the generated 3D motion and input music by introducing a novel metric known as the Beat Alignment Score. This metric evaluates the motion-music correlation by measuring the similarity between the beats in the motion and music. Librosa [42] is employed to extract music beats, while motion beats are computed as local minima in the motion velocity. The Beat Alignment Score is articulated as the average distance between each motion beat and its nearest music beat. To be specific, our Beat Alignment Score is defined as:

\displaystyle BeatAlign=\frac{1}{z}\sum_{i=1}^{z}exp(-\frac{min\forall{t_{j}^{% d}\in B^{d}||t_{i}^{c}-t_{j}^{d}||^{2}}}{2\alpha^{2}})

(75)

where $B^{c}=\left\{{t_{i}^{c}}\right\}$ is the set of motion beats, $B^{d}=t_{j}^{d}$ is the music beats, and $\alpha$ is a parameter for normalizing sequences with different FPS.

We set $\alpha=3$ for all experiments since the FPS for all our experimental sequences is 60. A similar metric called Beat Hit Rate is introduced in, but it relies on manually set thresholds for alignment (“hits”) depending on the dataset, while our metric directly measures distances. This metric is explicitly designed to be unidirectional, as dance movements do not necessarily need to match every music beat. On the other hand, each dynamic beat should have a corresponding music beat. To calibrate the results, we compute correlation metrics for the entire AIST++ dataset (upper bound) and randomly paired data (lower bound). As shown in Table 1, our generated motion shows better correlation with input music compared to the baselines. However, there is still considerable room for improvement for all methods, including ours, when compared to actual data. This reflects that music-motion correlation remains a challenging problem.

Methods	Motion Quality		Motion Diversity		Motion-Music Corr
Methods	FID_k↓	FID_g↓	Dist_k↑	Dist_g↑	BeatAlign↑
AIST++	-	-	9.057	7.556	0.292
AIST++(random)	-	-	-	-	0.213
Li et al[5].	86.43	20.58	6.85	4.93	0.232
Dancenet[6]	69.18	17.76	2.86	2.72	0.232
DanceRevolution[7]	73.42	31.01	3.52	2.46	0.22
FACT(baseline)[19]	48.95	28.1	4.9	6.69	0.232
our	30.1	11.5	7.82	9.37	0.239

Table 1: Conditional Motion Generation Evaluation on AIST++ dataset. Comparing to the three recent state-of-art methods, our generated motions are more realistic, better correlated with input music and more diversified.

4.4 Ablation Study

We conducted ablation studies on the Spin Position Embedding and Multi-modal Quaternion parameterization, respectively. The quantitative scores are shown in 2.

Position Embedding In the ablation experiments focused on position coding, we explore two distinct approaches and conduct experiments based on the following configurations: (1) a learnable coding approach for absolute positions (baseline), and (2) a rotary coding approach for relative positions. Method 2 was selected to introduce explicit relative position dependence in the self-attention formulation. This choice offers increased flexibility in sequence length, a potential reduction in dependencies between tokens, and the capacity to encode relative positions for linear self-attention.As illustrated in Table 2, we observe that the rotational embedding method of relative position leads to a significant reduction in the $FID_{g}$ values compared to the original method. This indicates that dances generated using rotary position embedding are notably closer to reality.

Quaternion parameterization Here, we performed ablation experiments on the original baseline as well as with the addition of the QRA module. Through Table 2, we can observe that Quaternion Rotary Attention (QRA), by introducing quaternion operations, is able to fully explore the relationship between audio and motion, and achieves more significant enhancement results.

	$FID_{k}$ ↓	$FID_{g}$ ↓	BeatAlign ↑
baseline	48.95	28.1	0.232
baseline+Spin Position Embedding	30.1	11.5	0.239
baseline+QRA	46.33	26.2	0.236

Table 2: Ablation Study on Spin Position Embedding and Quaternion Rotary Attention.As illustrated in the table, the experimental results clearly demonstrate the effectiveness of our proposed method. The graph shows a significant improvement in performance metrics when compared to the baseline approach.

5 Conclusion

We propose a network called QEAN for generating 3D dance movements. QEAN employs Spin Position Embedding (SPE) the position encoding part to embed the position information in a rotational manner in the self-attention, which improves the model’s representation of the sequences and enhances the model’s understanding of the human movements in terms of their temporal order. Additionally, we propose Quaternion Rotary Attention (QRA), a quaternion-valued relational learning network, which uses quaternion values to explore the temporal coordination between music and dance. To demonstrate the superiority of QEAN, we conducted experiments on the AIST++ dataset. The results of the relevant experimental data demonstrate the superiority of our approach in the 3D dance generation task. Furthermore, the results of our ablation experiments illustrate the importance of SPE and QRA in this task.

6 Acknowledgement

This work was supported in part by National Natural Science Foundation under Grant 92267107, the Science and Technology Planning Project of Guangdong under Grant 2021B0101220006, Science and Technology Projects in Guangzhou under Grant 202201011706, Key Areas Research and Development Program of Guangzhou under Grant 2023B01J0029, Science and technology research in key areas in Foshan under Grant 2020001006832, Key Area Research and Development Program of Guangdong Province under Grant 2018B010109007 and 2019B010153002, Science and technology projects of Guangzhou under Grant 202007040006, and Guangdong Provin-cial Key Laboratory of Cyber-Physical System under Grant 2020B1212060069.

7 Declarations

Conflict of interest We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

References

[1] Yue Yang and Ensi Zhang “Cultural thought and philosophical elements of singing and dancing in Indian films” In Trans/Form/Ação 46, 2023, pp. 315–328 DOI: 10.1590/0101-3173.2023.v46n4.p315
[2] Mark Siciliano “A citation analysis of business librarianship: Examining the Journal of Business and Finance Librarianship from 1990–2014” In Journal of Business & Finance Librarianship 22, 2017, pp. 81–96 URL: https://api.semanticscholar.org/CorpusID:63474056
[3] Andreas Aristidou et al. “Style-based motion analysis for dance composition” In The Visual Computer 34, 2018, pp. 1725–1737 URL: https://api.semanticscholar.org/CorpusID:27531229
[4] Jiaman Li et al. “Learning to Generate Diverse Dance Motions with Transformer” In ArXiv abs/2008.08171, 2020 URL: https://api.semanticscholar.org/CorpusID:221173065
[5] Ruozi Huang et al. “Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning” In International Conference on Learning Representations, 2020 URL: https://api.semanticscholar.org/CorpusID:235614403
[6] Xinjian Zhang et al. “Dance Generation with Style Embedding: Learning and Transferring Latent Representations of Dance Styles” In ArXiv abs/2104.14802, 2021 URL: https://api.semanticscholar.org/CorpusID:233476346
[7] Ruozi Huang et al. “Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning” In International Conference on Learning Representations, 2020 URL: https://api.semanticscholar.org/CorpusID:235614403
[8] Samy Bengio, Oriol Vinyals, Navdeep Jaitly and Noam M. Shazeer “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks” In ArXiv abs/1506.03099, 2015 URL: https://api.semanticscholar.org/CorpusID:1820089
[9] Zhifeng Xie et al. “BaGFN: Broad Attentive Graph Fusion Network for High-Order Feature Interactions” In IEEE Transactions on Neural Networks and Learning Systems 34, 2021, pp. 4499–4513 URL: https://api.semanticscholar.org/CorpusID:238476689
[10] Shiry Ginosar et al. “Learning Individual Styles of Conversational Gesture” In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3492–3501 URL: https://api.semanticscholar.org/CorpusID:182952539
[11] Bin Sheng, ** Li, Riaz Ali and C.L.Philip Chen “Improving Video Temporal Consistency via Broad Learning System” In IEEE Transactions on Cybernetics 52.7, 2022, pp. 6662–6675 DOI: 10.1109/TCYB.2021.3079311
[12] Alexey Dosovitskiy et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” In ArXiv abs/2010.11929, 2020 URL: https://api.semanticscholar.org/CorpusID:225039882
[13] Ze Liu et al. “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows” In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9992–10002 URL: https://api.semanticscholar.org/CorpusID:232352874
[14] Ashish Vaswani et al. “Attention is All you Need” In Neural Information Processing Systems, 2017 URL: https://api.semanticscholar.org/CorpusID:13756489
[15] Xiao Lin et al. “EAPT: Efficient Attention Pyramid Transformer for Image Processing” In IEEE Transactions on Multimedia 25, 2021, pp. 50–61 URL: https://api.semanticscholar.org/CorpusID:245536278
[16] Zinuo Li, Xuhang Chen, Chi-Man Pun and Xiaodong Cun “High-Resolution Document Shadow Removal via A Large-Scale Real-World Dataset and A Frequency-Aware Shadow Erasing Net” In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 12449–12458
[17] Zinuo Li, Xuhang Chen, Shuqiang Wang and Chi-Man Pun “A Large-Scale Film Style Dataset for Learning Multi-frequency Driven Film Enhancement” Main Track In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23 International Joint Conferences on Artificial Intelligence Organization, 2023, pp. 1160–1168 DOI: 10.24963/ijcai.2023/129
[18] Shenghong Luo et al. “Devignet: High-Resolution Vignetting Removal via a Dual Aggregated Fusion Transformer With Adaptive Channel Expansion” In arXiv preprint arXiv:2308.13739, 2023
[19] Ruilong Li, Sha Yang, David A. Ross and Angjoo Kanazawa “AI Choreographer: Music Conditioned 3D Dance Generation with AIST++” In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13381–13392 URL: https://api.semanticscholar.org/CorpusID:236882798
[20] Lian Siyao et al. “Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory” In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11040–11049 URL: https://api.semanticscholar.org/CorpusID:247627867
[21] Dario Pavllo, Christoph Feichtenhofer, Michael Auli and David Grangier “Modeling Human Motion with Quaternion-Based Neural Networks” In International Journal of Computer Vision 128, 2019, pp. 855–872 URL: https://api.semanticscholar.org/CorpusID:59158790
[22] Weizhao Ma et al. “PCMG:3D point cloud human motion generation based on self-attention and transformer” In The Visual Computer, 2023 URL: https://api.semanticscholar.org/CorpusID:261566852
[23] David Greenwood, Stephen D. Laycock and Iain Matthews “Predicting Head Pose from Speech with a Conditional Variational Autoencoder” In Interspeech, 2017 URL: https://api.semanticscholar.org/CorpusID:11113871
[24] Yuhang Huang et al. “Genre-Conditioned Long-Term 3D Dance Generation Driven by Music” In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 4858–4862 URL: https://api.semanticscholar.org/CorpusID:249437513
[25] Sepp Hochreiter and Jürgen Schmidhuber “Long Short-Term Memory” In Neural Computation 9, 1997, pp. 1735–1780 URL: https://api.semanticscholar.org/CorpusID:1915014
[26] Qihang Yu et al. “Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP” In ArXiv abs/2308.02487, 2023 URL: https://api.semanticscholar.org/CorpusID:260611350
[27] Yao-Hung Hubert Tsai et al. “Multimodal Transformer for Unaligned Multimodal Language Sequences” In Proceedings of the conference. Association for Computational Linguistics. Meeting 2019, 2019, pp. 6558–6569 URL: https://api.semanticscholar.org/CorpusID:173990158
[28] Ziheng Wu et al. “EasyPhoto: Your Smart AI Photo Generator”, 2023 URL: https://api.semanticscholar.org/CorpusID:263829612
[29] Purva Tendulkar, Abhishek Das, Aniruddha Kembhavi and Devi Parikh “Feel The Music: Automatically Generating A Dance For An Input Song” In ArXiv abs/2006.11905, 2020 URL: https://api.semanticscholar.org/CorpusID:219572850
[30] Jogendra Nath Kundu, Himanshu Buckchash, Priyanka Mandikal and Rahul “Cross-Conditioned Recurrent Networks for Long-Term Synthesis of Inter-Person Human Motion Interactions” In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 2713–2722 URL: https://api.semanticscholar.org/CorpusID:214675800
[31] Linjie Li et al. “VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation” In ArXiv abs/2106.04632, 2021 URL: https://api.semanticscholar.org/CorpusID:235377363
[32] Partha Ghosh, Jie Song, Emre Aksan and Otmar Hilliges “Learning Human Motion Models for Long-Term Predictions” In 2017 International Conference on 3D Vision (3DV), 2017, pp. 458–466 URL: https://api.semanticscholar.org/CorpusID:13549534
[33] Chenfei Wu et al. “Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models” In ArXiv abs/2303.04671, 2023 URL: https://api.semanticscholar.org/CorpusID:257404891
[34] Zhengxiao Du et al. “GLM: General Language Model Pretraining with Autoregressive Blank Infilling” In Annual Meeting of the Association for Computational Linguistics, 2021 URL: https://api.semanticscholar.org/CorpusID:247519241
[35] Zongwen Bai et al. “Low-rank multimodal fusion algorithm based on context modeling” In Journal of Internet Technology 22.4, 2021, pp. 913–921
[36] Daniel Holden, Jun Saito and Taku Komura “A deep learning framework for character motion synthesis and editing” In ACM Transactions on Graphics (TOG) 35, 2016, pp. 1–11 URL: https://api.semanticscholar.org/CorpusID:18149328
[37] Haibo Qiu et al. “Cross View Fusion for 3D Human Pose Estimation” In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 4341–4350 URL: https://api.semanticscholar.org/CorpusID:201891326
[38] Ye Zhu et al. “Quantized GAN for Complex Music Generation from Dance Videos” In ArXiv abs/2204.00604, 2022 URL: https://api.semanticscholar.org/CorpusID:247922422
[39] Zewen Zheng et al. “Quaternion-Valued Correlation Learning for Few-Shot Semantic Segmentation” In IEEE Transactions on Circuits and Systems for Video Technology 33, 2023, pp. 2102–2115 URL: https://api.semanticscholar.org/CorpusID:253661872
[40] Jianlin Su et al. “RoFormer: Enhanced Transformer with Rotary Position Embedding” In ArXiv abs/2104.09864, 2021 URL: https://api.semanticscholar.org/CorpusID:233307138
[41] Shuhei Tsuchida, Satoru Fukayama, Masahiro Hamasaki and Masataka Goto “AIST Dance Video Database: Multi-Genre, Multi-Dancer, and Multi-Camera Database for Dance Information Processing” In International Society for Music Information Retrieval Conference, 2019 URL: https://api.semanticscholar.org/CorpusID:208334750
[42] Brian McFee et al. “librosa: Audio and Music Signal Analysis in Python” In SciPy, 2015 URL: https://api.semanticscholar.org/CorpusID:33504
[43] Martin Heusel et al. “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium” In Neural Information Processing Systems, 2017 URL: https://api.semanticscholar.org/CorpusID:326772
[44] Kensuke Onuma, Christos Faloutsos and Jessica K. Hodgins “FMDistance: A Fast and Effective Distance Function for Motion Capture Data” In Eurographics, 2008 URL: https://api.semanticscholar.org/CorpusID:8323054
[45] Hao Hao Tan and Mohit Bansal “LXMERT: Learning Cross-Modality Encoder Representations from Transformers” In Conference on Empirical Methods in Natural Language Processing, 2019 URL: https://api.semanticscholar.org/CorpusID:201103729