HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: manyfoot

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2403.11626v1 [cs.GR] 18 Mar 2024

[1]Guoheng Huang [2]Xuhang Chen 1]Guangdong University of Technology, Guangdong, China 2]Huizhou University, Guangdong, China 3]Guangdong Mechanical and Electrical College, Guangdong, China 4]University of Western Australia, WA, Australia

QEAN: Quaternion-Enhanced Attention Network for Visual Dance Generation

Zhizhen Zhou [email protected]    Ye**g Huo [email protected]    [email protected]    An Zeng [email protected]    [email protected]    Lian Huang [email protected]    Zinuo Li [email protected] [ [ [ [
Abstract

The study of music-generated dance is a novel and challenging Image generation task. It aims to input a piece of music and seed motions, then generate natural dance movements for the subsequent music. Transformer-based methods face challenges in time series prediction tasks related to human movements and music due to their struggle in capturing the nonlinear relationship and temporal aspects. This can lead to issues like joint deformation, role deviation, floating, and inconsistencies in dance movements generated in response to the music. In this paper, we propose a Quaternion-Enhanced Attention Network (QEAN) for visual dance synthesis from a quaternion perspective, which consists of a Spin Position Embedding (SPE) module and a Quaternion Rotary Attention (QRA) module. First, SPE embeds position information into self-attention in a rotational manner, leading to better learning of features of movement sequences and audio sequences, and improved understanding of the connection between music and dance. Second, QRA represents and fuses 3D motion features and audio features in the form of a series of quaternions, enabling the model to better learn the temporal coordination of music and dance under the complex temporal cycle conditions of dance generation. Finally, we conducted experiments on the dataset AIST++, and the results show that our approach achieves better and more robust performance in generating accurate, high-quality dance movements. Our source code and dataset can be available from https://github.com/MarasyZZ/QEAN and https://google.github.io/aistplusplus_dataset respectively.

keywords:
Dance generation, Multi-modal task, Quaternion network, Time-series prediction task, Animation generation task

1 Introduction

Refer to caption
Figure 1: The motivation of our method. We compare the effectiveness of our method is compared with other approaches in generating dance movements from seed motions. In the top row labeled “other methods”, two sets of images showcase the transformation of seed movements into unnatural final poses characterized by joint deformation and character drift. Conversely, in the bottom row labeled “our method”, we demonstrate how the application of Pre-quaternion parameterization (P), Spin Position Embedding (S), and Quaternion Attention (Q) yields natural-looking final poses. Each prediction produced by our method successfully learns the correlations between dance and music.

Dancing is a universal language across all cultures [1, 2] and is used by many as a powerful means of self-expression on online media platforms, becoming a dynamic tool for disseminating information on the Internet. Although dance is an art form, it requires professional practice and training to give dancers a rich expressive voice [3]. Therefore, from a computational point of view, music-conditioned 3D dance generation [4, 5, 6, 7] has become a critical task that promises to open up a variety of practical applications. However, creating satisfying dance sequences that harmonize with specific music and body structures faces challenges because of our lack of understanding of the timing of human movements and the connection between music and dance. Overcoming these challenges is essential to achieve fluid movements with a high degree of kinematic complexity while ensuring consistency with the complex non-linear relationships of the accompanying music.

Later, with the continuous development and advancement of deep learning, many deep learning methods [6, 7, 8, 9, 10, 11] were started to be applied to dance generation. Firstly, RNN [8, 10] based methods were used to simulate human dances, but RNN approaches would face the challenges of static poses and error accumulation, especially when the input data varied. Subsequently, some researchers have used Variational Auto-Encoders (VAE) and Generative Adversarial Networks (GAN) to model 2D dance movements [12], and then LSTM-Auto-Encoders were used to model 3D dance movements directly from musical features [13], although such an approach solves the shortcomings of error accumulation that exist in the RNN approach. However, such an approach suffers from the disadvantage of instability and is prone to regress to non-standard poses.

In recent years, Transformer [12, 13, 14, 15, 16, 17, 18] has been favoured by many in natural language processing as well as visual processing, and some scholars have made great progress in their research [19, 20] to be able to generate high-quality dance movements given a piece of music. However, due to the existence of Transformer’s inadequate modeling of the temporal dependence of sequences when dealing with time-series data, the generated dances will suffer from problems such as drifting and foot slip** (As shown in Fig.1). Given the non-linear relationship between music and dance, existing approaches to Transformer do not fully model this relationship.

Quaternions are widely used as a mathematical tool for rotational expression and gesture control [21]. In view of this, we believe that the introduction of quaternions in the field of dance generation may be a promising approach. Compared to traditional Euler angles, quaternions are more effective in avoiding the “Gimbal Lock” problem and improving the stability of gesture representations. We expect to use quaternions to more accurately adapt dance movements to the rhythm and emotion of the music. By combining musical features with the correlation of quaternions, we can more accurately capture the complex relationship between music and dance.

In this paper, to address these challenges, we propose a Quaternion-Enhanced Attention Network for multi-modal dance synthesis (shown in Fig.2). The network mainly consists of a Spin Position Embedding (SPE) module and a Quaternion Rotary Attention (QRA) module. The SPE module is mainly used in the Transformer structure of the network, which embeds information in the form of relative positions into the self-attention. The audio and motion features are extracted by the Transformer structure in the network, respectively, and the SPE module combines the advantages of relative position coding and absolute position coding to maximize the model’s representation of sequence features. The extracted audio and motion features are expanded to four dimensions by quaternion parameterization dimension, and then the splicing operation is performed through the Quaternion Rotary Attention module. Compared to the work of Li [19], the proposed SPE better increases the model’s representation and utilisation of positional information. Besides, the QRA module better learns the representation of the link between audio and movement relationships, improving the quality of the generated dances with good robustness.

The contribution of the proposed network can be summarized as follows:

1. In this paper, we introduce a Quaternion-Enhanced Attention Network (QEAN) for generating multimodal dances. This addresses challenges seen in current methods, like awkward joint movements and character inconsistency. QEAN uses quaternion operations to better capture the complex relationship between music and dance, improving the modeling of temporal dependencies.

2. We introduce the Spin Position Embedding (SPE) module, which computes query and key vectors for features, applies rotational operations, and embeds results into self-attention. SPE addresses limitations of traditional position encoding by introducing relative position encoding based on rotations, enhancing modeling for variable-length sequences while solving length consistency and overfitting issues. Additionally, relative position information improves modeling of intrinsic feature associations, significantly enhancing the model’s representation and utilization of temporal order in human motion.

3. We introduce the quaternion perspective and propose the Quaternion Rotary Attention (QRA) module. The QRA module maps audio and motion features to the quaternion space and explores the intrinsic correlation between the two using Hamiltonian multiplication, which enables the model to better learn the temporal coordination between music and dance, and generate smooth and natural dances coordinated with the music tempo.

4. Experimental results on the AIST++ dataset demonstrate that our proposed network is capable of effectively learning the connection between audio and movement, leading to the generation of higher quality dance movements. It outperforms other current state-of-the-art methods in terms of dance quality.

2 Related Work

3D Human Motion Synthesis The research on generating realistic and controllable 3D human motion sequences, as discussed in [8, 22, 23, 24], has seen significant advancements in recent years. Initial efforts utilized statistical models like kernel-based probability distributions [25] to synthesize motion, but these methods tended to oversimplify motion details. A subsequent breakthrough came with the introduction of the motion graph approach [26], which addressed this limitation by adopting a non-parametric method. This technique involved constructing a directed graph using motion capture datasets, where each node represented a pose, and edges denoted transitions between poses. Motion generation was achieved through random walks on this graph. However, a notable challenge in motion graphs was the generation of plausible transitions, and certain methods sought to overcome this by introducing parameterizations for transitions [27]. As deep learning gained prominence, several approaches explored the use of neural networks trained on extensive motion capture datasets to generate 3D motion. Various network architectures, including CNNs [23, 28], GANs [29], VAE [30], RNNs [6, 20], and Transformers [4, 20] have been investigated. While auto-regressive models like RNNs and pure Transformers [31] theoretically have the capacity to generate infinite motion, practical challenges such as mean regression arise. This phenomenon leads to motion “freezing” or drifting into unnatural movements after several iterations. To address this, some studies [31, 32] propose periodic usage of the network’s output as input during the training process. Additionally, Phase Function Neural Networks and their variants have been introduced [33, 34] to tackle the mean regression issue by conditioning network weights on the phase. However, their scalability in representing diverse movements is limited.

Music-Driven Dance Generation In recent years, data-driven deep learning has become the dominant technique for dance generation. Joao [35] used graph convolutional networks to learn from a variety of dance datasets and generate new dance sequences that are smooth and continuous. This deep learning approach significantly improves the quality and continuity of the generated dances. Holden [36], Qiu [37] and Starke [38] built on deep learning by de-augmenting long-term dependency modeling as a means of generating more coherent long dance sequences. Common approaches include integrating skeleton information and employing attention mechanisms. Li [19] built on that previous work by proposing the Full Attention Cross-Modal Transformer model (FACT), which can generate non-freezing, high-quality 3D motion sequences conditioned on music by learning audio-motion correspondences sequences.

Quaternion Networks In various domains of deep learning such as few-shot segmentation [39], human motion synthesis [21], and multi-sensor signal processing, Quaternion Neural Networks have made significant strides. Similar to the task discussed in this paper, Quaternion representations are employed in neural network architectures as a parameterization for rotations.Quaternion networks, exemplified by QuaterNet [21], utilize quaternions to represent joint rotations in both Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs). This approach addresses the discontinuity issues associated with Euler angles, achieving outstanding performance in long-term prediction tasks. In the context of our work, focused on music-driven dance generation, we propose constructing a learning process for the correlation between music and dance. This is essential as it requires consideration of the non-linear characteristics of both motion and music. Therefore, our method involves exploring the relationship between audio and motion features using quaternions. By leveraging quaternions, we aim to enhance the correlation between audio and motion, facilitating the generation of high-quality dance sequences.

Refer to caption
Figure 2: The overview of our method. (a) describes the basic process, which contains three modules (i), (ii), and (iii). When the inputs are a motion sequence with a length of 120 frames and an audio sequence with a length of 240 frames, features are extracted by the motion transformer and the audio transformer, respectively. The extracted features are parameterized by a quadratic parameterization operation, and the dimension is changed to 4 dimensions. Through the Spin Position Embedding (SPE) module, the corresponding 4-dimensional features are rotated to embed the information into the self-attention in a rotational manner. The information processed by the SPE is used to explore the coordination between the music and the dance through the quaternionic attentional transformer, and finally, the corresponding dance is generated. (i), (ii) and (iii) describe the processing of quaternion parameterization, spin position embedding and the basic structure of the quaternion attention transformer, respectively. Specific details are given in the Methods section.

3 Methods

3.1 Overview of QEAN

In this paper, we propose a Quaternion-Enhanced Attention Network (QEAN) for generating high-quality dances under musical conditions, as illustrated in Fig. 2.

We are given random motion seeds Y of length 120 frames and audio features Z of length 240 frames, where Y can be denoted as Y={y1,y2,,yt}𝑌subscript𝑦1subscript𝑦2subscript𝑦𝑡Y=\{y_{1},y_{2},\ldots,y_{t}\}italic_Y = { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } and Z can be denoted as Z={z1,z2,,zt}𝑍subscript𝑧1subscript𝑧2subscript𝑧superscript𝑡Z=\{z_{1},z_{2},\ldots,z_{t^{\prime}}\}italic_Z = { italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT }. Our objective is to generate a sequence of future motion from t+1 to tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, Y={yt+1,yt+2yt}superscript𝑌subscript𝑦𝑡1subscript𝑦𝑡2subscript𝑦superscript𝑡Y^{\prime}=\{y_{t+1},y_{t+2}\ldots y_{t^{\prime}}\}italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { italic_y start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_t + 2 end_POSTSUBSCRIPT … italic_y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT }, where ttmuch-greater-thansuperscript𝑡𝑡t^{\prime}\gg titalic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≫ italic_t. QEAN first utilizes the two input transformers, the motion transformer fmotsubscript𝑓𝑚𝑜𝑡f_{mot}italic_f start_POSTSUBSCRIPT italic_m italic_o italic_t end_POSTSUBSCRIPT and the audio transformer faudiosubscript𝑓𝑎𝑢𝑑𝑖𝑜f_{audio}italic_f start_POSTSUBSCRIPT italic_a italic_u italic_d italic_i italic_o end_POSTSUBSCRIPT, to encode features and generate motion and audio embeddings, represented as h1:Tysubscriptsuperscript𝑦:1𝑇h^{y}_{1:T}italic_h start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT and h1:Tzsubscriptsuperscript𝑧:1superscript𝑇h^{z}_{1:T^{\prime}}italic_h start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, respectively. Next, these two embedded features are combined and subjected to a quaternion parameterization operation (see 3.2 for details). This operation maps the features to four dimensions and embeds the information into the self-attention in a rotational manner using Spin Position Embedding (see 3.3 for details). Finally, the fused features are learned by a Quaternion Rotary Attention Transformer (see 3.4 for details) to generate the corresponding dance movements.

3.2 Quaternion Algorithms and Quaternion Parameterization

We begin by elucidating the fundamental concepts of quaternions crucial for understanding the context of this paper. Quaternions, classified as hyper-complex numbers of rank 4, stand out as a direct and non-commutative extension of complex-valued numbers. In our proposed methodologies, the intricate interplay between Hamilton products and quaternion algebra emerges as the linchpin, forming the cornerstone of our innovative approaches. This exploration of quaternion principles lays the groundwork for the subsequent discussions and applications detailed in this study.

A quaternion Q in quaternion domain D, Q \in D, can be represented as:

Q=e+f𝐢+g𝐣+h𝐤𝑄𝑒𝑓𝐢𝑔𝐣𝐤Q=e+f\textbf{i}+g\textbf{j}+h\textbf{k}italic_Q = italic_e + italic_f i + italic_g j + italic_h k (1)

Where e,f,g and h are real numbers,and i,j and k are the quaternion unit basis.In a quaternion, e is the real part, where fi+gj+hk, with 𝐢2superscript𝐢2\textbf{i}^{2}i start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT=𝐣2superscript𝐣2\textbf{j}^{2}j start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT=𝐤2superscript𝐤2\textbf{k}^{2}k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT=ijk=-1 is the imaginary part.

A pure quaternion is a quaternion whose real part is 0, resulting in the vector Q=fi+gj+hk. Operations on quaternions are defined as follows.

The addition of two Quaternions is defined as:

Q+R=Qe+Re+(Qf+Rf)𝐢+(Qg+Rg)𝐣+(Qh+Rh)𝐤𝑄𝑅subscript𝑄𝑒subscript𝑅𝑒subscript𝑄𝑓subscript𝑅𝑓𝐢subscript𝑄𝑔subscript𝑅𝑔𝐣subscript𝑄subscript𝑅𝐤Q+R=Q_{e}+R_{e}+(Q_{f}+R_{f})\mathbf{i}+(Q_{g}+R_{g})\mathbf{j}+(Q_{h}+R_{h})% \mathbf{k}italic_Q + italic_R = italic_Q start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT + ( italic_Q start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) bold_i + ( italic_Q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) bold_j + ( italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) bold_k (2)

Where Q and P with subscripts denote the real and imaginary parts of the quaternions Q and P.

The Multiplication with scalar γ𝛾\gammaitalic_γ can be defined as:

γQ=γe+γf𝐢+γg𝐣+γh𝐤𝛾𝑄𝛾𝑒𝛾𝑓𝐢𝛾𝑔𝐣𝛾𝐤\gamma Q=\gamma e+\gamma f\mathbf{i}+\gamma g\mathbf{j}+\gamma h\mathbf{k}italic_γ italic_Q = italic_γ italic_e + italic_γ italic_f bold_i + italic_γ italic_g bold_j + italic_γ italic_h bold_k (3)

The conjugate complex Qsuperscript𝑄Q^{\star}italic_Q start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT of Q𝑄Qitalic_Q can be defined as:

Q=ef𝐢g𝐣h𝐤superscript𝑄𝑒𝑓𝐢𝑔𝐣𝐤Q^{\star}=e-f\mathbf{i}-g\mathbf{j}-h\mathbf{k}italic_Q start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = italic_e - italic_f bold_i - italic_g bold_j - italic_h bold_k (4)

The multiplication of quaternions Q and R can be defined as follows:

QR=(QeReQfRfQgRgQhRh)+(QfRe+QeRfQhRg+QgRh)𝐢+(QgRe+QhRe+QeRgQfRh)𝐣+(QhReQgRf+QfRg+QeRh)𝐤𝑄tensor-product𝑅subscript𝑄𝑒subscript𝑅𝑒subscript𝑄𝑓subscript𝑅𝑓subscript𝑄𝑔subscript𝑅𝑔subscript𝑄subscript𝑅subscript𝑄𝑓subscript𝑅𝑒subscript𝑄𝑒subscript𝑅𝑓subscript𝑄subscript𝑅𝑔subscript𝑄𝑔subscript𝑅𝐢subscript𝑄𝑔subscript𝑅𝑒subscript𝑄subscript𝑅𝑒subscript𝑄𝑒subscript𝑅𝑔subscript𝑄𝑓subscript𝑅𝐣subscript𝑄subscript𝑅𝑒subscript𝑄𝑔subscript𝑅𝑓subscript𝑄𝑓subscript𝑅𝑔subscript𝑄𝑒subscript𝑅𝐤\begin{split}Q\bigotimes R=&(Q_{e}R_{e}-Q_{f}R_{f}-Q_{g}R_{g}-Q_{h}R_{h})\\ &+(Q_{f}R_{e}+Q_{e}R_{f}-Q_{h}R_{g}+Q_{g}R_{h})\textbf{i}\\ &+(Q_{g}R_{e}+Q_{h}R_{e}+Q_{e}R_{g}-Q_{f}R_{h})\textbf{j}\\ &+(Q_{h}R_{e}-Q_{g}R_{f}+Q_{f}R_{g}+Q_{e}R_{h})\textbf{k}\end{split}start_ROW start_CELL italic_Q ⨂ italic_R = end_CELL start_CELL ( italic_Q start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ( italic_Q start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT + italic_Q start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT + italic_Q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) i end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ( italic_Q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT + italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT + italic_Q start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) j end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ( italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT + italic_Q start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT + italic_Q start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) k end_CELL end_ROW (5)

The equation above clearly describes the exchange between quaternions Q and R, indicating that Hamiltonian product is essential in quaternion neural networks. In this study, we extensively employ the Hamiltonian product to learn the correlations between music and dance, which forms the foundation of QEAN and enhances its generalization ability.

To implement our approach, we combine music and motion features to create a feature vector. Specifically, 35-dimensional music features and 219-dimensional motion features can be combined into a 254-dimensional feature vector through concatenation, based on dimension and time correspondence. Subsequently, we convert this concatenated feature vector into a sequence of quaternions. In this process, each music feature and three-dimensional dance motion feature are broken down into four components, representing a quaternion with a real part and three imaginary parts. As a result, the original 254-dimensional feature vector is transformed into a quaternion sequence with a length of 63 (disregarding the last two dimensions as they are insufficient to form a complete quaternion). Finally, we input this quaternion sequence into the Spin Position Embedding for further processing. In this model, a position encoding is assigned to each quaternion, enabling the capture of position information within the sequence.

By incorporating position information, the model gains a better understanding of the sequence and improves its performance accordingly.

Refer to caption
Figure 3: The general situation of Spin Position Embedding. Specifically, the input action sequences and audio sequences in this paper are given feature vector representations after being encoded by their respective Transformers. The feature vectors of the action sequences are xmsubscript𝑥𝑚x_{m}italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT,and the feature vectors of the audio sequences are xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.These word vectors are then multiplied by different rotation matrices Rmsubscript𝑅𝑚R_{m}italic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, Rnsubscript𝑅𝑛R_{n}italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT according to their positions m and n in their respective sequences to achieve the positional information of fusion. Finally, the encoded vectors of the action sequences are transformed into query vectors qmsubscript𝑞𝑚q_{m}italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT,and the rotationally transformed key vectors knsubscript𝑘𝑛k_{n}italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of the encoded audio sequences are run on a click to compute the correlation between the two modal sequences. With this Spin Position Embedding, the modality can better model the positional information of the two sequences, as well as the correlation between them, thus increasing the learning of cross-modal representations.

3.3 Spin Position Embedding

The main types of position embedding methods are relative position embedding and absolute position embedding methods. In 2017, the Transformer [14] model was proposed. The concept of positional embedding was introduced in this model to provide information about the position of each word or token in this input sequence. This is crucial for NLP tasks that heavily rely on the relative position of words. In the Transformer model, positional embedding is used to encode information about different positions using sine and cosine functions, and the positional embedding is updated at different frequencies for different dimensions. In this way, the model is able to learn the relative positions of the tokens in the sequence. This way of position coding with sine and cosine, which is also known as absolute position coding, is easy to implement and relies directly on the absolute position without position loss, but this type of coding has poor generalisation ability, the model only adapts to a specific length of absolute position coding, and is prone to overfitting when the length varies, and performs poorly on tasks in long sequences. Relative positional embedding, on the other hand, is a method of obtaining positional embedding by using the relative distance or order relationship between lexical elements to rely on. This embedding method can provide relative position information between lexical elements instead of relying completely on absolute position. Such an embedding approach highlights the relevance of lexical elements in terms of content, which is conducive to content comprehension, and avoids the excessive computation caused by the exponential growth of absolute positional embedding with position. Consequently, it improves the generalisation ability.

Motivated by the work of Jianlin Su [40], who proposed the Rotary Position Embedding, a positional embedding method designed to enhance the performance of the Transformer architecture by integrating relative positional information into self-attention. The popular LLama2 [10] model currently employs this position embedding approach.Therefore, we borrowed from Su and embedded the extracted audio and motion features into self-attention in the form of rotated positions to better learn the features in it and improve the computational efficiency. The basic idea can be seen in Fig.3 .

First, we define a sequence of features of length N (since motion and audio features operate similarly in the process of SPE,The O in the next equation represents different operations for different eigenvectors in different situations): FN={Wi}i=1Nsubscript𝐹𝑁superscriptsubscriptsubscript𝑊𝑖𝑖1𝑁F_{N}=\{W_{i}\}_{i=1}^{N}italic_F start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = { italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Where wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the i-th token in the input sequence, and the embedding corresponding to the input sequence FNsubscript𝐹𝑁F_{N}italic_F start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is denoted as EN={xi}i=1Nsubscript𝐸𝑁superscriptsubscriptsubscript𝑥𝑖𝑖1𝑁E_{N}=\{x_{i}\}_{i=1}^{N}italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the d-dimensional embedding vector of the i-th token wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Before performing self-attention operations, we use the feature embedding vectors to calculate the q, k, and v vectors and incorporate the corresponding positional information. The function expressions are as follows:

qssubscript𝑞𝑠\displaystyle q_{s}italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT =Oq(xs,s)absentsubscript𝑂𝑞subscript𝑥𝑠𝑠\displaystyle=O_{q}(x_{s},s)= italic_O start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) (6)
ktsubscript𝑘𝑡\displaystyle k_{t}italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =Ok(xt,t)absentsubscript𝑂𝑘subscript𝑥𝑡𝑡\displaystyle=O_{k}(x_{t},t)= italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) (7)
vtsubscript𝑣𝑡\displaystyle v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =Ov(xt,t)absentsubscript𝑂𝑣subscript𝑥𝑡𝑡\displaystyle=O_{v}(x_{t},t)= italic_O start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) (8)

Here, qssubscript𝑞𝑠q_{s}italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT represents the query vector for the s-th token with positional information s integrated into the feature vector xssubscript𝑥𝑠x_{s}italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, while ktsubscript𝑘𝑡k_{t}italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and vtsubscript𝑣𝑡v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represent the key and value vectors for the t-th token with positional information t integrated into the feature vector xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Then, we need to compute the output of self-attention for the s-th feature embedding vector xssubscript𝑥𝑠x_{s}italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT . This involves calculating an attention score between qssubscript𝑞𝑠q_{s}italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and other ktsubscript𝑘𝑡k_{t}italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , and then multiplying the attention score by the corresponding vtsubscript𝑣𝑡v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , followed by summation to obtain the output vector ossubscript𝑜𝑠o_{s}italic_o start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT:

as,t=exp(qsTktd)j=1Nexp(qskjd)subscript𝑎𝑠𝑡𝑒𝑥𝑝superscriptsubscript𝑞𝑠𝑇subscript𝑘𝑡𝑑superscriptsubscript𝑗1𝑁𝑒𝑥𝑝subscript𝑞𝑠subscript𝑘𝑗𝑑\displaystyle a_{s,t}=\frac{exp(\frac{q_{s}^{T}k_{t}}{\sqrt{d}})}{\sum_{j=1}^{% N}exp(\frac{q_{s}k_{j}}{\sqrt{d}})}italic_a start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT = divide start_ARG italic_e italic_x italic_p ( divide start_ARG italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_e italic_x italic_p ( divide start_ARG italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) end_ARG (9)
os=n=1Nas,tvnsubscript𝑜𝑠superscriptsubscript𝑛1𝑁subscript𝑎𝑠𝑡subscript𝑣𝑛\displaystyle o_{s}=\sum_{n=1}^{N}a_{s,t}v_{n}italic_o start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (10)

Next, in order to leverage the relative positional relationships between the mentioned tokens, let’s assume that the dot product operation between the query vector qssubscript𝑞𝑠q_{s}italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and the key vector ktsubscript𝑘𝑡k_{t}italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is represented by a function g . The input to function g includes the word embedding vectors xssubscript𝑥𝑠x_{s}italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and their relative position s-t:

<Oq(xs,s),Ok(xt,t)>=g(xs,xt,st)formulae-sequenceabsentsubscript𝑂𝑞subscript𝑥𝑠𝑠subscript𝑂𝑘subscript𝑥𝑡𝑡𝑔subscript𝑥𝑠subscript𝑥𝑡𝑠𝑡\displaystyle<O_{q}(x_{s},s),O_{k}(x_{t},t)>=g(x_{s},x_{t},s-t)< italic_O start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) , italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) > = italic_g ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s - italic_t ) (11)

We then discover an alternative approach to position embedding that upholds the aforementioned relationship.

Oq(xs,s)=(Wqxs)eisθsubscript𝑂𝑞subscript𝑥𝑠𝑠subscript𝑊𝑞subscript𝑥𝑠superscript𝑒𝑖𝑠𝜃\displaystyle O_{q}(x_{s},s)=(W_{q}x_{s})e^{is\theta}italic_O start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) = ( italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT italic_i italic_s italic_θ end_POSTSUPERSCRIPT (12)
Ok(xt,t)=(Wkxt)eitθsubscript𝑂𝑘subscript𝑥𝑡𝑡subscript𝑊𝑘subscript𝑥𝑡superscript𝑒𝑖𝑡𝜃\displaystyle O_{k}(x_{t},t)=(W_{k}x_{t})e^{it\theta}italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = ( italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT italic_i italic_t italic_θ end_POSTSUPERSCRIPT (13)
O(xs,xt,st)=Re[(Wqxs)(Wkxt)*ei(st)θ]𝑂subscript𝑥𝑠subscript𝑥𝑡𝑠𝑡𝑅𝑒delimited-[]subscript𝑊𝑞subscript𝑥𝑠superscriptsubscript𝑊𝑘subscript𝑥𝑡superscript𝑒𝑖𝑠𝑡𝜃\displaystyle O(x_{s},x_{t},s-t)=Re[(W_{q}x_{s})(W_{k}x_{t})^{*}e^{i(s-t)% \theta}]italic_O ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s - italic_t ) = italic_R italic_e [ ( italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ( italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_i ( italic_s - italic_t ) italic_θ end_POSTSUPERSCRIPT ] (14)

Here, x represents any real number, e is the base of the natural logarithm, and i is the imaginary unit in complex numbers.

We can cleverly use Euler’s formula eix=cosx+isinxsuperscript𝑒𝑖𝑥𝑐𝑜𝑠𝑥𝑖𝑠𝑖𝑛𝑥e^{ix}=cosx+isinxitalic_e start_POSTSUPERSCRIPT italic_i italic_x end_POSTSUPERSCRIPT = italic_c italic_o italic_s italic_x + italic_i italic_s italic_i italic_n italic_x, where the real part is cosx and the imaginary part sinx is of a complex number.

After transformation, formulas O and g can be changed to:

eisθ=cos(sθ)+isin(sθ)superscript𝑒𝑖𝑠𝜃𝑐𝑜𝑠𝑠𝜃𝑖𝑠𝑖𝑛𝑠𝜃\displaystyle e^{is\theta}=cos(s\theta)+isin(s\theta)italic_e start_POSTSUPERSCRIPT italic_i italic_s italic_θ end_POSTSUPERSCRIPT = italic_c italic_o italic_s ( italic_s italic_θ ) + italic_i italic_s italic_i italic_n ( italic_s italic_θ ) (15)
eitθ=cos(tθ)+isin(tθ)superscript𝑒𝑖𝑡𝜃𝑐𝑜𝑠𝑡𝜃𝑖𝑠𝑖𝑛𝑡𝜃\displaystyle e^{it\theta}=cos(t\theta)+isin(t\theta)italic_e start_POSTSUPERSCRIPT italic_i italic_t italic_θ end_POSTSUPERSCRIPT = italic_c italic_o italic_s ( italic_t italic_θ ) + italic_i italic_s italic_i italic_n ( italic_t italic_θ ) (16)
ei(st)θ=cos((st)θ)+isin((st)θ)superscript𝑒𝑖𝑠𝑡𝜃𝑐𝑜𝑠𝑠𝑡𝜃𝑖𝑠𝑖𝑛𝑠𝑡𝜃\displaystyle e^{i(s-t)\theta}=cos((s-t)\theta)+isin((s-t)\theta)italic_e start_POSTSUPERSCRIPT italic_i ( italic_s - italic_t ) italic_θ end_POSTSUPERSCRIPT = italic_c italic_o italic_s ( ( italic_s - italic_t ) italic_θ ) + italic_i italic_s italic_i italic_n ( ( italic_s - italic_t ) italic_θ ) (17)
Oq(xs,s)=(Wqxs)eisθsubscript𝑂𝑞subscript𝑥𝑠𝑠subscript𝑊𝑞subscript𝑥𝑠superscript𝑒𝑖𝑠𝜃\displaystyle O_{q}(x_{s},s)=(W_{q}x_{s})e^{is\theta}italic_O start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) = ( italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT italic_i italic_s italic_θ end_POSTSUPERSCRIPT (18)

Then, according to linear algebra, we can represent qssubscript𝑞𝑠q_{s}italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT using a matrix:

qs=(qs(1)qs(2))=(Wqxs)=(Wq(11)Wq(12)Wq(21)Wq(22))(xs(1)xs(2))subscript𝑞𝑠matrixsuperscriptsubscript𝑞𝑠1superscriptsubscript𝑞𝑠2subscript𝑊𝑞subscript𝑥𝑠matrixsuperscriptsubscript𝑊𝑞11superscriptsubscript𝑊𝑞12superscriptsubscript𝑊𝑞21superscriptsubscript𝑊𝑞22matrixsuperscriptsubscript𝑥𝑠1superscriptsubscript𝑥𝑠2\displaystyle q_{s}=\begin{pmatrix}q_{s}^{(1)}\\ q_{s}^{(2)}\end{pmatrix}=(W_{q}x_{s})=\begin{pmatrix}W_{q}^{(11)}&W_{q}^{(12)}% \\ W_{q}^{(21)}&W_{q}^{(22)}\end{pmatrix}\begin{pmatrix}x_{s}^{(1)}\\ x_{s}^{(2)}\end{pmatrix}italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = ( start_ARG start_ROW start_CELL italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) = ( italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = ( start_ARG start_ROW start_CELL italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 11 ) end_POSTSUPERSCRIPT end_CELL start_CELL italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 12 ) end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 21 ) end_POSTSUPERSCRIPT end_CELL start_CELL italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 22 ) end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) ( start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) (25)
Oq(xs,s)=(Wqxs)eisθ=qseisθsubscript𝑂𝑞subscript𝑥𝑠𝑠subscript𝑊𝑞subscript𝑥𝑠superscript𝑒𝑖𝑠𝜃subscript𝑞𝑠superscript𝑒𝑖𝑠𝜃\displaystyle O_{q}(x_{s},s)=(W_{q}x_{s})e^{is\theta}=q_{s}e^{is\theta}italic_O start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) = ( italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT italic_i italic_s italic_θ end_POSTSUPERSCRIPT = italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_i italic_s italic_θ end_POSTSUPERSCRIPT (26)

Therefore, multiplying these two complex numbers, we get the following result:

qseisθ=[qs(1)cos(sθ)qs(2)sin(sθ),qs(2)cos(sθ)+qs1sin(sθ)\displaystyle q_{s}e^{is\theta}=[q_{s}^{(1)}cos(s\theta)-q_{s}^{(2)}sin(s% \theta),q_{s}^{(2)}cos(s\theta)+q_{s}^{1}sin(s\theta)italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_i italic_s italic_θ end_POSTSUPERSCRIPT = [ italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT italic_c italic_o italic_s ( italic_s italic_θ ) - italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT italic_s italic_i italic_n ( italic_s italic_θ ) , italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT italic_c italic_o italic_s ( italic_s italic_θ ) + italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_s italic_i italic_n ( italic_s italic_θ ) (27)

Then, we magically discover that the above expression is equal to the query vector multiplied by a rotation matrix:

Oq(xs,s)subscript𝑂𝑞subscript𝑥𝑠𝑠\displaystyle O_{q}(x_{s},s)italic_O start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) =(Wqxs)eisθ=qseisθabsentsubscript𝑊𝑞subscript𝑥𝑠superscript𝑒𝑖𝑠𝜃subscript𝑞𝑠superscript𝑒𝑖𝑠𝜃\displaystyle=(W_{q}x_{s})e^{is\theta}=q_{s}e^{is\theta}= ( italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT italic_i italic_s italic_θ end_POSTSUPERSCRIPT = italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_i italic_s italic_θ end_POSTSUPERSCRIPT
=(cos(sθ)sin(sθ)sin(sθ)cos(sθ))(qs(1)qs(2))absentmatrix𝑠𝜃𝑠𝜃𝑠𝜃𝑠𝜃matrixsuperscriptsubscript𝑞𝑠1superscriptsubscript𝑞𝑠2\displaystyle=\begin{pmatrix}\cos(s\theta)&-\sin(s\theta)\\ \sin(s\theta)&\cos(s\theta)\end{pmatrix}\begin{pmatrix}q_{s}^{(1)}\\ q_{s}^{(2)}\end{pmatrix}= ( start_ARG start_ROW start_CELL roman_cos ( italic_s italic_θ ) end_CELL start_CELL - roman_sin ( italic_s italic_θ ) end_CELL end_ROW start_ROW start_CELL roman_sin ( italic_s italic_θ ) end_CELL start_CELL roman_cos ( italic_s italic_θ ) end_CELL end_ROW end_ARG ) ( start_ARG start_ROW start_CELL italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) (32)

Similarly, the key vector ktsubscript𝑘𝑡k_{t}italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be represented as follows:

Ok(xt,t)subscript𝑂𝑘subscript𝑥𝑡𝑡\displaystyle O_{k}(x_{t},t)italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) =(Wkxt)einθ=kteitθabsentsubscript𝑊𝑘subscript𝑥𝑡superscript𝑒𝑖𝑛𝜃subscript𝑘𝑡superscript𝑒𝑖𝑡𝜃\displaystyle=(W_{k}x_{t})e^{in\theta}=k_{t}e^{it\theta}= ( italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT italic_i italic_n italic_θ end_POSTSUPERSCRIPT = italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_i italic_t italic_θ end_POSTSUPERSCRIPT
=(cos(tθ)sin(tθ))(kt(1)kt(2))absentmatrix𝑡𝜃𝑡𝜃matrixsuperscriptsubscript𝑘𝑡1superscriptsubscript𝑘𝑡2\displaystyle=\begin{pmatrix}\cos(t\theta)&-\sin(t\theta)\end{pmatrix}\begin{% pmatrix}k_{t}^{(1)}\\ k_{t}^{(2)}\end{pmatrix}= ( start_ARG start_ROW start_CELL roman_cos ( italic_t italic_θ ) end_CELL start_CELL - roman_sin ( italic_t italic_θ ) end_CELL end_ROW end_ARG ) ( start_ARG start_ROW start_CELL italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) (36)
+(sin(tθ)cos(tθ))(kt2kt1)matrix𝑡𝜃𝑡𝜃matrixsuperscriptsubscript𝑘𝑡2superscriptsubscript𝑘𝑡1\displaystyle\quad+\begin{pmatrix}\sin(t\theta)&\cos(t\theta)\end{pmatrix}% \begin{pmatrix}k_{t}^{2}\\ k_{t}^{1}\end{pmatrix}+ ( start_ARG start_ROW start_CELL roman_sin ( italic_t italic_θ ) end_CELL start_CELL roman_cos ( italic_t italic_θ ) end_CELL end_ROW end_ARG ) ( start_ARG start_ROW start_CELL italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) (40)

By rearranging the above formulas, we can simplify the following expression:

<Oq(xs,s),Ok(xt,t)>\displaystyle<O_{q}(x_{s},s),O_{k}(x_{t},t)>< italic_O start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ) , italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) >
=((cos(sθ)sin(sθ)sin(sθ)cos(sθ))T(qs(1)qs(2)))Tabsentsuperscriptsuperscriptmatrix𝑠𝜃𝑠𝜃𝑠𝜃𝑠𝜃𝑇matrixsuperscriptsubscript𝑞𝑠1superscriptsubscript𝑞𝑠2𝑇\displaystyle=\left(\begin{pmatrix}\cos(s\theta)&-\sin(s\theta)\\ \sin(s\theta)&\cos(s\theta)\end{pmatrix}^{T}\begin{pmatrix}q_{s}^{(1)}\\ q_{s}^{(2)}\end{pmatrix}\right)^{T}= ( ( start_ARG start_ROW start_CELL roman_cos ( italic_s italic_θ ) end_CELL start_CELL - roman_sin ( italic_s italic_θ ) end_CELL end_ROW start_ROW start_CELL roman_sin ( italic_s italic_θ ) end_CELL start_CELL roman_cos ( italic_s italic_θ ) end_CELL end_ROW end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (45)
(cos(tθ)sin(tθ)sin(tθ)cos(tθ))(kt(1)kt(2))matrix𝑡𝜃𝑡𝜃𝑡𝜃𝑡𝜃matrixsuperscriptsubscript𝑘𝑡1superscriptsubscript𝑘𝑡2\displaystyle\quad\begin{pmatrix}\cos(t\theta)&-\sin(t\theta)\\ \sin(t\theta)&\cos(t\theta)\end{pmatrix}\begin{pmatrix}k_{t}^{(1)}\\ k_{t}^{(2)}\end{pmatrix}( start_ARG start_ROW start_CELL roman_cos ( italic_t italic_θ ) end_CELL start_CELL - roman_sin ( italic_t italic_θ ) end_CELL end_ROW start_ROW start_CELL roman_sin ( italic_t italic_θ ) end_CELL start_CELL roman_cos ( italic_t italic_θ ) end_CELL end_ROW end_ARG ) ( start_ARG start_ROW start_CELL italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) (50)
=(qs(1)qs(2))(cos((st)θ)sin((st)θ)sin((st)θ)cos((st)θ))(kt(1)kt(2))absentmatrixsuperscriptsubscript𝑞𝑠1superscriptsubscript𝑞𝑠2matrix𝑠𝑡𝜃𝑠𝑡𝜃𝑠𝑡𝜃𝑠𝑡𝜃matrixsuperscriptsubscript𝑘𝑡1superscriptsubscript𝑘𝑡2\displaystyle=\begin{pmatrix}q_{s}^{(1)}&q_{s}^{(2)}\end{pmatrix}\begin{% pmatrix}\cos((s-t)\theta)&-\sin((s-t)\theta)\\ \sin((s-t)\theta)&\cos((s-t)\theta)\end{pmatrix}\begin{pmatrix}k_{t}^{(1)}\\ k_{t}^{(2)}\end{pmatrix}= ( start_ARG start_ROW start_CELL italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_CELL start_CELL italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) ( start_ARG start_ROW start_CELL roman_cos ( ( italic_s - italic_t ) italic_θ ) end_CELL start_CELL - roman_sin ( ( italic_s - italic_t ) italic_θ ) end_CELL end_ROW start_ROW start_CELL roman_sin ( ( italic_s - italic_t ) italic_θ ) end_CELL start_CELL roman_cos ( ( italic_s - italic_t ) italic_θ ) end_CELL end_ROW end_ARG ) ( start_ARG start_ROW start_CELL italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) (56)

With the above formulas, we can summarize the following calculation process: In simple terms, the process of self-attention with Spin Position Embedding involves, for each feature embedding vector in the token sequence, first calculating its corresponding query and key vectors. Then, for each token position, calculate the corresponding rotated position embedding information. After that, apply the rotation transformation to the elements of the query and key vectors for each token position in pairs, and finally, calculate the dot product between the query and key to obtain the result of self-attention.

3.4 Quaternion Rotary Attention

Refer to caption
Figure 4: The overall of our Transformer structure.Our Transformer structure enhances the generalisation ability of the model by adding regularisation means such as Dropout in multiple places and adjusting the number of Attention heads to expand the model capacity on the basis of the original. The absolute position information of the input sequence is converted into a polar coordinate representation of the relative position using Spin Position Embedding, (ρ𝜌\rhoitalic_ρ, θ𝜃\thetaitalic_θ) where ρ𝜌\rhoitalic_ρ denotes the distance from the centre point, θ𝜃\thetaitalic_θ denotes the relative angle. This Spin Position Embedding module provides better local relative positions with some rotational invariance. In this way, our model can better support some tasks that are sensitive to position information, such as behavioural sequence modelling and 3D shape analysis.

For the features after the rotated attention module, we assume that there are N-length query series X𝑋Xitalic_X and an M-length key-values series γ𝛾\gammaitalic_γ. Firstly the original χ𝜒\chiitalic_χ and γ𝛾\gammaitalic_γ are projected onto the representation space,and a series of operations are performed: Q=χWQRN×d𝑄𝜒superscript𝑊𝑄superscript𝑅𝑁𝑑Q=\chi{W^{Q}}\in R^{N\times d}italic_Q = italic_χ italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT, K=γWKRM×d𝐾𝛾superscript𝑊𝐾superscript𝑅𝑀𝑑K=\gamma{W^{K}}\in R^{M\times d}italic_K = italic_γ italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_M × italic_d end_POSTSUPERSCRIPT and V=γWVRM×d𝑉𝛾superscript𝑊𝑉superscript𝑅𝑀𝑑V=\gamma{W^{V}}\in R^{M\times d}italic_V = italic_γ italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_M × italic_d end_POSTSUPERSCRIPT.

Here, Q represents the query vector, K represents the key, V represents the value, d represents the number of channels in the attention layer and W represents the trainable weights. Then, QRA will calculate H=Attn(X,γ)𝐻𝐴𝑡𝑡𝑛𝑋𝛾H=Attn(X,\gamma)italic_H = italic_A italic_t italic_t italic_n ( italic_X , italic_γ ) to the map query series to output H𝐻Hitalic_H using key-value series.

Frequency/phase-Generation:

(ω1QωPQ),(θ1QθPQ)=Conv(Q;WωQ),Conv(Q;WθQ),(ω1KωPK),(θ1KθPK)=Conv(K;WωK),Conv(K;WθK),formulae-sequencesuperscriptsubscript𝜔1Qsuperscriptsubscript𝜔𝑃Qsuperscriptsubscript𝜃1Qsuperscriptsubscript𝜃𝑃QConv𝑄superscriptsubscript𝑊𝜔QConv𝑄superscriptsubscript𝑊𝜃Qsuperscriptsubscript𝜔1Ksuperscriptsubscript𝜔𝑃Ksuperscriptsubscript𝜃1Ksuperscriptsubscript𝜃𝑃KConv𝐾superscriptsubscript𝑊𝜔KConv𝐾superscriptsubscript𝑊𝜃K\displaystyle\begin{gathered}\left.\left(\begin{array}[]{c}\omega_{1}^{\mathrm% {Q}}\\ \cdots\\ \omega_{P}^{\mathrm{Q}}\end{array}\right.\right),\left(\begin{array}[]{c}% \theta_{1}^{\mathrm{Q}}\\ \cdots\\ \theta_{P}^{\mathrm{Q}}\end{array}\right)=\mathrm{Conv}(Q;W_{\omega}^{\mathrm{% Q}}),\mathrm{Conv}(Q;W_{\theta}^{\mathrm{Q}}),\\ \left.\left(\begin{array}[]{c}\omega_{1}^{\mathrm{K}}\\ \cdots\\ \omega_{P}^{\mathrm{K}}\end{array}\right.\right),\left(\begin{array}[]{c}% \theta_{1}^{\mathrm{K}}\\ \cdots\\ \theta_{P}^{\mathrm{K}}\end{array}\right)=\mathrm{Conv}(K;W_{\omega}^{\mathrm{% K}}),\mathrm{Conv}(K;W_{\theta}^{\mathrm{K}}),\end{gathered}start_ROW start_CELL ( start_ARRAY start_ROW start_CELL italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Q end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋯ end_CELL end_ROW start_ROW start_CELL italic_ω start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Q end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ) , ( start_ARRAY start_ROW start_CELL italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Q end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋯ end_CELL end_ROW start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Q end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ) = roman_Conv ( italic_Q ; italic_W start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Q end_POSTSUPERSCRIPT ) , roman_Conv ( italic_Q ; italic_W start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Q end_POSTSUPERSCRIPT ) , end_CELL end_ROW start_ROW start_CELL ( start_ARRAY start_ROW start_CELL italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_K end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋯ end_CELL end_ROW start_ROW start_CELL italic_ω start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_K end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ) , ( start_ARRAY start_ROW start_CELL italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_K end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋯ end_CELL end_ROW start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_K end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ) = roman_Conv ( italic_K ; italic_W start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_K end_POSTSUPERSCRIPT ) , roman_Conv ( italic_K ; italic_W start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_K end_POSTSUPERSCRIPT ) , end_CELL end_ROW (71)

Series-Rotation

Φp(Q,posQ)=Q~e𝐢(2πωpQposQ+θpQ),p=1,2,,PΨp(K,posK)=K~e𝐣(2πωpKposK+θpK),p=1,2,,PsubscriptΦ𝑝𝑄superscriptposQformulae-sequenceabsent~𝑄superscript𝑒𝐢2𝜋superscriptsubscript𝜔𝑝QsuperscriptposQsuperscriptsubscript𝜃𝑝Q𝑝12𝑃subscriptΨ𝑝𝐾superscriptposKformulae-sequenceabsent~𝐾superscript𝑒𝐣2𝜋superscriptsubscript𝜔𝑝KsuperscriptposKsuperscriptsubscript𝜃𝑝K𝑝12𝑃\displaystyle\begin{aligned} \Phi_{p}(Q,\text{pos}^{\mathrm{Q}})&=\tilde{Q}e^{% \mathbf{i}(2\pi\omega_{p}^{\mathrm{Q}}\text{pos}^{\mathrm{Q}}+\theta_{p}^{% \mathrm{Q}})},\quad p=1,2,\cdots,P\\ \Psi_{p}(K,\text{pos}^{\mathrm{K}})&=\tilde{K}e^{\mathbf{j}(2\pi\omega_{p}^{% \mathrm{K}}\text{pos}^{\mathrm{K}}+\theta_{p}^{\mathrm{K}})},\quad p=1,2,% \cdots,P\end{aligned}start_ROW start_CELL roman_Φ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_Q , pos start_POSTSUPERSCRIPT roman_Q end_POSTSUPERSCRIPT ) end_CELL start_CELL = over~ start_ARG italic_Q end_ARG italic_e start_POSTSUPERSCRIPT bold_i ( 2 italic_π italic_ω start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Q end_POSTSUPERSCRIPT pos start_POSTSUPERSCRIPT roman_Q end_POSTSUPERSCRIPT + italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Q end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT , italic_p = 1 , 2 , ⋯ , italic_P end_CELL end_ROW start_ROW start_CELL roman_Ψ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_K , pos start_POSTSUPERSCRIPT roman_K end_POSTSUPERSCRIPT ) end_CELL start_CELL = over~ start_ARG italic_K end_ARG italic_e start_POSTSUPERSCRIPT bold_j ( 2 italic_π italic_ω start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_K end_POSTSUPERSCRIPT pos start_POSTSUPERSCRIPT roman_K end_POSTSUPERSCRIPT + italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_K end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT , italic_p = 1 , 2 , ⋯ , italic_P end_CELL end_ROW (72)

Series-Attention with softmax-kernel (shown in Fig.5)

S=softmax(1Pdp=1PRe[Φp(Q,pos)Ψp(K,pos𝕂)𝖧])𝑆softmax1𝑃𝑑superscriptsubscript𝑝1𝑃Redelimited-[]subscriptΦ𝑝𝑄superscriptpossubscriptΨ𝑝superscript𝐾superscriptpos𝕂𝖧\displaystyle S=\text{softmax}\left(\frac{1}{P\sqrt{d}}\sum_{p=1}^{P}\text{Re}% [\Phi_{p}(Q,\text{pos}^{\mathbb{Q}})\Psi_{p}(K,\text{pos}^{\mathbb{K}})^{% \mathsf{H}}]\right)italic_S = softmax ( divide start_ARG 1 end_ARG start_ARG italic_P square-root start_ARG italic_d end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT Re [ roman_Φ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_Q , pos start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT ) roman_Ψ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_K , pos start_POSTSUPERSCRIPT blackboard_K end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT sansserif_H end_POSTSUPERSCRIPT ] ) (73)

Series-Aggregation:

H=SV𝐻𝑆𝑉\displaystyle H=SVitalic_H = italic_S italic_V (74)
Refer to caption
Figure 5: A three-dimensional illustration of a rotated softmax-kernel. The rotated softmax-kernel represents the embeddings in quaternion form and rotates them using the angular frequency ω𝜔\omegaitalic_ω .Thus, embeddings with different phases can be distinguished. Finally, the similarity of the rotated embeddings is measured by measuring the exponential dot product between them.

Here, we hypothesize that the series have P𝑃Pitalic_P periods, and P𝑃Pitalic_P is a hyper-parameter . In frequency/phase-generation step, we utilize 1D convolutions with activation ReLU to generate P𝑃Pitalic_P latent frequencies ω1PQ[0,+)N×1superscriptsubscript𝜔similar-to1𝑃Qsuperscript0𝑁1\omega_{1\sim P}^{\mathrm{Q}}\in[0,+\infty)^{N\times 1}italic_ω start_POSTSUBSCRIPT 1 ∼ italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Q end_POSTSUPERSCRIPT ∈ [ 0 , + ∞ ) start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT(ω1PKsuperscriptsubscript𝜔similar-to1𝑃K\omega_{1{\sim}P}^{\mathrm{K}}italic_ω start_POSTSUBSCRIPT 1 ∼ italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_K end_POSTSUPERSCRIPT is similar). Convolutions can effectively capture local contexts of each time step to generate reliable latent frequencies, and these latent frequencies are not identical at each time step implying variable periods. Moreover, to account for phase shifts, we additionally generate P𝑃Pitalic_P latent phases θ1PQ(π,π)N×1superscriptsubscript𝜃similar-to1𝑃Qsuperscript𝜋𝜋𝑁1\theta_{1\sim P}^{\mathrm{Q}}\in(-\pi,\pi)^{N\times 1}italic_θ start_POSTSUBSCRIPT 1 ∼ italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Q end_POSTSUPERSCRIPT ∈ ( - italic_π , italic_π ) start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT using 1D convolutions with activation πtanh𝜋tanh\pi\cdot\mathrm{tanh}italic_π ⋅ roman_tanh (θ1PKsuperscriptsubscript𝜃similar-to1𝑃K\theta_{1{\sim}P}^{\mathrm{K}}italic_θ start_POSTSUBSCRIPT 1 ∼ italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_K end_POSTSUPERSCRIPT is similar).In series-rotation step, we rotate the representations at each time step according to the learned latent frequencies and phases in the previous step.Each row vector of Q~,K~~𝑄~𝐾\tilde{Q},\tilde{K}over~ start_ARG italic_Q end_ARG , over~ start_ARG italic_K end_ARG is in the quaternion form of the corresponding row vector of Q, k, and posQ=[0,1,2,,N1]T/N𝑝𝑜superscript𝑠𝑄superscript012𝑁1TNpos^{Q}=[0,1,2,\cdots,N-1]^{\mathrm{T}}/\mathrm{N}italic_p italic_o italic_s start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT = [ 0 , 1 , 2 , ⋯ , italic_N - 1 ] start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT / roman_N,posK=[0,1,2,,M1]T/M𝑝𝑜superscript𝑠𝐾superscript012𝑀1TMpos^{K}=[0,1,2,\cdots,M-1]^{\mathrm{T}}/\mathrm{M}italic_p italic_o italic_s start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT = [ 0 , 1 , 2 , ⋯ , italic_M - 1 ] start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT / roman_M are position vectors of series Q and K, respectively. In the series-attention step, to integratedly capture position-wise similarity under multiple periods, the unnormalized similarity is the mean of quaternion dot-product under multiple rotations. Finally, in the series-aggregation step, the outputs HN×d𝐻superscript𝑁𝑑H\in\mathbb{R}^{N\times d}italic_H ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT is generated using the softmax-normalized similarity. In practice, we employ the multi-head variant of QRA, and will not go into details here, as it can be derived quite directly. Notice that, QRA is more expressive than canonical dot-product attention. When P𝑃Pitalic_P= 1, ω𝜔\omegaitalic_ω = 0 and θ𝜃\thetaitalic_θ = 0, QRA degenerates into canonical attention.

4 Experiments

4.1 Datasets

The AIST++ [41] dance movement dataset was constructed from the AIST dance [19] video database. A well-developed process was designed for estimating camera parameters, 3D human keypoints and 3D human dance movement sequences from multi-view videos. The dataset provides 3D human keypoint annotations and camera parameters for 10.1 million images covering 30 different subjects in 9 viewpoints. These features make it the largest and richest dataset containing 3D human keypoint annotations currently available. Additionaly, the dataset contains 1,408 3D human dance movement sequences represented as joint rotations and root trajectories. These dance movements are evenly distributed across 10 dance genres and contain hundreds of choreographies. The duration of the movements ranges from 7.4 to 48.0 seconds, and each dance movement is accompanied by corresponding music. Based on these annotations, AIST++ is designed to support multiple tasks including multi-view human keypoint estimation, human motion prediction/generation, and cross-modal analysis between human motion and music.

4.2 Implementation Details

In our primary experiments, the model takes a seed motion sequence spanning 120 frames (2 seconds) and a music sequence covering 240 frames (4 seconds) as input. These two sequences are aligned at the initial frame, and the model’s output consists of a future motion sequence with N=20 frames supervised by L2 loss. During the inference process, future motions are continuously generated in an auto-regressive manner at 60 FPS, with only the first predicted motion retained at each step.For music feature extraction, we employ the publicly available audio processing toolbox, Librosa [42], which includes 1-dimensional envelope, 20-dimensional MFCC, 12-dimensional chroma, 1-dimensional one-hot peaks, and 1-dimensional one-hot beats, resulting in a 35-dimensional music feature. The motion features combine a 9-dimensional representation of rotation matrices for all 24 joints with a 3-dimensional global translation vector, resulting in a 219-dimensional motion feature. These raw audio and motion features are initially embedded into 800-dimensional hidden representations using linear layers, with learnable position embeddings added before inputting them into the transformer layers. All three transformers (audio, motion, cross-modal) feature 16 attention heads with a hidden size of 800. In terms of training details, all experiments are trained using the Adam optimizer with a batch size of 16. The learning rate starts at 1e-4 and decays to {1e-5,1e-6} at {90k, 150k} steps, respectively. Training concludes after 500k steps, taking approximately 2 days on one RTX 3090. The baseline comparison includes the latest works on 3D dance generation with music and seed motion as input, such as Li [19] and Li et al [4]. For a more comprehensive evaluation, we also compare it with the recent state-of-the-art 2D dance generation method, DanceRevolution [5]. We adapt this work to output 3D joint positions for a direct quantitative comparison with our results, even though joint positions do not allow for immediate repositioning. The official code provided by the authors is used to train and test these baselines on the same dataset as ours.

4.3 Quanitative Evalutation

In this section, we assess the performance of our proposed Multi-modal Roformer across three key dimensions: (1) motion quality (2) generation diversity and (3) motion-music correlation. The results presented in Table 1 demonstrate that, under identical experimental conditions, our model surpasses state-of-the-art methods [2, 6, 7] in these aspects.

Motion Quality: Similar to previous studies, we assess the quality of generated motion by computing the Frechet Inception distance (FID) [43], which measures the dissimilarity between the distribution of generated motion and ground-truth motion. To capture motion features, we utilize two meticulously crafted motion feature extractors, as undisclosed motion encoders were employed in earlier works [44]. These extractors include: (1) a geometric feature extractor, generating a boolean vector that represents geometric relationships among specific body points in the motion sequence, and (2) a dynamic feature extractor, map** the motion sequence to capture dynamic aspects such as velocity and acceleration.We designate FID based on these geometric and dynamic features as FIDg𝐹𝐼subscript𝐷𝑔FID_{g}italic_F italic_I italic_D start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and FIDd𝐹𝐼subscript𝐷𝑑FID_{d}italic_F italic_I italic_D start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, respectively. The metrics are computed by comparing real dance motion sequences in the AIST++ test set with 40 generated motion sequences, each comprising T = 1200 frames (20 seconds). As depicted in Table 1, our generated motion sequences exhibit distributions that are closer to ground-truth motion compared to the three methods.

Generation Diversity:We also assess the model’s capacity to generate diverse dance movements in response to different input music, comparing its performance to baseline methods. Following a methodology similar to previous research [45], we compute the average Euclidean distance in the feature space of 40 generated motions from the AIST++ test set to quantify diversity. The motion diversity in geometric and dynamic feature spaces is denoted as Distg𝐷𝑖𝑠subscript𝑡𝑔Dist_{g}italic_D italic_i italic_s italic_t start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and Distk𝐷𝑖𝑠subscript𝑡𝑘Dist_{k}italic_D italic_i italic_s italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, respectively.Table1 illustrates that our method excels in generating more diverse dance movements in comparison to the baselines, with the exception of Li [29]. The latter discretizes motions, resulting in discontinuous outputs and elevated Distk𝐷𝑖𝑠subscript𝑡𝑘Dist_{k}italic_D italic_i italic_s italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

Motion-Music Correlation:Moreover, we gauge the correlation between the generated 3D motion and input music by introducing a novel metric known as the Beat Alignment Score. This metric evaluates the motion-music correlation by measuring the similarity between the beats in the motion and music. Librosa [42] is employed to extract music beats, while motion beats are computed as local minima in the motion velocity. The Beat Alignment Score is articulated as the average distance between each motion beat and its nearest music beat. To be specific, our Beat Alignment Score is defined as:

BeatAlign=1zi=1zexp(mintjdBdtictjd22α2)𝐵𝑒𝑎𝑡𝐴𝑙𝑖𝑔𝑛1𝑧superscriptsubscript𝑖1𝑧𝑒𝑥𝑝𝑚𝑖𝑛for-allsuperscriptsubscript𝑡𝑗𝑑superscript𝐵𝑑superscriptnormsuperscriptsubscript𝑡𝑖𝑐superscriptsubscript𝑡𝑗𝑑22superscript𝛼2\displaystyle BeatAlign=\frac{1}{z}\sum_{i=1}^{z}exp(-\frac{min\forall{t_{j}^{% d}\in B^{d}||t_{i}^{c}-t_{j}^{d}||^{2}}}{2\alpha^{2}})italic_B italic_e italic_a italic_t italic_A italic_l italic_i italic_g italic_n = divide start_ARG 1 end_ARG start_ARG italic_z end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT italic_e italic_x italic_p ( - divide start_ARG italic_m italic_i italic_n ∀ italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∈ italic_B start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | | italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT - italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) (75)

where Bc={tic}superscript𝐵𝑐superscriptsubscript𝑡𝑖𝑐B^{c}=\left\{{t_{i}^{c}}\right\}italic_B start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = { italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT } is the set of motion beats, Bd=tjdsuperscript𝐵𝑑superscriptsubscript𝑡𝑗𝑑B^{d}=t_{j}^{d}italic_B start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT = italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the music beats, and α𝛼\alphaitalic_α is a parameter for normalizing sequences with different FPS.

We set α=3𝛼3\alpha=3italic_α = 3 for all experiments since the FPS for all our experimental sequences is 60. A similar metric called Beat Hit Rate is introduced in, but it relies on manually set thresholds for alignment (“hits”) depending on the dataset, while our metric directly measures distances. This metric is explicitly designed to be unidirectional, as dance movements do not necessarily need to match every music beat. On the other hand, each dynamic beat should have a corresponding music beat. To calibrate the results, we compute correlation metrics for the entire AIST++ dataset (upper bound) and randomly paired data (lower bound). As shown in Table 1, our generated motion shows better correlation with input music compared to the baselines. However, there is still considerable room for improvement for all methods, including ours, when compared to actual data. This reflects that music-motion correlation remains a challenging problem.

Methods Motion Quality Motion Diversity Motion-Music Corr
FIDk FIDg Distk Distg BeatAlign
AIST++ - - 9.057 7.556 0.292
AIST++(random) - - - - 0.213
Li et al[5]. 86.43 20.58 6.85 4.93 0.232
Dancenet[6] 69.18 17.76 2.86 2.72 0.232
DanceRevolution[7] 73.42 31.01 3.52 2.46 0.22
FACT(baseline)[19] 48.95 28.1 4.9 6.69 0.232
our 30.1 11.5 7.82 9.37 0.239
Table 1: Conditional Motion Generation Evaluation on AIST++ dataset. Comparing to the three recent state-of-art methods, our generated motions are more realistic, better correlated with input music and more diversified.

4.4 Ablation Study

We conducted ablation studies on the Spin Position Embedding and Multi-modal Quaternion parameterization, respectively. The quantitative scores are shown in 2.

Position Embedding In the ablation experiments focused on position coding, we explore two distinct approaches and conduct experiments based on the following configurations: (1) a learnable coding approach for absolute positions (baseline), and (2) a rotary coding approach for relative positions. Method 2 was selected to introduce explicit relative position dependence in the self-attention formulation. This choice offers increased flexibility in sequence length, a potential reduction in dependencies between tokens, and the capacity to encode relative positions for linear self-attention.As illustrated in Table 2, we observe that the rotational embedding method of relative position leads to a significant reduction in the FIDg𝐹𝐼subscript𝐷𝑔FID_{g}italic_F italic_I italic_D start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT values compared to the original method. This indicates that dances generated using rotary position embedding are notably closer to reality.

Quaternion parameterization Here, we performed ablation experiments on the original baseline as well as with the addition of the QRA module. Through Table 2, we can observe that Quaternion Rotary Attention (QRA), by introducing quaternion operations, is able to fully explore the relationship between audio and motion, and achieves more significant enhancement results.

FIDk𝐹𝐼subscript𝐷𝑘FID_{k}italic_F italic_I italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT FIDg𝐹𝐼subscript𝐷𝑔FID_{g}italic_F italic_I italic_D start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT BeatAlign ↑
baseline 48.95 28.1 0.232
baseline+Spin Position Embedding 30.1 11.5 0.239
baseline+QRA 46.33 26.2 0.236
Table 2: Ablation Study on Spin Position Embedding and Quaternion Rotary Attention.As illustrated in the table, the experimental results clearly demonstrate the effectiveness of our proposed method. The graph shows a significant improvement in performance metrics when compared to the baseline approach.
Refer to caption
Figure 6: Frame extraction. The visual representation clearly emphasizes the effectiveness of our proposed method.
Refer to caption
Figure 7: This image illustrates the frame extraction effect of a dance generated by alternative methods. During the post-production phase, the generated dance movements exhibit phenomena of dance collapse and unscientific limb folding.

5 Conclusion

We propose a network called QEAN for generating 3D dance movements. QEAN employs Spin Position Embedding (SPE) the position encoding part to embed the position information in a rotational manner in the self-attention, which improves the model’s representation of the sequences and enhances the model’s understanding of the human movements in terms of their temporal order. Additionally, we propose Quaternion Rotary Attention (QRA), a quaternion-valued relational learning network, which uses quaternion values to explore the temporal coordination between music and dance. To demonstrate the superiority of QEAN, we conducted experiments on the AIST++ dataset. The results of the relevant experimental data demonstrate the superiority of our approach in the 3D dance generation task. Furthermore, the results of our ablation experiments illustrate the importance of SPE and QRA in this task.

6 Acknowledgement

This work was supported in part by National Natural Science Foundation under Grant 92267107, the Science and Technology Planning Project of Guangdong under Grant 2021B0101220006, Science and Technology Projects in Guangzhou under Grant 202201011706, Key Areas Research and Development Program of Guangzhou under Grant 2023B01J0029, Science and technology research in key areas in Foshan under Grant 2020001006832, Key Area Research and Development Program of Guangdong Province under Grant 2018B010109007 and 2019B010153002, Science and technology projects of Guangzhou under Grant 202007040006, and Guangdong Provin-cial Key Laboratory of Cyber-Physical System under Grant 2020B1212060069.

7 Declarations

Conflict of interest We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

References

  • [1] Yue Yang and Ensi Zhang “Cultural thought and philosophical elements of singing and dancing in Indian films” In Trans/Form/Ação 46, 2023, pp. 315–328 DOI: 10.1590/0101-3173.2023.v46n4.p315
  • [2] Mark Siciliano “A citation analysis of business librarianship: Examining the Journal of Business and Finance Librarianship from 1990–2014” In Journal of Business & Finance Librarianship 22, 2017, pp. 81–96 URL: https://api.semanticscholar.org/CorpusID:63474056
  • [3] Andreas Aristidou et al. “Style-based motion analysis for dance composition” In The Visual Computer 34, 2018, pp. 1725–1737 URL: https://api.semanticscholar.org/CorpusID:27531229
  • [4] Jiaman Li et al. “Learning to Generate Diverse Dance Motions with Transformer” In ArXiv abs/2008.08171, 2020 URL: https://api.semanticscholar.org/CorpusID:221173065
  • [5] Ruozi Huang et al. “Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning” In International Conference on Learning Representations, 2020 URL: https://api.semanticscholar.org/CorpusID:235614403
  • [6] Xinjian Zhang et al. “Dance Generation with Style Embedding: Learning and Transferring Latent Representations of Dance Styles” In ArXiv abs/2104.14802, 2021 URL: https://api.semanticscholar.org/CorpusID:233476346
  • [7] Ruozi Huang et al. “Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning” In International Conference on Learning Representations, 2020 URL: https://api.semanticscholar.org/CorpusID:235614403
  • [8] Samy Bengio, Oriol Vinyals, Navdeep Jaitly and Noam M. Shazeer “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks” In ArXiv abs/1506.03099, 2015 URL: https://api.semanticscholar.org/CorpusID:1820089
  • [9] Zhifeng Xie et al. “BaGFN: Broad Attentive Graph Fusion Network for High-Order Feature Interactions” In IEEE Transactions on Neural Networks and Learning Systems 34, 2021, pp. 4499–4513 URL: https://api.semanticscholar.org/CorpusID:238476689
  • [10] Shiry Ginosar et al. “Learning Individual Styles of Conversational Gesture” In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3492–3501 URL: https://api.semanticscholar.org/CorpusID:182952539
  • [11] Bin Sheng, ** Li, Riaz Ali and C.L.Philip Chen “Improving Video Temporal Consistency via Broad Learning System” In IEEE Transactions on Cybernetics 52.7, 2022, pp. 6662–6675 DOI: 10.1109/TCYB.2021.3079311
  • [12] Alexey Dosovitskiy et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” In ArXiv abs/2010.11929, 2020 URL: https://api.semanticscholar.org/CorpusID:225039882
  • [13] Ze Liu et al. “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows” In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9992–10002 URL: https://api.semanticscholar.org/CorpusID:232352874
  • [14] Ashish Vaswani et al. “Attention is All you Need” In Neural Information Processing Systems, 2017 URL: https://api.semanticscholar.org/CorpusID:13756489
  • [15] Xiao Lin et al. “EAPT: Efficient Attention Pyramid Transformer for Image Processing” In IEEE Transactions on Multimedia 25, 2021, pp. 50–61 URL: https://api.semanticscholar.org/CorpusID:245536278
  • [16] Zinuo Li, Xuhang Chen, Chi-Man Pun and Xiaodong Cun “High-Resolution Document Shadow Removal via A Large-Scale Real-World Dataset and A Frequency-Aware Shadow Erasing Net” In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 12449–12458
  • [17] Zinuo Li, Xuhang Chen, Shuqiang Wang and Chi-Man Pun “A Large-Scale Film Style Dataset for Learning Multi-frequency Driven Film Enhancement” Main Track In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23 International Joint Conferences on Artificial Intelligence Organization, 2023, pp. 1160–1168 DOI: 10.24963/ijcai.2023/129
  • [18] Shenghong Luo et al. “Devignet: High-Resolution Vignetting Removal via a Dual Aggregated Fusion Transformer With Adaptive Channel Expansion” In arXiv preprint arXiv:2308.13739, 2023
  • [19] Ruilong Li, Sha Yang, David A. Ross and Angjoo Kanazawa “AI Choreographer: Music Conditioned 3D Dance Generation with AIST++” In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13381–13392 URL: https://api.semanticscholar.org/CorpusID:236882798
  • [20] Lian Siyao et al. “Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory” In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11040–11049 URL: https://api.semanticscholar.org/CorpusID:247627867
  • [21] Dario Pavllo, Christoph Feichtenhofer, Michael Auli and David Grangier “Modeling Human Motion with Quaternion-Based Neural Networks” In International Journal of Computer Vision 128, 2019, pp. 855–872 URL: https://api.semanticscholar.org/CorpusID:59158790
  • [22] Weizhao Ma et al. “PCMG:3D point cloud human motion generation based on self-attention and transformer” In The Visual Computer, 2023 URL: https://api.semanticscholar.org/CorpusID:261566852
  • [23] David Greenwood, Stephen D. Laycock and Iain Matthews “Predicting Head Pose from Speech with a Conditional Variational Autoencoder” In Interspeech, 2017 URL: https://api.semanticscholar.org/CorpusID:11113871
  • [24] Yuhang Huang et al. “Genre-Conditioned Long-Term 3D Dance Generation Driven by Music” In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 4858–4862 URL: https://api.semanticscholar.org/CorpusID:249437513
  • [25] Sepp Hochreiter and Jürgen Schmidhuber “Long Short-Term Memory” In Neural Computation 9, 1997, pp. 1735–1780 URL: https://api.semanticscholar.org/CorpusID:1915014
  • [26] Qihang Yu et al. “Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP” In ArXiv abs/2308.02487, 2023 URL: https://api.semanticscholar.org/CorpusID:260611350
  • [27] Yao-Hung Hubert Tsai et al. “Multimodal Transformer for Unaligned Multimodal Language Sequences” In Proceedings of the conference. Association for Computational Linguistics. Meeting 2019, 2019, pp. 6558–6569 URL: https://api.semanticscholar.org/CorpusID:173990158
  • [28] Ziheng Wu et al. “EasyPhoto: Your Smart AI Photo Generator”, 2023 URL: https://api.semanticscholar.org/CorpusID:263829612
  • [29] Purva Tendulkar, Abhishek Das, Aniruddha Kembhavi and Devi Parikh “Feel The Music: Automatically Generating A Dance For An Input Song” In ArXiv abs/2006.11905, 2020 URL: https://api.semanticscholar.org/CorpusID:219572850
  • [30] Jogendra Nath Kundu, Himanshu Buckchash, Priyanka Mandikal and Rahul “Cross-Conditioned Recurrent Networks for Long-Term Synthesis of Inter-Person Human Motion Interactions” In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 2713–2722 URL: https://api.semanticscholar.org/CorpusID:214675800
  • [31] Linjie Li et al. “VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation” In ArXiv abs/2106.04632, 2021 URL: https://api.semanticscholar.org/CorpusID:235377363
  • [32] Partha Ghosh, Jie Song, Emre Aksan and Otmar Hilliges “Learning Human Motion Models for Long-Term Predictions” In 2017 International Conference on 3D Vision (3DV), 2017, pp. 458–466 URL: https://api.semanticscholar.org/CorpusID:13549534
  • [33] Chenfei Wu et al. “Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models” In ArXiv abs/2303.04671, 2023 URL: https://api.semanticscholar.org/CorpusID:257404891
  • [34] Zhengxiao Du et al. “GLM: General Language Model Pretraining with Autoregressive Blank Infilling” In Annual Meeting of the Association for Computational Linguistics, 2021 URL: https://api.semanticscholar.org/CorpusID:247519241
  • [35] Zongwen Bai et al. “Low-rank multimodal fusion algorithm based on context modeling” In Journal of Internet Technology 22.4, 2021, pp. 913–921
  • [36] Daniel Holden, Jun Saito and Taku Komura “A deep learning framework for character motion synthesis and editing” In ACM Transactions on Graphics (TOG) 35, 2016, pp. 1–11 URL: https://api.semanticscholar.org/CorpusID:18149328
  • [37] Haibo Qiu et al. “Cross View Fusion for 3D Human Pose Estimation” In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 4341–4350 URL: https://api.semanticscholar.org/CorpusID:201891326
  • [38] Ye Zhu et al. “Quantized GAN for Complex Music Generation from Dance Videos” In ArXiv abs/2204.00604, 2022 URL: https://api.semanticscholar.org/CorpusID:247922422
  • [39] Zewen Zheng et al. “Quaternion-Valued Correlation Learning for Few-Shot Semantic Segmentation” In IEEE Transactions on Circuits and Systems for Video Technology 33, 2023, pp. 2102–2115 URL: https://api.semanticscholar.org/CorpusID:253661872
  • [40] Jianlin Su et al. “RoFormer: Enhanced Transformer with Rotary Position Embedding” In ArXiv abs/2104.09864, 2021 URL: https://api.semanticscholar.org/CorpusID:233307138
  • [41] Shuhei Tsuchida, Satoru Fukayama, Masahiro Hamasaki and Masataka Goto “AIST Dance Video Database: Multi-Genre, Multi-Dancer, and Multi-Camera Database for Dance Information Processing” In International Society for Music Information Retrieval Conference, 2019 URL: https://api.semanticscholar.org/CorpusID:208334750
  • [42] Brian McFee et al. “librosa: Audio and Music Signal Analysis in Python” In SciPy, 2015 URL: https://api.semanticscholar.org/CorpusID:33504
  • [43] Martin Heusel et al. “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium” In Neural Information Processing Systems, 2017 URL: https://api.semanticscholar.org/CorpusID:326772
  • [44] Kensuke Onuma, Christos Faloutsos and Jessica K. Hodgins “FMDistance: A Fast and Effective Distance Function for Motion Capture Data” In Eurographics, 2008 URL: https://api.semanticscholar.org/CorpusID:8323054
  • [45] Hao Hao Tan and Mohit Bansal “LXMERT: Learning Cross-Modality Encoder Representations from Transformers” In Conference on Empirical Methods in Natural Language Processing, 2019 URL: https://api.semanticscholar.org/CorpusID:201103729