in2IN: Leveraging individual Information to Generate Human INteractions

Pablo Ruiz-Ponce1,2, German Barquero2,3, Cristina Palmero2,3, Sergio Escalera2,3, José García-Rodríguez1
1Universidad de Alicante, San Vicente del Raspeig, Spain
2Universitat de Barcelona, Barcelona, Spain
3Computer Vision Center, Cerdanyola del Vallès, Spain
[email protected]
https://pabloruizponce.github.io/in2IN
Abstract

Generating human-human motion interactions conditioned on textual descriptions is a very useful application in many areas such as robotics, gaming, animation, and the metaverse. Alongside this utility also comes a great difficulty in modeling the highly dimensional inter-personal dynamics. In addition, properly capturing the intra-personal diversity of interactions has a lot of challenges. Current methods generate interactions with limited diversity of intra-person dynamics due to the limitations of the available datasets and conditioning strategies. For this, we introduce in2IN, a novel diffusion model for human-human motion generation which is conditioned not only on the textual description of the overall interaction but also on the individual descriptions of the actions performed by each person involved in the interaction. To train this model, we use a large language model to extend the InterHuman dataset with individual descriptions. As a result, in2IN achieves state-of-the-art performance in the InterHuman dataset. Furthermore, in order to increase the intra-personal diversity on the existing interaction datasets, we propose DualMDM, a model composition technique that combines the motions generated with in2IN and the motions generated by a single-person motion prior pre-trained on HumanML3D. As a result, DualMDM generates motions with higher individual diversity and improves control over the intra-person dynamics while maintaining inter-personal coherence.

[Uncaptioned image]
Figure 1: We present in2IN, a diffusion model architecture capable of generating human-human motion interactions using general interaction descriptions to model the inter-personal dynamics and specific individual descriptions to model the intra-personal dynamics. Furthermore, we propose DualMDM, a motion composition method that is able to combine predictions made by an interaction model and by a single-person motion prior, thus increasing the intra-personal diversity of human motion interactions.

1 Introduction

Human Motion Generation refers to creating synthetic human movements that closely mimic those performed by actual individuals. This field has experienced significant advancements alongside the general progress in generative AI over recent years [56]. However, unlike other areas of generative AI, such as image and text generation, annotated motion datasets are scarce due to the need for expensive recording setups and actors. Controlling the generation of a motion based on a given condition is extremely important for applications such as video games or robotics. We can find many different condition types such as actions [16, 32, 41, 12], audio [43, 27, 55, 47], or natural text [18, 1, 40, 33, 19, 49, 54, 25, 34, 20, 26, 51, 41, 48, 12, 50]. In contrast to discrete conditioning means such as actions, utilizing text is advantageous due to its capacity to convey detailed descriptions of specific motions. Natural text allows for the specification of movements in different body parts, at varying velocities, and within diverse contexts or emotional states. Recent advancements with Large Language Models (LLMs) have underscored the potency of text as a versatile tool across various applications [14, 10, 42, 53].

Generating realistic individual human motion conditioned on a textual description is a very challenging task due to the complexity of the intra-personal dynamics as well as the difficulty of aligning a textual description with a specific motion. Additionally, motion is rarely done in isolation in the real world. As an intelligent species, we adapt our motions depending on several factors, such as the environment and other individuals that we might interact with [13, 5]. Modeling such interactions is extremely difficult due to the intricacy of inter-personal dynamics [21, 6, 57]. More specifically, a single person might behave in many different ways under the same interaction. This individual diversity can arise from variations in the joints trajectories, velocities, or even the action semantics. For example, two people can salute each other by waving the left or the right hand, slowly or quickly, or even bowing instead. Controlling such intra-personal dynamics when generating human-human interactions is an important and underexplored capability.

Available annotated interaction datasets such as InterHuman [28] contain a significant amount of annotated interactions. However, neither of them [28, 36, 39] provides enough individual diversity nor detailed textual descriptions of the individual motions of the interaction. As a consequence, recent human-human interaction generation methods [36, 28, 39, 11] tend to replicate the interactions from the training datasets, showing limited diversity in the individual motions that encompass the interactions, and lack individual control capabilities. To address all these problems, we could scale up by collecting bigger and more diverse datasets. This work, instead, proposes a new methodology that effectively exploits the individual diversity already present in the available datasets to improve the performance and control when generating human-human interactions. More particularly, our main contributions are:

  • We propose in2IN, a novel diffusion model architecture that is not only conditioned on the overall interaction description but also on the descriptions of the individual motion performed by each interactant, as illustrated in Fig. 1. To do so, we extend the InterHuman dataset [28] with LLM-generated textual descriptions of the individual human motions involved in the interaction. Our approach allows for a more precise interaction generation and achieves state-of-the-art results on InterHuman.

  • We introduce a diffusion conditioning technique based on the Classifier Free Guidance (CFG) [22] that allows weighting independently the importance of each condition during the interaction generation. This enables a higher control over the influence of individual and interaction descriptions on the sampling process.

  • We propose DualMDM, a new motion composition technique to further increase the individual diversity and control. By combining our in2IN interaction model with a single-person (individual) motion prior, we generate interactions with more diverse intra-personal dynamics.

2 Related Work

2.1 Text-Driven Human Motion Generation

A literature review [56] reveals significant progress in this domain over the past two years, with a plethora of explored approaches. The first works proposed to align the text and motion latent spaces using the Kullback-Leibler divergence loss [18, 1, 40, 33]. A decoder is trained to convert the text latents into the corresponding motion. The main limitation of these approaches is that the scarcity of motion data might lead to latent space misalignments and therefore semantic mismatches between the text and the generated motion.

Based on the recent success of auto-regressive approaches in domains like language, with the advent of LLMs [14, 10, 42, 53] powered by Transformers [44], new approaches have emerged in the motion field [19, 49, 54, 25]. In these, motions are tokenized into discrete codes from a learned codebook, and a Transformer architecture is used to convert text tokens into motion tokens in an autoregressive manner. While these approaches generate more realistic motions, they have some downsides. Firstly, while tokenizing text is a relatively simple task, tokenizing motion is not straightforward because there are no clear individual logic units as can be the words or lemmas for text. Additionally, auto-regressive models cannot model bi-directional dependencies. MMM [34] and MoMask [20] address this limitation using masked attention in BERT [14] style.

Diffusion Models [37, 23] have emerged as the best option for many generative tasks [46], also achieving excellent results in the text-to-motion field. FLAME [26] and MotionDiffusion [51] employ a traditional diffusion model with a Transformer as the noise predictor, achieving state-of-the-art results. Instead of predicting the noise, MDM [41] predicts the fully denoised motion at each step. This strategy, typically called x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT reparametrization [45, 7], enables the use of kinematic loss functions, leading to better human motion generation. Other methods propose incorporating physical constraints into the diffusion process [48], using latent diffusion models for speeding up the sampling [12], or leveraging retrieval-based methods [50]. Although the sequential multi-step nature of diffusion models during inference makes them very slow, it also empowers them to generate very realistic samples with high diversity [15] and fine-grained control capabilities. As a result, diffusion models are very powerful for human interaction generation.

2.2 Text-Driven Human Interaction Generation

ComMDM [36] extends MDM’s capabilities to generate multi-human interactions. ComMDM is a cross-attention module integrated into specific layers of the denoisers in two frozen MDM models. This module processes the activations from the two models and adjusts them to foster interaction. In [39], a similar concept is employed but this time with two distinct models. Interaction modeling is achieved through a shared cross-attention module that connects both models, an architecture particularly suited for asymmetric interactions involving an actor and a receiver. However, they observed that their method overfitted to the training dataset due to the lack of annotated interaction datasets. Recently, InterHuman [28] was released, becoming the most extensive annotated dataset of human interactions up to date. The authors also propose a baseline method called InterGen, which is based on two cooperative denoisers with shared weights. Finally, MoMat-MoGen [11] extends the retrieval diffusion model proposed in [50] and adapts it to human interactions, becoming the current state of the art on InterHuman. In contrast to the previous approaches, we propose a diffusion model (in2IN) that conditions the generation on both the general interaction description and a fine-grained description providing more details on the action performed by each individual involved in the interaction. This results in a model that generates adequate inter-personal dynamics and, at the same time, enables precise control on the intra-personal dynamics.

2.3 Human Motion Composition

The iterative paradigm underlying diffusion models provides them the capability to combine data, such as multiple images or motions, in a harmonized way [4, 52]. In the realm of motion, the literature has traditionally differentiated between temporal and spatial composition. Temporal composition refers to combining multiple individual motions into a larger sequence [2, 36, 8], making smooth and realistic transitions among them emerge. On the other hand, spatial composition refers to combining multiple motions to generate a new motion of the same length that combines certain elements of the original motions, such as the actions, the trajectory, or joint-specific movements [40, 3]. However, they all share the same limitation: they apply to single-person motion composition. In a broader sense, [36] proposed a generic model composition technique to combine the sampling processes of two different diffusion models, thus generating a harmonized motion. However, they used a fixed score-merging technique along the whole denoising process, which we prove is a suboptimal strategy in more complex scenarios like ours. Instead, we propose a novel model composition technique (DualMDM) that can combine 1) individual motions generated with a prior pre-trained on a single-person motion dataset, and 2) interactive motions generated by any human-human interaction model. Our interactions show more diverse intra-personal dynamics while preserving the inter-personal coherence.

Refer to caption
Figure 2: in2IN diffusion model. Our proposed architecture consists of a Siamese Transformer that generates the denoised motion of each individual in the interaction (xa0subscriptsuperscript𝑥0𝑎x^{0}_{a}italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and xb0subscriptsuperscript𝑥0𝑏x^{0}_{b}italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT). First, a self-attention layer models the intra-personal dependencies using the encoded individual condition and noisy motion of each person (xatsubscriptsuperscript𝑥𝑡𝑎x^{t}_{a}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and xbtsubscriptsuperscript𝑥𝑡𝑏x^{t}_{b}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT). Then, a cross-attention module models the inter-personal dynamics using the encoded interaction description, the self-attention output, and the noisy motion from the other interacting person.

3 Method

In this section, we introduce our main contributions. First, in Sec. 3.1, we describe in2IN, our proposed interaction-aware diffusion model conditioned on both the interaction and the individual textual descriptions. Then, we introduce the multi-weight CFG technique, which increases the user control over the influence that each condition has over the generation process. Finally, in Sec 3.2, we discuss how our second contribution, DualMDM, can increase the control and diversity of the intra-personal dynamics generated by pre-trained interaction models such as in2IN.

3.1 in2IN: Interaction diffusion model

The architecture of our interaction diffusion model (in2IN) is founded on the principle that interactions between two persons exhibit a commutative property [28], denoted as {xa,xb}subscript𝑥𝑎subscript𝑥𝑏\{x_{a},x_{b}\}{ italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT }, which is considered to be equivalent to {xb,xa}subscript𝑥𝑏subscript𝑥𝑎\{x_{b},x_{a}\}{ italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT }. Building on this concept, we introduce a Transformer-based diffusion model in a Siamese configuration [9]. Two copies of the diffusion model are made, sharing parameters. Each network is responsible for processing its respective noisy motion inputs, 𝐱atsuperscriptsubscript𝐱𝑎𝑡\mathbf{x}_{a}^{t}bold_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and 𝐱btsuperscriptsubscript𝐱𝑏𝑡\mathbf{x}_{b}^{t}bold_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, and aims to produce the denoised versions, 𝐱a0subscriptsuperscript𝐱0𝑎\mathbf{x}^{0}_{a}bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and 𝐱b0subscriptsuperscript𝐱0𝑏\mathbf{x}^{0}_{b}bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT respectively. We predict the x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT directly [45, 7] as this allows us to use kinematic losses. Once the losses have been calculated, the motions of both interactants, generated by each one of the copies of the model, are noised back to xt1superscript𝑥𝑡1x^{t-1}italic_x start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT to become the inputs of the next denoising iteration.

Similarly to [28, 39], our diffusion model architecture (Fig. 2) has a multi-head self-attention module that learns the intra-personal dynamics of the motion, and a multi-head cross-attention module that combines the self-attention output with the motion of the other individual in the interaction, thus modeling the inter-personal dynamics. We also condition the generation with adaptive normalization layers [30]. However, in contrast to previous approaches, we introduce different conditions for the different attention modules. For the self-attention module, where only the individual motion matters, we provide the specific textual description of the individual motion as conditioning. On the other hand, in the cross-attention module, where the whole interaction is important, we provide the interaction textual description as conditioning. More specifically, on one of the copies of the model, 𝐱atsuperscriptsubscript𝐱𝑎𝑡\mathbf{x}_{a}^{t}bold_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and its individual description are fed together into the self-attention module, while 𝐱btsuperscriptsubscript𝐱𝑏𝑡\mathbf{x}_{b}^{t}bold_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is injected in the cross-attention one. Conversely, the other copy uses 𝐱btsuperscriptsubscript𝐱𝑏𝑡\mathbf{x}_{b}^{t}bold_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for the self-attention and 𝐱atsuperscriptsubscript𝐱𝑎𝑡\mathbf{x}_{a}^{t}bold_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT at the cross-attention. This allows for a more precise control of the intra- and inter-personal dynamics.

Multi-Weight Classifier-Free Guidance. Our conditioning strategy for the diffusion model builds upon CFG, initially proposed by Ho et al. [22]. Generally, diffusion models have a significant dependency on CFG to generate high-quality samples. However, incorporating multiple conditions using CFG is not trivial. We address this by employing distinct weighting strategies for each condition. The equation representing our model’s sampling function, denoted as GI(xt,t,c)superscript𝐺𝐼superscript𝑥𝑡𝑡𝑐G^{I}(x^{t},t,c)italic_G start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , italic_c ), is as follows:

GI(xt,t,c)superscript𝐺𝐼superscript𝑥𝑡𝑡𝑐\displaystyle G^{I}\left(x^{t},t,c\right)italic_G start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , italic_c ) =G(xt,t,)absent𝐺superscript𝑥𝑡𝑡\displaystyle=G\left(x^{t},t,\emptyset\right)= italic_G ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , ∅ ) (1)
+wc(G(xt,t,c)G(xt,t,))subscript𝑤𝑐𝐺superscript𝑥𝑡𝑡𝑐𝐺superscript𝑥𝑡𝑡\displaystyle+w_{c}\cdot\left(G\left(x^{t},t,c\right)-G\left(x^{t},t,\emptyset% \right)\right)+ italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⋅ ( italic_G ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , italic_c ) - italic_G ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , ∅ ) )
+wI(G(xt,t,cI)G(xt,t,))subscript𝑤𝐼𝐺superscript𝑥𝑡𝑡subscript𝑐I𝐺superscript𝑥𝑡𝑡\displaystyle+w_{I}\cdot\left(G\left(x^{t},t,c_{\text{I}}\right)-G\left(x^{t},% t,\emptyset\right)\right)+ italic_w start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ⋅ ( italic_G ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT I end_POSTSUBSCRIPT ) - italic_G ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , ∅ ) )
+wi(G(xt,t,ci)G(xt,t,)),subscript𝑤𝑖𝐺superscript𝑥𝑡𝑡subscript𝑐i𝐺superscript𝑥𝑡𝑡\displaystyle+w_{i}\cdot\left(G\left(x^{t},t,c_{\text{i}}\right)-G\left(x^{t},% t,\emptyset\right)\right),+ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ( italic_G ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) - italic_G ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , ∅ ) ) ,

where G(xt,t,)𝐺superscript𝑥𝑡𝑡G(x^{t},t,\emptyset)italic_G ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , ∅ ) is the unconditional output of the model, and G(xt,t,c)𝐺superscript𝑥𝑡𝑡𝑐G(x^{t},t,c)italic_G ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , italic_c ), G(xt,t,cI)𝐺superscript𝑥𝑡𝑡subscript𝑐IG(x^{t},t,c_{\text{I}})italic_G ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT I end_POSTSUBSCRIPT ), and G(xt,t,ci)𝐺superscript𝑥𝑡𝑡subscript𝑐iG(x^{t},t,c_{\text{i}})italic_G ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) denote the model outputs conditioned on the whole conditioning c={cI,ci}𝑐subscript𝑐𝐼subscript𝑐𝑖c=\{c_{I},c_{i}\}italic_c = { italic_c start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, only the interaction, and only the individual, respectively. The weights wcsubscript𝑤𝑐w_{c}italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, wIsubscript𝑤𝐼w_{I}italic_w start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT, and wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT absent\in\mathbb{R}∈ blackboard_R adjust the influence of each conditioned output relative to the unconditional baseline. A notable limitation of this approach is the necessity to perform quadruple sampling from the denoiser, as opposed to the double sampling required in a conventional CFG methodology. In exchange, this method allows for more refined control over the generation process, ensuring that the model can effectively capture and express the nuances of both individual and interaction-specific conditions. If a weight is set to 0, then that particular conditioning is ignored during the generation process.

3.2 DualMDM: Model composition

In our second contribution, we propose a motion model composition technique that allows us to combine interactions generated by an interaction model with the motions generated by an individual motion prior trained with a single-person motion dataset. This method uses a single-person human motion prior to provide the generated human-human interactions with a higher diversity of intra-personal dynamics. This model composition technique is built on top of the method proposed in DiffusionBlending [36]:

Ga,b(xt,t,ca,cb)superscript𝐺𝑎𝑏superscript𝑥𝑡𝑡subscript𝑐𝑎subscript𝑐𝑏\displaystyle G^{a,b}(x^{t},t,c_{a},c_{b})italic_G start_POSTSUPERSCRIPT italic_a , italic_b end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) =Ga(xt,t,ca)absentsuperscript𝐺𝑎superscript𝑥𝑡𝑡subscript𝑐𝑎\displaystyle=G^{a}(x^{t},t,c_{a})= italic_G start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) (2)
+w(Gb(xt,t,cb)Ga(xt,t,ca)),𝑤superscript𝐺𝑏superscript𝑥𝑡𝑡subscript𝑐𝑏superscript𝐺𝑎superscript𝑥𝑡𝑡subscript𝑐𝑎\displaystyle+w\cdot(G^{b}(x^{t},t,c_{b})-G^{a}(x^{t},t,c_{a})),+ italic_w ⋅ ( italic_G start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) - italic_G start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) ) ,

where w𝑤w\in\mathbb{R}italic_w ∈ blackboard_R is the blending weight, and Ga(xt,t,ca)superscript𝐺𝑎superscript𝑥𝑡𝑡subscript𝑐𝑎G^{a}(x^{t},t,c_{a})italic_G start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) and Gb(xt,t,cb)superscript𝐺𝑏superscript𝑥𝑡𝑡subscript𝑐𝑏G^{b}(x^{t},t,c_{b})italic_G start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) are the outputs of the diffusion models a𝑎aitalic_a and b𝑏bitalic_b, respectively. Since the original proposal was made to combine single-person diffusion models, we adapt the previous formula to our scenario:

GI,i(xt,t,c)superscript𝐺𝐼𝑖superscript𝑥𝑡𝑡𝑐\displaystyle G^{I,i}(x^{t},t,c)italic_G start_POSTSUPERSCRIPT italic_I , italic_i end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , italic_c ) =GI(xt,t,c)absentsuperscript𝐺Isuperscript𝑥𝑡𝑡𝑐\displaystyle=G^{\text{I}}(x^{t},t,c)= italic_G start_POSTSUPERSCRIPT I end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , italic_c ) (3)
+w(Gi(xt,t,ci)G(xt,t,c)I),\displaystyle+w\cdot(G^{\text{i}}(x^{t},t,c_{i})-G{{}^{\text{I}}}(x_{t},t,c)),+ italic_w ⋅ ( italic_G start_POSTSUPERSCRIPT i end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_G start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) ) ,
Refer to caption
Figure 3: Different weights schedulers tested for DualMDM: Exponential , Inverse Exponential, Constant, and Linear.

where GI(xt,t,c)superscript𝐺Isuperscript𝑥𝑡𝑡𝑐G^{\text{I}}(x^{t},t,c)italic_G start_POSTSUPERSCRIPT I end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , italic_c ) is the output of the interaction diffusion model and Gi(xt,t,ci)superscript𝐺isuperscript𝑥𝑡𝑡subscript𝑐𝑖G^{\text{i}}(x^{t},t,c_{i})italic_G start_POSTSUPERSCRIPT i end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the output of the individual motion prior. By choosing w𝑤witalic_w to be constant, authors from [36] assumed that the optimal blending weight is the same along the whole sampling process. However, in line with [24], we argue that the optimal blending weight might vary along the denoising chain, depending on the particularities of each scenario. To account for this, we propose to replace the constant w𝑤witalic_w with a weight scheduler w(t)𝑤𝑡w(t)italic_w ( italic_t ) that parameterizes the blending weight used to combine the denoised motion from both models, making it variable on the sampling phase (Fig. 3). As a generalization of the DiffusionBlending technique, DualMDM is a more flexible and powerful strategy to combine two diffusion models.

4 Experimental Evaluation

Our experiments are conducted with the InterHuman [28] and HumanML3D [17] datasets. InterHuman is the largest annotated interaction dataset. There, each motion is represented as xi=[𝐣gp,𝐣gv,𝐣r,𝐜f]superscript𝑥𝑖superscriptsubscript𝐣𝑔𝑝superscriptsubscript𝐣𝑔𝑣superscript𝐣𝑟superscript𝐜𝑓x^{i}{=}\left[\mathbf{j}_{g}^{p},\mathbf{j}_{g}^{v},\mathbf{j}^{r},\mathbf{c}^% {f}\right]italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = [ bold_j start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , bold_j start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_j start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , bold_c start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ], where xisuperscript𝑥𝑖x^{i}italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, the i𝑖iitalic_i-th motion timestep, encompasses joint positions and velocities 𝐣gp,𝐣gv3Njsuperscriptsubscript𝐣𝑔𝑝superscriptsubscript𝐣𝑔𝑣superscript3subscript𝑁𝑗\mathbf{j}_{g}^{p},\mathbf{j}_{g}^{v}\,{\in}\,\mathbb{R}^{3N_{j}}bold_j start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , bold_j start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT in the world frame, 6D6𝐷6D6 italic_D representation of local rotations 𝐣r6Njsuperscript𝐣𝑟superscript6subscript𝑁𝑗\mathbf{j}^{r}\,{\in}\,\mathbb{R}^{6N_{j}}bold_j start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 6 italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT in the root frame, and binary foot-ground contact features 𝐜f4superscript𝐜𝑓superscript4\mathbf{c}^{f}\,{\in}\,\mathbb{R}^{4}bold_c start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT. N𝑁Nitalic_N is the number of joints. In our case N=22𝑁22N=22italic_N = 22. As InterHuman does not provide textual descriptions of the individuals’ motions within an interaction, we have automatically generated them using LLMs.

InterHuman focuses on providing a wide range of interactions rather than individual diversity in its motions. For this reason, we have trained an individual motion prior with the HumanML3D dataset, which contains a much wider range of annotated individual motions. For compatibility purposes, we converted the HumanML3D motion representation to the one used in the InterHuman dataset. More details on the LLM-based generation of the individual descriptions and the implementation details of our individual motion prior can be found in the Supp. Material.

4.1 Evaluation Metrics

We utilize the evaluation metrics proposed in [17]. More concretely, R-Precision and Multimodal-Dist evaluate how semantically close the generated motions are to the input prompts. The FID score is used to measure the dissimilarity between the distributions of generated motions and the actual ground truth motions. Diversity is assessed to gauge the range of variation within the generated motion distribution, while MultiModality calculates the average variance for motions generated from a single text prompt. To compute these metrics, we need encoders that align the text and motion latent representation, which we borrow from [28].

None of the previous evaluation metrics assesses the alignment of the generated interactions with the individual descriptions. Due to the lack of ground-truth individual annotations, we cannot train single-person motion and text encoders for InterHuman. Therefore, we cannot reliably assess the individual alignment using R-Precision, Multimodal-Dist, or FID. Still, such metrics are not only sensitive to the global quality of the interaction, but also to the coherence of the intra-personal dynamics within the interaction context. In other words, they are sensitive to wrong intra-personal dynamics during an interaction. For instance, if an interactant is kicking a ball, the salute to each other interaction is not coherent, and the generated motion will have low R-Precision. However, these metrics cannot assess the influence of using distinct individual descriptions on the generation of varied intra-personal dynamics. For example, an interaction generated with {cI=subscript𝑐Iabsentc_{\text{I}}{=}italic_c start_POSTSUBSCRIPT I end_POSTSUBSCRIPT =salute to each other, ci1=ci2=subscript𝑐subscripti1subscript𝑐subscripti2absentc_{\text{i}_{1}}{=}c_{\text{i}_{2}}{=}italic_c start_POSTSUBSCRIPT i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT =wave right hand} will be different from the one generated with the same set with ci2=subscript𝑐subscripti2absentc_{\text{i}_{2}}{=}italic_c start_POSTSUBSCRIPT i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT =bows forward instead. However, such difference might come from: 1) the intrinsic diversity of the generative model, quantified by the MultiModality metric (i.e., different ways of waving with the right hand, and not bowing at all); or 2) the extrinsic diversity caused by differences in the individual descriptions used, thus showing control capabilities over the generated intra-personal dynamics. Thus, to quantify the latter, we introduce a new evaluation metric called Extrinsic Individual Diversity (EID).

Extrinsic Individual Diversity (EID). In order to assess the extrinsic diversity of the model, we need to disentangle it from the intrinsic one. To do so, we generate two empirical distributions that will serve as a proxy for quantifying the intrinsic diversity of 1) the ground-truth scenario, and 2) a synthetic scenario where the individual descriptions are randomly changed. In particular, for every set of interaction and individual descriptions {cI,ci1,ci2}subscript𝑐Isubscript𝑐subscripti1subscript𝑐subscripti2\{c_{\text{I}},c_{\text{i}_{1}},c_{\text{i}_{2}}\}{ italic_c start_POSTSUBSCRIPT I end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT } in the dataset, we proceed as follows: 1) we build DGTsubscript𝐷GTD_{\text{GT}}italic_D start_POSTSUBSCRIPT GT end_POSTSUBSCRIPT as the set of N𝑁Nitalic_N motions generated with {cI,ci1,ci2}subscript𝑐Isubscript𝑐subscripti1subscript𝑐subscripti2\{c_{\text{I}},c_{\text{i}_{1}},c_{\text{i}_{2}}\}{ italic_c start_POSTSUBSCRIPT I end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT }, and 2) we build Drandsubscript𝐷randD_{\text{rand}}italic_D start_POSTSUBSCRIPT rand end_POSTSUBSCRIPT as the set of N𝑁Nitalic_N motions generated randomly replacing ci1subscript𝑐subscripti1c_{\text{i}_{1}}italic_c start_POSTSUBSCRIPT i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and ci2subscript𝑐subscripti2c_{\text{i}_{2}}italic_c start_POSTSUBSCRIPT i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT with other individual descriptions from the dataset. Then, we define the EID as the Wasserstein distance between DGTsubscript𝐷GTD_{\text{GT}}italic_D start_POSTSUBSCRIPT GT end_POSTSUBSCRIPT and Drandsubscript𝐷randD_{\text{rand}}italic_D start_POSTSUBSCRIPT rand end_POSTSUBSCRIPT. A higher distance means more influence of the individual descriptions on the diversity of the generated motions, arguably leading to higher control on the intra-personal dynamics of the interaction. This metric can be combined with others such as the R-Precision and FID to analyze the trade-off between individual diversity and interaction quality and fidelity. In our experiments, we set N=32𝑁32N{=}32italic_N = 32. To quantify the additional extrinsic diversity provided by the DualMDM technique, we build DGTsubscript𝐷GTD_{\text{GT}}italic_D start_POSTSUBSCRIPT GT end_POSTSUBSCRIPT with in2IN and Drandsubscript𝐷randD_{\text{rand}}italic_D start_POSTSUBSCRIPT rand end_POSTSUBSCRIPT with in2IN combined with the DualMDM.

4.2 Implementation Details

Our in2IN models consist of 8 consecutive multi-head attention layers with a latent dimension of 1024 and 8 heads. We utilize a frozen CLIP-ViTL/14𝐿14L/14italic_L / 14 model [35] as our text encoder. We set the number of diffusion timesteps to 1,000 and employ a cosine noise schedule [31]. During inference, we use DDIM sampling [38] with η=0𝜂0\eta=0italic_η = 0 and 50 timesteps, and our proposed multi-weight CFG variation. To enable the latter, 10% of the CLIP embeddings are randomly set to zero during training. All models have been trained using the AdamW optimizer [29] with betas of (0.9,0.999)0.90.999(0.9,0.999)( 0.9 , 0.999 ), weight decay of 21052superscript1052\cdot 10^{-5}2 ⋅ 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, maximum learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and a cosine learning rate schedule with an initial 10-epoch linear warm-up period. They have been trained using the L2 loss and, thanks to the use of the x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT parameterization, kinematic losses have also been used. These include the foot contact and the velocity losses from the MDM framework [41], and the bone length, the masked joint distance map, and the relative orientation losses suggested in InterGen [28]. Additionally, we have used the kinematic loss scheduler from InterGen. All models have been trained for 2,000 epochs with a batch size of 64 and 16-bit mixed precision, taking 5 days using two Nvidia 3090 GPUs.

DualMDM schedulers. We test these functions:

constant, or w(t)=λ𝑤𝑡𝜆\displaystyle w(t)=\lambdaitalic_w ( italic_t ) = italic_λ (4)
linear, or w(t)=t/T𝑤𝑡𝑡𝑇\displaystyle w(t)=t/Titalic_w ( italic_t ) = italic_t / italic_T
exponential, or w(t)=eλ(Tt),𝑤𝑡superscript𝑒𝜆𝑇𝑡\displaystyle w(t)=e^{-\lambda\cdot(T-t)},italic_w ( italic_t ) = italic_e start_POSTSUPERSCRIPT - italic_λ ⋅ ( italic_T - italic_t ) end_POSTSUPERSCRIPT ,
inverse exponential, or w(t)=1eλ(Tt),𝑤𝑡1superscript𝑒𝜆𝑇𝑡\displaystyle w(t)=1-e^{-\lambda\cdot(T-t)},italic_w ( italic_t ) = 1 - italic_e start_POSTSUPERSCRIPT - italic_λ ⋅ ( italic_T - italic_t ) end_POSTSUPERSCRIPT ,

where t𝑡titalic_t is the actual denoising step, T𝑇Titalic_T the total number of denoising steps, and λ𝜆\lambdaitalic_λ the parameter that determines the slope of our scheduler function. We visualize them in Fig. 3.

4.3 Quantitative Analysis

Methods R-Precision \uparrow FID \downarrow MM Dist \downarrow Diversity \rightarrow MModality \uparrow
Top 1 Top 2 Top 3
Ground Truth 0.452±.008superscript0.452plus-or-minus.0080.452^{\pm.008}0.452 start_POSTSUPERSCRIPT ± .008 end_POSTSUPERSCRIPT 0.610±.009superscript0.610plus-or-minus.0090.610^{\pm.009}0.610 start_POSTSUPERSCRIPT ± .009 end_POSTSUPERSCRIPT 0.701±.008superscript0.701plus-or-minus.0080.701^{\pm.008}0.701 start_POSTSUPERSCRIPT ± .008 end_POSTSUPERSCRIPT 0.273±.007superscript0.273plus-or-minus.0070.273^{\pm.007}0.273 start_POSTSUPERSCRIPT ± .007 end_POSTSUPERSCRIPT 3.755±.008superscript3.755plus-or-minus.0083.755^{\pm.008}3.755 start_POSTSUPERSCRIPT ± .008 end_POSTSUPERSCRIPT 7.948±.064superscript7.948plus-or-minus.0647.948^{\pm.064}7.948 start_POSTSUPERSCRIPT ± .064 end_POSTSUPERSCRIPT -
TEMOS [33] 0.224±.010superscript0.224plus-or-minus.0100.224^{\pm.010}0.224 start_POSTSUPERSCRIPT ± .010 end_POSTSUPERSCRIPT 0.316±.013superscript0.316plus-or-minus.0130.316^{\pm.013}0.316 start_POSTSUPERSCRIPT ± .013 end_POSTSUPERSCRIPT 0.450±.018superscript0.450plus-or-minus.0180.450^{\pm.018}0.450 start_POSTSUPERSCRIPT ± .018 end_POSTSUPERSCRIPT 17.375±.043superscript17.375plus-or-minus.04317.375^{\pm.043}17.375 start_POSTSUPERSCRIPT ± .043 end_POSTSUPERSCRIPT 6.342±.015superscript6.342plus-or-minus.0156.342^{\pm.015}6.342 start_POSTSUPERSCRIPT ± .015 end_POSTSUPERSCRIPT 6.939±.071superscript6.939plus-or-minus.0716.939^{\pm.071}6.939 start_POSTSUPERSCRIPT ± .071 end_POSTSUPERSCRIPT 0.535±.014superscript0.535plus-or-minus.0140.535^{\pm.014}0.535 start_POSTSUPERSCRIPT ± .014 end_POSTSUPERSCRIPT
T2M [17] 0.238±.012superscript0.238plus-or-minus.0120.238^{\pm.012}0.238 start_POSTSUPERSCRIPT ± .012 end_POSTSUPERSCRIPT 0.325±.010superscript0.325plus-or-minus.0100.325^{\pm.010}0.325 start_POSTSUPERSCRIPT ± .010 end_POSTSUPERSCRIPT 0.464±.014superscript0.464plus-or-minus.0140.464^{\pm.014}0.464 start_POSTSUPERSCRIPT ± .014 end_POSTSUPERSCRIPT 13.769±.072superscript13.769plus-or-minus.07213.769^{\pm.072}13.769 start_POSTSUPERSCRIPT ± .072 end_POSTSUPERSCRIPT 5.731±.013superscript5.731plus-or-minus.0135.731^{\pm.013}5.731 start_POSTSUPERSCRIPT ± .013 end_POSTSUPERSCRIPT 7.046±.022superscript7.046plus-or-minus.0227.046^{\pm.022}7.046 start_POSTSUPERSCRIPT ± .022 end_POSTSUPERSCRIPT 1.387±.076superscript1.387plus-or-minus.0761.387^{\pm.076}1.387 start_POSTSUPERSCRIPT ± .076 end_POSTSUPERSCRIPT
MDM [41] 0.153±.012superscript0.153plus-or-minus.0120.153^{\pm.012}0.153 start_POSTSUPERSCRIPT ± .012 end_POSTSUPERSCRIPT 0.260±.009superscript0.260plus-or-minus.0090.260^{\pm.009}0.260 start_POSTSUPERSCRIPT ± .009 end_POSTSUPERSCRIPT 0.339±.012superscript0.339plus-or-minus.0120.339^{\pm.012}0.339 start_POSTSUPERSCRIPT ± .012 end_POSTSUPERSCRIPT 9.167±.056superscript9.167plus-or-minus.0569.167^{\pm.056}9.167 start_POSTSUPERSCRIPT ± .056 end_POSTSUPERSCRIPT 7.125±.018superscript7.125plus-or-minus.0187.125^{\pm.018}7.125 start_POSTSUPERSCRIPT ± .018 end_POSTSUPERSCRIPT 7.602±.045superscript7.602plus-or-minus.0457.602^{\pm.045}7.602 start_POSTSUPERSCRIPT ± .045 end_POSTSUPERSCRIPT 2.35±.080superscript2.35plus-or-minus.080\mathbf{2.35}^{\pm.080}bold_2.35 start_POSTSUPERSCRIPT ± .080 end_POSTSUPERSCRIPT
ComMDM [36] 0.223±.009superscript0.223plus-or-minus.0090.223^{\pm.009}0.223 start_POSTSUPERSCRIPT ± .009 end_POSTSUPERSCRIPT 0.334±.008superscript0.334plus-or-minus.0080.334^{\pm.008}0.334 start_POSTSUPERSCRIPT ± .008 end_POSTSUPERSCRIPT 0.466±.010superscript0.466plus-or-minus.0100.466^{\pm.010}0.466 start_POSTSUPERSCRIPT ± .010 end_POSTSUPERSCRIPT 7.069±.054superscript7.069plus-or-minus.0547.069^{\pm.054}7.069 start_POSTSUPERSCRIPT ± .054 end_POSTSUPERSCRIPT 6.212±.021superscript6.212plus-or-minus.0216.212^{\pm.021}6.212 start_POSTSUPERSCRIPT ± .021 end_POSTSUPERSCRIPT 7.244±.038superscript7.244plus-or-minus.0387.244^{\pm.038}7.244 start_POSTSUPERSCRIPT ± .038 end_POSTSUPERSCRIPT 1.822±.052superscript1.822plus-or-minus.0521.822^{\pm.052}1.822 start_POSTSUPERSCRIPT ± .052 end_POSTSUPERSCRIPT
InterGen [28] 0.371±.010superscript0.371plus-or-minus.0100.371^{\pm.010}0.371 start_POSTSUPERSCRIPT ± .010 end_POSTSUPERSCRIPT 0.515±.012superscript0.515plus-or-minus.0120.515^{\pm.012}0.515 start_POSTSUPERSCRIPT ± .012 end_POSTSUPERSCRIPT 0.624±.010superscript0.624plus-or-minus.0100.624^{\pm.010}0.624 start_POSTSUPERSCRIPT ± .010 end_POSTSUPERSCRIPT 5.918±.079superscript5.918plus-or-minus.0795.918^{\pm.079}5.918 start_POSTSUPERSCRIPT ± .079 end_POSTSUPERSCRIPT 5.108±.014superscript5.108plus-or-minus.0145.108^{\pm.014}5.108 start_POSTSUPERSCRIPT ± .014 end_POSTSUPERSCRIPT 7.387±.029superscript7.387plus-or-minus.0297.387^{\pm.029}7.387 start_POSTSUPERSCRIPT ± .029 end_POSTSUPERSCRIPT 2.141¯±.063superscript¯2.141plus-or-minus.063\underline{2.141}^{\pm.063}under¯ start_ARG 2.141 end_ARG start_POSTSUPERSCRIPT ± .063 end_POSTSUPERSCRIPT
MoMat-MoGen [11] 0.449¯±.004superscript¯0.449plus-or-minus.004\underline{0.449}^{\pm.004}under¯ start_ARG 0.449 end_ARG start_POSTSUPERSCRIPT ± .004 end_POSTSUPERSCRIPT 0.591¯±.003superscript¯0.591plus-or-minus.003\underline{0.591}^{\pm.003}under¯ start_ARG 0.591 end_ARG start_POSTSUPERSCRIPT ± .003 end_POSTSUPERSCRIPT 0.666¯±.004superscript¯0.666plus-or-minus.004\underline{0.666}^{\pm.004}under¯ start_ARG 0.666 end_ARG start_POSTSUPERSCRIPT ± .004 end_POSTSUPERSCRIPT 5.674±.085superscript5.674plus-or-minus.0855.674^{\pm.085}5.674 start_POSTSUPERSCRIPT ± .085 end_POSTSUPERSCRIPT 3.790±.001superscript3.790plus-or-minus.001\textbf{3.790}^{\pm.001}3.790 start_POSTSUPERSCRIPT ± .001 end_POSTSUPERSCRIPT 8.021±.035superscript8.021plus-or-minus.0358.021^{\pm.035}8.021 start_POSTSUPERSCRIPT ± .035 end_POSTSUPERSCRIPT 1.295±.023superscript1.295plus-or-minus.0231.295^{\pm.023}1.295 start_POSTSUPERSCRIPT ± .023 end_POSTSUPERSCRIPT
in2IN* 0.425±0.008superscript0.425plus-or-minus0.0080.425^{\pm 0.008}0.425 start_POSTSUPERSCRIPT ± 0.008 end_POSTSUPERSCRIPT 0.576±0.008superscript0.576plus-or-minus0.0080.576^{\pm 0.008}0.576 start_POSTSUPERSCRIPT ± 0.008 end_POSTSUPERSCRIPT 0.662±0.009superscript0.662plus-or-minus0.0090.662^{\pm 0.009}0.662 start_POSTSUPERSCRIPT ± 0.009 end_POSTSUPERSCRIPT 5.535¯±0.120superscript¯5.535plus-or-minus0.120\underline{5.535}^{\pm 0.120}under¯ start_ARG 5.535 end_ARG start_POSTSUPERSCRIPT ± 0.120 end_POSTSUPERSCRIPT 3.803¯±0.002superscript¯3.803plus-or-minus0.002\underline{3.803}^{\pm 0.002}under¯ start_ARG 3.803 end_ARG start_POSTSUPERSCRIPT ± 0.002 end_POSTSUPERSCRIPT 7.953±0.047superscript7.953plus-or-minus0.047\textbf{7.953}^{\pm 0.047}7.953 start_POSTSUPERSCRIPT ± 0.047 end_POSTSUPERSCRIPT 1.215±0.023superscript1.215plus-or-minus0.0231.215^{\pm 0.023}1.215 start_POSTSUPERSCRIPT ± 0.023 end_POSTSUPERSCRIPT
in2IN 0.455±0.004superscript0.455plus-or-minus0.004\textbf{0.455}^{\pm 0.004}0.455 start_POSTSUPERSCRIPT ± 0.004 end_POSTSUPERSCRIPT 0.611±0.005superscript0.611plus-or-minus0.005\textbf{0.611}^{\pm 0.005}0.611 start_POSTSUPERSCRIPT ± 0.005 end_POSTSUPERSCRIPT 0.687±0.005superscript0.687plus-or-minus0.005\textbf{0.687}^{\pm 0.005}0.687 start_POSTSUPERSCRIPT ± 0.005 end_POSTSUPERSCRIPT 5.177±0.103superscript5.177plus-or-minus0.103\textbf{5.177}^{\pm 0.103}5.177 start_POSTSUPERSCRIPT ± 0.103 end_POSTSUPERSCRIPT 3.790±0.002superscript3.790plus-or-minus0.002\textbf{3.790}^{\pm 0.002}3.790 start_POSTSUPERSCRIPT ± 0.002 end_POSTSUPERSCRIPT 7.940¯±0.030superscript¯7.940plus-or-minus0.030\underline{7.940}^{\pm 0.030}under¯ start_ARG 7.940 end_ARG start_POSTSUPERSCRIPT ± 0.030 end_POSTSUPERSCRIPT 1.061±0.038superscript1.061plus-or-minus0.0381.061^{\pm 0.038}1.061 start_POSTSUPERSCRIPT ± 0.038 end_POSTSUPERSCRIPT
Table 1: Comparison of our model (in2IN) to the state of the art in human-human interaction motion generation on the InterHuman dataset. *in2IN model only using wIsubscript𝑤𝐼w_{I}italic_w start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT (conditioning only on the interaction during sampling). All evaluations have been executed 10 times to elude the randomness of the generation ±plus-or-minus\pm± indicates the 95% confidence interval. We highlight the best and the second best results.

4.3.1 in2IN: Interaction Generation

Tab. 1 shows the quantitative evaluation of our in2IN architecture with respect to the previously existing methods evaluated on the InterHuman dataset. It can be observed that by using individual information we have been able to obtain better results than all previous methods. As might reasonably be anticipated, the additional information used only by in2IN in form of LLM-generated individual descriptions reduces the spectrum of valid motions fulfilling the interaction description, which reflects as a lower MultiModality.

With respect to the Multi-Weight CFG, we evaluate the isolated effect of each weight on the evaluation metrics in Fig. 4. As can be observed, for weights wcsubscript𝑤𝑐w_{c}italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and wIsubscript𝑤𝐼w_{I}italic_w start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT, 4 is the best weight individually. On the other hand, for weight wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, 2 is the best weight. More than that turns into a decrement in performance. We find the best combination with a grid search in a validation subset: wc=3subscript𝑤𝑐3w_{c}{=}3italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 3, wI=3subscript𝑤𝐼3w_{I}{=}3italic_w start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = 3, and wi=1subscript𝑤𝑖1w_{i}{=}1italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1.

Refer to caption
Figure 4: R-Precision and FID for the different weights on the Multi-Weight CFG tested in isolation. Each column ablates a different weight (wcsubscript𝑤𝑐w_{c}italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, wIsubscript𝑤𝐼w_{I}italic_w start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT, wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT). wcsubscript𝑤𝑐w_{c}italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT has been ablated with wI=wi=0subscript𝑤𝐼subscript𝑤𝑖0w_{I}{=}w_{i}{=}0italic_w start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0. wIsubscript𝑤𝐼w_{I}italic_w start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with wc=1subscript𝑤𝑐1w_{c}{=}1italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 1, and the other weight set to 0.
Scheduler λ𝜆\lambdaitalic_λ R-Precision \uparrow FID \downarrow EID \uparrow
(Top-3)
0.000.000.000.00 0.687±.005superscript0.687plus-or-minus.005\textbf{0.687}^{\pm.005}0.687 start_POSTSUPERSCRIPT ± .005 end_POSTSUPERSCRIPT 5.177±.103superscript5.177plus-or-minus.103\textbf{5.177}^{\pm.103}5.177 start_POSTSUPERSCRIPT ± .103 end_POSTSUPERSCRIPT 1.238±.011superscript1.238plus-or-minus.0111.238^{\pm.011}1.238 start_POSTSUPERSCRIPT ± .011 end_POSTSUPERSCRIPT
0.250.250.250.25 0.577±.004superscript0.577plus-or-minus.0040.577^{\pm.004}0.577 start_POSTSUPERSCRIPT ± .004 end_POSTSUPERSCRIPT 33.75±.293superscript33.75plus-or-minus.29333.75^{\pm.293}33.75 start_POSTSUPERSCRIPT ± .293 end_POSTSUPERSCRIPT 1.516±.005superscript1.516plus-or-minus.0051.516^{\pm.005}1.516 start_POSTSUPERSCRIPT ± .005 end_POSTSUPERSCRIPT
0.500.500.500.50 0.383±.006superscript0.383plus-or-minus.0060.383^{\pm.006}0.383 start_POSTSUPERSCRIPT ± .006 end_POSTSUPERSCRIPT 91.99±.000superscript91.99plus-or-minus.00091.99^{\pm.000}91.99 start_POSTSUPERSCRIPT ± .000 end_POSTSUPERSCRIPT 1.972±.018superscript1.972plus-or-minus.0181.972^{\pm.018}1.972 start_POSTSUPERSCRIPT ± .018 end_POSTSUPERSCRIPT
0.750.750.750.75 0.218±.016superscript0.218plus-or-minus.0160.218^{\pm.016}0.218 start_POSTSUPERSCRIPT ± .016 end_POSTSUPERSCRIPT 127.8±.691superscript127.8plus-or-minus.691127.8^{\pm.691}127.8 start_POSTSUPERSCRIPT ± .691 end_POSTSUPERSCRIPT 2.188±.010superscript2.188plus-or-minus.010\textbf{2.188}^{\pm.010}2.188 start_POSTSUPERSCRIPT ± .010 end_POSTSUPERSCRIPT
1.001.001.001.00 0.094±.004superscript0.094plus-or-minus.0040.094^{\pm.004}0.094 start_POSTSUPERSCRIPT ± .004 end_POSTSUPERSCRIPT 130.4±.226superscript130.4plus-or-minus.226130.4^{\pm.226}130.4 start_POSTSUPERSCRIPT ± .226 end_POSTSUPERSCRIPT 2.118±.010superscript2.118plus-or-minus.0102.118^{\pm.010}2.118 start_POSTSUPERSCRIPT ± .010 end_POSTSUPERSCRIPT
0.01000.01000.01000.0100 0.589±.006superscript0.589plus-or-minus.006\textbf{0.589}^{\pm.006}0.589 start_POSTSUPERSCRIPT ± .006 end_POSTSUPERSCRIPT 19.76±.232superscript19.76plus-or-minus.232\textbf{19.76}^{\pm.232}19.76 start_POSTSUPERSCRIPT ± .232 end_POSTSUPERSCRIPT 1.461±.007superscript1.461plus-or-minus.0071.461^{\pm.007}1.461 start_POSTSUPERSCRIPT ± .007 end_POSTSUPERSCRIPT
0.008750.008750.008750.00875 0.574±.003superscript0.574plus-or-minus.0030.574^{\pm.003}0.574 start_POSTSUPERSCRIPT ± .003 end_POSTSUPERSCRIPT 22.86±.190superscript22.86plus-or-minus.19022.86^{\pm.190}22.86 start_POSTSUPERSCRIPT ± .190 end_POSTSUPERSCRIPT 1.492±.006superscript1.492plus-or-minus.0061.492^{\pm.006}1.492 start_POSTSUPERSCRIPT ± .006 end_POSTSUPERSCRIPT
0.00750.00750.00750.0075 0.565±.007superscript0.565plus-or-minus.0070.565^{\pm.007}0.565 start_POSTSUPERSCRIPT ± .007 end_POSTSUPERSCRIPT 26.20±.129superscript26.20plus-or-minus.12926.20^{\pm.129}26.20 start_POSTSUPERSCRIPT ± .129 end_POSTSUPERSCRIPT 1.534±.013superscript1.534plus-or-minus.0131.534^{\pm.013}1.534 start_POSTSUPERSCRIPT ± .013 end_POSTSUPERSCRIPT
0.006250.006250.006250.00625 0.530±.013superscript0.530plus-or-minus.0130.530^{\pm.013}0.530 start_POSTSUPERSCRIPT ± .013 end_POSTSUPERSCRIPT 31.23±.211superscript31.23plus-or-minus.21131.23^{\pm.211}31.23 start_POSTSUPERSCRIPT ± .211 end_POSTSUPERSCRIPT 1.596±.009superscript1.596plus-or-minus.0091.596^{\pm.009}1.596 start_POSTSUPERSCRIPT ± .009 end_POSTSUPERSCRIPT
0.00500.00500.00500.0050 0.500±.007superscript0.500plus-or-minus.0070.500^{\pm.007}0.500 start_POSTSUPERSCRIPT ± .007 end_POSTSUPERSCRIPT 39.36±.301superscript39.36plus-or-minus.30139.36^{\pm.301}39.36 start_POSTSUPERSCRIPT ± .301 end_POSTSUPERSCRIPT 1.680±.004superscript1.680plus-or-minus.004\textbf{1.680}^{\pm.004}1.680 start_POSTSUPERSCRIPT ± .004 end_POSTSUPERSCRIPT
0.01000.01000.01000.0100 0.232±.006superscript0.232plus-or-minus.0060.232^{\pm.006}0.232 start_POSTSUPERSCRIPT ± .006 end_POSTSUPERSCRIPT 114.3±.433superscript114.3plus-or-minus.433114.3^{\pm.433}114.3 start_POSTSUPERSCRIPT ± .433 end_POSTSUPERSCRIPT 2.140±.013superscript2.140plus-or-minus.013\textbf{2.140}^{\pm.013}2.140 start_POSTSUPERSCRIPT ± .013 end_POSTSUPERSCRIPT
0.00750.00750.00750.0075 0.251±.004superscript0.251plus-or-minus.0040.251^{\pm.004}0.251 start_POSTSUPERSCRIPT ± .004 end_POSTSUPERSCRIPT 111.1±.316superscript111.1plus-or-minus.316111.1^{\pm.316}111.1 start_POSTSUPERSCRIPT ± .316 end_POSTSUPERSCRIPT 2.115±.008superscript2.115plus-or-minus.0082.115^{\pm.008}2.115 start_POSTSUPERSCRIPT ± .008 end_POSTSUPERSCRIPT
0.00500.00500.00500.0050 0.282±.006superscript0.282plus-or-minus.006\textbf{0.282}^{\pm.006}0.282 start_POSTSUPERSCRIPT ± .006 end_POSTSUPERSCRIPT 106.8±.386superscript106.8plus-or-minus.386\textbf{106.8}^{\pm.386}106.8 start_POSTSUPERSCRIPT ± .386 end_POSTSUPERSCRIPT 2.088±.009superscript2.088plus-or-minus.0092.088^{\pm.009}2.088 start_POSTSUPERSCRIPT ± .009 end_POSTSUPERSCRIPT
- 0.235±.005superscript0.235plus-or-minus.005\textbf{0.235}^{\pm.005}0.235 start_POSTSUPERSCRIPT ± .005 end_POSTSUPERSCRIPT 98.27±.528superscript98.27plus-or-minus.528\textbf{98.27}^{\pm.528}98.27 start_POSTSUPERSCRIPT ± .528 end_POSTSUPERSCRIPT 2.118±.010superscript2.118plus-or-minus.010\textbf{2.118}^{\pm.010}2.118 start_POSTSUPERSCRIPT ± .010 end_POSTSUPERSCRIPT
Table 2: EID and interaction metrics of different weight schedulers: Exponential, Inverse Exponential, Constant [36], and Linear. Bold represents the best value for each scheduler.

4.3.2 DualMDM: Individual Diversity

In Tab. 2, the EID metric is compared with the R-Precision and FID on different DualMDM schedulers. In general, we can observe that in all the schedulers, the ones that assign more weight to the individual model obtain higher individual diversity, in exchange for a lower interaction quality. The constant scheduler with λ=0𝜆0\lambda{=}0italic_λ = 0 uses only the interaction model, representing an upper bound in terms of interaction quality (R-Precision and FID), and a lower bound for the individual diversity (EID). While a constant scheduler with λ=0.25𝜆0.25\lambda{=}0.25italic_λ = 0.25 seems to achieve good quantitative values, we can observe that the exponential weight scheduler with λ=0.00875𝜆0.00875\lambda{=}0.00875italic_λ = 0.00875 provides a better trade-off between individual diversity and interaction quality. This is fundamental, as we want to have high intra-personal diversity while kee** the inter-personal coherence. We hypothesize that the good trade-off acquired by the exponential schedule is due to the fact that the intra-relationships of the motion (provided by the individual motion prior) are much more important during the early stages of denoising. However, as the sampling advances, the inter-relationships of the motions interaction become more relevant. Also, when the individual model is used during the later stages of denoising, it deteriorates the denoised inter-personal dynamics. On the contrary, if the weight on this individual prior is gradually reduced, the interaction model is able to recover these dynamics in the later stages of the denoising. In Sec. 4.4, we validate these hypotheses by means of a qualitative analysis.

4.4 Qualitative Analysis

As depicted in Fig. 5 and Fig. 6, our in2IN model can generate more realistic interactions aligned with the textual description. Upon qualitative evaluation, our model consistently outperforms InterGen across various scenarios. Fig. 7 illustrates the effect of the different weighting strategies for our DualMDM motion composition method. It can be observed how the exponential scheduler provides more coherent results, preserving the interaction semantics while generating individual motions that match the individual descriptions, yielding a superior fine-grained control. While a constant scheduler might quantitatively provide decent results, the qualitative evaluation demonstrates the superiority of the exponential scheduler. For the constant schedulers, we notice that increasing the weight assigned to the individual prior leads to a degradation of the inter-personal dynamics, particularly concerning trajectories and orientations. As a limitation of the exponential scheduler, we can observe that the λ𝜆\lambdaitalic_λ value selected for each case is critical and might not be the same for all compositions. The selection of this value will depend on the specific characteristics of the interaction and individual motions that we want to combine. More visualizations are included in the Supp. Material.

Refer to caption
Figure 5: Interaction Description: The two guys meet, grip each other’s hand, and nod in agreement. The X-axis represents time.
Refer to caption
Figure 6: Interaction Description: One person spots the other person on the street, lifts the right hand to greet, and the other person glances towards one person. The X-axis represents time.
Refer to caption
Figure 7: Interaction Description: Two persons are in an intense boxing match. Individual Description #1: An individual throws a kick with his right leg. Individual Description #2: An individual is boxing. The X-axis represents time.

5 Conclusion

We presented in2IN, an interaction diffusion model that leverages both interaction and individual textual descriptions to generate better inter- and intra-personal dynamics in the human-human motion interaction generation. With a more precise conditioning, in2IN has become the new state-of-the-art in the InterHuman dataset. We also introduced DualMDM, a motion model composition technique that injects the single-person dynamics learned by a pre-trained individual motion prior into the generated interactions. As a result, combining in2IN with DualMDM provides better control over the intra-personal dynamics of the interaction.

Limitations and Future work. LLM-generated individual descriptions might not faithfully match the individual motion. In future work, more complex techniques for generating individual descriptions will be tested. Additionally, one of our main reasons to propose DualMDM is that the optimal strategy for combining the outputs of the individual and the interaction models changes along the sampling process. However, we observed in Sec. 4.4 that these dynamics vary as well depending on the descriptions, or even on the stochasticity of the generation itself. Future work includes exploring better blending strategies for which the user does not need to define any scheduler parameter.

Acknowledgments This work has been partially supported by the Spanish project PID2022-136436NB-I00, the Spanish national grant for PhD studies, FPU22/04200, and by ICREA under the ICREA Academia programme. We thank HORIZON-MSCA-2021-SE-0 action number: 101086387, REMARKABLE, Rural Environmental Monitoring via ultra wide-ARea networKs And distriButed federated Learning and CIAICO/2022/132 Consolidated group project “AI4Health” funded by Valencian government.

References

  • Ahuja and Morency [2019] Chaitanya Ahuja and Louis-Philippe Morency. Language2pose: Natural language grounded pose forecasting. In 2019 International Conference on 3D Vision (3DV), pages 719–728. IEEE, 2019.
  • Athanasiou et al. [2022] Nikos Athanasiou, Mathis Petrovich, Michael J Black, and Gül Varol. Teach: Temporal action composition for 3d humans. In 2022 International Conference on 3D Vision (3DV), pages 414–423. IEEE, 2022.
  • Athanasiou et al. [2023] Nikos Athanasiou, Mathis Petrovich, Michael J Black, and Gül Varol. Sinc: Spatial composition of 3d human motions for simultaneous action generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9984–9995, 2023.
  • Bar-Tal et al. [2023] Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. In International Conference on Machine Learning, pages 1737–1752. PMLR, 2023.
  • Barquero et al. [2022a] German Barquero, Johnny Núnez, Sergio Escalera, Zhen Xu, Wei-Wei Tu, Isabelle Guyon, and Cristina Palmero. Didn’t see that coming: a survey on non-verbal social human behavior forecasting. In Understanding Social Behavior in Dyadic and Small Group Interactions, pages 139–178. PMLR, 2022a.
  • Barquero et al. [2022b] German Barquero, Johnny Núnez, Zhen Xu, Sergio Escalera, Wei-Wei Tu, Isabelle Guyon, and Cristina Palmero. Comparison of spatio-temporal models for human motion and pose forecasting in face-to-face interaction scenarios. In Understanding Social Behavior in Dyadic and Small Group Interactions, pages 107–138. PMLR, 2022b.
  • Barquero et al. [2023] German Barquero, Sergio Escalera, and Cristina Palmero. Belfusion: Latent diffusion for behavior-driven human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2317–2327, 2023.
  • Barquero et al. [2024] German Barquero, Sergio Escalera, and Cristina Palmero. Seamless human motion composition with blended positional encodings. arXiv preprint arXiv:2402.15509, 2024.
  • Bromley et al. [1993] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature verification using a "siamese" time delay neural network. In Advances in Neural Information Processing Systems. Morgan-Kaufmann, 1993.
  • Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Cai et al. [2023] Zhongang Cai, Jian** Jiang, Zhongfei Qing, Xinying Guo, Mingyuan Zhang, Zhengyu Lin, Haiyi Mei, Chen Wei, Ruisi Wang, Wanqi Yin, Xiangyu Fan, Han Du, Liang Pan, Peng Gao, Zhitao Yang, Yang Gao, Jiaqi Li, Tianxiang Ren, Yukun Wei, Xiaogang Wang, Chen Change Loy, Lei Yang, and Ziwei Liu. Digital life project: Autonomous 3d characters with social intelligence. arXiv preprint arXiv:2312.04547, 2023.
  • Chen et al. [2023] Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18000–18010, 2023.
  • Curto et al. [2021] David Curto, Albert Clapes, Javier Selva, Sorina Smeureanu, Julio C.S. Jacques Junior, David Gallardo-Pujol, Georgina Guilera, David Leiva, Thomas B. Moeslund, Sergio Escalera, and Cristina Palmero. Dyadformer: A multi-modal transformer for long-range modeling of dyadic interactions. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2177–2188, 2021.
  • Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  • Gu et al. [2022] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
  • Guo et al. [2020] Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia. ACM, 2020.
  • Guo et al. [2022a] Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5152–5161, 2022a.
  • Guo et al. [2022b] Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5152–5161, 2022b.
  • Guo et al. [2022c] Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts. In European Conference on Computer Vision, pages 580–597. Springer, 2022c.
  • Guo et al. [2023] Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked modeling of 3d human motions. arXiv preprint arXiv:2312.00063, 2023.
  • Guo et al. [2022d] Wen Guo, Xiaoyu Bie, Xavier Alameda-Pineda, and Francesc Moreno-Noguer. Multi-person extreme motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13053–13064, 2022d.
  • Ho and Salimans [2021] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • Huang et al. [2023] Ziqi Huang, Kelvin CK Chan, Yuming Jiang, and Ziwei Liu. Collaborative diffusion for multi-modal face generation and editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6080–6090, 2023.
  • Jiang et al. [2024] Biao Jiang, Xin Chen, Wen Liu, **gyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language. Advances in Neural Information Processing Systems, 36, 2024.
  • Kim et al. [2023] Jihoon Kim, Jiseob Kim, and Sungjoon Choi. FLAME: Free-form Language-based Motion Synthesis & Editing. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 8255–8263, 2023.
  • Le et al. [2023] Nhat Le, Thang Pham, Tuong Do, Erman Tjiputra, Quang D Tran, and Anh Nguyen. Music-driven group choreography. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8673–8682, 2023.
  • Liang et al. [2023] Han Liang, Wenqian Zhang, Wenxuan Li, **gyi Yu, and Lan Xu. Intergen: Diffusion-based multi-human motion generation under complex interactions. arXiv preprint arXiv:2304.05684, 2023.
  • Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  • Nichol et al. [2022] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2022.
  • Nichol and Dhariwal [2021] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR, 2021.
  • Petrovich et al. [2021] Mathis Petrovich, Michael J Black, and Gül Varol. Action-conditioned 3d human motion synthesis with transformer vae. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10985–10995, 2021.
  • Petrovich et al. [2022] Mathis Petrovich, Michael J Black, and Gül Varol. TEMOS: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision, pages 480–497. Springer, 2022.
  • Pinyoanuntapong et al. [2023] Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee, and Chen Chen. Mmm: Generative masked motion model. arXiv preprint arXiv:2312.03596, 2023.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Shafir et al. [2023] Yonatan Shafir, Guy Tevet, Roy Kapon, and Amit H Bermano. Human Motion Diffusion as a Generative Prior. arXiv preprint arXiv:2303.01418, 2023.
  • Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  • Song et al. [2020] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020.
  • Tanaka and Fujiwara [2023] Mikihiro Tanaka and Kent Fujiwara. Role-Aware Interaction Generation from Textual Description. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15999–16009, 2023.
  • Tevet et al. [2022] Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. Motionclip: Exposing human motion generation to clip space. In European Conference on Computer Vision, pages 358–374. Springer, 2022.
  • Tevet et al. [2023] Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffusion model. In The Eleventh International Conference on Learning Representations, 2023.
  • Touvron et al. [2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • Tseng et al. [2023] Jonathan Tseng, Rodrigo Castellon, and Karen Liu. Edge: Editable dance generation from music. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 448–458, 2023.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Xiao et al. [2022] Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling the generative learning trilemma with denoising diffusion GANs. In International Conference on Learning Representations (ICLR), 2022.
  • Yang et al. [2023] Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–39, 2023.
  • Yi et al. [2023] Hongwei Yi, Hualin Liang, Yifei Liu, Qiong Cao, Yandong Wen, Timo Bolkart, Dacheng Tao, and Michael J Black. Generating holistic 3d human motion from speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 469–480, 2023.
  • Yuan et al. [2023] Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, and Jan Kautz. Physdiff: Physics-guided human motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16010–16021, 2023.
  • Zhang et al. [2023a] Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations. arXiv preprint arXiv:2301.06052, 2023a.
  • Zhang et al. [2023b] Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, and Ziwei Liu. Remodiffuse: Retrieval-augmented motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 364–373, 2023b.
  • Zhang et al. [2024] Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  • Zhang et al. [2023c] Qinsheng Zhang, Jiaming Song, Xun Huang, Yongxin Chen, and Ming-Yu Liu. Diffcollage: Parallel generation of large content with diffusion models. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10188–10198. IEEE, 2023c.
  • Zhao et al. [2023] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, **hao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  • Zhong et al. [2023] Chongyang Zhong, Lei Hu, Zihao Zhang, and Shihong Xia. Attt2m: Text-driven human motion generation with multi-perspective attention mechanism. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 509–519, 2023.
  • Zhu et al. [2023a] Lingting Zhu, Xian Liu, Xuanyu Liu, Rui Qian, Ziwei Liu, and Lequan Yu. Taming diffusion models for audio-driven co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10544–10553, 2023a.
  • Zhu et al. [2023b] Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, **lu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, and Yizhou Wang. Human motion generation: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–20, 2023b.
  • Zhu et al. [2024] Wentao Zhu, Jason Qin, Yuke Lou, Hang Ye, Xiaoxuan Ma, Hai Ci, and Yizhou Wang. Social motion prediction with cognitive hierarchies. Advances in Neural Information Processing Systems, 36, 2024.
\thetitle

Supplementary Material

In this chapter, we will provide additional information and examples about the main paper’s contents that have not been included due to space reasons. This additional information will help the reproducibility and further understanding of the contributions that we have previously presented. In Sec. A we will explain how we have generated the individual descriptions for the InterHuman dataset. In Sec. B we will explain the implementation details and results on the individual prior that we have used in DualMDM. Finally, in Sec. C we will present additional examples from the qualitative evaluation of the in2IN and DualMDM contributions.

A Extended InterHuman dataset

The InterHuman dataset contains a significant amount of annotated human-human interactions. However, the textual descriptions of the interactions are not focused on the specific individual motions performed by the integrants of the interaction. As our in2IN proposal needs these individual descriptions, we have generated them using LLMs. From the original interaction descriptions we generate the individual ones using the following prompt:

Having the description of an interaction, extract descriptions for the motions of each individual.
-
Interaction Description: In an intense boxing match, one person attacks the opponent with a straight punch, and then the opponent falls over. Individual Motion 1: One person is moving and then throws a punch. Individual Motion 2: One person falls over and stays on the ground.
-
Interaction Description: <interaction motion description>

The LLM used for this task is gpt3.5_turbo from OpenAI with a p_value of 1 and a temperature of 1.5. Using LLMs to generate these individual descriptions automatically has some risks such as hallucinations or the no correspondence between the individual description and the individual motion. However, as manually annotating this dataset is not feasible and the interactions on it are not very complex, we have decided to use this approach as a proof of concept. In future work, more complex techniques for generating individual descriptions might be tested.

B Individual Motion Prior

For our proposal in Sec. 3.2 we need an individual motion prior. As mentioned in Sec. 2.1 there are many existing approaches for single-human motion generation. However, none of them are trained with the motion representation that we use in our interaction model. For this reason, we have proposed a single-human baseline based on our in2IN architecture. The differences with the proposed architecture in Sec. 3.1 is that we have removed the cross-attention modules and we have only retained the individual conditioning.

While this individual motion prior can be theoretically interchangeable with other ones, the prior selected must have been trained with the same motion representation as the interaction model and using the same training and sampling scheduler. We have trained this individual prior with the HumanML3D dataset converted to the InterHuman format. It is important to do that, because the HumanML3D dataset contains relative joint positions and velocities, while the InterHuman dataset has these values in the world frame to properly represent the global positions of the different individuals on the interaction. The rest of the implementation details are the same as the ones described in Sec. 4.2 for the in2IN model. The only difference is that we have trained the individual prior with just the L2 loss.

Method
R Precision
(top 3) \uparrow
FID \downarrow
MM Dist \downarrow
Diversity \rightarrow Multimodality \uparrow
Real 0.797±.002superscript0.797plus-or-minus.0020.797^{\pm.002}0.797 start_POSTSUPERSCRIPT ± .002 end_POSTSUPERSCRIPT 0.002±.000superscript0.002plus-or-minus.0000.002^{\pm.000}0.002 start_POSTSUPERSCRIPT ± .000 end_POSTSUPERSCRIPT 2.974±.008superscript2.974plus-or-minus.0082.974^{\pm.008}2.974 start_POSTSUPERSCRIPT ± .008 end_POSTSUPERSCRIPT 9.503±.065superscript9.503plus-or-minus.0659.503^{\pm.065}9.503 start_POSTSUPERSCRIPT ± .065 end_POSTSUPERSCRIPT -
JL2P 0.486±.002superscript0.486plus-or-minus.0020.486^{\pm.002}0.486 start_POSTSUPERSCRIPT ± .002 end_POSTSUPERSCRIPT 11.02±.046superscript11.02plus-or-minus.04611.02^{\pm.046}11.02 start_POSTSUPERSCRIPT ± .046 end_POSTSUPERSCRIPT 5.296±.008superscript5.296plus-or-minus.0085.296^{\pm.008}5.296 start_POSTSUPERSCRIPT ± .008 end_POSTSUPERSCRIPT 7.676±.058superscript7.676plus-or-minus.0587.676^{\pm.058}7.676 start_POSTSUPERSCRIPT ± .058 end_POSTSUPERSCRIPT -
Text2Gesture 0.345±.002superscript0.345plus-or-minus.0020.345^{\pm.002}0.345 start_POSTSUPERSCRIPT ± .002 end_POSTSUPERSCRIPT 7.664±.030superscript7.664plus-or-minus.0307.664^{\pm.030}7.664 start_POSTSUPERSCRIPT ± .030 end_POSTSUPERSCRIPT 6.030±.008superscript6.030plus-or-minus.0086.030^{\pm.008}6.030 start_POSTSUPERSCRIPT ± .008 end_POSTSUPERSCRIPT 6.409±.071superscript6.409plus-or-minus.0716.409^{\pm.071}6.409 start_POSTSUPERSCRIPT ± .071 end_POSTSUPERSCRIPT -
T2M 0.740±.003superscript0.740plus-or-minus.003\mathbf{0.740}^{\pm.003}bold_0.740 start_POSTSUPERSCRIPT ± .003 end_POSTSUPERSCRIPT 1.067±.002superscript1.067plus-or-minus.0021.067^{\pm.002}1.067 start_POSTSUPERSCRIPT ± .002 end_POSTSUPERSCRIPT 3.340±.008superscript3.340plus-or-minus.008\mathbf{3.340^{\pm.008}}bold_3.340 start_POSTSUPERSCRIPT ± bold_.008 end_POSTSUPERSCRIPT 9.188±.002superscript9.188plus-or-minus.0029.188^{\pm.002}9.188 start_POSTSUPERSCRIPT ± .002 end_POSTSUPERSCRIPT 2.090±.083superscript2.090plus-or-minus.0832.090^{\pm.083}2.090 start_POSTSUPERSCRIPT ± .083 end_POSTSUPERSCRIPT
MDM 0.707±.004superscript0.707plus-or-minus.0040.707^{\pm.004}0.707 start_POSTSUPERSCRIPT ± .004 end_POSTSUPERSCRIPT 0.489±.025superscript0.489plus-or-minus.025\mathbf{0.489}^{\pm.025}bold_0.489 start_POSTSUPERSCRIPT ± .025 end_POSTSUPERSCRIPT 3.631±.023superscript3.631plus-or-minus.0233.631^{\pm.023}3.631 start_POSTSUPERSCRIPT ± .023 end_POSTSUPERSCRIPT 9.449±.066superscript9.449plus-or-minus.066\mathbf{9.449}^{\pm.066}bold_9.449 start_POSTSUPERSCRIPT ± .066 end_POSTSUPERSCRIPT 2.873±.111superscript2.873plus-or-minus.111\textbf{2.873}^{\pm.111}2.873 start_POSTSUPERSCRIPT ± .111 end_POSTSUPERSCRIPT
Individual Prior (Ours) 0.6172±.005superscript0.6172plus-or-minus.0050.6172^{\pm.005}0.6172 start_POSTSUPERSCRIPT ± .005 end_POSTSUPERSCRIPT 5.0631±.150superscript5.0631plus-or-minus.1505.0631^{\pm.150}5.0631 start_POSTSUPERSCRIPT ± .150 end_POSTSUPERSCRIPT 4.2910±.026superscript4.2910plus-or-minus.0264.2910^{\pm.026}4.2910 start_POSTSUPERSCRIPT ± .026 end_POSTSUPERSCRIPT 7.8289±.082superscript7.8289plus-or-minus.0827.8289^{\pm.082}7.8289 start_POSTSUPERSCRIPT ± .082 end_POSTSUPERSCRIPT 0.4354±.024superscript0.4354plus-or-minus.0240.4354^{\pm.024}0.4354 start_POSTSUPERSCRIPT ± .024 end_POSTSUPERSCRIPT
Table A: Quantitative evaluation of our individual motion prior in comparison with other models. ±plus-or-minus\pm± indicates the 95% confidence interval. We highlight the best results
Refer to caption
Figure A: Interaction Description: both they lift their right legs to kick one another. The X-axis represents time.
Refer to caption
Figure B: Interaction Description: Two people greet each other by shaking hands. Individual Description #1: One person reaches out their hand to meet the other person’s hand, shaking it in a vertical motion. Individual Description #2: An individual jumps. The X-axis represents time.
Refer to caption
Figure C: Interaction Description: the two individuals are dancing ballroom together. The X-axis represents time.
Refer to caption
Figure D: Interaction Description: Two people salute to each other. Individual Description #1: An individual bows forward. Individual Description #2: An individual raises their right arm and waves it. The X-axis represents time.

B.1 Results Individual Generation

The results of the evaluation of the individual prior can be observed in Tab. A. While our model does not beat the best models presented in this table, it obtains decent results to be used as a motion prior. While better architectures could have been used, this goes out of the scope of this paper as the only objective was to obtain a decent motion prior able to use the InterHuman motion representation.

C Additional Qualitative Evaluation

In addition to the examples shown in Sec. 4.4, we include additional cases that have been used on the qualitative evaluation which illustrate the observations presented in the main paper. In Fig. A and Fig. C we can observe how the interactions generated by our in2IN architecture outperform the ones generated by InterGen. Additionally, in Fig. B and Fig. D we further corroborate the improvements of the exponential schedulers for DualMDM in comparison to the others. It can also be observed the differences for the same λ𝜆\lambdaitalic_λ with different examples. As stated in the main paper, future lines of work could try to propose better blending strategies without scheduler parameters.