Efficient Text-driven Motion Generation via Latent Consistency Training

Mengxian Hu
Tongji university
[email protected] \AndMinghao Zhu
Tongji university
[email protected] \AndXun Zhou
Tongji university
[email protected] \AndQingqing Yan
Tongji university
[email protected] \AndShu Li
Tongji university
[email protected] \AndChengju Liu
Tongji university
[email protected] \AndQijun Chen
Tongji university
[email protected]
Abstract

Motion diffusion models have recently proven successful for text-driven human motion generation. Despite their excellent generation performance, they are challenging to infer in real time due to the multi-step sampling mechanism that involves tens or hundreds of repeat function evaluation iterations. To this end, we investigate a motion latent consistency Training (MLCT) for motion generation to alleviate the computation and time consumption during iteration inference. It applies diffusion pipelines to low-dimensional motion latent spaces to mitigate the computational burden of each function evaluation. Explaining the diffusion process with probabilistic flow ordinary differential equation (PF-ODE) theory, the MLCT allows extremely few steps infer between the prior distribution to the motion latent representation distribution via maintaining consistency of the outputs over the trajectory of PF-ODE. Especially, we introduce a quantization constraint to optimize motion latent representations that are bounded, regular, and well-reconstructed compared to traditional variational constraints. Furthermore, we propose a conditional PF-ODE trajectory simulation method, which improves the conditional generation performance with minimal additional training costs. Extensive experiments on two human motion generation benchmarks show that the proposed model achieves state-of-the-art performance with less than 10% time cost.

Keywords quantized representation  \cdot latent consistency training  \cdot motion generation

1 Introduction

Synthesizing human motion sequences under specified conditions is a fundamental task in robotics and virtual reality. Research in recent years has explored the text-to-motion diffusion framework [1, 2, 3] to generate realistic and diverse motions, which gradually recovers the motion representation from a prior distribution with multiple iterations. These works show more stable distribution estimation and stronger controllability than traditional single-step methods (e.g., GANs [4] or VAEs [5, 6]), but at the cost of a hundredfold increase in computational burden. Such a high-cost sampling mechanism is expensive in time and memory, limiting the model’s accessibility in real-time applications.

To mitigate inference cost, previous text-to-motion diffusion frameworks try to trade off between fidelity and efficiency from two perspectives: i) map** length-varying and high-dimensional original motion sequences into well-reconstructed and low-dimension motion latent representations[3, 7] to reduce data redundancy and complexity, and ii) utilizing skip-step sampling strategy [3, 8] to minimize expensive and repetitive function evaluation iterations. The first perspective inspired by the excellent performance of the latent diffusion model in text-to-image synthesis, they introduce the variational autoencoder with Kullback-Leibler (KL) divergence constraints as motion representation extractor. However, unlike image data support that contains more than ten million samples, the high cost of motion capture limits the number of samples for the text-based motion generation task. As a example, the largest current human motion dataset contains no more than fifteen thousand samples after employing data augmentation. Simultaneous optimization of reconstruction loss and KL divergence loss, which are adversarial targets, is significantly challenging in the presence of limited training resources. To ensure high reconstruction performance, previous state-of-the-art models usually set the KL divergence weights low enough, which results in low regularity of motion representations. Such low-regularity and continuous motion representations suffer redundancy and low robustness. It can be mitigated by a sufficiently numerous repetitive function evaluation iterations, but seriously harms the generative performance in the context of extremely few sampling steps. The second perspective follows from the recently well-established diffusion solvers, which can be categorized as training-free methods and training-based methods. Previous study confirms that the forward diffusion process corresponds to an inverse diffusion process without a stochastic term and is known as the probabilistic flow ordinary differential equation (PF-ODE) [9]. Training-free methods constructed different discrete solvers for the special form of the PF-ODE, achieving almost a 20-fold performance improvement. These works effectively compress the sampling steps to 50-100 steps, but the fidelity of the ODE solution results is lower when the number of iterations is much smaller due to the complexity of the probability distribution of the motion sequences and the cumulative error of the discrete ODE sampling. It is still a significant gap in computational effort compared to traditional single-step motion generation models. Training-based methods usually rely on model distillation or trajectory distillation for implementation, and one promising approach is known as the consistency model. It impose constraints on the model to maintain the consistency of the output on the same PF-ODE trajectory, thus achieving a single-step or multiple-step generative map** from the prior distribution to the target distribution. Typical PF-ODE trajectory generation methods are consistency distillation, which generates trajectories with pre-trained diffusion models, or consistency training, which simulates trajectories with the unbiased estimation of ground truth. The former relies on well-trained diffusion models as foundation models. Training these models from scratch is computationally expensive and time-consuming. Less costly consistency training frameworks avoid additional pre-trained models, but also suffer poor generation performance and even training collapse due to redundant and irregular latent representations. Moreover, existing consistency training frameworks have not sufficiently explored conditional PF-ODE trajectory. It results in vanilla consistency-training-based models without significant advantages over well-established multi-step diffusion samplers using classifier-free guidance.

Upon the above limitations, we propose a Motion Latent Consistency Training (MLCT) framework with generates high-quality motions with no more than 5 sampling steps. Following the common latent space modeling paradigm, our motivation focuses on constructing low-dimensional and regular motion latent representations, as well as exploring the simulation of conditional PF-ODE trajectories with the consistency training model in the absence of pre-trained models. Specifically, the first contribution of this paper is to introduce a pixel-like latent autoencoder with quantization constraints, which aggregates motion information of arbitrary length to multiple latent representation tokens via self-attention calculation. It differs significantly from the widely used variational representations in that the former is bounded and discrete while the latter is unbounded and continuous. We restrict the representation boundaries with the hyperbolic tangent (Tanh) function and forces the continuous representation to map to the nearest predefined clustering center. Compared to the black-box control strategy of fine-tuning the KL divergence weights, our approach trades off the regularity and reconstruction performance of the motion latent representations more controllably via designing finite dimensional discrete latent representation space. In addition, previous practice demonstrates that the boundedness of the representations contributes to sustaining stable inference in classifier-free guidance (CFG) techniques. The second contribution of this paper is to explore a one-stage conditionally guided consistency training framework. The main insight is to consider unbiased estimation based on ground truth motion representations as the simulation of a conditional probability gradient and to propose an online updating mechanism for the unconditional probability gradient. To the best of our knowledge, this is the first application of classifier-free guidance to consistency training. Since it is utilized for generating trajectories, the denoiser does not need to be double computationally expensive in the derivation to get better conditional generation results.

Refer to caption
Figure 1: Our model achieves better FID metrics with less inference time and allows for the generation of high-quality human motions based on textual prompts in around 5 NFE. The color of humans darkens over time.

We evaluate the proposed framework on two widely-used datasets: KIT and HumanML datasets. The results of our 1, 3 and 5 number of function evaluations (NFE) generation are shown in Figure 1, along with the differences in FID metrics with existing methods. Extensive experiments indicate the effectiveness of MLCT and its components. The proposed framework achieves state-of-the-art performance in motion generation only in around 5 steps.

To sum up, the contributions of this paper are as follows:

  • We explore a pixel-like motion latent representation relying on quantization constraints which is highly regular, well-reconstruction and bounded.

  • We introduce classifier-free guidance in consistency training for the first time. It is beneficial to realize more controllable motion generation as well as more stable training convergence.

  • Our proposed MLCT achieves state-of-the-art performance on two challenge datasets with extremely less sampling steps.

2 Related Work

Human motion generation. Human motion generation aims to synthesize human motion sequence under specified conditions, such as action categories [10, 11], audio [12, 13], and textual description [14, 2, 3]. In the past few years, numerous works have investigated motion generation from various generative frameworks. For example, VAE-based models [15, 16, 5] represent the motion as a set of Gaussian distributions and constrain its regularity with KL divergence. Such constraint allows it to reconstruct the motion information from the standard normal distribution, yet its results are often ambiguous. GAN-based methods [17, 4] achieve better performance by bypassing direct estimation of probabilistic likelihoods via the adversarial training strategy, but the adversarial property makes their training often unstable and prone to mode collapse. Some multi-step generative methods have emerged recently with great success, such as auto-regressive [18, 19] and diffusion methods [1, 2, 3]. In particular, the latter is gradually dominating the research frontiers due to its stable distribution estimation capability and high-quality sampling results. Motiondiffuse [1] and MDM [2] were the pioneers in implementing diffusion frameworks for motion generation. MLD [3] realizes the latent space diffusion, which significantly improves the efficiency. M2DM [7] represents motion as discrete features and diffusion processes in finite state space with state-of-the-art performance. Some recent work [8] has focused on more controlled generation with equally excellent results. These works validate the outstanding capabilities of the motion diffusion framework and receive continuous attention.

Efficient diffusion sampling. Efficient diffusion sampling is the primary challenge of diffusion frameworks oriented to real-time generation tasks. DDIM [20] relaxes the restriction on Markov conditions in the original diffusion framework and achieves a 20 times computational efficiency improvement. Score-based method [9] from the same period relates the diffusion framework to a stochastic differential equation and notes that it has a special form known as the probability flow ODE. This is a milestone achievement. It guides the following works either to steer a simplified diffusion process through a specially designed form of ODE [21, 22, 23], or to skip a sufficiently large number of sampling steps via the more sophisticated higher-order ODE approximation solution strategy [24]. In addition to the above work, the diffusion process can be executed in lower dimensional and more regular latent spaces, thus reducing the single-step computational burden [25]. While these works have proven effective in computer vision, they have received only finite reflections in motion diffusion frameworks. Previous state-of-the-art methods such as MLD [3] and GraphMotion [8] have utilized VAE-based representations and DDIM sampling strategies. Precise and robust motion representation and efficient motion diffusion design remain an open problem.

Consistency model. Consistency modeling is a novel and flexible diffusion sampling framework that allows the model to make trade-offs between extreme few steps and generation quality. Latent consistency models extend consistency distillation methods to the latent representation space, saving memory spend and further improving inference efficiency. Subsequently, VideoLCM further applies consistency distillation to video generation. Recent approaches have also investigated the application of Lora and control net to consistency modeling with impressive results. These methods rely on a strong teacher model as the distillation target, which trained from scratch requires not only a large dataset support but also a lot of computational resources. To reduce the training cost, ICM further explores and improves consistency training methods to obtain similar performance to consistency distillation without pre-trained models. However, it is still limited to the original pixel representation space of fixed dimensions and is applied to variance-explosion ODE frameworks. Consistency training methods for broader diffusion strategies in the latent representation space lack further exploration.

3 Preliminaries

In this section, we briefly introduce diffusion and consistency models.

3.1 Score-based Diffusion Models

The diffusion model [26] is a generative model that gradually injects Gaussian noise into the data and then generates samples from the noise through a reverse denoising process. Specifically, it gradually transforms the data distribution pdata(x0)subscript𝑝𝑑𝑎𝑡𝑎subscript𝑥0p_{data}(x_{0})italic_p start_POSTSUBSCRIPT italic_d italic_a italic_t italic_a end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) into a well-sampled prior distribution p(xT)𝑝subscript𝑥𝑇p(x_{T})italic_p ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) via a Gaussian perturbation kernel p(xt|x0)=𝒩(xt|αtx0,σt2I)𝑝conditionalsubscript𝑥𝑡subscript𝑥0𝒩conditionalsubscript𝑥𝑡subscript𝛼𝑡subscript𝑥0superscriptsubscript𝜎𝑡2𝐼p(x_{t}|x_{0})=\mathcal{N}(x_{t}|\alpha_{t}x_{0},\sigma_{t}^{2}I)italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ), where αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are specify noise schedules. Recent studies have formalized it into a continuous time form, described as a stochastic partial differential equation,

dxt=f(t)xtdt+g(t)dwt,𝑑subscript𝑥𝑡𝑓𝑡subscript𝑥𝑡𝑑𝑡𝑔𝑡𝑑subscript𝑤𝑡dx_{t}=f(t)x_{t}dt+g(t)dw_{t},italic_d italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f ( italic_t ) italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d italic_t + italic_g ( italic_t ) italic_d italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (1)

where t[ϵ,T]𝑡italic-ϵ𝑇t\in[\epsilon,T]italic_t ∈ [ italic_ϵ , italic_T ], ϵitalic-ϵ\epsilonitalic_ϵ and T𝑇Titalic_T are the fixed positive constant, wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the standard Brownian motion, f𝑓fitalic_f and g𝑔gitalic_g are the drift and diffusion coefficients respectively with follow from,

f(t)=dlogαtdt,g2(t)=dσt2dt2dlogαtdtσt2.formulae-sequence𝑓𝑡𝑑subscript𝛼𝑡𝑑𝑡superscript𝑔2𝑡𝑑superscriptsubscript𝜎𝑡2𝑑𝑡2𝑑subscript𝛼𝑡𝑑𝑡superscriptsubscript𝜎𝑡2f(t)=\frac{d\log\alpha_{t}}{dt},\quad g^{2}(t)=\frac{d\sigma_{t}^{2}}{dt}-2% \frac{d\log\alpha_{t}}{dt}\sigma_{t}^{2}.italic_f ( italic_t ) = divide start_ARG italic_d roman_log italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG , italic_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) = divide start_ARG italic_d italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_t end_ARG - 2 divide start_ARG italic_d roman_log italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (2)

Previous work has revealed that the reverse process of Eq. 1 shares the same marginal probabilities with the probabilistic flow ODE:

dxt=[f(t)xt12g2(t)xtlogp(xt)]dt,𝑑subscript𝑥𝑡delimited-[]𝑓𝑡subscript𝑥𝑡12superscript𝑔2𝑡subscriptsubscript𝑥𝑡𝑝subscript𝑥𝑡𝑑𝑡dx_{t}=[f(t)x_{t}-\frac{1}{2}g^{2}(t)\nabla_{x_{t}}\log p(x_{t})]dt,italic_d italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_f ( italic_t ) italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] italic_d italic_t , (3)

where xlogp(xt)subscript𝑥𝑝subscript𝑥𝑡\nabla_{x}\log p(x_{t})∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is named the score function, which is the only unknown term in the sampling pipeline. An effective approach is training a time-dependent score network 𝒮θ(xt,t)subscript𝒮𝜃subscript𝑥𝑡𝑡\mathcal{S}_{\theta}(x_{t},t)caligraphic_S start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) to estimate xlogp(xt)subscript𝑥𝑝subscript𝑥𝑡\nabla_{x}\log p(x_{t})∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) based on conditional score matching, parameterized as the prediction of noise or initial value in forward diffusion. Further, Eq. 3 can be solved in finite steps by any numerical ODE solver such as Euler [9] and Heun solvers [27].

3.2 Consistency Models

Theoretically, the inverse process expressed by Eq. 3 is deterministic, and the consistency model (CM) [23] achieves one-step or few-step generation by pulling in outputs on the same ODE trajectory. It is more formally expressed as,

𝒮θ(xt,t)=𝒮θ(xt,t)𝒮θ(xϵ,ϵ)t,t[ϵ,T],formulae-sequencesubscript𝒮𝜃subscript𝑥𝑡𝑡subscript𝒮𝜃subscript𝑥superscript𝑡superscript𝑡subscript𝒮𝜃subscript𝑥italic-ϵitalic-ϵfor-all𝑡superscript𝑡italic-ϵ𝑇\mathcal{S}_{\theta}(x_{t},t)=\mathcal{S}_{\theta}(x_{t^{\prime}},t^{\prime})% \approx\mathcal{S}_{\theta}(x_{\epsilon},\epsilon)\quad\forall t,t^{\prime}\in% [\epsilon,T],caligraphic_S start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = caligraphic_S start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≈ caligraphic_S start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT , italic_ϵ ) ∀ italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ [ italic_ϵ , italic_T ] , (4)

which is known as the self-consistency property. To maintain the boundary conditions, existing consistency models are commonly parameterized by skip connections, i.e.,

𝒮θ(xt,t):=cskip(t)x+cout(t)𝒮^θ(xt,t)assignsubscript𝒮𝜃subscript𝑥𝑡𝑡subscript𝑐𝑠𝑘𝑖𝑝𝑡𝑥subscript𝑐𝑜𝑢𝑡𝑡subscript^𝒮𝜃subscript𝑥𝑡𝑡\mathcal{S}_{\theta}(x_{t},t):=c_{skip}(t)x+c_{out}(t)\hat{\mathcal{S}}_{% \theta}(x_{t},t)caligraphic_S start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) := italic_c start_POSTSUBSCRIPT italic_s italic_k italic_i italic_p end_POSTSUBSCRIPT ( italic_t ) italic_x + italic_c start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ( italic_t ) over^ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) (5)

where cskip(t)subscript𝑐𝑠𝑘𝑖𝑝𝑡c_{skip}(t)italic_c start_POSTSUBSCRIPT italic_s italic_k italic_i italic_p end_POSTSUBSCRIPT ( italic_t ) and cout(t)subscript𝑐𝑜𝑢𝑡𝑡c_{out}(t)italic_c start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ( italic_t ) are differentiable functions satisfied cskip(ϵ)=1subscript𝑐𝑠𝑘𝑖𝑝italic-ϵ1c_{skip}(\epsilon)=1italic_c start_POSTSUBSCRIPT italic_s italic_k italic_i italic_p end_POSTSUBSCRIPT ( italic_ϵ ) = 1 and cout(ϵ)=0subscript𝑐𝑜𝑢𝑡italic-ϵ0c_{out}(\epsilon)=0italic_c start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ( italic_ϵ ) = 0. For stabilize training, the consistency model maintaining target model 𝒮θsuperscriptsubscript𝒮𝜃\mathcal{S}_{\theta}^{-}caligraphic_S start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT, trained with the exponential moving average (EMA) of parameter γ𝛾\gammaitalic_γ, that is θγθ+(1γ)θsuperscript𝜃𝛾superscript𝜃1𝛾𝜃\theta^{-}\leftarrow\gamma\theta^{-}+(1-\gamma)\thetaitalic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ← italic_γ italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT + ( 1 - italic_γ ) italic_θ. The consistency loss can be formulated as,

cm(θ,θ)=𝔼x,t[d(𝒮θ(xtn+1,tn+1),𝒮θ(x^tn,tn))]subscript𝑐𝑚𝜃superscript𝜃subscript𝔼𝑥𝑡delimited-[]𝑑subscript𝒮𝜃subscript𝑥subscript𝑡𝑛1subscript𝑡𝑛1subscript𝒮superscript𝜃subscript^𝑥subscript𝑡𝑛subscript𝑡𝑛\mathcal{L}_{cm}(\theta,\theta^{-})=\mathbb{E}_{x,t}\big{[}d\big{(}\mathcal{S}% _{\theta}(x_{t_{n+1}},t_{n+1}),\mathcal{S}_{\theta^{-}}(\hat{x}_{t_{n}},t_{n})% \big{)}\big{]}caligraphic_L start_POSTSUBSCRIPT italic_c italic_m end_POSTSUBSCRIPT ( italic_θ , italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_x , italic_t end_POSTSUBSCRIPT [ italic_d ( caligraphic_S start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) , caligraphic_S start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) ] (6)

where d(,)𝑑d(\cdot,\cdot)italic_d ( ⋅ , ⋅ ) is a metric function such as mean square or pseudo-huber metric, and x^tnsubscript^𝑥subscript𝑡𝑛\hat{x}_{t_{n}}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT is a one-step estimation from xtn+1subscript𝑥subscript𝑡𝑛1x_{t_{n+1}}italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT with ODE solvers applied in Eq. 3.

4 Motion Latent Consistency Training Framework

Refer to caption
Figure 2: Our Motion Consistency model can achieve high-quality motion generation given a text prompt with around 5 steps. The color of humans darkens over time.
\mathcal{E}caligraphic_E𝒟𝒟\mathcal{D}caligraphic_D𝒮𝒮\mathcal{S}caligraphic_S𝒮𝒮\mathcal{S}caligraphic_S𝒮𝒮\mathcal{S}caligraphic_Sxtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPTxϵsubscript𝑥italic-ϵx_{\epsilon}italic_x start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPTxTsubscript𝑥𝑇x_{T}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPTxϵsubscript𝑥italic-ϵx_{\epsilon}italic_x start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPTxtsubscript𝑥superscript𝑡x_{t^{\prime}}italic_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPTxϵsubscript𝑥italic-ϵx_{\epsilon}italic_x start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPTxϵsubscript𝑥italic-ϵx_{\epsilon}italic_x start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPTxtsubscript𝑥superscript𝑡x_{t^{\prime}}italic_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPTxtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPTxTsubscript𝑥𝑇x_{T}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPTdxt=f(t)xtdt+g(t)dwt𝑑subscript𝑥𝑡𝑓𝑡subscript𝑥𝑡𝑑𝑡𝑔𝑡𝑑subscript𝑤𝑡dx_{t}=f(t)x_{t}dt+g(t)dw_{t}italic_d italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f ( italic_t ) italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d italic_t + italic_g ( italic_t ) italic_d italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPTdxt=[f(t)xt12g2(t)xtlogp(xt)]dt𝑑subscript𝑥𝑡delimited-[]𝑓𝑡subscript𝑥𝑡12superscript𝑔2𝑡subscriptsubscript𝑥𝑡𝑝subscript𝑥𝑡𝑑𝑡dx_{t}=[f(t)x_{t}-\frac{1}{2}g^{2}(t)\nabla_{x_{t}}\log p(x_{t})]dtitalic_d italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_f ( italic_t ) italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] italic_d italic_tConsistency𝐶𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑐𝑦Consistencyitalic_C italic_o italic_n italic_s italic_i italic_s italic_t italic_e italic_n italic_c italic_yProperty𝑃𝑟𝑜𝑝𝑒𝑟𝑡𝑦Propertyitalic_P italic_r italic_o italic_p italic_e italic_r italic_t italic_y:𝒮(xT,T,c)𝒮(xt,t,c)𝒮(xt,t,c)xϵ𝒮subscript𝑥𝑇𝑇𝑐𝒮subscript𝑥superscript𝑡superscript𝑡𝑐𝒮subscript𝑥𝑡𝑡𝑐subscript𝑥italic-ϵ\mathcal{S}(x_{T},T,c)\approx\mathcal{S}(x_{t^{\prime}},t^{\prime},c)\approx% \mathcal{S}(x_{t},t,c)\approx x_{\epsilon}caligraphic_S ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_T , italic_c ) ≈ caligraphic_S ( italic_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c ) ≈ caligraphic_S ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) ≈ italic_x start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT, where t,t[ϵ,T]for-all𝑡superscript𝑡italic-ϵ𝑇\forall t,t^{\prime}\in[\epsilon,T]∀ italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ [ italic_ϵ , italic_T ]

In this section, we discuss two critical targets. The first is encoding motions with arbitrary lengths into low-dimensional and regularized latent representations of motions to align all motion dimensions. The second is introducing the conditional PF-ODE into less cost consistency training framework for few-steps and high-quality latent representation sampling. To this end, we propose a Motion Latent Consistency Training (MLCT) framework, as shown in Figure 2. It consists of an autoencoder with quantization constraints, which is used to learn various motion representations in low-dimensional and regularized latent spaces (details in Section 4.1), and a denoising network, which is used to capture the corresponding latent state distributions and to implement few-step sampling (details in Section 4.2).

4.1 Encoding Motion as Quantized Latent Representation

We construct an autoencoder 𝒢={,𝒟}𝒢𝒟\mathcal{G}=\{\mathcal{E},\mathcal{D}\}caligraphic_G = { caligraphic_E , caligraphic_D } with transformer-based architecture to realize encoding and reconstructing between motion sequences x𝑥xitalic_x and latent motion representations z𝑧zitalic_z. The core insight is that each dimension of z𝑧zitalic_z is sampled from a finite set \mathcal{M}caligraphic_M of size 2l+12𝑙12l+12 italic_l + 1 as follow,

={zi;1,j/l,,0,,j/l,,1}j=0l.superscriptsubscriptsubscript𝑧𝑖1𝑗𝑙0𝑗𝑙1𝑗0𝑙\mathcal{M}=\{z_{i};-1,-j/l,\cdots,0,\cdots,j/l,\cdots,1\}_{j=0}^{l}.caligraphic_M = { italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; - 1 , - italic_j / italic_l , ⋯ , 0 , ⋯ , italic_j / italic_l , ⋯ , 1 } start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT . (7)

To this end, we denote zn,d𝑧superscript𝑛𝑑z\in\mathcal{R}^{n,d}italic_z ∈ caligraphic_R start_POSTSUPERSCRIPT italic_n , italic_d end_POSTSUPERSCRIPT as n𝑛nitalic_n learnable tokens with d𝑑ditalic_d dimension, aggregating the motion sequence features via attention computation. Inspired by recent quantitative work [28], we employ a hyperbolic tangent (tanh) function on the output of the encoder \mathcal{E}caligraphic_E to constrain the boundaries of the representation, and then quantize the result by a rounding operator \mathcal{R}caligraphic_R. Furthermore, the gradient of quantized items is simulated by the previous state gradient to backpropagate the gradient normally. The latent representations z𝑧zitalic_z are sampled by follow format,

z=(ltanh((x)))/l.𝑧𝑙𝑡𝑎𝑛𝑥𝑙z=\mathcal{R}\Big{(}l\cdot tanh(\mathcal{E}(x))\Big{)}/l.italic_z = caligraphic_R ( italic_l ⋅ italic_t italic_a italic_n italic_h ( caligraphic_E ( italic_x ) ) ) / italic_l . (8)

The standard optimization target is to reconstruct motion information from z𝑧zitalic_z with the decoder 𝒟𝒟\mathcal{D}caligraphic_D, i.e., to optimize the l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT smooth error loss,

z=𝔼x[d(x,𝒟(z))+λjd(𝒥(x),𝒥(𝒟(z)))],subscript𝑧subscript𝔼𝑥delimited-[]𝑑𝑥𝒟𝑧subscript𝜆𝑗𝑑𝒥𝑥𝒥𝒟𝑧\mathcal{L}_{z}=\mathbb{E}_{x}\Big{[}d\Big{(}x,\mathcal{D}(z)\Big{)}+\lambda_{% j}d\Big{(}\mathcal{J}(x),\mathcal{J}(\mathcal{D}(z))\Big{)}\Big{]},caligraphic_L start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_d ( italic_x , caligraphic_D ( italic_z ) ) + italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_d ( caligraphic_J ( italic_x ) , caligraphic_J ( caligraphic_D ( italic_z ) ) ) ] , (9)

where 𝒥𝒥\mathcal{J}caligraphic_J is a function to transform features such as joint rotations into joint coordinates, and it is also applied in MLD [3] and GraphMotion [8]. λjsubscript𝜆𝑗\lambda_{j}italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is a balancing term.

Compared with the traditional VAEs, the optimization target Eq. 9 does not contain a divergence adversarial term. A well-trained autoencoder 𝒢𝒢\mathcal{G}caligraphic_G output bounded and regular motion latent representation, which in turn improves the solution space of the denoising network, and experimentally we found that this improvement is important for the convergence of consistent training.

4.2 Few Step Motion Generation via Consistency Training

For conditional motion generation, Class-Free Guidance (CFG) is crucial for synthesizing high-fidelity samples in most successful cases of motion diffusion models, such as MLD or GraphMotion. Previous work introduced CFG into the consistency distillation, demonstrating the feasibility of the consistency model on conditional PF-ODE trajectories. However, they rely on powerful pre-trained teacher models, which not only involve additional training costs but performance is limited by distillation errors. Therefore, we are motivated to simulate CFG more efficiently from the original motion latent representation following the consistency training framework to alleviate the computational burden.

The diffusion stage of MLCM begins with the variance preserving schedule [9] to perturbed motion latent representations xϵ=zsubscript𝑥italic-ϵ𝑧x_{\epsilon}=zitalic_x start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT = italic_z with perturbation kernel 𝒩(xt;α(t)x0,σ2(t)I)𝒩subscript𝑥𝑡𝛼𝑡subscript𝑥0superscript𝜎2𝑡𝐼\mathcal{N}(x_{t};\alpha(t)x_{0},\sigma^{2}(t)I)caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_α ( italic_t ) italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) italic_I ),

α(t):=e14t2(β1β0)12tβ0,σ(t):=1e2α(t).formulae-sequenceassign𝛼𝑡superscript𝑒14superscript𝑡2subscript𝛽1subscript𝛽012𝑡subscript𝛽0assign𝜎𝑡1superscript𝑒2𝛼𝑡\alpha(t):=e^{-\frac{1}{4}t^{2}(\beta_{1}-\beta_{0})-\frac{1}{2}t\beta_{0}},% \quad\sigma(t):=\sqrt{1-e^{2\alpha(t)}}.italic_α ( italic_t ) := italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_t italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_σ ( italic_t ) := square-root start_ARG 1 - italic_e start_POSTSUPERSCRIPT 2 italic_α ( italic_t ) end_POSTSUPERSCRIPT end_ARG . (10)

The consistency model 𝒮θsubscript𝒮𝜃\mathcal{S}_{\theta}caligraphic_S start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT has been constructed to predict xϵsubscript𝑥italic-ϵx_{\epsilon}italic_x start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT from perturbed xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in a given PF-ODE trajectory. To maintain the boundary conditions that 𝒮θ(xϵ,ϵ,c)=xϵsubscript𝒮𝜃subscript𝑥italic-ϵitalic-ϵ𝑐subscript𝑥italic-ϵ\mathcal{S}_{\theta}(x_{\epsilon},\epsilon,c)=x_{\epsilon}caligraphic_S start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT , italic_ϵ , italic_c ) = italic_x start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT, we employ the same skip setting for Eq. LABEL:equ5 as in the latent consistency model (LCM), which parameterized as follow:

𝒮θ(xt,t,c):=η2(10t)2+η2xt+10t(10t)2+η2𝒮~θ(xt,t,c),assignsubscript𝒮𝜃subscript𝑥𝑡𝑡𝑐superscript𝜂2superscript10𝑡2superscript𝜂2subscript𝑥𝑡10𝑡superscript10𝑡2superscript𝜂2subscript~𝒮𝜃subscript𝑥𝑡𝑡𝑐\mathcal{S}_{\theta}(x_{t},t,c):=\frac{\eta^{2}}{(10t)^{2}+\eta^{2}}\cdot x_{t% }+\frac{10t}{\sqrt{(10t)^{2}+\eta^{2}}}\cdot\widetilde{\mathcal{S}}_{\theta}(x% _{t},t,c),caligraphic_S start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) := divide start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 10 italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG 10 italic_t end_ARG start_ARG square-root start_ARG ( 10 italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ⋅ over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) , (11)

where 𝒮~θsubscript~𝒮𝜃\widetilde{\mathcal{S}}_{\theta}over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is a transformer-based network and η𝜂\etaitalic_η is a hyperparameter, which is usually set to 0.5. Following the self-consistency property (as detail in Eq. 4), the model 𝒮θsubscript𝒮𝜃\mathcal{S}_{\theta}caligraphic_S start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT has to maintain the consistency of the output at the given perturbed state xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with the previous state x~tΔtsubscript~𝑥𝑡Δ𝑡\widetilde{x}_{t-\Delta t}over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t - roman_Δ italic_t end_POSTSUBSCRIPT on the same ODE trajectory. The latter can be estimated via DPM++ solver:

x~tΔtσtΔtσtxtαt(αtΔtσtσtΔtαt1)xϵΦ,subscript~𝑥𝑡Δ𝑡subscript𝜎𝑡Δ𝑡subscript𝜎𝑡subscript𝑥𝑡subscript𝛼𝑡subscript𝛼𝑡Δ𝑡subscript𝜎𝑡subscript𝜎𝑡Δ𝑡subscript𝛼𝑡1superscriptsubscript𝑥italic-ϵΦ\widetilde{x}_{t-\Delta t}\approx\frac{\sigma_{t-\Delta t}}{\sigma_{t}}\cdot x% _{t}-\alpha_{t}\cdot(\frac{\alpha_{t-\Delta t}\cdot\sigma_{t}}{\sigma_{t-% \Delta t}\cdot\alpha_{t}}-1)\cdot x_{\epsilon}^{\Phi},over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t - roman_Δ italic_t end_POSTSUBSCRIPT ≈ divide start_ARG italic_σ start_POSTSUBSCRIPT italic_t - roman_Δ italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⋅ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ ( divide start_ARG italic_α start_POSTSUBSCRIPT italic_t - roman_Δ italic_t end_POSTSUBSCRIPT ⋅ italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t - roman_Δ italic_t end_POSTSUBSCRIPT ⋅ italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - 1 ) ⋅ italic_x start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Φ end_POSTSUPERSCRIPT , (12)

where xϵΦsuperscriptsubscript𝑥italic-ϵΦx_{\epsilon}^{\Phi}italic_x start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Φ end_POSTSUPERSCRIPT is the estimation of xϵsubscript𝑥italic-ϵx_{\epsilon}italic_x start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT under the different sampling strategies. In particular, xϵΦsuperscriptsubscript𝑥italic-ϵΦx_{\epsilon}^{\Phi}italic_x start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Φ end_POSTSUPERSCRIPT can be parameterized as a linear combination of conditional and unconditional latent presentation prediction following the CFG strategy, i.e.,

xϵΦ(xt,t,c)=(1+ω)θ(xt,t,c)ωθ(xt,t,),superscriptsubscript𝑥italic-ϵΦsubscript𝑥𝑡𝑡𝑐1𝜔subscript𝜃subscript𝑥𝑡𝑡𝑐𝜔subscript𝜃subscript𝑥𝑡𝑡x_{\epsilon}^{\Phi}(x_{t},t,c)=(1+\omega)\cdot\mathcal{F}_{\theta}(x_{t},t,c)-% \omega\mathcal{F}_{\theta}(x_{t},t,\emptyset),italic_x start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Φ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) = ( 1 + italic_ω ) ⋅ caligraphic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) - italic_ω caligraphic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , ∅ ) , (13)

where θ()subscript𝜃\mathcal{F}_{\theta}(\cdot)caligraphic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) is well-trained and xϵsubscript𝑥italic-ϵx_{\epsilon}italic_x start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT-prediction-based motion diffusion model.

It is worth noting that xϵsubscript𝑥italic-ϵx_{\epsilon}italic_x start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT can be utilized to simulate θ(xt,t,c)subscript𝜃subscript𝑥𝑡𝑡𝑐\mathcal{F}_{\theta}(x_{t},t,c)caligraphic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) as used in the vanilla consistency training pipeline. Furthermore, θ(xt,t,)subscript𝜃subscript𝑥𝑡𝑡\mathcal{F}_{\theta}(x_{t},t,\emptyset)caligraphic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , ∅ ) can be replaced by 𝒮θ(xt,t,)subscript𝒮𝜃subscript𝑥𝑡𝑡\mathcal{S}_{\theta}(x_{t},t,\emptyset)caligraphic_S start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , ∅ ) with online updating. Thus Eq. 13 can be rewritten as:

xϵΦ(xt,t,c)=(1+ω)xϵω𝒮θ(xt,t,).superscriptsubscript𝑥italic-ϵΦsubscript𝑥𝑡𝑡𝑐1𝜔subscript𝑥italic-ϵ𝜔subscript𝒮𝜃subscript𝑥𝑡𝑡x_{\epsilon}^{\Phi}(x_{t},t,c)=(1+\omega)\cdot x_{\epsilon}-\omega\mathcal{S}_% {\theta}(x_{t},t,\emptyset).italic_x start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Φ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) = ( 1 + italic_ω ) ⋅ italic_x start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT - italic_ω caligraphic_S start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , ∅ ) . (14)

The optimization objective of the consistency model 𝒮θsubscript𝒮𝜃\mathcal{S}_{\theta}caligraphic_S start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is that,

c=𝔼x,t[1Δtd(𝒮θ(xt,t,c),𝒮θ(x^tΔt,tΔt,c))+λcd(𝒮θ(xt,t,),xϵ)],subscript𝑐subscript𝔼𝑥𝑡delimited-[]1Δ𝑡𝑑subscript𝒮𝜃subscript𝑥𝑡𝑡𝑐subscript𝒮superscript𝜃subscript^𝑥𝑡Δ𝑡𝑡Δ𝑡𝑐subscript𝜆𝑐𝑑subscript𝒮𝜃subscript𝑥𝑡𝑡subscript𝑥italic-ϵ\mathcal{L}_{c}=\mathbb{E}_{x,t}\Big{[}\frac{1}{\Delta t}d\Big{(}\mathcal{S}_{% \theta}(x_{t},t,c),\mathcal{S}_{\theta^{-}}(\hat{x}_{t-\Delta t},t-\Delta t,c)% \Big{)}+\lambda_{c}d\Big{(}\mathcal{S}_{\theta}(x_{t},t,\emptyset),x_{\epsilon% }\Big{)}\Big{]},caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_x , italic_t end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG roman_Δ italic_t end_ARG italic_d ( caligraphic_S start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) , caligraphic_S start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t - roman_Δ italic_t end_POSTSUBSCRIPT , italic_t - roman_Δ italic_t , italic_c ) ) + italic_λ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_d ( caligraphic_S start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , ∅ ) , italic_x start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ) ] , (15)

where d(x,y)=(xy)2+γ2γ𝑑𝑥𝑦superscript𝑥𝑦2superscript𝛾2𝛾d(x,y)=\sqrt{(x-y)^{2}+\gamma^{2}}-\gammaitalic_d ( italic_x , italic_y ) = square-root start_ARG ( italic_x - italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - italic_γ is pseudo-huber metric, γ𝛾\gammaitalic_γ is a constant, λcsubscript𝜆𝑐\lambda_{c}italic_λ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is a balancing term. The target network 𝒮θsubscript𝒮superscript𝜃\mathcal{S}_{\theta^{-}}caligraphic_S start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is updated after each iteration via EMA.

5 Experiments

5.1 Datasets and Metrics

Datasets. We evaluate the proposed framework on two mainstream benchmarks for text-driven motion generation tasks, which are the KIT [29] and the HumanML3D [5]. The former contains 3,911 motions and their corresponding 6,363 natural language descriptions. The latter is currently the largest 3D human motion dataset comprising the HumanAct12 [15] and AMASS [30] datasets, containing 14,616 motions and 44,970 descriptions.

Evaluation Metrics. Consistent with previous work, we evaluate the proposed framework in four parts. (a) Motion quality: we utilize the frechet inception distance (FID) to evaluate the distance in feature distribution between the generated data and the real data. (b) Condition matching: we first employ the R-precision to measure the correlation between the text description and the generated motion sequence and record the probability of the first k=1,2,3𝑘123k=1,2,3italic_k = 1 , 2 , 3 matches. Then, we further calculate the distance between motions and texts by multi-modal distance (MM Dist). (c) Motion diversity: we compute differences between features with the diversity metric and then measure generative diversity in the same text input using multimodality (MM) metric. (d) Calculating burden: we first use the number of function evaluations (NFE) to evaluate generated performance with fewer steps sampling. Then, we further statistics the average sampling time (AST) of a single sample.

5.2 Implementation Details

Model Configuration. The motion autoencoder {,𝒟}𝒟\{\mathcal{E},\mathcal{D}\}{ caligraphic_E , caligraphic_D } and the score network 𝒮𝒮\mathcal{S}caligraphic_S are both the transformer architecture with long skip connections [31], which is also used in MLD [3]. Specifically, both the encoder \mathcal{E}caligraphic_E and decoder 𝒟𝒟\mathcal{D}caligraphic_D contain 7 layers of transformer blocks with input dimensions 256, and each block contains 3 learnable tokens. The size of the finite set \mathcal{M}caligraphic_M is set as 2001, i.e. l=1000𝑙1000l=1000italic_l = 1000. The score network 𝒮𝒮\mathcal{S}caligraphic_S contains 15 layers of transformer blocks with input dimensions 512. The frozen CLIP-ViT-L-14 model [32] is used to be the text encoder. It encodes the text to a pooled output w1,256𝑤superscript1256w\in\mathcal{R}^{1,256}italic_w ∈ caligraphic_R start_POSTSUPERSCRIPT 1 , 256 end_POSTSUPERSCRIPT and then projects it as text embedding to sum with the time embedding before the input of each block.

Train Configuration. For diffusion time horizon [ϵ,T]italic-ϵ𝑇[\epsilon,T][ italic_ϵ , italic_T ] into N1𝑁1N-1italic_N - 1 sub-intervals, we set ϵitalic-ϵ\epsilonitalic_ϵ is 0.002, T𝑇Titalic_T is 1, N𝑁Nitalic_N is 1000. We follow the consistency model [23] to determine ti=(ϵ1/ρ+i1N1(T1/ρϵ1/ρ))ρsubscript𝑡𝑖superscriptsuperscriptitalic-ϵ1𝜌𝑖1𝑁1superscript𝑇1𝜌superscriptitalic-ϵ1𝜌𝜌t_{i}=(\epsilon^{1/\rho}+\frac{i-1}{N-1}(T^{1/\rho}-\epsilon^{1/\rho}))^{\rho}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_ϵ start_POSTSUPERSCRIPT 1 / italic_ρ end_POSTSUPERSCRIPT + divide start_ARG italic_i - 1 end_ARG start_ARG italic_N - 1 end_ARG ( italic_T start_POSTSUPERSCRIPT 1 / italic_ρ end_POSTSUPERSCRIPT - italic_ϵ start_POSTSUPERSCRIPT 1 / italic_ρ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT italic_ρ end_POSTSUPERSCRIPT, where ρ=2𝜌2\rho=2italic_ρ = 2. For balance training, we set λjsubscript𝜆𝑗\lambda_{j}italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT as 0.001. All the proposed models are trained with the AdamW optimizer with a learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT on a single RTX 4090 GPU. The size of each mini-batch is 64 and 128 for the autoencoder and denoising network, and the training process has been iterated with 1500 and 2000 epochs for the autoencoder and denoising network.

5.3 Comparisons to State-of-the-art Methods

Table 1: Comparisons to state-of-the-art methods on the HumanML test set. We repeat all the evaluations 20 times and report the average with a 95% confidence interval. "\uparrow" denotes that higher is better. "\downarrow" denotes that lower is better. "\rightarrow" denotes that results are better if the metric is closer to the real motion. {\dagger} denotes that classifier-free guidance is utilized, causing a double NFE.
Method R-Precision \uparrow FID \downarrow MM-Dist\downarrow Diversity\rightarrow MModality\uparrow NFE\downarrow
Top-1 Top-2 Top-3
Real 0.511±.003superscript0.511plus-or-minus.0030.511^{\pm.003}0.511 start_POSTSUPERSCRIPT ± .003 end_POSTSUPERSCRIPT 0.703±.003superscript0.703plus-or-minus.0030.703^{\pm.003}0.703 start_POSTSUPERSCRIPT ± .003 end_POSTSUPERSCRIPT 0.797±.002superscript0.797plus-or-minus.0020.797^{\pm.002}0.797 start_POSTSUPERSCRIPT ± .002 end_POSTSUPERSCRIPT 0.002±.000superscript0.002plus-or-minus.0000.002^{\pm.000}0.002 start_POSTSUPERSCRIPT ± .000 end_POSTSUPERSCRIPT 2.974±.008superscript2.974plus-or-minus.0082.974^{\pm.008}2.974 start_POSTSUPERSCRIPT ± .008 end_POSTSUPERSCRIPT 9.503±.065superscript9.503plus-or-minus.0659.503^{\pm.065}9.503 start_POSTSUPERSCRIPT ± .065 end_POSTSUPERSCRIPT - -
TEMOS[6] 0.424±.002superscript0.424plus-or-minus.0020.424^{\pm.002}0.424 start_POSTSUPERSCRIPT ± .002 end_POSTSUPERSCRIPT 0.612±.002superscript0.612plus-or-minus.0020.612^{\pm.002}0.612 start_POSTSUPERSCRIPT ± .002 end_POSTSUPERSCRIPT 0.722±.002superscript0.722plus-or-minus.0020.722^{\pm.002}0.722 start_POSTSUPERSCRIPT ± .002 end_POSTSUPERSCRIPT 3.734±.028superscript3.734plus-or-minus.0283.734^{\pm.028}3.734 start_POSTSUPERSCRIPT ± .028 end_POSTSUPERSCRIPT 3.703±.008superscript3.703plus-or-minus.0083.703^{\pm.008}3.703 start_POSTSUPERSCRIPT ± .008 end_POSTSUPERSCRIPT 8.973±.071superscript8.973plus-or-minus.0718.973^{\pm.071}8.973 start_POSTSUPERSCRIPT ± .071 end_POSTSUPERSCRIPT 0.368±.018superscript0.368plus-or-minus.0180.368^{\pm.018}0.368 start_POSTSUPERSCRIPT ± .018 end_POSTSUPERSCRIPT -
T2M[5] 0.457±.002superscript0.457plus-or-minus.0020.457^{\pm.002}0.457 start_POSTSUPERSCRIPT ± .002 end_POSTSUPERSCRIPT 0.639±.003superscript0.639plus-or-minus.0030.639^{\pm.003}0.639 start_POSTSUPERSCRIPT ± .003 end_POSTSUPERSCRIPT 0.740±.003superscript0.740plus-or-minus.0030.740^{\pm.003}0.740 start_POSTSUPERSCRIPT ± .003 end_POSTSUPERSCRIPT 1.067±.002superscript1.067plus-or-minus.0021.067^{\pm.002}1.067 start_POSTSUPERSCRIPT ± .002 end_POSTSUPERSCRIPT 3.340±.008superscript3.340plus-or-minus.0083.340^{\pm.008}3.340 start_POSTSUPERSCRIPT ± .008 end_POSTSUPERSCRIPT 9.188±.002superscript9.188plus-or-minus.0029.188^{\pm.002}9.188 start_POSTSUPERSCRIPT ± .002 end_POSTSUPERSCRIPT 2.090±.083superscript2.090plus-or-minus.0832.090^{\pm.083}2.090 start_POSTSUPERSCRIPT ± .083 end_POSTSUPERSCRIPT -
MDM [2] 0.320±.005superscript0.320plus-or-minus.0050.320^{\pm.005}0.320 start_POSTSUPERSCRIPT ± .005 end_POSTSUPERSCRIPT 0.498±.004superscript0.498plus-or-minus.0040.498^{\pm.004}0.498 start_POSTSUPERSCRIPT ± .004 end_POSTSUPERSCRIPT 0.611±.007superscript0.611plus-or-minus.0070.611^{\pm.007}0.611 start_POSTSUPERSCRIPT ± .007 end_POSTSUPERSCRIPT 0.544±.044superscript0.544plus-or-minus.0440.544^{\pm.044}0.544 start_POSTSUPERSCRIPT ± .044 end_POSTSUPERSCRIPT 5.566±.027superscript5.566plus-or-minus.0275.566^{\pm.027}5.566 start_POSTSUPERSCRIPT ± .027 end_POSTSUPERSCRIPT 9.559±.086superscript9.559plus-or-minus.0869.559^{\pm.086}9.559 start_POSTSUPERSCRIPT ± .086 end_POSTSUPERSCRIPT 2.799±.072superscript2.799plus-or-minus.0722.799^{\pm.072}2.799 start_POSTSUPERSCRIPT ± .072 end_POSTSUPERSCRIPT 1000
MD [1] 0.491±.001superscript0.491plus-or-minus.0010.491^{\pm.001}0.491 start_POSTSUPERSCRIPT ± .001 end_POSTSUPERSCRIPT 0.681±.001superscript0.681plus-or-minus.0010.681^{\pm.001}0.681 start_POSTSUPERSCRIPT ± .001 end_POSTSUPERSCRIPT 0.782±.001superscript0.782plus-or-minus.0010.782^{\pm.001}0.782 start_POSTSUPERSCRIPT ± .001 end_POSTSUPERSCRIPT 0.630±.001superscript0.630plus-or-minus.0010.630^{\pm.001}0.630 start_POSTSUPERSCRIPT ± .001 end_POSTSUPERSCRIPT 3.113±.001superscript3.113plus-or-minus.0013.113^{\pm.001}3.113 start_POSTSUPERSCRIPT ± .001 end_POSTSUPERSCRIPT 9.410±.049superscript9.410plus-or-minus.0499.410^{\pm.049}9.410 start_POSTSUPERSCRIPT ± .049 end_POSTSUPERSCRIPT 1.553±.042superscript1.553plus-or-minus.0421.553^{\pm.042}1.553 start_POSTSUPERSCRIPT ± .042 end_POSTSUPERSCRIPT 1000
MLD [3] 0.481±.003superscript0.481plus-or-minus.0030.481^{\pm.003}0.481 start_POSTSUPERSCRIPT ± .003 end_POSTSUPERSCRIPT 0.673±.003superscript0.673plus-or-minus.0030.673^{\pm.003}0.673 start_POSTSUPERSCRIPT ± .003 end_POSTSUPERSCRIPT 0.772±.002superscript0.772plus-or-minus.0020.772^{\pm.002}0.772 start_POSTSUPERSCRIPT ± .002 end_POSTSUPERSCRIPT 0.473±.013superscript0.473plus-or-minus.0130.473^{\pm.013}0.473 start_POSTSUPERSCRIPT ± .013 end_POSTSUPERSCRIPT 3.196±.010superscript3.196plus-or-minus.0103.196^{\pm.010}3.196 start_POSTSUPERSCRIPT ± .010 end_POSTSUPERSCRIPT 9.724±.082superscript9.724plus-or-minus.0829.724^{\pm.082}9.724 start_POSTSUPERSCRIPT ± .082 end_POSTSUPERSCRIPT 2.413±.079superscript2.413plus-or-minus.0792.413^{\pm.079}2.413 start_POSTSUPERSCRIPT ± .079 end_POSTSUPERSCRIPT 100
GraphMotion[8] 0.504±.003superscript0.504plus-or-minus.0030.504^{\pm.003}0.504 start_POSTSUPERSCRIPT ± .003 end_POSTSUPERSCRIPT 0.699±.002superscript0.699plus-or-minus.0020.699^{\pm.002}0.699 start_POSTSUPERSCRIPT ± .002 end_POSTSUPERSCRIPT 0.785±.002superscript0.785plus-or-minus.0020.785^{\pm.002}0.785 start_POSTSUPERSCRIPT ± .002 end_POSTSUPERSCRIPT 0.116±.007superscript0.116plus-or-minus.0070.116^{\pm.007}0.116 start_POSTSUPERSCRIPT ± .007 end_POSTSUPERSCRIPT 3.070±.008superscript3.070plus-or-minus.0083.070^{\pm.008}3.070 start_POSTSUPERSCRIPT ± .008 end_POSTSUPERSCRIPT 9.692±.067superscript9.692plus-or-minus.0679.692^{\pm.067}9.692 start_POSTSUPERSCRIPT ± .067 end_POSTSUPERSCRIPT 2.766±.096superscript2.766plus-or-minus.0962.766^{\pm.096}2.766 start_POSTSUPERSCRIPT ± .096 end_POSTSUPERSCRIPT 300
M2DM [7] 0.497±.003superscript0.497plus-or-minus.0030.497^{\pm.003}0.497 start_POSTSUPERSCRIPT ± .003 end_POSTSUPERSCRIPT 0.682±.002superscript0.682plus-or-minus.0020.682^{\pm.002}0.682 start_POSTSUPERSCRIPT ± .002 end_POSTSUPERSCRIPT 0.763±.003superscript0.763plus-or-minus.0030.763^{\pm.003}0.763 start_POSTSUPERSCRIPT ± .003 end_POSTSUPERSCRIPT 0.352±.005superscript0.352plus-or-minus.0050.352^{\pm.005}0.352 start_POSTSUPERSCRIPT ± .005 end_POSTSUPERSCRIPT 3.134±.010superscript3.134plus-or-minus.0103.134^{\pm.010}3.134 start_POSTSUPERSCRIPT ± .010 end_POSTSUPERSCRIPT 9.926±.073superscript9.926plus-or-minus.0739.926^{\pm.073}9.926 start_POSTSUPERSCRIPT ± .073 end_POSTSUPERSCRIPT 3.587±.072superscript3.587plus-or-minus.0723.587^{\pm.072}3.587 start_POSTSUPERSCRIPT ± .072 end_POSTSUPERSCRIPT 100
Our 0.460±.001superscript0.460plus-or-minus.0010.460^{\pm.001}0.460 start_POSTSUPERSCRIPT ± .001 end_POSTSUPERSCRIPT 0.655±.002superscript0.655plus-or-minus.0020.655^{\pm.002}0.655 start_POSTSUPERSCRIPT ± .002 end_POSTSUPERSCRIPT 0.760±.006superscript0.760plus-or-minus.0060.760^{\pm.006}0.760 start_POSTSUPERSCRIPT ± .006 end_POSTSUPERSCRIPT 0.232±.007superscript0.232plus-or-minus.0070.232^{\pm.007}0.232 start_POSTSUPERSCRIPT ± .007 end_POSTSUPERSCRIPT 3.238±.008superscript3.238plus-or-minus.0083.238^{\pm.008}3.238 start_POSTSUPERSCRIPT ± .008 end_POSTSUPERSCRIPT 9.658±.065superscript9.658plus-or-minus.0659.658^{\pm.065}9.658 start_POSTSUPERSCRIPT ± .065 end_POSTSUPERSCRIPT 3.506±.008superscript3.506plus-or-minus.0083.506^{\pm.008}3.506 start_POSTSUPERSCRIPT ± .008 end_POSTSUPERSCRIPT 5
Table 2: Comparisons to state-of-the-art methods on the KIT test set. The meaning of the markers is the same as in Tab. 1.
Method R-Precision \uparrow FID \downarrow MM-Dist\downarrow Diversity\rightarrow MModality\uparrow NFE\downarrow
Top-1 Top-2 Top-3
Real 0.424±.005superscript0.424plus-or-minus.0050.424^{\pm.005}0.424 start_POSTSUPERSCRIPT ± .005 end_POSTSUPERSCRIPT 0.649±.006superscript0.649plus-or-minus.0060.649^{\pm.006}0.649 start_POSTSUPERSCRIPT ± .006 end_POSTSUPERSCRIPT 0.779±.006superscript0.779plus-or-minus.0060.779^{\pm.006}0.779 start_POSTSUPERSCRIPT ± .006 end_POSTSUPERSCRIPT 0.031±.004superscript0.031plus-or-minus.0040.031^{\pm.004}0.031 start_POSTSUPERSCRIPT ± .004 end_POSTSUPERSCRIPT 2.788±.012superscript2.788plus-or-minus.0122.788^{\pm.012}2.788 start_POSTSUPERSCRIPT ± .012 end_POSTSUPERSCRIPT 11.08±.097superscript11.08plus-or-minus.09711.08^{\pm.097}11.08 start_POSTSUPERSCRIPT ± .097 end_POSTSUPERSCRIPT - -
TEMOS[6] 0.353±.006superscript0.353plus-or-minus.0060.353^{\pm.006}0.353 start_POSTSUPERSCRIPT ± .006 end_POSTSUPERSCRIPT 0.561±.007superscript0.561plus-or-minus.0070.561^{\pm.007}0.561 start_POSTSUPERSCRIPT ± .007 end_POSTSUPERSCRIPT 0.687±.005superscript0.687plus-or-minus.0050.687^{\pm.005}0.687 start_POSTSUPERSCRIPT ± .005 end_POSTSUPERSCRIPT 3.717±.051superscript3.717plus-or-minus.0513.717^{\pm.051}3.717 start_POSTSUPERSCRIPT ± .051 end_POSTSUPERSCRIPT 3.417±.019superscript3.417plus-or-minus.0193.417^{\pm.019}3.417 start_POSTSUPERSCRIPT ± .019 end_POSTSUPERSCRIPT 10.84±.100superscript10.84plus-or-minus.10010.84^{\pm.100}10.84 start_POSTSUPERSCRIPT ± .100 end_POSTSUPERSCRIPT 0.532±.034superscript0.532plus-or-minus.0340.532^{\pm.034}0.532 start_POSTSUPERSCRIPT ± .034 end_POSTSUPERSCRIPT -
T2M[5] 0.370±.005superscript0.370plus-or-minus.0050.370^{\pm.005}0.370 start_POSTSUPERSCRIPT ± .005 end_POSTSUPERSCRIPT 0.569±.007superscript0.569plus-or-minus.0070.569^{\pm.007}0.569 start_POSTSUPERSCRIPT ± .007 end_POSTSUPERSCRIPT 0.693±.007superscript0.693plus-or-minus.0070.693^{\pm.007}0.693 start_POSTSUPERSCRIPT ± .007 end_POSTSUPERSCRIPT 2.770±.109superscript2.770plus-or-minus.1092.770^{\pm.109}2.770 start_POSTSUPERSCRIPT ± .109 end_POSTSUPERSCRIPT 3.401±.008superscript3.401plus-or-minus.0083.401^{\pm.008}3.401 start_POSTSUPERSCRIPT ± .008 end_POSTSUPERSCRIPT 10.91±.119superscript10.91plus-or-minus.11910.91^{\pm.119}10.91 start_POSTSUPERSCRIPT ± .119 end_POSTSUPERSCRIPT 1.482±.065superscript1.482plus-or-minus.0651.482^{\pm.065}1.482 start_POSTSUPERSCRIPT ± .065 end_POSTSUPERSCRIPT -
MDM [2] 0.164±.004superscript0.164plus-or-minus.0040.164^{\pm.004}0.164 start_POSTSUPERSCRIPT ± .004 end_POSTSUPERSCRIPT 0.291±.004superscript0.291plus-or-minus.0040.291^{\pm.004}0.291 start_POSTSUPERSCRIPT ± .004 end_POSTSUPERSCRIPT 0.396±.004superscript0.396plus-or-minus.0040.396^{\pm.004}0.396 start_POSTSUPERSCRIPT ± .004 end_POSTSUPERSCRIPT 0.497±.021superscript0.497plus-or-minus.0210.497^{\pm.021}0.497 start_POSTSUPERSCRIPT ± .021 end_POSTSUPERSCRIPT 9.191±.022superscript9.191plus-or-minus.0229.191^{\pm.022}9.191 start_POSTSUPERSCRIPT ± .022 end_POSTSUPERSCRIPT 10.85±.109superscript10.85plus-or-minus.10910.85^{\pm.109}10.85 start_POSTSUPERSCRIPT ± .109 end_POSTSUPERSCRIPT 1.907±.214superscript1.907plus-or-minus.2141.907^{\pm.214}1.907 start_POSTSUPERSCRIPT ± .214 end_POSTSUPERSCRIPT 1000
MD [1] 0.417±.004superscript0.417plus-or-minus.0040.417^{\pm.004}0.417 start_POSTSUPERSCRIPT ± .004 end_POSTSUPERSCRIPT 0.621±.004superscript0.621plus-or-minus.0040.621^{\pm.004}0.621 start_POSTSUPERSCRIPT ± .004 end_POSTSUPERSCRIPT 0.739±.004superscript0.739plus-or-minus.0040.739^{\pm.004}0.739 start_POSTSUPERSCRIPT ± .004 end_POSTSUPERSCRIPT 1.954±.062superscript1.954plus-or-minus.0621.954^{\pm.062}1.954 start_POSTSUPERSCRIPT ± .062 end_POSTSUPERSCRIPT 2.958±.005superscript2.958plus-or-minus.0052.958^{\pm.005}2.958 start_POSTSUPERSCRIPT ± .005 end_POSTSUPERSCRIPT 11.10±.143superscript11.10plus-or-minus.14311.10^{\pm.143}11.10 start_POSTSUPERSCRIPT ± .143 end_POSTSUPERSCRIPT 0.730±.013superscript0.730plus-or-minus.0130.730^{\pm.013}0.730 start_POSTSUPERSCRIPT ± .013 end_POSTSUPERSCRIPT 1000
MLD [3] 0.390±.008superscript0.390plus-or-minus.0080.390^{\pm.008}0.390 start_POSTSUPERSCRIPT ± .008 end_POSTSUPERSCRIPT 0.609±.008superscript0.609plus-or-minus.0080.609^{\pm.008}0.609 start_POSTSUPERSCRIPT ± .008 end_POSTSUPERSCRIPT 0.734±.007superscript0.734plus-or-minus.0070.734^{\pm.007}0.734 start_POSTSUPERSCRIPT ± .007 end_POSTSUPERSCRIPT 0.404±.027superscript0.404plus-or-minus.0270.404^{\pm.027}0.404 start_POSTSUPERSCRIPT ± .027 end_POSTSUPERSCRIPT 3.204±.027superscript3.204plus-or-minus.0273.204^{\pm.027}3.204 start_POSTSUPERSCRIPT ± .027 end_POSTSUPERSCRIPT 10.80±.117superscript10.80plus-or-minus.11710.80^{\pm.117}10.80 start_POSTSUPERSCRIPT ± .117 end_POSTSUPERSCRIPT 2.192±.071superscript2.192plus-or-minus.0712.192^{\pm.071}2.192 start_POSTSUPERSCRIPT ± .071 end_POSTSUPERSCRIPT 100
GM†,‡[8] 0.429±.007superscript0.429plus-or-minus.0070.429^{\pm.007}0.429 start_POSTSUPERSCRIPT ± .007 end_POSTSUPERSCRIPT 0.648±.006superscript0.648plus-or-minus.0060.648^{\pm.006}0.648 start_POSTSUPERSCRIPT ± .006 end_POSTSUPERSCRIPT 0.769±.006superscript0.769plus-or-minus.0060.769^{\pm.006}0.769 start_POSTSUPERSCRIPT ± .006 end_POSTSUPERSCRIPT 0.313±.013superscript0.313plus-or-minus.0130.313^{\pm.013}0.313 start_POSTSUPERSCRIPT ± .013 end_POSTSUPERSCRIPT 3.076±.022superscript3.076plus-or-minus.0223.076^{\pm.022}3.076 start_POSTSUPERSCRIPT ± .022 end_POSTSUPERSCRIPT 11.12±.135superscript11.12plus-or-minus.13511.12^{\pm.135}11.12 start_POSTSUPERSCRIPT ± .135 end_POSTSUPERSCRIPT 3.627±.113superscript3.627plus-or-minus.1133.627^{\pm.113}3.627 start_POSTSUPERSCRIPT ± .113 end_POSTSUPERSCRIPT 300
M2DM [7] 0.416±.004superscript0.416plus-or-minus.0040.416^{\pm.004}0.416 start_POSTSUPERSCRIPT ± .004 end_POSTSUPERSCRIPT 0.628±.004superscript0.628plus-or-minus.0040.628^{\pm.004}0.628 start_POSTSUPERSCRIPT ± .004 end_POSTSUPERSCRIPT 0.743±.004superscript0.743plus-or-minus.0040.743^{\pm.004}0.743 start_POSTSUPERSCRIPT ± .004 end_POSTSUPERSCRIPT 0.515±.029superscript0.515plus-or-minus.0290.515^{\pm.029}0.515 start_POSTSUPERSCRIPT ± .029 end_POSTSUPERSCRIPT 3.015±.017superscript3.015plus-or-minus.0173.015^{\pm.017}3.015 start_POSTSUPERSCRIPT ± .017 end_POSTSUPERSCRIPT 11.417±.970superscript11.417plus-or-minus.97011.417^{\pm.970}11.417 start_POSTSUPERSCRIPT ± .970 end_POSTSUPERSCRIPT 3.325±.370superscript3.325plus-or-minus.3703.325^{\pm.370}3.325 start_POSTSUPERSCRIPT ± .370 end_POSTSUPERSCRIPT 100
Our 0.433±.007superscript0.433plus-or-minus.0070.433^{\pm.007}0.433 start_POSTSUPERSCRIPT ± .007 end_POSTSUPERSCRIPT 0.655±.006superscript0.655plus-or-minus.0060.655^{\pm.006}0.655 start_POSTSUPERSCRIPT ± .006 end_POSTSUPERSCRIPT 0.783±.006superscript0.783plus-or-minus.0060.783^{\pm.006}0.783 start_POSTSUPERSCRIPT ± .006 end_POSTSUPERSCRIPT 0.408±.013superscript0.408plus-or-minus.0130.408^{\pm.013}0.408 start_POSTSUPERSCRIPT ± .013 end_POSTSUPERSCRIPT 2.831±.018superscript2.831plus-or-minus.0182.831^{\pm.018}2.831 start_POSTSUPERSCRIPT ± .018 end_POSTSUPERSCRIPT 11.179±.085superscript11.179plus-or-minus.08511.179^{\pm.085}11.179 start_POSTSUPERSCRIPT ± .085 end_POSTSUPERSCRIPT 1.23±.037superscript1.23plus-or-minus.0371.23^{\pm.037}1.23 start_POSTSUPERSCRIPT ± .037 end_POSTSUPERSCRIPT 5

The test results of HumanML and KIT are shown in Tab. 1 and Tab. 2, respectively. Our framework achieves the state-of-the-art generation performance. Compared to existing motion diffusion generation frameworks with more than 50-1000 iterations (e.g., MDM, MotionDiffuse, and MLD), our approach reduces the computational burden by more than tenfold without severely degrading the quality of damage generation. Remarkably, our inference pipeline is very concise, with no tricks such as additional text preprocessing as used in GraphMotion. Sampling in fewer steps also has not significantly reduced diversity and multi-modality metrics, which remain competitive. Fig. 3 shows the comparison of the visualization results with the previous model.

Refer to caption
Figure 3: Qualitative analysis of our model and previous models. We provide three textual prompts for the motion visualization results. We achieve better motion generation performance to match some text conditions with fewer NFE.

5.4 Ablation Study

Refer to caption
Figure 4: Ablation study of the quantized autoencoder employed in our framework with the conventional variational autoencoder and the vanilla autoencoder under different guidance parameters. We repeat all evaluations 3 times at each 50 epoch and report the average values.
Table 3: Ablation study of our framework with more generation metrics under different guidance parameters. The meaning of the markers is the same as in Tab. 1.
Dataset w𝑤witalic_w R-Precision Top-3 \uparrow FID \downarrow MM-Dist \downarrow MModality \uparrow
KIT 0 0.742±.006 0.717±.028 3.051±.021 2.496±.065
0.5 0.771±.006 0.504±.021 2.885±.023 1.935±.044
1 0.775±.005 0.494 ±.019 2.831±.021 1.844±.049
1.5 0.783±.006 0.411±.019 2.809±.019 1.648±.040
2 0.777±.006 0.518±.016 2.799±.023 1.612±.041

Effectiveness of each component. We explore the generative performance of the classifier-free guidance technique under different representations, and the results are reported in Fig. 4. When the guidance coefficient w𝑤witalic_w equals to 0, the model degenerates into a vanilla consistency model. We discover that increasing various degrees of classifier-free guidance accelerates consistency training convergence and improves generation quality. The pixel-discrete motion representation via the quantized autoencoder has better convergence ability generation performance compared to the continuous motion representation. In particular, under the same consistency training parameters, we have not observed significant gains in generation quality from variational constraints compared to the vanilla autoencoder. We further discuss more comprehensive generation metrics at different guidance parameters and the results are reported in Tab. 3. As the guidance parameters increase, controllability and generation quality gradually improve, with a corresponding decrease in diversity. In contrast to the larger guidance parameters employed in the traditional diffusion framework (which can usually be set to 7), we find that there is no contribution to the generation quality starting from w𝑤witalic_w greater than 2 in the consistency training framework.

Table 4: Ablation study of different number of token and sizes of representation finite set. The meaning of the markers is the same as in Tab. 1.
Dataset Token l𝑙litalic_l R-Precision Top-3 \uparrow FID \downarrow MM-Dist \downarrow MModality \uparrow
KIT 2 100 0.770±.006 0.599±.025 2.870±.020 1.656±.043
2 500 0.774±.005 0.550±.019 2.829±.018 1.769±.021
2 2000 0.775±.005 0.428±.016 2.844±.019 1.645±.045
4 1000 0.781±.003 0.489±.021 2.823±.021 1.859±.044
6 1000 0.781±.004 0.465±.021 2.821±.019 1.839±.055
2 1000 0.783±.006 0.411±.019 2.809±.019 1.648±.040

Ablation study on the different model hyperparameters. In Tab. 4, we test the model performance with different hyperparameters. Consistent with the findings of MLD, increasing the number of tokens does not remarkably increase the generation quality. Appropriately increasing the size of the finite set 2l+12𝑙12l+12 italic_l + 1 is beneficial in improving the generation results, and such gain is no longer significant when l𝑙litalic_l is larger than 1000.

Table 5: Ablation study of different number of function evaluations.
Dataset NFE R-Precision Top-3 \uparrow FID \downarrow MM-Dist \downarrow MModality \uparrow
KIT 1 0.777±.005 0.567±.002 2.865±.013 1.424±.040
3 0.781±.005 0.409±.014 2.812±.019 1.598±.037
5 0.783±.006 0.411±.019 2.809±.019 1.648±.040
8 0.783±.006 0.400±.015 2.810±.017 1.667±.051
10 0.786±.006 0.395±.015 2.795±.019 1.663±.049

Ablation study on the different sampling steps. Our generation results at different sampling steps are further shown in Tab. 5. We have excellent results with fewer sampling steps, but when the number of sampling steps is increased to more than 15, the increased number of sampling steps does not result in a quality payoff. It is a common problem with consistency training.

5.5 Time Cost

Table 6: Comparison of inference time with previous sota models.
Method MDM MLD T2M-GPT GraphMotion Our (NFE 5) Our (NFE 3)
AST (s) 7.5604 0.0786 0.2168 0.5417 0.0141 0.0098

The consistency training method we use does not require prior training of the diffusion model, so training is inexpensive and is available on just a single 4090. On the HumanML dataset, we train the encoder in 15 hours and the denoiser in 12 hours. Benefiting from the consistency sampling strategy, our inference time is also more than tenfold less than existing models. A more detailed time comparison is reported in Tab. 6.

6 Conclusion

In this paper, we propose a motion latent consistency Training framework, called MLCT, for high-quality, few-step sampling. It encodes motion sequences of arbitrary length into representational tokens with quantization constraints and constrains the consistency of outputs on the same ODE trajectory to realize the latent diffusion pipeline. Inspired by classifier-free guidance, we propose a method called consistent trajectory offset for fast convergence of consistent training. We validate our model and each of its components through extensive experiments and achieve the best trade-off between performance and computational burden in a very small number of steps (around 10). Our approach can provide a reference for subsequent latent consistency model training for different tasks.

Limitation and Future Work. Our work still has some directions for improvement. First, we aim at less-step motion generation and lack a discussion on fine-grained motion control. Fortunately, our proposed method is a generalized diffusion model training framework with fewer sampling steps. Some recent common textual controllers (such as graphmotion) can be integrated into the current work. Second, we note that consistent training fails to yield higher sampling quality after increasing the number of steps compared to common diffusion frameworks. How to overcome this difficulty is our main subsequent work.

References

  • [1] Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022.
  • [2] Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. Human motion diffusion model. In International Conference on Learning Representations, 2023.
  • [3] Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18000–18010, 2023.
  • [4] Haoye Cai, Chunyan Bai, Yu-Wing Tai, and Chi-Keung Tang. Deep video generation, prediction and completion of human action sequences. In Proceedings of the European conference on computer vision (ECCV), pages 366–382, 2018.
  • [5] Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5152–5161, 2022.
  • [6] Mathis Petrovich, Michael J Black, and Gül Varol. Temos: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision, pages 480–497. Springer, 2022.
  • [7] Hanyang Kong, Kehong Gong, Dongze Lian, Michael Bi Mi, and Xinchao Wang. Priority-centric human motion generation in discrete latent space. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14806–14816, 2023.
  • [8] Peng **, Yang Wu, Yanbo Fan, Zhongqian Sun, Yang Wei, and Li Yuan. Act as you wish: Fine-grained control of motion diffusion model with hierarchical semantic graphs. arXiv preprint arXiv:2311.01015, 2023.
  • [9] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
  • [10] Taeryung Lee, Gyeongsik Moon, and Kyoung Mu Lee. Multiact: Long-term 3d human motion generation from multiple action labels. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1231–1239, 2023.
  • [11] Liang Xu, Ziyang Song, Dongliang Wang, **g Su, Zhicheng Fang, Chen**g Ding, Weihao Gan, Yichao Yan, Xin **, Xiaokang Yang, et al. Actformer: A gan-based transformer towards general action-conditioned 3d human motion generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2228–2238, 2023.
  • [12] Buyu Li, Yongchi Zhao, Shi Zhelun, and Lu Sheng. Danceformer: Music conditioned 3d dance generation with parametric motion transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 1272–1279, 2022.
  • [13] Kunkun Pang, Dafei Qin, Yingruo Fan, Julian Habekost, Takaaki Shiratori, Junichi Yamagishi, and Taku Komura. Bodyformer: Semantics-guided 3d body gesture synthesis with transformer. ACM Transactions on Graphics (TOG), 42(4):1–12, 2023.
  • [14] Chaitanya Ahuja and Louis-Philippe Morency. Language2pose: Natural language grounded pose forecasting. In 2019 International Conference on 3D Vision (3DV), pages 719–728. IEEE, 2019.
  • [15] Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2021–2029, 2020.
  • [16] Mathis Petrovich, Michael J Black, and Gül Varol. Action-conditioned 3d human motion synthesis with transformer vae. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10985–10995, 2021.
  • [17] Zhenyi Wang, ** Yu, Yang Zhao, Ruiyi Zhang, Yufan Zhou, Junsong Yuan, and Changyou Chen. Learning diverse stochastic human-action generators by learning smooth latent transitions. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 12281–12288, 2020.
  • [18] Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2m-gpt: Generating human motion from textual descriptions with discrete representations. arXiv preprint arXiv:2301.06052, 2023.
  • [19] Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked modeling of 3d human motions. 2023.
  • [20] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
  • [21] Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In International Conference on Learning Representations, 2022.
  • [22] Yilun Xu, Ziming Liu, Max Tegmark, and Tommi Jaakkola. Poisson flow generative models. Advances in Neural Information Processing Systems, 35:16782–16795, 2022.
  • [23] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In International Conference on Machine Learning, 2023.
  • [24] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
  • [25] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • [26] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • [27] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
  • [28] Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple. arXiv preprint arXiv:2309.15505, 2023.
  • [29] Matthias Plappert, Christian Mandery, and Tamim Asfour. The kit motion-language dataset. Big data, 4(4):236–252, 2016.
  • [30] Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019.
  • [31] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  • [32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.