Efficient Text-driven Motion Generation via Latent Consistency Training

Mengxian Hu
Tongji university
[email protected] \AndMinghao Zhu
Tongji university
[email protected] \AndXun Zhou
Tongji university
[email protected] \AndQingqing Yan
Tongji university
[email protected] \AndShu Li
Tongji university
[email protected] \AndChengju Liu
Tongji university
[email protected] \AndQijun Chen
Tongji university
[email protected]

Abstract

Motion diffusion models have recently proven successful for text-driven human motion generation. Despite their excellent generation performance, they are challenging to infer in real time due to the multi-step sampling mechanism that involves tens or hundreds of repeat function evaluation iterations. To this end, we investigate a motion latent consistency Training (MLCT) for motion generation to alleviate the computation and time consumption during iteration inference. It applies diffusion pipelines to low-dimensional motion latent spaces to mitigate the computational burden of each function evaluation. Explaining the diffusion process with probabilistic flow ordinary differential equation (PF-ODE) theory, the MLCT allows extremely few steps infer between the prior distribution to the motion latent representation distribution via maintaining consistency of the outputs over the trajectory of PF-ODE. Especially, we introduce a quantization constraint to optimize motion latent representations that are bounded, regular, and well-reconstructed compared to traditional variational constraints. Furthermore, we propose a conditional PF-ODE trajectory simulation method, which improves the conditional generation performance with minimal additional training costs. Extensive experiments on two human motion generation benchmarks show that the proposed model achieves state-of-the-art performance with less than 10% time cost.

Keywords quantized representation $\cdot$ latent consistency training $\cdot$ motion generation

1 Introduction

Synthesizing human motion sequences under specified conditions is a fundamental task in robotics and virtual reality. Research in recent years has explored the text-to-motion diffusion framework [1, 2, 3] to generate realistic and diverse motions, which gradually recovers the motion representation from a prior distribution with multiple iterations. These works show more stable distribution estimation and stronger controllability than traditional single-step methods (e.g., GANs [4] or VAEs [5, 6]), but at the cost of a hundredfold increase in computational burden. Such a high-cost sampling mechanism is expensive in time and memory, limiting the model’s accessibility in real-time applications.

To mitigate inference cost, previous text-to-motion diffusion frameworks try to trade off between fidelity and efficiency from two perspectives: i) map** length-varying and high-dimensional original motion sequences into well-reconstructed and low-dimension motion latent representations[3, 7] to reduce data redundancy and complexity, and ii) utilizing skip-step sampling strategy [3, 8] to minimize expensive and repetitive function evaluation iterations. The first perspective inspired by the excellent performance of the latent diffusion model in text-to-image synthesis, they introduce the variational autoencoder with Kullback-Leibler (KL) divergence constraints as motion representation extractor. However, unlike image data support that contains more than ten million samples, the high cost of motion capture limits the number of samples for the text-based motion generation task. As a example, the largest current human motion dataset contains no more than fifteen thousand samples after employing data augmentation. Simultaneous optimization of reconstruction loss and KL divergence loss, which are adversarial targets, is significantly challenging in the presence of limited training resources. To ensure high reconstruction performance, previous state-of-the-art models usually set the KL divergence weights low enough, which results in low regularity of motion representations. Such low-regularity and continuous motion representations suffer redundancy and low robustness. It can be mitigated by a sufficiently numerous repetitive function evaluation iterations, but seriously harms the generative performance in the context of extremely few sampling steps. The second perspective follows from the recently well-established diffusion solvers, which can be categorized as training-free methods and training-based methods. Previous study confirms that the forward diffusion process corresponds to an inverse diffusion process without a stochastic term and is known as the probabilistic flow ordinary differential equation (PF-ODE) [9]. Training-free methods constructed different discrete solvers for the special form of the PF-ODE, achieving almost a 20-fold performance improvement. These works effectively compress the sampling steps to 50-100 steps, but the fidelity of the ODE solution results is lower when the number of iterations is much smaller due to the complexity of the probability distribution of the motion sequences and the cumulative error of the discrete ODE sampling. It is still a significant gap in computational effort compared to traditional single-step motion generation models. Training-based methods usually rely on model distillation or trajectory distillation for implementation, and one promising approach is known as the consistency model. It impose constraints on the model to maintain the consistency of the output on the same PF-ODE trajectory, thus achieving a single-step or multiple-step generative map** from the prior distribution to the target distribution. Typical PF-ODE trajectory generation methods are consistency distillation, which generates trajectories with pre-trained diffusion models, or consistency training, which simulates trajectories with the unbiased estimation of ground truth. The former relies on well-trained diffusion models as foundation models. Training these models from scratch is computationally expensive and time-consuming. Less costly consistency training frameworks avoid additional pre-trained models, but also suffer poor generation performance and even training collapse due to redundant and irregular latent representations. Moreover, existing consistency training frameworks have not sufficiently explored conditional PF-ODE trajectory. It results in vanilla consistency-training-based models without significant advantages over well-established multi-step diffusion samplers using classifier-free guidance.

Upon the above limitations, we propose a Motion Latent Consistency Training (MLCT) framework with generates high-quality motions with no more than 5 sampling steps. Following the common latent space modeling paradigm, our motivation focuses on constructing low-dimensional and regular motion latent representations, as well as exploring the simulation of conditional PF-ODE trajectories with the consistency training model in the absence of pre-trained models. Specifically, the first contribution of this paper is to introduce a pixel-like latent autoencoder with quantization constraints, which aggregates motion information of arbitrary length to multiple latent representation tokens via self-attention calculation. It differs significantly from the widely used variational representations in that the former is bounded and discrete while the latter is unbounded and continuous. We restrict the representation boundaries with the hyperbolic tangent (Tanh) function and forces the continuous representation to map to the nearest predefined clustering center. Compared to the black-box control strategy of fine-tuning the KL divergence weights, our approach trades off the regularity and reconstruction performance of the motion latent representations more controllably via designing finite dimensional discrete latent representation space. In addition, previous practice demonstrates that the boundedness of the representations contributes to sustaining stable inference in classifier-free guidance (CFG) techniques. The second contribution of this paper is to explore a one-stage conditionally guided consistency training framework. The main insight is to consider unbiased estimation based on ground truth motion representations as the simulation of a conditional probability gradient and to propose an online updating mechanism for the unconditional probability gradient. To the best of our knowledge, this is the first application of classifier-free guidance to consistency training. Since it is utilized for generating trajectories, the denoiser does not need to be double computationally expensive in the derivation to get better conditional generation results.

Refer to caption — Figure 1: Our model achieves better FID metrics with less inference time and allows for the generation of high-quality human motions based on textual prompts in around 5 NFE. The color of humans darkens over time.

We evaluate the proposed framework on two widely-used datasets: KIT and HumanML datasets. The results of our 1, 3 and 5 number of function evaluations (NFE) generation are shown in Figure 1, along with the differences in FID metrics with existing methods. Extensive experiments indicate the effectiveness of MLCT and its components. The proposed framework achieves state-of-the-art performance in motion generation only in around 5 steps.

To sum up, the contributions of this paper are as follows:

•

We explore a pixel-like motion latent representation relying on quantization constraints which is highly regular, well-reconstruction and bounded.
•

We introduce classifier-free guidance in consistency training for the first time. It is beneficial to realize more controllable motion generation as well as more stable training convergence.
•

Our proposed MLCT achieves state-of-the-art performance on two challenge datasets with extremely less sampling steps.

2 Related Work

Human motion generation. Human motion generation aims to synthesize human motion sequence under specified conditions, such as action categories [10, 11], audio [12, 13], and textual description [14, 2, 3]. In the past few years, numerous works have investigated motion generation from various generative frameworks. For example, VAE-based models [15, 16, 5] represent the motion as a set of Gaussian distributions and constrain its regularity with KL divergence. Such constraint allows it to reconstruct the motion information from the standard normal distribution, yet its results are often ambiguous. GAN-based methods [17, 4] achieve better performance by bypassing direct estimation of probabilistic likelihoods via the adversarial training strategy, but the adversarial property makes their training often unstable and prone to mode collapse. Some multi-step generative methods have emerged recently with great success, such as auto-regressive [18, 19] and diffusion methods [1, 2, 3]. In particular, the latter is gradually dominating the research frontiers due to its stable distribution estimation capability and high-quality sampling results. Motiondiffuse [1] and MDM [2] were the pioneers in implementing diffusion frameworks for motion generation. MLD [3] realizes the latent space diffusion, which significantly improves the efficiency. M2DM [7] represents motion as discrete features and diffusion processes in finite state space with state-of-the-art performance. Some recent work [8] has focused on more controlled generation with equally excellent results. These works validate the outstanding capabilities of the motion diffusion framework and receive continuous attention.

Efficient diffusion sampling. Efficient diffusion sampling is the primary challenge of diffusion frameworks oriented to real-time generation tasks. DDIM [20] relaxes the restriction on Markov conditions in the original diffusion framework and achieves a 20 times computational efficiency improvement. Score-based method [9] from the same period relates the diffusion framework to a stochastic differential equation and notes that it has a special form known as the probability flow ODE. This is a milestone achievement. It guides the following works either to steer a simplified diffusion process through a specially designed form of ODE [21, 22, 23], or to skip a sufficiently large number of sampling steps via the more sophisticated higher-order ODE approximation solution strategy [24]. In addition to the above work, the diffusion process can be executed in lower dimensional and more regular latent spaces, thus reducing the single-step computational burden [25]. While these works have proven effective in computer vision, they have received only finite reflections in motion diffusion frameworks. Previous state-of-the-art methods such as MLD [3] and GraphMotion [8] have utilized VAE-based representations and DDIM sampling strategies. Precise and robust motion representation and efficient motion diffusion design remain an open problem.

Consistency model. Consistency modeling is a novel and flexible diffusion sampling framework that allows the model to make trade-offs between extreme few steps and generation quality. Latent consistency models extend consistency distillation methods to the latent representation space, saving memory spend and further improving inference efficiency. Subsequently, VideoLCM further applies consistency distillation to video generation. Recent approaches have also investigated the application of Lora and control net to consistency modeling with impressive results. These methods rely on a strong teacher model as the distillation target, which trained from scratch requires not only a large dataset support but also a lot of computational resources. To reduce the training cost, ICM further explores and improves consistency training methods to obtain similar performance to consistency distillation without pre-trained models. However, it is still limited to the original pixel representation space of fixed dimensions and is applied to variance-explosion ODE frameworks. Consistency training methods for broader diffusion strategies in the latent representation space lack further exploration.

3 Preliminaries

In this section, we briefly introduce diffusion and consistency models.

3.1 Score-based Diffusion Models

The diffusion model [26] is a generative model that gradually injects Gaussian noise into the data and then generates samples from the noise through a reverse denoising process. Specifically, it gradually transforms the data distribution $p_{data}(x_{0})$ into a well-sampled prior distribution $p(x_{T})$ via a Gaussian perturbation kernel $p(x_{t}|x_{0})=\mathcal{N}(x_{t}|\alpha_{t}x_{0},\sigma_{t}^{2}I)$ , where $\alpha_{t}$ and $\sigma_{t}$ are specify noise schedules. Recent studies have formalized it into a continuous time form, described as a stochastic partial differential equation,

dx_{t}=f(t)x_{t}dt+g(t)dw_{t},

(1)

where $t\in[\epsilon,T]$ , $\epsilon$ and $T$ are the fixed positive constant, $w_{t}$ denotes the standard Brownian motion, $f$ and $g$ are the drift and diffusion coefficients respectively with follow from,

f(t)=\frac{d\log\alpha_{t}}{dt},\quad g^{2}(t)=\frac{d\sigma_{t}^{2}}{dt}-2% \frac{d\log\alpha_{t}}{dt}\sigma_{t}^{2}.

(2)

Previous work has revealed that the reverse process of Eq. 1 shares the same marginal probabilities with the probabilistic flow ODE:

dx_{t}=[f(t)x_{t}-\frac{1}{2}g^{2}(t)\nabla_{x_{t}}\log p(x_{t})]dt,

(3)

where $\nabla_{x}\log p(x_{t})$ is named the score function, which is the only unknown term in the sampling pipeline. An effective approach is training a time-dependent score network $\mathcal{S}_{\theta}(x_{t},t)$ to estimate $\nabla_{x}\log p(x_{t})$ based on conditional score matching, parameterized as the prediction of noise or initial value in forward diffusion. Further, Eq. 3 can be solved in finite steps by any numerical ODE solver such as Euler [9] and Heun solvers [27].

3.2 Consistency Models

Theoretically, the inverse process expressed by Eq. 3 is deterministic, and the consistency model (CM) [23] achieves one-step or few-step generation by pulling in outputs on the same ODE trajectory. It is more formally expressed as,

\mathcal{S}_{\theta}(x_{t},t)=\mathcal{S}_{\theta}(x_{t^{\prime}},t^{\prime})% \approx\mathcal{S}_{\theta}(x_{\epsilon},\epsilon)\quad\forall t,t^{\prime}\in% [\epsilon,T],

(4)

which is known as the self-consistency property. To maintain the boundary conditions, existing consistency models are commonly parameterized by skip connections, i.e.,

\mathcal{S}_{\theta}(x_{t},t):=c_{skip}(t)x+c_{out}(t)\hat{\mathcal{S}}_{% \theta}(x_{t},t)

(5)

where $c_{skip}(t)$ and $c_{out}(t)$ are differentiable functions satisfied $c_{skip}(\epsilon)=1$ and $c_{out}(\epsilon)=0$ . For stabilize training, the consistency model maintaining target model $\mathcal{S}_{\theta}^{-}$ , trained with the exponential moving average (EMA) of parameter $\gamma$ , that is $\theta^{-}\leftarrow\gamma\theta^{-}+(1-\gamma)\theta$ . The consistency loss can be formulated as,

\mathcal{L}_{cm}(\theta,\theta^{-})=\mathbb{E}_{x,t}\big{[}d\big{(}\mathcal{S}% _{\theta}(x_{t_{n+1}},t_{n+1}),\mathcal{S}_{\theta^{-}}(\hat{x}_{t_{n}},t_{n})% \big{)}\big{]}

(6)

where $d(\cdot,\cdot)$ is a metric function such as mean square or pseudo-huber metric, and $\hat{x}_{t_{n}}$ is a one-step estimation from $x_{t_{n+1}}$ with ODE solvers applied in Eq. 3.

4 Motion Latent Consistency Training Framework

In this section, we discuss two critical targets. The first is encoding motions with arbitrary lengths into low-dimensional and regularized latent representations of motions to align all motion dimensions. The second is introducing the conditional PF-ODE into less cost consistency training framework for few-steps and high-quality latent representation sampling. To this end, we propose a Motion Latent Consistency Training (MLCT) framework, as shown in Figure 2. It consists of an autoencoder with quantization constraints, which is used to learn various motion representations in low-dimensional and regularized latent spaces (details in Section 4.1), and a denoising network, which is used to capture the corresponding latent state distributions and to implement few-step sampling (details in Section 4.2).

4.1 Encoding Motion as Quantized Latent Representation

We construct an autoencoder $\mathcal{G}=\{\mathcal{E},\mathcal{D}\}$ with transformer-based architecture to realize encoding and reconstructing between motion sequences $x$ and latent motion representations $z$ . The core insight is that each dimension of $z$ is sampled from a finite set $\mathcal{M}$ of size $2l+1$ as follow,

\mathcal{M}=\{z_{i};-1,-j/l,\cdots,0,\cdots,j/l,\cdots,1\}_{j=0}^{l}.

(7)

To this end, we denote $z\in\mathcal{R}^{n,d}$ as $n$ learnable tokens with $d$ dimension, aggregating the motion sequence features via attention computation. Inspired by recent quantitative work [28], we employ a hyperbolic tangent (tanh) function on the output of the encoder $\mathcal{E}$ to constrain the boundaries of the representation, and then quantize the result by a rounding operator $\mathcal{R}$ . Furthermore, the gradient of quantized items is simulated by the previous state gradient to backpropagate the gradient normally. The latent representations $z$ are sampled by follow format,

z=\mathcal{R}\Big{(}l\cdot tanh(\mathcal{E}(x))\Big{)}/l.

(8)

The standard optimization target is to reconstruct motion information from $z$ with the decoder $\mathcal{D}$ , i.e., to optimize the $l_{1}$ smooth error loss,

\mathcal{L}_{z}=\mathbb{E}_{x}\Big{[}d\Big{(}x,\mathcal{D}(z)\Big{)}+\lambda_{% j}d\Big{(}\mathcal{J}(x),\mathcal{J}(\mathcal{D}(z))\Big{)}\Big{]},

(9)

where $\mathcal{J}$ is a function to transform features such as joint rotations into joint coordinates, and it is also applied in MLD [3] and GraphMotion [8]. $\lambda_{j}$ is a balancing term.

Compared with the traditional VAEs, the optimization target Eq. 9 does not contain a divergence adversarial term. A well-trained autoencoder $\mathcal{G}$ output bounded and regular motion latent representation, which in turn improves the solution space of the denoising network, and experimentally we found that this improvement is important for the convergence of consistent training.

4.2 Few Step Motion Generation via Consistency Training

For conditional motion generation, Class-Free Guidance (CFG) is crucial for synthesizing high-fidelity samples in most successful cases of motion diffusion models, such as MLD or GraphMotion. Previous work introduced CFG into the consistency distillation, demonstrating the feasibility of the consistency model on conditional PF-ODE trajectories. However, they rely on powerful pre-trained teacher models, which not only involve additional training costs but performance is limited by distillation errors. Therefore, we are motivated to simulate CFG more efficiently from the original motion latent representation following the consistency training framework to alleviate the computational burden.

The diffusion stage of MLCM begins with the variance preserving schedule [9] to perturbed motion latent representations $x_{\epsilon}=z$ with perturbation kernel $\mathcal{N}(x_{t};\alpha(t)x_{0},\sigma^{2}(t)I)$ ,

\alpha(t):=e^{-\frac{1}{4}t^{2}(\beta_{1}-\beta_{0})-\frac{1}{2}t\beta_{0}},% \quad\sigma(t):=\sqrt{1-e^{2\alpha(t)}}.

(10)

The consistency model $\mathcal{S}_{\theta}$ has been constructed to predict $x_{\epsilon}$ from perturbed $x_{t}$ in a given PF-ODE trajectory. To maintain the boundary conditions that $\mathcal{S}_{\theta}(x_{\epsilon},\epsilon,c)=x_{\epsilon}$ , we employ the same skip setting for Eq. LABEL:equ5 as in the latent consistency model (LCM), which parameterized as follow:

\mathcal{S}_{\theta}(x_{t},t,c):=\frac{\eta^{2}}{(10t)^{2}+\eta^{2}}\cdot x_{t% }+\frac{10t}{\sqrt{(10t)^{2}+\eta^{2}}}\cdot\widetilde{\mathcal{S}}_{\theta}(x% _{t},t,c),

(11)

where $\widetilde{\mathcal{S}}_{\theta}$ is a transformer-based network and $\eta$ is a hyperparameter, which is usually set to 0.5. Following the self-consistency property (as detail in Eq. 4), the model $\mathcal{S}_{\theta}$ has to maintain the consistency of the output at the given perturbed state $x_{t}$ with the previous state $\widetilde{x}_{t-\Delta t}$ on the same ODE trajectory. The latter can be estimated via DPM++ solver:

\widetilde{x}_{t-\Delta t}\approx\frac{\sigma_{t-\Delta t}}{\sigma_{t}}\cdot x% _{t}-\alpha_{t}\cdot(\frac{\alpha_{t-\Delta t}\cdot\sigma_{t}}{\sigma_{t-% \Delta t}\cdot\alpha_{t}}-1)\cdot x_{\epsilon}^{\Phi},

(12)

where $x_{\epsilon}^{\Phi}$ is the estimation of $x_{\epsilon}$ under the different sampling strategies. In particular, $x_{\epsilon}^{\Phi}$ can be parameterized as a linear combination of conditional and unconditional latent presentation prediction following the CFG strategy, i.e.,

x_{\epsilon}^{\Phi}(x_{t},t,c)=(1+\omega)\cdot\mathcal{F}_{\theta}(x_{t},t,c)-% \omega\mathcal{F}_{\theta}(x_{t},t,\emptyset),

(13)

where $\mathcal{F}_{\theta}(\cdot)$ is well-trained and $x_{\epsilon}$ -prediction-based motion diffusion model.

It is worth noting that $x_{\epsilon}$ can be utilized to simulate $\mathcal{F}_{\theta}(x_{t},t,c)$ as used in the vanilla consistency training pipeline. Furthermore, $\mathcal{F}_{\theta}(x_{t},t,\emptyset)$ can be replaced by $\mathcal{S}_{\theta}(x_{t},t,\emptyset)$ with online updating. Thus Eq. 13 can be rewritten as:

x_{\epsilon}^{\Phi}(x_{t},t,c)=(1+\omega)\cdot x_{\epsilon}-\omega\mathcal{S}_% {\theta}(x_{t},t,\emptyset).

(14)

The optimization objective of the consistency model $\mathcal{S}_{\theta}$ is that,

\mathcal{L}_{c}=\mathbb{E}_{x,t}\Big{[}\frac{1}{\Delta t}d\Big{(}\mathcal{S}_{% \theta}(x_{t},t,c),\mathcal{S}_{\theta^{-}}(\hat{x}_{t-\Delta t},t-\Delta t,c)% \Big{)}+\lambda_{c}d\Big{(}\mathcal{S}_{\theta}(x_{t},t,\emptyset),x_{\epsilon% }\Big{)}\Big{]},

(15)

where $d(x,y)=\sqrt{(x-y)^{2}+\gamma^{2}}-\gamma$ is pseudo-huber metric, $\gamma$ is a constant, $\lambda_{c}$ is a balancing term. The target network $\mathcal{S}_{\theta^{-}}$ is updated after each iteration via EMA.

5 Experiments

5.1 Datasets and Metrics

Datasets. We evaluate the proposed framework on two mainstream benchmarks for text-driven motion generation tasks, which are the KIT [29] and the HumanML3D [5]. The former contains 3,911 motions and their corresponding 6,363 natural language descriptions. The latter is currently the largest 3D human motion dataset comprising the HumanAct12 [15] and AMASS [30] datasets, containing 14,616 motions and 44,970 descriptions.

Evaluation Metrics. Consistent with previous work, we evaluate the proposed framework in four parts. (a) Motion quality: we utilize the frechet inception distance (FID) to evaluate the distance in feature distribution between the generated data and the real data. (b) Condition matching: we first employ the R-precision to measure the correlation between the text description and the generated motion sequence and record the probability of the first $k=1,2,3$ matches. Then, we further calculate the distance between motions and texts by multi-modal distance (MM Dist). (c) Motion diversity: we compute differences between features with the diversity metric and then measure generative diversity in the same text input using multimodality (MM) metric. (d) Calculating burden: we first use the number of function evaluations (NFE) to evaluate generated performance with fewer steps sampling. Then, we further statistics the average sampling time (AST) of a single sample.

5.2 Implementation Details

Model Configuration. The motion autoencoder $\{\mathcal{E},\mathcal{D}\}$ and the score network $\mathcal{S}$ are both the transformer architecture with long skip connections [31], which is also used in MLD [3]. Specifically, both the encoder $\mathcal{E}$ and decoder $\mathcal{D}$ contain 7 layers of transformer blocks with input dimensions 256, and each block contains 3 learnable tokens. The size of the finite set $\mathcal{M}$ is set as 2001, i.e. $l=1000$ . The score network $\mathcal{S}$ contains 15 layers of transformer blocks with input dimensions 512. The frozen CLIP-ViT-L-14 model [32] is used to be the text encoder. It encodes the text to a pooled output $w\in\mathcal{R}^{1,256}$ and then projects it as text embedding to sum with the time embedding before the input of each block.

Train Configuration. For diffusion time horizon $[\epsilon,T]$ into $N-1$ sub-intervals, we set $\epsilon$ is 0.002, $T$ is 1, $N$ is 1000. We follow the consistency model [23] to determine $t_{i}=(\epsilon^{1/\rho}+\frac{i-1}{N-1}(T^{1/\rho}-\epsilon^{1/\rho}))^{\rho}$ , where $\rho=2$ . For balance training, we set $\lambda_{j}$ as 0.001. All the proposed models are trained with the AdamW optimizer with a learning rate of $10^{-4}$ on a single RTX 4090 GPU. The size of each mini-batch is 64 and 128 for the autoencoder and denoising network, and the training process has been iterated with 1500 and 2000 epochs for the autoencoder and denoising network.

5.3 Comparisons to State-of-the-art Methods

Table 1: Comparisons to state-of-the-art methods on the HumanML test set. We repeat all the evaluations 20 times and report the average with a 95% confidence interval. "

\uparrow

" denotes that higher is better. "

\downarrow

" denotes that lower is better. "

\rightarrow

" denotes that results are better if the metric is closer to the real motion.

{\dagger}

denotes that classifier-free guidance is utilized, causing a double NFE.

Method	R-Precision $\uparrow$			FID $\downarrow$	MM-Dist $\downarrow$	Diversity $\rightarrow$	MModality $\uparrow$	NFE $\downarrow$
Method	Top-1	Top-2	Top-3	FID $\downarrow$	MM-Dist $\downarrow$	Diversity $\rightarrow$	MModality $\uparrow$	NFE $\downarrow$
Real	$0.511^{\pm.003}$	$0.703^{\pm.003}$	$0.797^{\pm.002}$	$0.002^{\pm.000}$	$2.974^{\pm.008}$	$9.503^{\pm.065}$	-	-
TEMOS[6]	$0.424^{\pm.002}$	$0.612^{\pm.002}$	$0.722^{\pm.002}$	$3.734^{\pm.028}$	$3.703^{\pm.008}$	$8.973^{\pm.071}$	$0.368^{\pm.018}$	-
T2M[5]	$0.457^{\pm.002}$	$0.639^{\pm.003}$	$0.740^{\pm.003}$	$1.067^{\pm.002}$	$3.340^{\pm.008}$	$9.188^{\pm.002}$	$2.090^{\pm.083}$	-
MDM [2]	$0.320^{\pm.005}$	$0.498^{\pm.004}$	$0.611^{\pm.007}$	$0.544^{\pm.044}$	$5.566^{\pm.027}$	$9.559^{\pm.086}$	$2.799^{\pm.072}$	1000
MD [1]	$0.491^{\pm.001}$	$0.681^{\pm.001}$	$0.782^{\pm.001}$	$0.630^{\pm.001}$	$3.113^{\pm.001}$	$9.410^{\pm.049}$	$1.553^{\pm.042}$	1000
MLD^† [3]	$0.481^{\pm.003}$	$0.673^{\pm.003}$	$0.772^{\pm.002}$	$0.473^{\pm.013}$	$3.196^{\pm.010}$	$9.724^{\pm.082}$	$2.413^{\pm.079}$	100
GraphMotion^†[8]	$0.504^{\pm.003}$	$0.699^{\pm.002}$	$0.785^{\pm.002}$	$0.116^{\pm.007}$	$3.070^{\pm.008}$	$9.692^{\pm.067}$	$2.766^{\pm.096}$	300
M2DM [7]	$0.497^{\pm.003}$	$0.682^{\pm.002}$	$0.763^{\pm.003}$	$0.352^{\pm.005}$	$3.134^{\pm.010}$	$9.926^{\pm.073}$	$3.587^{\pm.072}$	100
Our	$0.460^{\pm.001}$	$0.655^{\pm.002}$	$0.760^{\pm.006}$	$0.232^{\pm.007}$	$3.238^{\pm.008}$	$9.658^{\pm.065}$	$3.506^{\pm.008}$	5

Table 2: Comparisons to state-of-the-art methods on the KIT test set. The meaning of the markers is the same as in Tab. 1.

Method	R-Precision $\uparrow$			FID $\downarrow$	MM-Dist $\downarrow$	Diversity $\rightarrow$	MModality $\uparrow$	NFE $\downarrow$
Method	Top-1	Top-2	Top-3	FID $\downarrow$	MM-Dist $\downarrow$	Diversity $\rightarrow$	MModality $\uparrow$	NFE $\downarrow$
Real	$0.424^{\pm.005}$	$0.649^{\pm.006}$	$0.779^{\pm.006}$	$0.031^{\pm.004}$	$2.788^{\pm.012}$	$11.08^{\pm.097}$	-	-
TEMOS[6]	$0.353^{\pm.006}$	$0.561^{\pm.007}$	$0.687^{\pm.005}$	$3.717^{\pm.051}$	$3.417^{\pm.019}$	$10.84^{\pm.100}$	$0.532^{\pm.034}$	-
T2M[5]	$0.370^{\pm.005}$	$0.569^{\pm.007}$	$0.693^{\pm.007}$	$2.770^{\pm.109}$	$3.401^{\pm.008}$	$10.91^{\pm.119}$	$1.482^{\pm.065}$	-
MDM [2]	$0.164^{\pm.004}$	$0.291^{\pm.004}$	$0.396^{\pm.004}$	$0.497^{\pm.021}$	$9.191^{\pm.022}$	$10.85^{\pm.109}$	$1.907^{\pm.214}$	1000
MD [1]	$0.417^{\pm.004}$	$0.621^{\pm.004}$	$0.739^{\pm.004}$	$1.954^{\pm.062}$	$2.958^{\pm.005}$	$11.10^{\pm.143}$	$0.730^{\pm.013}$	1000
MLD^† [3]	$0.390^{\pm.008}$	$0.609^{\pm.008}$	$0.734^{\pm.007}$	$0.404^{\pm.027}$	$3.204^{\pm.027}$	$10.80^{\pm.117}$	$2.192^{\pm.071}$	100
GM^†,‡[8]	$0.429^{\pm.007}$	$0.648^{\pm.006}$	$0.769^{\pm.006}$	$0.313^{\pm.013}$	$3.076^{\pm.022}$	$11.12^{\pm.135}$	$3.627^{\pm.113}$	300
M2DM [7]	$0.416^{\pm.004}$	$0.628^{\pm.004}$	$0.743^{\pm.004}$	$0.515^{\pm.029}$	$3.015^{\pm.017}$	$11.417^{\pm.970}$	$3.325^{\pm.370}$	100
Our	$0.433^{\pm.007}$	$0.655^{\pm.006}$	$0.783^{\pm.006}$	$0.408^{\pm.013}$	$2.831^{\pm.018}$	$11.179^{\pm.085}$	$1.23^{\pm.037}$	5

The test results of HumanML and KIT are shown in Tab. 1 and Tab. 2, respectively. Our framework achieves the state-of-the-art generation performance. Compared to existing motion diffusion generation frameworks with more than 50-1000 iterations (e.g., MDM, MotionDiffuse, and MLD), our approach reduces the computational burden by more than tenfold without severely degrading the quality of damage generation. Remarkably, our inference pipeline is very concise, with no tricks such as additional text preprocessing as used in GraphMotion. Sampling in fewer steps also has not significantly reduced diversity and multi-modality metrics, which remain competitive. Fig. 3 shows the comparison of the visualization results with the previous model.

5.4 Ablation Study

Table 3: Ablation study of our framework with more generation metrics under different guidance parameters. The meaning of the markers is the same as in Tab. 1.

Dataset	$w$	R-Precision Top-3 $\uparrow$	FID $\downarrow$	MM-Dist $\downarrow$	MModality $\uparrow$
KIT	0	0.742^±.006	0.717^±.028	3.051^±.021	2.496^±.065
	0.5	0.771^±.006	0.504^±.021	2.885^±.023	1.935^±.044
	1	0.775^±.005	0.494 ^±.019	2.831^±.021	1.844^±.049
	1.5	0.783^±.006	0.411^±.019	2.809^±.019	1.648^±.040
	2	0.777^±.006	0.518^±.016	2.799^±.023	1.612^±.041

Effectiveness of each component. We explore the generative performance of the classifier-free guidance technique under different representations, and the results are reported in Fig. 4. When the guidance coefficient $w$ equals to 0, the model degenerates into a vanilla consistency model. We discover that increasing various degrees of classifier-free guidance accelerates consistency training convergence and improves generation quality. The pixel-discrete motion representation via the quantized autoencoder has better convergence ability generation performance compared to the continuous motion representation. In particular, under the same consistency training parameters, we have not observed significant gains in generation quality from variational constraints compared to the vanilla autoencoder. We further discuss more comprehensive generation metrics at different guidance parameters and the results are reported in Tab. 3. As the guidance parameters increase, controllability and generation quality gradually improve, with a corresponding decrease in diversity. In contrast to the larger guidance parameters employed in the traditional diffusion framework (which can usually be set to 7), we find that there is no contribution to the generation quality starting from $w$ greater than 2 in the consistency training framework.

Table 4: Ablation study of different number of token and sizes of representation finite set. The meaning of the markers is the same as in Tab. 1.

Dataset	Token	$l$	R-Precision Top-3 $\uparrow$	FID $\downarrow$	MM-Dist $\downarrow$	MModality $\uparrow$
KIT	2	100	0.770^±.006	0.599^±.025	2.870^±.020	1.656^±.043
	2	500	0.774^±.005	0.550^±.019	2.829^±.018	1.769^±.021
	2	2000	0.775^±.005	0.428^±.016	2.844^±.019	1.645^±.045
	4	1000	0.781^±.003	0.489^±.021	2.823^±.021	1.859^±.044
	6	1000	0.781^±.004	0.465^±.021	2.821^±.019	1.839^±.055
	2	1000	0.783^±.006	0.411^±.019	2.809^±.019	1.648^±.040

Ablation study on the different model hyperparameters. In Tab. 4, we test the model performance with different hyperparameters. Consistent with the findings of MLD, increasing the number of tokens does not remarkably increase the generation quality. Appropriately increasing the size of the finite set $2l+1$ is beneficial in improving the generation results, and such gain is no longer significant when $l$ is larger than 1000.

Table 5: Ablation study of different number of function evaluations.

Dataset	NFE	R-Precision Top-3 $\uparrow$	FID $\downarrow$	MM-Dist $\downarrow$	MModality $\uparrow$
KIT	1	0.777^±.005	0.567^±.002	2.865^±.013	1.424^±.040
	3	0.781^±.005	0.409^±.014	2.812^±.019	1.598^±.037
	5	0.783^±.006	0.411^±.019	2.809^±.019	1.648^±.040
	8	0.783^±.006	0.400^±.015	2.810^±.017	1.667^±.051
	10	0.786^±.006	0.395^±.015	2.795^±.019	1.663^±.049

Ablation study on the different sampling steps. Our generation results at different sampling steps are further shown in Tab. 5. We have excellent results with fewer sampling steps, but when the number of sampling steps is increased to more than 15, the increased number of sampling steps does not result in a quality payoff. It is a common problem with consistency training.

5.5 Time Cost

Table 6: Comparison of inference time with previous sota models.

Method	MDM	MLD	T2M-GPT	GraphMotion	Our (NFE 5)	Our (NFE 3)
AST (s)	7.5604	0.0786	0.2168	0.5417	0.0141	0.0098

The consistency training method we use does not require prior training of the diffusion model, so training is inexpensive and is available on just a single 4090. On the HumanML dataset, we train the encoder in 15 hours and the denoiser in 12 hours. Benefiting from the consistency sampling strategy, our inference time is also more than tenfold less than existing models. A more detailed time comparison is reported in Tab. 6.

6 Conclusion

In this paper, we propose a motion latent consistency Training framework, called MLCT, for high-quality, few-step sampling. It encodes motion sequences of arbitrary length into representational tokens with quantization constraints and constrains the consistency of outputs on the same ODE trajectory to realize the latent diffusion pipeline. Inspired by classifier-free guidance, we propose a method called consistent trajectory offset for fast convergence of consistent training. We validate our model and each of its components through extensive experiments and achieve the best trade-off between performance and computational burden in a very small number of steps (around 10). Our approach can provide a reference for subsequent latent consistency model training for different tasks.

Limitation and Future Work. Our work still has some directions for improvement. First, we aim at less-step motion generation and lack a discussion on fine-grained motion control. Fortunately, our proposed method is a generalized diffusion model training framework with fewer sampling steps. Some recent common textual controllers (such as graphmotion) can be integrated into the current work. Second, we note that consistent training fails to yield higher sampling quality after increasing the number of steps compared to common diffusion frameworks. How to overcome this difficulty is our main subsequent work.

References

[1] Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022.
[2] Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. Human motion diffusion model. In International Conference on Learning Representations, 2023.
[3] Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18000–18010, 2023.
[4] Haoye Cai, Chunyan Bai, Yu-Wing Tai, and Chi-Keung Tang. Deep video generation, prediction and completion of human action sequences. In Proceedings of the European conference on computer vision (ECCV), pages 366–382, 2018.
[5] Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5152–5161, 2022.
[6] Mathis Petrovich, Michael J Black, and Gül Varol. Temos: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision, pages 480–497. Springer, 2022.
[7] Hanyang Kong, Kehong Gong, Dongze Lian, Michael Bi Mi, and Xinchao Wang. Priority-centric human motion generation in discrete latent space. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14806–14816, 2023.
[8] Peng **, Yang Wu, Yanbo Fan, Zhongqian Sun, Yang Wei, and Li Yuan. Act as you wish: Fine-grained control of motion diffusion model with hierarchical semantic graphs. arXiv preprint arXiv:2311.01015, 2023.
[9] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
[10] Taeryung Lee, Gyeongsik Moon, and Kyoung Mu Lee. Multiact: Long-term 3d human motion generation from multiple action labels. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1231–1239, 2023.
[11] Liang Xu, Ziyang Song, Dongliang Wang, **g Su, Zhicheng Fang, Chen**g Ding, Weihao Gan, Yichao Yan, Xin **, Xiaokang Yang, et al. Actformer: A gan-based transformer towards general action-conditioned 3d human motion generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2228–2238, 2023.
[12] Buyu Li, Yongchi Zhao, Shi Zhelun, and Lu Sheng. Danceformer: Music conditioned 3d dance generation with parametric motion transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 1272–1279, 2022.
[13] Kunkun Pang, Dafei Qin, Yingruo Fan, Julian Habekost, Takaaki Shiratori, Junichi Yamagishi, and Taku Komura. Bodyformer: Semantics-guided 3d body gesture synthesis with transformer. ACM Transactions on Graphics (TOG), 42(4):1–12, 2023.
[14] Chaitanya Ahuja and Louis-Philippe Morency. Language2pose: Natural language grounded pose forecasting. In 2019 International Conference on 3D Vision (3DV), pages 719–728. IEEE, 2019.
[15] Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2021–2029, 2020.
[16] Mathis Petrovich, Michael J Black, and Gül Varol. Action-conditioned 3d human motion synthesis with transformer vae. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10985–10995, 2021.
[17] Zhenyi Wang, ** Yu, Yang Zhao, Ruiyi Zhang, Yufan Zhou, Junsong Yuan, and Changyou Chen. Learning diverse stochastic human-action generators by learning smooth latent transitions. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 12281–12288, 2020.
[18] Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2m-gpt: Generating human motion from textual descriptions with discrete representations. arXiv preprint arXiv:2301.06052, 2023.
[19] Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked modeling of 3d human motions. 2023.
[20] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
[21] Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In International Conference on Learning Representations, 2022.
[22] Yilun Xu, Ziming Liu, Max Tegmark, and Tommi Jaakkola. Poisson flow generative models. Advances in Neural Information Processing Systems, 35:16782–16795, 2022.
[23] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In International Conference on Machine Learning, 2023.
[24] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
[25] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
[26] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
[27] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
[28] Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple. arXiv preprint arXiv:2309.15505, 2023.
[29] Matthias Plappert, Christian Mandery, and Tamim Asfour. The kit motion-language dataset. Big data, 4(4):236–252, 2016.
[30] Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019.
[31] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
[32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.