\newfloatcommand

capbtabboxtable[][\FBwidth]

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

Yuang Zhang^1,2 Jiaxi Gu^1 ✉ Li-Wen Wang¹ Han Wang^1,2¹¹footnotemark: 1 Junqi Cheng¹
Yuefeng Zhu¹ Fangyuan Zou¹
¹Tencent ²Shanghai Jiao Tong University
{yuaaazhang,levenwang,kathyhwang,junqicheng,
yuefengzhu,ericfyzou}@tencent.com
✉ [email protected] Work done during internship at Tencent.

Abstract

In recent years, generative artificial intelligence has achieved significant advancements in the field of image generation, spawning a variety of applications. However, video generation still faces considerable challenges in various aspects, such as controllability, video length, and richness of details, which hinder the application and popularization of this technology. In this work, we propose a controllable video generation framework, dubbed MimicMotion, which can generate high-quality videos of arbitrary length mimicking specific motion guidance. Compared with previous methods, our approach has several highlights. Firstly, we introduce confidence-aware pose guidance that ensures high frame quality and temporal smoothness. Secondly, we introduce regional loss amplification based on pose confidence, which significantly reduces image distortion. Lastly, for generating long and smooth videos, we propose a progressive latent fusion strategy. By this means, we can produce videos of arbitrary length with acceptable resource consumption. With extensive experiments and user studies, MimicMotion demonstrates significant improvements over previous approaches in various aspects. Detailed results and comparisons are available on our project page: https://tencent.github.io/MimicMotion.

Figure 1: Pose-guided dancing and talking videos generated by MimicMotion showcase its capability to produce diverse human motions and long videos.

1 Introduction

With the rapid development of generative artificial intelligence, video generation is gaining attention in parallel with the growing maturity of image generation techniques. However, video generation is more challenging due to its higher inherent complexities, including the need for high-quality imagery and seamless temporal smoothness. This sets higher standards for video generation technology. In addition to these challenges, controlling the generated content and extension to significant lengths without compromising quality is essential for real-world use. In this paper, we focus on pose-guided video generation conditioned on a reference image. Our goal is to generate a video that not only contains rich imagery details but also adheres to the reference image and the pose guidance.

Currently, there are plenty of works focusing on image-conditioned pose-guided video generation, such as Follow Your Pose [1], DreamPose [2], DisCo [3], MagicDance [4], AnimateAnyone [5], MagicAnimate [6], DreaMoving [7], Champ [8], etc. Though various model architectures and training techniques have been studied for better generation performance, the generated results are unsatisfactory in several aspects. Imagery distortion especially on the regions of human hands is still a common issue which is particularly evident in videos containing large movements. Besides, to achieve good temporal smoothness, imagery details are sometimes sacrificed resulting in videos of blurred frames. In the presence of diverse appearances and motions in videos, accurate pose estimation is inherently challenging. This inaccuracy not only creates a conflict between pose alignment and temporal smoothness but also hinders the model scaling on the training schedule due to overfitting on noisy samples. In addition, due to computational limitations and model capabilities, there are still significant challenges in generating high-quality long videos containing a large number of frames. To solve these problems, we propose a series of approaches for generating long but still smooth videos based on pose guidance and image reference.

To alleviate the negative impact of inaccurate pose estimation, we propose an approach of confidence-aware pose guidance. By introducing the concept of confidence to the pose sequence representation, better temporal smoothness can be achieved and imagery distortion can also be eased. Confidence-based regional loss amplification can make the hand regions more accurate and clear. In addition, we propose a progressive latent fusion method for achieving long but still smooth video generation. Through generating video segments with overlapped frames with the proposed progressive latent fusion, our model can handle arbitrary-length pose sequence guidance. By merging the generated video segments, the final long video can be of good cross-frame smoothness and imagery richness at the same time. For model training, to keep the cost of model training within an acceptable range, our method is based on a generally pre-trained video generation model. The amount of training data is not large and no special manual annotation is required.

In summary, there are three key contributions of this work:

1.

We improve the pose guidance by employing a confidence-aware strategy. In this way, the negative impact of inaccurate pose estimation can be alleviated. This approach not only reduces the influence of noisy samples during training but also corrects erroneous pose guidance during inference.
2.

Based on the confidence-aware strategy, we propose hand region enhancement to alleviate hand distortion by strengthening the loss weight of the region of human hands with high pose confidence.
3.

While cross-frame overlapped diffusion is a standard technique for generating long videos, we advance it with a position-aware progressive latent fusion approach that improves temporal smoothness at segment boundaries. Extensive experimental results show the effectiveness of the proposed approach.

2 Related work

2.1 Diffusion models for image/video generation

Diffusion-based models have demonstrated promising results in the fields of image [9, 10, 11, 12, 13] and video generation [14, 15, 16, 17, 18, 19, 20], renowned for their capacity in generative tasks. Diffusion models operating in the pixel domain encounter challenges in generating high-resolution images due to information redundancy and high computational costs. Latent Diffusion Models (LDM) [11] address these issues by performing the diffusion process in low-dimensional latent spaces, significantly enhancing generation efficiency and quality while reducing computational demands. Compared to image generation, video generation demands a more precise understanding of spatial relationships and temporal motion patterns. Recent video generation models leverage diffusion models by adding temporal layers to pre-trained image generation models [14, 15, 21, 22], or utilizing transformers structures [23, 24, 25, 26] to enhance generative capabilities for videos. Stable Video Diffusion (SVD) [20] is one of the most popular open-source models built upon LDM. It offers a straightforward and effective method for image-based video generation and serves as a powerful pre-trained model for this task. Our approach extends SVD for pose-guided video generation, leveraging the pre-trained generative capabilities of SVD.

2.2 Pose-guided human motion transfer

Pose-to-appearance map** aims to transfer motion from the source identity to the target identity. Methods based on paired keypoints from source and target images employ local affine transformations [27, 28] or Thin-Plate Spline transformations [29] to warp the source image to match the driving image. These techniques aim to minimize distortion by applying weighted affine transformations, thereby generating poses in the output image that closely resemble those in the driving image. Similarly, methods such as [30, 1, 5, 7] utilize pose stick figures obtained from off-the-shelf human pose detectors as motion indicators and directly generate video frames through generative models. Depth information [7] or 3D human parametric models, such as SMPL (Skinned Multi-Person Linear) [8], can also be used to represent human geometry and motion characteristics from the source video. Nevertheless, these overly dense guidance techniques can rely too much on the signal from the source video, such as the outline of the body, leading to a degradation in the quality of the generated videos, especially when the target identity differs significantly from the source. Our approach, leveraging off-the-shelf human pose detectors, is capable of capturing the motion of the human body in driving videos without introducing excessive extraneous information, thereby ensuring the overall quality of the generated video. Different from existing methods, we introduce confidence-aware pose guidance, which effectively mitigates the influence of inaccurate pose estimation in training and inference. In this way, we achieve superior portrait frame quality, especially in the hand regions.

2.3 Long video generation

Recent diffusion-based video generation algorithms are constrained to producing videos with durations of only a few seconds, significantly limiting their practical applications. As a result, substantial research efforts have been dedicated to extending the duration of generated videos, leading to the proposal of various approaches to overcome this limitation. Methods like [17, 31] autoregressively predict successive frames, enabling the generation of infinitely long videos. However, these methods often face quality degradation due to error accumulation and the lack of long-term temporal coherence. Hierarchical approach [32, 22] are proposed for generating long videos in a coarse-to-fine manner. It first creates a coarse storyline with keyframes using a global diffusion model, then iteratively refines the video with local diffusion models to produce detailed intermediate frames.

MultiDiffusion [33] combines multiple processes that use pre-trained text-to-image diffusion models to create high-quality images with user-defined controls. It works by applying the model to different parts of an image and using an optimization method to ensure all parts blend seamlessly. This allows users to generate images that meet specific requirements, like certain aspect ratios or spatial layouts, without needing additional training or fine-tuning. Lumiere [34] extends MultiDiffusion to video generation by dividing the video into overlap** temporal segments. Each segment is independently denoised, and an optimization algorithm then combines these denoised segments. This approach ensures high coherence in the generated video, effectively maintaining temporal smoothness across segments. However, our experiments reveal that abrupt transitions can still occur at segment boundaries.

Building upon the principle of MultiDiffusion, we introduce a position-aware progressive latent fusion strategy that enhances temporal smoothness near segment boundaries. We adaptively assign fusion weight based on the temporal position, ensuring a smooth transition at the segment boundaries that further reduces flickering.

3 Method

3.1 Preliminaries

A Diffusion Model (DM) learns a diffusion process that generates a probability distribution for a given dataset. In the case of visual content generation tasks, a neural network of DM is trained to reverse the process of adding noise to real data so new data can be progressively generated starting from random noise. For a data sample $\mathbf{x}\sim p_{\text{data}}$ from a specific data distribution $p_{\text{data}}$ , the forward diffusion process is defined as a fixed Markov Chain that gradually adds Gaussian noise to the data following:

q(\mathbf{x_{t}}\mid\mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_{t};\sqrt{1-\beta% _{t}}\mathbf{x}_{t-1},\beta_{t}\mathbf{I})

(1)

for $t=1,\cdots T$ , where $T$ is the number of perturbing steps and $\mathbf{x}_{t}$ represents noisy data after adding $t$ steps of noise on the real data $\mathbf{x}_{0}$ . This process is controlled by a sequence schedule $\beta_{t}$ which is parameterized by the noising step $t$ . Following the closure of normal distribution, $\mathbf{x}_{t}$ can be directly computed with $\mathbf{x}_{0}$ by reforming the above diffusion process as follows:

q(\mathbf{x_{t}}\mid\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t};\sqrt{\bar{% \alpha}_{t}}\mathbf{x}_{0},(1-\bar{\alpha}_{t})\mathbf{I})

(2)

where $\bar{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}$ and $\alpha_{t}=1-\beta_{t}$ . Following DDPM [9], a denoising function $\epsilon_{\theta}$ parameterized with $\theta$ , commonly implemented with a neural network, is trained by minimizing the mean square error loss as follows:

\mathbb{E}_{\epsilon\sim\mathcal{N}(\mathbf{0},\mathbf{I}),\mathbf{x}_{t},% \mathbf{c},t}[\lVert\epsilon-\epsilon_{\theta}(\mathbf{x}_{t};\mathbf{c},t)% \rVert_{2}^{2}]

(3)

where $\mathbf{c}$ is an optional condition and $\mathbf{x}_{t}$ is a perturbed version of real data $\mathbf{x}_{0}\sim p_{\text{data}}$ by adding $t$ -step noises. In this way, $\epsilon_{\theta}$ can be trained till converge by sampling $\mathbf{x}_{0}$ from real data distribution and a time step $t$ , with an optional condition $\mathbf{c}$ .

3.2 Data preparation

To train a pose-guided video diffusion model, we collect a video dataset containing various human motions. Leveraging the powerful capability of the generally pre-trained image-to-video model, the dataset need not be excessively large, as the pre-trained model already has a good prior.

Given a video from our dataset, the training sample is constructed with three parts: a reference image (denoted as $I_{\text{ref}}$ ), a sequence of raw video frames, and the corresponding poses. Firstly, basic pre-processing operations like frame resizing and crop** are applied to the raw video to get a sequence of video frames with a fixed aspect ratio. For a given video, a fixed number of frames are randomly sampled at equal intervals as input video frames to the diffusion model. The input reference image is randomly sampled from the same video at a location not limited to the sampled video frame. This reference image is pre-processed in the same way as the video frames. Another input of the model is the pose sequence, which is extracted from the video frames with DWPose [35] frame by frame.

3.3 Pose-guided Video Diffusion Model

The goal of MimicMotion is to generate high-quality, pose-guided human videos from a single reference image and a sequence of poses to mimic. This task involves the synthesis of realistic motion that adheres to the provided pose sequence while maintaining visual fidelity to the reference image. We exploit the ability of a specific pre-trained video diffusion model to reduce the data requirement and computational cost of training a video diffusion model from scratch. Stable Video Diffusion (SVD) [20] is an open-source image-to-video diffusion model trained on a large-scale video dataset. It shows good performance on both video quality and diversity compared with the other contemporary models. The model structure of MimicMotion is designed to integrate a pre-trained Stable Video Diffusion (SVD) model to leverage its image-to-video generation capabilities.

Learning a diffusion process in pixel space is costly and this is more severe in generating high-definition videos involving many frames. We follow the Latent Diffusion Model (LDM) [36] to encode pixels into latent space so diffusion can be conducted in a low-dimension latent space. LDM adopts a pair of autoencoders, consisting of an encoder $\mathcal{E}$ and a decoder $\mathcal{D}$ . Given a data sample $\mathbf{x}$ , it is encoded into the latent space as $\mathbf{z}=\mathcal{E}(\mathbf{x})$ . Conversely, the latent vector $\mathbf{z}$ can be decoded back into pixel space via $\mathbf{x}=\mathcal{D}(\mathbf{z})$ .

Refer to caption — Figure 2: MimicMotion integrates an image-to-video diffusion model with novel confidence-aware pose guidance. The model’s trainable components consist of a spatiotemporal U-Net and a PoseNet for introducing pose sequence as the condition. Key features of confidence-aware pose guidance include: 1) The pose sequence condition is accompanied by keypoint confidence scores, enabling the model to adaptively adjust the influence of pose guidance based on the score. 2) The regions with high confidence are given greater weight in the loss function, amplifying their impact in training.

Figure 2 shows the structure of our model. The core structure of our model is a latent video diffusion model with a U-Net for progressive denoising in latent space. The VAE encoder on input video frames and the corresponding decoder for getting denoised video frames are both adopted from SVD and these parameters are frozen. The VAE encoder is applied independently to each frame of the input video as well as to the conditional reference image, operating on a per-frame basis without considering temporal or cross-frame interactions. Differently, the VAE decoder processes the latent features, which undergo spatiotemporal interaction from U-Net. To ensure the generation of a smooth video, the VAE decoder incorporates temporal layers alongside the spatial layers, mirroring the architecture of the VAE encoder.

In addition to the input video frames, the reference image and the sequence of poses are two other inputs of the model. The reference image is fed into the diffusion model along two separate pathways. One pathway involves feeding the image into each block of the U-Net. Specifically, through a visual encoder like CLIP [37], the image feature is extracted and fed into the cross-attention of every U-Net block for finally controlling the output results. The other pathway targets the input latent features. Similar to the raw video frames, the input reference image is encoded with the same frozen VAE encoder to get its representation in the latent space. The latent feature of the single reference image is then duplicated along the temporal dimension to align with the features of input video frames. The duplicated latent reference images are concatenated with latent video frames along the channel dimension and then fed into U-Net for diffusion altogether.

For introducing the guidance of poses, PoseNet, which is implemented with multiple convolution layers, is designed as a trainable module for extracting features of the input sequence of poses. The reason for not using the VAE encoder is that the pixel value distribution of the pose sequence is different from that of common images on which the VAE autoencoder is trained. With PoseNet, the features of poses are extracted and then element-wisely added to the output of the first convolution layer of U-Net. In this way, the influence of the posture guidance can take effect from the very beginning of denoising. We do not add pose guidance to every U-Net block for the following considerations: a) the sequence pose is extracted frame by frame without any temporal interaction so it may confuse the spatio-temporal layers within U-Net when it takes effect on these layers directly; b) excessive involvement of the pose sequence may degrade the performance of the pre-trained image-to-video model.

3.4 Confidence-aware pose guidance

Inaccurate pose estimation has a negative impact on the model’s training and inference. Accurately estimating poses from images is challenging in dynamic videos. Estimating poses from 2D images is inherently difficult. The limited capability of the pose estimation model, like DWPose [35], is only one aspect of the reason. The more significant cause is the inherent uncertainty of pose from dynamic appearances and motions. Specifically, incorrect pose guidance signals can mislead the model, resulting in the generation of inaccurate or distorted outputs, as illustrated in Figure 9. Moreover, noisy pose guidance signals can lead to overfitting on samples with incorrect poses, potentially causing training instability. This in turn may hinder the model’s ability to benefit from extended training schedules.

For this problem, we propose confidence-aware pose guidance, which leverages the confidence scores associated with each keypoint from the pose estimation model. These scores reflect the likelihood of accurate detection, with higher values indicating higher visibility, less occlusion, and motion blur. Instead of applying a fixed confidence threshold to filter the keypoints, as commonly adopted in prior works [38, 4], we utilize brightness on the pose guidance frame to represent the confidence level of pose estimation. Specifically, we integrate the confidence scores of the pose and keypoints into their respective drawing colors. This means that we multiply the color assigned to each keypoint and limb by its confidence score. Consequently, keypoints and corresponding limbs with higher confidence scores will appear more significant on the pose guidance map. This method enables the model to prioritize more reliable pose information in its guidance, thereby enhancing the overall accuracy of pose-guided generation.

Figure 3 illustrates this concept, showing how confidence-aware pose frames reflect situations of occlusion and motion blur. In this way, the uncertainty of pose estimation can be conveyed through the pose guidance, making pose guidance more informative. Our ablation studies show the effectiveness of this technique in suppressing visual artifacts, as shown in Figure 9.

Hand region enhancement

Moreover, we employ pose estimation and the associated confidence scores to alleviate region-specific artifacts, such as hand distortion, which are prevalent in the diffusion-based image and video generation models. Specifically, we identify reliable regions via thresholding keypoint confidence scores. By setting a threshold, we can distinguish between keypoints that are confidently detected and those that may be ambiguous or incorrect due to factors like occlusion or motion blur. Keypoints with confidence scores above the threshold are considered reliable. We implement a masking strategy that generates masks based on a confidence threshold. We unmask areas where confidence scores surpass a predefined threshold, thereby identifying reliable regions. When computing the loss of the video diffusion model, the loss values corresponding to the unmasked regions are amplified by a certain scale so they can have more effect on the model training than other masked regions.

Specifically, to mitigate hand distortion, we compute masks using a confidence threshold for keypoints in the hand region. Only hands with all keypoint confidence scores exceeding this threshold are considered reliable, as a higher score correlates to higher visual quality. We then construct a bounding box around the hand by padding the boundary of these keypoints, and the enclosed rectangle is designated as unmasked. This region is subsequently assigned a larger weight in the loss calculation during the training of the video diffusion model. This selective unmasking and weighting process biases the model’s learning towards hands, especially hands with higher visual quality, effectively reducing distortion and improving the overall realism of the generated content.

3.5 Progressive latent fusion for long video generation

Limited by computation resources, generating long videos containing a large number of frames is challenging. For this problem, latent fusion during denoising with DM has been validated by some prior works like MultiDiffusion [33] which utilizes latent fusion on overlapped tiles to realize panoramic image generation. A similar idea can be applied to the video generation task. A straightforward approach is directly applying MultiDiffusion in the time domain, as in Lumiere [34]. Compared with spatial discontinuity between image tiles, viewers are more sensitive to temporal discontinuity because they can significantly cause flickering or even abrupt changes in content. For this problem, we propose a progressive approach for generating long videos with high temporal continuity.

Progressive latent fusion is training-free and is integrated into the denoising process of the latent diffusion model during inference. Figure 4 shows an overview of this process. We omit the VAE for brevity. The denoising process is done in latent space in our method. In general, there are $T$ denoising steps in total and our latent fusion is applied within each step. For a long given pose sequence, we use a pre-defined strategy for splitting the whole sequence into segments, consisting of a fixed number of frames per segment (denoted as $N$ ), with a certain number ( $C$ ) of overlapped frames between every two adjacent segments. For the sake of generation efficiency, it is common to assume that $C\ll N$ . During each denoising step, video segments are firstly denoised separately with the trained model, conditioning on the same reference image and the corresponding sub-sequence of poses. Algorithm 1 shows the specific details of progressive latent fusion. As inputs, the reference image is denoted as $I_{\text{ref}}$ (c.f. Sec. 3.2) and pose frames corresponding to $j$ -th frame in $i$ -th video segment is denoted as $P_{i}^{j}$ . We use $\mathbf{z}_{i}^{j}$ to denote the latent feature of $j$ -th frame in $i$ -th video segment. The denoising process starts from the maximum time step $T$ and the latent features are initialized with a normal distribution $\mathcal{N}(\mathbf{0},\mathbf{I})$ . Within each denoising step at time step $t$ , the reversed diffusion process defined by the trained model (DM) is applied to the latent features of each video segment numbered $i$ separately, with $\mathbf{z}_{i}$ , $I_{\text{ref}}$ , $P_{i}$ and $t$ as inputs. During the latent fusion stage, for every two adjacent video segments, the involved video frames are then fused. To avoid the corruption of temporal smoothness near video segment boundaries after latent fusion, we propose progressive latent fusion. For a video frame involved in latent fusion, its fusion weight is determined by its relative position in the video segment it belongs to. Specifically, if a frame is close to the segment it belongs to, it will be assigned a heavier weight. For implementation, a fusion scale is pre-defined as $\lambda_{\text{fusion}}=1/(C+1)$ for controlling the level of latent fusion.

Input:

I_{\text{ref}}

: Reference image.

P_{i}^{j}

: Pose frame corresponding to

j

-th frame in

i

-th video segment;

\mathbf{z}_{i}^{j}

: The latent feature of

j

-th frame in

i

-th video segment;

N

: the number of frames in a video segment;

C

: the number of overlapped frames.

Output:

\mathbf{z}^{\prime}

: A long sequence of latent features of video frames.

\mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I})

;

// Random initialization of video latent features.

\lambda_{\text{fusion}}\leftarrow 1/(C+1)

;

// Set a scale of latent fusion.

for $t=T$ to $1$ do // Denoising from noisy latent features step by step.

for $i=1,2,\dots$ do

\mathbf{z}_{i}\leftarrow\text{DM}(\mathbf{z}_{i},I_{\text{ref}},P_{i},t)

// Separately denoise each segment.

for $i=1,2,\dots$ do // Within each video segment.

for $j=1$ to $N$ do // Start latent fusion for each frame.

if $i>1$ and $j\leq C$ then // Latent fusion with the previous segment.

\mathbf{z}_{i}^{j}\leftarrow j\lambda_{\text{fusion}}\mathbf{z}_{i}^{j}+(1-j% \lambda_{\text{fusion}})\mathbf{z}_{i-1}^{N-C+j}

else if $j>N-C$ then // Latent fusion with the next segment.

\mathbf{z}_{i}^{j}\leftarrow(N+1-j)\lambda_{\text{fusion}}\mathbf{z}_{i}^{j}+(% C-N+j)\lambda_{\text{fusion}}\mathbf{z}_{i+1}^{C-N+j}

return $\mathbf{z}^{\prime}=\text{Merge}(\mathbf{z})$ ;

// Merge multi-segment features following Listing LABEL:lst:merge_function.

Algorithm 1 Progressive frame-level latent fusion for long video generation.

After applying $T$ iteratively denoising steps, a merging strategy denoted as Merge is used to get the final long sequence based on the denoised overlapped video segments in latent space. The function of Merge concatenates the multi-segment latents, which is described in Listing LABEL:lst:merge_function in detail.

Listing 1: Merge is for merging 2D list z representing overlapped video segments into a long list zp.

⬇

def Merge(z, C):

zp = z[0]

for i in range(1, len(z)):

zp.extend(z[i][C:])

return zp

4 Experiments

4.1 Implementation details

We train our model on 4,436 human dancing videos collected from the internet. The average length of the training videos is 20.1s. We adopt the pre-trained weights from the public stable video diffusion 1.1 image-to-video model. The PoseNet is trained from scratch. We train our model on 8 NVIDIA A100 GPUs (40G) for 20 epochs, with a per-device batch size of 1. The learning rate is $10^{-5}$ with a linear warmup for the first 500 iterations. We tune all the parameters in the UNet and PoseNet.

4.2 Comparison to state-of-the-art methods

We compare our method with latest state-of-the-art pose-guided human video generation methods, including MagicPose [4], Moore-AnymateAnyone [38], and MuseV [39]. Following previous works [3, 4], we adopt the TikTok [40] dataset and use sequence 335 to 340 for testing.

We provide both qualitative and quantitative comparisons, complemented by a user study. Each method has a different input aspect ratio. To ensure a fair comparison, we only consider the central square region of the videos. Specifically, to accommodate each method’s unique input aspect ratios, we individually apply a center crop to the reference image and pose sequence. Then, we extract the center squares from the generated videos for a fair comparison across different methods. This applies to all experiments in comparison to state-of-the-art methods.

Qualitative evaluation

We conduct qualitative comparisons between the selected baselines and our method. In Figure 5, we showcase sample frames to highlight the superior quality of individual frames produced by our method. Additionally, in Figure 6, we illustrate the temporal differences, demonstrating the enhanced temporal smoothness of our approach compared to existing methods.

Figure 5 presents a comparison of the generated frames, where each row represents a distinct example. The first row demonstrates the superior hand quality achieved by our approach, while the second row showcases the improved adherence to pose guidance. These improvements directly result from our confidence-aware pose guidance and hand region enhancement design.

Importantly, our method shows superior temporal smoothness, characterized by smooth motion and minimal flickering. To illustrate this aspect, we present the pixel-wise differences between consecutive frames in Figure 6, which effectively illustrate the temporal stability of our method. From the figure, it is evident that MagicPose [4] exhibits abrupt transitions, Moore-AnymateAnyone [38] shows flickering in the texture of clothing wrinkles, and MuseV [39] struggles with generating consistent text on clothing. In contrast, our method maintains stable inter-frame differences without obvious artifacts, demonstrating better temporal smoothness in our generation results. Videos are included on the project page. This enhancement in temporal smoothness is likely due to the robustness provided by our confidence-aware pose guidance, which effectively mitigates the impact of inaccurate pose inputs and temporal noise. By intelligently weighting the influence of pose signals based on their confidence, our method ensures that the generated videos maintain a high level of temporal smoothness in the presence of noise.

Quantitative evaluation

In Table 8, we present a quantitative comparison of our method against state-of-the-art approaches using the FID-VID [41] and FVD [42] metrics on the test sequences from the TikTok [40] dataset. The results indicate that our method achieves a notable performance advantage over all existing methods in terms of both metrics.

User study

To supplement our quantitative and qualitative evaluations, we conduct a user study to assess the subjective preferences of participants regarding the generated videos on the TikTok dataset test split. The study involves showing two video clips—one generated by our method and the other by one of the baseline methods—to a diverse group of users. Participants are instructed to select the video that they perceived as having higher quality, considering factors such as image quality, flickering, and the temporal smoothness of characters and clothing. We collected data from 36 participants, with each participant evaluating 6 video pairs for our method against each baseline method. As shown in Figure 8, the results indicate a strong preference for MimicMotion over the baseline methods. In comparison to MagicPose and Moore, the participants almost favored all videos produced by our method. Despite MuseV showing higher image quality compared to other baselines, the preference for videos produced by our method still reached 75.5%. These findings align with our qualitative and quantitative evaluation, reinforcing the effectiveness of our method in meeting user expectations for high-quality human video generation.

Method	FID-VID $\downarrow$	FVD $\downarrow$
MagicPose [4]	13.3	916
Moore [38]	12.4	728
MuseV [39]	14.6	754
MimicMotion (ours)	9.3	594

4.3 Ablation Study

Confidence-aware pose guiding

Figure 9 shows the effectiveness of confidence-aware pose guidance. Each row corresponds to one example. On the left side, we show three images used to extract the pose. On the right side, we plot the guiding signals corresponding to the pose estimation, both with and without confidence-aware pose guiding. From the guiding signals, we can see that there are errors in the pose estimated by DWPose. Nevertheless, our confidence-aware design minimizes the impact of incorrect pose estimation in guidance signals.

Specifically, in the case of Pose 1, the estimation exhibits a duplicate detection issue, which leads to the inclusion of duplicate keypoints. In the case of Pose 2, there is one hand obscured, and the keypoints of this hand are incorrectly estimated on the other hand; In the case of Pose 3, the right elbow is obscured, but it is still detected with confidence above the threshold thus falsely remains in the guidance signal. These problems lead to confusing hand guidance signals and ultimately lead to distortions such as deformed hands or wrong spatial relationships in the generated frames.

In contrast, by integrating confidence scores into the pose representation, our method effectively mitigates these issues. The confidence scores provide a measure of reliability for each keypoint, allowing the system to weigh the guidance signals accordingly. Specifically, keypoint with lower confidence, which typically correspond to inaccurate keypoints caused by occlusion or motion blur, will be of less significance in the guidance. This approach leads to clearer and richer pose guidance, as the influence of potentially erroneous keypoints is reduced. The corresponding generation results demonstrate how our method enhances the robustness of generation against false guiding signals (Pose 1 and Pose 2) and offers visibility hints to resolve the front-back ambiguity of 2D pose estimation (Pose 3).

Hand region enhancement

In conjunction with confidence-aware pose guiding, we further improve the quality of hand generation by assigning a higher weight to the hand region in the training loss. Figure 10 compares the generation result with and without hand region enhancement, using the same reference image and pose guidance. All experiments incorporate confidence-aware pose guidance. The hands in the first row are cropped from the generated video frames of a model trained without hand region enhancement, which exhibits noticeable distortions, such as irregular and misplaced fingers. In contrast, the results of the model trained with hand region enhancement (the second row) show consistent improvements in hand generation quality and a reduction in hand distortion. These results show the effectiveness of the proposed hand region enhancement design, which substantially mitigates hand distortion, which is a prevalent challenge in diffusion-based models.

Moreover, hand region enhancement improves the visual appeal of the generated content. The hand region is often the area of interest that human observers tend to focus on. By emphasizing the hand regions, we align the regional preferences of the training process with human preferences, thereby enhancing the visual appeal of the generation results.

Progressive latent fusion

To achieve seamless transitions between video segments, we introduce progressive latent fusion, a technique that gradually blends frames in the overlapped regions of consecutive video segments. The original MultiDiffusion approach employs a simple averaging of frames within the overlap region. As illustrated in Figure 11(a), this method assigns equal weight to all frames in the overlap region, irrespective of their temporal position (whether they are closer to the preceding or subsequent segment). This lack of a gradual transition in influence from one segment to another can cause abrupt transitions and noticeable flickers in the video. This is evident in the y-t slice shown on the left, where the transition at segment boundaries is abrupt. The right side of the figure shows four frames at the segment boundary. Note that the background in the top-left corner (enlarged) is initially clear in segment 1. It suddenly becomes blurry in the overlapped region and then suddenly reverts to clearer in the main part of segment 2. This artifact is not observed when progressive latent fusion is applied.

The proposed progressive latent fusion approach (see Figure 11(b)) effectively mitigates these issues. The y-t slice on the left demonstrates that this method enables a smooth transition across segment boundaries, eliminating the abrupt changes seen in the original approach. The right side of the figure demonstrates the relevance of sudden blurring. This strategy significantly mitigates flicking artifacts, thus improving the overall visual temporal coherence for long video generation.

5 Conclusion

In this study, we introduce MimicMotion, a pose-guided human video generation model that leverages confidence-aware pose guidance and progressive latent fusion for producing high-quality, long videos with human motion guided by pose. Through extensive experiments and ablation studies, we show that our model achieves superior adaptation to noisy pose estimation, enhancing hand quality, and ensuring temporal smoothness. The integration of confidence scores into pose guidance, the enhancement of hand region loss, and the implementation of progressive latent fusion are crucial in achieving these improvements, resulting in more visually compelling and realistic human video generation.

References

Ma et al. [2024a] Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose-guided text-to-video generation using pose-free videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4117–4125, 2024a.
Karras et al. [2023] Johanna Karras, Aleksander Holynski, Ting-Chun Wang, and Ira Kemelmacher-Shlizerman. Dreampose: Fashion image-to-video synthesis via stable diffusion. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 22623–22633. IEEE, 2023.
Wang et al. [2023] Tan Wang, Linjie Li, Kevin Lin, Yuanhao Zhai, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, and Lijuan Wang. Disco: Disentangled control for realistic human dance generation. arXiv preprint arXiv:2307.00040, 2023.
Chang et al. [2023] Di Chang, Yichun Shi, Quankai Gao, Hongyi Xu, Jessica Fu, Guoxian Song, Qing Yan, Yizhe Zhu, Xiao Yang, and Mohammad Soleymani. Magicpose: Realistic human poses and facial expressions retargeting with identity-aware diffusion. In Forty-first International Conference on Machine Learning, 2023.
Hu [2024] Li Hu. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8153–8163, 2024.
Xu et al. [2024] Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human image animation using diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1481–1490, 2024.
Feng et al. [2023] Mengyang Feng, **lin Liu, Kai Yu, Yuan Yao, Zheng Hui, Xiefan Guo, Xianhui Lin, Haolan Xue, Chen Shi, Xiaowen Li, et al. Dreamoving: A human video generation framework based on diffusion models, 2023.
Zhu et al. [2024] Shenhao Zhu, Junming Leo Chen, Zuozhuo Dai, Yinghui Xu, Xun Cao, Yao Yao, Hao Zhu, and Siyu Zhu. Champ: Controllable and consistent human image animation with 3d parametric guidance, 2024.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
Song et al. [2020] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
Rombach et al. [2022a] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022a.
Zhang et al. [2023] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
Wang et al. [2024a] Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, and Anthony Chen. Instantid: Zero-shot identity-preserving generation in seconds. arXiv preprint arXiv:2401.07519, 2024a.
Singer et al. [2022] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
Guo et al. [2023] Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
Ho et al. [2022] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022.
He et al. [2022] Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2022.
Chen et al. [2023] Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, **bo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023.
Chen et al. [2024] Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024.
Blattmann et al. [2023] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
Wu et al. [2023] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
Wang et al. [2024b] Weimin Wang, Jiawei Liu, Zhijie Lin, Jiangqiao Yan, Shuo Chen, Chetwin Low, Tuyen Hoang, Jie Wu, Jun Hao Liew, Hanshu Yan, et al. Magicvideo-v2: Multi-stage high-aesthetic video generation. arXiv preprint arXiv:2401.04468, 2024b.
Yan et al. [2021] Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
Yu et al. [2023] Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10459–10469, 2023.
Ma et al. [2024b] Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024b.
Bao et al. [2024] Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models. arXiv preprint arXiv:2405.04233, 2024.
Siarohin et al. [2019] Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation. Advances in neural information processing systems, 32, 2019.
Siarohin et al. [2021] Aliaksandr Siarohin, Oliver J Woodford, Jian Ren, Menglei Chai, and Sergey Tulyakov. Motion representations for articulated animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13653–13662, 2021.
Zhao and Zhang [2022] Jian Zhao and Hui Zhang. Thin-plate spline motion model for image animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3657–3666, 2022.
Chan et al. [2019] Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. Everybody dance now. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5933–5942, 2019.
Voleti et al. [2022] Vikram Voleti, Alexia Jolicoeur-Martineau, and Chris Pal. Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. Advances in neural information processing systems, 35:23371–23385, 2022.
Yin et al. [2023] Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, et al. Nuwa-xl: Diffusion over diffusion for extremely long video generation. arXiv preprint arXiv:2303.12346, 2023.
Bar-Tal et al. [2023] Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113, 2023.
Bar-Tal et al. [2024] Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, et al. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024.
Yang et al. [2023] Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. Effective whole-body pose estimation with two-stages distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4210–4220, 2023.
Rombach et al. [2022b] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022b.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
moo [2024] MooreThreads/Moore-AnimateAnyone, May 2024. URL https://github.com/MooreThreads/Moore-AnimateAnyone. original-date: 2024-01-12T07:55:21Z.
Xia et al. [2024] Zhiqiang Xia, Zhaokang Chen, Bin Wu, Chao Li, Kwok-Wai Hung, Chao Zhan, Yingjie He, and Wenjiang Zhou. Musev: Infinite-length and high fidelity virtual human video generation with visual conditioned parallel denoising. arxiv, 2024.
Jafarian and Park [2021] Yasamin Jafarian and Hyun Soo Park. Learning high fidelity depths of dressed humans by watching social media dance videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12753–12762, June 2021.
Balaji et al. [2019] Yogesh Balaji, Martin Renqiang Min, Bing Bai, Rama Chellappa, and Hans Peter Graf. Conditional gan with discriminative filter generation for text-to-video synthesis. In IJCAI, volume 1, page 2, 2019.
Unterthiner et al. [2018] Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.