\useunder

\ul

MotionClone: Training-Free Motion Cloning for Controllable Video Generation

Pengyang Ling^1,4 Jiazi Bu^2,4∗ Pan Zhang^4† Xiaoyi Dong⁴
Yuhang Zang⁴ Tong Wu³ Huaian Chen¹ Jiaqi Wang⁴ Yi **^1†
¹University of Science and Technology of China ²Shanghai Jiao Tong University
³The Chinese University of Hong Kong ⁴Shanghai AI Laboratory
https://github.com/Bujiazi/MotionClone/ *Equal contribution.

\dagger

Corresponding author.

Abstract

Motion-based controllable text-to-video generation involves motions to control the video generation. Previous methods typically require the training of models to encode motion cues or the fine-tuning of video diffusion models. However, these approaches often result in suboptimal motion generation when applied outside the trained domain. In this work, we propose MotionClone, a training-free framework that enables motion cloning from a reference video to control text-to-video generation. We employ temporal attention in video inversion to represent the motions in the reference video and introduce primary temporal-attention guidance to mitigate the influence of noisy or very subtle motions within the attention weights. Furthermore, to assist the generation model in synthesizing reasonable spatial relationships and enhance its prompt-following capability, we propose a location-aware semantic guidance mechanism that leverages the coarse location of the foreground from the reference video and original classifier-free guidance features to guide the video generation. Extensive experiments demonstrate that MotionClone exhibits proficiency in both global camera motion and local object motion, with notable superiority in terms of motion fidelity, textual alignment, and temporal consistency.

1 Introduction

The generation of videos that align with human intentions and produce high-quality outputs has recently attracted significant attention, particularly with the rise of mainstream text-to-video (T2V) diffusion models [11, 3]. Despite the substantial progress witnessed in text-to-image generation, the domain of text-to-video generation presents unique challenges, primarily due to the complexities introduced by motion synthesis. Incorporating additional motion elements not only serves to reduce the ambiguity inherent in video synthesis, thereby facilitating the generation of high-quality motion but also enhances the controllability of the generated content.

Within the domain of text-to-video generation guided by motion cues, extant methodologies are typically classified into two principal methodologies: one that leverages the dense optical flow or depth of a reference video [32, 17], and another that employs trajectory [34, 38]. The former methodology often integrates a pre-trained model to extract motion cues at a pixel level. Despite achieving high-quality outputs, these dense motion cues can be entangled with the structural elements of the reference video, thereby limiting their transferability to other objects. In contrast, the latter methodology, which is predicated on trajectory, is more user-friendly for incorporating motion cues. However, while the model proficiently captures macroscopic object movements, it exhibits potential constraints in delineating finer, localized motions such as head turns or hand raises. Additionally, both methodologies typically necessitate the training of a model to encode motion cues which often result in suboptimal generation when applied outside the trained domain. In some cases, they also entail the fine-tuning of pre-trained text-to-video models, potentially degrading generation quality.

Refer to caption — Figure 1: Given a reference video, MotionClone can clone the contained motion into novel scenarios with excellent prompt-following ability, without motion-specific fine-tuning.

In this work, we introduce MotionClone, a novel training-free framework designed to clone motion from a reference video for controllable text-to-video generation. Diverging from traditional approaches involving dense flow or trajectories, MotionClone employs a temporal-attention mechanism within the video generation model to capture the motion in the reference video. This strategy effectively renders detailed motion while concurrently preserving minimal interdependencies with the structural components of the reference video.

Nevertheless, we find that the majority of weights within the temporal-attention tend to correspond to either noisy or very subtle motions. When temporal-attention is applied uniformly across the model, these weights can overshadow the motion guidance, consequently resulting in the suppression of the primary motion. To address this limitation, we propose primary temporal-attention guidance, which leverages only the principal components of the temporal-attention weights for motion-guided video generation. This approach enables the model to overlook noisy or less significant motions and concentrate on the primary motion, thus significantly improving the quality of the motion clone. Besides, we observe that directly apply primary temporal-attention components sampling in vanilla video generation realizes the enhancement of the primary motions within these videos.

Despite achieving success in cloning motion from the reference video, we have observed that current text-to-video models sometimes synthesize unreasonable spatial relationships, and show suboptimal prompt-following capability when guided by motion cues alone. To address this issue, we propose a location-aware semantic guidance mechanism that leverages a coarse foreground location derived from the reference video alone with original classifier-free guidance features. The location is obtained from the spatial cross-attention within our generation model. The proposed guidance maintains generative flexibility while enhancing the rationality of spatial relationships in the synthesized video.

In summary, our MotionClone is a novel training-free framework designed to clone motion from a reference video for controllable text-to-video generation processes, which is composed with primary temporal-attention guidance and location-aware semantic guidance. Extensive experiments demonstrate that MotionClone exhibits proficiency in both global camera motion and local object action, with notable superiority in terms of motion fidelity, text alignment, and temporal consistency.

2 Related Work

2.1 Text-to-video diffusion models

Equipped with sophisticated text encoders [28, 39], a great breakthrough has been achieved in the realm of text-to-image (T2I) generation [10, 24, 29, 25], which sparks the enthusiasm for advanced text-to-video (T2V) models [3, 33, 5, 6, 11]. Notably, VideoLDM [3] introduces a motion module that utilizes 3D convolutions and temporal attention to capture frame-to-frame correlations. In a novel approach, AnimateDiff [11] enhances a pre-trained T2I diffusion model with motion modeling capabilities. This is achieved by fine-tuning a series of specialized temporal attention layers on extensive video datasets, allowing for a harmonious fusion with the original T2I generation process. To address the challenge of data scarcity, VideoCraft2 [6] suggests an innovative strategy of learning motion from low-quality videos [1] while simultaneously learning appearance from high-quality images [31]. Despite these advancements, there remains a significant disparity in the quality of generated content between the available T2V models and their sophisticated T2I counterparts, primarily due to the intricate nature of diverse motions and the limited availability of high-quality video data. In this work, a motion guidance strategy is developed, which ingeniously incorporates motion cues from given videos to ease the challenges of motion modeling, yielding more realistic and coherent video sequences, without model fine-tuning.

2.2 Controllable video generation

Building on the success of controllable image generation through the integration of additional conditions [40, 18, 19, 27, 15], a multitude of studies [5, 38, 8, 22, 2] have endeavored to introduce diverse control signals for versatile video generation. These include control over the first video frame [5], motion trajectory [38], motion region [8], and motion object [22]. Furthermore, in pursuit of high-quality video customization, several studies delve into reference-based video generation, leveraging the motion from an existing real video to direct the creation of new video content. A straightforward solution developed in [32, 9, 37], involves the direct integration of frame-wise depth maps or canny maps to regularize motion. However, this approach inadvertently introduces motion-independent features, such as structures in static areas, which can disrupt the alignment of the resulting video appearance with new text. To address this issue, motion-specific fine-tuning frameworks, as explored in [41, 16], have been developed to extract a distinct motion pattern from a single video or a collection of videos with identical motion. While holding promise, these methods are subject to complex training processes and potential model degradation. To address this, we present a novel motion cloning scheme, which extracts temporal correlations from existing videos as explicit motion clues to guide the generation of new video content, providing a plug-and-play motion customization solution.

2.3 Attention control

Attention mechanisms have been confirmed as vital for high-quality content generation. Prompt2Prompt[12] illustrates that cross-attention maps are instrumental in dictating the spatial layout of synthesized images. This observation subsequently motivates serious work in semantic preservation [4], multi-object generation [21, 36], and video editing [20]. FreeControl [23] highlights that the feature space within self-attention layers encodes structural image information, facilitating reference-based image generation. While previous methods have concentrated on spatial attention blocks, our work uncovers the untapped potential of temporal attention layers for effective motion guidance, thereby enabling flexible motion cloning from existing videos.

3 MotionClone

In this section, we first introduce video diffusion and temporal attention mechanisms. We then present our observations regarding the temporal attentions within the video diffusion model. Subsequently, we elaborate on the proposed MotionClone framework, which comprises primary temporal-attention guidance and location-aware semantic guidance.

3.1 Preliminaries

Diffusion sampling. Following pioneering work[29], video diffusion models encode a input video $x$ into latent representation $z=\mathcal{E}(x)$ by using a pre-trained encoder $\mathcal{E}(\cdot)$ . To enable video distribution learning, diffusion model $\epsilon_{\theta}$ is encouraged to estimate noise component $\epsilon_{t}$ from the latent $z_{t}$ that follows time-dependent scheduler [13], i.e.,

\mathcal{L(\theta)}=\mathbb{E}_{\mathcal{E}(x),\epsilon_{t}\in\mathcal{N}(0,1)% ,t\sim\mathcal{U}(1,T)}\left[\|\epsilon_{t}-\epsilon_{\theta}(z_{t},c,t)\|_{2}% ^{2}\right],

(1)

where $t$ is the time step, and $c$ is the condition signal such as text. In the inference phase, the generative process commences with a standard Gaussian noise. The trajectory of sampling can be adjusted by incorporating classifier-free guidance [14], denoted as $\epsilon_{\theta}(z_{t},\phi,t)$ , and an additional energy function, represented by $g(z_{t},y,t)$ , which is parameterized by the label $y$ , i.e.,

\hat{\epsilon_{\theta}}=\epsilon_{\theta}(z_{t},c,t)+s\\ (\epsilon_{\theta}(z_{t},c,t)-\epsilon_{\theta}(z_{t},\phi,t))+\lambda\ g(z_{t% },y,t),

(2)

where $s$ and $\lambda$ are guidance weights, and $\phi$ denotes null text or negative prompt.

Temporal attention. To model video motion, temporal attention is introduced to establish correlation across frames. Given a video feature with $F$ frames $f_{in}\in\mathbb{R}^{B\times F\times C\times H\times W}$ , the temporal attention mechanism reshapes this tensor into 3D tensor $\bar{f}_{in}\in\mathbb{R}^{(B\times H\times W)\times F\times C}$ by merging the spatial dimensions into the batch size. Subsequently, it executes self-attention along the frame axis, i.e.,

{f}_{out}=Attention(Q(\bar{f}_{in}),K(\bar{f}_{in}),V(\bar{f}_{in})),

(3)

where $Q(\cdot)$ , $K(\cdot)$ , and $V(\cdot)$ are projection layers. The corresponding attention map can be obtained as $\mathcal{A}\in\mathbb{R}^{(B\times H\times W)\times F\times F}$ , which represents the temporal relation for each spatial pixel. For the sake of brevity, in the ensuing exposition, we employ the latent representation $z$ to denote videos, given that all operations are executed within the latent space.

3.2 Observation

Since the motion in the generated video is governed by temporal attention mechanisms, videos with similar temporal attentions should exhibit similar motion characteristics. To investigate this hypothesis, we control the denoising process by aligning the temporal attentions from the generated video with those from a reference video. As depicted in Fig. 2, plain controlling the generation, can preserve certain motion patterns from the reference video, such as the gait of a cat and the directional movement of a tank. However, employing naive guidance based on temporal attentions can result in suboptimal motion tracking capabilities when compared to the reference video, particularly with regard to the amplitude of motion for both objects and the camera. We postulate that the reason is the majority of weights within the temporal-attentions, which often correspond to either noisy or exceedingly subtle motions. Such weights have the potential to obscure the motion guidance. The proposed primary control strategy, which is elaborated upon in subsequent sections, demonstrates the capacity to effectively replicate the reference motion.

In addition, we directly apply primary component sampling to the temporal attention module of the video generation model during the inference phase, and a significant enhancement in the motion within the generated videos is observed, as illustrated in Fig. 3. This finding further supports the notion that the principal components of temporal attention represent the primary motions in the generated videos, which we aim to utilize for motion cloning.

3.3 Methodology

Method overview. The framework of MotionClone is depicted in Fig. 4. Given a real reference video $z^{g}$ , we employ DDIM [30] inversion to obtain the time-dependent latent set $S_{z^{g}}=\left\{z_{1}^{g},z_{2}^{g},...,z_{t}^{g},...,z_{T}^{g}\right\}$ . During the video generation process, an initial latent $z_{T}$ is sampled from a standard Gaussian distribution and subsequently duplicated to create a sibling latent $\bar{z}_{T}$ . This sibling latent, in conjunction with classifier-free guidance, is utilized to confer semantic appearance characteristics derived from the prompt. At each denoising step, $z_{t}^{g}$ , $z_{t}$ , and $\bar{z}_{t}$ are fed to the pretrained video diffusion model, in which the motion encapsulated by the temporal attentions of $z_{t}^{g}$ and semantic appearance encoded within the cross attentions of $\bar{z}_{t}$ are collectively employed to guide the denoising process of $z_{t}$ . Specifically, motion guidance is implemented by aligning the primary temporal-attention components of the generated latent $z_{t}$ with those of the reference latent $z_{g}$ . This alignment propels $z_{t}$ to clone motion in regions where $z_{g}$ exhibits substantial motion activity. For location-aware semantic guidance, we derive coarse object masks from the cross-attention layers of $z_{t}^{g}$ and $\bar{z}_{t}$ , which are encoded using a Gaussian kernel. These masks are then utilized to guide $z_{t}$ , leveraging the spatial location information from $z_{t}^{g}$ and the semantic appearance details from $\bar{z}_{t}$ . The joint guidance facilitates the generation of videos that exhibit compelling motion fidelity and precise textual alignment.

Primary temporal-attention guidance. For temporal attention $\mathcal{A}^{g}\in\mathbb{R}^{(1\times H\times W)\times F\times F}$ from a given reference video, which satisfies $\sum_{j=1}^{F}\mathcal{A}_{(p,i,j)}^{g}=1$ . In the subsequent exposition, the time step $t$ is omitted for brevity. The value of $\mathcal{A}_{(p,i,j)}^{g}$ reflects the relation between $i$ frame and $j$ frame in position $p$ , and a larger value of $\mathcal{A}_{(p,i,j)}^{g}$ implies a stronger correlation. The primary temporal-attention guidance $g_{m}$ for motion cloning can be expressed as:

g_{m}=\left\|\mathcal{M}\cdot(\mathcal{A}^{g}-\mathcal{A})\right\|_{2}^{2},

(4)

where $\mathcal{M}$ is the temporal mask for the primary temporal attention constraint, and $\mathcal{A}$ is the temporal attention of $z$ . Particularly, we denote $\mathcal{M}=1$ to represent the plain temporal-attention constraint, indicative of the “plain control” that exhibits limited motion transfer capability as illustrated in Fig. 2. We postulate that the reason is the majority of weights within the temporal-attentions, which often correspond to either noisy or subtle motions. These weights may obscure the motion guidance. Following this postulate, a rudimentary approach involves the implementation of threshold-based filtering.

\mathcal{M}_{(p,i,j)}^{threshold}:=\left\{\begin{array}[]{l}1,\ \ if\ \ \ % \mathcal{A}_{(p,i,j)}^{g}>=\alpha\\ 0,\ \ otherwise,\end{array}\right.

(5)

However, this straightforward attempt leads to unstable control. Therefore, we propose to obtain the sparse temporal mask according to the rank of $\mathcal{A}^{g}$ value in the temporal axis. Let $\hat{\mathcal{A}}^{g}$ denote the attention $A^{g}$ sorted along the temporal axis in decreasing order, i,e., $\hat{\mathcal{A}}_{(p,i,k_{1})}^{g}>=\hat{A}_{(p,i,k_{2})}^{g}$ for arbitrary $k_{1}<=k_{2}$ . Consequently, the subset comprising the top $k$ values can be defined as $\Omega_{(p,i)}^{k}=\left\{\hat{\mathcal{A}}_{(p,i,1)}^{g},\hat{\mathcal{A}}_{(% p,i,2)}^{g},...,\hat{\mathcal{A}}_{(p,i,k)}^{g}\right\}$ . The mask of rank sampling is defined as:

\mathcal{M}_{(p,i,j)}^{rank}:=\left\{\begin{array}[]{l}1,\ \ \ \ \ if\mathcal{% A}_{(p,i,j)}^{g}\in\Omega_{(p,i)}^{k}\\ 0,\ \ \ \ \ \ otherwise,\end{array}\right.

(6)

where $k$ is a hyper-parameter. Particularly, in the case where $k=1$ , the guidance focuses solely on the highest activation for each spatial location. This ranking-based sparse strategy enables primary motion guidance from the reference video.

Location-aware semantic guidance. The primary motion guidance effectively facilitates video motion. However, there is no guarantee that the generated videos will be physically plausible, which introduces the risk of unrealistic generations due to misalignments between motion and appearance, as shown in Fig. 5. This phenomenon has minimal impact on camera motion scenarios that are characterized by globally consistent motion; however, it substantially degrades the quality in object action scenarios where there is a lack of motion coherence between the foreground and background. Additionally, motion guidance may attenuate the model’s adherence to the prompt, resulting in limited prompt-following generation.

To address the above issues, we propose to employ the foreground location from the reference video $z^{g}$ and the appearance of the sibling video $\bar{z}$ to provide joint semantic guidance. Specifically, given the foreground tokens derived from the prompts, the foreground masks of the reference video $z^{g}$ and the sibling video $\bar{z}$ , denoted as $M^{g}$ and $\bar{M}$ respectively, are extracted from the cross-attention layers [12]. To eliminate structural information from the reference video, we employ a Gaussian kernel to encode the foreground mask, using the mask’s center as the mean of the kernel. Formally, the Gaussian kernel can be obtained by the following equation:

G_{p}=\frac{1}{2\pi\sigma^{2}}e^{-\frac{\|p-p_{0}\|^{2}}{2\sigma^{2}}},

(7)

where $p_{0}$ represents the coordinates of the foreground mask’s center, and $\sigma$ is the standard deviation of the Gaussian distribution. Finally, the location-aware semantic guidance is defined as:

g_{s}=\left\|\frac{G_{p}\cdot M_{p}^{g}\cdot\mathcal{F}_{p}}{{\textstyle\sum_{% p}}G_{p}\cdot M_{p}^{g}}-\frac{\bar{M}_{p}\cdot\bar{\mathcal{F}}_{p}}{{% \textstyle\sum_{p}}\bar{M}_{p}}\right\|^{2}+\left\|\frac{(1-G_{p})\cdot(1-M_{p% }^{g})\cdot\mathcal{F}_{p}}{{\textstyle\sum_{p}}(1-G_{p})\cdot(1-M_{p}^{g})}-% \frac{(1-\bar{M}_{p})\cdot\bar{\mathcal{F}}_{p}}{{\textstyle\sum_{p}}(1-\bar{M% }_{p})}\right\|^{2},

(8)

where $p$ represents the spatial coordinate, while $\mathcal{F}$ and $\bar{\mathcal{F}}$ denote the key feature obtained from the self-attention layers of the reference video $z$ and the sibling video $\bar{z}$ , respectively. Essentially, location-aware semantic guidance leverages the location from the reference video and aligns the global appearance of $\mathcal{F}_{p}$ with that of $\bar{\mathcal{F}}_{p}$ . Consequently, it assists the generation model in synthesizing spatial relationships that are more reasonable and enhances its ability to adhere to the prompt.

Controllable video generation. Based on the primary temporal-attention guidance and location-aware semantic guidance, controllable video generation can be achieved by replacing the energy function in Eq. 2:

\hat{\epsilon_{\theta}}=\epsilon_{\theta}(z_{t},c,t)+s\\ (\epsilon_{\theta}(z_{t},c,t)-\epsilon_{\theta}(z_{t},\phi,t))+\lambda_{1}g_{m% }+\lambda_{2}g_{s},

(9)

where $s$ is the weight for classifier-free guidance, $\lambda_{1}$ and $\lambda_{2}$ are weights for motion guidance and semantic guidance, respectively.

4 Experiments

4.1 Implementation details

In this work, AnimateDiff[11] is adopted as the base text-to-video generation model. When the reference video is a synthesized 16-frame videos with resolution $512\times 512$ , the generation step is set as 100, in which guidance is only applied in the first 60 steps, typically taking around 1 minutes. For real videos, the DDIM inversion is adopted to obtain latent representations. The time cost is around 3 minutes with an inversion step of 1000, and the guidance step and generation step are also extended to 300 and 500. The motion guidance is conducted on temporal attention layers in “up block.1" and “up block.2", which are meticulously observed to ensure optimal performance. We apply semantic guidance in the self-attention layers in “up block.1". $s$ , $\lambda_{1}$ , and $\lambda_{2}$ are empirically set as 7.5, 20000, and 200, respectively.

4.2 Experimental setup

Dataset. For experimental evaluation, 30 text-video pairs sourced from DAVIS [26] are utilized for a thorough analysis. These videos encompass a rich tapestry of motion types and scenarios, ranging from the dynamic motions of animals and humans to the global camera motion.

Evaluation metrics. For objective evaluation, two commonly used metrics are adopted: i) Textual alignment, which quantifies the congruence with the provided textual prompt. Following previous work [32], it is measured by the average CLIP [28] cosine similarity between all video frames and text [16]; ii) Temporal consistency, the indicator of video smoothness, is quantified by calculating the average CLIP similarity among consecutive video frames. Beyond the scope of objective metrics, a user study has been employed for a more nuanced assessment of human preferences in video quality, incorporating two additional criteria: i) motion preservation which evaluates the motion’s adherence to the reference video, and ii) appearance diversity which assesses the visual range and diversity in contrast to the reference video. The scores of the user study are derived from the average ratings provided by 20 volunteers, ranging from 1 to 5.

Baselines. The MotionClone aims to achieve high-quality training-free video motion transfer. For a thorough comparative analysis, various alternative methods have been examined in the comparison, including VideoComposer[32], Tune-A-video[35], Control-A-Video[7], VMC[16], Gen-1[9], and MotionCtrl[34]. The detailed description of each method is depicted in the Appendix A.

4.3 Qualitative Comparison.

Vanilla T2V model. The integration of motion guidance fulfills two principal objectives: it enhances the customization of video motion and improves the quality of generated motion. As shown in Fig. 6, MotionClone achieves superior quality in terms of motion fidelity and controllability, a result attributed to the reduction of inherent ambiguities within the video synthesis process.

Camera motion clone. As shown in Fig. 7, the "clockwise rotation" motion presents a significant challenge. Despite MotionCtrl’s commendable effort in motion preservation, it fails to produce adequate appearance details. VMC and Tune-A-Video generate scenes with acceptable text alignment but exhibit deficiencies in motion transfer. The outputs from VideoComposer, Gen-1, and Control-A-Video are notably unrealistic, which can be attributed to the dense integration of the structural elements from the original videos. Conversely, MotionClone demonstrates superior text alignment and motion consistency, thereby suggesting its superior video motion transfer capabilities within global camera motion scenarios.

Object motion clone. Beyond the scope of camera motion, the proficiency in handling local object motions has been rigorously validated. As evidenced by Fig. 8, VMC falls short in matching motion with the source videos. Videocomposer appears to generate grayish colors with limited prompt-following ability. Gen-1 is inhibited by the original videos’ structure. Tune-A-Video struggles with capturing detailed body motions, while Control-A-Video cannot maintain a faithful appearance. In contrast, MotionClone stands out in scenarios with localized object motions, enhancing motion accuracy and improved text alignment.

Table 1: Comparison over DAVIS dataset by using automotive metrics and user study.

Method	VMC	VideoComposer	Gen-1	Tune-A-Video	Control-A-Video	MotionClone
Text Alignment	0.337	0.265	0.311	0.332	0.291	0.342
Temporal Consistency	0.942	0.942	0.932	0.948	0.918	0.949
Motion Preservation	3.48	3.93	4.07	3.03	4.20	4.67
Appearance Diversity	3.87	3.68	4.03	3.95	3.77	4.28
Text Alignment	4.27	2.95	3.29	3.70	3.25	4.40
Temporal Consistency	3.32	3.65	3.45	2.47	3.16	4.43

Table 2: Quantitative results of ablation study. Motion Preservation and Motion Quality are subjective metrics obtained from user study.

Method	Text Alignment	Temporal Consistency	Motion Preservation	Motion Quality
w/o motion control	0.353	0.987	1.47	2.62
Threshold mask	0.302	0.932	4.52	3.14
w/o semantic control	0.295	0.939	4.55	3.82
MotionClone	0.342	0.949	4.63	4.37

4.4 Quantitative comparison.

The quantitative comparison on DAVIS dataset are outlined in Tab. 1. It is observed that MotionClone gains competitive scores in both textual alignment and temporal consistency. Moreover, MotionClone has outperformed its rivals in motion preservation, appearance diversity, temporal consistency, and textual alignment in human preference tests, underscoring its ability to produce visually compelling outcomes.

Effect of primary temporal-attention guidance. We validate the effect of the primary motion guidance strategy with two variants: i) w/o motion control, which sets $\lambda_{2}=0$ in Eq. 9 and ii) threshold masking in Eq. 6. The results are presented in Fig. 9 and Tab. 2. Videos without motion control tend to exhibit minimal movement or remain static, which leads to higher temporal consistency scores. However, these videos significantly underperform compared to MotionClone in terms of motion preservation and motion quality. Additionally, the threshold masking demonstrates inferior performance to MotionClone in both qualitative and quantitative outcomes. More results are given in supplementary materials.

Effect of location-aware semantic guidance. Location-aware semantic guidance aids in the model’s synthesis of plausible spatial relationships and enhances its adherence to the prompt, as illustrated in Fig. 5 and Fig. 9. Furthermore, additional quantitative ablation study results presented in Tab. 2 substantiate these improvements.

5 Conclusion

In this work, we observe that the temporal attention layers embedded within video generation models harbor substantial representational capacities pertinent to video motion transfer. Motivated by these findings, we introduce a training-free method dubbed MotionClone for motion cloning. This methodology is founded on two principal elements: primary temporal-attention guidance, which plays a pivotal role in facilitating motion transfer, and location-aware semantic guidance, responsible for orchestrating the visual appearance. Employing a real reference video, MotionClone demonstrates its capability to preserve motion fidelity robustly while concurrently assimilating novel textual semantics. This framework thereby emerges as a highly adaptable and efficient tool for motion customization within the realm of text-to-video generation.

References

[1] M. Bain, A. Nagrani, G. Varol, and A. Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
[2] A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
[3] A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
[4] H. Chefer, Y. Alaluf, Y. Vinker, L. Wolf, and D. Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
[5] H. Chen, M. Xia, Y. He, Y. Zhang, X. Cun, S. Yang, J. Xing, Y. Liu, Q. Chen, X. Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023.
[6] H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y. Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024.
[7] W. Chen, J. Wu, P. Xie, H. Wu, J. Li, X. Xia, X. Xiao, and L. Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023.
[8] Z. Dai, Z. Zhang, Y. Yao, B. Qiu, S. Zhu, L. Qin, and W. Wang. Animateanything: Fine-grained open domain image animation with motion guidance. arXiv e-prints, pages arXiv–2311, 2023.
[9] P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
[10] S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and B. Guo. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
[11] Y. Guo, C. Yang, A. Rao, Y. Wang, Y. Qiao, D. Lin, and B. Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
[12] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
[13] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
[14] J. Ho and T. Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
[15] L. Huang, D. Chen, Y. Liu, Y. Shen, D. Zhao, and J. Zhou. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023.
[16] H. Jeong, G. Y. Park, and J. C. Ye. Vmc: Video motion customization using temporal attention adaption for text-to-video diffusion models. arXiv preprint arXiv:2312.00845, 2023.
[17] H. Jeong and J. C. Ye. Ground-a-video: Zero-shot grounded video editing using text-to-image diffusion models. arXiv preprint arXiv:2310.01107, 2023.
[18] Y. Kim, J. Lee, J.-H. Kim, J.-W. Ha, and J.-Y. Zhu. Dense text-to-image generation with attention modulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7701–7711, 2023.
[19] Y. Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, and Y. J. Lee. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023.
[20] S. Liu, Y. Zhang, W. Li, Z. Lin, and J. Jia. Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761, 2023.
[21] J. Ma, J. Liang, C. Chen, and H. Lu. Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. arXiv preprint arXiv:2307.11410, 2023.
[22] Y. Ma, Y. He, H. Wang, A. Wang, C. Qi, C. Cai, X. Li, Z. Li, H.-Y. Shum, W. Liu, et al. Follow-your-click: Open-domain regional image animation via short prompts. arXiv preprint arXiv:2403.08268, 2024.
[23] S. Mo, F. Mu, K. H. Lin, Y. Liu, B. Guan, Y. Li, and B. Zhou. Freecontrol: Training-free spatial control of any text-to-image diffusion model with any condition. arXiv preprint arXiv:2312.07536, 2023.
[24] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
[25] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
[26] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
[27] C. Qin, S. Zhang, N. Yu, Y. Feng, X. Yang, Y. Zhou, H. Wang, J. C. Niebles, C. Xiong, S. Savarese, et al. Unicontrol: A unified diffusion model for controllable visual generation in the wild. arXiv preprint arXiv:2305.11147, 2023.
[28] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[29] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
[30] J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
[31] K. Sun, J. Pan, Y. Ge, H. Li, H. Duan, X. Wu, R. Zhang, A. Zhou, Z. Qin, Y. Wang, et al. Journeydb: A benchmark for generative image understanding. Advances in Neural Information Processing Systems, 36, 2024.
[32] X. Wang, H. Yuan, S. Zhang, D. Chen, J. Wang, Y. Zhang, Y. Shen, D. Zhao, and J. Zhou. Videocomposer: Compositional video synthesis with motion controllability. Advances in Neural Information Processing Systems, 36, 2024.
[33] Y. Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y. Wang, C. Yang, Y. He, J. Yu, P. Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023.
[34] Z. Wang, Z. Yuan, X. Wang, T. Chen, M. Xia, P. Luo, and Y. Shan. Motionctrl: A unified and flexible motion controller for video generation. arXiv preprint arXiv:2312.03641, 2023.
[35] J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
[36] G. Xiao, T. Yin, W. T. Freeman, F. Durand, and S. Han. Fastcomposer: Tuning-free multi-subject image generation with localized attention. arXiv preprint arXiv:2305.10431, 2023.
[37] J. Xing, M. Xia, Y. Liu, Y. Zhang, Y. He, H. Liu, H. Chen, X. Cun, X. Wang, Y. Shan, et al. Make-your-video: Customized video generation using textual and structural guidance. IEEE Transactions on Visualization and Computer Graphics, 2024.
[38] S. Yin, C. Wu, J. Liang, J. Shi, H. Li, G. Ming, and N. Duan. Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:2308.08089, 2023.
[39] B. Zhang, P. Zhang, X. Dong, Y. Zang, and J. Wang. Long-clip: Unlocking the long-text capability of clip. arXiv preprint arXiv:2403.15378, 2024.
[40] L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
[41] R. Zhao, Y. Gu, J. Z. Wu, D. J. Zhang, J. Liu, W. Wu, J. Keppo, and M. Z. Shou. Motiondirector: Motion customization of text-to-video diffusion models. arXiv preprint arXiv:2310.08465, 2023.

Appendix A: Baseline description

Among the compared methods, VideoComposer[32] creates videos by extracting specific features such as frame-wise depth or canny maps from existing videos, achieving a compositional approach to controllable video generation. Gen-1[9] leverages the original structure of reference videos to generate new video content, akin to video-to-video translation. Tune-A-Video expands the spatial self-attention of pre-trained text-to-image models into spatio-temporal attention, and then fine-tuning it for motion-specific generation. Control-A-Video[7] incorporates the first video frame as an additional motion cue for customized video generation. MotionCtrl[34] introduces the trajectory of both camera and object motion as supplementary signals for conditional motion modeling. VMC[16] aims to distill motion patterns by fine-tuning the temporal attention layers in a pre-trained text-to-video diffusion model.

Appendix B: More results without motion guidance

Without the assistance of the designed primary temporal-attention guidance, the motion in the resultant videos tends to be minimal or nearly static, akin to the vanilla T2V model. As depicted in Fig. 13, even though the generated content is well-aligned with the provided prompt, the desired motion is not effectively conveyed within the video sequence, which leads to high text alignment and temporal consistency (with no variations in timing), yet fails to satisfy the human preference for dynamic video content.

Appendix C: More generated results

A broader array of generated content is displayed to validate the versatile generation capability. As shown in Fig. 10-12, MotionClone is able to adeptly extract motion cues from a diverse range of existing videos and thus enables the creation of content that is both prompt-aligned and motion-preserved, showcasing its robust motion cloning capabilities.

Appendix D: Limitation

While MotionClone demonstrates notable improvements in cloning motion from reference videos in a training-free manner, there are inherent limitations associated with it. Firstly, the motion contained within the reference video must be appropriate for the objects depicted in the new prompt; otherwise, MotionClone may produce unrealistic video outputs. Secondly, despite the application of primary temporal-attention guidance and a Gaussian kernel to mitigate the impact of structural information from the reference video, a small number of generated samples still retain structural elements from the reference. These limitations will be addressed in future research.

Appendix E: Broader Impact

The development and implementation of MotionClone, a novel training-free framework for motion-based controllable text-to-video generation, carry distinct societal implications, both beneficial and challenging.

On the positive side, MotionClone’s capability to efficiently clone motions from reference videos while ensuring high fidelity and textual alignment opens new avenues in numerous fields. In the realm of digital content creation, film and media professionals can utilize this technology to streamline the production process, enhance narrative expressions, and create more engaging visual experiences without extensive resource commitments. Furthermore, in the educational sector, instructors and content creators can leverage this innovation to produce customized instructional videos that incorporate precise motions aligned with textual descriptions, potentially increasing engagement and comprehension among students. This could be particularly transformative for subjects where demonstration of physical actions or processes plays a crucial role, such as in sports training or scientific experiments.

On the negative side, the power of MotionClone to generate realistic videos based on text and existing motion cues raises concerns about its potential misuse, including the creation of deepfakes or misleading media content. Such applications can undermine trust in media, affect public opinion through the dissemination of false information, and infringe on personal rights and privacy. Moreover, the ease of generating convincing videos might enable the proliferation of propaganda or harmful content that can have widespread negative implications on society.

In conclusion, while MotionClone presents significant advancements in the field of AI-driven video generation, it is imperative that these technologies are developed and utilized with a conscious commitment to ethical standards and regulatory oversight. Promoting transparency in AI-generated content, establishing clear usage guidelines, and fostering an open dialogue about the capabilities and ethics of such technologies are crucial steps in ensuring that the benefits of MotionClone are realized while its risks are effectively mitigated. This involves collaborative efforts among technologists, policymakers, industry stakeholders, and the broader public to steer the responsible development and application of AI-driven media tools.