\useunder

\ul

MotionClone: Training-Free Motion Cloning for Controllable Video Generation

Pengyang Ling1,4 Jiazi Bu2,4∗ Pan Zhang4† Xiaoyi Dong4
Yuhang Zang4Tong Wu3Huaian Chen1Jiaqi Wang4Yi **1†
1
University of Science and Technology of China  2Shanghai Jiao Tong University
3The Chinese University of Hong Kong  4Shanghai AI Laboratory
https://github.com/Bujiazi/MotionClone/
*Equal contribution. \daggerCorresponding author.
Abstract

Motion-based controllable text-to-video generation involves motions to control the video generation. Previous methods typically require the training of models to encode motion cues or the fine-tuning of video diffusion models. However, these approaches often result in suboptimal motion generation when applied outside the trained domain. In this work, we propose MotionClone, a training-free framework that enables motion cloning from a reference video to control text-to-video generation. We employ temporal attention in video inversion to represent the motions in the reference video and introduce primary temporal-attention guidance to mitigate the influence of noisy or very subtle motions within the attention weights. Furthermore, to assist the generation model in synthesizing reasonable spatial relationships and enhance its prompt-following capability, we propose a location-aware semantic guidance mechanism that leverages the coarse location of the foreground from the reference video and original classifier-free guidance features to guide the video generation. Extensive experiments demonstrate that MotionClone exhibits proficiency in both global camera motion and local object motion, with notable superiority in terms of motion fidelity, textual alignment, and temporal consistency.

1 Introduction

The generation of videos that align with human intentions and produce high-quality outputs has recently attracted significant attention, particularly with the rise of mainstream text-to-video (T2V) diffusion models [11, 3]. Despite the substantial progress witnessed in text-to-image generation, the domain of text-to-video generation presents unique challenges, primarily due to the complexities introduced by motion synthesis. Incorporating additional motion elements not only serves to reduce the ambiguity inherent in video synthesis, thereby facilitating the generation of high-quality motion but also enhances the controllability of the generated content.

Within the domain of text-to-video generation guided by motion cues, extant methodologies are typically classified into two principal methodologies: one that leverages the dense optical flow or depth of a reference video  [32, 17], and another that employs trajectory [34, 38]. The former methodology often integrates a pre-trained model to extract motion cues at a pixel level. Despite achieving high-quality outputs, these dense motion cues can be entangled with the structural elements of the reference video, thereby limiting their transferability to other objects. In contrast, the latter methodology, which is predicated on trajectory, is more user-friendly for incorporating motion cues. However, while the model proficiently captures macroscopic object movements, it exhibits potential constraints in delineating finer, localized motions such as head turns or hand raises. Additionally, both methodologies typically necessitate the training of a model to encode motion cues which often result in suboptimal generation when applied outside the trained domain. In some cases, they also entail the fine-tuning of pre-trained text-to-video models, potentially degrading generation quality.

Refer to caption
Figure 1: Given a reference video, MotionClone can clone the contained motion into novel scenarios with excellent prompt-following ability, without motion-specific fine-tuning.

In this work, we introduce MotionClone, a novel training-free framework designed to clone motion from a reference video for controllable text-to-video generation. Diverging from traditional approaches involving dense flow or trajectories, MotionClone employs a temporal-attention mechanism within the video generation model to capture the motion in the reference video. This strategy effectively renders detailed motion while concurrently preserving minimal interdependencies with the structural components of the reference video.

Nevertheless, we find that the majority of weights within the temporal-attention tend to correspond to either noisy or very subtle motions. When temporal-attention is applied uniformly across the model, these weights can overshadow the motion guidance, consequently resulting in the suppression of the primary motion. To address this limitation, we propose primary temporal-attention guidance, which leverages only the principal components of the temporal-attention weights for motion-guided video generation. This approach enables the model to overlook noisy or less significant motions and concentrate on the primary motion, thus significantly improving the quality of the motion clone. Besides, we observe that directly apply primary temporal-attention components sampling in vanilla video generation realizes the enhancement of the primary motions within these videos.

Despite achieving success in cloning motion from the reference video, we have observed that current text-to-video models sometimes synthesize unreasonable spatial relationships, and show suboptimal prompt-following capability when guided by motion cues alone. To address this issue, we propose a location-aware semantic guidance mechanism that leverages a coarse foreground location derived from the reference video alone with original classifier-free guidance features. The location is obtained from the spatial cross-attention within our generation model. The proposed guidance maintains generative flexibility while enhancing the rationality of spatial relationships in the synthesized video.

In summary, our MotionClone is a novel training-free framework designed to clone motion from a reference video for controllable text-to-video generation processes, which is composed with primary temporal-attention guidance and location-aware semantic guidance. Extensive experiments demonstrate that MotionClone exhibits proficiency in both global camera motion and local object action, with notable superiority in terms of motion fidelity, text alignment, and temporal consistency.

2 Related Work

2.1 Text-to-video diffusion models

Equipped with sophisticated text encoders [28, 39], a great breakthrough has been achieved in the realm of text-to-image (T2I) generation [10, 24, 29, 25], which sparks the enthusiasm for advanced text-to-video (T2V) models [3, 33, 5, 6, 11]. Notably, VideoLDM [3] introduces a motion module that utilizes 3D convolutions and temporal attention to capture frame-to-frame correlations. In a novel approach, AnimateDiff [11] enhances a pre-trained T2I diffusion model with motion modeling capabilities. This is achieved by fine-tuning a series of specialized temporal attention layers on extensive video datasets, allowing for a harmonious fusion with the original T2I generation process. To address the challenge of data scarcity, VideoCraft2 [6] suggests an innovative strategy of learning motion from low-quality videos  [1] while simultaneously learning appearance from high-quality images  [31]. Despite these advancements, there remains a significant disparity in the quality of generated content between the available T2V models and their sophisticated T2I counterparts, primarily due to the intricate nature of diverse motions and the limited availability of high-quality video data. In this work, a motion guidance strategy is developed, which ingeniously incorporates motion cues from given videos to ease the challenges of motion modeling, yielding more realistic and coherent video sequences, without model fine-tuning.

2.2 Controllable video generation

Building on the success of controllable image generation through the integration of additional conditions [40, 18, 19, 27, 15], a multitude of studies [5, 38, 8, 22, 2] have endeavored to introduce diverse control signals for versatile video generation. These include control over the first video frame [5], motion trajectory [38], motion region [8], and motion object [22]. Furthermore, in pursuit of high-quality video customization, several studies delve into reference-based video generation, leveraging the motion from an existing real video to direct the creation of new video content. A straightforward solution developed in [32, 9, 37], involves the direct integration of frame-wise depth maps or canny maps to regularize motion. However, this approach inadvertently introduces motion-independent features, such as structures in static areas, which can disrupt the alignment of the resulting video appearance with new text. To address this issue, motion-specific fine-tuning frameworks, as explored in [41, 16], have been developed to extract a distinct motion pattern from a single video or a collection of videos with identical motion. While holding promise, these methods are subject to complex training processes and potential model degradation. To address this, we present a novel motion cloning scheme, which extracts temporal correlations from existing videos as explicit motion clues to guide the generation of new video content, providing a plug-and-play motion customization solution.

2.3 Attention control

Attention mechanisms have been confirmed as vital for high-quality content generation. Prompt2Prompt[12] illustrates that cross-attention maps are instrumental in dictating the spatial layout of synthesized images. This observation subsequently motivates serious work in semantic preservation [4], multi-object generation [21, 36], and video editing [20]. FreeControl [23] highlights that the feature space within self-attention layers encodes structural image information, facilitating reference-based image generation. While previous methods have concentrated on spatial attention blocks, our work uncovers the untapped potential of temporal attention layers for effective motion guidance, thereby enabling flexible motion cloning from existing videos.

3 MotionClone

In this section, we first introduce video diffusion and temporal attention mechanisms. We then present our observations regarding the temporal attentions within the video diffusion model. Subsequently, we elaborate on the proposed MotionClone framework, which comprises primary temporal-attention guidance and location-aware semantic guidance.

Refer to caption
Figure 2: Leveraging temporal attentions derived from a reference video to guide video generation. Plain control refers to a rudimentary approach whereby all weights are uniformly applied. Primary control utilizes primary temporal-attention guidance, as outlined in Section 3.3.

3.1 Preliminaries

Diffusion sampling. Following pioneering work[29], video diffusion models encode a input video x𝑥xitalic_x into latent representation z=(x)𝑧𝑥z=\mathcal{E}(x)italic_z = caligraphic_E ( italic_x ) by using a pre-trained encoder ()\mathcal{E}(\cdot)caligraphic_E ( ⋅ ). To enable video distribution learning, diffusion model ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is encouraged to estimate noise component ϵtsubscriptitalic-ϵ𝑡\epsilon_{t}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the latent ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that follows time-dependent scheduler [13], i.e.,

(θ)=𝔼(x),ϵt𝒩(0,1),t𝒰(1,T)[ϵtϵθ(zt,c,t)22],𝜃subscript𝔼formulae-sequence𝑥subscriptitalic-ϵ𝑡𝒩01similar-to𝑡𝒰1𝑇delimited-[]superscriptsubscriptnormsubscriptitalic-ϵ𝑡subscriptitalic-ϵ𝜃subscript𝑧𝑡𝑐𝑡22\mathcal{L(\theta)}=\mathbb{E}_{\mathcal{E}(x),\epsilon_{t}\in\mathcal{N}(0,1)% ,t\sim\mathcal{U}(1,T)}\left[\|\epsilon_{t}-\epsilon_{\theta}(z_{t},c,t)\|_{2}% ^{2}\right],caligraphic_L ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT caligraphic_E ( italic_x ) , italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_N ( 0 , 1 ) , italic_t ∼ caligraphic_U ( 1 , italic_T ) end_POSTSUBSCRIPT [ ∥ italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (1)

where t𝑡titalic_t is the time step, and c𝑐citalic_c is the condition signal such as text. In the inference phase, the generative process commences with a standard Gaussian noise. The trajectory of sampling can be adjusted by incorporating classifier-free guidance [14], denoted as ϵθ(zt,ϕ,t)subscriptitalic-ϵ𝜃subscript𝑧𝑡italic-ϕ𝑡\epsilon_{\theta}(z_{t},\phi,t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ϕ , italic_t ), and an additional energy function, represented by g(zt,y,t)𝑔subscript𝑧𝑡𝑦𝑡g(z_{t},y,t)italic_g ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y , italic_t ), which is parameterized by the label y𝑦yitalic_y, i.e.,

ϵθ^=ϵθ(zt,c,t)+s(ϵθ(zt,c,t)ϵθ(zt,ϕ,t))+λg(zt,y,t),^subscriptitalic-ϵ𝜃subscriptitalic-ϵ𝜃subscript𝑧𝑡𝑐𝑡𝑠subscriptitalic-ϵ𝜃subscript𝑧𝑡𝑐𝑡subscriptitalic-ϵ𝜃subscript𝑧𝑡italic-ϕ𝑡𝜆𝑔subscript𝑧𝑡𝑦𝑡\hat{\epsilon_{\theta}}=\epsilon_{\theta}(z_{t},c,t)+s\\ (\epsilon_{\theta}(z_{t},c,t)-\epsilon_{\theta}(z_{t},\phi,t))+\lambda\ g(z_{t% },y,t),over^ start_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_ARG = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t ) + italic_s ( italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t ) - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ϕ , italic_t ) ) + italic_λ italic_g ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y , italic_t ) , (2)

where s𝑠sitalic_s and λ𝜆\lambdaitalic_λ are guidance weights, and ϕitalic-ϕ\phiitalic_ϕ denotes null text or negative prompt.

Temporal attention. To model video motion, temporal attention is introduced to establish correlation across frames. Given a video feature with F𝐹Fitalic_F frames finB×F×C×H×Wsubscript𝑓𝑖𝑛superscript𝐵𝐹𝐶𝐻𝑊f_{in}\in\mathbb{R}^{B\times F\times C\times H\times W}italic_f start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × italic_F × italic_C × italic_H × italic_W end_POSTSUPERSCRIPT, the temporal attention mechanism reshapes this tensor into 3D tensor f¯in(B×H×W)×F×Csubscript¯𝑓𝑖𝑛superscript𝐵𝐻𝑊𝐹𝐶\bar{f}_{in}\in\mathbb{R}^{(B\times H\times W)\times F\times C}over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_B × italic_H × italic_W ) × italic_F × italic_C end_POSTSUPERSCRIPT by merging the spatial dimensions into the batch size. Subsequently, it executes self-attention along the frame axis, i.e.,

fout=Attention(Q(f¯in),K(f¯in),V(f¯in)),subscript𝑓𝑜𝑢𝑡𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑄subscript¯𝑓𝑖𝑛𝐾subscript¯𝑓𝑖𝑛𝑉subscript¯𝑓𝑖𝑛{f}_{out}=Attention(Q(\bar{f}_{in}),K(\bar{f}_{in}),V(\bar{f}_{in})),italic_f start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT = italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_Q ( over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ) , italic_K ( over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ) , italic_V ( over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ) ) , (3)

where Q()𝑄Q(\cdot)italic_Q ( ⋅ ), K()𝐾K(\cdot)italic_K ( ⋅ ), and V()𝑉V(\cdot)italic_V ( ⋅ ) are projection layers. The corresponding attention map can be obtained as 𝒜(B×H×W)×F×F𝒜superscript𝐵𝐻𝑊𝐹𝐹\mathcal{A}\in\mathbb{R}^{(B\times H\times W)\times F\times F}caligraphic_A ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_B × italic_H × italic_W ) × italic_F × italic_F end_POSTSUPERSCRIPT, which represents the temporal relation for each spatial pixel. For the sake of brevity, in the ensuing exposition, we employ the latent representation z𝑧zitalic_z to denote videos, given that all operations are executed within the latent space.

3.2 Observation

Since the motion in the generated video is governed by temporal attention mechanisms, videos with similar temporal attentions should exhibit similar motion characteristics. To investigate this hypothesis, we control the denoising process by aligning the temporal attentions from the generated video with those from a reference video. As depicted in Fig. 2, plain controlling the generation, can preserve certain motion patterns from the reference video, such as the gait of a cat and the directional movement of a tank. However, employing naive guidance based on temporal attentions can result in suboptimal motion tracking capabilities when compared to the reference video, particularly with regard to the amplitude of motion for both objects and the camera. We postulate that the reason is the majority of weights within the temporal-attentions, which often correspond to either noisy or exceedingly subtle motions. Such weights have the potential to obscure the motion guidance. The proposed primary control strategy, which is elaborated upon in subsequent sections, demonstrates the capacity to effectively replicate the reference motion.

In addition, we directly apply primary component sampling to the temporal attention module of the video generation model during the inference phase, and a significant enhancement in the motion within the generated videos is observed, as illustrated in Fig. 3. This finding further supports the notion that the principal components of temporal attention represent the primary motions in the generated videos, which we aim to utilize for motion cloning.

Refer to caption
Figure 3: Primary sampling in vanilla video generation. By applying primary sampling to the temporal attention module of the video generation model during the inference phase, we observe a empressive enhancement in the range and quality of motions within the generated videos.
Refer to caption
Figure 4: The framework of MotionClone. MotionClone comprises two core components: Primary Temporal-Attention Guidance and Location-Aware Semantic Guidance, which operate synergistically to provide comprehensive motion and semantic guidance.

3.3 Methodology

Method overview. The framework of MotionClone is depicted in Fig. 4. Given a real reference video zgsuperscript𝑧𝑔z^{g}italic_z start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT, we employ DDIM [30] inversion to obtain the time-dependent latent set Szg={z1g,z2g,,ztg,,zTg}subscript𝑆superscript𝑧𝑔superscriptsubscript𝑧1𝑔superscriptsubscript𝑧2𝑔superscriptsubscript𝑧𝑡𝑔superscriptsubscript𝑧𝑇𝑔S_{z^{g}}=\left\{z_{1}^{g},z_{2}^{g},...,z_{t}^{g},...,z_{T}^{g}\right\}italic_S start_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = { italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT }. During the video generation process, an initial latent zTsubscript𝑧𝑇z_{T}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is sampled from a standard Gaussian distribution and subsequently duplicated to create a sibling latent z¯Tsubscript¯𝑧𝑇\bar{z}_{T}over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. This sibling latent, in conjunction with classifier-free guidance, is utilized to confer semantic appearance characteristics derived from the prompt. At each denoising step, ztgsuperscriptsubscript𝑧𝑡𝑔z_{t}^{g}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT, ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and z¯tsubscript¯𝑧𝑡\bar{z}_{t}over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are fed to the pretrained video diffusion model, in which the motion encapsulated by the temporal attentions of ztgsuperscriptsubscript𝑧𝑡𝑔z_{t}^{g}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT and semantic appearance encoded within the cross attentions of z¯tsubscript¯𝑧𝑡\bar{z}_{t}over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are collectively employed to guide the denoising process of ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Specifically, motion guidance is implemented by aligning the primary temporal-attention components of the generated latent ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with those of the reference latent zgsubscript𝑧𝑔z_{g}italic_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT. This alignment propels ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to clone motion in regions where zgsubscript𝑧𝑔z_{g}italic_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT exhibits substantial motion activity. For location-aware semantic guidance, we derive coarse object masks from the cross-attention layers of ztgsuperscriptsubscript𝑧𝑡𝑔z_{t}^{g}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT and z¯tsubscript¯𝑧𝑡\bar{z}_{t}over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which are encoded using a Gaussian kernel. These masks are then utilized to guide ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, leveraging the spatial location information from ztgsuperscriptsubscript𝑧𝑡𝑔z_{t}^{g}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT and the semantic appearance details from z¯tsubscript¯𝑧𝑡\bar{z}_{t}over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The joint guidance facilitates the generation of videos that exhibit compelling motion fidelity and precise textual alignment.

Primary temporal-attention guidance. For temporal attention 𝒜g(1×H×W)×F×Fsuperscript𝒜𝑔superscript1𝐻𝑊𝐹𝐹\mathcal{A}^{g}\in\mathbb{R}^{(1\times H\times W)\times F\times F}caligraphic_A start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( 1 × italic_H × italic_W ) × italic_F × italic_F end_POSTSUPERSCRIPT from a given reference video, which satisfies j=1F𝒜(p,i,j)g=1superscriptsubscript𝑗1𝐹superscriptsubscript𝒜𝑝𝑖𝑗𝑔1\sum_{j=1}^{F}\mathcal{A}_{(p,i,j)}^{g}=1∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT ( italic_p , italic_i , italic_j ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = 1. In the subsequent exposition, the time step t𝑡titalic_t is omitted for brevity. The value of 𝒜(p,i,j)gsuperscriptsubscript𝒜𝑝𝑖𝑗𝑔\mathcal{A}_{(p,i,j)}^{g}caligraphic_A start_POSTSUBSCRIPT ( italic_p , italic_i , italic_j ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT reflects the relation between i𝑖iitalic_i frame and j𝑗jitalic_j frame in position p𝑝pitalic_p, and a larger value of 𝒜(p,i,j)gsuperscriptsubscript𝒜𝑝𝑖𝑗𝑔\mathcal{A}_{(p,i,j)}^{g}caligraphic_A start_POSTSUBSCRIPT ( italic_p , italic_i , italic_j ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT implies a stronger correlation. The primary temporal-attention guidance gmsubscript𝑔𝑚g_{m}italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT for motion cloning can be expressed as:

gm=(𝒜g𝒜)22,subscript𝑔𝑚superscriptsubscriptnormsuperscript𝒜𝑔𝒜22g_{m}=\left\|\mathcal{M}\cdot(\mathcal{A}^{g}-\mathcal{A})\right\|_{2}^{2},italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = ∥ caligraphic_M ⋅ ( caligraphic_A start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT - caligraphic_A ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (4)

where \mathcal{M}caligraphic_M is the temporal mask for the primary temporal attention constraint, and 𝒜𝒜\mathcal{A}caligraphic_A is the temporal attention of z𝑧zitalic_z. Particularly, we denote =11\mathcal{M}=1caligraphic_M = 1 to represent the plain temporal-attention constraint, indicative of the “plain control” that exhibits limited motion transfer capability as illustrated in Fig. 2. We postulate that the reason is the majority of weights within the temporal-attentions, which often correspond to either noisy or subtle motions. These weights may obscure the motion guidance. Following this postulate, a rudimentary approach involves the implementation of threshold-based filtering.

(p,i,j)threshold:={1,if𝒜(p,i,j)g>=α0,otherwise,assignsuperscriptsubscript𝑝𝑖𝑗𝑡𝑟𝑒𝑠𝑜𝑙𝑑cases1𝑖𝑓superscriptsubscript𝒜𝑝𝑖𝑗𝑔𝛼0𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒\mathcal{M}_{(p,i,j)}^{threshold}:=\left\{\begin{array}[]{l}1,\ \ if\ \ \ % \mathcal{A}_{(p,i,j)}^{g}>=\alpha\\ 0,\ \ otherwise,\end{array}\right.caligraphic_M start_POSTSUBSCRIPT ( italic_p , italic_i , italic_j ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_h italic_r italic_e italic_s italic_h italic_o italic_l italic_d end_POSTSUPERSCRIPT := { start_ARRAY start_ROW start_CELL 1 , italic_i italic_f caligraphic_A start_POSTSUBSCRIPT ( italic_p , italic_i , italic_j ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT > = italic_α end_CELL end_ROW start_ROW start_CELL 0 , italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e , end_CELL end_ROW end_ARRAY (5)

However, this straightforward attempt leads to unstable control. Therefore, we propose to obtain the sparse temporal mask according to the rank of 𝒜gsuperscript𝒜𝑔\mathcal{A}^{g}caligraphic_A start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT value in the temporal axis. Let 𝒜^gsuperscript^𝒜𝑔\hat{\mathcal{A}}^{g}over^ start_ARG caligraphic_A end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT denote the attention Agsuperscript𝐴𝑔A^{g}italic_A start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT sorted along the temporal axis in decreasing order, i,e., 𝒜^(p,i,k1)g>=A^(p,i,k2)gsuperscriptsubscript^𝒜𝑝𝑖subscript𝑘1𝑔superscriptsubscript^𝐴𝑝𝑖subscript𝑘2𝑔\hat{\mathcal{A}}_{(p,i,k_{1})}^{g}>=\hat{A}_{(p,i,k_{2})}^{g}over^ start_ARG caligraphic_A end_ARG start_POSTSUBSCRIPT ( italic_p , italic_i , italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT > = over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT ( italic_p , italic_i , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT for arbitrary k1<=k2subscript𝑘1subscript𝑘2k_{1}<=k_{2}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < = italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Consequently, the subset comprising the top k𝑘kitalic_k values can be defined as Ω(p,i)k={𝒜^(p,i,1)g,𝒜^(p,i,2)g,,𝒜^(p,i,k)g}superscriptsubscriptΩ𝑝𝑖𝑘superscriptsubscript^𝒜𝑝𝑖1𝑔superscriptsubscript^𝒜𝑝𝑖2𝑔superscriptsubscript^𝒜𝑝𝑖𝑘𝑔\Omega_{(p,i)}^{k}=\left\{\hat{\mathcal{A}}_{(p,i,1)}^{g},\hat{\mathcal{A}}_{(% p,i,2)}^{g},...,\hat{\mathcal{A}}_{(p,i,k)}^{g}\right\}roman_Ω start_POSTSUBSCRIPT ( italic_p , italic_i ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = { over^ start_ARG caligraphic_A end_ARG start_POSTSUBSCRIPT ( italic_p , italic_i , 1 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , over^ start_ARG caligraphic_A end_ARG start_POSTSUBSCRIPT ( italic_p , italic_i , 2 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , … , over^ start_ARG caligraphic_A end_ARG start_POSTSUBSCRIPT ( italic_p , italic_i , italic_k ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT }. The mask of rank sampling is defined as:

(p,i,j)rank:={1,if𝒜(p,i,j)gΩ(p,i)k0,otherwise,assignsuperscriptsubscript𝑝𝑖𝑗𝑟𝑎𝑛𝑘cases1𝑖𝑓superscriptsubscript𝒜𝑝𝑖𝑗𝑔superscriptsubscriptΩ𝑝𝑖𝑘0𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒\mathcal{M}_{(p,i,j)}^{rank}:=\left\{\begin{array}[]{l}1,\ \ \ \ \ if\mathcal{% A}_{(p,i,j)}^{g}\in\Omega_{(p,i)}^{k}\\ 0,\ \ \ \ \ \ otherwise,\end{array}\right.caligraphic_M start_POSTSUBSCRIPT ( italic_p , italic_i , italic_j ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_a italic_n italic_k end_POSTSUPERSCRIPT := { start_ARRAY start_ROW start_CELL 1 , italic_i italic_f caligraphic_A start_POSTSUBSCRIPT ( italic_p , italic_i , italic_j ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ∈ roman_Ω start_POSTSUBSCRIPT ( italic_p , italic_i ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 , italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e , end_CELL end_ROW end_ARRAY (6)
Refer to caption
Figure 5: Temporal guidance leads to unreasonable spatial relationships.

where k𝑘kitalic_k is a hyper-parameter. Particularly, in the case where k=1𝑘1k=1italic_k = 1, the guidance focuses solely on the highest activation for each spatial location. This ranking-based sparse strategy enables primary motion guidance from the reference video.

Location-aware semantic guidance. The primary motion guidance effectively facilitates video motion. However, there is no guarantee that the generated videos will be physically plausible, which introduces the risk of unrealistic generations due to misalignments between motion and appearance, as shown in Fig. 5. This phenomenon has minimal impact on camera motion scenarios that are characterized by globally consistent motion; however, it substantially degrades the quality in object action scenarios where there is a lack of motion coherence between the foreground and background. Additionally, motion guidance may attenuate the model’s adherence to the prompt, resulting in limited prompt-following generation.

To address the above issues, we propose to employ the foreground location from the reference video zgsuperscript𝑧𝑔z^{g}italic_z start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT and the appearance of the sibling video z¯¯𝑧\bar{z}over¯ start_ARG italic_z end_ARG to provide joint semantic guidance. Specifically, given the foreground tokens derived from the prompts, the foreground masks of the reference video zgsuperscript𝑧𝑔z^{g}italic_z start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT and the sibling video z¯¯𝑧\bar{z}over¯ start_ARG italic_z end_ARG, denoted as Mgsuperscript𝑀𝑔M^{g}italic_M start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT and M¯¯𝑀\bar{M}over¯ start_ARG italic_M end_ARG respectively, are extracted from the cross-attention layers [12]. To eliminate structural information from the reference video, we employ a Gaussian kernel to encode the foreground mask, using the mask’s center as the mean of the kernel. Formally, the Gaussian kernel can be obtained by the following equation:

Gp=12πσ2epp022σ2,subscript𝐺𝑝12𝜋superscript𝜎2superscript𝑒superscriptnorm𝑝subscript𝑝022superscript𝜎2G_{p}=\frac{1}{2\pi\sigma^{2}}e^{-\frac{\|p-p_{0}\|^{2}}{2\sigma^{2}}},italic_G start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 italic_π italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT - divide start_ARG ∥ italic_p - italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT , (7)

where p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT represents the coordinates of the foreground mask’s center, and σ𝜎\sigmaitalic_σ is the standard deviation of the Gaussian distribution. Finally, the location-aware semantic guidance is defined as:

gs=GpMpgppGpMpgM¯p¯ppM¯p2+(1Gp)(1Mpg)pp(1Gp)(1Mpg)(1M¯p)¯pp(1M¯p)2,subscript𝑔𝑠superscriptnormsubscript𝐺𝑝superscriptsubscript𝑀𝑝𝑔subscript𝑝subscript𝑝subscript𝐺𝑝superscriptsubscript𝑀𝑝𝑔subscript¯𝑀𝑝subscript¯𝑝subscript𝑝subscript¯𝑀𝑝2superscriptnorm1subscript𝐺𝑝1superscriptsubscript𝑀𝑝𝑔subscript𝑝subscript𝑝1subscript𝐺𝑝1superscriptsubscript𝑀𝑝𝑔1subscript¯𝑀𝑝subscript¯𝑝subscript𝑝1subscript¯𝑀𝑝2g_{s}=\left\|\frac{G_{p}\cdot M_{p}^{g}\cdot\mathcal{F}_{p}}{{\textstyle\sum_{% p}}G_{p}\cdot M_{p}^{g}}-\frac{\bar{M}_{p}\cdot\bar{\mathcal{F}}_{p}}{{% \textstyle\sum_{p}}\bar{M}_{p}}\right\|^{2}+\left\|\frac{(1-G_{p})\cdot(1-M_{p% }^{g})\cdot\mathcal{F}_{p}}{{\textstyle\sum_{p}}(1-G_{p})\cdot(1-M_{p}^{g})}-% \frac{(1-\bar{M}_{p})\cdot\bar{\mathcal{F}}_{p}}{{\textstyle\sum_{p}}(1-\bar{M% }_{p})}\right\|^{2},italic_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = ∥ divide start_ARG italic_G start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⋅ italic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ⋅ caligraphic_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⋅ italic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT end_ARG - divide start_ARG over¯ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⋅ over¯ start_ARG caligraphic_F end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT over¯ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ divide start_ARG ( 1 - italic_G start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ⋅ ( 1 - italic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) ⋅ caligraphic_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( 1 - italic_G start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ⋅ ( 1 - italic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) end_ARG - divide start_ARG ( 1 - over¯ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ⋅ over¯ start_ARG caligraphic_F end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( 1 - over¯ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (8)

where p𝑝pitalic_p represents the spatial coordinate, while \mathcal{F}caligraphic_F and ¯¯\bar{\mathcal{F}}over¯ start_ARG caligraphic_F end_ARG denote the key feature obtained from the self-attention layers of the reference video z𝑧zitalic_z and the sibling video z¯¯𝑧\bar{z}over¯ start_ARG italic_z end_ARG, respectively. Essentially, location-aware semantic guidance leverages the location from the reference video and aligns the global appearance of psubscript𝑝\mathcal{F}_{p}caligraphic_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT with that of ¯psubscript¯𝑝\bar{\mathcal{F}}_{p}over¯ start_ARG caligraphic_F end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. Consequently, it assists the generation model in synthesizing spatial relationships that are more reasonable and enhances its ability to adhere to the prompt.

Controllable video generation. Based on the primary temporal-attention guidance and location-aware semantic guidance, controllable video generation can be achieved by replacing the energy function in Eq. 2:

ϵθ^=ϵθ(zt,c,t)+s(ϵθ(zt,c,t)ϵθ(zt,ϕ,t))+λ1gm+λ2gs,^subscriptitalic-ϵ𝜃subscriptitalic-ϵ𝜃subscript𝑧𝑡𝑐𝑡𝑠subscriptitalic-ϵ𝜃subscript𝑧𝑡𝑐𝑡subscriptitalic-ϵ𝜃subscript𝑧𝑡italic-ϕ𝑡subscript𝜆1subscript𝑔𝑚subscript𝜆2subscript𝑔𝑠\hat{\epsilon_{\theta}}=\epsilon_{\theta}(z_{t},c,t)+s\\ (\epsilon_{\theta}(z_{t},c,t)-\epsilon_{\theta}(z_{t},\phi,t))+\lambda_{1}g_{m% }+\lambda_{2}g_{s},over^ start_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_ARG = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t ) + italic_s ( italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t ) - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ϕ , italic_t ) ) + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , (9)

where s𝑠sitalic_s is the weight for classifier-free guidance, λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are weights for motion guidance and semantic guidance, respectively.

4 Experiments

4.1 Implementation details

In this work, AnimateDiff[11] is adopted as the base text-to-video generation model. When the reference video is a synthesized 16-frame videos with resolution 512×512512512512\times 512512 × 512, the generation step is set as 100, in which guidance is only applied in the first 60 steps, typically taking around 1 minutes. For real videos, the DDIM inversion is adopted to obtain latent representations. The time cost is around 3 minutes with an inversion step of 1000, and the guidance step and generation step are also extended to 300 and 500. The motion guidance is conducted on temporal attention layers in “up block.1" and “up block.2", which are meticulously observed to ensure optimal performance. We apply semantic guidance in the self-attention layers in “up block.1". s𝑠sitalic_s, λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are empirically set as 7.5, 20000, and 200, respectively.

4.2 Experimental setup

Dataset. For experimental evaluation, 30 text-video pairs sourced from DAVIS [26] are utilized for a thorough analysis. These videos encompass a rich tapestry of motion types and scenarios, ranging from the dynamic motions of animals and humans to the global camera motion.

Refer to caption
Figure 6: Comparison with vanilla AnimateDiff, in which MotionClone achieves better motion quality with excellent details preservation.
Refer to caption
Figure 7: Comparison in camera motion cloning, in which MotionClone achieves superior textual alignment by better suppressing the original structure.

Evaluation metrics. For objective evaluation, two commonly used metrics are adopted: i) Textual alignment, which quantifies the congruence with the provided textual prompt. Following previous work [32], it is measured by the average CLIP [28] cosine similarity between all video frames and text [16]; ii) Temporal consistency, the indicator of video smoothness, is quantified by calculating the average CLIP similarity among consecutive video frames. Beyond the scope of objective metrics, a user study has been employed for a more nuanced assessment of human preferences in video quality, incorporating two additional criteria: i) motion preservation which evaluates the motion’s adherence to the reference video, and ii) appearance diversity which assesses the visual range and diversity in contrast to the reference video. The scores of the user study are derived from the average ratings provided by 20 volunteers, ranging from 1 to 5.

Baselines. The MotionClone aims to achieve high-quality training-free video motion transfer. For a thorough comparative analysis, various alternative methods have been examined in the comparison, including VideoComposer[32], Tune-A-video[35], Control-A-Video[7], VMC[16], Gen-1[9], and MotionCtrl[34]. The detailed description of each method is depicted in the Appendix A.

Refer to caption
Figure 8: Comparison in object motion cloning, in which MotionClone performs preferable motion fidelity with improved prompt-following ability.

4.3 Qualitative Comparison.

Vanilla T2V model. The integration of motion guidance fulfills two principal objectives: it enhances the customization of video motion and improves the quality of generated motion. As shown in Fig. 6, MotionClone achieves superior quality in terms of motion fidelity and controllability, a result attributed to the reduction of inherent ambiguities within the video synthesis process.

Camera motion clone. As shown in Fig. 7, the "clockwise rotation" motion presents a significant challenge. Despite MotionCtrl’s commendable effort in motion preservation, it fails to produce adequate appearance details. VMC and Tune-A-Video generate scenes with acceptable text alignment but exhibit deficiencies in motion transfer. The outputs from VideoComposer, Gen-1, and Control-A-Video are notably unrealistic, which can be attributed to the dense integration of the structural elements from the original videos. Conversely, MotionClone demonstrates superior text alignment and motion consistency, thereby suggesting its superior video motion transfer capabilities within global camera motion scenarios.

Object motion clone. Beyond the scope of camera motion, the proficiency in handling local object motions has been rigorously validated. As evidenced by Fig. 8, VMC falls short in matching motion with the source videos. Videocomposer appears to generate grayish colors with limited prompt-following ability. Gen-1 is inhibited by the original videos’ structure. Tune-A-Video struggles with capturing detailed body motions, while Control-A-Video cannot maintain a faithful appearance. In contrast, MotionClone stands out in scenarios with localized object motions, enhancing motion accuracy and improved text alignment.

Table 1: Comparison over DAVIS dataset by using automotive metrics and user study.
Method VMC VideoComposer Gen-1 Tune-A-Video Control-A-Video MotionClone
Text Alignment 0.337 0.265 0.311 0.332 0.291 0.342
Temporal Consistency 0.942 0.942 0.932 0.948 0.918 0.949
Motion Preservation 3.48 3.93 4.07 3.03 4.20 4.67
Appearance Diversity 3.87 3.68 4.03 3.95 3.77 4.28
Text Alignment 4.27 2.95 3.29 3.70 3.25 4.40
Temporal Consistency 3.32 3.65 3.45 2.47 3.16 4.43
Table 2: Quantitative results of ablation study. Motion Preservation and Motion Quality are subjective metrics obtained from user study.
Method Text Alignment Temporal Consistency Motion Preservation Motion Quality
w/o motion control 0.353 0.987 1.47 2.62
Threshold mask 0.302 0.932 4.52 3.14
w/o semantic control 0.295 0.939 4.55 3.82
MotionClone 0.342 0.949 4.63 4.37

4.4 Quantitative comparison.

The quantitative comparison on DAVIS dataset are outlined in Tab.  1. It is observed that MotionClone gains competitive scores in both textual alignment and temporal consistency. Moreover, MotionClone has outperformed its rivals in motion preservation, appearance diversity, temporal consistency, and textual alignment in human preference tests, underscoring its ability to produce visually compelling outcomes.

Effect of primary temporal-attention guidance. We validate the effect of the primary motion guidance strategy with two variants: i) w/o motion control, which sets λ2=0subscript𝜆20\lambda_{2}=0italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0 in Eq. 9 and ii) threshold masking in Eq. 6. The results are presented in Fig. 9 and Tab.  2. Videos without motion control tend to exhibit minimal movement or remain static, which leads to higher temporal consistency scores. However, these videos significantly underperform compared to MotionClone in terms of motion preservation and motion quality. Additionally, the threshold masking demonstrates inferior performance to MotionClone in both qualitative and quantitative outcomes. More results are given in supplementary materials.

Refer to caption
Figure 9: Ablations on primary temporal-attention guidance and location-aware semantic guidance.

Effect of location-aware semantic guidance. Location-aware semantic guidance aids in the model’s synthesis of plausible spatial relationships and enhances its adherence to the prompt, as illustrated in Fig. 5 and Fig. 9. Furthermore, additional quantitative ablation study results presented in Tab.  2 substantiate these improvements.

5 Conclusion

In this work, we observe that the temporal attention layers embedded within video generation models harbor substantial representational capacities pertinent to video motion transfer. Motivated by these findings, we introduce a training-free method dubbed MotionClone for motion cloning. This methodology is founded on two principal elements: primary temporal-attention guidance, which plays a pivotal role in facilitating motion transfer, and location-aware semantic guidance, responsible for orchestrating the visual appearance. Employing a real reference video, MotionClone demonstrates its capability to preserve motion fidelity robustly while concurrently assimilating novel textual semantics. This framework thereby emerges as a highly adaptable and efficient tool for motion customization within the realm of text-to-video generation.

References

  • [1] M. Bain, A. Nagrani, G. Varol, and A. Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  • [2] A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  • [3] A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  • [4] H. Chefer, Y. Alaluf, Y. Vinker, L. Wolf, and D. Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
  • [5] H. Chen, M. Xia, Y. He, Y. Zhang, X. Cun, S. Yang, J. Xing, Y. Liu, Q. Chen, X. Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023.
  • [6] H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y. Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024.
  • [7] W. Chen, J. Wu, P. Xie, H. Wu, J. Li, X. Xia, X. Xiao, and L. Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023.
  • [8] Z. Dai, Z. Zhang, Y. Yao, B. Qiu, S. Zhu, L. Qin, and W. Wang. Animateanything: Fine-grained open domain image animation with motion guidance. arXiv e-prints, pages arXiv–2311, 2023.
  • [9] P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
  • [10] S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and B. Guo. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
  • [11] Y. Guo, C. Yang, A. Rao, Y. Wang, Y. Qiao, D. Lin, and B. Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  • [12] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  • [13] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • [14] J. Ho and T. Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  • [15] L. Huang, D. Chen, Y. Liu, Y. Shen, D. Zhao, and J. Zhou. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023.
  • [16] H. Jeong, G. Y. Park, and J. C. Ye. Vmc: Video motion customization using temporal attention adaption for text-to-video diffusion models. arXiv preprint arXiv:2312.00845, 2023.
  • [17] H. Jeong and J. C. Ye. Ground-a-video: Zero-shot grounded video editing using text-to-image diffusion models. arXiv preprint arXiv:2310.01107, 2023.
  • [18] Y. Kim, J. Lee, J.-H. Kim, J.-W. Ha, and J.-Y. Zhu. Dense text-to-image generation with attention modulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7701–7711, 2023.
  • [19] Y. Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, and Y. J. Lee. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023.
  • [20] S. Liu, Y. Zhang, W. Li, Z. Lin, and J. Jia. Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761, 2023.
  • [21] J. Ma, J. Liang, C. Chen, and H. Lu. Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. arXiv preprint arXiv:2307.11410, 2023.
  • [22] Y. Ma, Y. He, H. Wang, A. Wang, C. Qi, C. Cai, X. Li, Z. Li, H.-Y. Shum, W. Liu, et al. Follow-your-click: Open-domain regional image animation via short prompts. arXiv preprint arXiv:2403.08268, 2024.
  • [23] S. Mo, F. Mu, K. H. Lin, Y. Liu, B. Guan, Y. Li, and B. Zhou. Freecontrol: Training-free spatial control of any text-to-image diffusion model with any condition. arXiv preprint arXiv:2312.07536, 2023.
  • [24] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  • [25] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  • [26] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
  • [27] C. Qin, S. Zhang, N. Yu, Y. Feng, X. Yang, Y. Zhou, H. Wang, J. C. Niebles, C. Xiong, S. Savarese, et al. Unicontrol: A unified diffusion model for controllable visual generation in the wild. arXiv preprint arXiv:2305.11147, 2023.
  • [28] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [29] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • [30] J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  • [31] K. Sun, J. Pan, Y. Ge, H. Li, H. Duan, X. Wu, R. Zhang, A. Zhou, Z. Qin, Y. Wang, et al. Journeydb: A benchmark for generative image understanding. Advances in Neural Information Processing Systems, 36, 2024.
  • [32] X. Wang, H. Yuan, S. Zhang, D. Chen, J. Wang, Y. Zhang, Y. Shen, D. Zhao, and J. Zhou. Videocomposer: Compositional video synthesis with motion controllability. Advances in Neural Information Processing Systems, 36, 2024.
  • [33] Y. Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y. Wang, C. Yang, Y. He, J. Yu, P. Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023.
  • [34] Z. Wang, Z. Yuan, X. Wang, T. Chen, M. Xia, P. Luo, and Y. Shan. Motionctrl: A unified and flexible motion controller for video generation. arXiv preprint arXiv:2312.03641, 2023.
  • [35] J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
  • [36] G. Xiao, T. Yin, W. T. Freeman, F. Durand, and S. Han. Fastcomposer: Tuning-free multi-subject image generation with localized attention. arXiv preprint arXiv:2305.10431, 2023.
  • [37] J. Xing, M. Xia, Y. Liu, Y. Zhang, Y. He, H. Liu, H. Chen, X. Cun, X. Wang, Y. Shan, et al. Make-your-video: Customized video generation using textual and structural guidance. IEEE Transactions on Visualization and Computer Graphics, 2024.
  • [38] S. Yin, C. Wu, J. Liang, J. Shi, H. Li, G. Ming, and N. Duan. Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:2308.08089, 2023.
  • [39] B. Zhang, P. Zhang, X. Dong, Y. Zang, and J. Wang. Long-clip: Unlocking the long-text capability of clip. arXiv preprint arXiv:2403.15378, 2024.
  • [40] L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  • [41] R. Zhao, Y. Gu, J. Z. Wu, D. J. Zhang, J. Liu, W. Wu, J. Keppo, and M. Z. Shou. Motiondirector: Motion customization of text-to-video diffusion models. arXiv preprint arXiv:2310.08465, 2023.

Appendix A: Baseline description

Among the compared methods, VideoComposer[32] creates videos by extracting specific features such as frame-wise depth or canny maps from existing videos, achieving a compositional approach to controllable video generation. Gen-1[9] leverages the original structure of reference videos to generate new video content, akin to video-to-video translation. Tune-A-Video expands the spatial self-attention of pre-trained text-to-image models into spatio-temporal attention, and then fine-tuning it for motion-specific generation. Control-A-Video[7] incorporates the first video frame as an additional motion cue for customized video generation. MotionCtrl[34] introduces the trajectory of both camera and object motion as supplementary signals for conditional motion modeling. VMC[16] aims to distill motion patterns by fine-tuning the temporal attention layers in a pre-trained text-to-video diffusion model.

Appendix B: More results without motion guidance

Without the assistance of the designed primary temporal-attention guidance, the motion in the resultant videos tends to be minimal or nearly static, akin to the vanilla T2V model. As depicted in Fig.  13, even though the generated content is well-aligned with the provided prompt, the desired motion is not effectively conveyed within the video sequence, which leads to high text alignment and temporal consistency (with no variations in timing), yet fails to satisfy the human preference for dynamic video content.

Appendix C: More generated results

A broader array of generated content is displayed to validate the versatile generation capability. As shown in Fig. 10-12, MotionClone is able to adeptly extract motion cues from a diverse range of existing videos and thus enables the creation of content that is both prompt-aligned and motion-preserved, showcasing its robust motion cloning capabilities.

Appendix D: Limitation

While MotionClone demonstrates notable improvements in cloning motion from reference videos in a training-free manner, there are inherent limitations associated with it. Firstly, the motion contained within the reference video must be appropriate for the objects depicted in the new prompt; otherwise, MotionClone may produce unrealistic video outputs. Secondly, despite the application of primary temporal-attention guidance and a Gaussian kernel to mitigate the impact of structural information from the reference video, a small number of generated samples still retain structural elements from the reference. These limitations will be addressed in future research.

Appendix E: Broader Impact

The development and implementation of MotionClone, a novel training-free framework for motion-based controllable text-to-video generation, carry distinct societal implications, both beneficial and challenging.

On the positive side, MotionClone’s capability to efficiently clone motions from reference videos while ensuring high fidelity and textual alignment opens new avenues in numerous fields. In the realm of digital content creation, film and media professionals can utilize this technology to streamline the production process, enhance narrative expressions, and create more engaging visual experiences without extensive resource commitments. Furthermore, in the educational sector, instructors and content creators can leverage this innovation to produce customized instructional videos that incorporate precise motions aligned with textual descriptions, potentially increasing engagement and comprehension among students. This could be particularly transformative for subjects where demonstration of physical actions or processes plays a crucial role, such as in sports training or scientific experiments.

On the negative side, the power of MotionClone to generate realistic videos based on text and existing motion cues raises concerns about its potential misuse, including the creation of deepfakes or misleading media content. Such applications can undermine trust in media, affect public opinion through the dissemination of false information, and infringe on personal rights and privacy. Moreover, the ease of generating convincing videos might enable the proliferation of propaganda or harmful content that can have widespread negative implications on society.

In conclusion, while MotionClone presents significant advancements in the field of AI-driven video generation, it is imperative that these technologies are developed and utilized with a conscious commitment to ethical standards and regulatory oversight. Promoting transparency in AI-generated content, establishing clear usage guidelines, and fostering an open dialogue about the capabilities and ethics of such technologies are crucial steps in ensuring that the benefits of MotionClone are realized while its risks are effectively mitigated. This involves collaborative efforts among technologists, policymakers, industry stakeholders, and the broader public to steer the responsible development and application of AI-driven media tools.

Refer to caption
Figure 10: More results of MotionClone. Within each group, the first row presents the reference video, while the subsequent rows display videos generated by MotionClone.
Refer to caption
Figure 11: More results of MotionClone. Within each group, the first row presents the reference video, while the subsequent rows display videos generated by MotionClone.
Refer to caption
Figure 12: More results of MotionClone. Within each group, the first row presents the reference video, while the subsequent rows display videos generated by MotionClone.
Refer to caption
Figure 13: More qualitative comparison with AnimateDiff. Within each group, the first row and the second row display videos generated by AnimateDiff and MotionClone, respectively. The videos generated by AnimateDiff exhibit minimal movement or remain static.