Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

Chen, Weifeng; Ji, Yatai; Wu, Jie; Wu, Hefeng; Xie, Pan; Li, Jiashi; Xia, Xin; Xiao, Xuefeng; Lin, Liang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2305.13840 (cs)

[Submitted on 23 May 2023 (v1), last revised 6 Dec 2023 (this version, v2)]

Title:Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

Authors:Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, Liang Lin

View PDF

Abstract:Recent advancements in diffusion models have unlocked unprecedented abilities in visual creation. However, current text-to-video generation models struggle with the trade-off among movement range, action coherence and object consistency. To mitigate this issue, we present a controllable text-to-video (T2V) diffusion model, called Control-A-Video, capable of maintaining consistency while customizable video synthesis. Based on a pre-trained conditional text-to-image (T2I) diffusion model, our model aims to generate videos conditioned on a sequence of control signals, such as edge or depth maps. For the purpose of improving object consistency, Control-A-Video integrates motion priors and content priors into video generation. We propose two motion-adaptive noise initialization strategies, which are based on pixel residual and optical flow, to introduce motion priors from input videos, producing more coherent videos. Moreover, a first-frame conditioned controller is proposed to generate videos from content priors of the first frame, which facilitates the semantic alignment with text and allows longer video generation in an auto-regressive manner. With the proposed architecture and strategies, our model achieves resource-efficient convergence and generate consistent and coherent videos with fine-grained control. Extensive experiments demonstrate its success in various video generative tasks such as video editing and video style transfer, outperforming previous methods in terms of consistency and quality.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
Cite as:	arXiv:2305.13840 [cs.CV]
	(or arXiv:2305.13840v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2305.13840

Submission history

From: Weifeng Chen [view email]
[v1] Tue, 23 May 2023 09:03:19 UTC (4,917 KB)
[v2] Wed, 6 Dec 2023 14:03:00 UTC (38,957 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators