InstructVid2Vid: Controllable Video Editing with Natural Language Instructions

Qin, Bosheng; Li, Juncheng; Tang, Siliang; Chua, Tat-Seng; Zhuang, Yueting

Computer Science > Computer Vision and Pattern Recognition

arXiv:2305.12328 (cs)

[Submitted on 21 May 2023 (v1), last revised 29 May 2024 (this version, v2)]

Title:InstructVid2Vid: Controllable Video Editing with Natural Language Instructions

Authors:Bosheng Qin, Juncheng Li, Siliang Tang, Tat-Seng Chua, Yueting Zhuang

View PDF HTML (experimental)

Abstract:We introduce InstructVid2Vid, an end-to-end diffusion-based methodology for video editing guided by human language instructions. Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion. The proposed InstructVid2Vid model modifies a pretrained image generation model, Stable Diffusion, to generate a time-dependent sequence of video frames. By harnessing the collective intelligence of disparate models, we engineer a training dataset rich in video-instruction triplets, which is a more cost-efficient alternative to collecting data in real-world scenarios. To enhance the coherence between successive frames within the generated videos, we propose the Inter-Frames Consistency Loss and incorporate it during the training process. With multimodal classifier-free guidance during the inference stage, the generated videos is able to resonate with both the input video and the accompanying instructions. Experimental results demonstrate that InstructVid2Vid is capable of generating high-quality, temporally coherent videos and performing diverse edits, including attribute editing, background changes, and style transfer. These results underscore the versatility and effectiveness of our proposed method.

Comments:	Accepted by ICME 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Cite as:	arXiv:2305.12328 [cs.CV]
	(or arXiv:2305.12328v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2305.12328

Submission history

From: Bosheng Qin [view email]
[v1] Sun, 21 May 2023 03:28:13 UTC (20,512 KB)
[v2] Wed, 29 May 2024 11:08:41 UTC (7,907 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:InstructVid2Vid: Controllable Video Editing with Natural Language Instructions

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:InstructVid2Vid: Controllable Video Editing with Natural Language Instructions

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators