FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

Cong, Yuren; Xu, Mengmeng; Simon, Christian; Chen, Shoufa; Ren, Jiawei; **; Perez-Rua, Juan-Manuel; Rosenhahn, Bodo; Xiang, Tao; He, Sen

Computer Science > Computer Vision and Pattern Recognition

arXiv:2310.05922 (cs)

[Submitted on 9 Oct 2023 (v1), last revised 29 Feb 2024 (this version, v3)]

Title:FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

Authors:Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, Sen He

View PDF HTML (experimental)

Abstract:Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention. Although temporal context can be added through spatio-temporal attention, it may introduce some irrelevant information for each patch and therefore cause inconsistency in the edited video. In this paper, for the first time, we introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion-based text-to-video editing methods and improve their visual consistency. Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance. In particular, our method excels in maintaining the visual consistency in the edited videos.

Comments:	Accepted by ICLR2024. Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2310.05922 [cs.CV]
	(or arXiv:2310.05922v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2310.05922

Submission history

From: Yuren Cong [view email]
[v1] Mon, 9 Oct 2023 17:59:53 UTC (18,842 KB)
[v2] Thu, 22 Feb 2024 13:37:09 UTC (32,554 KB)
[v3] Thu, 29 Feb 2024 21:06:58 UTC (32,554 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators