FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

He, Xuehai; Zheng, Jian; Fang, Jacob Zhiyuan; Piramuthu, Robinson; Bansal, Mohit; Ordonez, Vicente; Sigurdsson, Gunnar A; Peng, Nanyun; Wang, Xin Eric

Computer Science > Computer Vision and Pattern Recognition

arXiv:2405.04834 (cs)

[Submitted on 8 May 2024 (v1), last revised 22 May 2024 (this version, v2)]

Title:FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

Authors:Xuehai He, Jian Zheng, Jacob Zhiyuan Fang, Robinson Piramuthu, Mohit Bansal, Vicente Ordonez, Gunnar A Sigurdsson, Nanyun Peng, Xin Eric Wang

View PDF HTML (experimental)

Abstract:Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps. Nevertheless, current controllable T2I methods commonly face challenges related to efficiency and faithfulness, especially when conditioning on multiple inputs from either the same or diverse modalities. In this paper, we propose a novel Flexible and Efficient method, FlexEControl, for controllable T2I generation. At the core of FlexEControl is a unique weight decomposition strategy, which allows for streamlined integration of various input types. This approach not only enhances the faithfulness of the generated image to the control, but also significantly reduces the computational overhead typically associated with multimodal conditioning. Our approach achieves a reduction of 41% in trainable parameters and 30% in memory usage compared with Uni-ControlNet. Moreover, it doubles data efficiency and can flexibly generate images under the guidance of multiple input conditions of various modalities.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2405.04834 [cs.CV]
	(or arXiv:2405.04834v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2405.04834

Submission history

From: Xuehai He [view email]
[v1] Wed, 8 May 2024 06:09:11 UTC (13,742 KB)
[v2] Wed, 22 May 2024 02:45:13 UTC (13,743 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators