Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Chatterjee, Agneet; Stan, Gabriela Ben Melech; Aflalo, Estelle; Paul, Sayak; Ghosh, Dhruba; Gokhale, Tejas; Schmidt, Ludwig; Hajishirzi, Hannaneh; Lal, Vasudev; Baral, Chitta; Yang, Yezhou

Computer Science > Computer Vision and Pattern Recognition

arXiv:2404.01197 (cs)

[Submitted on 1 Apr 2024]

Title:Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Authors:Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, Yezhou Yang

View PDF HTML (experimental)

Abstract:One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also develo** datasets and methods that achieve state-of-the-art performance. First, we find that current vision-language datasets do not represent spatial relationships well enough; to alleviate this bottleneck, we create SPRIGHT, the first spatially-focused, large scale dataset, by re-captioning 6 million images from 4 widely used vision datasets. Through a 3-fold evaluation and analysis pipeline, we find that SPRIGHT largely improves upon existing datasets in capturing spatial relationships. To demonstrate its efficacy, we leverage only ~0.25% of SPRIGHT and achieve a 22% improvement in generating spatially accurate images while also improving the FID and CMMD scores. Secondly, we find that training on images containing a large number of objects results in substantial improvements in spatial consistency. Notably, we attain state-of-the-art on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on <500 images. Finally, through a set of controlled experiments and ablations, we document multiple findings that we believe will enhance the understanding of factors that affect spatial consistency in text-to-image models. We publicly release our dataset and model to foster further research in this area.

Comments:	project webpage : this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2404.01197 [cs.CV]
	(or arXiv:2404.01197v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2404.01197

Submission history

From: Agneet Chatterjee [view email]
[v1] Mon, 1 Apr 2024 15:55:25 UTC (35,976 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators