Improving Vision-and-Language Reasoning via Spatial Relations Modeling

Yang, Cheng; Xu, Rui; Guo, Ye; Huang, Peixiang; Chen, Yiru; Ding, Wenkui; Wang, Zhongyuan; Zhou, Hong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2311.05298 (cs)

[Submitted on 9 Nov 2023]

Title:Improving Vision-and-Language Reasoning via Spatial Relations Modeling

Authors:Cheng Yang, Rui Xu, Ye Guo, Peixiang Huang, Yiru Chen, Wenkui Ding, Zhongyuan Wang, Hong Zhou

View PDF

Abstract:Visual commonsense reasoning (VCR) is a challenging multi-modal task, which requires high-level cognition and commonsense reasoning ability about the real world. In recent years, large-scale pre-training approaches have been developed and promoted the state-of-the-art performance of VCR. However, the existing approaches almost employ the BERT-like objectives to learn multi-modal representations. These objectives motivated from the text-domain are insufficient for the excavation on the complex scenario of visual modality. Most importantly, the spatial distribution of the visual objects is basically neglected. To address the above issue, we propose to construct the spatial relation graph based on the given visual scenario. Further, we design two pre-training tasks named object position regression (OPR) and spatial relation classification (SRC) to learn to reconstruct the spatial relation graph respectively. Quantitative analysis suggests that the proposed method can guide the representations to maintain more spatial context and facilitate the attention on the essential visual regions for reasoning. We achieve the state-of-the-art results on VCR and two other vision-and-language reasoning tasks VQA, and NLVR.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2311.05298 [cs.CV]
	(or arXiv:2311.05298v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2311.05298

Submission history

From: Rui Xu [view email]
[v1] Thu, 9 Nov 2023 11:54:55 UTC (9,490 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Improving Vision-and-Language Reasoning via Spatial Relations Modeling

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Improving Vision-and-Language Reasoning via Spatial Relations Modeling

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators