YORO -- Lightweight End to End Visual Grounding

Ho, Chih-Hui; Appalaraju, Srikar; Jasani, Bhavan; Manmatha, R.; Vasconcelos, Nuno

Computer Science > Computer Vision and Pattern Recognition

arXiv:2211.07912 (cs)

[Submitted on 15 Nov 2022]

Title:YORO -- Lightweight End to End Visual Grounding

Authors:Chih-Hui Ho, Srikar Appalaraju, Bhavan Jasani, R. Manmatha, Nuno Vasconcelos

View PDF

Abstract:We present YORO - a multi-modal transformer encoder-only architecture for the Visual Grounding (VG) task. This task involves localizing, in an image, an object referred via natural language. Unlike the recent trend in the literature of using multi-stage approaches that sacrifice speed for accuracy, YORO seeks a better trade-off between speed an accuracy by embracing a single-stage design, without CNN backbone. YORO consumes natural language queries, image patches, and learnable detection tokens and predicts coordinates of the referred object, using a single transformer encoder. To assist the alignment between text and visual objects, a novel patch-text alignment loss is proposed. Extensive experiments are conducted on 5 different datasets with ablations on architecture design choices. YORO is shown to support real-time inference and outperform all approaches in this class (single-stage methods) by large margins. It is also the fastest VG model and achieves the best speed/accuracy trade-off in the literature.

Comments:	Accepted to ECCVW on International Challenge on Compositional and Multimodal Perception
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2211.07912 [cs.CV]
	(or arXiv:2211.07912v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2211.07912

Submission history

From: Chih-Hui Ho [view email]
[v1] Tue, 15 Nov 2022 05:34:40 UTC (18,869 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:YORO -- Lightweight End to End Visual Grounding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:YORO -- Lightweight End to End Visual Grounding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators