Search | arXiv e-print repository

arXiv:2002.02012 [pdf, other]

From Route Instructions to Landmark Graphs

Abstract: Landmarks are central to how people navigate, but most navigation technologies do not incorporate them into their representations. We propose the landmark graph generation task (creating landmark-based spatial representations from natural language) and introduce a fully end-to-end neural approach to generate these graphs. We evaluate our models on the SAIL route instruction dataset, as well as on… ▽ More Landmarks are central to how people navigate, but most navigation technologies do not incorporate them into their representations. We propose the landmark graph generation task (creating landmark-based spatial representations from natural language) and introduce a fully end-to-end neural approach to generate these graphs. We evaluate our models on the SAIL route instruction dataset, as well as on a small set of real-world delivery instructions that we collected, and we show that our approach yields high quality results on both our task and the related robotic navigation task. △ Less

Submitted 5 February, 2020; originally announced February 2020.

arXiv:1611.06641 [pdf, other]

Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues

Authors: Bryan A. Plummer, Arun Mallya, Christopher M. Cervantes, Julia Hockenmaier, Svetlana Lazebnik

Abstract: This paper presents a framework for localization or grounding of phrases in images using a large collection of linguistic and visual cues. We model the appearance, size, and position of entity bounding boxes, adjectives that contain attribute information, and spatial relationships between pairs of entities connected by verbs or prepositions. Special attention is given to relationships between peop… ▽ More This paper presents a framework for localization or grounding of phrases in images using a large collection of linguistic and visual cues. We model the appearance, size, and position of entity bounding boxes, adjectives that contain attribute information, and spatial relationships between pairs of entities connected by verbs or prepositions. Special attention is given to relationships between people and clothing or body part mentions, as they are useful for distinguishing individuals. We automatically learn weights for combining these cues and at test time, perform joint inference over all phrases in a caption. The resulting system produces state of the art performance on phrase localization on the Flickr30k Entities dataset and visual relationship detection on the Stanford VRD dataset. △ Less

Submitted 8 August, 2017; v1 submitted 20 November, 2016; originally announced November 2016.

Comments: IEEE ICCV 2017 accepted paper

arXiv:1505.04870 [pdf, other]

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

Authors: Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, Svetlana Lazebnik

Abstract: The Flickr30k dataset has become a standard benchmark for sentence-based image description. This paper presents Flickr30k Entities, which augments the 158k captions from Flickr30k with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes. Such annotations are essential for conti… ▽ More The Flickr30k dataset has become a standard benchmark for sentence-based image description. This paper presents Flickr30k Entities, which augments the 158k captions from Flickr30k with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes. Such annotations are essential for continued progress in automatic image description and grounded language understanding. They enable us to define a new benchmark for localization of textual entity mentions in an image. We present a strong baseline for this task that combines an image-text embedding, detectors for common objects, a color classifier, and a bias towards selecting larger objects. While our baseline rivals in accuracy more complex state-of-the-art models, we show that its gains cannot be easily parlayed into improvements on such tasks as image-sentence retrieval, thus underlining the limitations of current methods and the need for further research. △ Less

Submitted 19 September, 2016; v1 submitted 19 May, 2015; originally announced May 2015.

Showing 1–3 of 3 results for author: Cervantes, C M