VideoNavQA: Bridging the Gap between Visual and Embodied Question Answering

Cangea, Cătălina; Belilovsky, Eugene; Liò, Pietro; Courville, Aaron

Computer Science > Computer Vision and Pattern Recognition

arXiv:1908.04950 (cs)

[Submitted on 14 Aug 2019]

Title:VideoNavQA: Bridging the Gap between Visual and Embodied Question Answering

Authors:Cătălina Cangea, Eugene Belilovsky, Pietro Liò, Aaron Courville

View PDF

Abstract:Embodied Question Answering (EQA) is a recently proposed task, where an agent is placed in a rich 3D environment and must act based solely on its egocentric input to answer a given question. The desired outcome is that the agent learns to combine capabilities such as scene understanding, navigation and language understanding in order to perform complex reasoning in the visual world. However, initial advancements combining standard vision and language methods with imitation and reinforcement learning algorithms have shown EQA might be too complex and challenging for these techniques. In order to investigate the feasibility of EQA-type tasks, we build the VideoNavQA dataset that contains pairs of questions and videos generated in the House3D environment. The goal of this dataset is to assess question-answering performance from nearly-ideal navigation paths, while considering a much more complete variety of questions than current instantiations of the EQA task. We investigate several models, adapted from popular VQA methods, on this new benchmark. This establishes an initial understanding of how well VQA-style methods can perform within this novel EQA paradigm.

Comments:	To appear at BMVC 2019. 15 pages, 5 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:1908.04950 [cs.CV]
	(or arXiv:1908.04950v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1908.04950

Submission history

From: Cătălina Cangea [view email]
[v1] Wed, 14 Aug 2019 04:44:26 UTC (286 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VideoNavQA: Bridging the Gap between Visual and Embodied Question Answering

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VideoNavQA: Bridging the Gap between Visual and Embodied Question Answering

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators