Visual Reference Resolution using Attention Memory for Visual Dialog

Seo, Paul Hongsuck; Lehrmann, Andreas; Han, Bohyung; Sigal, Leonid

Computer Science > Computer Vision and Pattern Recognition

arXiv:1709.07992v2 (cs)

[Submitted on 23 Sep 2017 (v1), revised 20 Nov 2017 (this version, v2), latest version 6 Aug 2018 (v3)]

Title:Visual Reference Resolution using Attention Memory for Visual Dialog

Authors:Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, Leonid Sigal

View PDF

Abstract:Visual dialog is a task of answering a series of inter-dependent questions given an input image, and often requires to resolve visual references among the questions. This problem is different from visual question answering (VQA), which relies on spatial attention (a.k.a. visual grounding) estimated from an image and question pair. We propose a novel attention mechanism that exploits visual attentions in the past to resolve the current reference in the visual dialog scenario. The proposed model is equipped with an associative attention memory storing a sequence of previous (attention, key) pairs. From this memory, the model retrieves the previous attention, taking into account recency, which is most relevant for the current question, in order to resolve potentially ambiguous references. The model then merges the retrieved attention with a tentative one to obtain the final attention for the current question; specifically, we use dynamic parameter prediction to combine the two attentions conditioned on the question. Through extensive experiments on a new synthetic visual dialog dataset, we show that our model significantly outperforms the state-of-the-art (by ~16 % points) in situations, where visual reference resolution plays an important role. Moreover, the proposed model achieves superior performance (~ 2 % points improvement) in the Visual Dialog dataset, despite having significantly fewer parameters than the baselines.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:1709.07992 [cs.CV]
	(or arXiv:1709.07992v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1709.07992

Submission history

From: Paul Hongsuck Seo [view email]
[v1] Sat, 23 Sep 2017 02:53:48 UTC (6,552 KB)
[v2] Mon, 20 Nov 2017 18:09:07 UTC (2,844 KB)
[v3] Mon, 6 Aug 2018 21:03:18 UTC (2,844 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Visual Reference Resolution using Attention Memory for Visual Dialog

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Visual Reference Resolution using Attention Memory for Visual Dialog

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators