Attention-Based Keyword Localisation in Speech using Visual Grounding

Olaleye, Kayode; Kamper, Herman

Computer Science > Computation and Language

arXiv:2106.08859 (cs)

[Submitted on 16 Jun 2021 (v1), last revised 23 Jun 2021 (this version, v2)]

Title:Attention-Based Keyword Localisation in Speech using Visual Grounding

Authors:Kayode Olaleye, Herman Kamper

View PDF

Abstract:Visually grounded speech models learn from images paired with spoken captions. By tagging images with soft text labels using a trained visual classifier with a fixed vocabulary, previous work has shown that it is possible to train a model that can detect whether a particular text keyword occurs in speech utterances or not. Here we investigate whether visually grounded speech models can also do keyword localisation: predicting where, within an utterance, a given textual keyword occurs without any explicit text-based or alignment supervision. We specifically consider whether incorporating attention into a convolutional model is beneficial for localisation. Although absolute localisation performance with visually supervised models is still modest (compared to using unordered bag-of-word text labels for supervision), we show that attention provides a large gain in performance over previous visually grounded models. As in many other speech-image studies, we find that many of the incorrect localisations are due to semantic confusions, e.g. locating the word 'backstroke' for the query keyword 'swimming'.

Comments:	Accepted to Interspeech 2021
Subjects:	Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2106.08859 [cs.CL]
	(or arXiv:2106.08859v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2106.08859

Submission history

From: Kayode Olaleye [view email]
[v1] Wed, 16 Jun 2021 15:29:11 UTC (98 KB)
[v2] Wed, 23 Jun 2021 12:57:46 UTC (98 KB)

Computer Science > Computation and Language

Title:Attention-Based Keyword Localisation in Speech using Visual Grounding

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Attention-Based Keyword Localisation in Speech using Visual Grounding

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators