Audio-Visual Instance Discrimination with Cross-Modal Agreement

Morgado, Pedro; Vasconcelos, Nuno; Misra, Ishan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2004.12943v1 (cs)

[Submitted on 27 Apr 2020 (this version), latest version 29 Mar 2021 (v3)]

Title:Audio-Visual Instance Discrimination with Cross-Modal Agreement

Authors:Pedro Morgado, Nuno Vasconcelos, Ishan Misra

View PDF

Abstract:We present a self-supervised learning approach to learn audio-visual representations from video and audio. Our method uses contrastive learning for cross-modal discrimination of video from audio and vice versa. We show that optimizing for cross-modal discrimination, rather than within-modal discrimination, is important to learn good representations from video and audio. With this simple but powerful insight, our method achieves state-of-the-art results when finetuned on action recognition tasks. While recent work in contrastive learning defines positive and negative samples as individual instances, we generalize this definition by exploring cross-modal agreement. We group together multiple instances as positives by measuring their similarity in both the video and the audio feature spaces. Cross-modal agreement creates better positive and negative sets, and allows us to calibrate visual similarities by seeking within-modal discrimination of positive instances.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2004.12943 [cs.CV]
	(or arXiv:2004.12943v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2004.12943

Submission history

From: Pedro Morgado [view email]
[v1] Mon, 27 Apr 2020 16:59:49 UTC (4,864 KB)
[v2] Tue, 6 Oct 2020 20:04:40 UTC (4,864 KB)
[v3] Mon, 29 Mar 2021 20:14:23 UTC (4,823 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2020-04

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Pedro Morgado
Nuno Vasconcelos
Ishan Misra

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:Audio-Visual Instance Discrimination with Cross-Modal Agreement

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Audio-Visual Instance Discrimination with Cross-Modal Agreement

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators