Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding

Song, Zengjie; Zhang, Zhaoxiang

doi:10.1109/TNNLS.2023.3288022

Computer Science > Sound

arXiv:2306.10684 (cs)

[Submitted on 19 Jun 2023]

Title:Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding

Authors:Zengjie Song, Zhaoxiang Zhang

View PDF

Abstract:The framework of visually-guided sound source separation generally consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. An ongoing trend in this field has been to tailor involved visual feature extractor for informative visual guidance and separately devise module for feature fusion, while utilizing U-Net by default for sound analysis. However, such divide-and-conquer paradigm is parameter inefficient and, meanwhile, may obtain suboptimal performance as jointly optimizing and harmonizing various model components is challengeable. By contrast, this paper presents a novel approach, dubbed audio-visual predictive coding (AVPC), to tackle this task in a parameter efficient and more effective manner. The network of AVPC features a simple ResNet-based video analysis network for deriving semantic visual features, and a predictive coding-based sound separation network that can extract audio features, fuse multimodal information, and predict sound separation masks in the same architecture. By iteratively minimizing the prediction error between features, AVPC integrates audio and visual information recursively, leading to progressively improved performance. In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source. Extensive evaluations demonstrate that AVPC outperforms several baselines in separating musical instrument sounds, while reducing the model size significantly. Code is available at: this https URL.

Comments:	Accepted to IEEE Transactions on Neural Networks and Learning Systems (T-NNLS)
Subjects:	Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2306.10684 [cs.SD]
	(or arXiv:2306.10684v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2306.10684
Related DOI:	https://doi.org/10.1109/TNNLS.2023.3288022

Submission history

From: Zengjie Song [view email]
[v1] Mon, 19 Jun 2023 03:10:57 UTC (2,791 KB)

Computer Science > Sound

Title:Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators