Video Question Answering Using CLIP-Guided Visual-Text Attention

Ye, Shuhong; Kong, Weikai; Yao, Chenglin; Ren, Jianfeng; Jiang, Xudong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2303.03131v1 (cs)

[Submitted on 6 Mar 2023 (this version), latest version 8 Mar 2023 (v2)]

Title:Video Question Answering Using CLIP-Guided Visual-Text Attention

Authors:Shuhong Ye (1), Weikai Kong (1), Chenglin Yao (1), Jianfeng Ren (1), Xudong Jiang (2) ((1) School of Computer Science, University of Nottingham Ningbo China, (2) School of Electrical & Electronic Engineering, Nanyang Technological University)

View PDF

Abstract:Cross-modal learning of video and text plays a key role in Video Question Answering (VideoQA). In this paper, we propose a visual-text attention mechanism to utilize the Contrastive Language-Image Pre-training (CLIP) trained on lots of general domain language-image pairs to guide the cross-modal learning for VideoQA. Specifically, we first extract video features using a TimeSformer and text features using a BERT from the target application domain, and utilize CLIP to extract a pair of visual-text features from the general-knowledge domain through the domain-specific learning. We then propose a Cross-domain Learning to extract the attention information between visual and linguistic features across the target domain and general domain. The set of CLIP-guided visual-text features are integrated to predict the answer. The proposed method is evaluated on MSVD-QA and MSRVTT-QA datasets, and outperforms state-of-the-art methods.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
ACM classes:	I.2.10
Cite as:	arXiv:2303.03131 [cs.CV]
	(or arXiv:2303.03131v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2303.03131

Submission history

From: Shuhong Ye [view email]
[v1] Mon, 6 Mar 2023 13:49:15 UTC (918 KB)
[v2] Wed, 8 Mar 2023 11:35:51 UTC (918 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Video Question Answering Using CLIP-Guided Visual-Text Attention

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Video Question Answering Using CLIP-Guided Visual-Text Attention

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators