TeachCLIP: Multi-Grained Teaching for Efficient Text-to-Video Retrieval

Tian, Kaibin; Zhao, Ruixiang; Hu, Hu; Xie, Runquan; Lian, Fengzong; Kang, Zhanhui; Li, Xirong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2308.01217 (cs)

[Submitted on 2 Aug 2023]

Title:TeachCLIP: Multi-Grained Teaching for Efficient Text-to-Video Retrieval

Authors:Kaibin Tian, Ruixiang Zhao, Hu Hu, Runquan Xie, Fengzong Lian, Zhanhui Kang, Xirong Li

View PDF

Abstract:For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos by ad-hoc textual queries, CLIP-based methods are dominating. Compared to CLIP4Clip which is efficient and compact, the state-of-the-art models tend to compute video-text similarity by fine-grained cross-modal feature interaction and matching, putting their scalability for large-scale T2VR into doubt. For efficient T2VR, we propose TeachCLIP with multi-grained teaching to let a CLIP4Clip based student network learn from more advanced yet computationally heavy models such as X-CLIP, TS2-Net and X-Pool . To improve the student's learning capability, we add an Attentional frame-Feature Aggregation (AFA) block, which by design adds no extra storage/computation overhead at the retrieval stage. While attentive weights produced by AFA are commonly used for combining frame-level features, we propose a novel use of the weights to let them imitate frame-text relevance estimated by the teacher network. As such, AFA provides a fine-grained learning (teaching) channel for the student (teacher). Extensive experiments on multiple public datasets justify the viability of the proposed method.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2308.01217 [cs.CV]
	(or arXiv:2308.01217v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2308.01217

Submission history

From: Kaibin Tian [view email]
[v1] Wed, 2 Aug 2023 15:22:00 UTC (4,485 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:TeachCLIP: Multi-Grained Teaching for Efficient Text-to-Video Retrieval

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:TeachCLIP: Multi-Grained Teaching for Efficient Text-to-Video Retrieval

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators