Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

Peng, Min; Wang, Chongyang; Shi, Yu; Zhou, Xiang-Dong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2302.02136v2 (cs)

[Submitted on 4 Feb 2023 (v1), last revised 5 Mar 2023 (this version, v2)]

Title:Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

Authors:Min Peng, Chongyang Wang, Yu Shi, Xiang-Dong Zhou

View PDF

Abstract:This paper presents a new method for end-to-end Video Question Answering (VideoQA), aside from the current popularity of using large-scale pre-training with huge feature extractors. We achieve this with a pyramidal multimodal transformer (PMT) model, which simply incorporates a learnable word embedding layer, a few convolutional and transformer layers. We use the anisotropic pyramid to fulfill video-language interactions across different spatio-temporal scales. In addition to the canonical pyramid, which includes both bottom-up and top-down pathways with lateral connections, novel strategies are proposed to decompose the visual feature stream into spatial and temporal sub-streams at different scales and implement their interactions with the linguistic semantics while preserving the integrity of local and global semantics. We demonstrate better or on-par performances with high computational efficiency against state-of-the-art methods on five VideoQA benchmarks. Our ablation study shows the scalability of our model that achieves competitive results for text-to-video retrieval by leveraging feature extractors with reusable pre-trained weights, and also the effectiveness of the pyramid.

Comments:	Accepted by AAAI 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2302.02136 [cs.CV]
	(or arXiv:2302.02136v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2302.02136

Submission history

From: Chongyang Wang [view email]
[v1] Sat, 4 Feb 2023 09:14:18 UTC (10,045 KB)
[v2] Sun, 5 Mar 2023 10:09:11 UTC (6,485 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators