Understanding Long Videos in One Multimodal Language Model Pass

Ranasinghe, Kanchana; Li, Xiang; Kahatapitiya, Kumara; Ryoo, Michael S.

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.16998 (cs)

[Submitted on 25 Mar 2024]

Title:Understanding Long Videos in One Multimodal Language Model Pass

Authors:Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, Michael S. Ryoo

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs), known to contain a strong awareness of world knowledge, have allowed recent approaches to achieve excellent performance on Long-Video Understanding benchmarks, but at high inference costs. In this work, we first propose Likelihood Selection, a simple technique that unlocks faster inference in autoregressive LLMs for multiple-choice tasks common in long-video benchmarks. In addition to faster inference, we discover the resulting models to yield surprisingly good accuracy on long-video tasks, even with no video specific information. Building on this, we inject video-specific object-centric information extracted from off-the-shelf pre-trained models and utilize natural language as a medium for information fusion. Our resulting Multimodal Video Understanding (MVU) framework demonstrates state-of-the-art performance across long-video and fine-grained action recognition benchmarks. Code available at: this https URL

Comments:	24 pages
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2403.16998 [cs.CV]
	(or arXiv:2403.16998v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.16998

Submission history

From: Kanchana Ranasinghe [view email]
[v1] Mon, 25 Mar 2024 17:59:09 UTC (2,561 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Understanding Long Videos in One Multimodal Language Model Pass

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Understanding Long Videos in One Multimodal Language Model Pass

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators