VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Chen, Sihan; Li, Handong; Wang, Qunbo; Zhao, Zijia; Sun, Mingzhen; Zhu, Xinxin; Liu, **g

Computer Science > Computer Vision and Pattern Recognition

arXiv:2305.18500 (cs)

[Submitted on 29 May 2023 (v1), last revised 7 Oct 2023 (this version, v2)]

Title:VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Authors:Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, **g Liu

View PDF

Abstract:Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this paper, we resort to establish connections between multi-modality video tracks, including Vision, Audio, and Subtitle, and Text by exploring an automatically generated large-scale omni-modality video caption dataset called VAST-27M. Specifically, we first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions. Then, we employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions. Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA). Extensive experiments have been conducted to demonstrate the effectiveness of our proposed VAST-27M corpus and VAST foundation model. VAST achieves 22 new state-of-the-art results on various cross-modality benchmarks. Code, model and dataset will be released at this https URL.

Comments:	Accepted by NeurIPS 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2305.18500 [cs.CV]
	(or arXiv:2305.18500v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2305.18500

Submission history

From: Sihan Chen [view email]
[v1] Mon, 29 May 2023 14:34:50 UTC (14,293 KB)
[v2] Sat, 7 Oct 2023 12:58:26 UTC (14,292 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators