Hierarchical Memory for Long Video QA
Abstract
This paper describes our champion solution to the LOVEU Challenge @ CVPR’24, Track 1 (Long Video VQA). Processing long sequences of visual tokens is computationally expensive and memory-intensive, making long video question-answering a challenging task. The key is to compress visual tokens effectively, reducing memory footprint and decoding latency, while preserving the essential information for accurate question-answering. We adopt a hierarchical memory mechanism named STAR Memory, proposed in Flash-VStream [11], that is capable of processing long videos with limited GPU memory (VRAM). We further utilize the video and audio data of MovieChat-1K training set to fine-tune the pretrained weight released by Flash-VStream, achieving 1st place in the challenge. Code is available at project homepage https://invinciblewyq.github.io/vstream-page.
1 Introduction
Most existing large video-language models face challenges when performing long video question-answering upon user queries [3, 2, 6, 10]. The main reason is that: visual tokens between consecutive frames are heavy and redundant without effective compression, making it impossible to save all visual features in limited GPU Memory (VRAM), as well as significantly increasing the decoding latency of language model. To better address this issue, we adopt a hierarchical memory mechanism named STAR Memory, which is proposed recently in Flash-VStream [11]. By further fine-tuning the pretrained weight released by Flash-VStream on the training branch of MovieChat-1K, we successfully leverage the general video compression and understanding ability of STAR Memory, as well as the domain-specific knowledge of MovieChat-1K. The strong performance of our model demonstrates the effectiveness of the hierarchical memory method, namely the STAR Memory, in long video question-answering tasks.
2 Method
![Refer to caption](x1.png)
As shown in Figure 1, the Flash-VStream framework that we adopted consists of three main components: (1) a streaming visual encoder that continuously processes video frames, (2) a Spatial-Temporal-Abstract-Retrieved memory mechanism (STAR memory), including memory writing and reading with the help of a feature buffer. (3) a LLM decoder capable of providing responses to questions raised by users.
2.1 Streaming visual encoder
We use the pre-trained CLIP ViT-L [8] as visual encoder. Only patch tokens are used during training and inference. Specifically, given a frame stream , the encoder maps the -th frame to feature map , where is the number of ViT patch tokens and is the hidden dimension of ViT.
2.2 Spatial-Temporal-Abstract-Retrieved memory
![Refer to caption](x2.png)
In order to handle information of different levels of granularity, the STAR memory consists of 4 components: spatial memory , temporal memory , abstract memory and retrieved memory . A feature buffer is used to store the feature of latest frames. Therefore, the overall memory size is limited to tokens.
Spatial memory. Spatial memory houses the most recent and detailed spatial information for short-term use, implemented as a FIFO (First-In-First-Out) queue, as illustrated in Figure 2 and Equation 2. This architecture enables continuous updating with the newest frames, facilitating immediate access to fine-grained spatial data.
Temporal memory. Temporal memory integrates dynamic information over time, crucial for long-term retention. When its size surpasses , the (Weighted K-means Clustering) algorithm is applied, as shown in Equation 3. This strategy condenses the memory content into clusters which can be seen as the representation of key events in videos. Then the centroids of these clusters are used as the new memory for efficiently storing temporal contexts.
Abstract memory. Abstract memory supports high-level semantic concept interpretation through , the Semantic Attention model. It follows Equation 4 to synthesize the insights gained from both spatial and temporal memories into abstracted, actionable knowledge. keeps adjusting , the synopsis of whole video by newest features. Refer to Figure 2 for details.
Retrieved memory. Retrieved memory focuses on recalling precise spatial details by identifying and retrieving the most substantial frame features. As shown in Figure 2, it first selects the top-K (where K equals ) largest clusters from the clusters obtained in temporal memory . Then the nearest frame features in feature buffer to centroids of these K clusters are retrieved to supplement the temporal memory with more detailed spatial information. This process is illustrated in Equation 5.
In brief, a new feature is written to STAR memory as follows:
(1) | ||||
(2) | ||||
(3) | ||||
(4) | ||||
(5) |
Here applies Average Pooling to compress feature map from to size along width and height dimensions. means concatenating tensors and along time axis.
2.3 Real-time LLM decoder
The LLM decoder works as part of a real-time question answering server. When triggered by a question at time , the LLM decoder first calculates the text embedding and maps the STAR memory to embedding space with the projector . Then it starts to generate answer in real time.
2.4 Adopting automatic speech recognition (ASR)
To further utilize the audio information in videos, we adopt the automatic speech recognition (ASR) model to transcribe the audio stream into text. The transcribed text is then fed into the LLM decoder as an additional input to improve the performance of video question answering.
2.5 Implementation details
In this study, we utilize pre-trained CLIP ViT-L/14-336px [8] as streaming visual encoder. Following LLaVA [4], we choose a 2-layer-MLP as visual projector and pre-trained Vicuna-7B [1] as LLM decoder. We adopt the open-source ASR model Whisper-large-v3 [9] to pre-transcribe the audio stream into text.
Considering the balance between performance and resource consumption, we set , , , , , and . The MAXSIZE of STAR memory is set to 681 tokens in order to keep computational efficiency.
We train Flash-VStream for 3 stages: modality alignment, instruction tuning, and domain-specific fine-tuning. The training data of first 2 stages are keep same with LLaMA-VID [3], including LLaVA-filtered-558K [5] image-caption pairs and LLaMA-VID-filtered-232K [3] video-caption pairs for stage 1, LLaVA-filtered-665K [5] image QA pairs and Video-ChatGPT-filtered-98K [7] video QA pairs for stage 2. The training data of stage 3 is the training branch of MovieChat-1K [10]. Speech data transcribed from the ASR model are only used in the third stage. Speech data are feed into the LLM decoder as a part of the question text, thus providing additional information for video question answering.
For each stage, the model is trained for 1 epoch on 8 A100 80G GPUs. During training, the parameters of visual encoder are frozen and the parameters of LLM are frozen only for the first stage. All training and inference experiments was conducted under BF16 precision to save time and resources.
During inference on the test branch of MovieChat-1K, we use different strategy for global and breakpoint questions. For global questions, we use the STAR memory to store the whole video information and answer the questions. For breakpoint questions, we only use a part of the video, which ranges from 15s before the breakpoint to 15s after the breakpoint, to answer the questions.
3 Experiments
Method | val set, GPT-3.5 eval | |||
G. Acc. | G. Sco. | B. Acc. | B. Sco. | |
FVS | 66.7 | 3.823 | 53.9 | 3.518 |
FVS + stage-3 | 84.0 | 4.624 | 73.5 | 4.078 |
Method | test set, Gemini-Pro eval | |||
G. Acc. | G. Sco. | B. Acc. | B. Sco. | |
MovieChat [10] | 55.1 | 2.78 | 38.5 | 1.87 |
FVS + stage-3 | 84.5 | 4.12 | 52.4 | 2.89 |
FVS + stage-3 & speech | 96.0 | 4.60 | 59.6 | 2.99 |
To better showcase how different designs affect the performance of our method, we split out 20% of the MovieChat-1K training set as the validation set and evaluate our model on it with GPT-3.5 as the evaluator. As shown in the upper part of Table 1, it is necessary to fine-tune the pretrained weight released by Flash-VStream on the training branch of MovieChat-1K, as the performance of the model is significantly improved after stage-3 fine-tuning. The lower part of Table 1 shows the performance of our model on the test set of MovieChat-1K with Gemini-Pro as the evaluator, evaluated by the organizers. The performance of our model is significantly improved after adding the speech data as an additional input to the LLM decoder. It also surpass the training-free baseline method MovieChat[10] by a large margin, demonstrating the effectiveness of our method as well as the importance of audio information when processing MovieChat-1K dataset.
4 Conclusion
In conclusion, we adopted Flash-VStream[11], a recently proposed long video-QA model. By incorporating a hierarchical memory, the model can effectively compress visual tokens throughout the whole video. We further fine-tuned the pretrained weight released by Flash-VStream on the training branch of MovieChat-1K. The transcribed speech data is also leveraged as an additional input to the LLM decoder to further improve its performance on the MovieChat-1K dataset, achieving 1st place in the challenge. We hope our method could inspire further research and advancements in the field of long video stream understanding.
References
- Chiang et al. [2023] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/, 2023.
- ** et al. [2023] Peng **, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. arXiv preprint arXiv:2311.08046, 2023.
- Li et al. [2023] Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043, 2023.
- Liu et al. [2023a] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
- Liu et al. [2023b] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
- Ma et al. [2023] Fan Ma, Xiaojie **, Heng Wang, Yuchen Xian, Jiashi Feng, and Yi Yang. Vista-llama: Reliable video narrator via equal distance to visual tokens. arXiv preprint arXiv:2312.08870, 2023.
- Maaz et al. [2023] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
- Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
- Radford et al. [2022] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision, 2022.
- Song et al. [2023] Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, et al. Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449, 2023.
- Zhang et al. [2024] Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie **. Flash-vstream: Memory-based real-time understanding for long video streams. arXiv preprint arXiv:2406.08085, 2024.