Hierarchical Memory for Long Video QA

Yiqin Wang¹^,¹¹footnotemark: 1 , Haoji Zhang¹^,¹¹footnotemark: 1 , Yansong Tang¹^,²²footnotemark: 2 , Yong Liu¹, Jifeng Dai², Jiashi Feng³, Xiaojie **³^,²²footnotemark: 2
¹ Shenzhen International Graduate School, Tsinghua University
² Department of Electronic Engineering, Tsinghua University ³ ByteDance Inc.
{yq-wang23@mails.,haoji-zh20@mails.,tang.yansong@sz.}tsinghua.edu.cn
[email protected]

Abstract

This paper describes our champion solution to the LOVEU Challenge @ CVPR’24, Track 1 (Long Video VQA). Processing long sequences of visual tokens is computationally expensive and memory-intensive, making long video question-answering a challenging task. The key is to compress visual tokens effectively, reducing memory footprint and decoding latency, while preserving the essential information for accurate question-answering. We adopt a hierarchical memory mechanism named STAR Memory, proposed in Flash-VStream [11], that is capable of processing long videos with limited GPU memory (VRAM). We further utilize the video and audio data of MovieChat-1K training set to fine-tune the pretrained weight released by Flash-VStream, achieving 1st place in the challenge. Code is available at project homepage https://invinciblewyq.github.io/vstream-page.

^$*$^$*$footnotetext: Equal contribution.^$\dagger$^$\dagger$footnotetext: Corresponding author.

1 Introduction

Most existing large video-language models face challenges when performing long video question-answering upon user queries [3, 2, 6, 10]. The main reason is that: visual tokens between consecutive frames are heavy and redundant without effective compression, making it impossible to save all visual features in limited GPU Memory (VRAM), as well as significantly increasing the decoding latency of language model. To better address this issue, we adopt a hierarchical memory mechanism named STAR Memory, which is proposed recently in Flash-VStream [11]. By further fine-tuning the pretrained weight released by Flash-VStream on the training branch of MovieChat-1K, we successfully leverage the general video compression and understanding ability of STAR Memory, as well as the domain-specific knowledge of MovieChat-1K. The strong performance of our model demonstrates the effectiveness of the hierarchical memory method, namely the STAR Memory, in long video question-answering tasks.

2 Method

Refer to caption — Figure 1: The overview of Flash-VStream framework that we adopted for real-time online video stream understanding. Flash-VStream is executed by two processes, namely “frame handle” and “question handler”. The frame handler is responsible for encoding frames and writing to memory, which contains a visual encoder, a STAR memory and a feature buffer. The question handler is responsible for reading from memory and answering questions anytime, which contains a projector and a Large Language Model.

As shown in Figure 1, the Flash-VStream framework that we adopted consists of three main components: (1) a streaming visual encoder that continuously processes video frames, (2) a Spatial-Temporal-Abstract-Retrieved memory mechanism (STAR memory), including memory writing and reading with the help of a feature buffer. (3) a LLM decoder capable of providing responses to questions raised by users.

2.1 Streaming visual encoder

We use the pre-trained CLIP ViT-L [8] as visual encoder. Only patch tokens are used during training and inference. Specifically, given a frame stream $\{V^{t}\}_{t=1}^{\infty}$ , the encoder maps the $t$ -th frame $V^{t}\in\mathbb{R}^{H\times W\times 3}$ to feature map ${e^{t}}\in\mathbb{R}^{P\times P\times D}$ , where $P\times P$ is the number of ViT patch tokens and $D$ is the hidden dimension of ViT.

2.2 Spatial-Temporal-Abstract-Retrieved memory

In order to handle information of different levels of granularity, the STAR memory consists of 4 components: spatial memory $M_{\text{spa}}\in\mathbb{R}^{N_{\text{spa}}\times P_{\text{spa}}^{2}\times D}$ , temporal memory $M_{\text{tem}}\in\mathbb{R}^{N_{\text{tem}}\times P_{\text{tem}}^{2}\times D}$ , abstract memory $M_{\text{abs}}\in\mathbb{R}^{N_{\text{abs}}\times P_{\text{abs}}^{2}\times D}$ and retrieved memory $M_{\text{ret}}\in\mathbb{R}^{N_{\text{ret}}\times P_{\text{spa}}^{2}\times D}$ . A feature buffer $M_{\text{buff}}\in\mathbb{R}^{N_{\text{buff}}\times P_{\text{spa}}^{2}\times D}$ is used to store the feature of latest $N_{\text{buff}}$ frames. Therefore, the overall memory size is limited to $\text{MAXSIZE}=(N_{\text{spa}}+N_{\text{ret}})\times P_{\text{spa}}^{2}+N_{% \text{tem}}\times P_{\text{tem}}^{2}+N_{\text{abs}}\times P_{\text{abs}}^{2}$ tokens.

Spatial memory. Spatial memory houses the most recent and detailed spatial information for short-term use, implemented as a FIFO (First-In-First-Out) queue, as illustrated in Figure 2 and Equation 2. This architecture enables continuous updating with the newest frames, facilitating immediate access to fine-grained spatial data.

Temporal memory. Temporal memory integrates dynamic information over time, crucial for long-term retention. When its size surpasses $N_{\text{tem}}$ , the $g_{\text{wkmeans}}$ (Weighted K-means Clustering) algorithm is applied, as shown in Equation 3. This strategy condenses the memory content into $N_{\text{tem}}$ clusters which can be seen as the representation of key events in videos. Then the centroids of these clusters are used as the new memory for efficiently storing temporal contexts.

Abstract memory. Abstract memory supports high-level semantic concept interpretation through $f_{SA}$ , the Semantic Attention model. It follows Equation 4 to synthesize the insights gained from both spatial and temporal memories into abstracted, actionable knowledge. $f_{SA}$ keeps adjusting $M_{\text{abs}}$ , the synopsis of whole video by newest features. Refer to Figure 2 for details.

Retrieved memory. Retrieved memory focuses on recalling precise spatial details by identifying and retrieving the most substantial frame features. As shown in Figure 2, it first selects the top-K (where K equals $N_{\text{ret}}$ ) largest clusters from the $N_{\text{tem}}$ clusters obtained in temporal memory $M_{\text{tem}}$ . Then the nearest frame features in feature buffer to centroids of these K clusters are retrieved to supplement the temporal memory with more detailed spatial information. This process is illustrated in Equation 5.

In brief, a new feature $e^{t}$ is written to STAR memory as follows:

$\displaystyle M_{\text{buff}}^{t}$	$\displaystyle=\text{concat}\big{(}g_{\text{pooling}}(e^{t},P_{\text{spa}}),M_{% \text{buff}}^{t-1}\big{)}[0:N_{\text{buff}},:,:]$	(1)
$\displaystyle M_{\text{spa}}^{t}$	$\displaystyle=M_{\text{buff}}^{t}[0:N_{\text{spa}},:,:]$	(2)
$\displaystyle M_{\text{tem}}^{t}$	$\displaystyle=g_{\text{wkmeans}}\Big{(}\text{concat}\big{(}g_{\text{pooling}}(% e^{t},P_{\text{tem}}),M_{\text{tem}}^{t-1}\big{)},N_{\text{tem}}\Big{)}$	(3)
$\displaystyle M_{\text{abs}}^{t}$	$\displaystyle=f_{SA}\big{(}M_{\text{abs}}^{t-1},g_{\text{pooling}}(e^{t},P_{% \text{abs}}),N_{\text{abs}}\big{)}$	(4)
$\displaystyle M_{\text{ret}}^{t}$	$\displaystyle=g_{\text{retrieve}}(M_{\text{buff}}^{t},M_{\text{tem}}^{t},N_{% \text{ret}})$	(5)

Here $g_{\text{pooling}}(e,P^{\prime})$ applies Average Pooling to compress feature map $e$ from $P^{2}$ to $P^{\prime 2}$ size along width and height dimensions. $\texttt{concat}(a,b)$ means concatenating tensors $a$ and $b$ along time axis.

2.3 Real-time LLM decoder

The LLM decoder works as part of a real-time question answering server. When triggered by a question $Q^{t}$ at time $t$ , the LLM decoder first calculates the text embedding $I_{\text{text}}^{t}=f_{\text{embed}}(Q^{t})$ and maps the STAR memory $M^{t}=M_{\text{spa}}^{t}+M_{\text{tem}}^{t}+M_{\text{abs}}^{t}+M_{\text{ret}}^% {t}$ to embedding space with the projector $I_{\text{vision}}^{t}=f_{\text{proj}}(M^{t})$ . Then it starts to generate answer $A^{t}=f_{\text{LLM}}(I_{\text{text}}^{t},I_{\text{vision}}^{t}).\text{decode}()$ in real time.

2.4 Adopting automatic speech recognition (ASR)

To further utilize the audio information in videos, we adopt the automatic speech recognition (ASR) model to transcribe the audio stream into text. The transcribed text is then fed into the LLM decoder as an additional input to improve the performance of video question answering.

2.5 Implementation details

In this study, we utilize pre-trained CLIP ViT-L/14-336px [8] as streaming visual encoder. Following LLaVA [4], we choose a 2-layer-MLP as visual projector and pre-trained Vicuna-7B [1] as LLM decoder. We adopt the open-source ASR model Whisper-large-v3 [9] to pre-transcribe the audio stream into text.

Considering the balance between performance and resource consumption, we set $P_{\text{spa}}=8$ , $P_{\text{tem}}=4$ , $P_{\text{abs}}=1$ , $N_{\text{buff}}=300$ , $N_{\text{spa}}=1$ , $N_{\text{tem}}=N_{\text{abs}}=25$ and $N_{\text{ret}}=3$ . The MAXSIZE of STAR memory is set to 681 tokens in order to keep computational efficiency.

We train Flash-VStream for 3 stages: modality alignment, instruction tuning, and domain-specific fine-tuning. The training data of first 2 stages are keep same with LLaMA-VID [3], including LLaVA-filtered-558K [5] image-caption pairs and LLaMA-VID-filtered-232K [3] video-caption pairs for stage 1, LLaVA-filtered-665K [5] image QA pairs and Video-ChatGPT-filtered-98K [7] video QA pairs for stage 2. The training data of stage 3 is the training branch of MovieChat-1K [10]. Speech data transcribed from the ASR model are only used in the third stage. Speech data are feed into the LLM decoder as a part of the question text, thus providing additional information for video question answering.

For each stage, the model is trained for 1 epoch on 8 A100 80G GPUs. During training, the parameters of visual encoder are frozen and the parameters of LLM are frozen only for the first stage. All training and inference experiments was conducted under BF16 precision to save time and resources.

During inference on the test branch of MovieChat-1K, we use different strategy for global and breakpoint questions. For global questions, we use the STAR memory to store the whole video information and answer the questions. For breakpoint questions, we only use a part of the video, which ranges from 15s before the breakpoint to 15s after the breakpoint, to answer the questions.

3 Experiments

Table 1: Performance on MovieChat-1K dataset. G. and B. denote global and breakpoint, while Acc. and Sco. denote accuracy and score, respectively. FVS denotes Flash-VStream-7b [11].

Method	val set, GPT-3.5 eval
Method	G. Acc.	G. Sco.	B. Acc.	B. Sco.
FVS	66.7	3.823	53.9	3.518
FVS + stage-3	84.0	4.624	73.5	4.078

Method	test set, Gemini-Pro eval
Method	G. Acc.	G. Sco.	B. Acc.	B. Sco.
MovieChat [10]	55.1	2.78	38.5	1.87
FVS + stage-3	84.5	4.12	52.4	2.89
FVS + stage-3 & speech	96.0	4.60	59.6	2.99

To better showcase how different designs affect the performance of our method, we split out 20% of the MovieChat-1K training set as the validation set and evaluate our model on it with GPT-3.5 as the evaluator. As shown in the upper part of Table 1, it is necessary to fine-tune the pretrained weight released by Flash-VStream on the training branch of MovieChat-1K, as the performance of the model is significantly improved after stage-3 fine-tuning. The lower part of Table 1 shows the performance of our model on the test set of MovieChat-1K with Gemini-Pro as the evaluator, evaluated by the organizers. The performance of our model is significantly improved after adding the speech data as an additional input to the LLM decoder. It also surpass the training-free baseline method MovieChat[10] by a large margin, demonstrating the effectiveness of our method as well as the importance of audio information when processing MovieChat-1K dataset.

4 Conclusion

In conclusion, we adopted Flash-VStream[11], a recently proposed long video-QA model. By incorporating a hierarchical memory, the model can effectively compress visual tokens throughout the whole video. We further fine-tuned the pretrained weight released by Flash-VStream on the training branch of MovieChat-1K. The transcribed speech data is also leveraged as an additional input to the LLM decoder to further improve its performance on the MovieChat-1K dataset, achieving 1st place in the challenge. We hope our method could inspire further research and advancements in the field of long video stream understanding.

References

Chiang et al. [2023] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/, 2023.
** et al. [2023] Peng **, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. arXiv preprint arXiv:2311.08046, 2023.
Li et al. [2023] Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043, 2023.
Liu et al. [2023a] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
Liu et al. [2023b] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
Ma et al. [2023] Fan Ma, Xiaojie **, Heng Wang, Yuchen Xian, Jiashi Feng, and Yi Yang. Vista-llama: Reliable video narrator via equal distance to visual tokens. arXiv preprint arXiv:2312.08870, 2023.
Maaz et al. [2023] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
Radford et al. [2022] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision, 2022.
Song et al. [2023] Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, et al. Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449, 2023.
Zhang et al. [2024] Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie **. Flash-vstream: Memory-based real-time understanding for long video streams. arXiv preprint arXiv:2406.08085, 2024.