Hierarchical Memory for Long Video QA

Yiqin Wang1,11footnotemark: 1 , Haoji Zhang1,11footnotemark: 1 , Yansong Tang1,22footnotemark: 2 , Yong Liu1, Jifeng Dai2, Jiashi Feng3, Xiaojie **3,22footnotemark: 2 
1 Shenzhen International Graduate School, Tsinghua University
2 Department of Electronic Engineering, Tsinghua University   3 ByteDance Inc.
{yq-wang23@mails.,haoji-zh20@mails.,tang.yansong@sz.}tsinghua.edu.cn
[email protected]
Abstract

This paper describes our champion solution to the LOVEU Challenge @ CVPR’24, Track 1 (Long Video VQA). Processing long sequences of visual tokens is computationally expensive and memory-intensive, making long video question-answering a challenging task. The key is to compress visual tokens effectively, reducing memory footprint and decoding latency, while preserving the essential information for accurate question-answering. We adopt a hierarchical memory mechanism named STAR Memory, proposed in Flash-VStream [11], that is capable of processing long videos with limited GPU memory (VRAM). We further utilize the video and audio data of MovieChat-1K training set to fine-tune the pretrained weight released by Flash-VStream, achieving 1st place in the challenge. Code is available at project homepage https://invinciblewyq.github.io/vstream-page.

$*$$*$footnotetext: Equal contribution.$\dagger$$\dagger$footnotetext: Corresponding author.

1 Introduction

Most existing large video-language models face challenges when performing long video question-answering upon user queries [3, 2, 6, 10]. The main reason is that: visual tokens between consecutive frames are heavy and redundant without effective compression, making it impossible to save all visual features in limited GPU Memory (VRAM), as well as significantly increasing the decoding latency of language model. To better address this issue, we adopt a hierarchical memory mechanism named STAR Memory, which is proposed recently in Flash-VStream [11]. By further fine-tuning the pretrained weight released by Flash-VStream on the training branch of MovieChat-1K, we successfully leverage the general video compression and understanding ability of STAR Memory, as well as the domain-specific knowledge of MovieChat-1K. The strong performance of our model demonstrates the effectiveness of the hierarchical memory method, namely the STAR Memory, in long video question-answering tasks.

2 Method

Refer to caption
Figure 1: The overview of Flash-VStream framework that we adopted for real-time online video stream understanding. Flash-VStream is executed by two processes, namely “frame handle” and “question handler”. The frame handler is responsible for encoding frames and writing to memory, which contains a visual encoder, a STAR memory and a feature buffer. The question handler is responsible for reading from memory and answering questions anytime, which contains a projector and a Large Language Model.

As shown in  Figure 1, the Flash-VStream framework that we adopted consists of three main components: (1) a streaming visual encoder that continuously processes video frames, (2) a Spatial-Temporal-Abstract-Retrieved memory mechanism (STAR memory), including memory writing and reading with the help of a feature buffer. (3) a LLM decoder capable of providing responses to questions raised by users.

2.1 Streaming visual encoder

We use the pre-trained CLIP ViT-L [8] as visual encoder. Only patch tokens are used during training and inference. Specifically, given a frame stream {Vt}t=1superscriptsubscriptsuperscript𝑉𝑡𝑡1\{V^{t}\}_{t=1}^{\infty}{ italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT, the encoder maps the t𝑡titalic_t-th frame VtH×W×3superscript𝑉𝑡superscript𝐻𝑊3V^{t}\in\mathbb{R}^{H\times W\times 3}italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT to feature map etP×P×Dsuperscript𝑒𝑡superscript𝑃𝑃𝐷{e^{t}}\in\mathbb{R}^{P\times P\times D}italic_e start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_P × italic_D end_POSTSUPERSCRIPT, where P×P𝑃𝑃P\times Pitalic_P × italic_P is the number of ViT patch tokens and D𝐷Ditalic_D is the hidden dimension of ViT.

2.2 Spatial-Temporal-Abstract-Retrieved memory

Refer to caption
Figure 2: STAR memory writing mechanism. (a) Update spatial memory by a FIFO queue. (b) Update temporal memory by Weighted K-means Clustering. (c) Update abstract memory by Semantic Attention. (d) Update retrieved memory by key frame feature retrival. Here feature map eTsuperscript𝑒𝑇e^{T}italic_e start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT has multiple sizes. “S”, “T”, “A” and “R” represent tokens of spatial, temporal, abstract and retrieved memory, respectively.

In order to handle information of different levels of granularity, the STAR memory consists of 4 components: spatial memory MspaNspa×Pspa2×Dsubscript𝑀spasuperscriptsubscript𝑁spasuperscriptsubscript𝑃spa2𝐷M_{\text{spa}}\in\mathbb{R}^{N_{\text{spa}}\times P_{\text{spa}}^{2}\times D}italic_M start_POSTSUBSCRIPT spa end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT spa end_POSTSUBSCRIPT × italic_P start_POSTSUBSCRIPT spa end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_D end_POSTSUPERSCRIPT, temporal memory MtemNtem×Ptem2×Dsubscript𝑀temsuperscriptsubscript𝑁temsuperscriptsubscript𝑃tem2𝐷M_{\text{tem}}\in\mathbb{R}^{N_{\text{tem}}\times P_{\text{tem}}^{2}\times D}italic_M start_POSTSUBSCRIPT tem end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT tem end_POSTSUBSCRIPT × italic_P start_POSTSUBSCRIPT tem end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_D end_POSTSUPERSCRIPT, abstract memory MabsNabs×Pabs2×Dsubscript𝑀abssuperscriptsubscript𝑁abssuperscriptsubscript𝑃abs2𝐷M_{\text{abs}}\in\mathbb{R}^{N_{\text{abs}}\times P_{\text{abs}}^{2}\times D}italic_M start_POSTSUBSCRIPT abs end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT abs end_POSTSUBSCRIPT × italic_P start_POSTSUBSCRIPT abs end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_D end_POSTSUPERSCRIPT and retrieved memory MretNret×Pspa2×Dsubscript𝑀retsuperscriptsubscript𝑁retsuperscriptsubscript𝑃spa2𝐷M_{\text{ret}}\in\mathbb{R}^{N_{\text{ret}}\times P_{\text{spa}}^{2}\times D}italic_M start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT × italic_P start_POSTSUBSCRIPT spa end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_D end_POSTSUPERSCRIPT. A feature buffer MbuffNbuff×Pspa2×Dsubscript𝑀buffsuperscriptsubscript𝑁buffsuperscriptsubscript𝑃spa2𝐷M_{\text{buff}}\in\mathbb{R}^{N_{\text{buff}}\times P_{\text{spa}}^{2}\times D}italic_M start_POSTSUBSCRIPT buff end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT buff end_POSTSUBSCRIPT × italic_P start_POSTSUBSCRIPT spa end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_D end_POSTSUPERSCRIPT is used to store the feature of latest Nbuffsubscript𝑁buffN_{\text{buff}}italic_N start_POSTSUBSCRIPT buff end_POSTSUBSCRIPT frames. Therefore, the overall memory size is limited to MAXSIZE=(Nspa+Nret)×Pspa2+Ntem×Ptem2+Nabs×Pabs2MAXSIZEsubscript𝑁spasubscript𝑁retsuperscriptsubscript𝑃spa2subscript𝑁temsuperscriptsubscript𝑃tem2subscript𝑁abssuperscriptsubscript𝑃abs2\text{MAXSIZE}=(N_{\text{spa}}+N_{\text{ret}})\times P_{\text{spa}}^{2}+N_{% \text{tem}}\times P_{\text{tem}}^{2}+N_{\text{abs}}\times P_{\text{abs}}^{2}MAXSIZE = ( italic_N start_POSTSUBSCRIPT spa end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT ) × italic_P start_POSTSUBSCRIPT spa end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_N start_POSTSUBSCRIPT tem end_POSTSUBSCRIPT × italic_P start_POSTSUBSCRIPT tem end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_N start_POSTSUBSCRIPT abs end_POSTSUBSCRIPT × italic_P start_POSTSUBSCRIPT abs end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT tokens.

Spatial memory. Spatial memory houses the most recent and detailed spatial information for short-term use, implemented as a FIFO (First-In-First-Out) queue, as illustrated in Figure 2 and Equation 2. This architecture enables continuous updating with the newest frames, facilitating immediate access to fine-grained spatial data.

Temporal memory. Temporal memory integrates dynamic information over time, crucial for long-term retention. When its size surpasses Ntemsubscript𝑁temN_{\text{tem}}italic_N start_POSTSUBSCRIPT tem end_POSTSUBSCRIPT, the gwkmeanssubscript𝑔wkmeansg_{\text{wkmeans}}italic_g start_POSTSUBSCRIPT wkmeans end_POSTSUBSCRIPT (Weighted K-means Clustering) algorithm is applied, as shown in Equation 3. This strategy condenses the memory content into Ntemsubscript𝑁temN_{\text{tem}}italic_N start_POSTSUBSCRIPT tem end_POSTSUBSCRIPT clusters which can be seen as the representation of key events in videos. Then the centroids of these clusters are used as the new memory for efficiently storing temporal contexts.

Abstract memory. Abstract memory supports high-level semantic concept interpretation through fSAsubscript𝑓𝑆𝐴f_{SA}italic_f start_POSTSUBSCRIPT italic_S italic_A end_POSTSUBSCRIPT, the Semantic Attention model. It follows Equation 4 to synthesize the insights gained from both spatial and temporal memories into abstracted, actionable knowledge. fSAsubscript𝑓𝑆𝐴f_{SA}italic_f start_POSTSUBSCRIPT italic_S italic_A end_POSTSUBSCRIPT keeps adjusting Mabssubscript𝑀absM_{\text{abs}}italic_M start_POSTSUBSCRIPT abs end_POSTSUBSCRIPT, the synopsis of whole video by newest features. Refer to Figure 2 for details.

Retrieved memory. Retrieved memory focuses on recalling precise spatial details by identifying and retrieving the most substantial frame features. As shown in Figure 2, it first selects the top-K (where K equals Nretsubscript𝑁retN_{\text{ret}}italic_N start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT) largest clusters from the Ntemsubscript𝑁temN_{\text{tem}}italic_N start_POSTSUBSCRIPT tem end_POSTSUBSCRIPT clusters obtained in temporal memory Mtemsubscript𝑀temM_{\text{tem}}italic_M start_POSTSUBSCRIPT tem end_POSTSUBSCRIPT. Then the nearest frame features in feature buffer to centroids of these K clusters are retrieved to supplement the temporal memory with more detailed spatial information. This process is illustrated in Equation 5.

In brief, a new feature etsuperscript𝑒𝑡e^{t}italic_e start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is written to STAR memory as follows:

Mbufftsuperscriptsubscript𝑀buff𝑡\displaystyle M_{\text{buff}}^{t}italic_M start_POSTSUBSCRIPT buff end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT =concat(gpooling(et,Pspa),Mbufft1)[0:Nbuff,:,:]\displaystyle=\text{concat}\big{(}g_{\text{pooling}}(e^{t},P_{\text{spa}}),M_{% \text{buff}}^{t-1}\big{)}[0:N_{\text{buff}},:,:]= concat ( italic_g start_POSTSUBSCRIPT pooling end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT spa end_POSTSUBSCRIPT ) , italic_M start_POSTSUBSCRIPT buff end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) [ 0 : italic_N start_POSTSUBSCRIPT buff end_POSTSUBSCRIPT , : , : ] (1)
Mspatsuperscriptsubscript𝑀spa𝑡\displaystyle M_{\text{spa}}^{t}italic_M start_POSTSUBSCRIPT spa end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT =Mbufft[0:Nspa,:,:]\displaystyle=M_{\text{buff}}^{t}[0:N_{\text{spa}},:,:]= italic_M start_POSTSUBSCRIPT buff end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT [ 0 : italic_N start_POSTSUBSCRIPT spa end_POSTSUBSCRIPT , : , : ] (2)
Mtemtsuperscriptsubscript𝑀tem𝑡\displaystyle M_{\text{tem}}^{t}italic_M start_POSTSUBSCRIPT tem end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT =gwkmeans(concat(gpooling(et,Ptem),Mtemt1),Ntem)absentsubscript𝑔wkmeansconcatsubscript𝑔poolingsuperscript𝑒𝑡subscript𝑃temsuperscriptsubscript𝑀tem𝑡1subscript𝑁tem\displaystyle=g_{\text{wkmeans}}\Big{(}\text{concat}\big{(}g_{\text{pooling}}(% e^{t},P_{\text{tem}}),M_{\text{tem}}^{t-1}\big{)},N_{\text{tem}}\Big{)}= italic_g start_POSTSUBSCRIPT wkmeans end_POSTSUBSCRIPT ( concat ( italic_g start_POSTSUBSCRIPT pooling end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT tem end_POSTSUBSCRIPT ) , italic_M start_POSTSUBSCRIPT tem end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) , italic_N start_POSTSUBSCRIPT tem end_POSTSUBSCRIPT ) (3)
Mabstsuperscriptsubscript𝑀abs𝑡\displaystyle M_{\text{abs}}^{t}italic_M start_POSTSUBSCRIPT abs end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT =fSA(Mabst1,gpooling(et,Pabs),Nabs)absentsubscript𝑓𝑆𝐴superscriptsubscript𝑀abs𝑡1subscript𝑔poolingsuperscript𝑒𝑡subscript𝑃abssubscript𝑁abs\displaystyle=f_{SA}\big{(}M_{\text{abs}}^{t-1},g_{\text{pooling}}(e^{t},P_{% \text{abs}}),N_{\text{abs}}\big{)}= italic_f start_POSTSUBSCRIPT italic_S italic_A end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT abs end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT , italic_g start_POSTSUBSCRIPT pooling end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT abs end_POSTSUBSCRIPT ) , italic_N start_POSTSUBSCRIPT abs end_POSTSUBSCRIPT ) (4)
Mrettsuperscriptsubscript𝑀ret𝑡\displaystyle M_{\text{ret}}^{t}italic_M start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT =gretrieve(Mbufft,Mtemt,Nret)absentsubscript𝑔retrievesuperscriptsubscript𝑀buff𝑡superscriptsubscript𝑀tem𝑡subscript𝑁ret\displaystyle=g_{\text{retrieve}}(M_{\text{buff}}^{t},M_{\text{tem}}^{t},N_{% \text{ret}})= italic_g start_POSTSUBSCRIPT retrieve end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT buff end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_M start_POSTSUBSCRIPT tem end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_N start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT ) (5)

Here gpooling(e,P)subscript𝑔pooling𝑒superscript𝑃g_{\text{pooling}}(e,P^{\prime})italic_g start_POSTSUBSCRIPT pooling end_POSTSUBSCRIPT ( italic_e , italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) applies Average Pooling to compress feature map e𝑒eitalic_e from P2superscript𝑃2P^{2}italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to P2superscript𝑃2P^{\prime 2}italic_P start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT size along width and height dimensions. concat(a,b)concat𝑎𝑏\texttt{concat}(a,b)concat ( italic_a , italic_b ) means concatenating tensors a𝑎aitalic_a and b𝑏bitalic_b along time axis.

2.3 Real-time LLM decoder

The LLM decoder works as part of a real-time question answering server. When triggered by a question Qtsuperscript𝑄𝑡Q^{t}italic_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT at time t𝑡titalic_t, the LLM decoder first calculates the text embedding Itextt=fembed(Qt)superscriptsubscript𝐼text𝑡subscript𝑓embedsuperscript𝑄𝑡I_{\text{text}}^{t}=f_{\text{embed}}(Q^{t})italic_I start_POSTSUBSCRIPT text end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT embed end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) and maps the STAR memory Mt=Mspat+Mtemt+Mabst+Mrettsuperscript𝑀𝑡superscriptsubscript𝑀spa𝑡superscriptsubscript𝑀tem𝑡superscriptsubscript𝑀abs𝑡superscriptsubscript𝑀ret𝑡M^{t}=M_{\text{spa}}^{t}+M_{\text{tem}}^{t}+M_{\text{abs}}^{t}+M_{\text{ret}}^% {t}italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_M start_POSTSUBSCRIPT spa end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_M start_POSTSUBSCRIPT tem end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_M start_POSTSUBSCRIPT abs end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_M start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to embedding space with the projector Ivisiont=fproj(Mt)superscriptsubscript𝐼vision𝑡subscript𝑓projsuperscript𝑀𝑡I_{\text{vision}}^{t}=f_{\text{proj}}(M^{t})italic_I start_POSTSUBSCRIPT vision end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT ( italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). Then it starts to generate answer At=fLLM(Itextt,Ivisiont).decode()formulae-sequencesuperscript𝐴𝑡subscript𝑓LLMsuperscriptsubscript𝐼text𝑡superscriptsubscript𝐼vision𝑡decodeA^{t}=f_{\text{LLM}}(I_{\text{text}}^{t},I_{\text{vision}}^{t}).\text{decode}()italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT text end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT vision end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) . decode ( ) in real time.

2.4 Adopting automatic speech recognition (ASR)

To further utilize the audio information in videos, we adopt the automatic speech recognition (ASR) model to transcribe the audio stream into text. The transcribed text is then fed into the LLM decoder as an additional input to improve the performance of video question answering.

2.5 Implementation details

In this study, we utilize pre-trained CLIP ViT-L/14-336px [8] as streaming visual encoder. Following LLaVA [4], we choose a 2-layer-MLP as visual projector and pre-trained Vicuna-7B [1] as LLM decoder. We adopt the open-source ASR model Whisper-large-v3 [9] to pre-transcribe the audio stream into text.

Considering the balance between performance and resource consumption, we set Pspa=8subscript𝑃spa8P_{\text{spa}}=8italic_P start_POSTSUBSCRIPT spa end_POSTSUBSCRIPT = 8, Ptem=4subscript𝑃tem4P_{\text{tem}}=4italic_P start_POSTSUBSCRIPT tem end_POSTSUBSCRIPT = 4, Pabs=1subscript𝑃abs1P_{\text{abs}}=1italic_P start_POSTSUBSCRIPT abs end_POSTSUBSCRIPT = 1, Nbuff=300subscript𝑁buff300N_{\text{buff}}=300italic_N start_POSTSUBSCRIPT buff end_POSTSUBSCRIPT = 300, Nspa=1subscript𝑁spa1N_{\text{spa}}=1italic_N start_POSTSUBSCRIPT spa end_POSTSUBSCRIPT = 1, Ntem=Nabs=25subscript𝑁temsubscript𝑁abs25N_{\text{tem}}=N_{\text{abs}}=25italic_N start_POSTSUBSCRIPT tem end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT abs end_POSTSUBSCRIPT = 25 and Nret=3subscript𝑁ret3N_{\text{ret}}=3italic_N start_POSTSUBSCRIPT ret end_POSTSUBSCRIPT = 3. The MAXSIZE of STAR memory is set to 681 tokens in order to keep computational efficiency.

We train Flash-VStream for 3 stages: modality alignment, instruction tuning, and domain-specific fine-tuning. The training data of first 2 stages are keep same with LLaMA-VID [3], including LLaVA-filtered-558K [5] image-caption pairs and LLaMA-VID-filtered-232K [3] video-caption pairs for stage 1, LLaVA-filtered-665K [5] image QA pairs and Video-ChatGPT-filtered-98K [7] video QA pairs for stage 2. The training data of stage 3 is the training branch of MovieChat-1K [10]. Speech data transcribed from the ASR model are only used in the third stage. Speech data are feed into the LLM decoder as a part of the question text, thus providing additional information for video question answering.

For each stage, the model is trained for 1 epoch on 8 A100 80G GPUs. During training, the parameters of visual encoder are frozen and the parameters of LLM are frozen only for the first stage. All training and inference experiments was conducted under BF16 precision to save time and resources.

During inference on the test branch of MovieChat-1K, we use different strategy for global and breakpoint questions. For global questions, we use the STAR memory to store the whole video information and answer the questions. For breakpoint questions, we only use a part of the video, which ranges from 15s before the breakpoint to 15s after the breakpoint, to answer the questions.

3 Experiments

Table 1: Performance on MovieChat-1K dataset. G. and B. denote global and breakpoint, while Acc. and Sco. denote accuracy and score, respectively. FVS denotes Flash-VStream-7b [11].
Method val set, GPT-3.5 eval
G. Acc. G. Sco. B. Acc. B. Sco.
FVS 66.7 3.823 53.9 3.518
FVS + stage-3 84.0 4.624 73.5 4.078
Method test set, Gemini-Pro eval
G. Acc. G. Sco. B. Acc. B. Sco.
MovieChat [10] 55.1 2.78 38.5 1.87
FVS + stage-3 84.5 4.12 52.4 2.89
FVS + stage-3 & speech 96.0 4.60 59.6 2.99

To better showcase how different designs affect the performance of our method, we split out 20% of the MovieChat-1K training set as the validation set and evaluate our model on it with GPT-3.5 as the evaluator. As shown in the upper part of Table 1, it is necessary to fine-tune the pretrained weight released by Flash-VStream on the training branch of MovieChat-1K, as the performance of the model is significantly improved after stage-3 fine-tuning. The lower part of Table 1 shows the performance of our model on the test set of MovieChat-1K with Gemini-Pro as the evaluator, evaluated by the organizers. The performance of our model is significantly improved after adding the speech data as an additional input to the LLM decoder. It also surpass the training-free baseline method MovieChat[10] by a large margin, demonstrating the effectiveness of our method as well as the importance of audio information when processing MovieChat-1K dataset.

4 Conclusion

In conclusion, we adopted Flash-VStream[11], a recently proposed long video-QA model. By incorporating a hierarchical memory, the model can effectively compress visual tokens throughout the whole video. We further fine-tuned the pretrained weight released by Flash-VStream on the training branch of MovieChat-1K. The transcribed speech data is also leveraged as an additional input to the LLM decoder to further improve its performance on the MovieChat-1K dataset, achieving 1st place in the challenge. We hope our method could inspire further research and advancements in the field of long video stream understanding.

References

  • Chiang et al. [2023] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/, 2023.
  • ** et al. [2023] Peng **, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. arXiv preprint arXiv:2311.08046, 2023.
  • Li et al. [2023] Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043, 2023.
  • Liu et al. [2023a] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
  • Liu et al. [2023b] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  • Ma et al. [2023] Fan Ma, Xiaojie **, Heng Wang, Yuchen Xian, Jiashi Feng, and Yi Yang. Vista-llama: Reliable video narrator via equal distance to visual tokens. arXiv preprint arXiv:2312.08870, 2023.
  • Maaz et al. [2023] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  • Radford et al. [2022] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision, 2022.
  • Song et al. [2023] Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, et al. Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449, 2023.
  • Zhang et al. [2024] Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie **. Flash-vstream: Memory-based real-time understanding for long video streams. arXiv preprint arXiv:2406.08085, 2024.