\addbibresource

arxiv.bib

LLaVolta: Efficient Multi-modal Models
via Stage-wise Visual Context Compression

Jieneng Chen***Equally contributed Luoxin Ye11footnotemark: 1 Ju He Zhao-Yang Wang Daniel KhashabiEqually advised Alan Yuille22footnotemark: 2
Johns Hopkins University
Abstract

While significant advancements have been made in compressed representations for text embeddings in large language models (LLMs), the compression of visual tokens in large multi-modal models (LMMs) has remained a largely overlooked area. In this work, we present the study on the analysis of redundancy concerning visual tokens and efficient training within these models. Our initial experiments show that eliminating up to 70% of visual tokens at the testing stage by simply average pooling only leads to a minimal 3% reduction in visual question answering accuracy on the GQA benchmark, indicating significant redundancy in visual context. Addressing this, we introduce Visual Context Compressor, which reduces the number of visual tokens during training to enhance training efficiency without sacrificing performance. To minimize information loss caused by the compression on visual tokens while maintaining training efficiency, we develop LLaVolta as a lite training scheme. LLaVolta incorporates stage-wise visual context compression to progressively compress the visual tokens from heavily to lightly, and finally no compression at the end of training, yielding no loss of information when testing. Extensive experiments demonstrate that our approach enhances the performance of MLLMs in both image-language and video-language understanding, while also significantly cutting training costs.

1 Introduction

The advent of LLMs [chatgpt, gpt4, touvron2023llama2] has marked a new era in the field of artificial intelligence and natural language processing. LLMs can play a role as a universal interface for a general-purpose assistant, where various task instructions can be explicitly represented in language and guide the end-to-end trained neural assistant to solve a task of interest. For example, the recent success of ChatGPT (chatgpt) and GPT-4 (gpt4) have demonstrated the power of aligned LLMs in following human instructions, and have stimulated tremendous interest in develo** open-source LLMs (team2024gemma; touvron2023llama). As the horizon of LLM applications broadens and the availability of open-source LLMs increases, the integration of multi-modality into these models presents a new frontier in expanding their capabilities. Multi-modal LLMs (alayrac2022flamingo; liu2024visual; team2023gemini; zhu2023minigpt) (MLLMs), which can process and understand not just text but also visual information, stand at the cutting edge of this evolution.

Refer to caption
Figure 1: Visual tokens are redundant in MLLMs. Left: The accuracy of the LLaVA-1.5-7B liu2024visual model on the GQA hudson2019gqa benchmarks varies with different percentages of retained visual tokens. The x𝑥xitalic_x-axis represents the percentage of original visual tokens preserved after applying 1D average pooling with varying stride sizes S𝑆Sitalic_S applied in i𝑖iitalic_i-th Transformer layer. Right: Visual tokens receive less attention from the [ANS] token as we go deeper into its layers of LLaVA-1.5-7B model. These findings collectively suggest a significant redundancy within the visual tokens of the MLLMs.

While MLLMs have made significant strides, a crucial aspect that remains relatively unexplored is the efficient representation and processing of visual information within these models. Substantial efforts (hou2022token; qin2023nugget; zeng2024vcc) have been dedicated to optimizing the efficient representation of text tokens through various compression techniques (hou2022token; qin2023nugget; zeng2024vcc), aimed at enhancing inference efficiency by attentively selecting important tokens. However, the efficient learning of visual tokens in MLLM has not garnered comparable attention. Naturally, this raises questions about the potential redundancy present in visual tokens and its implications for the overall computational efficiency of MLLMs.

We start our work by addressing the question: Are visual tokens redundant in multi-modal LLMs? To explore this, we first experiment with simply reducing the number of visual tokens in a pre-trained LLaVA-1.5-7B liu2024visual at the inference stage via average pooling (§3). As shown in Fig.1 (left), our initial results demonstrate that eliminating up to 70% of visual tokens by pooling them with a stride of 4 starting from Transformer layer 2 incurs only a minimal performance loss on the GQA benchmark, specifically a 3% accuracy reduction. Additionally, we compute and present the average attention values from the [ANS] token to visual tokens and system prompt tokens across different Transformer layers in the pre-trained LLaVA-1.5-7B liu2024visual. As revealed in Fig. 1 (right; blue trends), the visual tokens are generally less attended to, measured based on average attention from the [ANS] token, as the layers get deeper. These two early explorations indicate significant redundancy in visual tokens.

Addressing this, in this work we develop an effective Visual Context Compressor that can be integrated into the training of MLLMs. Surprisingly, a simple average pooler nested in LLMs stands out as the most effective compressor, outperforming the attention-based hou2022token; zeng2024vcc and parametric li2023blip counterparts. We attribute this to two reasons: (1) The simple pooling operation makes training stable, whereas prior attention-based approaches hou2022token; zeng2024vcc are specifically designed for accelerating inference rather than training. (2) Visual tokens in the deeper Transformer layers are less attended to (see Fig. 1 (right)) and particularly redundant, making a simple compressor placed in a deeper Transformer layer effective enough. At a lower training cost, the LLaVA-1.5-7B liu2024visual trained with the proposed Visual Context Compressor is competitive with the non-compressed baseline across various multi-modal benchmarks (e.g., GQA hudson2019gqa and MM-Vet yu2023mm). This dual achievement highlights Visual Context Compressor’s role as a pivotal advancement in enhancing the efficiency and performance of MLLMs across various multi-modal question-answering benchmarks.

To further mitigate the information loss caused by compressing visual tokens, especially under a large compression ratio (CR), we have devised a LLaVA-powered lite training scheme, dubbed LLaVolta, which progressively employs Visual Context Compressor at multiple training stages with different compression ratios (§3.3). Specifically, LLaVolta progresses through several stages, beginning with a high level of visual token compression and gradually reducing the compression ratio until the final stages, where full visual tokens are utilized. This multi-stage approach allows for adaptive compression levels that ensure training efficiency without losing information at testing, thus maintaining the overall effectiveness of the model.

Extensive experimental evaluations of LLaVolta have been conducted on thirteen widely-adopted MLLM benchmarks for both image-language understanding and video-language understanding , showing promising results. We observe that LLaVolta not only enhances the performance of MLLMs, but also achieves a substantial reduction in training costs. These experiments validate the effectiveness of our method, demonstrating its capability to optimize resource utilization while maintaining or even improving model performance.

In summary, our paper makes the following contributions:

  • We present two initial studies to verify the redundancy of visual tokens in MLLMs.

  • We propose the Visual Context Compressor, a simple yet effective compression technique that utilizes an average pooler, enhancing the efficiency of multi-modal models.

  • We propose the LLaVolta as an efficient training scheme by leveraging Visual Context Compressor at multiple training stages with a progressively decreasing compression ratio. To the best of our knowledge, we are among the first to explore efficient training of MLLMs.

  • Extensive experiments show that our approach not only improves the performance of MLLMs in image-language and video-language understanding across various benchmarks but also showcases efficiency gains by reducing training costs by 16%.

2 Related Works

Multi-modal LLMs. The evolution of large language models (gpt4; chatgpt; chiang2023vicuna) into their multi-modal counterparts (team2023gemini; liu2024visual) represents a significant leap in their ability to follow instructions and generalize across tasks. This transition has been marked by seminal works such as Flamingo (alayrac2022flamingo), BLIP-2 (li2023blip) and LLaVA (liu2024visual), which have extended LLM capabilities to encompass visual tasks, demonstrating impressive zero-shot generalization and in-context learning abilities. Progress in multi-modal LLMs has primarily been driven by advancements in visual instruction tuning (liu2024visual; zhu2023minigpt), leveraging vision-language datasets and refining visual instruction-following data. Additionally, efforts have been made to enhance the grounding capabilities of multi-modal LLMs through the use of specialized datasets aimed at improving task-specific performance. Despite these advancements, the exploration of visual compression within multi-modal LLMs remains relatively underdeveloped. The design and optimization of compression strategies are crucial for maximizing the effectiveness and efficiency of multi-modal LLMs, suggesting a potential area for future research and development.

Visual Redundancy. In computer vision, reducing redundancy is crucial for creating efficient yet effective models without losing accuracy (barlow2001redundancy). Redundancy in images often arises from the inherent characteristics of natural scenes, including repetitive patterns, textures, and areas of uniform color. These features, while contributing to the richness and detail of visual perception, can lead to inefficiencies in both storage and processing when not adequately addressed. Image compression algorithms (wallace1992jpeg) can reduce file size by eliminating or efficiently encoding redundant data. These methods take advantage of human visual perception’s tolerances to subtly reduce data without significantly impacting image quality. Advanced machine learning models, particularly CNNs and autoencoders (baldi2012autoencoders), offer sophisticated approaches to minimizing redundancy. Transformers (vaswani2017attention), as a fundamental architecture for LLMs (chiang2023vicuna; gpt4), apply self-attention mechanisms to dynamically bind the most informative parts of tokents. Vision Transformers (chen2024vitamin; chen2022transmix; dosovitskiy2020image; he2022transfg) trained with CLIP objective (chen2024vitamin; radford2021learning) encode an image to a sequence of visual features for multi-modal LLMs (liu2024visual). Nevertheless, visual tokens receive less attention in LLMs due to attention shrinkage (xiao2023efficient), resulting a waste of computation. In this work, we focus on reducing the redundancy of visual tokens in MLLMs.

Efficient LLMs. Efficient inference and training for LLMs are important. Compressing input sequences for efficiency reasons in Transformers is not a new idea for NLP. Much work is being done to accelerate the inference of LMs. For example, Pyramid Transformer variants (dai2020funnel) and  (huang2022pyramid) are proposed in Encoder-Decoder LMs that progressively compress the sequence as the layers grow deeper via pooling or core-set selection. Nawrot et al. (nawrot2022efficient) propose adaptively compressing the sequence based on the predicted semantic boundaries within the sequence. Rae et al. (rae2019compressive) propose compressing the fine-grained past activations to coarser memories. VCC (zeng2024vcc) compress the sequence into a much smaller representation at each layer by prioritizing important tokens. Besides efficient inference, accelerating training for LLMs attracts attention as well. A staged training setup (shen2022staged) is proposed which begins with a small model and incrementally increases the amount of compute used for training by applying a growth operator to increase the model depth and width. However, efficient training for LLMs in multi-modal scenarios is rarely explored.

3 Method

In this section, we first introduce an overview of multi-modal LLMs in § 3.1. Then, we define the problem of visual redundancy and introduce Visual Context Compressor in § 3.2. Finally, we present our proposed LLaVolta in § 3.3.

3.1 Preliminaries: A Multi-modal LLM

We start by reviewing the design of the LLaVA family (liu2024visual; liu2023improvedllava). For processing an input image 𝐗vsubscript𝐗𝑣\mathbf{X}_{v}bold_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, we utilize the pre-trained CLIP visual encoder ViT-L/14, as detailed by (radford2021learning), to extract the visual feature 𝐙v=g(𝐗v)subscript𝐙𝑣𝑔subscript𝐗𝑣\mathbf{Z}_{v}=g(\mathbf{X}_{v})bold_Z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_g ( bold_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ), where g(.)g(.)italic_g ( . ) indicates the visual encoder. To bridge the gap between visual and linguistic modalities, the LLaVA (liu2024visual; liu2023improvedllava) framework as an MLLM implements a straightforward linear/MLP transformation. This involves a trainable projection matrix 𝐖𝐖\mathbf{W}bold_W, which maps the visual features 𝐙vsubscript𝐙𝑣\mathbf{Z}_{v}bold_Z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT into the linguistic embedding space, producing language embedding tokens 𝐇v=𝐖𝐙vsubscript𝐇𝑣subscript𝐖𝐙𝑣\mathbf{H}_{v}=\mathbf{W}\mathbf{Z}_{v}bold_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = bold_WZ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. These tokens are designed to match the dimensionality of the word embeddings within the LLM.

For each image 𝐗vsubscript𝐗𝑣\mathbf{X}_{v}bold_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, one can generate multi-turn conversation data (𝐗q1,𝐗a1,,𝐗qT,𝐗aT)superscriptsubscript𝐗𝑞1superscriptsubscript𝐗𝑎1superscriptsubscript𝐗𝑞𝑇superscriptsubscript𝐗𝑎𝑇(\mathbf{X}_{q}^{1},\mathbf{X}_{a}^{1},\cdots,\mathbf{X}_{q}^{T},\mathbf{X}_{a% }^{T})( bold_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , bold_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) with T𝑇Titalic_T as the number of turns. One can organize them as a sequence, by treating all answers as the assistant’s response and the instruction 𝐗instructtsuperscriptsubscript𝐗instruct𝑡\mathbf{X}_{\texttt{instruct}}^{t}bold_X start_POSTSUBSCRIPT instruct end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT at the t𝑡titalic_t-th turn as:

𝐗instructt={Random Choose[𝐗q1,𝐗v]or[𝐗v,𝐗q1],t=1𝐗qt,t>1\displaystyle\mathbf{X}_{\texttt{instruct}}^{t}=\left\{\begin{matrix}&\text{% Random Choose}[\mathbf{X}_{q}^{1},\mathbf{X}_{v}]\leavevmode\nobreak\ % \leavevmode\nobreak\ \text{or}\leavevmode\nobreak\ \leavevmode\nobreak\ [% \mathbf{X}_{v},\mathbf{X}_{q}^{1}],\hskip 8.53581pt\leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ t=1\\ &\mathbf{X}_{q}^{t},\hskip 85.35826pt\leavevmode\nobreak\ \leavevmode\nobreak% \ \leavevmode\nobreak\ t>1\end{matrix}\right.bold_X start_POSTSUBSCRIPT instruct end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = { start_ARG start_ROW start_CELL end_CELL start_CELL Random Choose [ bold_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ] or [ bold_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ] , italic_t = 1 end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL bold_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t > 1 end_CELL end_ROW end_ARG (1)

This approach establishes a standardized format for the multi-modal instruction-following sequence. It allows for the instruction-based tuning of the LLM to be applied to the prediction tokens, utilizing the model’s native auto-regressive training objective. Specifically, for a sequence with length L𝐿Litalic_L, the likelihood of the target responses 𝐗asubscript𝐗𝑎\mathbf{X}_{a}bold_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is calculated as:

p(𝐗a|𝐗v,𝐗instruct)=i=1Lpθ(xi|𝐗v,𝐗instruct,<i,𝐗a,<i),𝑝conditionalsubscript𝐗𝑎subscript𝐗𝑣subscript𝐗instructsuperscriptsubscriptproduct𝑖1𝐿subscript𝑝𝜃conditionalsubscript𝑥𝑖subscript𝐗𝑣subscript𝐗instructabsent𝑖subscript𝐗𝑎absent𝑖p(\mathbf{X}_{a}|\mathbf{X}_{v},\mathbf{X}_{\texttt{instruct}})=\prod_{i=1}^{L% }p_{\theta}(x_{i}|\mathbf{X}_{v},\mathbf{X}_{\texttt{instruct},<i},\mathbf{X}_% {a,<i}),italic_p ( bold_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT instruct end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT instruct , < italic_i end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_a , < italic_i end_POSTSUBSCRIPT ) , (2)

3.2 Visual Context Compressor

Problem Formulation: The redundancy observed in images often arises from inherent traits of natural scenes, including repetitive patterns, textures, and regions with uniform color. While these traits enrich visual perception by offering detail and depth, they can also present challenges in terms of storage and processing efficiency. Considering the inherent limitations of Transformers in handling long sequences liu2023lost; anil2022exploring; ye2024analobench, it is critical to minimize any length redundancies to obtain a more effective accuracy/efficiency trade-off.

The objective of this study is to decrease the length of visual tokens 𝐗vsubscript𝐗𝑣\mathbf{X}_{v}bold_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT (i.e., its hidden states 𝐇vsubscript𝐇𝑣\mathbf{H}_{v}bold_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT if inside LLMs), while simultaneously maximizing the probability of the target response p(𝐗a|𝐗v,𝐗instruct)𝑝conditionalsubscript𝐗𝑎subscript𝐗𝑣subscript𝐗instructp(\mathbf{X}_{a}|\mathbf{X}_{v},\mathbf{X}_{\texttt{instruct}})italic_p ( bold_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT instruct end_POSTSUBSCRIPT ) as described in Equation (2).

Visual Context Compressor: A key design change that we introduce is a compressor layer that compresses the dimensions of the visual inputs by reducing the effective number of visual tokens. As depicted in Fig. 2, the compressor is simply an average pooler in our setting. It is applied to the visual tokens in k𝑘kitalic_k-th Transformer layer of an LLM. Formally, given the hidden visual tokens at k𝑘kitalic_k-th Transformer layer 𝐇kB×C×Lsubscript𝐇𝑘superscript𝐵𝐶𝐿\mathbf{H}_{k}\in\mathbb{R}^{B\times C\times L}bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × italic_C × italic_L end_POSTSUPERSCRIPT, the compressor is expected to fulfill the following projection: f:B×C×LB×C×Lout,:𝑓maps-tosuperscript𝐵𝐶𝐿superscript𝐵𝐶subscript𝐿outf:\mathbb{R}^{B\times C\times L}\mapsto\mathbb{R}^{B\times C\times L_{\text{% \emph{out}}}},italic_f : blackboard_R start_POSTSUPERSCRIPT italic_B × italic_C × italic_L end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_B × italic_C × italic_L start_POSTSUBSCRIPT out end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , which results in compressed visual tokens

𝐇~kB×C×Loutsubscript~𝐇𝑘superscript𝐵𝐶subscript𝐿out\tilde{\mathbf{H}}_{k}\in\mathbb{R}^{B\times C\times L_{\text{\emph{out}}}}over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × italic_C × italic_L start_POSTSUBSCRIPT out end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where Lout=LSsubscript𝐿out𝐿𝑆L_{\text{\emph{out}}}=\frac{L}{S}italic_L start_POSTSUBSCRIPT out end_POSTSUBSCRIPT = divide start_ARG italic_L end_ARG start_ARG italic_S end_ARG with s𝑠sitalic_s as the compression stride. In  §4, we explore multiple variants of compressor f𝑓fitalic_f to reduce the token length, including random token drop** (he2022masked) with drop** ratio 11S11𝑆1-\frac{1}{S}1 - divide start_ARG 1 end_ARG start_ARG italic_S end_ARG, K-Means (kanungo2002efficient) with number of centroids set to NC=LSsubscript𝑁𝐶𝐿𝑆N_{C}=\frac{L}{S}italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = divide start_ARG italic_L end_ARG start_ARG italic_S end_ARG, attention-based token-centric compression zeng2024vcc, attention-based token drop** chen2024image; hou2022token, and average pooling with stride s𝑠sitalic_s. To our surprise, we find that the simple average pooler is the most effective compressor for vision tokens within MLLMs, due to its stability during training detailed in § 4.4. Thus, we choose average pooler as the compressor.

Note that the proposed Visual Context Compressor can be directly applied to any off-the-shelf MLLMs to assess the visual redundancy, as conducted in §4.2. One can also train an MLLM with Visual Context Compressor to reduce the number of visual tokens while maintaining competitive multi-modal performance.

Refer to caption
Figure 2: Example of Visual Context Compressor in a multi-modal LLM.

Compression Ratio (CR)Definition of compression ratio from Wikipedia. For an LLM with N𝑁Nitalic_N Transformer decoder layers, the compression ratio for visual tokens can be calculated as:

CR=NL(NK)Lout+KL,CR𝑁𝐿𝑁𝐾subscript𝐿out𝐾𝐿\text{CR}=\frac{N\cdot L}{(N-K)\cdot L_{\text{\emph{out}}}+K\cdot L}\;\;,CR = divide start_ARG italic_N ⋅ italic_L end_ARG start_ARG ( italic_N - italic_K ) ⋅ italic_L start_POSTSUBSCRIPT out end_POSTSUBSCRIPT + italic_K ⋅ italic_L end_ARG , (3)

where K𝐾Kitalic_K is the K𝐾Kitalic_K-th Transformer layer of a multi-modal LLM; L𝐿Litalic_L is the the length of visual tokens input into Visual Context Compressor; Loutsubscript𝐿outL_{\text{\emph{out}}}italic_L start_POSTSUBSCRIPT out end_POSTSUBSCRIPT is the compressed length of visual tokens generated by Visual Context Compressor, as illustrated in Fig. 2.

Our architecture modifications thus far mostly impacts the inference efficiency of MLLM, however, its impact on performance-compression trade-off remains unclear. We will study this question in the context of training MLLMs with a goal of enhancing efficiency without compromising performance. We then move on to further utilize Visual Context Compressor to design an efficient training scheme to incorporates Visual Context Compressor at various stages of the training process.

3.3 LLaVolta as a Lite Training Scheme

Training with Visual Context Compressor not only facilitates efficient inference but also enhances training efficiency. However, devising an effective training scheme poses challenges when ensuring fair comparisons with the original LLaVA liu2023improvedllava, primarily due to differences in the number of tokens involved in inference. This discrepancy may lead to information loss, particularly when operating under a scenario with a high compression ratio. To tackle this issue, we have developed a lite training scheme for LLaVA, dubbed as LLaVolta, which employs stage-wise visual context compression. Generally, assuming there are Nssubscript𝑁𝑠N_{s}italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT total stages, stage i𝑖iitalic_i involves 1Ns1subscript𝑁𝑠\frac{1}{N_{s}}divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG of the total training epochs with a compression ratio of risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and the final stage proceeds without any compression. Essentially, as training progresses, i𝑖iitalic_i increases while risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT decreases.

In this work, as depicted in Fig. 3, we primarily explore a three-stage training pipeline that progressively reduces the compression ratio, as detailed below:

Training Stage I: Heavy Compression. The MLLM training at the first one-third of the total training iterations commences with a heavy compression ratio (> 500%), where Visual Context Compressor is applied in an early layer of the LLM with a large pooling stride. This setup enables a very fast training speed.

Training Stage II: Light Compression. The MLLM continues training with another one-third of the total training epochs. At this stage, Visual Context Compressor is applied at only the deeper layers of the LLM with a smaller pooling stride compared to Training Stage I.

Training Stage III: No Compression. The MLLM continues training with the final one-third of the total training epochs, following the standard MLLM training protocol without compression. Disabling compression in the final stage ensures that the number of tokens remains consistent with the original MLLM during inference, avoiding the loss of information caused by the reduction of visual tokens.

Given the above meta framework, we can instantiate a family of training schemes, as demonstrated in Tab. 1. The single-stage (non-compression) scheme is equivalent to the MLLM baseline. For multi-stage training, the compression stage can either go deeper or wider. “deeper” implies an increase in K𝐾Kitalic_K (Transformer layer), while “wider” means a decrease in the stride of the pooler.

Refer to caption
Figure 3: Meta framework of LLaVolta, consisting with multiple training stages: Stage I with heavy visual compression; Stage II with light visual compression in deeper layer with wider token window; Stage III with standard MLLM training (as there is also no compression in standard inference). This can accelerate the training by 16+% while maintaining performance.
#Stages Scheme Stage Layer Stride CR #Epoch
Single no compression S1𝑆1S1italic_S 1 / / 100% 1
Two compression S1𝑆1S1italic_S 1 2 8 557% 0.5
S2𝑆2S2italic_S 2 / / 100% 0.5
Three compr. deeper S1𝑆1S1italic_S 1 2 8 557% 0.33
S2𝑆2S2italic_S 2 16 8 178% 0.33
S3𝑆3S3italic_S 3 / / 100% 0.33
Three compr. wider S1𝑆1S1italic_S 1 2 8 557% 0.33
S2𝑆2S2italic_S 2 2 2 188% 0.33
S3𝑆3S3italic_S 3 / / 100% 0.33
#Stages Scheme Stage Layer Stride CR #Epoch
Four wider then deeper S1𝑆1S1italic_S 1 2 8 557% 0.25
S2𝑆2S2italic_S 2 2 2 188% 0.25
S3𝑆3S3italic_S 3 16 2 133% 0.25
S4𝑆4S4italic_S 4 / / 100% 0.25
Four deeper then wider S1𝑆1S1italic_S 1 2 8 557% 0.25
S2𝑆2S2italic_S 2 16 8 178% 0.25
S3𝑆3S3italic_S 3 16 2 133% 0.25
S4𝑆4S4italic_S 4 / / 100% 0.25
Table 1: Instantiations of LLaVolta schemes. deeper indicates that the compressor’s position in the LLM shifts from the shallow layer (e.g., 2) to a deeper layer (e.g., 16). wider indicates that the compressor’s stride decreases while the number of visual tokens increases.

Note that all training schemes will be standardized to complete just one epoch. Thus, in the three-stage training, each stage will receive one third of an epoch, while in the four-stage training, each stage will receive one fourth of an epoch. Effects of non-uniform stage splitting are presented in the Appendix.

4 Experiments

In this section, we begin by detailing the experimental setup in § 4.1. Next, we elaborate on the proof-of-concept in Section § 4.2. Following this, we validate the proposed LLaVolta in § 4.3 with an ablation study in § 4.4. Finally, we assess the extensibility to video-language in § 4.5.

4.1 Experimental Setup

We adopt the Vicuna-v1.5-7B (chiang2023vicuna) as the language model, leveraging the LLaMA2 codebase (touvron2023llama). We leverage the pre-trained CLIP ViT-L/14 (radford2021learning; dosovitskiy2020image) with an input resolution of 336×336336336336\times 336336 × 336, resulting in 576576576576 visual tokens. We employ the LLaVA framework  (liu2023improvedllava) to connect the frozen CLIP vision encoder and the Vicuna LLMs. Along with the projector, we train the entire LLM instead of parameter-efficient finetuning. We follow LLaVA-1.5  (liu2023improvedllava) to perform data preparation and training schedule for pretraining and instruction tuning. We conduct all the experiments with the machine of 8×\times× Nvidia RTX 6000 Ada. Due to multiple invalid image links in the dataset of instruction tuning stage, the scores of LLaVA-1.5 reported in our analysis are reproduced by ourselves to ensure a fair comparison under the same experimental environment.

It is worth mentioning that assessing visual token redundancy only necessitates the inference of existing off-the-shelf models, whereas the other experiments involve the training of multi-modal LLMs, specifically projectors and LLMs.

Benchmarks and Metrics: We adopt thirteen benchmarks specifically designed for MLLM evaluation, including GQA (hudson2019gqa), MM-Vet (yu2023mm), ScienceQA (SQA)(lu2022learn), MME(fu2023mme), TextVQA (singh2019towards), POPE (li2023evaluating), MMBench (liu2023mmbench), MMBench-CN (liu2023mmbench), VQA-v2 (goyal2017making), LLaVA-Bench-in-the-Wild (LLaVAW(liu2024visual), VisWiz (gurari2018vizwiz), SEED-Image (li2023seed) and MMMU (yue2024mmmu). GQA and VQA-v2 evaluate the model’s visual perception capabilities on open-ended short answers. MME-Perception evaluates model’s visual perception with yes/no questions. ScienceQA with multiple choice are used to evaluate the zero-shot generalization on scientific question answering. TextVQA contains text-rich visual question answering. MMBench and the CN version evaluate a model’s answer robustness with all-round shuffling on multiple choice answers. MM-Vet evaluates a model’s capabilities in engaging in visual conversations. Additionally, we extend LLaVolta to video-language understanding, and follow Video-LLaVA (lin2023video) to evaluate the models on MSVD-QA (chen2011collecting), MSRVTT-QA (xu2016msr) and ActivityNet-QA (yu2019activitynet), where the accuracy and score are assessed using GPT-Assistant.
We report the official metrics calculated using the standard implementations provided for each benchmark for a fair comparison. Latency is reported as the time taken during inference until the first answer token is produced. When reporting average performance in Table 2, the score of MME is divided by 2000, as its range is from 800 to 2000. TFLOPs are profiled via DeepSpeed. For total number of tokens, #Tokens=iN#Tokeni#Tokenssuperscriptsubscript𝑖𝑁#superscriptToken𝑖\#\text{Tokens}=\sum_{i}^{N}\#\text{Token}^{i}# Tokens = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT # Token start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. The training time is reported for one epoch of training during the LLaVA instruction-tuning stage. The Compression Ratio (CR) is defined as in Equation 3.

4.2 Proof of Concept: Visual Context Redundancy

To assess the redundancy of visual tokens, we perform average pooling within an off-the-shelf LLaVA-1.5-7B checkpoint at the testing stage, using different pooling stride sizes S𝑆Sitalic_S across various Transformer layers K𝐾Kitalic_K. As shown in Fig. 1, the model still exhibits strong performance even when retaining only 62.5% of the visual tokens (S=4,K=16formulae-sequence𝑆4𝐾16S=4,K=16italic_S = 4 , italic_K = 16) in the MM-Vet benchmark, without the need for additional training. When adopting the same setting (S=4,K=16formulae-sequence𝑆4𝐾16S=4,K=16italic_S = 4 , italic_K = 16), a similar trend can be observed in the GQA benchmark as well, where the compressed model only has 1% performance drop than the uncompressed counterpart. Surprisingly, in the GQA benchmark, eliminating up to 70% of visual tokens (S=4,K=16formulae-sequence𝑆4𝐾16S=4,K=16italic_S = 4 , italic_K = 16) results in a mere 3% decrease in performance. This proof-of-concept shows a certain level of redundancy in the visual tokens within MLLMs.

4.3 Main Results: LLaVolta

In this section, we present the main results of LLaVolta schemes instantiated in § 3.3. We conduct a thorough evaluation of the multi-modal capability across 13 benchmarks. Tab. 2 demonstrates that our proposed LLaVolta not only consistently lowers training costs by 16% (15.3 hours vs. 12.8 hours) but also surpasses the non-compression baseline. The four-stage training schemes achieves the best performance in nine out of the thirteen benchmarks and obtains 61.9% average performance, improving LLaVA-v1.5-7B (liu2023improvedllava) with much less overall TFLOPs and training time. This indicates the necessity of designing an optimally lite training scheme.

#Stages Scheme #Tokens CR Test TFLOPs Train Time GQA MMVet SQA MME VQAT POPE MMB MMBCN VQAv2 LLaVAw VisWiz SEEDI MMMU Avg.
Single no compression 18432 - 8.26 15.3h 62.6.49 31.91 70.8.59 146713 58.3.15 86.1.24 65.3.93 59.4.92 78.9.37 65.5.56 49.8.6 66.7.25 35.1.86 61.8.32
Two compression 10062 183% 5.20 12.8h 61.9.23 31.71.5 70.9.34 148023 58.3.46 86.5.33 64.8.23 59.01.1 78.5.20 67.3.91 47.21.8 64.9.17 34.9.11 61.5.40
Three compr. deeper 10597 174% 5.13 12.8h 62.1.01 30.5.40 70.5.23 147713 58.4.07 86.6.14 65.6.26 59.9.27 78.5.22 67.51.4 49.2.56 65.9.17 35.0.19 61.8.10
Three compr. wider 10407 177% 3.93 12.8h 61.11.6 31.8.61 71.0.28 143412 58.5.04 86.6.06 64.8.23 59.1.83 78.7.02 64.34.8 49.81.1 65.3.04 34.3.75 61.3.28
Four wider then deeper 11088 166% 5.39 12.9h 62.1.09 31.6.58 71.4.36 144415 58.7.24 86.8.21 65.3.30 59.3.26 78.8.05 67.73.1 50.1.21 65.6.15 33.8.78 61.8.35
Four deeper then wider 10863 170% 5.45 12.8h 62.1.07 31.5.20 70.5.16 147216 58.7.08 86.3.33 65.6.52 59.9.61 78.8.03 68.22.1 48.31.3 66.1.20 35.1.02 61.9.47
Table 2: Performance of LLaVolta. See the definition of each training scheme in Tab. 1. \dagger: average across stages. The derived five training schemes achieve competitive results while reducing 16% training time. We report the average results across three runs, with the standard deviation written at the bottom right of the average result. The four-stage training achieves the highest performance in nine of thirteen benchmarks, outperforming the baseline (LLaVA-v1.5-7B) while requiring significantly fewer TFLOPs and less training time.

4.4 Ablation Study

In this section, we perform an ablation study on the choice of visual compressors by comparing different compression methods. Additionally, we examine the effects of varying the stride and LLM layer in training Visual Context Compressor.

Compressor #Tokens CR GQA MM-Vet SQA MME VQAT POPE MMB MMBCN VQAv2 LLaVAW VisWiz SEEDI MMMU Avg.
Train without compression; Testing with compression
Random Drop** 3312 556% 50.6 21.4 69.3 1142 46.5 55.8 39.7 33.3 59.3 47.6 47.2 52.2 34.3 47.3
K-Means 3312 556% 54.4 25.9 69.7 1155 49.0 78.6 55.3 46.1 69.3 57.6 48.9 56.1 32.9 54.0
FastV chen2024image 3312 556% 52.1 30.6 69.4 1298 53.4 65.6 60.1 53.0 68.6 54.8 50.0 56.3 34.9 54.9
VCC zeng2024vcc 3582 514% 54.7 26.9 69.2 1246 49.2 72.3 60.8 52.0 68.1 55.6 47.8 57.0 34.8 54.7
Average Pooling 3312 556% 53.7 25.6 69.4 1150 47.7 70.1 56.4 46.5 67.0 55.6 50.0 55.7 34.3 53.0
Train with compression; Testing with compression
Random Drop** 3312 556% 53.4 25.0 69.4 1186 49.4 64.9 52.0 41.1 59.7 51.5 47.9 52.6 34.6 50.8
K-Means 3312 556% 57.5 25.9 55.6 1279 51.4 79.4 62.6 54.6 75.7 59 46.1 59.2 34.1 57.9
FastV chen2024image 3312 556% 55.9 27.9 70.4 1327 49.7 79.8 62.9 55.9 69.5 61.7 49.6 56.8 35.1 57.0
VCC zeng2024vcc 3582 514% 57.7 29.3 70.7 1398 53.0 83.6 65.0 55.8 74.1 58.0 48.2 60.1 35.0 58.5
Average Pooling 3312 556% 60.0 30.7 70.8 1450 55.1 85.5 65.0 59.5 75.9 66.9 46.4 62.6 33.8 60.4
Table 3: Comparison among different visual compressors. Higher values are preferred. All methods except VCC are set to the compression ratio of 556% to approximate VCC’s 514% zeng2024vcc for a fair comparison. The best scores are marked as gray and the second best are underlined. Attention-based compressors (i.e., FastV and VCC) excel during the inference phase, yet their application to the training phase proves challenging. Average pooling shows a more stable performance during the training phase.

Choice of Visual Compressors. The design choices include (1) random token drop**, (2) K-Means clustering, (3) average pooling, (4) FastV (chen2024image), (5) VCC (hou2022token), (6) parametric pre-trained Q-Former (li2023blip). We have the following three observations. Firstly, Tab. 3 shows that the attention-based methods, including FastV and VCC win 9/13 best and second best scores, showcasing the high performance when compressing visual tokens in inference. However, they are ineffective when applied to training because the in-training attention scores are unstable. Secondly, and surprisingly, the average pooling obtains the highest scores on eleven out of thirteen benchmarks when it is used to train MLLMs with a high CR. Thirdly, Tab. 4 shows that both Q-Former and average pooling can obtain reasonably good performance when trained with extremely high CRs, and the average pooling performs better with less training cost. The reason could be that the Q-Former resamples tokens outside the LLM, potentially causing the LLM to overlook crucial information relevant to the response. In contrast, our approach employs average pooling subsequent to Transformer layer K𝐾Kitalic_K, allowing the initial K𝐾Kitalic_K layers of the LLM to effectively retain important information from uncompressed tokens. Given these three insights, we select average pooling as our favored approach for visual compression.

Method #Param #Tokens CR Train Time GQA MMVet SQA MME VQAT POPE MMB MMBCN VQAv2 LLaVAw VisWiz SEEDI MMMU Avg.
Q-Former li2023blip 105M 1024 1800% 10.4h 55.7 26.4 69.3 1217 49.2 83.0 57.7 50.7 71.4 64.6 52.6 55.1 34.0 56.2
Ours 0 855 2156% 9.2h 55.9 26.3 71.0 1321 51.6 82.5 63.3 55.9 74.5 63.1 47.8 57.3 35.7 57.8
Table 4: Parametric vs. nonparametric visual compressor. We follow miniGPT-4 zhu2023minigpt that uses Q-Former pre-trained from BLIP-2 li2023blip as the parametric compressor (All other aspects are maintained as in LLaVA to ensure a fair comparison). Ours: pooling with stride 64 on LLM layer 1 to ensure comparable CRs. Our nonparametric compressor outshines the parametric Q-Former counterpart in terms of both performance and training efficiency.

Performance Across Compression Ratios. Herein, we train the multi-modal LLM with our Visual Context Compressor in various settings. As demonstrated in Tab. 5, the proposed method offers certain improvements and trade-offs compared to the state-of-the-art method, LLaVA-1.5-7B. We have the following two observations. Firstly, in the heavy compression level, the performance of MLLM is inversely proportional to the compression ratio (linearly scaling to the number of visual tokens). Secondly, the performance of MLLMs at the light compression level does not correlate directly with the number of visual tokens, making this observation somewhat unexpected. We attribute this to the MLLMs at this level of compression being relatively insensitive to changes in the compression ratio. This indicates that MLLMs trained at a light compression level will not hurt the model performance at all. For instance, the setting of stride 16 in light compression level attains a 188% CR and also outperforms the baseline LLaVA-v1.5-7B across all four metrics. The above observations pave the way for develo** a more systematic training scheme.

Stride #Tokens CR Latency TFLOPs Train time GQA MMVet SQA MME VQAT POPE MMB MMBCN VQAv2 LLaVAw VisWiz SEEDI MMMU Avg.
Heavy compression in LLM layer 2
8 3312 557% 37.9ms 2.14 12.0 59.9.13 30.1.92 70.9.17 144311 55.3.3 85.3.21 65.2.25 59.5.06 76.0.09 65.92.0 46.6.2 62.6.0 34.2.54 60.3.2
2 9792 188% 48.6ms 4.77 12.6 61.9.43 30.91.1 71.6.69 145018 57.6.08 86.3.22 67.2.05 59.9.4 78.0.17 66.4.85 48.7.25 65.9.49 34.1.34 61.6.08
Light compression in LLM layer 16
8 10368 178% 51.3ms 5.00 12.8 62.6.03 30.4.54 71.1.27 14629 58.2.01 86.0.09 65.3.52 58.9.57 78.8.12 63.91.1 51.4.15 66.8.23 35.81.4 61.8.04
2 13824 133% 58.8ms 6.40 14.2 61.9.45 31.51.0 70.8.49 146224 58.5.02 86.4.12 66.4.33 59.6.47 78.9.02 65.3.46 49.5.97 66.7.23 35.1.87 61.8.01
Base (liu2023improvedllava) 18432 100% 68.5ms 8.26 15.3h 62.6.49 31.91.0 70.8.59 146713 58.3.15 86.1.24 65.3.93 59.4.92 78.9.37 65.5.56 49.8.6 66.7.25 35.1.86 61.8.32
Table 5: Training MLLMs with Visual Context Compressor in various compression levels. We report the average results across three runs, with the standard deviation written at the bottom right of the average result. In the heavy compression range, the performance is inversely proportional to the compression ratio. In the light compression range, the performance is not sensitive to compression. Performance remains high for models at the light compression level.

Furthermore, we conduct an ablation study on the number of iterations in different stages (uniform vs. non-uniform stage splitting), which is detailed in the Appendix.

4.5 Extensibility to Video MLLMs

We extend our training scheme to VideoLLaVA (lin2023video) and the results in Tab. 6 reveal similar findings as before: the proposed training scheme achieve competitive results while reducing 9% training time. It is worth mentioning VideoLLaVA does not support DeepSpeed ZeRO-3, unlike LLaVA, which results in different relative efficiency gains.

#Stages Scheme #Tokens CR TFLOPs Train-time MSVD-QA MSRVTT-QA ActivityNet-QA Average
Score Acc Score Acc Score Acc Score Acc
Single no compression 147456 - 29.68 40.7h 3.69 69.1 3.48 56.8 3.28 47.5 3.48 57.8
Two compression 80496 183% 17.73 37.1h 3.71 69.0 3.50 56.9 3.29 47.9 3.50 57.9
Three compr. deeper 84776 174% 17.29 37.1h 3.73 69.3 3.51 57.2 3.28 47.4 3.51 58.0
Three compr. wider 83256 177% 16.86 37.0h 3.72 69.0 3.51 57.2 3.29 47.7 3.51 58.0
Four wider then deeper 88704 166% 18.32 37.2h 3.72 69.1 3.51 57.2 3.27 48.0 3.50 58.1
Four deeper then wider 86904 170% 18.64 37.1h 3.74 69.8 3.49 56.9 3.27 47.8 3.50 58.2
Table 6: Performance of LLaVolta on VideoLLaVA(lin2023video). See the definition of each training scheme in Tab. 1. \dagger: average across stages. To implement our multi-stage training, we apply the same compression processing to the 8 frames representing the video respectively. The derived five training schemes achieve competitive results while reducing 9% training time.

5 Conclusion

In this work, we conduct two initial studies to investigate and verify the redundancy of visual tokens in multi-modal LLMs. To address this, we propose Visual Context Compressor, a straightforward yet effective compression technique that employs a simple average pooler, seamlessly integrating into the training of MLLMs. This approach enhances training efficiency without compromising performance. To further mitigate the information loss brought by the token compression, we introduce LLaVolta, a multi-stage training scheme that utilizes Visual Context Compressor with a progressively decreasing compression rate. Experimental results on various visual question answering benchmarks verify the effectiveness of LLaVolta in boosting performance while also demonstrating efficiency gains by reducing training costs by 16%. To the best of our knowledge, we are the first to accelerate the training of multi-modal LLM from the compression perspective. We hope that the proposed Visual Context Compressor and LLaVolta will inspire more in-depth analysis of visual redundancy existing in current MLLMs and call for future designs of efficient training for MLLMs.

\printbibliography

[heading=bibintoc]

Appendix

In the appendix, we provide additional information as listed below:

  • § A provides the additional experimental results.

Appendix A Additional Experimental Results

A.1 Non-uniform Stage Splitting

By default, the training time is evenly divided across each stage. To explore how the compression stage affects total training time, we modify the relative proportion of different stages. This variation is tested in the two-stage setup referenced in Tab. 1, adjusting from the standard 50% in Stage 1 and 50% in Stage 2 to different distributions. Tab. 7 below displays the results of these experiments.

Stage 1 Stage 2 #Tokens CR GQA MMVet SQA MME VQAT POPE MMB MMBCN
0% 100% 18432 - 62.0 31.1 70.1 1453.0 58.2 85.9 64.3 58.3
25% 75% 11088 166% 62.1 31.7 70.6 1474.5 58.8 86.4 65.1 59.6
50% 50% 10863 170% 62.2 30.0 70.3 1443.5 57.5 85.8 64.8 59.7
75% 25% 10597 174% 61.6 32.2 70.8 1471.5 57.5 86.6 65.2 58.9
90% 10% 10407 177% 61.2 31.0 70.5 1447.5 56.3 86.4 64.4 56.9
100% 0% 10062 183% 55.9 29.5 64.1 1257.8 49.1 86.6 47.4 29.2
Table 7: Effects of non-uniform stage splitting at the two-stage set-up. Performance decreases as the proportion of Stage 2 decreases, albeit at the expense of lower compression ratios.

We observe that as the Stage 2 increases from 0% to 100%, there is a gradual decrease in the model’s performance across various metrics (such as GQA, MMVet, SQA, MME, VQA, POPE, MMB, and MMBCN). Although there is a decline in performance, it is relatively minor when the compression stage makes up to 50% of the training duration. However, when the proportion of the compression stage is reduced below 50%, the decline in performance becomes more significant. In conclusion, kee** the compression stage between 0-50% of the training time minimizes performance loss while still achieving significant compression ratios.