\addbibresource

arxiv.bib

LLaVolta: Efficient Multi-modal Models
via Stage-wise Visual Context Compression

Jieneng Chen^*^**Equally contributed Luoxin Ye¹¹footnotemark: 1 Ju He Zhao-Yang Wang Daniel Khashabi^†^††Equally advised Alan Yuille²²footnotemark: 2
Johns Hopkins University

Abstract

While significant advancements have been made in compressed representations for text embeddings in large language models (LLMs), the compression of visual tokens in large multi-modal models (LMMs) has remained a largely overlooked area. In this work, we present the study on the analysis of redundancy concerning visual tokens and efficient training within these models. Our initial experiments show that eliminating up to 70% of visual tokens at the testing stage by simply average pooling only leads to a minimal 3% reduction in visual question answering accuracy on the GQA benchmark, indicating significant redundancy in visual context. Addressing this, we introduce Visual Context Compressor, which reduces the number of visual tokens during training to enhance training efficiency without sacrificing performance. To minimize information loss caused by the compression on visual tokens while maintaining training efficiency, we develop LLaVolta as a lite training scheme. LLaVolta incorporates stage-wise visual context compression to progressively compress the visual tokens from heavily to lightly, and finally no compression at the end of training, yielding no loss of information when testing. Extensive experiments demonstrate that our approach enhances the performance of MLLMs in both image-language and video-language understanding, while also significantly cutting training costs.

	Website	https://beckschen.github.io/llavolta.html
	Code	https://github.com/Beckschen/LLaVolta

1 Introduction

The advent of LLMs [chatgpt, gpt4, touvron2023llama2] has marked a new era in the field of artificial intelligence and natural language processing. LLMs can play a role as a universal interface for a general-purpose assistant, where various task instructions can be explicitly represented in language and guide the end-to-end trained neural assistant to solve a task of interest. For example, the recent success of ChatGPT (chatgpt) and GPT-4 (gpt4) have demonstrated the power of aligned LLMs in following human instructions, and have stimulated tremendous interest in develo** open-source LLMs (team2024gemma; touvron2023llama). As the horizon of LLM applications broadens and the availability of open-source LLMs increases, the integration of multi-modality into these models presents a new frontier in expanding their capabilities. Multi-modal LLMs (alayrac2022flamingo; liu2024visual; team2023gemini; zhu2023minigpt) (MLLMs), which can process and understand not just text but also visual information, stand at the cutting edge of this evolution.

Refer to caption — Figure 1: Visual tokens are redundant in MLLMs. Left: The accuracy of the LLaVA-1.5-7B liu2024visual model on the GQA hudson2019gqa benchmarks varies with different percentages of retained visual tokens. The $x$ -axis represents the percentage of original visual tokens preserved after applying 1D average pooling with varying stride sizes $S$ applied in $i$ -th Transformer layer. Right: Visual tokens receive less attention from the [ANS] token as we go deeper into its layers of LLaVA-1.5-7B model. These findings collectively suggest a significant redundancy within the visual tokens of the MLLMs.

While MLLMs have made significant strides, a crucial aspect that remains relatively unexplored is the efficient representation and processing of visual information within these models. Substantial efforts (hou2022token; qin2023nugget; zeng2024vcc) have been dedicated to optimizing the efficient representation of text tokens through various compression techniques (hou2022token; qin2023nugget; zeng2024vcc), aimed at enhancing inference efficiency by attentively selecting important tokens. However, the efficient learning of visual tokens in MLLM has not garnered comparable attention. Naturally, this raises questions about the potential redundancy present in visual tokens and its implications for the overall computational efficiency of MLLMs.

We start our work by addressing the question: Are visual tokens redundant in multi-modal LLMs? To explore this, we first experiment with simply reducing the number of visual tokens in a pre-trained LLaVA-1.5-7B liu2024visual at the inference stage via average pooling (§3). As shown in Fig.1 (left), our initial results demonstrate that eliminating up to 70% of visual tokens by pooling them with a stride of 4 starting from Transformer layer 2 incurs only a minimal performance loss on the GQA benchmark, specifically a 3% accuracy reduction. Additionally, we compute and present the average attention values from the [ANS] token to visual tokens and system prompt tokens across different Transformer layers in the pre-trained LLaVA-1.5-7B liu2024visual. As revealed in Fig. 1 (right; blue trends), the visual tokens are generally less attended to, measured based on average attention from the [ANS] token, as the layers get deeper. These two early explorations indicate significant redundancy in visual tokens.

Addressing this, in this work we develop an effective Visual Context Compressor that can be integrated into the training of MLLMs. Surprisingly, a simple average pooler nested in LLMs stands out as the most effective compressor, outperforming the attention-based hou2022token; zeng2024vcc and parametric li2023blip counterparts. We attribute this to two reasons: (1) The simple pooling operation makes training stable, whereas prior attention-based approaches hou2022token; zeng2024vcc are specifically designed for accelerating inference rather than training. (2) Visual tokens in the deeper Transformer layers are less attended to (see Fig. 1 (right)) and particularly redundant, making a simple compressor placed in a deeper Transformer layer effective enough. At a lower training cost, the LLaVA-1.5-7B liu2024visual trained with the proposed Visual Context Compressor is competitive with the non-compressed baseline across various multi-modal benchmarks (e.g., GQA hudson2019gqa and MM-Vet yu2023mm). This dual achievement highlights Visual Context Compressor’s role as a pivotal advancement in enhancing the efficiency and performance of MLLMs across various multi-modal question-answering benchmarks.

To further mitigate the information loss caused by compressing visual tokens, especially under a large compression ratio (CR), we have devised a LLaVA-powered lite training scheme, dubbed LLaVolta, which progressively employs Visual Context Compressor at multiple training stages with different compression ratios (§3.3). Specifically, LLaVolta progresses through several stages, beginning with a high level of visual token compression and gradually reducing the compression ratio until the final stages, where full visual tokens are utilized. This multi-stage approach allows for adaptive compression levels that ensure training efficiency without losing information at testing, thus maintaining the overall effectiveness of the model.

Extensive experimental evaluations of LLaVolta have been conducted on thirteen widely-adopted MLLM benchmarks for both image-language understanding and video-language understanding , showing promising results. We observe that LLaVolta not only enhances the performance of MLLMs, but also achieves a substantial reduction in training costs. These experiments validate the effectiveness of our method, demonstrating its capability to optimize resource utilization while maintaining or even improving model performance.

In summary, our paper makes the following contributions:

•

We present two initial studies to verify the redundancy of visual tokens in MLLMs.
•

We propose the Visual Context Compressor, a simple yet effective compression technique that utilizes an average pooler, enhancing the efficiency of multi-modal models.
•

We propose the LLaVolta as an efficient training scheme by leveraging Visual Context Compressor at multiple training stages with a progressively decreasing compression ratio. To the best of our knowledge, we are among the first to explore efficient training of MLLMs.
•

Extensive experiments show that our approach not only improves the performance of MLLMs in image-language and video-language understanding across various benchmarks but also showcases efficiency gains by reducing training costs by 16%.

2 Related Works

Multi-modal LLMs. The evolution of large language models (gpt4; chatgpt; chiang2023vicuna) into their multi-modal counterparts (team2023gemini; liu2024visual) represents a significant leap in their ability to follow instructions and generalize across tasks. This transition has been marked by seminal works such as Flamingo (alayrac2022flamingo), BLIP-2 (li2023blip) and LLaVA (liu2024visual), which have extended LLM capabilities to encompass visual tasks, demonstrating impressive zero-shot generalization and in-context learning abilities. Progress in multi-modal LLMs has primarily been driven by advancements in visual instruction tuning (liu2024visual; zhu2023minigpt), leveraging vision-language datasets and refining visual instruction-following data. Additionally, efforts have been made to enhance the grounding capabilities of multi-modal LLMs through the use of specialized datasets aimed at improving task-specific performance. Despite these advancements, the exploration of visual compression within multi-modal LLMs remains relatively underdeveloped. The design and optimization of compression strategies are crucial for maximizing the effectiveness and efficiency of multi-modal LLMs, suggesting a potential area for future research and development.

Visual Redundancy. In computer vision, reducing redundancy is crucial for creating efficient yet effective models without losing accuracy (barlow2001redundancy). Redundancy in images often arises from the inherent characteristics of natural scenes, including repetitive patterns, textures, and areas of uniform color. These features, while contributing to the richness and detail of visual perception, can lead to inefficiencies in both storage and processing when not adequately addressed. Image compression algorithms (wallace1992jpeg) can reduce file size by eliminating or efficiently encoding redundant data. These methods take advantage of human visual perception’s tolerances to subtly reduce data without significantly impacting image quality. Advanced machine learning models, particularly CNNs and autoencoders (baldi2012autoencoders), offer sophisticated approaches to minimizing redundancy. Transformers (vaswani2017attention), as a fundamental architecture for LLMs (chiang2023vicuna; gpt4), apply self-attention mechanisms to dynamically bind the most informative parts of tokents. Vision Transformers (chen2024vitamin; chen2022transmix; dosovitskiy2020image; he2022transfg) trained with CLIP objective (chen2024vitamin; radford2021learning) encode an image to a sequence of visual features for multi-modal LLMs (liu2024visual). Nevertheless, visual tokens receive less attention in LLMs due to attention shrinkage (xiao2023efficient), resulting a waste of computation. In this work, we focus on reducing the redundancy of visual tokens in MLLMs.

Efficient LLMs. Efficient inference and training for LLMs are important. Compressing input sequences for efficiency reasons in Transformers is not a new idea for NLP. Much work is being done to accelerate the inference of LMs. For example, Pyramid Transformer variants (dai2020funnel) and (huang2022pyramid) are proposed in Encoder-Decoder LMs that progressively compress the sequence as the layers grow deeper via pooling or core-set selection. Nawrot et al. (nawrot2022efficient) propose adaptively compressing the sequence based on the predicted semantic boundaries within the sequence. Rae et al. (rae2019compressive) propose compressing the fine-grained past activations to coarser memories. VCC (zeng2024vcc) compress the sequence into a much smaller representation at each layer by prioritizing important tokens. Besides efficient inference, accelerating training for LLMs attracts attention as well. A staged training setup (shen2022staged) is proposed which begins with a small model and incrementally increases the amount of compute used for training by applying a growth operator to increase the model depth and width. However, efficient training for LLMs in multi-modal scenarios is rarely explored.

3 Method

In this section, we first introduce an overview of multi-modal LLMs in § 3.1. Then, we define the problem of visual redundancy and introduce Visual Context Compressor in § 3.2. Finally, we present our proposed LLaVolta in § 3.3.

3.1 Preliminaries: A Multi-modal LLM

We start by reviewing the design of the LLaVA family (liu2024visual; liu2023improvedllava). For processing an input image $\mathbf{X}_{v}$ , we utilize the pre-trained CLIP visual encoder ViT-L/14, as detailed by (radford2021learning), to extract the visual feature $\mathbf{Z}_{v}=g(\mathbf{X}_{v})$ , where $g(.)$ indicates the visual encoder. To bridge the gap between visual and linguistic modalities, the LLaVA (liu2024visual; liu2023improvedllava) framework as an MLLM implements a straightforward linear/MLP transformation. This involves a trainable projection matrix $\mathbf{W}$ , which maps the visual features $\mathbf{Z}_{v}$ into the linguistic embedding space, producing language embedding tokens $\mathbf{H}_{v}=\mathbf{W}\mathbf{Z}_{v}$ . These tokens are designed to match the dimensionality of the word embeddings within the LLM.

For each image $\mathbf{X}_{v}$ , one can generate multi-turn conversation data $(\mathbf{X}_{q}^{1},\mathbf{X}_{a}^{1},\cdots,\mathbf{X}_{q}^{T},\mathbf{X}_{a% }^{T})$ with $T$ as the number of turns. One can organize them as a sequence, by treating all answers as the assistant’s response and the instruction $\mathbf{X}_{\texttt{instruct}}^{t}$ at the $t$ -th turn as:

\displaystyle\mathbf{X}_{\texttt{instruct}}^{t}=\left\{\begin{matrix}&\text{% Random Choose}[\mathbf{X}_{q}^{1},\mathbf{X}_{v}]\leavevmode\nobreak\ % \leavevmode\nobreak\ \text{or}\leavevmode\nobreak\ \leavevmode\nobreak\ [% \mathbf{X}_{v},\mathbf{X}_{q}^{1}],\hskip 8.53581pt\leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ t=1\\ &\mathbf{X}_{q}^{t},\hskip 85.35826pt\leavevmode\nobreak\ \leavevmode\nobreak% \ \leavevmode\nobreak\ t>1\end{matrix}\right.

(1)

This approach establishes a standardized format for the multi-modal instruction-following sequence. It allows for the instruction-based tuning of the LLM to be applied to the prediction tokens, utilizing the model’s native auto-regressive training objective. Specifically, for a sequence with length $L$ , the likelihood of the target responses $\mathbf{X}_{a}$ is calculated as:

p(\mathbf{X}_{a}|\mathbf{X}_{v},\mathbf{X}_{\texttt{instruct}})=\prod_{i=1}^{L% }p_{\theta}(x_{i}|\mathbf{X}_{v},\mathbf{X}_{\texttt{instruct},<i},\mathbf{X}_% {a,<i}),

(2)

3.2 Visual Context Compressor

Problem Formulation: The redundancy observed in images often arises from inherent traits of natural scenes, including repetitive patterns, textures, and regions with uniform color. While these traits enrich visual perception by offering detail and depth, they can also present challenges in terms of storage and processing efficiency. Considering the inherent limitations of Transformers in handling long sequences liu2023lost; anil2022exploring; ye2024analobench, it is critical to minimize any length redundancies to obtain a more effective accuracy/efficiency trade-off.

The objective of this study is to decrease the length of visual tokens $\mathbf{X}_{v}$ (i.e., its hidden states $\mathbf{H}_{v}$ if inside LLMs), while simultaneously maximizing the probability of the target response $p(\mathbf{X}_{a}|\mathbf{X}_{v},\mathbf{X}_{\texttt{instruct}})$ as described in Equation (2).

Visual Context Compressor: A key design change that we introduce is a compressor layer that compresses the dimensions of the visual inputs by reducing the effective number of visual tokens. As depicted in Fig. 2, the compressor is simply an average pooler in our setting. It is applied to the visual tokens in $k$ -th Transformer layer of an LLM. Formally, given the hidden visual tokens at $k$ -th Transformer layer $\mathbf{H}_{k}\in\mathbb{R}^{B\times C\times L}$ , the compressor is expected to fulfill the following projection: $f:\mathbb{R}^{B\times C\times L}\mapsto\mathbb{R}^{B\times C\times L_{\text{% \emph{out}}}},$ which results in compressed visual tokens

$\tilde{\mathbf{H}}_{k}\in\mathbb{R}^{B\times C\times L_{\text{\emph{out}}}}$ , where $L_{\text{\emph{out}}}=\frac{L}{S}$ with $s$ as the compression stride. In §4, we explore multiple variants of compressor $f$ to reduce the token length, including random token drop** (he2022masked) with drop** ratio $1-\frac{1}{S}$ , K-Means (kanungo2002efficient) with number of centroids set to $N_{C}=\frac{L}{S}$ , attention-based token-centric compression zeng2024vcc, attention-based token drop** chen2024image; hou2022token, and average pooling with stride $s$ . To our surprise, we find that the simple average pooler is the most effective compressor for vision tokens within MLLMs, due to its stability during training detailed in § 4.4. Thus, we choose average pooler as the compressor.

Note that the proposed Visual Context Compressor can be directly applied to any off-the-shelf MLLMs to assess the visual redundancy, as conducted in §4.2. One can also train an MLLM with Visual Context Compressor to reduce the number of visual tokens while maintaining competitive multi-modal performance.

Compression Ratio (CR)^‡^‡‡Definition of compression ratio from Wikipedia. For an LLM with $N$ Transformer decoder layers, the compression ratio for visual tokens can be calculated as:

\text{CR}=\frac{N\cdot L}{(N-K)\cdot L_{\text{\emph{out}}}+K\cdot L}\;\;,

(3)

where $K$ is the $K$ -th Transformer layer of a multi-modal LLM; $L$ is the the length of visual tokens input into Visual Context Compressor; $L_{\text{\emph{out}}}$ is the compressed length of visual tokens generated by Visual Context Compressor, as illustrated in Fig. 2.

Our architecture modifications thus far mostly impacts the inference efficiency of MLLM, however, its impact on performance-compression trade-off remains unclear. We will study this question in the context of training MLLMs with a goal of enhancing efficiency without compromising performance. We then move on to further utilize Visual Context Compressor to design an efficient training scheme to incorporates Visual Context Compressor at various stages of the training process.

3.3 LLaVolta as a Lite Training Scheme

Training with Visual Context Compressor not only facilitates efficient inference but also enhances training efficiency. However, devising an effective training scheme poses challenges when ensuring fair comparisons with the original LLaVA liu2023improvedllava, primarily due to differences in the number of tokens involved in inference. This discrepancy may lead to information loss, particularly when operating under a scenario with a high compression ratio. To tackle this issue, we have developed a lite training scheme for LLaVA, dubbed as LLaVolta, which employs stage-wise visual context compression. Generally, assuming there are $N_{s}$ total stages, stage $i$ involves $\frac{1}{N_{s}}$ of the total training epochs with a compression ratio of $r_{i}$ , and the final stage proceeds without any compression. Essentially, as training progresses, $i$ increases while $r_{i}$ decreases.

In this work, as depicted in Fig. 3, we primarily explore a three-stage training pipeline that progressively reduces the compression ratio, as detailed below:

Training Stage I: Heavy Compression. The MLLM training at the first one-third of the total training iterations commences with a heavy compression ratio (> 500%), where Visual Context Compressor is applied in an early layer of the LLM with a large pooling stride. This setup enables a very fast training speed.

Training Stage II: Light Compression. The MLLM continues training with another one-third of the total training epochs. At this stage, Visual Context Compressor is applied at only the deeper layers of the LLM with a smaller pooling stride compared to Training Stage I.

Training Stage III: No Compression. The MLLM continues training with the final one-third of the total training epochs, following the standard MLLM training protocol without compression. Disabling compression in the final stage ensures that the number of tokens remains consistent with the original MLLM during inference, avoiding the loss of information caused by the reduction of visual tokens.

Given the above meta framework, we can instantiate a family of training schemes, as demonstrated in Tab. 1. The single-stage (non-compression) scheme is equivalent to the MLLM baseline. For multi-stage training, the compression stage can either go deeper or wider. “deeper” implies an increase in $K$ (Transformer layer), while “wider” means a decrease in the stride of the pooler.

#Stages	Scheme	Stage	Layer	Stride	CR	#Epoch
Single	no compression	$S1$	/	/	100%	1
Two	compression	$S1$	2	8	557%	0.5
Two	compression	$S2$	/	/	100%	0.5
Three	compr. deeper	$S1$	2	8	557%	0.33
		$S2$	16	8	178%	0.33
		$S3$	/	/	100%	0.33
Three	compr. wider	$S1$	2	8	557%	0.33
		$S2$	2	2	188%	0.33
		$S3$	/	/	100%	0.33

#Stages	Scheme	Stage	Layer	Stride	CR	#Epoch
Four	wider then deeper	$S1$	2	8	557%	0.25
		$S2$	2	2	188%	0.25
		$S3$	16	2	133%	0.25
		$S4$	/	/	100%	0.25
Four	deeper then wider	$S1$	2	8	557%	0.25
		$S2$	16	8	178%	0.25
		$S3$	16	2	133%	0.25
		$S4$	/	/	100%	0.25

Table 1: Instantiations of LLaVolta schemes. deeper indicates that the compressor’s position in the LLM shifts from the shallow layer (e.g., 2) to a deeper layer (e.g., 16). wider indicates that the compressor’s stride decreases while the number of visual tokens increases.

Note that all training schemes will be standardized to complete just one epoch. Thus, in the three-stage training, each stage will receive one third of an epoch, while in the four-stage training, each stage will receive one fourth of an epoch. Effects of non-uniform stage splitting are presented in the Appendix.

4 Experiments

In this section, we begin by detailing the experimental setup in § 4.1. Next, we elaborate on the proof-of-concept in Section § 4.2. Following this, we validate the proposed LLaVolta in § 4.3 with an ablation study in § 4.4. Finally, we assess the extensibility to video-language in § 4.5.

4.1 Experimental Setup

We adopt the Vicuna-v1.5-7B (chiang2023vicuna) as the language model, leveraging the LLaMA2 codebase (touvron2023llama). We leverage the pre-trained CLIP ViT-L/14 (radford2021learning; dosovitskiy2020image) with an input resolution of $336\times 336$ , resulting in $576$ visual tokens. We employ the LLaVA framework (liu2023improvedllava) to connect the frozen CLIP vision encoder and the Vicuna LLMs. Along with the projector, we train the entire LLM instead of parameter-efficient finetuning. We follow LLaVA-1.5 (liu2023improvedllava) to perform data preparation and training schedule for pretraining and instruction tuning. We conduct all the experiments with the machine of 8 $\times$ Nvidia RTX 6000 Ada. Due to multiple invalid image links in the dataset of instruction tuning stage, the scores of LLaVA-1.5 reported in our analysis are reproduced by ourselves to ensure a fair comparison under the same experimental environment.

It is worth mentioning that assessing visual token redundancy only necessitates the inference of existing off-the-shelf models, whereas the other experiments involve the training of multi-modal LLMs, specifically projectors and LLMs.

Benchmarks and Metrics: We adopt thirteen benchmarks specifically designed for MLLM evaluation, including GQA (hudson2019gqa), MM-Vet (yu2023mm), ScienceQA (SQA)(lu2022learn), MME(fu2023mme), TextVQA (singh2019towards), POPE (li2023evaluating), MMBench (liu2023mmbench), MMBench-CN (liu2023mmbench), VQA-v2 (goyal2017making), LLaVA-Bench-in-the-Wild (LLaVA^W) (liu2024visual), VisWiz (gurari2018vizwiz), SEED-Image (li2023seed) and MMMU (yue2024mmmu). GQA and VQA-v2 evaluate the model’s visual perception capabilities on open-ended short answers. MME-Perception evaluates model’s visual perception with yes/no questions. ScienceQA with multiple choice are used to evaluate the zero-shot generalization on scientific question answering. TextVQA contains text-rich visual question answering. MMBench and the CN version evaluate a model’s answer robustness with all-round shuffling on multiple choice answers. MM-Vet evaluates a model’s capabilities in engaging in visual conversations. Additionally, we extend LLaVolta to video-language understanding, and follow Video-LLaVA (lin2023video) to evaluate the models on MSVD-QA (chen2011collecting), MSRVTT-QA (xu2016msr) and ActivityNet-QA (yu2019activitynet), where the accuracy and score are assessed using GPT-Assistant.
We report the official metrics calculated using the standard implementations provided for each benchmark for a fair comparison. Latency is reported as the time taken during inference until the first answer token is produced. When reporting average performance in Table 2, the score of MME is divided by 2000, as its range is from 800 to 2000. TFLOPs are profiled via DeepSpeed. For total number of tokens, $\#\text{Tokens}=\sum_{i}^{N}\#\text{Token}^{i}$ . The training time is reported for one epoch of training during the LLaVA instruction-tuning stage. The Compression Ratio (CR) is defined as in Equation 3.

4.2 Proof of Concept: Visual Context Redundancy

To assess the redundancy of visual tokens, we perform average pooling within an off-the-shelf LLaVA-1.5-7B checkpoint at the testing stage, using different pooling stride sizes $S$ across various Transformer layers $K$ . As shown in Fig. 1, the model still exhibits strong performance even when retaining only 62.5% of the visual tokens ( $S=4,K=16$ ) in the MM-Vet benchmark, without the need for additional training. When adopting the same setting ( $S=4,K=16$ ), a similar trend can be observed in the GQA benchmark as well, where the compressed model only has 1% performance drop than the uncompressed counterpart. Surprisingly, in the GQA benchmark, eliminating up to 70% of visual tokens ( $S=4,K=16$ ) results in a mere 3% decrease in performance. This proof-of-concept shows a certain level of redundancy in the visual tokens within MLLMs.

4.3 Main Results: LLaVolta

In this section, we present the main results of LLaVolta schemes instantiated in § 3.3. We conduct a thorough evaluation of the multi-modal capability across 13 benchmarks. Tab. 2 demonstrates that our proposed LLaVolta not only consistently lowers training costs by 16% (15.3 hours vs. 12.8 hours) but also surpasses the non-compression baseline. The four-stage training schemes achieves the best performance in nine out of the thirteen benchmarks and obtains 61.9% average performance, improving LLaVA-v1.5-7B (liu2023improvedllava) with much less overall TFLOPs and training time. This indicates the necessity of designing an optimally lite training scheme.

#Stages

Scheme

#Tokens^†

CR^†

Test TFLOPs^†

Train Time

GQA

MMVet

SQA

MME

VQA^T

POPE

MMB

MMB^CN

VQA^v2

LLaVA^w

VisWiz

SEED^I

MMMU

Avg.

Single

no compression

18432

8.26

15.3h

62.6_.49

31.9₁

70.8_.59

1467₁₃

58.3_.15

86.1_.24

65.3_.93

59.4_.92

78.9_.37

65.5_.56

49.8_.6

66.7_.25

35.1_.86

61.8_.32

Two

compression

10062

183%

5.20

12.8h

61.9_.23

31.7_1.5

70.9_.34

1480₂₃

58.3_.46

86.5_.33

64.8_.23

59.0_1.1

78.5_.20

67.3_.91

47.2_1.8

64.9_.17

34.9_.11

61.5_.40

Three

compr. deeper

10597

174%

5.13

12.8h

62.1_.01

30.5_.40

70.5_.23

1477₁₃

58.4_.07

86.6_.14

65.6_.26

59.9_.27

78.5_.22

67.5_1.4

49.2_.56

65.9_.17

35.0_.19

61.8_.10

Three

compr. wider

10407

177%

3.93

12.8h

61.1_1.6

31.8_.61

71.0_.28

1434₁₂

58.5_.04

86.6_.06

64.8_.23

59.1_.83

78.7_.02

64.3_4.8

49.8_1.1

65.3_.04

34.3_.75

61.3_.28

Four

wider then deeper

11088

166%

5.39

12.9h

62.1_.09

31.6_.58

71.4_.36

1444₁₅

58.7_.24

86.8_.21

65.3_.30

59.3_.26

78.8_.05

67.7_3.1

50.1_.21

65.6_.15

33.8_.78

61.8_.35

Four

deeper then wider

10863

170%

5.45

12.8h

62.1_.07

31.5_.20

70.5_.16

1472₁₆

58.7_.08

86.3_.33

65.6_.52

59.9_.61

78.8_.03

68.2_2.1

48.3_1.3

66.1_.20

35.1_.02

61.9_.47

Table 2: Performance of LLaVolta. See the definition of each training scheme in Tab. 1.

\dagger

: average across stages. The derived five training schemes achieve competitive results while reducing 16% training time. We report the average results across three runs, with the standard deviation written at the bottom right of the average result. The four-stage training achieves the highest performance in nine of thirteen benchmarks, outperforming the baseline (LLaVA-v1.5-7B) while requiring significantly fewer TFLOPs and less training time.

4.4 Ablation Study

In this section, we perform an ablation study on the choice of visual compressors by comparing different compression methods. Additionally, we examine the effects of varying the stride and LLM layer in training Visual Context Compressor.

Compressor	#Tokens	CR	GQA	MM-Vet	SQA	MME	VQA^T	POPE	MMB	MMB^CN	VQA^v2	LLaVA^W	VisWiz	SEED^I	MMMU	Avg.
Train without compression; Testing with compression
Random Drop**	3312	556%	50.6	21.4	69.3	1142	46.5	55.8	39.7	33.3	59.3	47.6	47.2	52.2	34.3	47.3
K-Means	3312	556%	54.4	25.9	69.7	1155	49.0	78.6	55.3	46.1	69.3	57.6	48.9	56.1	32.9	54.0
FastV chen2024image	3312	556%	52.1	30.6	69.4	1298	53.4	65.6	60.1	53.0	68.6	54.8	50.0	56.3	34.9	54.9
VCC zeng2024vcc	3582	514%	54.7	26.9	69.2	1246	49.2	72.3	60.8	52.0	68.1	55.6	47.8	57.0	34.8	54.7
Average Pooling	3312	556%	53.7	25.6	69.4	1150	47.7	70.1	56.4	46.5	67.0	55.6	50.0	55.7	34.3	53.0
Train with compression; Testing with compression
Random Drop**	3312	556%	53.4	25.0	69.4	1186	49.4	64.9	52.0	41.1	59.7	51.5	47.9	52.6	34.6	50.8
K-Means	3312	556%	57.5	25.9	55.6	1279	51.4	79.4	62.6	54.6	75.7	59	46.1	59.2	34.1	57.9
FastV chen2024image	3312	556%	55.9	27.9	70.4	1327	49.7	79.8	62.9	55.9	69.5	61.7	49.6	56.8	35.1	57.0
VCC zeng2024vcc	3582	514%	57.7	29.3	70.7	1398	53.0	83.6	65.0	55.8	74.1	58.0	48.2	60.1	35.0	58.5
Average Pooling	3312	556%	60.0	30.7	70.8	1450	55.1	85.5	65.0	59.5	75.9	66.9	46.4	62.6	33.8	60.4

Table 3: Comparison among different visual compressors. Higher values are preferred. All methods except VCC are set to the compression ratio of 556% to approximate VCC’s 514% zeng2024vcc for a fair comparison. The best scores are marked as gray and the second best are underlined. Attention-based compressors (i.e., FastV and VCC) excel during the inference phase, yet their application to the training phase proves challenging. Average pooling shows a more stable performance during the training phase.

Choice of Visual Compressors. The design choices include (1) random token drop**, (2) K-Means clustering, (3) average pooling, (4) FastV (chen2024image), (5) VCC (hou2022token), (6) parametric pre-trained Q-Former (li2023blip). We have the following three observations. Firstly, Tab. 3 shows that the attention-based methods, including FastV and VCC win 9/13 best and second best scores, showcasing the high performance when compressing visual tokens in inference. However, they are ineffective when applied to training because the in-training attention scores are unstable. Secondly, and surprisingly, the average pooling obtains the highest scores on eleven out of thirteen benchmarks when it is used to train MLLMs with a high CR. Thirdly, Tab. 4 shows that both Q-Former and average pooling can obtain reasonably good performance when trained with extremely high CRs, and the average pooling performs better with less training cost. The reason could be that the Q-Former resamples tokens outside the LLM, potentially causing the LLM to overlook crucial information relevant to the response. In contrast, our approach employs average pooling subsequent to Transformer layer $K$ , allowing the initial $K$ layers of the LLM to effectively retain important information from uncompressed tokens. Given these three insights, we select average pooling as our favored approach for visual compression.

Method	#Param	#Tokens	CR	Train Time	GQA	MMVet	SQA	MME	VQA^T	POPE	MMB	MMB^CN	VQA^v2	LLaVA^w	VisWiz	SEED^I	MMMU	Avg.
Q-Former li2023blip	105M	1024	1800%	10.4h	55.7	26.4	69.3	1217	49.2	83.0	57.7	50.7	71.4	64.6	52.6	55.1	34.0	56.2
Ours	0	855	2156%	9.2h	55.9	26.3	71.0	1321	51.6	82.5	63.3	55.9	74.5	63.1	47.8	57.3	35.7	57.8

Table 4: Parametric vs. nonparametric visual compressor. We follow miniGPT-4 zhu2023minigpt that uses Q-Former pre-trained from BLIP-2 li2023blip as the parametric compressor (All other aspects are maintained as in LLaVA to ensure a fair comparison). Ours: pooling with stride 64 on LLM layer 1 to ensure comparable CRs. Our nonparametric compressor outshines the parametric Q-Former counterpart in terms of both performance and training efficiency.

Performance Across Compression Ratios. Herein, we train the multi-modal LLM with our Visual Context Compressor in various settings. As demonstrated in Tab. 5, the proposed method offers certain improvements and trade-offs compared to the state-of-the-art method, LLaVA-1.5-7B. We have the following two observations. Firstly, in the heavy compression level, the performance of MLLM is inversely proportional to the compression ratio (linearly scaling to the number of visual tokens). Secondly, the performance of MLLMs at the light compression level does not correlate directly with the number of visual tokens, making this observation somewhat unexpected. We attribute this to the MLLMs at this level of compression being relatively insensitive to changes in the compression ratio. This indicates that MLLMs trained at a light compression level will not hurt the model performance at all. For instance, the setting of stride 16 in light compression level attains a 188% CR and also outperforms the baseline LLaVA-v1.5-7B across all four metrics. The above observations pave the way for develo** a more systematic training scheme.

Stride

#Tokens

Latency

TFLOPs

Train time

GQA

MMVet

SQA

MME

VQA^T

POPE

MMB

MMB^CN

VQA^v2

LLaVA^w

VisWiz

SEED^I

MMMU

Avg.

Heavy compression in LLM layer 2

3312

557%

37.9ms

2.14

12.0

59.9_.13

30.1_.92

70.9_.17

1443₁₁

55.3_.3

85.3_.21

65.2_.25

59.5_.06

76.0_.09

65.9_2.0

46.6_.2

62.6_.0

34.2_.54

60.3_.2

9792

188%

48.6ms

4.77

12.6

61.9_.43

30.9_1.1

71.6_.69

1450₁₈

57.6_.08

86.3_.22

67.2_.05

59.9_.4

78.0_.17

66.4_.85

48.7_.25

65.9_.49

34.1_.34

61.6_.08

Light compression in LLM layer 16

10368

178%

51.3ms

5.00

12.8

62.6_.03

30.4_.54

71.1_.27

1462₉

58.2_.01

86.0_.09

65.3_.52

58.9_.57

78.8_.12

63.9_1.1

51.4_.15

66.8_.23

35.8_1.4

61.8_.04

13824

133%

58.8ms

6.40

14.2

61.9_.45

31.5_1.0

70.8_.49

1462₂₄

58.5_.02

86.4_.12

66.4_.33

59.6_.47

78.9_.02

65.3_.46

49.5_.97

66.7_.23

35.1_.87

61.8_.01

Base (liu2023improvedllava)

18432

100%

68.5ms

8.26

15.3h

62.6_.49

31.9_1.0

70.8_.59

1467₁₃

58.3_.15

86.1_.24

65.3_.93

59.4_.92

78.9_.37

65.5_.56

49.8_.6

66.7_.25

35.1_.86

61.8_.32

Table 5: Training MLLMs with Visual Context Compressor in various compression levels. We report the average results across three runs, with the standard deviation written at the bottom right of the average result. In the heavy compression range, the performance is inversely proportional to the compression ratio. In the light compression range, the performance is not sensitive to compression. Performance remains high for models at the light compression level.

Furthermore, we conduct an ablation study on the number of iterations in different stages (uniform vs. non-uniform stage splitting), which is detailed in the Appendix.

4.5 Extensibility to Video MLLMs

We extend our training scheme to VideoLLaVA (lin2023video) and the results in Tab. 6 reveal similar findings as before: the proposed training scheme achieve competitive results while reducing 9% training time. It is worth mentioning VideoLLaVA does not support DeepSpeed ZeRO-3, unlike LLaVA, which results in different relative efficiency gains.

#Stages	Scheme	#Tokens^†	CR^†	TFLOPs^†	Train-time	MSVD-QA		MSRVTT-QA		ActivityNet-QA		Average
#Stages	Scheme	#Tokens^†	CR^†	TFLOPs^†	Train-time	Score	Acc	Score	Acc	Score	Acc	Score	Acc
Single	no compression	147456	-	29.68	40.7h	3.69	69.1	3.48	56.8	3.28	47.5	3.48	57.8
Two	compression	80496	183%	17.73	37.1h	3.71	69.0	3.50	56.9	3.29	47.9	3.50	57.9
Three	compr. deeper	84776	174%	17.29	37.1h	3.73	69.3	3.51	57.2	3.28	47.4	3.51	58.0
Three	compr. wider	83256	177%	16.86	37.0h	3.72	69.0	3.51	57.2	3.29	47.7	3.51	58.0
Four	wider then deeper	88704	166%	18.32	37.2h	3.72	69.1	3.51	57.2	3.27	48.0	3.50	58.1
Four	deeper then wider	86904	170%	18.64	37.1h	3.74	69.8	3.49	56.9	3.27	47.8	3.50	58.2

Table 6: Performance of LLaVolta on VideoLLaVA(lin2023video). See the definition of each training scheme in Tab. 1.

\dagger

: average across stages. To implement our multi-stage training, we apply the same compression processing to the 8 frames representing the video respectively. The derived five training schemes achieve competitive results while reducing 9% training time.

5 Conclusion

In this work, we conduct two initial studies to investigate and verify the redundancy of visual tokens in multi-modal LLMs. To address this, we propose Visual Context Compressor, a straightforward yet effective compression technique that employs a simple average pooler, seamlessly integrating into the training of MLLMs. This approach enhances training efficiency without compromising performance. To further mitigate the information loss brought by the token compression, we introduce LLaVolta, a multi-stage training scheme that utilizes Visual Context Compressor with a progressively decreasing compression rate. Experimental results on various visual question answering benchmarks verify the effectiveness of LLaVolta in boosting performance while also demonstrating efficiency gains by reducing training costs by 16%. To the best of our knowledge, we are the first to accelerate the training of multi-modal LLM from the compression perspective. We hope that the proposed Visual Context Compressor and LLaVolta will inspire more in-depth analysis of visual redundancy existing in current MLLMs and call for future designs of efficient training for MLLMs.

\printbibliography

[heading=bibintoc]

Appendix

In the appendix, we provide additional information as listed below:

•

§ A provides the additional experimental results.

Appendix A Additional Experimental Results

A.1 Non-uniform Stage Splitting

By default, the training time is evenly divided across each stage. To explore how the compression stage affects total training time, we modify the relative proportion of different stages. This variation is tested in the two-stage setup referenced in Tab. 1, adjusting from the standard 50% in Stage 1 and 50% in Stage 2 to different distributions. Tab. 7 below displays the results of these experiments.

Stage 1	Stage 2	#Tokens	CR	GQA	MMVet	SQA	MME	VQA^T	POPE	MMB	MMB^CN
0%	100%	18432	-	62.0	31.1	70.1	1453.0	58.2	85.9	64.3	58.3
25%	75%	11088	166%	62.1	31.7	70.6	1474.5	58.8	86.4	65.1	59.6
50%	50%	10863	170%	62.2	30.0	70.3	1443.5	57.5	85.8	64.8	59.7
75%	25%	10597	174%	61.6	32.2	70.8	1471.5	57.5	86.6	65.2	58.9
90%	10%	10407	177%	61.2	31.0	70.5	1447.5	56.3	86.4	64.4	56.9
100%	0%	10062	183%	55.9	29.5	64.1	1257.8	49.1	86.6	47.4	29.2

Table 7: Effects of non-uniform stage splitting at the two-stage set-up. Performance decreases as the proportion of Stage 2 decreases, albeit at the expense of lower compression ratios.

We observe that as the Stage 2 increases from 0% to 100%, there is a gradual decrease in the model’s performance across various metrics (such as GQA, MMVet, SQA, MME, VQA, POPE, MMB, and MMB^CN). Although there is a decline in performance, it is relatively minor when the compression stage makes up to 50% of the training duration. However, when the proportion of the compression stage is reduced below 50%, the decline in performance becomes more significant. In conclusion, kee** the compression stage between 0-50% of the training time minimizes performance loss while still achieving significant compression ratios.

LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression

Abstract

1 Introduction

2 Related Works

3 Method

3.1 Preliminaries: A Multi-modal LLM

3.2 Visual Context Compressor

3.3 LLaVolta as a Lite Training Scheme

4 Experiments

4.1 Experimental Setup

4.2 Proof of Concept: Visual Context Redundancy

4.3 Main Results: LLaVolta

4.4 Ablation Study

4.5 Extensibility to Video MLLMs

5 Conclusion

Appendix

Appendix A Additional Experimental Results

A.1 Non-uniform Stage Splitting

LLaVolta: Efficient Multi-modal Models
via Stage-wise Visual Context Compression