1]FAIR, Meta 2]Sorbonne University 3]Valeo.ai

Improved Baselines for Data-efficient
Perceptual Augmentation of LLMs

Théophane Vallaeys Mustafa Shukor Matthieu Cord Jakob Verbeek [ [ [ [email protected] [email protected]

(March 20, 2024)

Abstract

The abilities of large language models (LLMs) have recently progressed to unprecedented levels, paving the way to novel applications in a wide variety of areas. In computer vision, LLMs can be used to prime vision-language tasks such image captioning and visual question answering when coupled with pre-trained vision backbones. While different approaches have been explored to interface LLMs with “perceptual backbones” that process, e.g., visual or audio data, they are often explored for different tasks, different datasets, and using different perceptual backbones and language models, hindering direct comparison of the interfacing mechanisms. To remedy this lack of comparability between methods, we present an extensive experimental evaluation of different interfacing mechanisms, across multiple tasks (including image, video, and audio captioning as well as visual question answering), datasets and backbones, paying special attention to low-data settings. We find improved performance using existing mechanisms over state-of-the-art results, and identify a new interfacing mechanism that yields (near) optimal results across different tasks, while obtaining a $4\times$ reduction in training time.

\correspondence

Théophane Vallayes () Mustafa Shukor ()

Refer to caption — (a) Our unified framework for perceptual augmentation of LLMs.

1 Introduction

The advent of large language models (LLMs) has brought unprecedented capabilities in the understanding and production of natural language [92, 82, 83, 6, 86]. These models can be leveraged to provide a natural user interface in a wide variety of applications, including text-based generation of images, video and audio [68, 50, 31], using external tools [69] and make models talk to each other [89].

Currently, state-of-the-art models in image captioning [40, 35] and visual question-answering (VQA) [9, 30] mostly consist of task-specific, end-to-end trained models. To build more general models, beyond a single task or dataset, several works leverage the generalization capabilities of pre-trained LLMs, coupled with visual encoders [1, 9, 87, 43, 73]. Such approaches rely on end-to-end training of very large numbers of parameters, e.g. 10B in Flamingo [1], requiring very large datasets, e.g. up to several billions of examples in [9]. Recently, a significant efforts have been focused on building more powerful multimodal models [51, 3, 52, 48, 7]. Yet, they still involve costly training stages, such as multimodal pretraining and multitask instruction tuning. These models are interesting when millions of training samples are available and there are no constraint on compute efficiency, and the goal is to have generalist models with good performance on many tasks. An interesting alternative line of research has emerged that studies data and parameter efficient methods [72, 58, 25, 84, 61, 79, 60] to address multimodal tasks. These approaches focus on adapting pre-trained and frozen LLMs, by training modules with few parameters, on limited training sets. This line of research is complementary to the former, and aims to maximize performance on a specific task (e.g. VQA), within few hours of training on a single machine. This becomes even more important in case of data scarcity where we lack big datasets with millions of training samples.

Improvements in parameter-efficient approaches span several axes, such as the LLMs and perceptual encoders used [60, 61, 72, 91], the perceptual feature extraction/injection mechanism [72, 60, 91], and the cross-modal map** module [58, 84, 60]. This variety of design choices prevents a fair and comprehensive comparison between existing approaches, and hinders the understanding of the main factors driving their success. In addition, most of these approaches focus on parameter-efficiency, with little focus on data-efficiency [72, 58], while we argue that the latter is a more important aspect together with compute-efficiency.

More than proposing novel approaches to couple LLMs with perceptual backbones, we believe it is important to have a unified understanding and proper comparison between existing methods. To this end, we propose a unified framework to comprehensively study previous approaches Figure 1. Our framework allows a fair and systematic comparison along designs of several blocks: feature extraction (e.g. which visual features to consider), feature map** (e.g. how to project the extracted features in the LLM textual space) and feature injection (e.g. where to inject the projected features). We consider the impact of the choice of LLM and perceptual backbones, and carefully and fairly tune hyperparameters. This by itself already improves over previously reported results. The systematic characterization of existing approaches naturally leads us to define and evaluate alternative approaches. We find that one of these approaches emerges as the overall best, which we dub DePALM, leading to (near) optimal results across different datasets and tasks. Our approach consistently and significantly improves over earlier data and parameter efficient approaches, and in some cases also outperforms few-shot performance of large-scale state-of-the-art models that train billions of parameters on massive datasets.

To summarize, our contributions are as follows:

•

We present the first systematic experimental study of methods to interface perceptual backbones with LLMs, using the same tasks, datasets, and underlying backbone networks.
•

For all considered tasks, we find improvements over previous state-of-the-art data and parameter efficient methods by careful setting of training hyperparameters and architectural choices.
•

We identify a new mechanism, DePALM, to interface LLMs with perceptual backbones based on token pooling, which obtains near optimal results, while being up to 4 $\times$ faster to train than the closest competitor (training in less than 1.5h on a single machine for a typical dataset).

2 Related work

Table 1: Overview of different architectures from the literature, as well as from this work (DePALM models). The LLM adaptation mechanisms consist of four fundamental components: feature extraction, feature map**, feature injection, and a fine-tuning mechanism. The last column shows the number of trainable parameters, as reported by papers, or with the LLaMA+CLIP-L setting in our models. Methods in orange leverage pre-training on large amounts of data, or cross-dataset training. Others have at least one version trained on a single dataset, which is the setting we consider.

Method	Backbones		Adaptation mechanism				# Tr.
Method	LLMs	Perceptual Enc.	Feature extraction	Feature map**	Feature injection	Fine-tuning mechanisms	params.
Flamingo [1]	Chinchilla [33]	NFNet [5]	Tokens from last layer	Perceiver Resampler (Transformer)	GATED XATTN-DENSE (Cross-attention)	–	10B
BLIP-2 [43]	OPT [92], FlanT5 [13]	CLIP [65]	Tokens from last layer	Q-Former	1st layer token injection	–	1.2B
MAGMA [22]	GPT-J 6B [86]	CLIP [65] / NFNet [5]	Tokens from last layer	MLP	1st layer token injection	fine-tuning of perceptual model	243M
MAPL [58]	GPT-J 6B [86]	CLIP-L [65]	Tokens from last layer	QPMapper ( $d_{\text{embed}}$ =256, 4 layers)	1st layer token injection	–	3.4M
PromptFuse [46]	BART [42]	ViT [19]	Tokens from last layer	nothing	–	prompt tuning	15K
LiMBeR [60]	GTP-J 6B [86]	CLIP [65]	Tokens from last layer	Linear projection	1st layer token injection	–	12.5M
eP-ALM [72]	OPT-2.7B/6.7B [92]	ViT [77], AST [27], TimeSformer [4]	CLS tokens from $n$ last layers	(Shared) linear projection	Token injection in intermediate layers	prompt tuning	4.2M
LLaMA-Adapter [91, 25]	LLaMA[82]	CLIP [65]	Tokens from last layer	Linear projection	Token injection in intermediate layers	inner-layer prompt tuning, bias tuning, norm tuning	14M
Frozen [84]	GPT-like [66]	NFNet [5]	Pooled output tokens	nothing	1st layer token injection	Fine-tune the NFNet	40.3M
ClipCap [61]	GPT-2[66]	CLIP [65]	Tokens from last layer	Transformer	1st layer token injection	–	43M
VL-Adapter [79]	BART [42], T5 [67]	CLIP [65]	Tokens from last layer	Linear projection	1st layer token injection	Adapters	5.8M
AnyMAL [62]	Llama 2-70B-chat [83],	CLIP [65], CLAP [23]	Tokens from last layer	Perceiver Resampler, or linear projection	1st layer token injection	LoRA [34]	–
DePALM^QP,inner	OPT-6.7B [92], LLaMA [82]	CLIP-L [65], DINOv2 [63], MAViL [36] TimeSformer [4]	Tokens from $n$ last layers	QPMapper	Token injection in intermediate layers	prompt tuning	18.1M
DePALM			Tokens from last layer	QPMapper	1st layer token injection		17.9M
DePALM^R-rand,L0, DePALM^R-linear,L0, DePALM^{R-QPMapper,L0}, DePALM^R-avgpool,L0			Tokens from last layer	Linear projection + Resampler	1st layer token injection		21M, 88M 18M, 21M
DePALM^c-attn			Tokens from $n$ last layers	Projection + Small Transformer	Gated cross-attention		17.9M

Multimodal models.

In recent years there has been a significant interest in multimodal models, and in particular in vision-language pre-training, see e.g. [80, 10, 45, 71, 20, 44, 65]. These models can be subsequently fine-tuned to address a range of tasks, such as visual question answering (VQA) and image captioning. The advent of large language models (LLMs) [6, 32, 92, 82, 12] has triggered another line of work on large-scale multimodal training built on top of LLMs. The typical approach is fine-tune a pre-trained language model on large multimodal datasets [9, 8].

Due to the large computational cost to train these approaches, especially with LLMs on the scale of billions of parameters, other approaches keep the LLM part of the model frozen, and only train additional parameters to solve multimodal tasks [1, 43]. Although the predominant focus of research revolves around image-text tasks, the adaptability of these approaches to other modalities, including video and audio, has recently been demonstrated in a straightforward manner [72, 29, 73, 57, 73, 62, 87]. Nonetheless, such endeavors still necessitate the training of a substantial number of parameters, e.g. 10B parameters in Flamingo [1], on billion-scale multimodal training sets [8].

Efficient adaptation of unimodal models.

In contrast to the paradigm of large-scale end-to-end multimodal training, another line of work considers efficient adaptation of pre-trained unimodal models. Methods such as MAGMA [22], Frozen [84] and ClipCap [61] tackle vision-language tasks by training the visual encoder [84, 22] or additional adapters [22] to leverage a pre-trained language model. Other approaches train smaller number of parameters by kee** all pre-trained models fixed, and train a linear layer [60] or a small transformer map** network [58]. Nevertheless, these methods rely on multimodal visual encoders such as CLIP [65], or inject a substantial number of visual tokens in the language model, which reduces inference speed. Recently, several approaches [72, 91, 39] have explored the use of simple linear layers to transform features and inject them in LLMs, some of them even use only-unimodal encoders, across image/video/audio modalities, see e.g. [72]. While each of these approaches show good performances within its own specific experimental setup, it is difficult to compare them due to the differences in the considered tasks and datasets.

3 Unified framework

Even though different approaches leverage LLMs for multimodal tasks, it remains challenging to discern the specific components responsible for the superiority of one method over another. In response to this challenge, we structure previous work in a comprehensive and unified framework, as depicted in Figure 1, enabling a systematic and fair comparison of various existing approaches. Within this framework, the process of adapting LLMs for multimodal tasks boils down to make different design choices. These include the choice of the LLM and perceptual backbone models, which we discuss in Section 3.1, and the adaptation mechanism, which we discuss in Section 3.2. The latter consists of a feature extraction, feature map** and feature injection mechanism, and finally a fine-tuning mechanism. Different design choices lead to different existing or new approaches, as exemplified for a number of methods from the literature in Table 1.

3.1 Backbone models

Language models.

Despite a non-negligible effort in encoder-decoder LLMs, the NLP community has mostly converged to decoder-only LLMs for very large scales. Most powerful LLMs come with different models sizes, the best choices in terms of the trade-off between performance and efficiency are usually models with in the order of 7B parameters. To solve multimodal tasks, these LLMs can be fully or partly finetuned, or completely frozen. Here we focus on frozen, decoder-only LLM with 7B parameters, such as OPT [92] and LLaMA [82]. We also experiment with intruction-tuned LLMs such as Vicuna [11] and LlaMA-2-chat [83].

Perceptual encoders.

Encoders are chosen depending on the modality, and they differ mainly in the architecture (e.g. CNNs or Transformers) and training paradigm (e.g. class-label supervised, text supervised, or self supervised). In our experiments, we focus on transformer-based encoders for their strong performance when pre-trained on large-scale datasets [19]. We experiment with CLIP [65] for image-text tasks, which has been pre-trained from text-aligned data. We also experiment with models pre-trained in an unimodal manner such as TimeSformer [4] for videos, and self-supervised ones such as DINOv2 [63] for images, and MAViL [36] for audio.

3.2 Adaptation mechanisms

To couple the perceptual backbone with the LLM, perceptual tokens are first extracted from the perceptual backbone, transformed via a map** network, and then injected in the LLM. To further improve the adaptation, different fine-tuning mechanisms can be adopted. An overview of the different designs of these components is given in columns four to seven in Table 1. Below, we discuss each of them in detail.

Feature extraction.

In transformer-based perceptual encoders, features take the form of “tokens” that correspond to specific parts of the input, e.g. an image patch. Some transformer models also include a special “class token”, denoted as CLS, which is not tied to a specific part of the input; it interacts with all the input-tied tokens, and encodes global information that can, e.g., be used to classify an image. These tokens can be extracted from any layer of the encoder. We consider two design choices; (i) where to extract the tokens: from the last encoder layer only, or from the last $k$ layers, and (ii) which tokens to extract: all of them, or only the CLS token.

Feature map**.

To render the encoder features compatible with the internal features of the LLM, we apply a map** which can take different forms.

1) Linear projection.

The simplest approach is to use a single linear layer that projects the extracted visual tokens to have the same LLM hidden state dimension. If multiple tokens are extracted, each of them is projected independently with the same linear layer.

2) Query pooling mapper.

Typical perceptual backbones use in the order of hundreds of internal tokens, e.g. 256 tokens organized in a $16\times 16$ grid for images. The training and inference cost directly depend on the number of tokens that are extracted from the encoder and injected in the LLM. Even if the LLM in principle needs to process only short text sequences for the task at hand, such as in visual question answering, injecting hundreds of perceptual tokens in the LLM makes it computationally demanding. We design a QPMapper block to aggregate the tokens extracted from the encoder into a smaller set. The input feature tokens are projected and concatenated to a sequence of learnable query tokens (hence the name), and only the outputs corresponding to the query tokens are kept, upsampled to the LLM dimension, and normalized using the RMSNorm [90]. This architecture is inspired by several previous works [43, 21, 81] using query or class tokens to compute an aggregate representation of the input. In our case, however, we use multiples rather than a single global token. It is also similar to the map** network of MAPL [58], with a lower number of layers and higher dimensionality, and its main benefit is to limit the number of tokens passed to the LLM. Our QPMapper consists of a small sequence of $N_{QS}$ transformer layers, wrapped by linear dimension downsampling and upsampling projections to a $d_{\text{embed}}$ -dimensional internal features, allowing to control the number of trainable parameters in this block. See Figure 1 (right panel) for an illustration of QPMapper architecture.

3) Token resamplers.

We consider several other alternatives to reduce the number of tokens, which are inspired from pooling blocks in CNNs. For example, average-pooling and max-pooling aggregate features over a small patch of, typically, $2\times 2$ features. In our experiments, we explore the following resamplers to reduce the number of tokens:

•

R-avgpool: tokens in a patch are averaged, which is equivalent to an average pooling layer on the input grid.
•

R-linear: tokens in a patch are concatenated and then linearly projected, which is equivalent to a strided convolution on the input grid.
•

R-QPMapper: tokens in a patch are passed through a QPMapper with a single query token, with parameters of the QPMapper shared across patches.
•

R-rand: a random subset of tokens, e.g. 50%, is selected during training. During evaluation, we keep all tokens.

4) Cross-attention.

Prepending tokens to the textual tokens inside an LLM significantly increase the inference complexity. Rather than reducing the number of prepended tokens, we consider a parameter-efficient cross-attention module that is inserted in the LLM and allows it to access the tokens of the perceptual encoder, inspired by Flamingo [1]. Specifically, the perceptual and textual tokens are projected to a smaller hidden dimension $d_{\text{embed}}$ . In this latent space, a typical cross-attention block is applied. The textual tokens are considered as query, and the perceptual ones as keys and values. Finally, the output is upsampled to the LLM dimension and added back to the initial textual tokens using a tanh-gated residual connection. These modules are inserted throughout the second half of the LLM, in-between the LLM transformer modules.

Feature injection.

As for the injection of tokens in the LLM, there are two choices to make: (i) how to inject, and (ii) where to inject these tokens. Regarding (i), we can prepend the tokens to the textual tokens and then interact with the text in the LLM self-attention layers, or we inject them via a cross-attention mechanism. Regarding (ii), tokens are either injected in the LLM input layer, or in intermediate ones. When prepending the tokens to textual tokens in the input layer, they propagate up until the last layer. When injecting in intermediate layers, they are only kept for a single attention block and discarded afterward, and possibly replaced by the same set or another set of tokens in the next transformer block. In this last case, if the tokens were extracted from $k$ levels of the perceptual model, and inserted in to $n$ LLM layers, then each sequence $i=1,\dots,k$ of perceptual features is inserted into the LLM blocks $\lfloor in/k\rfloor,\dots,\lfloor(i+1)n/k-1\rfloor$ .

Finetuning mechanisms

While kee** the LLM frozen is most efficient, parameter-efficient fine-tuning techniques can be used to further boost performance [18]. In our experiments, we consider prompt-tuning and bias-tuning, which we detail in the supplementary material.

4 Experiments

Below, we present our experimental setup in Section 4.1, followed by the results in Section 4.2.

4.1 Experimental setup

Datasets and metrics.

The datasets used in our experiments are listed in Table 2. For all datasets we use standard splits, except for COCO and VQAv2 where we use the commonly used Karpathy splits [37]. To study limited data settings, we consider OKVQA and MSRVTT, and also experiment with COCO and VQAv2 using 1% of the training data. We evaluate using the standard metric of each benchmark. Specifically, for both image and video captioning, we use CIDEr [85], and for VQA tasks, we use the official VQAv2 accuracy metric on the test and/or validation set. Audio captioning is evaluated with SPIDER[53]. We add other standard metrics (BLEU [64], METEOR [17], SPICE [2]) in the supplementary material.

Baselines.

To ensure fair comparison of the different interfacing mechanisms, we re-implement several parameter-efficient approaches: LiMBeR [60], MAPL [58] and eP-ALM [72]. We selected LiMBeR as this is the simplest method, used in a number of other works, and the other two as their original paper also report results on the data-efficient setting. We found the other models either redundant (VL-adapter[79] is the same as LiMBeR with additional fine-tuning, which is not our main focus), non-parameter or non-data efficient (Frozen [84], BLIP-2 [43]) or designed for another setting (LLaMA-Adapter [91] was conceived for instruction fine-tuning first). We refer to these as LiMBeR^(all) (our reimpl.), MAPL (our reimpl.) and eP-ALM (our reimpl.), and note that we change the backbones from the original papers to be all the same, for proper comparisons. We also use a variant of LiMBeR from [72], which we name LiMBeR⁽¹⁾ (our reimpl.), where only the CLS token is injected in the LLM.

Table 2: Datasets used in our experiments, listing the modality type, task, and size of the training set. We also list the LLM and perceptual backbone used by default for each dataset.

Dataset	Type	Task	Size	LLM	Backbone
COCO [47]	Image	Captioning	82K	LLaMA-7B	EVA-CLIP-L
TextCaps [75]	Image	Captioning	21K	LLaMA-7B	CLIP-ViT-L
VQAv2 [28]	Image	Question Ans.	605K	OPT-6.7B	CLIP-ViT-L
TextVQA [76]	Image	Question Ans.	34K	OPT-6.7B	CLIP-ViT-L
AOKVQA [70]	Image	Question Ans.	17K	OPT-6.7B	CLIP-ViT-L
OKVQA [59]	Image	Question Ans.	9K	OPT-6.7B	CLIP-ViT-L
AudioCaps [38]	Audio	Captioning	49K	OPT-6.7B	MAViL
MSRVTT [88]	Video	Captioning	7K	LLaMA-7B	TimeSformer

Our models.

Based on our unified framework, we explore seven novel interfacing mechanisms, summarized in Table 1: DePALM^QP,L0 (that we refer to as DePALM), DePALM^QP,inner, DePALM^c-attn, DePALM^R-rand,L0, DePALM^R-linear,L0, DePALM^{R-QPMapper,L0} and DePALM^R-avgpool,L0. To get a good trade-off between performance and efficiency, we include different pooling strategies to reduce the number of perceptual tokens, contrary to prior work that either used a single token [72] or all tokens [60]. Most of these variants extract features from the last perceptual encoder layer and inject the mapped features in the first LLM layer. In addition, we also explore models that consider intermediate layers as in [72]. In terms of fine-tuning mechanism, we consider prompt tuning and bias-tuning, due to its effectiveness in previous work [72, 46], and leave the LLM and perceptual backbone frozen. In the appendix we report additional experiments regarding different finetuning approaches.

Please refer to the supplementary material for further architectural detail of the baselines and our models.

Implementation details.

For fair comparison between different approaches, we use a unified training setup. For each dataset, we use the same LLM and perceptual encoders for all methods, as listed in Table 2. Models are trained directly on downstream tasks, without any pre-training, and using the standard cross-entropy loss; except for the LLM and perceptual backbone which are pre-trained and frozen. We use random perturbations for data-augmentation, using the same procedure as in [45] for images. We train with the AdamW [54] optimizer and the cosine learning rate scheduler [55]. For each experiment, we conduct five different runs with different random seeds, unless specified otherwise, each run being executed on a single machine equiped with eight V100 GPUs. We report the mean performance metrics in the main paper, and refer to the supplementary material for the standard deviations. Further implementation details can also be found in the supplementary material.

Table 3: Comparison of our implementation of baselines with results reported by the original papers. Results averaged over 5 runs. The best result per column are marked in bold.

\dagger

: The published results for LiMBeR use the standard split and a 4-shot evaluation using a model trained on a larger dataset, which do not correspond directly to our setting.

\ddagger

: results using 8 shots, after training on the target dataset only.

Method	COCO $\uparrow$	COCO (1%) $\uparrow$	VQAv2 $\uparrow$	VQAv2 (1%) $\uparrow$	OKVQA $\uparrow$
Method	CIDEr	CIDEr	Val	Val	Val
LiMBeR (4-shot) [60]	–	–	39.2 ${}^{\dagger}$	–	–
MAPL [58]	125.2	65.9	43.5	37.7	18.7 / 31.6 ${}^{\ddagger}$
eP-ALM [72]	111.6	–	54.9	41.9 ${}^{\dagger}$	–
LiMBeR^(all) (our reimpl.)	136.3	83.7	73.4	48.0	36.2
MAPL (our reimpl.)	126.1	69.2	67.1	45.9	36.2
eP-ALM (our reimpl.)	115.3	64.7	59.3	41.4	23.5

4.2 Main experimental results

Improved baseline performances.

We start with our reproductions of existing parameter and data-efficient baselines. Table 3 shows a comparison between the scores we obtained and those reported in the original papers. We improve the existing baselines by large margins across all metrics. This comes mainly from using better backbones (e.g., LlaMA and CLIP), and using a thorough hyperparameter search for the training algorithm. We conducted this hyperparameter search independently for each experiment, on the learning rate and gradient clip**, using a grid search over a set of values we found to work particularly well for a set of diverse models on our task. With our implementation, LiMBeR^(all) achieves the best performance across the board. However, LiMBeR^(all) is computationally more expensive as it dramatically increases the length of the sequence processed by the LLM as by passing all (typically 256) perceptual tokens to the LLM, compared to passing couple of tokens in MAPL, or just one in eP-ALM.

Table 4: Comparison of our proposed DePALM architectures and our baseline re-implementations, after training on 100% or 1% on each datasets. We highlight the first, second and third best results. All results are averaged on 5 runs. We show the training time on AudioCaps, the average rank and average normalized score of each method across benchmarks. For these last two values, we add the rank over all our models.

{}^{\blacksquare}

: tokens are first averaged across time to prevent memory errors.

{}^{\blacklozenge}

: incomplete data due to unstable training.

Method	COCO	COCO (1%)	TextCaps	VQAv2	VQAv2 (1%)	TextVQA	OKVQA	AOKVQA	AudioCaps	MSRVTT	Train time $\downarrow$	Average
Method	CIDEr $\uparrow$	CIDEr $\uparrow$	CIDEr $\uparrow$	Val $\uparrow$	Val $\uparrow$	Val $\uparrow$	Val $\uparrow$	Val $\uparrow$	SPIDEr $\uparrow$	CIDEr $\uparrow$	Train time $\downarrow$	Rank $\downarrow$	Score $\uparrow$
LiMBeR⁽¹⁾ (our reimpl.)	122.85	87.10	51.85	60.19	45.93	17.96	33.38	34.13	38.94	46.03	1h19	8.0 (9)	84.7 (9)
LiMBeR^(all) (our reimpl.)	136.31	83.74	75.51	73.42	47.98	31.25	36.19	38.93	40.12	46.87 ${}^{\blacksquare}$	4h59	3.2 (2)	97.4 (1)
MAPL (our reimpl.)	126.05	69.20	50.57	67.13	45.94	21.04	36.21	37.02	40.89	47.27	1h31	6.4 (7)	86.9 (7)
eP-ALM (our reimpl.)	115.34	64.65	42.58	59.34	41.38	16.59	23.52	27.82	38.13	38.83	1h20	10.4 (10)	73.1 (11)
DePALM	131.29	87.05	73.67	70.11	48.25	22.97	37.69	38.45	43.37	49.88	1h25	2.5 (1)	95.7 (2)
DePALM^QP,inner	130.91	75.86	65.22	67.88	45.27	23.70	35.98	36.36	41.20	47.76	2h21	5.2 (5)	90.7 (5)
DePALM^R-avgpool,L0	131.77	86.09	61.18	64.84	48.86	19.14	35.17	35.41	41.54	50.52	1h50	4.4 (3)	90.4 (6)
DePALM^R-linear,L0	133.01	85.31	69.76	64.76	47.66	19.08	34.58	35.30	40.92	51.60	1h48	5.2 (5)	91.1 (3)
DePALM^{R-QPMapper,L0}	131.92	75.46	51.03	61.09	46.08	18.56	35.35	35.63	41.17	45.49	1h48	6.8 (8)	85.6 (8)
DePALM^R-rand,L0	134.90	86.84	58.15	71.33	47.60	21.28	35.00	34.74	41.37	47.90 ${}^{\blacksquare}$	2h40	4.4 (3)	90.9 (4)
DePALM^c-attn ${}^{\blacklozenge}$	130.05	81.38	–	69.45	41.73	–	–	–	–	–	1h31	9.5 ${}^{\blacklozenge}$ (10)	36.9 ${}^{\blacklozenge}$ (11)

Better adaptation mechanism.

Next, we explore seven additional cross-modal interaction mechanisms, beyond the baseline ones, across a set of ten tasks. We also add LiMBeR⁽¹⁾, as it was shown to be a fast and efficient baseline [72]. We report results in Table 4, where we also report the training time for AudioCaps as an illustration of the training cost. To easily compare these methods across different tasks, we use two aggregate metrics. (i) The average rank: for each task, we rank from 1 (best) to 11 (worst), and average the ranks across tasks. (ii) The average score: we normalize the score for each task by the maximum score across the methods, and then average the normalized scores.

We use our results to conduct an analysis over the building blocks of the models. First, using the same feature map**, injecting tokens inside the LLM in the first-layer (DePALM, LiMBeR⁽¹⁾) prevails over inner-layer injection (eP-ALM, DePALM^QP,inner). We also found that cross-attention (DePALM^c-attn) leads to unstable training in most low-data settings. Extracting tokens from different encoder layers (eP-ALM, DePALM^QP,inner, DePALM^c-attn) makes sense with inner-layers injection techniques, but is not sufficient to improve over methods using only tokens from the last layer (LiMBeR variants, DePALM, and DePALM^*,L0 variants). Therefore, we now consider models with last-layer extraction and first-layer injection. For the central map** block, using a resampler (DePALM^R-*,L0 and DePALM^QP,* variants) to reduce the number of tokens provides a trade-off between efficiency and performance, compared to injecting all tokens (LiMBeR^(all)) or just one (LiMBeR⁽¹⁾ and eP-ALM). The QPMapper used over all feature tokens (DePALM and DePALM^QP,inner) provides the best trade-off, while the local resamplers (DePALM^R-*,L0) that preserve spatial feature structure lag behind or do not consistently achieve high scores.

Overall, DePALM and LiMBeR^(all) achieve the best performance, reaching the best average rank and score, respectively. In terms of training speed, however, DePALM is almost 4 $\times$ more efficient, due to the small number of visual tokens injected in the LLM. Its training cost is similar to the most efficient approaches, eP-ALM and LiMBeR⁽¹⁾ that inject only one visual token, while significantly outperforming them.

Qualitative results.

We give some qualitative results of our model for multiple multimodal tasks in Figure 2. We can notice that the models adapt to answer in the style corresponding to the dataset, and has notions of real-world objects, being able to identify colors, animals and objects.

Table 5: Comparison of different visual backbones, with a fixed LLM (left) and with different LLMs (right). We show the CIDEr score on COCO, the validation accuracy on OKVQA, and the SPICE score on AudioCaps. For reference, we add the ImageNet [16] Top1 score of each visual backbone, and ARC for each LLM, that measures textual question answering capabilities. The results are averaged over 3 runs.

Visual backbone	COCO DePALM	COCO LiMBeR⁽¹⁾	OKVQA DePALM	ImageNet Top1
DINOv2-S [63]	118.26	100.63	33.64	81.1%
DINOv2-B [63]	125.42	106.12	34.82	84.5%
DINOv2-L [63]	126.95	107.17	31.81	86.3%
DINOv2-G [63]	127.49	110.58	35.52	86.5%
ViT-L [77]	118.59	106.49	36.11	85.6%
CLIP-ViT-B [65]	121.93	111.88	36.68	68.6%
CLIP-ViT-L [65]	128.69	116.80	37.27	75.3%
EVA-CLIP-L [24]	130.66	123.20	37.13	79.8%

LLM backbone	COCO DePALM	COCO LiMBeR⁽¹⁾	AudioCaps DePALM	ARC
OPT-125M [92]	126.88	102.45	41.82	22.87
OPT-1.3B [92]	129.41	112.43	42.77	29.52
OPT-2.7B [92]	125.75	115.81	43.35	33.96
OPT-6.7B [92]	131.64	117.51	43.83	39.16
LLaMA-7B [82]	130.73	123.12	42.48	51.02
Vicuna-7B [11]	125.66	111.53	21.79	53.24

4.3 Analysis and ablation study

Text-aligned perceptual features adapt better to LLMs.

We investigate the influence of the perceptual backbones on the overall performance. In Table 5 (left) we compare different visual encoders with varying sizes and different training paradigms on different image captioning and visual question answering datasets. For the same model family, see DINOv2 and CLIP-ViT, the bigger size the better the performance. Self-supervised encoders (DINOv2) performed better than supervised ones (ViT) for image captioning, but the reverse was true for OKVQA. Finally, vision-language pre-training of the encoders (CLIP) performs best across all tested settings. This reveals that, using existing text-aligned perceptual encoders, makes the cross-modal interaction between the encoder and LLMs more effective. Overall, models with better feature quality (higher ImageNet score) increase our results, with a large boost when there is a pre-existing alignment with text.

Better LLMs are not always better for multimodality.

Next, in Table 5 (right), we compare LLMs with different models and pretraining data sizes, and consider the impact on image and audio captioning results. We find a clear positive correlation between the LLM size and the score for the OPT models, similar to the ARC metric [14] which measures textual question-answering capabilities of the LLMs. However, when comparing LLMs with similar model sizes in the 7B range, we do not see a clear improvement when using LLMs pretrained on more data (LLaMA), nor when fine-tuning on language instructions (Vicuna), contrary to observations for the ARC metric.

Parameter and data efficiency.

We consider COCO captioning performance as a function of the number of trainable parameters in Figure 3 (left), for a set of diverse methods: we include the two best architectures (LiMBeR^(all) and DePALM) and add a set of diverse models using different injection or map** mechanisms (DePALM^QP,inner, eP-ALM, MAPL) to compare to diverse behaviors. We focus on the low-data training regime by using only 1% of the training set. To vary the number of trainable parameters, we do the following: for MAPL and DePALM we change the hidden dimension of the QPMapper, for LiMBeR^(all) and eP-ALM,we replace the linear feature projection by a two-layer projection (MPL_2) with a bottleneck of variable dimension.

First, we observe that for most of the considered parameter range, LiMBeR^(all) and DePALM yield the best performance, coherent with earlier experiments. Second, we do not observe strong overfitting for any of the methods, suggesting that in this small data regime the type of interfacing mechanism is more important for performance than the number of parameters.

We investigate data efficiency by varying the training data size from 0.12% to 100% in Figure 3 (right). All methods scale similarly well with the number of training examples, with LiMBeR^(all) and DePALM yielding optimal performance across all data sizes. We find that training only on 10% of data achieves roughly 90% of the final performance, validating the data-efficiency of these methods.

Table 6: Comparison with state-of-the-art LLM augmentation methods. The DePALM results are averaged over 3 runs. We highlight the best results for each category in underlined bold.

\dagger

: use standard split instead of the Karpathy one. Note that only the results in the last group of parameter efficient methods are directly comparable to ours.

Method	COCO	COCO (1%)	VQAv2	OKVQA	MSRVTT
Method	CIDEr	CIDEr	Val	Val	CIDEr
Large-scale methods in few-shot mode
Flamingo [1] (32-shot)	113.8	–	67.6	57.8	–
BLIP-2 [43] (0-shot)	121.6	–	–	45.9	–
Large-scale methods finetuned on target task
Flamingo [1]	138.1 ${}^{\dagger}$	–	82.1 ${}^{\dagger}$	–	–
BLIP-2 [43]	145.8	–	82.30 ${}^{\dagger}$	–	–
UnIVAL [73]	128	–	73.24	45.7	56.3
NExT-GPT [87]	156.7	–	–	–	–
IDEFICS 80B Instruct [41] (32-shot)	123.2	–	68.8	59.5	–
Qwen-VL (7B) [3]	–	–	79.5	58.6	–
Parameter-efficient methods for LLM augmentation
MAPL [72]	125.2	65.9	43.5	31.6	–
VL-Adapter [79]	116	–	65.9	–	–
eP-ALM [72]	111.6	–	54.9	–	48.8
DePALM (ours)	131.3	87.1	70.1	37.7	49.9

Comparison with the state-of-the-art.

In Table 6 we compare our results with state-of-the-art approaches, including large-scale ones. We compare only with models with at least one top score. For this comparison, we use DePALM with the backbones listed in Table 2. Our approach outperforms all parameter-efficient approaches (bottom part of the table) such as eP-ALM and MAPL. We significantly reduce the performance gap of parameter-efficient approaches w.r.t. to large-scale models that are fine-tuned to the target task (middle part of the table). We also compare our approach to generalist models that do not require finetuning (top part), showing that we compete and sometimes outperform them. While, these models are not directly comparable to ours, the results show that finetuning can significantly boost performance, and DePALM emerges as a promising and efficient approach, that does not require large-scale pretraining.

5 Discussion and conclusion

Small vs. large-scale setups.

This work focuses on adapting LLMs for multimodal tasks with focus on efficiency along three main axes: (a) training set size, (b) number trainable parameters, and (c) amount of compute. This allows to obtain LLM-based solutions significantly faster and more affordably. Importantly, it streamlines the adoption of stronger LLMs and perceptual foundation models that are continuously released.

A different approach is to go large scale along these three axes, with the objective to obtain good performance across many datasets [51, 3, 52, 43, 1, 8]. This usually requires conducting pretraining, followed by instruction tuning, and even single-task finetuning when targeting a particular dataset. While this approach is more generalist, it requires enormous resources in terms of data and compute. Nonetheless, we believe that both setups are worth pursuing and are complementary, paving the way for very effective multimodal models, spanning a wide range of setups.

Limitations.

While this work achieves large improvements compared to previous efficient approaches, there is still room for improvement, especially regarding harder tasks such OK-VQA or those requiring reasoning [56]. We believe the proposed framework, will be a good ground to develop more effective approaches in the future. Besides, this work focuses mainly on performance and efficiency. However, there are other axes that should be considered before deployment. In particular, safety issues, such hallucinations [74, 49], abstention [15], harmfulness or the broader objective of aligning these models to human preferences [78].

Conclusion.

We presented a systematic comparative study of mechanisms to interface perceptual backbones —for image, video and audio data— with large language models to address tasks such as captioning and question answering. We focus on parameter efficient approaches, which leave the LLM and feature backbone unchanged, and can be trained on limited training sets. We conducted extensive experiments on different datasets and tasks in which we evaluated both existing and new mechanisms, considered different choices for the perceptual backbones and language models, and tune hyperparameters for all methods in a fair manner. We find improved results as compared to previously reported ones, even when using the same existing interfacing mechanisms. In general, our study shows that most of the improvement, is coming from better perceptual encoders, especially text-aligned ones, in contrast to using more powerful LLMs. We also find that simple design choices works best, such as passing all perceptual tokens at the input to the LLM, or using transformer-based token pooling mechanisms for efficiency. Moreover, we find that our proposed DePALM mechanism —which compresses tokens from the perceptual backbone to a few “summary tokens” to inject in the LLMs— yields on par or better results than existing approaches, while being 4 $\times$ faster to train than the second best method.

References

[1] Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J.L., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Bińkowski, M.a., Barreira, R., Vinyals, O., Zisserman, A., Simonyan, K.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022)
[2] Anderson, P., Fernando, B., Johnson, M., Gould, S.: Spice: Semantic propositional image caption evaluation. In: ECCV (2016)
[3] Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-VL: A frontier large vision-language model with versatile abilities. arXiv preprint 2308.12966 (2023)
[4] Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML (2021)
[5] Brock, A., De, S., Smith, S.L., Simonyan, K.: High-performance large-scale image recognition without normalization. In: ICML (2021)
[6] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are few-shot learners. In: NeurIPS (2020)
[7] Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chandra, V., Xiong, Y., Elhoseiny, M.: MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)
[8] Chen, X., Djolonga, J., Padlewski, P., Mustafa, B., Changpinyo, S., Wu, J., Ruiz, C.R., Goodman, S., Wang, X., Tay, Y., et al.: PaLI-X: On scaling up a multilingual vision and language model. arXiv preprint 2305.18565 (2023)
[9] Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A.J., Padlewski, P., Salz, D.M., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., Kolesnikov, A., Puigcerver, J., Ding, N., Rong, K., Akbari, H., Mishra, G., Xue, L., Thapliyal, A.V., Bradbury, J., Kuo, W., Seyedhosseini, M., Jia, C., Ayan, B.K., Riquelme, C., Steiner, A., Angelova, A., Zhai, X., Houlsby, N., Soricut, R.: PaLI: A jointly-scaled multilingual language-image model. In: ICLR (2022)
[10] Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: UNITER: UNiversal Image-TExt Representation Learning. In: ECCV (2020)
[11] Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., Stoica, I., Xing, E.P.: Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (March 2023), https://lmsys.org/blog/2023-03-30-vicuna/
[12] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., et al.: PaLM: Scaling language modeling with pathways. JMLR 24 (2023)
[13] Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S.S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Valter, D., Narang, S., Mishra, G., Yu, A.W., Zhao, V., Huang, Y., Dai, A.M., Yu, H., Petrov, S., hsin Chi, E.H., Dean, J., Devlin, J., Roberts, A., Zhou, D., Le, Q.V., Wei, J.: Scaling instruction-finetuned language models. arXiv preprint 2210.11416 (2022)
[14] Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., Tafjord, O.: Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint 1803.05457 (2018)
[15] Dancette, C., Whitehead, S., Maheshwary, R., Vedantam, R., Scherer, S., Chen, X., Cord, M., Rohrbach, M.: Improving selective visual question answering by learning from your peers. In: CVPR (2023)
[16] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: CVPR (2009)
[17] Denkowski, M., Lavie, A.: Meteor universal: Language specific translation evaluation for any target language. In: EACL Workshop on Statistical Machine Translation (2014)
[18] Ding, N., Qin, Y., Yang, G., Wei, F., Yang, Z., Su, Y., Hu, S., Chen, Y., Chan, C.M., Chen, W., et al.: Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. arXiv preprint 2203.06904 (2022)
[19] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)
[20] Dou, Z.Y., Xu, Y., Gan, Z., Wang, J., Wang, S., Wang, L., Zhu, C., Zhang, P., Yuan, L., Peng, N., et al.: An empirical study of training end-to-end vision-and-language transformers. In: CVPR (2022)
[21] Douillard, A., Ramé, A., Couairon, G., Cord, M.: DyTox: Transformers for continual learning with dynamic token expansion. In: CVPR (2022)
[22] Eichenberg, C., Black, S., Weinbach, S., Parcalabescu, L., Frank, A.: Magma – multimodal augmentation of generative models through adapter-based finetuning. In: EMNLP (2022)
[23] Elizalde, B., Deshmukh, S., Al Ismail, M., Wang, H.: CLAP: learning audio concepts from natural language supervision. In: ICASSP (2023)
[24] Fang, Y., Wang, W., Xie, B., Sun, Q.S., Wu, L.Y., Wang, X., Huang, T., Wang, X., Cao, Y.: EVA: Exploring the limits of masked visual representation learning at scale. In: CVPR (2022)
[25] Gao, P., Han, J., Zhang, R., Lin, Z., Geng, S., Zhou, A., Zhang, W., Lu, P., He, C., Yue, X., Li, H., Qiao, Y.J.: LLaMA-Adapter V2: Parameter-efficient visual instruction model. arXiv preprint 2304.15010 (2023)
[26] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio set: An ontology and human-labeled dataset for audio events. In: ICASSP (2017)
[27] Gong, Y., Chung, Y.A., Glass, J.R.: AST: Audio spectrogram transformer. arXiv preprint 2104.01778 (2021)
[28] Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in VQA matter: Elevating the role of image understanding in visual question answering. In: CVPR (2017)
[29] Han, J., Zhang, R., Shao, W., Gao, P., Xu, P., Xiao, H., Zhang, K., Liu, C., Wen, S., Guo, Z., Lu, X., Ren, S., Wen, Y., Chen, X., Yue, X., Li, H., Qiao, Y.J.: ImageBind-LLM: Multi-modality instruction tuning. arXiv preprint 2309.03905 (2023)
[30] He, X., Chen, S., Ma, F., Huang, Z., **, X., Liu, Z., Fu, D., Yang, Y., Liu, J., Feng, J.: VLAB: Enhancing video language pre-training by feature adapting and blending. arXiv preprint 2305.13167 (2023)
[31] Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., Salimans, T.: Imagen video: High definition video generation with diffusion models. arXiv preprint 2210.02303 (2022)
[32] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.d.L., Hendricks, L.A., Welbl, J., Clark, A., et al.: Training compute-optimal large language models. arXiv preprint 2203.15556 (2022)
[33] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J.W., Vinyals, O., Sifre, L.: Training compute-optimal large language models. In: NeurIPS (2022)
[34] Hu, J.E., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Chen, W.: LoRA: Low-rank adaptation of large language models. In: ICLR (2022)
[35] Hu, J., Cavicchioli, R., Capotondi, A.: ExpansionNet v2: Block static expansion in fast end to end training for image captioning. arXiv preprint 2208.06551 (2022)
[36] Huang, P.Y., Sharma, V., Xu, H., Ryali, C.K., Fan, H., Li, Y., Li, S.W., Ghosh, G., Malik, J., Feichtenhofer, C.: MAViL: Masked audio-video learners. In: NeurIPS (2023)
[37] Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)
[38] Kim, C.D., Kim, B., Lee, H., Kim, G.: AudioCaps: Generating captions for audios in the wild. In: NAACL-HLT (2019)
[39] Koh, J.Y., Salakhutdinov, R., Fried, D.: Grounding language models to images for multimodal inputs and outputs. ICML (2023)
[40] Labbé, E., Pellegrini, T., Pinquier, J.: CoNeTTE: An efficient audio captioning system leveraging multiple datasets with task embedding. arXiv preprint 2309.00454 (2023)
[41] Laurençon, H., Saulnier, L., Tronchon, L., Bekman, S., Singh, A., Lozhkov, A., Wang, T., Karamcheti, S., Rush, A., Kiela, D., et al.: Obelics: An open web-scale filtered dataset of interleaved image-text documents. In: NeurIPS (2023)
[42] Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., rahman Mohamed, A., Levy, O., Stoyanov, V., Zettlemoyer, L.: BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: ACL (2019)
[43] Li, J., Li, D., Savarese, S., Hoi, S.C.H.: BLIP-2: Bootstrap** language-image pre-training with frozen image encoders and large language models. arXiv preprint 2301.12597 (2023)
[44] Li, J., Li, D., ** language-image pre-training for unified vision-language understanding and generation. In: ICML (2022)
[45] Li, J., Selvaraju, R.R., Gotmare, A.D., Joty, S., Xiong, C., Hoi, S.: Align before fuse: Vision and language representation learning with momentum distillation. In: NeurIPS (2021)
[46] Liang, S., Zhao, M., Schütze, H.: Modular and parameter-efficient multimodal fusion with prompting. In: ACL (2022)
[47] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV (2014)
[48] Lin, Z., Liu, C., Zhang, R., Gao, P., Qiu, L., Xiao, H., Qiu, H., Lin, C., Shao, W., Chen, K., et al.: Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint 2311.07575 (2023)
[49] Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., Wang, L.: Aligning large multi-modal model with robust instruction tuning. arXiv preprint 2306.14565 (2023)
[50] Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., Wang, W., Plumbley, M.D.: AudioLDM: Text-to-audio generation with latent diffusion models. In: ICML (2023)
[51] Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint 2310.03744 (2023)
[52] Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2024)
[53] Liu, S., Zhu, Z., Ye, N., Guadarrama, S., Murphy, K.P.: Improved image captioning via policy gradient optimization of SPIDEr. ICCV (2017)
[54] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2017)
[55] Loshchilov, I., Hutter, F.: SGDR: Stochastic gradient descent with warm restarts. In: ICLR (2017)
[56] Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. In: NeurIPS (2022)
[57] Lyu, C., Wu, M., Wang, L., Huang, X., Liu, B., Du, Z., Shi, S., Tu, Z.: Macaw-LLM: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint 2306.09093 (2023)
[58] Mañas, O., López, P.R., Ahmadi, S., Nematzadeh, A., Goyal, Y., Agrawal, A.: MAPL: Parameter-efficient adaptation of unimodal pre-trained models for vision-language few-shot prompting. In: EACL (2023)
[59] Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: A visual question answering benchmark requiring external knowledge. In: CVPR (2019)
[60] Merullo, J., Castricato, L., Eickhoff, C., Pavlick, E.J.: Linearly map** from image to text space. In: ICLR (2023)
[61] Mokady, R.: ClipCap: CLIP prefix for image captioning. arXiv preprint 2111.09734 (2021)
[62] Moon, S., Madotto, A., Lin, Z., Nagarajan, T., Smith, M., Jain, S., Yeh, C.F., Murugesan, P., Heidari, P., Liu, Y., Srinet, K., Damavandi, B., Kumar, A.: AnyMAL: An efficient and scalable any-modality augmented language model. arXiv preprint 2309.16058 (2023)
[63] Oquab, M., Darcet, T., Moutakanni, T., Vo, H.Q., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M.G., Sharma, V., Synnaeve, G., Xu, H., Jégou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning robust visual features without supervision. arXiv preprint 2304.07193 (2023)
[64] Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2002)
[65] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021)
[66] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. Tech. rep., OpenAI (2019)
[67] Raffel, C., Shazeer, N.M., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR 21 (2020)
[68] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv preprint 2204.06125 (2022)
[69] Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., Scialom, T.: Toolformer: Language models can teach themselves to use tools. arXiv preprint 2302.04761 (2023)
[70] Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R.: A-OKVQA: A benchmark for visual question answering using world knowledge. In: ECCV (2022)
[71] Shukor, M., Couairon, G., Cord, M.: Efficient vision-language pretraining with visual concepts and hierarchical alignment. In: BMVC (2022)
[72] Shukor, M., Dancette, C., Cord, M.: eP-ALM: Efficient perceptual augmentation of language models. In: ICCV (2023)
[73] Shukor, M., Dancette, C., Ramé, A., Cord, M.: Unified model for image, video, audio and language tasks. arXiv preprint 2307.16184 (2023)
[74] Shukor, M., Rame, A., Dancette, C., Cord, M.: Beyond task performance: evaluating and reducing the flaws of large multimodal models with in-context-learning. In: ICLR (2024)
[75] Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: TextCaps: a dataset for image captioning with reading comprehension. In: ECCV (2020)
[76] Singh, A., Natarjan, V., Shah, M., Jiang, Y., Chen, X., Parikh, D., Rohrbach, M.: Towards VQA models that can read. In: CVPR (2019)
[77] Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? Data, augmentation, and regularization in vision transformers. TMLR (2022)
[78] Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., Gan, C., Gui, L.Y., Wang, Y.X., Yang, Y., et al.: Aligning large multimodal models with factually augmented RLHF. arXiv preprint 2309.14525 (2023)
[79] Sung, Y.L., Cho, J., Bansal, M.: VL-Adapter: Parameter-efficient transfer learning for vision-and-language tasks. In: CVPR (2022)
[80] Tan, H.H., Bansal, M.: LXMERT: Learning cross-modality encoder representations from transformers. In: EMNLP (2019)
[81] Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: ICCV (2021)
[82] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., Lample, G.: LLaMA: Open and efficient foundation language models. arXiv preprint 2302.13971 (2023)
[83] Touvron, H., Martin, L., Stone, K.R., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D.M., Blecher, L., Ferrer, C.C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A.S., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I.M., Korenev, A.V., Koura, P.S., Lachaux, M.A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E.M., Subramanian, R., Tan, X., Tang, B., Taylor, R., Williams, A., Kuan, J.X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., Scialom, T.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint 2307.09288 (2023)
[84] Tsimpoukelli, M., Menick, J.L., Cabi, S., Eslami, S.M.A., Vinyals, O., Hill, F.: Multimodal few-shot learning with frozen language models. In: NeurIPS (2021)
[85] Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: Consensus-based image description evaluation. In: CVPR (2014)
[86] Wang, B., Komatsuzaki, A.: GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax (May 2021)
[87] Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.S.: NExT-GPT: Any-to-any multimodal LLM. arXiv preprint 2309.05519 (2023)
[88] Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: A large video description dataset for bridging video and language. In: CVPR (2016)
[89] Zeng, A., Attarian, M., Ichter, B., Choromanski, K., Wong, A., Welker, S., Tombari, F., Purohit, A., Ryoo, M., Sindhwani, V., et al.: Socratic models: Composing zero-shot multimodal reasoning with language. In: ICLR (2023)
[90] Zhang, B., Sennrich, R.: Root mean square layer normalization. In: NeurIPS (2019)
[91] Zhang, R., Han, J., Zhou, A., Hu, X., Yan, S., Lu, P., Li, H., Gao, P., Qiao, Y.J.: LLaMA-Adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint 2303.16199 (2023)
[92] Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M.T., Li, X., Lin, X.V., Mihaylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D., Koura, P.S., Sridhar, A., Wang, T., Zettlemoyer, L.: Opt: Open pre-trained transformer language models. arXiv preprint 2205.01068 (2022)

Appendix A Assets and licensing information

In Table S1, we list the datasets and pre-trained models we use for our experiments. We provide the links to the to repositories and their licenses.

Name	Link	license
COCO [47]	https://cocodataset.org	CC BY 4.0
TextCaps [75]	https://textvqa.org/textcaps/	CC BY 4.0
VQAv2 [28]	https://visualqa.org/	CC BY 4.0
TextVQA [76]	https://textvqa.org/	CC BY 4.0
OKVQA [59]	https://okvqa.allenai.org/	Unknown
AOKVQA [70]	https://allenai.org/project/a-okvqa	Unknown
AudioSet [26]	https://research.google.com/audioset/	CC BY 4.0
AudioCaps [38]	https://audiocaps.github.io/	MIT
MSRVTT [88]	Microsoft website	Unknown
CLIP [65]	https://github.com/openai/CLIP	Unknown
EVA-CLIP [24]	https://github.com/baaivision/EVA/tree/master/EVA-CLIP	MIT
DINOv2 [63]	https://github.com/facebookresearch/dinov2	Apache License 2.0
ViT-L [77]	https://github.com/huggingface/pytorch-image-models	Apache License 2.0
TimeSformer [4]	https://github.com/facebookresearch/TimeSformer	CC BY 4.0
OPT [92]	https://github.com/facebookresearch/metaseq	MIT
LLaMA [82]	https://github.com/facebookresearch/llama	llama license
Llama2 [83]	https://ai.meta.com/llama/	llama license
Vicuna [11]	https://lmsys.org/blog/2023-03-30-vicuna/	llama license
bottomrule

Table S1: Links to the assets used in the paper, and their respective licenses.

Appendix B Building blocks of our framework

In this section, we provide more details about the different blocks we use to implement existing baseline models, as well as our DePALM models. We suppose a feature extractor model with tokens of dimension $d_{\text{feats}}$ , and a LLM with tokens of dimension $d_{\text{LLM}}$

B.1 Feature extraction

The design of feature extraction is based on two decisions: the number $n_{\text{fl}}$ of feature levels, and whether we keep all tokens, or only the CLS token. We take the output of the last $n_{\text{fl}}$ transformer layers of the feature extractor (the image, video or audio backbone). It gives us an output of dimension $(n_{\text{fl}},n_{\text{tk}}+1,d_{\text{feats}})$ , where $n_{\text{tk}}$ is the number of patch tokens, and $d_{\text{feats}}$ the embedding dimension of the feature model (perceptual encoders). When only extracting the CLS token, the output dimension becomes $(n_{\text{fl}},1,d_{\text{feats}})$ (the patch tokens are removed).

A special case is added for the MAViL model, where the CLS token is replaced by the mean of all patch tokens of the same level, but only when we do not keep the patch tokens.

B.2 Feature injection

First-layer token injection.

Here $n_{\text{fl}}=1$ . The feature tokens are prepended to the sequence of textual token embeddings (including “BOS” token for OPT). They are propagated through the LLM, and removed from its final output. We use a causal attention mask, where each token can only attend to previous ones, including in-between inserted perceptual tokens.

Inner-layers token injection.

Here we only require $n_{\text{fl}}\geqslant 1$ . This method is additionally parametrized by a number of LLM layers $n_{\text{LLM}}$ where we inject tokens and a number of left-out layers at the end $n_{\text{left}}$ . Then, if we note $L_{\text{LMM}}$ the total number of LLM layers, for each layer $i\in\{L_{\text{LMM}}-n_{\text{LLM}}-n_{\text{left}},\dots,L_{\text{LMM}}-n_{% \text{left}}-1\}$ , we inject feature tokens extracted from level $l_{i}=\lfloor\frac{(i-L_{\text{LMM}})*n_{\text{fl}}}{n_{\text{LLM}}}\rfloor$ . For injection, we follow the same procedure as for first-layer token injection, where the feature tokens are prepended to the input sequence of the layer $i$ . Additionally, we remove them from the output sequence of this layer.

B.3 Feature map**

QPMapper.

This map** block is parametrized by the number of layers $L_{\textrm{QP}}$ and the number of query tokens $n_{\textrm{Q}}$ . It takes as input a sequence of tokens of dimension $d_{\textrm{embed}}$ . These tokens are concatenated to $n_{\textrm{Q}}$ query tokens, which are learnable parameters. The resulting sequence is then passed through a stack of $L_{\textrm{QP}}$ standard transformer encoder layers. We use a dropout of $0.1$ , embedding dimension of $d_{\text{embed}}$ , the GELU activation and 8 attention heads. Only the last $n_{\textrm{Q}}$ output tokens, corresponding to the query tokens, are considered as output.

Block-based token resamplers.

Some of the resamplers use a common framework, based on local blocks of patches. This framework is parametrized by an embedding dimension $d_{\textrm{emb}}$ , and a pooling function. The extracted feature tokens are first projected from dimension $d_{\textrm{feats}}$ to $d_{\textrm{embed}}$ using a linear layer. The patches tokens are then arranged on a 1D, 2D or 3D grid, depending on the modality, and grouped into blocks of dimension $4$ (1D case) or $2\times 2$ (2D case). Each block is pooled using the pooling function, resulting in a single token. This is similar to using a 1D or 2D pooling operation on the grid. The tokens are then rearranged as a sequence again, to which the CLS token is prepended, before being normalized using the RMSNorm, and finally projected with a linear layer from $d_{\textrm{embed}}$ to $d_{\textrm{LLM}}$ .

B.4 Fine-tuning mechanism

Prompt-tuning.

is a parameter-efficient fine-tuning mechanism, parametrized by $n_{\textrm{pt}}$ , the number of learned tokens. When used, $n_{\textrm{pt}}$ constant embedding vectors of dimension $d_{\textrm{LLM}}$ are learned and prepended before the textual tokens (and after the perceptual tokens) at the beginning of LLMs. We also use a causal padding attention mask with these tokens.

Appendix C Implementation details

C.1 Reproducing existing baseline

LiMBeR^(all):

we use the feature extraction with $n_{\text{fl}}=1$ level, project all the feature tokens from dimension $d_{\text{feats}}$ to $d_{\text{LLM}}$ using a linear layer, and use first-layer token injection.

LiMBeR⁽¹⁾:

we use the same mechanism as for LiMBeR^(all), but only keep the single CLS token.

MAPL:

we use the feature extraction with $n_{\text{fl}}=1$ level, and a feature map** block consisting of a linear projection from dimension $d_{\text{feats}}$ to $d_{\text{embed}}=256$ , followed by a QPMapper using $L_{\textrm{QP}}=4$ layers and $n_{\textrm{Q}}=32$ query tokens, and then a linear projection from $d_{\text{embed}}$ to the LLM inner dimension $d_{\text{LLM}}$ . We then insert tokens with the first-layer injection mechanism.

eP-ALM:

we extract $n_{\text{fl}}=6$ levels of feature tokens, and only keep the CLS token from each level. We project each one from dimension $d_{\text{feats}}$ to $d_{\text{LLM}}$ with the same linear layer. Mapped tokens are inserted into $n_{\text{LLM}}=12$ inner layers, leaving out the last one ( $n_{\text{left}}=1$ ).

C.2 DePALM variants

Our DePALM and DePALM^*,L0 methods use the following blocks:

•

Feature extraction from $n_{\text{fl}}=1$ level.
•

First-layer injection of token, after the map** block.
•

Prompt-tuning with $n_{\text{pt}}=1$ .

DePALM:

we use a feature map** block consisting of a linear projection from dimension $d_{\textrm{feats}}$ to $d_{\text{embed}}=1024$ , followed by a QPMapper using $L_{\textrm{QP}}=2$ layers and $n_{\textrm{Q}}=32$ query tokens, and a linear projection from $d_{\textrm{embed}}$ to the LLM inner dimension $d_{\textrm{LLM}}$ .

DePALM^R-rand,L0:

we sample a subset of the tokens, using the following procedure. If the model outputs $n_{\text{tk}}$ patch tokens, we keep the CLS token, and $\lfloor fn_{\text{tk}}\rfloor$ uniformly sampled patch tokens, with $f$ the proportion of tokens we keep:

•

During training, with a probability of $\frac{1}{10}$ , we set $f=f_{\text{max}}$ . Otherwise, we sample $f^{\prime}\sim\mathcal{N}(f_{\text{mean}},f_{\text{std}})$ , and set $f=\min(\max(f,f_{\text{min}}),f_{\text{max}})$ to restrict the values to the interval $[f_{\text{min}};f_{\text{max}}]$ .
•

During inference, we set $f=f_{\text{max}}$ .
•

We use $f_{\text{min}}=\frac{1}{16}$ , $f_{\text{max}}=\frac{1}{2}$ , $f_{\text{mean}}=\frac{1}{4}$ and $f_{\text{std}}=0.2$ .

We project the resulting tokens from dimension $d_{\text{feats}}$ to $d_{\text{LLM}}$ , normalize them using the RMSNorm, and project them again with a linear layer ( $d_{\text{LLM}}$ to $d_{\text{LLM}}$ ) before injection inside the LLM.

DePALM^R-linear,L0:

we use a block-based resampler with $d_{\text{emb}}=d_{\text{LLM}}$ . The pooling function is a linear projection from dimension $4\times d_{\text{emb}}$ to $d_{\text{emb}}$ , taking as input the concatenation of all tokens from the same block.

DePALM^{R-QPMapper,L0}:

we use a block-based resampler with $d_{\text{emb}}=768$ . The pooling function is a QPMapper taking as input the 4 tokens of a single block, using $L_{\textrm{QP}}=4$ layers and a single query token ( $n_{\textrm{Q}}=1$ ).

DePALM^R-avgpool,L0:

we use a block-based resampler with $d_{\text{emb}}=d_{\text{LLM}}$ . The pooling function returns the mean of the tokens from the same block.

DePALM^QP,inner:

we extract $n_{\text{fl}}=4$ levels of feature tokens, and use the same feature map** as DePALM. In particular, the map** block is shared across all token levels, meaning that each sequence of tokens from each level goes through the same projection, with the same weights, but as separate batch elements. Mapped tokens are inserted into $n_{\text{LLM}}=12$ inner layers, with $n_{\text{left}}=3$ . We also use prompt-tuning with $n_{\textrm{pt}}=16$ .

DePALM^c-attn:

we extract $n_{\text{fl}}=4$ levels of feature tokens, and use a single cross-attention block (detailed below) that will do both feature map** and injection. Injection takes place into the last $n_{\text{LLM}}=12$ inner layers, leaving out the $n_{\text{left}}=3$ last ones, similarly to the inner-layer injection mechanism, but using cross-attention instead of concatenation.

For the cross-attention block, see Figure S1, we first project the tokens from dimension $d_{\text{feats}}$ to $d_{\text{embed}}=1024$ using the linear projection $P_{\textrm{in}}$ . They are then passed through a single transformer layer $\mathrm{Transf}$ (8 heads, dropout of 0.1), acting as a minimal resampler network. We also project the input textual tokens to the dimension $d_{\text{embed}}=1024$ , using a two-layers MLP (the FFN block, with a hidden dimension of $d_{\text{embed}}$ and the GELU activation function), and normalize them using RMSNorm. We then use the perceptual tokens as keys and values, and the text ones as queries, in a cross-attention layer. We project the resulting tokens back to dimension $d_{\text{LLM}}$ using a linear projection, normalize them with RMSNorm, and add them to input textual tokens, using a tanh-gated residual connection. This residual connection takes the form $x^{(k-1)},x^{(k)}_{r},h^{(k)}\mapsto x^{(k-1)}+\tanh(h^{(k)})\times x^{(k)}_{r}$ where $x^{(k-1)}$ are the original textual tokens, $x^{(k)}_{r}$ the output of our cross-attention block, and $h^{(k)}$ a single learned float, initialized at $0$ .

C.3 Training

General training.

We train each model on a single node using eight V100-32G GPUs, using the following method:

•

Optimizer: we take AdamW, with a weight decay of 0.1 and a default learning rate of $\alpha_{\text{max}}=8\cdot 10^{-4}$ , that we further adapt for each experiment.
•

Gradient clip**: we use a clip** value of $g_{\text{clip}}=0.8$ , that we further adapt for each experiment.
•

Learning rate scheduler: we set a minimum learning rate of $\alpha_{\text{min}}=\alpha_{\text{max}}\cdot 10^{-4}$ , with a cosine scheduler. During the first 20% of all iteration steps, we linearly warmup the effective learning rate from $\alpha_{\text{min}}$ to $\alpha_{\text{max}}$ , then use the cosine scheduler to decrease it from $\alpha_{\text{max}}$ to $\alpha_{\text{min}}$ .
•

Batch size: we use a batch size of 16 on each GPU, for an effective batch size of 128. On experiments where memory is an issue, we use gradient accumulation while to train on the same batch size.
•

Epochs: we use a base number of 8 epochs, and increase it on small datasets: we use 12 epochs on AudioCaps, 20 on TextCaps, OKVQA, AOKVQA and TextVQA, and 30 epochs on the two 1% settings.
•

Loss: we compute the loss only on the generate text. Additionally, we use a label smoothing value of $2\cdot 10^{-3}$ in the cross-entropy loss of the LLM.
•

Float precision: we load pre-trained models in float16, and train new weights in float32, using mixed-precision.
•

Duplicate inputs: we group together every training sample with the same perceptual and text input, but different outputs. During training, we select the target output for the loss randomly.

Grid search.

As each setting has a different training dynamic, which can be very sensitive to the learning rate, we use a grid search over a few values that we experimentally found to be working efficiently. We start by swee** over the learning rate values $\alpha_{\text{max}}\in\{1\cdot 10^{-3},8\cdot 10^{-4},4\cdot 10^{-4}\}$ , and set the gradient clip** parameter as $g_{\text{clip}}=0.8$ . On each experiment where the score significantly increase or decrease between the three runs, or where the results are noticeably lower than other methods or previous experiments, we further experiment with $g_{\text{clip}}=0.8$ for the same sweep over learning rates. We perform this sweep for each method-dataset pair, including baselines reproductions.

We also note that we used grid search on each parameter of our models (number of layers, etc.) on the COCO dataset, to confirm that they are at least a local optimum.

Data augmentation.

We use random augmentation during training of each model. Each time a training sample is used during the training loop, a random modification is applied, without increasing the dataset size.

•

Images: the normalized image is resized to $128\times 128$ with a random scale in $[0.5,1]$ and a random aspect ratio in $[\frac{3}{4},\frac{4}{3}]$ , and horizontally flipped with probability $0.5$ . Data is then augmented with the modified RandAugment procedure [45].
•

Audio: we augment the dataset using frequency masking with maximum length of 24, and time masking with maximum length of 96, after normalizing the audio.
•

Video: we normalize the videos, and use the same procedure as with images, with the same random scaling and flip** as with images, followed by the default RandAugment procedure implemented in pytorch.

Special cases:

due to memory limitations, when training LiMBeR⁽¹⁾ and DePALM^R-rand,L0 on MSRVTT, we average the patch tokens along the time dimension to reduce their number. This yields a representation with the dimension of the embedding of a single frame.

Appendix D Experimental results

Method

COCO

COCO (1% data)

TextCaps

VQAv2

VQAv2 (1% data)

TextVQA

OKVQA

AOKVQA

B@4

CIDEr

B@4

CIDEr

B@4

CIDEr

Val

Test

Val

Test

Val

LiMBeR⁽¹⁾ (our reimpl.)

35.85±0.32

122.85±0.69

25.05±0.31

87.10±1.36

17.28±0.08

51.85±0.71

60.19±0.36

59.95±0.42

45.93±1.04

45.68±0.95

17.96±0.76

33.38±1.60

34.13±1.39

LiMBeR^(all) (our reimpl.)

39.86±0.30

136.31±0.63

23.85±0.81

83.74±2.80

22.17±3.39

75.51±19.68

73.42±0.08

72.73±0.10

47.98±1.87

47.70±1.58

31.25±0.50

36.19±2.42

38.93±2.86

MAPL (our reimpl.)

36.96±1.60

126.05±5.13

21.15±2.71

69.20±11.74

18.43±0.82

50.57±3.99

67.13±1.28

66.76±1.35

45.94±1.76

45.65±1.62

21.04±0.88

36.21±1.24

37.02±0.45

eP-ALM (our reimpl.)

33.79±0.43

115.34±1.23

17.98±0.79

64.65±1.38

16.27±0.37

42.58±0.54

59.34±0.21

59.03±0.25

41.38±3.06

41.20±2.89

16.59±0.93

23.52±6.40

27.82±1.59

DePALM

38.66±1.25

131.29±3.38

25.09±0.37

87.05±1.61

22.23±0.91

73.67±5.36

70.11±0.14

69.56±0.19

48.25±1.20

47.80±1.17

22.97±1.22

37.69±0.65

38.45±1.48

DePALM^QP,inner

38.80±0.66

130.91±1.04

21.34±0.38

75.86±0.92

21.21±0.30

65.22±1.55

67.88±0.28

67.64±0.11

45.27±0.38

44.92±0.70

23.70±0.96

35.98±0.68

36.36±1.38

DePALM^R-avgpool,L0

38.52±0.88

131.77±3.50

24.68±1.35

86.09±4.45

20.04±0.86

61.18±4.87

64.84±2.07

64.61±2.14

48.86±0.76

48.56±0.90

19.14±0.60

35.17±1.34

35.41±1.78

DePALM^R-linear,L0

38.97±0.32

133.01±0.61

24.69±1.74

85.31±5.26

21.11±0.93

69.76±2.77

64.76±0.27

64.45±0.31

47.66±0.52

47.31±0.31

19.08±0.93

34.58±0.72

35.30±1.56

DePALM^{R-QPMapper,L0}

38.62±1.72

131.92±5.41

22.76±1.17

75.46±3.95

17.50±0.59

51.03±2.04

61.09±2.27

60.89±2.20

46.08±0.30

45.91±0.17

18.56±0.71

35.35±0.58

35.63±0.78

DePALM^R-rand,L0

39.63±0.48

134.90±0.67

24.60±0.53

86.84±1.49

19.49±1.33

58.15±5.21

71.33±0.09

70.76±0.10

47.60±0.96

47.25±1.08

21.28±1.59

35.00±1.29

34.74±1.34

DePALM^c-attn

38.16±0.41

130.05±1.01

23.83±0.73

81.38±2.52

–

69.45±1.17

69.05±1.01

41.73±0.83

41.48±0.61

–

Table S2: Comparison of our proposed DePALM architectures and our baseline re-implementations, after training on 100% or 1% of each image dataset. We show the average score and standard deviation over five runs, in the format avg±std.

Additional experiments.

We provide complementary results that extend the ones in Table 4 of the main paper. In Table S2 we add standard deviation across the five runs on image datasets, and add additional metrics (BLEU for COCO and TextCaps, test set performance for VQAv2). In Table S3 and Table S4, we do similarly but for AudioCaps (audio captioning) and MSRVTT (video captioning).

In addition, in Table S5, we report the results on image datasets using the DINOv2 visual encoder in place of the CLIP models for the COCO and VQAv2 benchmarks. We see that the scores slightly degrade, compared to Table S2, but still better than previous state-of-the-art parameter-efficient results. In particular, with the larger dataset (VQAv2), the effect is smaller, especially when using DePALM. This reveals that models trained in an unsupervised setting could be a good candidate to adapt LLMs efficiently to do multimodal tasks.

Method (on AudioCaps)	B@1	B@2	METEOR	CIDEr	SPICE	SPIDER
LiMBeR⁽¹⁾ (our reimpl.)	69.38±0.70	51.11±0.43	21.90±0.14	62.04±0.82	15.84±0.28	38.94±0.47
LiMBeR^(all) (our reimpl.)	69.52±0.69	51.23±0.74	22.53±0.30	64.34±2.44	15.91±0.65	40.12±1.37
MAPL (our reimpl.)	70.05±1.27	52.11±1.10	22.87±0.20	65.36±1.61	16.42±0.34	40.89±0.82
eP-ALM (our reimpl.)	61.94±2.01	45.83±1.75	21.38±0.46	60.84±2.31	15.41±0.53	38.13±1.37
DePALM	71.54±0.89	53.37±1.05	23.66±0.33	69.70±2.31	17.03±0.63	43.37±1.42
DePALM^QP,inner	70.85±1.63	52.75±1.87	23.12±0.48	65.96±3.55	16.44±0.71	41.20±2.09
DePALM^R-avgpool,L0	68.91±1.02	50.81±1.44	23.07±0.39	66.80±3.69	16.29±0.34	41.54±1.81
DePALM^R-linear,L0	68.73±1.46	50.59±1.15	23.08±0.35	65.52±2.89	16.32±0.28	40.92±1.42
DePALM^{R-QPMapper,L0}	69.89±0.88	51.82±0.72	22.69±0.28	66.16±2.38	16.19±0.32	41.17±1.17
DePALM^R-rand,L0	70.44±0.64	52.22±1.13	22.99±0.43	66.38±3.38	16.36±0.48	41.37±1.84

Table S3: Comparison of our proposed DePALM architectures and our baseline re-implementations, after training on AudioCaps. We show the average score and standard deviation over five runs, in the format avg±std.

Method (on AudioCaps)	B@4	METEOR	CIDEr
LiMBeR⁽¹⁾ (our reimpl.)	34.22±1.04	27.56±0.35	46.03±2.11
LiMBeR^(all) (our reimpl.)	36.30±0.87	27.77±0.36	46.87±1.80
MAPL (our reimpl.)	36.78±1.22	28.01±0.24	47.27±1.90
eP-ALM (our reimpl.)	25.59±1.28	25.35±0.35	38.83±2.14
DePALM	38.78±1.51	28.54±0.37	49.88±2.01
DePALM^QP,inner	39.44±1.07	28.29±0.47	47.76±2.18
DePALM^R-avgpool,L0	39.36±1.51	28.59±0.29	50.52±2.24
DePALM^R-linear,L0	40.56±1.23	28.71±0.36	51.60±2.28
DePALM^{R-QPMapper,L0}	38.32±1.05	27.41±0.32	45.49±1.49
DePALM^R-rand,L0	36.39±0.88	27.85±0.47	47.90±2.30

Table S4: Comparison of our proposed DePALM architectures and our baseline re-implementations, after training on MSRVTT for captioning. We show the average score and standard deviation over five runs, in the format avg±std.

Method	COCO		COCO (1% data)		VQAv2		VQAv2 (1% data)
Method	B@4	CIDEr	B@4	CIDEr	Val	Test	Val	Test
LiMBeR⁽¹⁾ (our reimpl.)	31.28±0.62	106.93±1.35	16.55±0.55	61.54±1.55	55.63±0.07	55.30±0.19	20.01±17.72	19.79±17.61
LiMBeR^(all) (our reimpl.)	37.86±0.28	129.24±0.75	22.21±0.99	74.96±4.21	68.95±2.42	68.64±2.42	44.56±0.46	44.31±0.52
MAPL (our reimpl.)	36.22±1.55	122.28±4.99	21.39±1.09	70.50±2.42	66.37±0.22	66.13±0.23	45.47±0.46	45.38±0.30
eP-ALM (our reimpl.)	31.28±0.24	106.21±0.56	16.34±0.68	57.45±1.38	57.62±0.16	57.37±0.25	39.90±1.32	39.41±1.64
DePALM	37.68±0.39	127.38±1.11	23.29±0.74	79.45±2.22	68.42±0.10	68.05±0.12	47.91±0.91	47.51±0.92
DePALM^QP,inner	37.28±0.61	124.53±1.28	22.27±0.34	73.53±1.20	65.73±0.25	65.49±0.27	43.81±0.50	43.52±0.40
DePALM^R-avgpool,L0	37.24±0.45	126.29±0.96	22.52±0.53	77.53±1.82	63.63±2.14	63.66±2.03	45.51±0.51	45.08±0.54
DePALM^R-linear,L0	37.14±0.33	125.29±0.67	22.37±0.43	76.71±2.13	66.47±0.20	66.48±0.22	45.62±0.49	45.29±0.45
DePALM^{R-QPMapper,L0}	36.00±1.59	122.30±5.36	20.74±1.49	65.85±5.08	59.93±0.30	59.79±0.55	48.09±0.46	47.64±0.35
DePALM^R-rand,L0	37.13±0.59	126.44±1.20	23.02±0.41	79.68±1.16	63.24±2.07	63.10±1.95	44.23±0.50	44.07±0.42
DePALM^c-attn	36.41±0.42	122.69±1.02	8.34±0.83	18.03±4.04	44.84	44.89	–	–

Table S5: Comparison of our proposed DePALM architectures and our baseline re-implementations, after training on 100% or 1% of COCO and VQAv2 datasets, when using the DINOv2 as the perceptual backbone extractor. We show the average score and standard deviation over five runs, in the format avg±std.

Method	COCO	TextCaps	AudioCaps	MSRVTT
Method	CIDEr	CIDEr	SPIDEr	CIDEr
LiMBeR^(all) (our reimpl.)	136.31	75.51	40.12	46.87
DePALM	131.29	73.67	43.37	49.88
LiMBeR^(all) (our reimpl.) + bias-FT	137.37	74.12	45.45	49.16
DePALM + bias-FT	133.55	67.98	47.35	50.86

Table S6: Impact of bias fine-tuning in the perceptual backbone model. The results are averaged over 5 runs.

Efficient fine-tuning of the feature model.

In our experiments so far, we used prompt-tuning, but did not fine-tune any internal parameters of the LLM or feature backbone. In Table S6 we consider the impact of adding bias-tuning to the feature model. While adding only 0.5M learnable parameters, we observe substantial gains on COCO, AudioCaps, and MSRVTT, but surprisingly observed performance loss on smaller datasets such as TextCaps. So this method should mostly be considered given enough data. For simplicity, and to keep good results on small datasets, we used only prompt tuning for the LLM and kept the encoders completely frozen in all other experiments.

Appendix E Carbon footprint estimation

We report the estimated carbon footprint of training a single instance of DePALM for four different datasets, using the following method. We take the average training time $T$ , and then compute the total GPU hours $T_{\text{GPU}}=T\times 8$ , as we use a single 8-GPU node for each model. We then estimate the power consumption in kWh, given a Thermal Design Power (TDP) of the V100-32G GPU equal to 250W and a Power Usage Effectiveness (PUE) of 1.1, as $K=\frac{250\times 1.1}{1000}\times T_{\text{GPU}}$ . Finally, given a carbon intensity factor of 0.385 kg CO ${}_{2}$ per KWh, we obtain the emission $E$ in kg of CO ${}_{2}$ as $E=0.385\times K$ .

	COCO	OKVQA	AudioCaps	MSRVTT
Training time: $T$	2h14	1h23	2h21	0h31
GPU hours (8 GPUs): $T_{\textrm{GPU}}$	17.87	11.07	18.80	4.13
Estimated kWh: $K$	4.91	3.04	5.17	1.14
Emitted kg of CO ${}_{2}$ : $E$	1.89	1.17	1.99	0.44

Table S7: Estimated carbon footprint of training a single DePALM^QP,L0 model, on four different datasets.

Improved Baselines for Data-efficient Perceptual Augmentation of LLMs

Abstract

1 Introduction

2 Related work

Multimodal models.

Efficient adaptation of unimodal models.

3 Unified framework

3.1 Backbone models

Language models.

Perceptual encoders.

3.2 Adaptation mechanisms

Feature extraction.

Feature map**.

1) Linear projection.

2) Query pooling mapper.

3) Token resamplers.

4) Cross-attention.

Feature injection.

Finetuning mechanisms

4 Experiments

4.1 Experimental setup

Datasets and metrics.

Baselines.

Our models.

Implementation details.

4.2 Main experimental results

Improved baseline performances.

Better adaptation mechanism.

Qualitative results.

4.3 Analysis and ablation study

Text-aligned perceptual features adapt better to LLMs.

Better LLMs are not always better for multimodality.

Parameter and data efficiency.

Comparison with the state-of-the-art.

5 Discussion and conclusion

Small vs. large-scale setups.

Limitations.

Conclusion.

References

Appendix A Assets and licensing information

Appendix B Building blocks of our framework

B.1 Feature extraction

B.2 Feature injection

First-layer token injection.

Inner-layers token injection.

B.3 Feature map**

QPMapper.

Block-based token resamplers.

B.4 Fine-tuning mechanism

Prompt-tuning.

Appendix C Implementation details

C.1 Reproducing existing baseline

LiMBeR(all):

LiMBeR(1):

MAPL:

eP-ALM:

C.2 DePALM variants

DePALM:

DePALMR-rand,L0:

DePALMR-linear,L0:

DePALMR-QPMapper,L0:

DePALMR-avgpool,L0:

DePALMQP,inner:

DePALMc-attn:

C.3 Training

General training.

Grid search.

Data augmentation.

Special cases:

Appendix D Experimental results

Additional experiments.

Efficient fine-tuning of the feature model.

Appendix E Carbon footprint estimation

Improved Baselines for Data-efficient
Perceptual Augmentation of LLMs

LiMBeR^(all):

LiMBeR⁽¹⁾:

DePALM^R-rand,L0:

DePALM^R-linear,L0:

DePALM^{R-QPMapper,L0}:

DePALM^R-avgpool,L0:

DePALM^QP,inner:

DePALM^c-attn: