HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: mathalpha

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2403.13499v1 [cs.CV] 20 Mar 2024

1]FAIR, Meta 2]Sorbonne University 3]Valeo.ai

Improved Baselines for Data-efficient
Perceptual Augmentation of LLMs

Théophane Vallaeys    Mustafa Shukor    Matthieu Cord    Jakob Verbeek [ [ [ [email protected] [email protected]
(March 20, 2024)
Abstract

The abilities of large language models (LLMs) have recently progressed to unprecedented levels, paving the way to novel applications in a wide variety of areas. In computer vision, LLMs can be used to prime vision-language tasks such image captioning and visual question answering when coupled with pre-trained vision backbones. While different approaches have been explored to interface LLMs with “perceptual backbones” that process, e.g., visual or audio data, they are often explored for different tasks, different datasets, and using different perceptual backbones and language models, hindering direct comparison of the interfacing mechanisms. To remedy this lack of comparability between methods, we present an extensive experimental evaluation of different interfacing mechanisms, across multiple tasks (including image, video, and audio captioning as well as visual question answering), datasets and backbones, paying special attention to low-data settings. We find improved performance using existing mechanisms over state-of-the-art results, and identify a new interfacing mechanism that yields (near) optimal results across different tasks, while obtaining a 4×4\times4 × reduction in training time.

\correspondence

Théophane Vallayes () Mustafa Shukor ()

Refer to caption
(a) Our unified framework for perceptual augmentation of LLMs.
Refer to caption
(b) Architecture of the feature map** (QPMapper) in DePALM.
Figure 1: (left) Unified view of data-efficient approaches for perceptual augmentation of LLMs. Existing approaches can be characterized by four configurable blocks: feature extraction, map**, and injection, as well as fine-tuning mechanisms. (right) Architecture of the feature map** (QPMapper) used by our model (DePALM): it aggregates tokens from the encoder before injecting them in the LLM, which results in significant savings in compute.

1 Introduction

The advent of large language models (LLMs) has brought unprecedented capabilities in the understanding and production of natural language [92, 82, 83, 6, 86]. These models can be leveraged to provide a natural user interface in a wide variety of applications, including text-based generation of images, video and audio [68, 50, 31], using external tools [69] and make models talk to each other [89].

Currently, state-of-the-art models in image captioning [40, 35] and visual question-answering (VQA) [9, 30] mostly consist of task-specific, end-to-end trained models. To build more general models, beyond a single task or dataset, several works leverage the generalization capabilities of pre-trained LLMs, coupled with visual encoders [1, 9, 87, 43, 73]. Such approaches rely on end-to-end training of very large numbers of parameters, e.g. 10B in Flamingo [1], requiring very large datasets, e.g. up to several billions of examples in [9]. Recently, a significant efforts have been focused on building more powerful multimodal models [51, 3, 52, 48, 7]. Yet, they still involve costly training stages, such as multimodal pretraining and multitask instruction tuning. These models are interesting when millions of training samples are available and there are no constraint on compute efficiency, and the goal is to have generalist models with good performance on many tasks. An interesting alternative line of research has emerged that studies data and parameter efficient methods  [72, 58, 25, 84, 61, 79, 60] to address multimodal tasks. These approaches focus on adapting pre-trained and frozen LLMs, by training modules with few parameters, on limited training sets. This line of research is complementary to the former, and aims to maximize performance on a specific task (e.g. VQA), within few hours of training on a single machine. This becomes even more important in case of data scarcity where we lack big datasets with millions of training samples.

Improvements in parameter-efficient approaches span several axes, such as the LLMs and perceptual encoders used [60, 61, 72, 91], the perceptual feature extraction/injection mechanism [72, 60, 91], and the cross-modal map** module [58, 84, 60]. This variety of design choices prevents a fair and comprehensive comparison between existing approaches, and hinders the understanding of the main factors driving their success. In addition, most of these approaches focus on parameter-efficiency, with little focus on data-efficiency [72, 58], while we argue that the latter is a more important aspect together with compute-efficiency.

More than proposing novel approaches to couple LLMs with perceptual backbones, we believe it is important to have a unified understanding and proper comparison between existing methods. To this end, we propose a unified framework to comprehensively study previous approaches Figure 1. Our framework allows a fair and systematic comparison along designs of several blocks: feature extraction (e.g. which visual features to consider), feature map** (e.g. how to project the extracted features in the LLM textual space) and feature injection (e.g. where to inject the projected features). We consider the impact of the choice of LLM and perceptual backbones, and carefully and fairly tune hyperparameters. This by itself already improves over previously reported results. The systematic characterization of existing approaches naturally leads us to define and evaluate alternative approaches. We find that one of these approaches emerges as the overall best, which we dub DePALM, leading to (near) optimal results across different datasets and tasks. Our approach consistently and significantly improves over earlier data and parameter efficient approaches, and in some cases also outperforms few-shot performance of large-scale state-of-the-art models that train billions of parameters on massive datasets.

To summarize, our contributions are as follows:

  • We present the first systematic experimental study of methods to interface perceptual backbones with LLMs, using the same tasks, datasets, and underlying backbone networks.

  • For all considered tasks, we find improvements over previous state-of-the-art data and parameter efficient methods by careful setting of training hyperparameters and architectural choices.

  • We identify a new mechanism, DePALM, to interface LLMs with perceptual backbones based on token pooling, which obtains near optimal results, while being up to 4×\times× faster to train than the closest competitor (training in less than 1.5h on a single machine for a typical dataset).

2 Related work

Table 1: Overview of different architectures from the literature, as well as from this work (DePALM models). The LLM adaptation mechanisms consist of four fundamental components: feature extraction, feature map**, feature injection, and a fine-tuning mechanism. The last column shows the number of trainable parameters, as reported by papers, or with the LLaMA+CLIP-L setting in our models. Methods in orange leverage pre-training on large amounts of data, or cross-dataset training. Others have at least one version trained on a single dataset, which is the setting we consider.
Method Backbones Adaptation mechanism # Tr.
LLMs Perceptual Enc. Feature extraction Feature map** Feature injection Fine-tuning mechanisms params.
Flamingo [1] Chinchilla [33] NFNet [5] Tokens from last layer Perceiver Resampler (Transformer) GATED XATTN-DENSE (Cross-attention) 10B
BLIP-2 [43] OPT [92], FlanT5 [13] CLIP [65] Tokens from last layer Q-Former 1st layer token injection 1.2B
MAGMA [22] GPT-J 6B [86] CLIP [65] / NFNet [5] Tokens from last layer MLP 1st layer token injection fine-tuning of perceptual model 243M
MAPL [58] GPT-J 6B [86] CLIP-L [65] Tokens from last layer QPMapper (dembedsubscript𝑑embedd_{\text{embed}}italic_d start_POSTSUBSCRIPT embed end_POSTSUBSCRIPT=256, 4 layers) 1st layer token injection 3.4M
PromptFuse [46] BART [42] ViT [19] Tokens from last layer nothing prompt tuning 15K
LiMBeR [60] GTP-J 6B [86] CLIP [65] Tokens from last layer Linear projection 1st layer token injection 12.5M
eP-ALM [72] OPT-2.7B/6.7B [92] ViT [77], AST [27], TimeSformer [4] CLS tokens from n𝑛nitalic_n last layers (Shared) linear projection Token injection in intermediate layers prompt tuning 4.2M
LLaMA-Adapter [91, 25] LLaMA[82] CLIP [65] Tokens from last layer Linear projection Token injection in intermediate layers inner-layer prompt tuning, bias tuning, norm tuning 14M
Frozen [84] GPT-like [66] NFNet [5] Pooled output tokens nothing 1st layer token injection Fine-tune the NFNet 40.3M
ClipCap [61] GPT-2[66] CLIP [65] Tokens from last layer Transformer 1st layer token injection 43M
VL-Adapter [79] BART [42], T5 [67] CLIP [65] Tokens from last layer Linear projection 1st layer token injection Adapters 5.8M
AnyMAL [62] Llama 2-70B-chat [83], CLIP [65], CLAP [23] Tokens from last layer Perceiver Resampler, or linear projection 1st layer token injection LoRA [34]
DePALMQP,inner OPT-6.7B [92], LLaMA [82] CLIP-L [65], DINOv2 [63], MAViL [36] TimeSformer [4] Tokens from n𝑛nitalic_n last layers QPMapper Token injection in intermediate layers prompt tuning 18.1M
DePALM Tokens from last layer 1st layer token injection 17.9M
DePALMR-rand,L0, DePALMR-linear,L0, DePALMR-QPMapper,L0, DePALMR-avgpool,L0 Linear projection + Resampler 21M, 88M 18M, 21M
DePALMc-attn Tokens from n𝑛nitalic_n last layers Projection + Small Transformer Gated cross-attention 17.9M

Multimodal models.

In recent years there has been a significant interest in multimodal models, and in particular in vision-language pre-training, see e.g. [80, 10, 45, 71, 20, 44, 65]. These models can be subsequently fine-tuned to address a range of tasks, such as visual question answering (VQA) and image captioning. The advent of large language models (LLMs) [6, 32, 92, 82, 12] has triggered another line of work on large-scale multimodal training built on top of LLMs. The typical approach is fine-tune a pre-trained language model on large multimodal datasets [9, 8].

Due to the large computational cost to train these approaches, especially with LLMs on the scale of billions of parameters, other approaches keep the LLM part of the model frozen, and only train additional parameters to solve multimodal tasks [1, 43]. Although the predominant focus of research revolves around image-text tasks, the adaptability of these approaches to other modalities, including video and audio, has recently been demonstrated in a straightforward manner [72, 29, 73, 57, 73, 62, 87]. Nonetheless, such endeavors still necessitate the training of a substantial number of parameters, e.g. 10B parameters in Flamingo [1], on billion-scale multimodal training sets [8].

Efficient adaptation of unimodal models.

In contrast to the paradigm of large-scale end-to-end multimodal training, another line of work considers efficient adaptation of pre-trained unimodal models. Methods such as MAGMA [22], Frozen [84] and ClipCap [61] tackle vision-language tasks by training the visual encoder [84, 22] or additional adapters [22] to leverage a pre-trained language model. Other approaches train smaller number of parameters by kee** all pre-trained models fixed, and train a linear layer [60] or a small transformer map** network [58]. Nevertheless, these methods rely on multimodal visual encoders such as CLIP [65], or inject a substantial number of visual tokens in the language model, which reduces inference speed. Recently, several approaches  [72, 91, 39] have explored the use of simple linear layers to transform features and inject them in LLMs, some of them even use only-unimodal encoders, across image/video/audio modalities, see e.g. [72]. While each of these approaches show good performances within its own specific experimental setup, it is difficult to compare them due to the differences in the considered tasks and datasets.

3 Unified framework

Even though different approaches leverage LLMs for multimodal tasks, it remains challenging to discern the specific components responsible for the superiority of one method over another. In response to this challenge, we structure previous work in a comprehensive and unified framework, as depicted in Figure 1, enabling a systematic and fair comparison of various existing approaches. Within this framework, the process of adapting LLMs for multimodal tasks boils down to make different design choices. These include the choice of the LLM and perceptual backbone models, which we discuss in Section 3.1, and the adaptation mechanism, which we discuss in Section 3.2. The latter consists of a feature extraction, feature map** and feature injection mechanism, and finally a fine-tuning mechanism. Different design choices lead to different existing or new approaches, as exemplified for a number of methods from the literature in Table 1.

3.1 Backbone models

Language models.

Despite a non-negligible effort in encoder-decoder LLMs, the NLP community has mostly converged to decoder-only LLMs for very large scales. Most powerful LLMs come with different models sizes, the best choices in terms of the trade-off between performance and efficiency are usually models with in the order of 7B parameters. To solve multimodal tasks, these LLMs can be fully or partly finetuned, or completely frozen. Here we focus on frozen, decoder-only LLM with 7B parameters, such as OPT [92] and LLaMA [82]. We also experiment with intruction-tuned LLMs such as Vicuna [11] and LlaMA-2-chat [83].

Perceptual encoders.

Encoders are chosen depending on the modality, and they differ mainly in the architecture (e.g. CNNs or Transformers) and training paradigm (e.g. class-label supervised, text supervised, or self supervised). In our experiments, we focus on transformer-based encoders for their strong performance when pre-trained on large-scale datasets [19]. We experiment with CLIP [65] for image-text tasks, which has been pre-trained from text-aligned data. We also experiment with models pre-trained in an unimodal manner such as TimeSformer [4] for videos, and self-supervised ones such as DINOv2 [63] for images, and MAViL [36] for audio.

3.2 Adaptation mechanisms

To couple the perceptual backbone with the LLM, perceptual tokens are first extracted from the perceptual backbone, transformed via a map** network, and then injected in the LLM. To further improve the adaptation, different fine-tuning mechanisms can be adopted. An overview of the different designs of these components is given in columns four to seven in Table 1. Below, we discuss each of them in detail.

Feature extraction.

In transformer-based perceptual encoders, features take the form of “tokens” that correspond to specific parts of the input, e.g. an image patch. Some transformer models also include a special “class token”, denoted as CLS, which is not tied to a specific part of the input; it interacts with all the input-tied tokens, and encodes global information that can, e.g., be used to classify an image. These tokens can be extracted from any layer of the encoder. We consider two design choices; (i) where to extract the tokens: from the last encoder layer only, or from the last k𝑘kitalic_k layers, and (ii) which tokens to extract: all of them, or only the CLS token.

Feature map**.

To render the encoder features compatible with the internal features of the LLM, we apply a map** which can take different forms.

1) Linear projection.

The simplest approach is to use a single linear layer that projects the extracted visual tokens to have the same LLM hidden state dimension. If multiple tokens are extracted, each of them is projected independently with the same linear layer.

2) Query pooling mapper.

Typical perceptual backbones use in the order of hundreds of internal tokens, e.g. 256 tokens organized in a 16×16161616\times 1616 × 16 grid for images. The training and inference cost directly depend on the number of tokens that are extracted from the encoder and injected in the LLM. Even if the LLM in principle needs to process only short text sequences for the task at hand, such as in visual question answering, injecting hundreds of perceptual tokens in the LLM makes it computationally demanding. We design a QPMapper block to aggregate the tokens extracted from the encoder into a smaller set. The input feature tokens are projected and concatenated to a sequence of learnable query tokens (hence the name), and only the outputs corresponding to the query tokens are kept, upsampled to the LLM dimension, and normalized using the RMSNorm [90]. This architecture is inspired by several previous works [43, 21, 81] using query or class tokens to compute an aggregate representation of the input. In our case, however, we use multiples rather than a single global token. It is also similar to the map** network of MAPL [58], with a lower number of layers and higher dimensionality, and its main benefit is to limit the number of tokens passed to the LLM. Our QPMapper consists of a small sequence of NQSsubscript𝑁𝑄𝑆N_{QS}italic_N start_POSTSUBSCRIPT italic_Q italic_S end_POSTSUBSCRIPT transformer layers, wrapped by linear dimension downsampling and upsampling projections to a dembedsubscript𝑑embedd_{\text{embed}}italic_d start_POSTSUBSCRIPT embed end_POSTSUBSCRIPT-dimensional internal features, allowing to control the number of trainable parameters in this block. See Figure 1 (right panel) for an illustration of QPMapper architecture.

3) Token resamplers.

We consider several other alternatives to reduce the number of tokens, which are inspired from pooling blocks in CNNs. For example, average-pooling and max-pooling aggregate features over a small patch of, typically, 2×2222\times 22 × 2 features. In our experiments, we explore the following resamplers to reduce the number of tokens:

  • R-avgpool: tokens in a patch are averaged, which is equivalent to an average pooling layer on the input grid.

  • R-linear: tokens in a patch are concatenated and then linearly projected, which is equivalent to a strided convolution on the input grid.

  • R-QPMapper: tokens in a patch are passed through a QPMapper with a single query token, with parameters of the QPMapper shared across patches.

  • R-rand: a random subset of tokens, e.g. 50%, is selected during training. During evaluation, we keep all tokens.

4) Cross-attention.

Prepending tokens to the textual tokens inside an LLM significantly increase the inference complexity. Rather than reducing the number of prepended tokens, we consider a parameter-efficient cross-attention module that is inserted in the LLM and allows it to access the tokens of the perceptual encoder, inspired by Flamingo [1]. Specifically, the perceptual and textual tokens are projected to a smaller hidden dimension dembedsubscript𝑑embedd_{\text{embed}}italic_d start_POSTSUBSCRIPT embed end_POSTSUBSCRIPT. In this latent space, a typical cross-attention block is applied. The textual tokens are considered as query, and the perceptual ones as keys and values. Finally, the output is upsampled to the LLM dimension and added back to the initial textual tokens using a tanh-gated residual connection. These modules are inserted throughout the second half of the LLM, in-between the LLM transformer modules.

Feature injection.

As for the injection of tokens in the LLM, there are two choices to make: (i) how to inject, and (ii) where to inject these tokens. Regarding (i), we can prepend the tokens to the textual tokens and then interact with the text in the LLM self-attention layers, or we inject them via a cross-attention mechanism. Regarding (ii), tokens are either injected in the LLM input layer, or in intermediate ones. When prepending the tokens to textual tokens in the input layer, they propagate up until the last layer. When injecting in intermediate layers, they are only kept for a single attention block and discarded afterward, and possibly replaced by the same set or another set of tokens in the next transformer block. In this last case, if the tokens were extracted from k𝑘kitalic_k levels of the perceptual model, and inserted in to n𝑛nitalic_n LLM layers, then each sequence i=1,,k𝑖1𝑘i=1,\dots,kitalic_i = 1 , … , italic_k of perceptual features is inserted into the LLM blocks in/k,,(i+1)n/k1𝑖𝑛𝑘𝑖1𝑛𝑘1\lfloor in/k\rfloor,\dots,\lfloor(i+1)n/k-1\rfloor⌊ italic_i italic_n / italic_k ⌋ , … , ⌊ ( italic_i + 1 ) italic_n / italic_k - 1 ⌋.

Finetuning mechanisms

While kee** the LLM frozen is most efficient, parameter-efficient fine-tuning techniques can be used to further boost performance [18]. In our experiments, we consider prompt-tuning and bias-tuning, which we detail in the supplementary material.

4 Experiments

Below, we present our experimental setup in Section 4.1, followed by the results in Section 4.2.

4.1 Experimental setup

Datasets and metrics.

The datasets used in our experiments are listed in Table 2. For all datasets we use standard splits, except for COCO and VQAv2 where we use the commonly used Karpathy splits [37]. To study limited data settings, we consider OKVQA and MSRVTT, and also experiment with COCO and VQAv2 using 1% of the training data. We evaluate using the standard metric of each benchmark. Specifically, for both image and video captioning, we use CIDEr [85], and for VQA tasks, we use the official VQAv2 accuracy metric on the test and/or validation set. Audio captioning is evaluated with SPIDER[53]. We add other standard metrics (BLEU [64], METEOR [17], SPICE [2]) in the supplementary material.

Baselines.

To ensure fair comparison of the different interfacing mechanisms, we re-implement several parameter-efficient approaches: LiMBeR [60], MAPL [58] and eP-ALM [72]. We selected LiMBeR as this is the simplest method, used in a number of other works, and the other two as their original paper also report results on the data-efficient setting. We found the other models either redundant (VL-adapter[79] is the same as LiMBeR with additional fine-tuning, which is not our main focus), non-parameter or non-data efficient (Frozen [84], BLIP-2 [43]) or designed for another setting (LLaMA-Adapter [91] was conceived for instruction fine-tuning first). We refer to these as LiMBeR(all) (our reimpl.), MAPL (our reimpl.) and eP-ALM (our reimpl.), and note that we change the backbones from the original papers to be all the same, for proper comparisons. We also use a variant of LiMBeR from [72], which we name LiMBeR(1) (our reimpl.), where only the CLS token is injected in the LLM.

Table 2: Datasets used in our experiments, listing the modality type, task, and size of the training set. We also list the LLM and perceptual backbone used by default for each dataset.
Dataset Type Task Size LLM Backbone
COCO [47] Image Captioning 82K LLaMA-7B EVA-CLIP-L
TextCaps [75] Image Captioning 21K LLaMA-7B CLIP-ViT-L
VQAv2 [28] Image Question Ans. 605K OPT-6.7B CLIP-ViT-L
TextVQA [76] Image Question Ans. 34K OPT-6.7B CLIP-ViT-L
AOKVQA [70] Image Question Ans. 17K OPT-6.7B CLIP-ViT-L
OKVQA [59] Image Question Ans. 9K OPT-6.7B CLIP-ViT-L
AudioCaps [38] Audio Captioning 49K OPT-6.7B MAViL
MSRVTT [88] Video Captioning 7K LLaMA-7B TimeSformer

Our models.

Based on our unified framework, we explore seven novel interfacing mechanisms, summarized in Table 1: DePALMQP,L0 (that we refer to as DePALM), DePALMQP,inner, DePALMc-attn, DePALMR-rand,L0, DePALMR-linear,L0, DePALMR-QPMapper,L0 and DePALMR-avgpool,L0. To get a good trade-off between performance and efficiency, we include different pooling strategies to reduce the number of perceptual tokens, contrary to prior work that either used a single token [72] or all tokens [60]. Most of these variants extract features from the last perceptual encoder layer and inject the mapped features in the first LLM layer. In addition, we also explore models that consider intermediate layers as in [72]. In terms of fine-tuning mechanism, we consider prompt tuning and bias-tuning, due to its effectiveness in previous work [72, 46], and leave the LLM and perceptual backbone frozen. In the appendix we report additional experiments regarding different finetuning approaches.

Please refer to the supplementary material for further architectural detail of the baselines and our models.

Implementation details.

For fair comparison between different approaches, we use a unified training setup. For each dataset, we use the same LLM and perceptual encoders for all methods, as listed in Table 2. Models are trained directly on downstream tasks, without any pre-training, and using the standard cross-entropy loss; except for the LLM and perceptual backbone which are pre-trained and frozen. We use random perturbations for data-augmentation, using the same procedure as in [45] for images. We train with the AdamW [54] optimizer and the cosine learning rate scheduler [55]. For each experiment, we conduct five different runs with different random seeds, unless specified otherwise, each run being executed on a single machine equiped with eight V100 GPUs. We report the mean performance metrics in the main paper, and refer to the supplementary material for the standard deviations. Further implementation details can also be found in the supplementary material.

Table 3: Comparison of our implementation of baselines with results reported by the original papers. Results averaged over 5 runs. The best result per column are marked in bold. \dagger: The published results for LiMBeR use the standard split and a 4-shot evaluation using a model trained on a larger dataset, which do not correspond directly to our setting. \ddagger: results using 8 shots, after training on the target dataset only.
Method COCO \uparrow COCO (1%) \uparrow VQAv2 \uparrow VQAv2 (1%) \uparrow OKVQA \uparrow
CIDEr CIDEr Val Val Val
LiMBeR (4-shot) [60] 39.2{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT
MAPL [58] 125.2 65.9 43.5 37.7 18.7 / 31.6{}^{\ddagger}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT
eP-ALM [72] 111.6 54.9 41.9{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT
LiMBeR(all) (our reimpl.) 136.3 83.7 73.4 48.0 36.2
MAPL (our reimpl.) 126.1 69.2 67.1 45.9 36.2
eP-ALM (our reimpl.) 115.3 64.7 59.3 41.4 23.5

4.2 Main experimental results

Improved baseline performances.

We start with our reproductions of existing parameter and data-efficient baselines. Table 3 shows a comparison between the scores we obtained and those reported in the original papers. We improve the existing baselines by large margins across all metrics. This comes mainly from using better backbones (e.g., LlaMA and CLIP), and using a thorough hyperparameter search for the training algorithm. We conducted this hyperparameter search independently for each experiment, on the learning rate and gradient clip**, using a grid search over a set of values we found to work particularly well for a set of diverse models on our task. With our implementation, LiMBeR(all) achieves the best performance across the board. However, LiMBeR(all) is computationally more expensive as it dramatically increases the length of the sequence processed by the LLM as by passing all (typically 256) perceptual tokens to the LLM, compared to passing couple of tokens in MAPL, or just one in eP-ALM.

Table 4: Comparison of our proposed DePALM architectures and our baseline re-implementations, after training on 100% or 1% on each datasets. We highlight the first, second and third best results. All results are averaged on 5 runs. We show the training time on AudioCaps, the average rank and average normalized score of each method across benchmarks. For these last two values, we add the rank over all our models. {}^{\blacksquare}start_FLOATSUPERSCRIPT ■ end_FLOATSUPERSCRIPT: tokens are first averaged across time to prevent memory errors. {}^{\blacklozenge}start_FLOATSUPERSCRIPT ◆ end_FLOATSUPERSCRIPT: incomplete data due to unstable training.
Method COCO COCO (1%) TextCaps VQAv2 VQAv2 (1%) TextVQA OKVQA AOKVQA AudioCaps MSRVTT Train time \downarrow Average
CIDEr \uparrow CIDEr \uparrow CIDEr \uparrow Val \uparrow Val \uparrow Val \uparrow Val \uparrow Val \uparrow SPIDEr \uparrow CIDEr \uparrow Rank \downarrow Score \uparrow
LiMBeR(1) (our reimpl.) 122.85 87.10 51.85 60.19 45.93 17.96 33.38 34.13 38.94 46.03 1h19 8.0 (9) 84.7 (9)
LiMBeR(all) (our reimpl.) 136.31 83.74 75.51 73.42 47.98 31.25 36.19 38.93 40.12 46.87{}^{\blacksquare}start_FLOATSUPERSCRIPT ■ end_FLOATSUPERSCRIPT 4h59 3.2 (2) 97.4 (1)
MAPL (our reimpl.) 126.05 69.20 50.57 67.13 45.94 21.04 36.21 37.02 40.89 47.27 1h31 6.4 (7) 86.9 (7)
eP-ALM (our reimpl.) 115.34 64.65 42.58 59.34 41.38 16.59 23.52 27.82 38.13 38.83 1h20 10.4 (10) 73.1 (11)
DePALM 131.29 87.05 73.67 70.11 48.25 22.97 37.69 38.45 43.37 49.88 1h25 2.5 (1) 95.7 (2)
DePALMQP,inner 130.91 75.86 65.22 67.88 45.27 23.70 35.98 36.36 41.20 47.76 2h21 5.2 (5) 90.7 (5)
DePALMR-avgpool,L0 131.77 86.09 61.18 64.84 48.86 19.14 35.17 35.41 41.54 50.52 1h50 4.4 (3) 90.4 (6)
DePALMR-linear,L0 133.01 85.31 69.76 64.76 47.66 19.08 34.58 35.30 40.92 51.60 1h48 5.2 (5) 91.1 (3)
DePALMR-QPMapper,L0 131.92 75.46 51.03 61.09 46.08 18.56 35.35 35.63 41.17 45.49 1h48 6.8 (8) 85.6 (8)
DePALMR-rand,L0 134.90 86.84 58.15 71.33 47.60 21.28 35.00 34.74 41.37 47.90{}^{\blacksquare}start_FLOATSUPERSCRIPT ■ end_FLOATSUPERSCRIPT 2h40 4.4 (3) 90.9 (4)
DePALMc-attn{}^{\blacklozenge}start_FLOATSUPERSCRIPT ◆ end_FLOATSUPERSCRIPT 130.05 81.38 69.45 41.73 1h31 9.5{}^{\blacklozenge}start_FLOATSUPERSCRIPT ◆ end_FLOATSUPERSCRIPT (10) 36.9{}^{\blacklozenge}start_FLOATSUPERSCRIPT ◆ end_FLOATSUPERSCRIPT (11)

Better adaptation mechanism.

Next, we explore seven additional cross-modal interaction mechanisms, beyond the baseline ones, across a set of ten tasks. We also add LiMBeR(1), as it was shown to be a fast and efficient baseline [72]. We report results in Table 4, where we also report the training time for AudioCaps as an illustration of the training cost. To easily compare these methods across different tasks, we use two aggregate metrics. (i) The average rank: for each task, we rank from 1 (best) to 11 (worst), and average the ranks across tasks. (ii) The average score: we normalize the score for each task by the maximum score across the methods, and then average the normalized scores.

We use our results to conduct an analysis over the building blocks of the models. First, using the same feature map**, injecting tokens inside the LLM in the first-layer (DePALM, LiMBeR(1)) prevails over inner-layer injection (eP-ALM, DePALMQP,inner). We also found that cross-attention (DePALMc-attn) leads to unstable training in most low-data settings. Extracting tokens from different encoder layers (eP-ALM, DePALMQP,inner, DePALMc-attn) makes sense with inner-layers injection techniques, but is not sufficient to improve over methods using only tokens from the last layer (LiMBeR variants, DePALM, and DePALM*,L0 variants). Therefore, we now consider models with last-layer extraction and first-layer injection. For the central map** block, using a resampler (DePALMR-*,L0 and DePALMQP,* variants) to reduce the number of tokens provides a trade-off between efficiency and performance, compared to injecting all tokens (LiMBeR(all)) or just one (LiMBeR(1) and eP-ALM). The QPMapper used over all feature tokens (DePALM and DePALMQP,inner) provides the best trade-off, while the local resamplers (DePALMR-*,L0) that preserve spatial feature structure lag behind or do not consistently achieve high scores.

Overall, DePALM and LiMBeR(all) achieve the best performance, reaching the best average rank and score, respectively. In terms of training speed, however, DePALM is almost 4×\times× more efficient, due to the small number of visual tokens injected in the LLM. Its training cost is similar to the most efficient approaches, eP-ALM and LiMBeR(1) that inject only one visual token, while significantly outperforming them.

Refer to caption

Refer to caption

Refer to caption

A group of cats laying on top of a bed.

A close up of a cake with nuts on it

A group of people standing in front of a large clock.

Refer to caption What is the substance in the bowl? Refer to caption What color is the building? Refer to caption How many rolls are on the plate?

bananas

brown

2

Refer to caption A sewing machine running briefly Refer to caption A vehicle engine is revving up Refer to caption A woman is speaking while food is frying and sizzling

A sewing machine is being used

A vehicle engine is idling and then accelerates

A woman is speaking and frying food



Figure 2: Qualitative samples of DePALM on COCO (first row), VQAv2 (second row) and AudioCaps (third row), when finetuned on 100% of the data. Input is in gray, and output is in green. For AudioCaps, a ground-truth caption of the input is shown.

Qualitative results.

We give some qualitative results of our model for multiple multimodal tasks in Figure 2. We can notice that the models adapt to answer in the style corresponding to the dataset, and has notions of real-world objects, being able to identify colors, animals and objects.

Table 5: Comparison of different visual backbones, with a fixed LLM (left) and with different LLMs (right). We show the CIDEr score on COCO, the validation accuracy on OKVQA, and the SPICE score on AudioCaps. For reference, we add the ImageNet [16] Top1 score of each visual backbone, and ARC for each LLM, that measures textual question answering capabilities. The results are averaged over 3 runs.
Visual backbone COCO DePALM COCO LiMBeR(1) OKVQA DePALM ImageNet Top1
DINOv2-S [63] 118.26 100.63 33.64 81.1%
DINOv2-B [63] 125.42 106.12 34.82 84.5%
DINOv2-L [63] 126.95 107.17 31.81 86.3%
DINOv2-G [63] 127.49 110.58 35.52 86.5%
ViT-L [77] 118.59 106.49 36.11 85.6%
CLIP-ViT-B [65] 121.93 111.88 36.68 68.6%
CLIP-ViT-L [65] 128.69 116.80 37.27 75.3%
EVA-CLIP-L [24] 130.66 123.20 37.13 79.8%
LLM backbone COCO DePALM COCO LiMBeR(1) AudioCaps DePALM ARC
OPT-125M [92] 126.88 102.45 41.82 22.87
OPT-1.3B [92] 129.41 112.43 42.77 29.52
OPT-2.7B [92] 125.75 115.81 43.35 33.96
OPT-6.7B [92] 131.64 117.51 43.83 39.16
LLaMA-7B [82] 130.73 123.12 42.48 51.02
Vicuna-7B [11] 125.66 111.53 21.79 53.24

4.3 Analysis and ablation study

Text-aligned perceptual features adapt better to LLMs.

We investigate the influence of the perceptual backbones on the overall performance. In Table 5 (left) we compare different visual encoders with varying sizes and different training paradigms on different image captioning and visual question answering datasets. For the same model family, see DINOv2 and CLIP-ViT, the bigger size the better the performance. Self-supervised encoders (DINOv2) performed better than supervised ones (ViT) for image captioning, but the reverse was true for OKVQA. Finally, vision-language pre-training of the encoders (CLIP) performs best across all tested settings. This reveals that, using existing text-aligned perceptual encoders, makes the cross-modal interaction between the encoder and LLMs more effective. Overall, models with better feature quality (higher ImageNet score) increase our results, with a large boost when there is a pre-existing alignment with text.

Better LLMs are not always better for multimodality.

Next, in Table 5 (right), we compare LLMs with different models and pretraining data sizes, and consider the impact on image and audio captioning results. We find a clear positive correlation between the LLM size and the score for the OPT models, similar to the ARC metric [14] which measures textual question-answering capabilities of the LLMs. However, when comparing LLMs with similar model sizes in the 7B range, we do not see a clear improvement when using LLMs pretrained on more data (LLaMA), nor when fine-tuning on language instructions (Vicuna), contrary to observations for the ARC metric.

Refer to caption
Refer to caption
Figure 3: Captioning performance on COCO (averaged across three runs) as a function of the number of parameters when using 1% of the training data (left), and as a function of the training set size (right). We control the number of parameters by setting the internal feature dimension d𝑑ditalic_d. For LiMBeR and eP-ALM, we replace the single linear projection with a two-layer projection (MLP_2) with varying number of hidden units.

Parameter and data efficiency.

We consider COCO captioning performance as a function of the number of trainable parameters in Figure 3 (left), for a set of diverse methods: we include the two best architectures (LiMBeR(all) and DePALM) and add a set of diverse models using different injection or map** mechanisms (DePALMQP,inner, eP-ALM, MAPL) to compare to diverse behaviors. We focus on the low-data training regime by using only 1% of the training set. To vary the number of trainable parameters, we do the following: for MAPL and DePALM we change the hidden dimension of the QPMapper, for LiMBeR(all) and eP-ALM,we replace the linear feature projection by a two-layer projection (MPL_2) with a bottleneck of variable dimension.

First, we observe that for most of the considered parameter range, LiMBeR(all) and DePALM yield the best performance, coherent with earlier experiments. Second, we do not observe strong overfitting for any of the methods, suggesting that in this small data regime the type of interfacing mechanism is more important for performance than the number of parameters.

We investigate data efficiency by varying the training data size from 0.12% to 100% in Figure 3 (right). All methods scale similarly well with the number of training examples, with LiMBeR(all) and DePALM yielding optimal performance across all data sizes. We find that training only on 10% of data achieves roughly 90% of the final performance, validating the data-efficiency of these methods.

Table 6: Comparison with state-of-the-art LLM augmentation methods. The DePALM results are averaged over 3 runs. We highlight the best results for each category in underlined bold. \dagger: use standard split instead of the Karpathy one. Note that only the results in the last group of parameter efficient methods are directly comparable to ours.
Method COCO COCO (1%) VQAv2 OKVQA MSRVTT
CIDEr CIDEr Val Val CIDEr
Large-scale methods in few-shot mode
Flamingo [1] (32-shot) 113.8 67.6 57.8
BLIP-2 [43] (0-shot) 121.6 45.9
Large-scale methods finetuned on target task
Flamingo [1] 138.1{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 82.1{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT
BLIP-2 [43] 145.8 82.30{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT
UnIVAL [73] 128 73.24 45.7 56.3
NExT-GPT [87] 156.7
IDEFICS 80B Instruct [41] (32-shot) 123.2 68.8 59.5
Qwen-VL (7B) [3] 79.5 58.6
Parameter-efficient methods for LLM augmentation
MAPL [72] 125.2 65.9 43.5 31.6
VL-Adapter [79] 116 65.9
eP-ALM [72] 111.6 54.9 48.8
DePALM (ours) 131.3 87.1 70.1 37.7 49.9

Comparison with the state-of-the-art.

In Table 6 we compare our results with state-of-the-art approaches, including large-scale ones. We compare only with models with at least one top score. For this comparison, we use DePALM with the backbones listed in Table 2. Our approach outperforms all parameter-efficient approaches (bottom part of the table) such as eP-ALM and MAPL. We significantly reduce the performance gap of parameter-efficient approaches w.r.t. to large-scale models that are fine-tuned to the target task (middle part of the table). We also compare our approach to generalist models that do not require finetuning (top part), showing that we compete and sometimes outperform them. While, these models are not directly comparable to ours, the results show that finetuning can significantly boost performance, and DePALM emerges as a promising and efficient approach, that does not require large-scale pretraining.

5 Discussion and conclusion

Small vs. large-scale setups.

This work focuses on adapting LLMs for multimodal tasks with focus on efficiency along three main axes: (a) training set size, (b) number trainable parameters, and (c) amount of compute. This allows to obtain LLM-based solutions significantly faster and more affordably. Importantly, it streamlines the adoption of stronger LLMs and perceptual foundation models that are continuously released.

A different approach is to go large scale along these three axes, with the objective to obtain good performance across many datasets [51, 3, 52, 43, 1, 8]. This usually requires conducting pretraining, followed by instruction tuning, and even single-task finetuning when targeting a particular dataset. While this approach is more generalist, it requires enormous resources in terms of data and compute. Nonetheless, we believe that both setups are worth pursuing and are complementary, paving the way for very effective multimodal models, spanning a wide range of setups.

Limitations.

While this work achieves large improvements compared to previous efficient approaches, there is still room for improvement, especially regarding harder tasks such OK-VQA or those requiring reasoning [56]. We believe the proposed framework, will be a good ground to develop more effective approaches in the future. Besides, this work focuses mainly on performance and efficiency. However, there are other axes that should be considered before deployment. In particular, safety issues, such hallucinations [74, 49], abstention [15], harmfulness or the broader objective of aligning these models to human preferences [78].

Conclusion.

We presented a systematic comparative study of mechanisms to interface perceptual backbones —for image, video and audio data— with large language models to address tasks such as captioning and question answering. We focus on parameter efficient approaches, which leave the LLM and feature backbone unchanged, and can be trained on limited training sets. We conducted extensive experiments on different datasets and tasks in which we evaluated both existing and new mechanisms, considered different choices for the perceptual backbones and language models, and tune hyperparameters for all methods in a fair manner. We find improved results as compared to previously reported ones, even when using the same existing interfacing mechanisms. In general, our study shows that most of the improvement, is coming from better perceptual encoders, especially text-aligned ones, in contrast to using more powerful LLMs. We also find that simple design choices works best, such as passing all perceptual tokens at the input to the LLM, or using transformer-based token pooling mechanisms for efficiency. Moreover, we find that our proposed DePALM mechanism —which compresses tokens from the perceptual backbone to a few “summary tokens” to inject in the LLMs— yields on par or better results than existing approaches, while being 4×\times× faster to train than the second best method.

References

  • [1] Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J.L., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Bińkowski, M.a., Barreira, R., Vinyals, O., Zisserman, A., Simonyan, K.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022)
  • [2] Anderson, P., Fernando, B., Johnson, M., Gould, S.: Spice: Semantic propositional image caption evaluation. In: ECCV (2016)
  • [3] Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-VL: A frontier large vision-language model with versatile abilities. arXiv preprint 2308.12966 (2023)
  • [4] Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML (2021)
  • [5] Brock, A., De, S., Smith, S.L., Simonyan, K.: High-performance large-scale image recognition without normalization. In: ICML (2021)
  • [6] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are few-shot learners. In: NeurIPS (2020)
  • [7] Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chandra, V., Xiong, Y., Elhoseiny, M.: MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)
  • [8] Chen, X., Djolonga, J., Padlewski, P., Mustafa, B., Changpinyo, S., Wu, J., Ruiz, C.R., Goodman, S., Wang, X., Tay, Y., et al.: PaLI-X: On scaling up a multilingual vision and language model. arXiv preprint 2305.18565 (2023)
  • [9] Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A.J., Padlewski, P., Salz, D.M., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., Kolesnikov, A., Puigcerver, J., Ding, N., Rong, K., Akbari, H., Mishra, G., Xue, L., Thapliyal, A.V., Bradbury, J., Kuo, W., Seyedhosseini, M., Jia, C., Ayan, B.K., Riquelme, C., Steiner, A., Angelova, A., Zhai, X., Houlsby, N., Soricut, R.: PaLI: A jointly-scaled multilingual language-image model. In: ICLR (2022)
  • [10] Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: UNITER: UNiversal Image-TExt Representation Learning. In: ECCV (2020)
  • [11] Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., Stoica, I., Xing, E.P.: Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (March 2023), https://lmsys.org/blog/2023-03-30-vicuna/
  • [12] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., et al.: PaLM: Scaling language modeling with pathways. JMLR 24 (2023)
  • [13] Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S.S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Valter, D., Narang, S., Mishra, G., Yu, A.W., Zhao, V., Huang, Y., Dai, A.M., Yu, H., Petrov, S., hsin Chi, E.H., Dean, J., Devlin, J., Roberts, A., Zhou, D., Le, Q.V., Wei, J.: Scaling instruction-finetuned language models. arXiv preprint 2210.11416 (2022)
  • [14] Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., Tafjord, O.: Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint 1803.05457 (2018)
  • [15] Dancette, C., Whitehead, S., Maheshwary, R., Vedantam, R., Scherer, S., Chen, X., Cord, M., Rohrbach, M.: Improving selective visual question answering by learning from your peers. In: CVPR (2023)
  • [16] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: CVPR (2009)
  • [17] Denkowski, M., Lavie, A.: Meteor universal: Language specific translation evaluation for any target language. In: EACL Workshop on Statistical Machine Translation (2014)
  • [18] Ding, N., Qin, Y., Yang, G., Wei, F., Yang, Z., Su, Y., Hu, S., Chen, Y., Chan, C.M., Chen, W., et al.: Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. arXiv preprint 2203.06904 (2022)
  • [19] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)
  • [20] Dou, Z.Y., Xu, Y., Gan, Z., Wang, J., Wang, S., Wang, L., Zhu, C., Zhang, P., Yuan, L., Peng, N., et al.: An empirical study of training end-to-end vision-and-language transformers. In: CVPR (2022)
  • [21] Douillard, A., Ramé, A., Couairon, G., Cord, M.: DyTox: Transformers for continual learning with dynamic token expansion. In: CVPR (2022)
  • [22] Eichenberg, C., Black, S., Weinbach, S., Parcalabescu, L., Frank, A.: Magma – multimodal augmentation of generative models through adapter-based finetuning. In: EMNLP (2022)
  • [23] Elizalde, B., Deshmukh, S., Al Ismail, M., Wang, H.: CLAP: learning audio concepts from natural language supervision. In: ICASSP (2023)
  • [24] Fang, Y., Wang, W., Xie, B., Sun, Q.S., Wu, L.Y., Wang, X., Huang, T., Wang, X., Cao, Y.: EVA: Exploring the limits of masked visual representation learning at scale. In: CVPR (2022)
  • [25] Gao, P., Han, J., Zhang, R., Lin, Z., Geng, S., Zhou, A., Zhang, W., Lu, P., He, C., Yue, X., Li, H., Qiao, Y.J.: LLaMA-Adapter V2: Parameter-efficient visual instruction model. arXiv preprint 2304.15010 (2023)
  • [26] Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio set: An ontology and human-labeled dataset for audio events. In: ICASSP (2017)
  • [27] Gong, Y., Chung, Y.A., Glass, J.R.: AST: Audio spectrogram transformer. arXiv preprint 2104.01778 (2021)
  • [28] Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in VQA matter: Elevating the role of image understanding in visual question answering. In: CVPR (2017)
  • [29] Han, J., Zhang, R., Shao, W., Gao, P., Xu, P., Xiao, H., Zhang, K., Liu, C., Wen, S., Guo, Z., Lu, X., Ren, S., Wen, Y., Chen, X., Yue, X., Li, H., Qiao, Y.J.: ImageBind-LLM: Multi-modality instruction tuning. arXiv preprint 2309.03905 (2023)
  • [30] He, X., Chen, S., Ma, F., Huang, Z., **, X., Liu, Z., Fu, D., Yang, Y., Liu, J., Feng, J.: VLAB: Enhancing video language pre-training by feature adapting and blending. arXiv preprint 2305.13167 (2023)
  • [31] Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., Salimans, T.: Imagen video: High definition video generation with diffusion models. arXiv preprint 2210.02303 (2022)
  • [32] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.d.L., Hendricks, L.A., Welbl, J., Clark, A., et al.: Training compute-optimal large language models. arXiv preprint 2203.15556 (2022)
  • [33] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J.W., Vinyals, O., Sifre, L.: Training compute-optimal large language models. In: NeurIPS (2022)
  • [34] Hu, J.E., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Chen, W.: LoRA: Low-rank adaptation of large language models. In: ICLR (2022)
  • [35] Hu, J., Cavicchioli, R., Capotondi, A.: ExpansionNet v2: Block static expansion in fast end to end training for image captioning. arXiv preprint 2208.06551 (2022)
  • [36] Huang, P.Y., Sharma, V., Xu, H., Ryali, C.K., Fan, H., Li, Y., Li, S.W., Ghosh, G., Malik, J., Feichtenhofer, C.: MAViL: Masked audio-video learners. In: NeurIPS (2023)
  • [37] Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)
  • [38] Kim, C.D., Kim, B., Lee, H., Kim, G.: AudioCaps: Generating captions for audios in the wild. In: NAACL-HLT (2019)
  • [39] Koh, J.Y., Salakhutdinov, R., Fried, D.: Grounding language models to images for multimodal inputs and outputs. ICML (2023)
  • [40] Labbé, E., Pellegrini, T., Pinquier, J.: CoNeTTE: An efficient audio captioning system leveraging multiple datasets with task embedding. arXiv preprint 2309.00454 (2023)
  • [41] Laurençon, H., Saulnier, L., Tronchon, L., Bekman, S., Singh, A., Lozhkov, A., Wang, T., Karamcheti, S., Rush, A., Kiela, D., et al.: Obelics: An open web-scale filtered dataset of interleaved image-text documents. In: NeurIPS (2023)
  • [42] Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., rahman Mohamed, A., Levy, O., Stoyanov, V., Zettlemoyer, L.: BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: ACL (2019)
  • [43] Li, J., Li, D., Savarese, S., Hoi, S.C.H.: BLIP-2: Bootstrap** language-image pre-training with frozen image encoders and large language models. arXiv preprint 2301.12597 (2023)
  • [44] Li, J., Li, D., ** language-image pre-training for unified vision-language understanding and generation. In: ICML (2022)
  • [45] Li, J., Selvaraju, R.R., Gotmare, A.D., Joty, S., Xiong, C., Hoi, S.: Align before fuse: Vision and language representation learning with momentum distillation. In: NeurIPS (2021)
  • [46] Liang, S., Zhao, M., Schütze, H.: Modular and parameter-efficient multimodal fusion with prompting. In: ACL (2022)
  • [47] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV (2014)
  • [48] Lin, Z., Liu, C., Zhang, R., Gao, P., Qiu, L., Xiao, H., Qiu, H., Lin, C., Shao, W., Chen, K., et al.: Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint 2311.07575 (2023)
  • [49] Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., Wang, L.: Aligning large multi-modal model with robust instruction tuning. arXiv preprint 2306.14565 (2023)
  • [50] Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., Wang, W., Plumbley, M.D.: AudioLDM: Text-to-audio generation with latent diffusion models. In: ICML (2023)
  • [51] Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint 2310.03744 (2023)
  • [52] Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2024)
  • [53] Liu, S., Zhu, Z., Ye, N., Guadarrama, S., Murphy, K.P.: Improved image captioning via policy gradient optimization of SPIDEr. ICCV (2017)
  • [54] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2017)
  • [55] Loshchilov, I., Hutter, F.: SGDR: Stochastic gradient descent with warm restarts. In: ICLR (2017)
  • [56] Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. In: NeurIPS (2022)
  • [57] Lyu, C., Wu, M., Wang, L., Huang, X., Liu, B., Du, Z., Shi, S., Tu, Z.: Macaw-LLM: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint 2306.09093 (2023)
  • [58] Mañas, O., López, P.R., Ahmadi, S., Nematzadeh, A., Goyal, Y., Agrawal, A.: MAPL: Parameter-efficient adaptation of unimodal pre-trained models for vision-language few-shot prompting. In: EACL (2023)
  • [59] Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: A visual question answering benchmark requiring external knowledge. In: CVPR (2019)
  • [60] Merullo, J., Castricato, L., Eickhoff, C., Pavlick, E.J.: Linearly map** from image to text space. In: ICLR (2023)
  • [61] Mokady, R.: ClipCap: CLIP prefix for image captioning. arXiv preprint 2111.09734 (2021)
  • [62] Moon, S., Madotto, A., Lin, Z., Nagarajan, T., Smith, M., Jain, S., Yeh, C.F., Murugesan, P., Heidari, P., Liu, Y., Srinet, K., Damavandi, B., Kumar, A.: AnyMAL: An efficient and scalable any-modality augmented language model. arXiv preprint 2309.16058 (2023)
  • [63] Oquab, M., Darcet, T., Moutakanni, T., Vo, H.Q., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M.G., Sharma, V., Synnaeve, G., Xu, H., Jégou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning robust visual features without supervision. arXiv preprint 2304.07193 (2023)
  • [64] Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2002)
  • [65] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021)
  • [66] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. Tech. rep., OpenAI (2019)
  • [67] Raffel, C., Shazeer, N.M., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR 21 (2020)
  • [68] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv preprint 2204.06125 (2022)
  • [69] Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., Scialom, T.: Toolformer: Language models can teach themselves to use tools. arXiv preprint 2302.04761 (2023)
  • [70] Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R.: A-OKVQA: A benchmark for visual question answering using world knowledge. In: ECCV (2022)
  • [71] Shukor, M., Couairon, G., Cord, M.: Efficient vision-language pretraining with visual concepts and hierarchical alignment. In: BMVC (2022)
  • [72] Shukor, M., Dancette, C., Cord, M.: eP-ALM: Efficient perceptual augmentation of language models. In: ICCV (2023)
  • [73] Shukor, M., Dancette, C., Ramé, A., Cord, M.: Unified model for image, video, audio and language tasks. arXiv preprint 2307.16184 (2023)
  • [74] Shukor, M., Rame, A., Dancette, C., Cord, M.: Beyond task performance: evaluating and reducing the flaws of large multimodal models with in-context-learning. In: ICLR (2024)
  • [75] Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: TextCaps: a dataset for image captioning with reading comprehension. In: ECCV (2020)
  • [76] Singh, A., Natarjan, V., Shah, M., Jiang, Y., Chen, X., Parikh, D., Rohrbach, M.: Towards VQA models that can read. In: CVPR (2019)
  • [77] Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? Data, augmentation, and regularization in vision transformers. TMLR (2022)
  • [78] Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., Gan, C., Gui, L.Y., Wang, Y.X., Yang, Y., et al.: Aligning large multimodal models with factually augmented RLHF. arXiv preprint 2309.14525 (2023)
  • [79] Sung, Y.L., Cho, J., Bansal, M.: VL-Adapter: Parameter-efficient transfer learning for vision-and-language tasks. In: CVPR (2022)
  • [80] Tan, H.H., Bansal, M.: LXMERT: Learning cross-modality encoder representations from transformers. In: EMNLP (2019)
  • [81] Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: ICCV (2021)
  • [82] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., Lample, G.: LLaMA: Open and efficient foundation language models. arXiv preprint 2302.13971 (2023)
  • [83] Touvron, H., Martin, L., Stone, K.R., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D.M., Blecher, L., Ferrer, C.C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A.S., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I.M., Korenev, A.V., Koura, P.S., Lachaux, M.A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E.M., Subramanian, R., Tan, X., Tang, B., Taylor, R., Williams, A., Kuan, J.X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., Scialom, T.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint 2307.09288 (2023)
  • [84] Tsimpoukelli, M., Menick, J.L., Cabi, S., Eslami, S.M.A., Vinyals, O., Hill, F.: Multimodal few-shot learning with frozen language models. In: NeurIPS (2021)
  • [85] Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: Consensus-based image description evaluation. In: CVPR (2014)
  • [86] Wang, B., Komatsuzaki, A.: GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax (May 2021)
  • [87] Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.S.: NExT-GPT: Any-to-any multimodal LLM. arXiv preprint 2309.05519 (2023)
  • [88] Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: A large video description dataset for bridging video and language. In: CVPR (2016)
  • [89] Zeng, A., Attarian, M., Ichter, B., Choromanski, K., Wong, A., Welker, S., Tombari, F., Purohit, A., Ryoo, M., Sindhwani, V., et al.: Socratic models: Composing zero-shot multimodal reasoning with language. In: ICLR (2023)
  • [90] Zhang, B., Sennrich, R.: Root mean square layer normalization. In: NeurIPS (2019)
  • [91] Zhang, R., Han, J., Zhou, A., Hu, X., Yan, S., Lu, P., Li, H., Gao, P., Qiao, Y.J.: LLaMA-Adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint 2303.16199 (2023)
  • [92] Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M.T., Li, X., Lin, X.V., Mihaylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D., Koura, P.S., Sridhar, A., Wang, T., Zettlemoyer, L.: Opt: Open pre-trained transformer language models. arXiv preprint 2205.01068 (2022)

Appendix A Assets and licensing information

In Table S1, we list the datasets and pre-trained models we use for our experiments. We provide the links to the to repositories and their licenses.

Name Link license
COCO [47] https://cocodataset.org CC BY 4.0
TextCaps [75] https://textvqa.org/textcaps/ CC BY 4.0
VQAv2 [28] https://visualqa.org/ CC BY 4.0
TextVQA [76] https://textvqa.org/ CC BY 4.0
OKVQA [59] https://okvqa.allenai.org/ Unknown
AOKVQA [70] https://allenai.org/project/a-okvqa Unknown
AudioSet [26] https://research.google.com/audioset/ CC BY 4.0
AudioCaps [38] https://audiocaps.github.io/ MIT
MSRVTT [88] Microsoft website Unknown
CLIP [65] https://github.com/openai/CLIP Unknown
EVA-CLIP [24] https://github.com/baaivision/EVA/tree/master/EVA-CLIP MIT
DINOv2 [63] https://github.com/facebookresearch/dinov2 Apache License 2.0
ViT-L [77] https://github.com/huggingface/pytorch-image-models Apache License 2.0
TimeSformer [4] https://github.com/facebookresearch/TimeSformer CC BY 4.0
OPT [92] https://github.com/facebookresearch/metaseq MIT
LLaMA [82] https://github.com/facebookresearch/llama llama license
Llama2 [83] https://ai.meta.com/llama/ llama license
Vicuna [11] https://lmsys.org/blog/2023-03-30-vicuna/ llama license
bottomrule
Table S1: Links to the assets used in the paper, and their respective licenses.

Appendix B Building blocks of our framework

In this section, we provide more details about the different blocks we use to implement existing baseline models, as well as our DePALM models. We suppose a feature extractor model with tokens of dimension dfeatssubscript𝑑featsd_{\text{feats}}italic_d start_POSTSUBSCRIPT feats end_POSTSUBSCRIPT, and a LLM with tokens of dimension dLLMsubscript𝑑LLMd_{\text{LLM}}italic_d start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT

B.1 Feature extraction

The design of feature extraction is based on two decisions: the number nflsubscript𝑛fln_{\text{fl}}italic_n start_POSTSUBSCRIPT fl end_POSTSUBSCRIPT of feature levels, and whether we keep all tokens, or only the CLS token. We take the output of the last nflsubscript𝑛fln_{\text{fl}}italic_n start_POSTSUBSCRIPT fl end_POSTSUBSCRIPT transformer layers of the feature extractor (the image, video or audio backbone). It gives us an output of dimension (nfl,ntk+1,dfeats)subscript𝑛flsubscript𝑛tk1subscript𝑑feats(n_{\text{fl}},n_{\text{tk}}+1,d_{\text{feats}})( italic_n start_POSTSUBSCRIPT fl end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT tk end_POSTSUBSCRIPT + 1 , italic_d start_POSTSUBSCRIPT feats end_POSTSUBSCRIPT ), where ntksubscript𝑛tkn_{\text{tk}}italic_n start_POSTSUBSCRIPT tk end_POSTSUBSCRIPT is the number of patch tokens, and dfeatssubscript𝑑featsd_{\text{feats}}italic_d start_POSTSUBSCRIPT feats end_POSTSUBSCRIPT the embedding dimension of the feature model (perceptual encoders). When only extracting the CLS token, the output dimension becomes (nfl,1,dfeats)subscript𝑛fl1subscript𝑑feats(n_{\text{fl}},1,d_{\text{feats}})( italic_n start_POSTSUBSCRIPT fl end_POSTSUBSCRIPT , 1 , italic_d start_POSTSUBSCRIPT feats end_POSTSUBSCRIPT ) (the patch tokens are removed).

A special case is added for the MAViL model, where the CLS token is replaced by the mean of all patch tokens of the same level, but only when we do not keep the patch tokens.

B.2 Feature injection

First-layer token injection.

Here nfl=1subscript𝑛fl1n_{\text{fl}}=1italic_n start_POSTSUBSCRIPT fl end_POSTSUBSCRIPT = 1. The feature tokens are prepended to the sequence of textual token embeddings (including “BOS” token for OPT). They are propagated through the LLM, and removed from its final output. We use a causal attention mask, where each token can only attend to previous ones, including in-between inserted perceptual tokens.

Inner-layers token injection.

Here we only require nfl1subscript𝑛fl1n_{\text{fl}}\geqslant 1italic_n start_POSTSUBSCRIPT fl end_POSTSUBSCRIPT ⩾ 1. This method is additionally parametrized by a number of LLM layers nLLMsubscript𝑛LLMn_{\text{LLM}}italic_n start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT where we inject tokens and a number of left-out layers at the end nleftsubscript𝑛leftn_{\text{left}}italic_n start_POSTSUBSCRIPT left end_POSTSUBSCRIPT. Then, if we note LLMMsubscript𝐿LMML_{\text{LMM}}italic_L start_POSTSUBSCRIPT LMM end_POSTSUBSCRIPT the total number of LLM layers, for each layer i{LLMMnLLMnleft,,LLMMnleft1}𝑖subscript𝐿LMMsubscript𝑛LLMsubscript𝑛leftsubscript𝐿LMMsubscript𝑛left1i\in\{L_{\text{LMM}}-n_{\text{LLM}}-n_{\text{left}},\dots,L_{\text{LMM}}-n_{% \text{left}}-1\}italic_i ∈ { italic_L start_POSTSUBSCRIPT LMM end_POSTSUBSCRIPT - italic_n start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT - italic_n start_POSTSUBSCRIPT left end_POSTSUBSCRIPT , … , italic_L start_POSTSUBSCRIPT LMM end_POSTSUBSCRIPT - italic_n start_POSTSUBSCRIPT left end_POSTSUBSCRIPT - 1 }, we inject feature tokens extracted from level li=(iLLMM)*nflnLLMsubscript𝑙𝑖𝑖subscript𝐿LMMsubscript𝑛flsubscript𝑛LLMl_{i}=\lfloor\frac{(i-L_{\text{LMM}})*n_{\text{fl}}}{n_{\text{LLM}}}\rflooritalic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ⌊ divide start_ARG ( italic_i - italic_L start_POSTSUBSCRIPT LMM end_POSTSUBSCRIPT ) * italic_n start_POSTSUBSCRIPT fl end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT end_ARG ⌋. For injection, we follow the same procedure as for first-layer token injection, where the feature tokens are prepended to the input sequence of the layer i𝑖iitalic_i. Additionally, we remove them from the output sequence of this layer.

B.3 Feature map**

QPMapper.

This map** block is parametrized by the number of layers LQPsubscript𝐿QPL_{\textrm{QP}}italic_L start_POSTSUBSCRIPT QP end_POSTSUBSCRIPT and the number of query tokens nQsubscript𝑛Qn_{\textrm{Q}}italic_n start_POSTSUBSCRIPT Q end_POSTSUBSCRIPT. It takes as input a sequence of tokens of dimension dembedsubscript𝑑embedd_{\textrm{embed}}italic_d start_POSTSUBSCRIPT embed end_POSTSUBSCRIPT. These tokens are concatenated to nQsubscript𝑛Qn_{\textrm{Q}}italic_n start_POSTSUBSCRIPT Q end_POSTSUBSCRIPT query tokens, which are learnable parameters. The resulting sequence is then passed through a stack of LQPsubscript𝐿QPL_{\textrm{QP}}italic_L start_POSTSUBSCRIPT QP end_POSTSUBSCRIPT standard transformer encoder layers. We use a dropout of 0.10.10.10.1, embedding dimension of dembedsubscript𝑑embedd_{\text{embed}}italic_d start_POSTSUBSCRIPT embed end_POSTSUBSCRIPT, the GELU activation and 8 attention heads. Only the last nQsubscript𝑛Qn_{\textrm{Q}}italic_n start_POSTSUBSCRIPT Q end_POSTSUBSCRIPT output tokens, corresponding to the query tokens, are considered as output.

Block-based token resamplers.

Some of the resamplers use a common framework, based on local blocks of patches. This framework is parametrized by an embedding dimension dembsubscript𝑑embd_{\textrm{emb}}italic_d start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT, and a pooling function. The extracted feature tokens are first projected from dimension dfeatssubscript𝑑featsd_{\textrm{feats}}italic_d start_POSTSUBSCRIPT feats end_POSTSUBSCRIPT to dembedsubscript𝑑embedd_{\textrm{embed}}italic_d start_POSTSUBSCRIPT embed end_POSTSUBSCRIPT using a linear layer. The patches tokens are then arranged on a 1D, 2D or 3D grid, depending on the modality, and grouped into blocks of dimension 4444 (1D case) or 2×2222\times 22 × 2 (2D case). Each block is pooled using the pooling function, resulting in a single token. This is similar to using a 1D or 2D pooling operation on the grid. The tokens are then rearranged as a sequence again, to which the CLS token is prepended, before being normalized using the RMSNorm, and finally projected with a linear layer from dembedsubscript𝑑embedd_{\textrm{embed}}italic_d start_POSTSUBSCRIPT embed end_POSTSUBSCRIPT to dLLMsubscript𝑑LLMd_{\textrm{LLM}}italic_d start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT.

B.4 Fine-tuning mechanism

Prompt-tuning.

is a parameter-efficient fine-tuning mechanism, parametrized by nptsubscript𝑛ptn_{\textrm{pt}}italic_n start_POSTSUBSCRIPT pt end_POSTSUBSCRIPT, the number of learned tokens. When used, nptsubscript𝑛ptn_{\textrm{pt}}italic_n start_POSTSUBSCRIPT pt end_POSTSUBSCRIPT constant embedding vectors of dimension dLLMsubscript𝑑LLMd_{\textrm{LLM}}italic_d start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT are learned and prepended before the textual tokens (and after the perceptual tokens) at the beginning of LLMs. We also use a causal padding attention mask with these tokens.

Appendix C Implementation details

C.1 Reproducing existing baseline

LiMBeR(all):

we use the feature extraction with nfl=1subscript𝑛fl1n_{\text{fl}}=1italic_n start_POSTSUBSCRIPT fl end_POSTSUBSCRIPT = 1 level, project all the feature tokens from dimension dfeatssubscript𝑑featsd_{\text{feats}}italic_d start_POSTSUBSCRIPT feats end_POSTSUBSCRIPT to dLLMsubscript𝑑LLMd_{\text{LLM}}italic_d start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT using a linear layer, and use first-layer token injection.

LiMBeR(1):

we use the same mechanism as for LiMBeR(all), but only keep the single CLS token.

MAPL:

we use the feature extraction with nfl=1subscript𝑛fl1n_{\text{fl}}=1italic_n start_POSTSUBSCRIPT fl end_POSTSUBSCRIPT = 1 level, and a feature map** block consisting of a linear projection from dimension dfeatssubscript𝑑featsd_{\text{feats}}italic_d start_POSTSUBSCRIPT feats end_POSTSUBSCRIPT to dembed=256subscript𝑑embed256d_{\text{embed}}=256italic_d start_POSTSUBSCRIPT embed end_POSTSUBSCRIPT = 256, followed by a QPMapper using LQP=4subscript𝐿QP4L_{\textrm{QP}}=4italic_L start_POSTSUBSCRIPT QP end_POSTSUBSCRIPT = 4 layers and nQ=32subscript𝑛Q32n_{\textrm{Q}}=32italic_n start_POSTSUBSCRIPT Q end_POSTSUBSCRIPT = 32 query tokens, and then a linear projection from dembedsubscript𝑑embedd_{\text{embed}}italic_d start_POSTSUBSCRIPT embed end_POSTSUBSCRIPT to the LLM inner dimension dLLMsubscript𝑑LLMd_{\text{LLM}}italic_d start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT. We then insert tokens with the first-layer injection mechanism.

eP-ALM:

we extract nfl=6subscript𝑛fl6n_{\text{fl}}=6italic_n start_POSTSUBSCRIPT fl end_POSTSUBSCRIPT = 6 levels of feature tokens, and only keep the CLS token from each level. We project each one from dimension dfeatssubscript𝑑featsd_{\text{feats}}italic_d start_POSTSUBSCRIPT feats end_POSTSUBSCRIPT to dLLMsubscript𝑑LLMd_{\text{LLM}}italic_d start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT with the same linear layer. Mapped tokens are inserted into nLLM=12subscript𝑛LLM12n_{\text{LLM}}=12italic_n start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT = 12 inner layers, leaving out the last one (nleft=1subscript𝑛left1n_{\text{left}}=1italic_n start_POSTSUBSCRIPT left end_POSTSUBSCRIPT = 1).

C.2 DePALM variants

Our DePALM and DePALM*,L0 methods use the following blocks:

  • Feature extraction from nfl=1subscript𝑛fl1n_{\text{fl}}=1italic_n start_POSTSUBSCRIPT fl end_POSTSUBSCRIPT = 1 level.

  • First-layer injection of token, after the map** block.

  • Prompt-tuning with npt=1subscript𝑛pt1n_{\text{pt}}=1italic_n start_POSTSUBSCRIPT pt end_POSTSUBSCRIPT = 1.

DePALM:

we use a feature map** block consisting of a linear projection from dimension dfeatssubscript𝑑featsd_{\textrm{feats}}italic_d start_POSTSUBSCRIPT feats end_POSTSUBSCRIPT to dembed=1024subscript𝑑embed1024d_{\text{embed}}=1024italic_d start_POSTSUBSCRIPT embed end_POSTSUBSCRIPT = 1024, followed by a QPMapper using LQP=2subscript𝐿QP2L_{\textrm{QP}}=2italic_L start_POSTSUBSCRIPT QP end_POSTSUBSCRIPT = 2 layers and nQ=32subscript𝑛Q32n_{\textrm{Q}}=32italic_n start_POSTSUBSCRIPT Q end_POSTSUBSCRIPT = 32 query tokens, and a linear projection from dembedsubscript𝑑embedd_{\textrm{embed}}italic_d start_POSTSUBSCRIPT embed end_POSTSUBSCRIPT to the LLM inner dimension dLLMsubscript𝑑LLMd_{\textrm{LLM}}italic_d start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT.

DePALMR-rand,L0:

we sample a subset of the tokens, using the following procedure. If the model outputs ntksubscript𝑛tkn_{\text{tk}}italic_n start_POSTSUBSCRIPT tk end_POSTSUBSCRIPT patch tokens, we keep the CLS token, and fntk𝑓subscript𝑛tk\lfloor fn_{\text{tk}}\rfloor⌊ italic_f italic_n start_POSTSUBSCRIPT tk end_POSTSUBSCRIPT ⌋ uniformly sampled patch tokens, with f𝑓fitalic_f the proportion of tokens we keep:

  • During training, with a probability of 110110\frac{1}{10}divide start_ARG 1 end_ARG start_ARG 10 end_ARG, we set f=fmax𝑓subscript𝑓maxf=f_{\text{max}}italic_f = italic_f start_POSTSUBSCRIPT max end_POSTSUBSCRIPT. Otherwise, we sample f𝒩(fmean,fstd)similar-tosuperscript𝑓𝒩subscript𝑓meansubscript𝑓stdf^{\prime}\sim\mathcal{N}(f_{\text{mean}},f_{\text{std}})italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_N ( italic_f start_POSTSUBSCRIPT mean end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT std end_POSTSUBSCRIPT ), and set f=min(max(f,fmin),fmax)𝑓𝑓subscript𝑓minsubscript𝑓maxf=\min(\max(f,f_{\text{min}}),f_{\text{max}})italic_f = roman_min ( roman_max ( italic_f , italic_f start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ) , italic_f start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) to restrict the values to the interval [fmin;fmax]subscript𝑓minsubscript𝑓max[f_{\text{min}};f_{\text{max}}][ italic_f start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ; italic_f start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ].

  • During inference, we set f=fmax𝑓subscript𝑓maxf=f_{\text{max}}italic_f = italic_f start_POSTSUBSCRIPT max end_POSTSUBSCRIPT.

  • We use fmin=116subscript𝑓min116f_{\text{min}}=\frac{1}{16}italic_f start_POSTSUBSCRIPT min end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 16 end_ARG, fmax=12subscript𝑓max12f_{\text{max}}=\frac{1}{2}italic_f start_POSTSUBSCRIPT max end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG, fmean=14subscript𝑓mean14f_{\text{mean}}=\frac{1}{4}italic_f start_POSTSUBSCRIPT mean end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 4 end_ARG and fstd=0.2subscript𝑓std0.2f_{\text{std}}=0.2italic_f start_POSTSUBSCRIPT std end_POSTSUBSCRIPT = 0.2.

We project the resulting tokens from dimension dfeatssubscript𝑑featsd_{\text{feats}}italic_d start_POSTSUBSCRIPT feats end_POSTSUBSCRIPT to dLLMsubscript𝑑LLMd_{\text{LLM}}italic_d start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT, normalize them using the RMSNorm, and project them again with a linear layer (dLLMsubscript𝑑LLMd_{\text{LLM}}italic_d start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT to dLLMsubscript𝑑LLMd_{\text{LLM}}italic_d start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT) before injection inside the LLM.

DePALMR-linear,L0:

we use a block-based resampler with demb=dLLMsubscript𝑑embsubscript𝑑LLMd_{\text{emb}}=d_{\text{LLM}}italic_d start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT. The pooling function is a linear projection from dimension 4×demb4subscript𝑑emb4\times d_{\text{emb}}4 × italic_d start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT to dembsubscript𝑑embd_{\text{emb}}italic_d start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT, taking as input the concatenation of all tokens from the same block.

DePALMR-QPMapper,L0:

we use a block-based resampler with demb=768subscript𝑑emb768d_{\text{emb}}=768italic_d start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT = 768. The pooling function is a QPMapper taking as input the 4 tokens of a single block, using LQP=4subscript𝐿QP4L_{\textrm{QP}}=4italic_L start_POSTSUBSCRIPT QP end_POSTSUBSCRIPT = 4 layers and a single query token (nQ=1subscript𝑛Q1n_{\textrm{Q}}=1italic_n start_POSTSUBSCRIPT Q end_POSTSUBSCRIPT = 1).

DePALMR-avgpool,L0:

we use a block-based resampler with demb=dLLMsubscript𝑑embsubscript𝑑LLMd_{\text{emb}}=d_{\text{LLM}}italic_d start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT. The pooling function returns the mean of the tokens from the same block.

DePALMQP,inner:

we extract nfl=4subscript𝑛fl4n_{\text{fl}}=4italic_n start_POSTSUBSCRIPT fl end_POSTSUBSCRIPT = 4 levels of feature tokens, and use the same feature map** as DePALM. In particular, the map** block is shared across all token levels, meaning that each sequence of tokens from each level goes through the same projection, with the same weights, but as separate batch elements. Mapped tokens are inserted into nLLM=12subscript𝑛LLM12n_{\text{LLM}}=12italic_n start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT = 12 inner layers, with nleft=3subscript𝑛left3n_{\text{left}}=3italic_n start_POSTSUBSCRIPT left end_POSTSUBSCRIPT = 3. We also use prompt-tuning with npt=16subscript𝑛pt16n_{\textrm{pt}}=16italic_n start_POSTSUBSCRIPT pt end_POSTSUBSCRIPT = 16.

DePALMc-attn:

we extract nfl=4subscript𝑛fl4n_{\text{fl}}=4italic_n start_POSTSUBSCRIPT fl end_POSTSUBSCRIPT = 4 levels of feature tokens, and use a single cross-attention block (detailed below) that will do both feature map** and injection. Injection takes place into the last nLLM=12subscript𝑛LLM12n_{\text{LLM}}=12italic_n start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT = 12 inner layers, leaving out the nleft=3subscript𝑛left3n_{\text{left}}=3italic_n start_POSTSUBSCRIPT left end_POSTSUBSCRIPT = 3 last ones, similarly to the inner-layer injection mechanism, but using cross-attention instead of concatenation.

For the cross-attention block, see Figure S1, we first project the tokens from dimension dfeatssubscript𝑑featsd_{\text{feats}}italic_d start_POSTSUBSCRIPT feats end_POSTSUBSCRIPT to dembed=1024subscript𝑑embed1024d_{\text{embed}}=1024italic_d start_POSTSUBSCRIPT embed end_POSTSUBSCRIPT = 1024 using the linear projection Pinsubscript𝑃inP_{\textrm{in}}italic_P start_POSTSUBSCRIPT in end_POSTSUBSCRIPT. They are then passed through a single transformer layer TransfTransf\mathrm{Transf}roman_Transf (8 heads, dropout of 0.1), acting as a minimal resampler network. We also project the input textual tokens to the dimension dembed=1024subscript𝑑embed1024d_{\text{embed}}=1024italic_d start_POSTSUBSCRIPT embed end_POSTSUBSCRIPT = 1024, using a two-layers MLP (the FFN block, with a hidden dimension of dembedsubscript𝑑embedd_{\text{embed}}italic_d start_POSTSUBSCRIPT embed end_POSTSUBSCRIPT and the GELU activation function), and normalize them using RMSNorm. We then use the perceptual tokens as keys and values, and the text ones as queries, in a cross-attention layer. We project the resulting tokens back to dimension dLLMsubscript𝑑LLMd_{\text{LLM}}italic_d start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT using a linear projection, normalize them with RMSNorm, and add them to input textual tokens, using a tanh-gated residual connection. This residual connection takes the form x(k1),xr(k),h(k)x(k1)+tanh(h(k))×xr(k)maps-tosuperscript𝑥𝑘1subscriptsuperscript𝑥𝑘𝑟superscript𝑘superscript𝑥𝑘1superscript𝑘subscriptsuperscript𝑥𝑘𝑟x^{(k-1)},x^{(k)}_{r},h^{(k)}\mapsto x^{(k-1)}+\tanh(h^{(k)})\times x^{(k)}_{r}italic_x start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_h start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ↦ italic_x start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT + roman_tanh ( italic_h start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) × italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT where x(k1)superscript𝑥𝑘1x^{(k-1)}italic_x start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT are the original textual tokens, xr(k)subscriptsuperscript𝑥𝑘𝑟x^{(k)}_{r}italic_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT the output of our cross-attention block, and h(k)superscript𝑘h^{(k)}italic_h start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT a single learned float, initialized at 00.

Refer to caption
Figure S1: Overview of cross-attention map** and injection in DePALMc-attn.

C.3 Training

General training.

We train each model on a single node using eight V100-32G GPUs, using the following method:

  • Optimizer: we take AdamW, with a weight decay of 0.1 and a default learning rate of αmax=8104subscript𝛼max8superscript104\alpha_{\text{max}}=8\cdot 10^{-4}italic_α start_POSTSUBSCRIPT max end_POSTSUBSCRIPT = 8 ⋅ 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, that we further adapt for each experiment.

  • Gradient clip**: we use a clip** value of gclip=0.8subscript𝑔clip0.8g_{\text{clip}}=0.8italic_g start_POSTSUBSCRIPT clip end_POSTSUBSCRIPT = 0.8, that we further adapt for each experiment.

  • Learning rate scheduler: we set a minimum learning rate of αmin=αmax104subscript𝛼minsubscript𝛼maxsuperscript104\alpha_{\text{min}}=\alpha_{\text{max}}\cdot 10^{-4}italic_α start_POSTSUBSCRIPT min end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ⋅ 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, with a cosine scheduler. During the first 20% of all iteration steps, we linearly warmup the effective learning rate from αminsubscript𝛼min\alpha_{\text{min}}italic_α start_POSTSUBSCRIPT min end_POSTSUBSCRIPT to αmaxsubscript𝛼max\alpha_{\text{max}}italic_α start_POSTSUBSCRIPT max end_POSTSUBSCRIPT, then use the cosine scheduler to decrease it from αmaxsubscript𝛼max\alpha_{\text{max}}italic_α start_POSTSUBSCRIPT max end_POSTSUBSCRIPT to αminsubscript𝛼min\alpha_{\text{min}}italic_α start_POSTSUBSCRIPT min end_POSTSUBSCRIPT.

  • Batch size: we use a batch size of 16 on each GPU, for an effective batch size of 128. On experiments where memory is an issue, we use gradient accumulation while to train on the same batch size.

  • Epochs: we use a base number of 8 epochs, and increase it on small datasets: we use 12 epochs on AudioCaps, 20 on TextCaps, OKVQA, AOKVQA and TextVQA, and 30 epochs on the two 1% settings.

  • Loss: we compute the loss only on the generate text. Additionally, we use a label smoothing value of 21032superscript1032\cdot 10^{-3}2 ⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT in the cross-entropy loss of the LLM.

  • Float precision: we load pre-trained models in float16, and train new weights in float32, using mixed-precision.

  • Duplicate inputs: we group together every training sample with the same perceptual and text input, but different outputs. During training, we select the target output for the loss randomly.

Grid search.

As each setting has a different training dynamic, which can be very sensitive to the learning rate, we use a grid search over a few values that we experimentally found to be working efficiently. We start by swee** over the learning rate values αmax{1103,8104,4104}subscript𝛼max1superscript1038superscript1044superscript104\alpha_{\text{max}}\in\{1\cdot 10^{-3},8\cdot 10^{-4},4\cdot 10^{-4}\}italic_α start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ∈ { 1 ⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 8 ⋅ 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT , 4 ⋅ 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT }, and set the gradient clip** parameter as gclip=0.8subscript𝑔clip0.8g_{\text{clip}}=0.8italic_g start_POSTSUBSCRIPT clip end_POSTSUBSCRIPT = 0.8. On each experiment where the score significantly increase or decrease between the three runs, or where the results are noticeably lower than other methods or previous experiments, we further experiment with gclip=0.8subscript𝑔clip0.8g_{\text{clip}}=0.8italic_g start_POSTSUBSCRIPT clip end_POSTSUBSCRIPT = 0.8 for the same sweep over learning rates. We perform this sweep for each method-dataset pair, including baselines reproductions.

We also note that we used grid search on each parameter of our models (number of layers, etc.) on the COCO dataset, to confirm that they are at least a local optimum.

Data augmentation.

We use random augmentation during training of each model. Each time a training sample is used during the training loop, a random modification is applied, without increasing the dataset size.

  • Images: the normalized image is resized to 128×128128128128\times 128128 × 128 with a random scale in [0.5,1]0.51[0.5,1][ 0.5 , 1 ] and a random aspect ratio in [34,43]3443[\frac{3}{4},\frac{4}{3}][ divide start_ARG 3 end_ARG start_ARG 4 end_ARG , divide start_ARG 4 end_ARG start_ARG 3 end_ARG ], and horizontally flipped with probability 0.50.50.50.5. Data is then augmented with the modified RandAugment procedure [45].

  • Audio: we augment the dataset using frequency masking with maximum length of 24, and time masking with maximum length of 96, after normalizing the audio.

  • Video: we normalize the videos, and use the same procedure as with images, with the same random scaling and flip** as with images, followed by the default RandAugment procedure implemented in pytorch.

Special cases:

due to memory limitations, when training LiMBeR(1) and DePALMR-rand,L0 on MSRVTT, we average the patch tokens along the time dimension to reduce their number. This yields a representation with the dimension of the embedding of a single frame.

Appendix D Experimental results

Method COCO COCO (1% data) TextCaps VQAv2 VQAv2 (1% data) TextVQA OKVQA AOKVQA
B@4 CIDEr B@4 CIDEr B@4 CIDEr Val Test Val Test Val Val Val
LiMBeR(1) (our reimpl.) 35.85±0.32 122.85±0.69 25.05±0.31 87.10±1.36 17.28±0.08 51.85±0.71 60.19±0.36 59.95±0.42 45.93±1.04 45.68±0.95 17.96±0.76 33.38±1.60 34.13±1.39
LiMBeR(all) (our reimpl.) 39.86±0.30 136.31±0.63 23.85±0.81 83.74±2.80 22.17±3.39 75.51±19.68 73.42±0.08 72.73±0.10 47.98±1.87 47.70±1.58 31.25±0.50 36.19±2.42 38.93±2.86
MAPL (our reimpl.) 36.96±1.60 126.05±5.13 21.15±2.71 69.20±11.74 18.43±0.82 50.57±3.99 67.13±1.28 66.76±1.35 45.94±1.76 45.65±1.62 21.04±0.88 36.21±1.24 37.02±0.45
eP-ALM (our reimpl.) 33.79±0.43 115.34±1.23 17.98±0.79 64.65±1.38 16.27±0.37 42.58±0.54 59.34±0.21 59.03±0.25 41.38±3.06 41.20±2.89 16.59±0.93 23.52±6.40 27.82±1.59
DePALM 38.66±1.25 131.29±3.38 25.09±0.37 87.05±1.61 22.23±0.91 73.67±5.36 70.11±0.14 69.56±0.19 48.25±1.20 47.80±1.17 22.97±1.22 37.69±0.65 38.45±1.48
DePALMQP,inner 38.80±0.66 130.91±1.04 21.34±0.38 75.86±0.92 21.21±0.30 65.22±1.55 67.88±0.28 67.64±0.11 45.27±0.38 44.92±0.70 23.70±0.96 35.98±0.68 36.36±1.38
DePALMR-avgpool,L0 38.52±0.88 131.77±3.50 24.68±1.35 86.09±4.45 20.04±0.86 61.18±4.87 64.84±2.07 64.61±2.14 48.86±0.76 48.56±0.90 19.14±0.60 35.17±1.34 35.41±1.78
DePALMR-linear,L0 38.97±0.32 133.01±0.61 24.69±1.74 85.31±5.26 21.11±0.93 69.76±2.77 64.76±0.27 64.45±0.31 47.66±0.52 47.31±0.31 19.08±0.93 34.58±0.72 35.30±1.56
DePALMR-QPMapper,L0 38.62±1.72 131.92±5.41 22.76±1.17 75.46±3.95 17.50±0.59 51.03±2.04 61.09±2.27 60.89±2.20 46.08±0.30 45.91±0.17 18.56±0.71 35.35±0.58 35.63±0.78
DePALMR-rand,L0 39.63±0.48 134.90±0.67 24.60±0.53 86.84±1.49 19.49±1.33 58.15±5.21 71.33±0.09 70.76±0.10 47.60±0.96 47.25±1.08 21.28±1.59 35.00±1.29 34.74±1.34
DePALMc-attn 38.16±0.41 130.05±1.01 23.83±0.73 81.38±2.52 69.45±1.17 69.05±1.01 41.73±0.83 41.48±0.61
Table S2: Comparison of our proposed DePALM architectures and our baseline re-implementations, after training on 100% or 1% of each image dataset. We show the average score and standard deviation over five runs, in the format avg±std.

Additional experiments.

We provide complementary results that extend the ones in Table 4 of the main paper. In Table S2 we add standard deviation across the five runs on image datasets, and add additional metrics (BLEU for COCO and TextCaps, test set performance for VQAv2). In Table S3 and Table S4, we do similarly but for AudioCaps (audio captioning) and MSRVTT (video captioning).

In addition, in Table S5, we report the results on image datasets using the DINOv2 visual encoder in place of the CLIP models for the COCO and VQAv2 benchmarks. We see that the scores slightly degrade, compared to Table S2, but still better than previous state-of-the-art parameter-efficient results. In particular, with the larger dataset (VQAv2), the effect is smaller, especially when using DePALM. This reveals that models trained in an unsupervised setting could be a good candidate to adapt LLMs efficiently to do multimodal tasks.

Method (on AudioCaps) B@1 B@2 METEOR CIDEr SPICE SPIDER
LiMBeR(1) (our reimpl.) 69.38±0.70 51.11±0.43 21.90±0.14 62.04±0.82 15.84±0.28 38.94±0.47
LiMBeR(all) (our reimpl.) 69.52±0.69 51.23±0.74 22.53±0.30 64.34±2.44 15.91±0.65 40.12±1.37
MAPL (our reimpl.) 70.05±1.27 52.11±1.10 22.87±0.20 65.36±1.61 16.42±0.34 40.89±0.82
eP-ALM (our reimpl.) 61.94±2.01 45.83±1.75 21.38±0.46 60.84±2.31 15.41±0.53 38.13±1.37
DePALM 71.54±0.89 53.37±1.05 23.66±0.33 69.70±2.31 17.03±0.63 43.37±1.42
DePALMQP,inner 70.85±1.63 52.75±1.87 23.12±0.48 65.96±3.55 16.44±0.71 41.20±2.09
DePALMR-avgpool,L0 68.91±1.02 50.81±1.44 23.07±0.39 66.80±3.69 16.29±0.34 41.54±1.81
DePALMR-linear,L0 68.73±1.46 50.59±1.15 23.08±0.35 65.52±2.89 16.32±0.28 40.92±1.42
DePALMR-QPMapper,L0 69.89±0.88 51.82±0.72 22.69±0.28 66.16±2.38 16.19±0.32 41.17±1.17
DePALMR-rand,L0 70.44±0.64 52.22±1.13 22.99±0.43 66.38±3.38 16.36±0.48 41.37±1.84
Table S3: Comparison of our proposed DePALM architectures and our baseline re-implementations, after training on AudioCaps. We show the average score and standard deviation over five runs, in the format avg±std.
Method (on AudioCaps) B@4 METEOR CIDEr
LiMBeR(1) (our reimpl.) 34.22±1.04 27.56±0.35 46.03±2.11
LiMBeR(all) (our reimpl.) 36.30±0.87 27.77±0.36 46.87±1.80
MAPL (our reimpl.) 36.78±1.22 28.01±0.24 47.27±1.90
eP-ALM (our reimpl.) 25.59±1.28 25.35±0.35 38.83±2.14
DePALM 38.78±1.51 28.54±0.37 49.88±2.01
DePALMQP,inner 39.44±1.07 28.29±0.47 47.76±2.18
DePALMR-avgpool,L0 39.36±1.51 28.59±0.29 50.52±2.24
DePALMR-linear,L0 40.56±1.23 28.71±0.36 51.60±2.28
DePALMR-QPMapper,L0 38.32±1.05 27.41±0.32 45.49±1.49
DePALMR-rand,L0 36.39±0.88 27.85±0.47 47.90±2.30
Table S4: Comparison of our proposed DePALM architectures and our baseline re-implementations, after training on MSRVTT for captioning. We show the average score and standard deviation over five runs, in the format avg±std.
Method COCO COCO (1% data) VQAv2 VQAv2 (1% data)
B@4 CIDEr B@4 CIDEr Val Test Val Test
LiMBeR(1) (our reimpl.) 31.28±0.62 106.93±1.35 16.55±0.55 61.54±1.55 55.63±0.07 55.30±0.19 20.01±17.72 19.79±17.61
LiMBeR(all) (our reimpl.) 37.86±0.28 129.24±0.75 22.21±0.99 74.96±4.21 68.95±2.42 68.64±2.42 44.56±0.46 44.31±0.52
MAPL (our reimpl.) 36.22±1.55 122.28±4.99 21.39±1.09 70.50±2.42 66.37±0.22 66.13±0.23 45.47±0.46 45.38±0.30
eP-ALM (our reimpl.) 31.28±0.24 106.21±0.56 16.34±0.68 57.45±1.38 57.62±0.16 57.37±0.25 39.90±1.32 39.41±1.64
DePALM 37.68±0.39 127.38±1.11 23.29±0.74 79.45±2.22 68.42±0.10 68.05±0.12 47.91±0.91 47.51±0.92
DePALMQP,inner 37.28±0.61 124.53±1.28 22.27±0.34 73.53±1.20 65.73±0.25 65.49±0.27 43.81±0.50 43.52±0.40
DePALMR-avgpool,L0 37.24±0.45 126.29±0.96 22.52±0.53 77.53±1.82 63.63±2.14 63.66±2.03 45.51±0.51 45.08±0.54
DePALMR-linear,L0 37.14±0.33 125.29±0.67 22.37±0.43 76.71±2.13 66.47±0.20 66.48±0.22 45.62±0.49 45.29±0.45
DePALMR-QPMapper,L0 36.00±1.59 122.30±5.36 20.74±1.49 65.85±5.08 59.93±0.30 59.79±0.55 48.09±0.46 47.64±0.35
DePALMR-rand,L0 37.13±0.59 126.44±1.20 23.02±0.41 79.68±1.16 63.24±2.07 63.10±1.95 44.23±0.50 44.07±0.42
DePALMc-attn 36.41±0.42 122.69±1.02 8.34±0.83 18.03±4.04 44.84 44.89
Table S5: Comparison of our proposed DePALM architectures and our baseline re-implementations, after training on 100% or 1% of COCO and VQAv2 datasets, when using the DINOv2 as the perceptual backbone extractor. We show the average score and standard deviation over five runs, in the format avg±std.
Method COCO TextCaps AudioCaps MSRVTT
CIDEr CIDEr SPIDEr CIDEr
LiMBeR(all) (our reimpl.) 136.31 75.51 40.12 46.87
DePALM 131.29 73.67 43.37 49.88
LiMBeR(all) (our reimpl.) + bias-FT 137.37 74.12 45.45 49.16
DePALM + bias-FT 133.55 67.98 47.35 50.86
Table S6: Impact of bias fine-tuning in the perceptual backbone model. The results are averaged over 5 runs.

Efficient fine-tuning of the feature model.

In our experiments so far, we used prompt-tuning, but did not fine-tune any internal parameters of the LLM or feature backbone. In Table S6 we consider the impact of adding bias-tuning to the feature model. While adding only 0.5M learnable parameters, we observe substantial gains on COCO, AudioCaps, and MSRVTT, but surprisingly observed performance loss on smaller datasets such as TextCaps. So this method should mostly be considered given enough data. For simplicity, and to keep good results on small datasets, we used only prompt tuning for the LLM and kept the encoders completely frozen in all other experiments.

Appendix E Carbon footprint estimation

We report the estimated carbon footprint of training a single instance of DePALM for four different datasets, using the following method. We take the average training time T𝑇Titalic_T, and then compute the total GPU hours TGPU=T×8subscript𝑇GPU𝑇8T_{\text{GPU}}=T\times 8italic_T start_POSTSUBSCRIPT GPU end_POSTSUBSCRIPT = italic_T × 8, as we use a single 8-GPU node for each model. We then estimate the power consumption in kWh, given a Thermal Design Power (TDP) of the V100-32G GPU equal to 250W and a Power Usage Effectiveness (PUE) of 1.1, as K=250×1.11000×TGPU𝐾2501.11000subscript𝑇GPUK=\frac{250\times 1.1}{1000}\times T_{\text{GPU}}italic_K = divide start_ARG 250 × 1.1 end_ARG start_ARG 1000 end_ARG × italic_T start_POSTSUBSCRIPT GPU end_POSTSUBSCRIPT. Finally, given a carbon intensity factor of 0.385 kg CO22{}_{2}start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT per KWh, we obtain the emission E𝐸Eitalic_E in kg of CO22{}_{2}start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT as E=0.385×K𝐸0.385𝐾E=0.385\times Kitalic_E = 0.385 × italic_K.

COCO OKVQA AudioCaps MSRVTT
Training time: T𝑇Titalic_T 2h14 1h23 2h21 0h31
GPU hours (8 GPUs): TGPUsubscript𝑇GPUT_{\textrm{GPU}}italic_T start_POSTSUBSCRIPT GPU end_POSTSUBSCRIPT 17.87 11.07 18.80 4.13
Estimated kWh: K𝐾Kitalic_K 4.91 3.04 5.17 1.14
Emitted kg of CO22{}_{2}start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT: E𝐸Eitalic_E 1.89 1.17 1.99 0.44
Table S7: Estimated carbon footprint of training a single DePALMQP,L0 model, on four different datasets.