\mdfsetup

roundcorner=10pt

What Makes Multimodal In-Context Learning Work?

Folco Bertini Baldassini¹ Mustafa Shukor¹ Matthieu Cord^1,2 Laure Soulier¹ Benjamin Piwowarski¹
¹Sorbonne Université, CNRS, ISIR, F-75005 Paris, France
² Valeo.ai, Paris, France

Abstract

Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at gitlab.com/folbaeni/multimodal-icl

Refer to caption — Figure 1: Empirical analysis of M-ICL behavior. 1. Images play a crucial role in image-to-text tasks. 2. M-ICL is mostly driven by text when the task includes both image and text as input. 3. For advanced M-ICL strategies ranking ICL examples by their similarity to the query, the LMM mostly does a majority vote over the demonstration pairs. 4. M-ICL copies the output of the last demonstration pair.

1 Introduction

Recently, Large Multimodal Models (LMMs) have made considerable progress in comprehending and generating visual and textual content [21, 54, 3, 37, 53]. These models can be seamlessly adapted to solve novel tasks, through In-context learning (ICL) [6]. It is a training-free approach that consists of augmenting the input prompt with a few pairs (input,output) prepended to the query prompt. This extra context acts as demonstrations that should help the model understand the task at hand. The choice and ordering of examples used in the ICL is decisive to its performance, as observed for retrieval methods [31, 43, 25, 46], and for multimodal tasks by exploiting CLIP [42, 68, 22], exemplified by RICES [3]. While extensive research has been carried out into conditions and biases of ICL for LLMs [11, 32, 76, 25, 29], extending this knowledge to the multimodal domain is not trivial. Besides, multimodal ICL (M-ICL) presents new challenges and biases [49, 7, 73] that may not be fully addressed by existing unimodal studies.

In this paper, we propose a comprehensive framework to study M-ICL: using the best open-source LMM models with M-ICL ability, such as IDEFICS [18] and OpenFlamingo [5], we consider a wide range of multimodal benchmarks that cover Visual Question Answering (VQA), captioning and classification tasks. To investigate how modalities (image and text) affect the M-ICL behavior, we systematically remove or mix each modality. We then extend our study to approaches that improve ICL with retrieval-based context selection (RICES [3]).

To summarize, we propose a comprehensive framework to evaluate the M-ICL behavior in LMMs. Our empirical study led to the following findings illustrated in Figure 1:

•

In general, M-ICL is primarily focused on text, overshadowing the role played by images. This is less the case for image captioning and classification tasks.
•

For advanced similarity-based context selection M-ICL methods, the LMM models behave so far not better than a majority voting mechanism over the context demonstrations.
•

We also identify a major flaw in these advanced similarity-based methods. They suffer from recency bias, where the model tends to "copy" the answer of the last example in context. This sheds light on several limitations that should be considered before deployment.

2 Related work

Multimodal models

have undergone significant advancements recently [72], by moving towards more unified models that can support a myriad of tasks and modalities [59, 48, 26, 34, 3, 20]. These models are generally built on top of pre-trained LLMs and visual encoders that are simply connected by a linear transformations [47, 24, 55, 23, 12, 54, 60, 35], or transformer-based mechanisms [20, 3, 18]. The level of performance of these models has started to approach those of LLMs, especially after multimodal instruction tuning [66, 24, 10, 30, 19]. In addition, several models can now support ICL [3], arguably due to training on interleaved image-text datasets. In this work, we focus on the best open-source models with ICL abilities (IDEFICS [18] and OpenFlamingo [5]), and in particular, IDEFICS that achieves comparable performance to Flamingo.

In-Context Learning (ICL)

is a paradigm that allows language models to learn tasks given only a few demonstrations [6] and is particularly effective for tackling more complex and reasoning-based tasks [63, 62, 28]. To explain ICL, studies compare it with gradient descent [57, 2, 67, 9, 57, 36] and examine the inner workings of the models [36, 15]. ICL is highly sensitive to the prompt and choice of demonstrations, Min et al. [32] indicates that the format of the prompt and distribution of the words matter, though the importance of labels is debated [69, 58]. Interestingly, [39] discusses task recognition and task learning, where the former requires a few examples to understand the task format, and the latter to reproduce the input-output map**. This depends on multiple factors such as if the model has been instruction tuned [40], the model size [65, 64], and the semantics of the prompt [61], affecting the necessary number of shots. Studies also identify recency and majority biases [76] and order sensitivity [29].

Multimodal ICL.

ICL can be extended to multimodal models after training on interleaved image-text datasets [3, 54, 52, 74, 19]. To further enhance M-ICL, several works try to use better context demonstrations using similarity sampling-based approaches [68, 25, 22, 46, 14, 3, 7]. Despite being effective, especially in handling out-of-distribution tasks [73], several works have tried to highlight several flaws. In particular, increasing object hallucinations and the limited ability to solve complex tasks such as instruction following or compositional image-text matching [49]. In addition, Chen et al. [7] study OpenFlamingo and find that the image plays a marginal role in VQA tasks, raising questions about the effectiveness of ICL in a multimodal context.

3 Analysis framework of M-ICL

For M-ICL, LMMs process inputs composed of a query $Q$ and a context $C$ . The query $Q$ includes an image $I$ and an optional associated text $T$ , which can be a question, instruction, or additional information. The context $C$ comprises $N$ demonstrations (examples) from the training dataset $D$ , each containing images and texts along with their corresponding responses $R$ . M-ICL can be written as follows:

C=\left((I_{i},T_{i},R_{i})\right)_{i\in D_{C}},\ \ O=\text{LMM}(C,(I_{Q},T_{Q% }))

(1)

Our similarity sampling method is RICES [3]. Given a query $Q$ , it retrieves the $N$ most similar demonstrations from the training set according to $S_{iq}=s(I_{i},I_{Q})+s(T_{i},T_{Q})$ , where $s$ represents the similarity score calculated by the visual encoder CLIP [42]. These demonstrations are arranged in the context in order of increasing similarity.

3.1 Research questions & analysis methodology

Our objective is to understand how different modalities affect M-ICL – here text and image. While there are several methods for demonstration retrieval in the literature [13, 25, 22, 46, 14, 43], there’s limited work [3, 68] for M-ICL and consequently little analysis of these methods. We believe that it is essential to investigate how ICL’s sensitivity factors apply to these methods and identify their limitations. We address the following research questions:

RQ1: How does each modality influence M-ICL? To analyze the effect of each modality, we modify the context $C$ by adjusting either $I$ (images) or $T$ (text). We describe the procedure for $I$ , but the same method applies to $T$ . We either completely remove the image component, resulting in a new context defined as $((\varnothing,T_{i},R_{i}))_{i\in D_{C}}$ , or randomize this modality by using random images from the demonstration dataset. In the later, the altered context is represented as $((I_{j},T_{i},R_{i})|j\neq i)_{i\in D_{C}}$ . We also conduct experiments with RICES to identify any behavioral differences.

RQ2: Which kind of shortcuts influence M-ICL? We are interested in whether M-ICL involves genuine learning from demonstrations, or if it relies on what we name “shortcuts”. Using Generalized Linear Models (GLM) and Spearman’s rank correlation, we evaluate the relationship between the similarity of the demonstrations to the query and their performance outcomes. We compare random sampling with RICES to understand M-ICL’s behavior, focusing on the improvements attributed to RICES. This analysis aims to understand the reason of these improvements and whether they reveal any emerging behaviors that suggest reliance on shortcuts. We then turn to the question of what performance gains can be attributed to RICES or to the m-ICL of LLMs. More precisely, for classification tasks, we rely on a simple RICES based KNN where the predicted answer $O^{\prime}$ is given by $\text{argmax}_{R}\left(\sum_{\{i\in D_{C}|R_{i}=R\}}e^{S_{iq}}\right)$ . For generation tasks (VQA and captioning), we also rely on another set of analysis, since the KNN approach is not the most adapted. Finally, we investigate another factor impacting M-ICL, namely the recency bias that complement our analysis on the relationship between the similarity of the context answer to the target one.

3.2 Experimental setup

Datasets

In our study, we investigate various tasks, including image captioning, classification, and visual question answering. For captioning, we employ the COCO dataset [8] and the Flickr30k dataset [41], where each image is annotated with five captions; we select one caption randomly for our experiments and evaluate using the CIDEr [56] metric. In classification, we use the CIFAR-100 [17] and ImageNet [44] datasets, with 100 and 1000 classes, respectively. The predicted class is the one whose label has the smallest Levenshtein distance to the model’s output. We use accuracy as the metric. An alternative would be to instruct the model to choose among all the classes, but this has a high computational cost. We also examine the Hateful Memes [16] and Rendered SST2 [38, 51] datasets for detecting hate speech and performing sentiment analysis through OCR, measuring performance by exact match accuracy. For visual question answering, we use the VizWiz [4], VQAv2 [1], OK-VQA [45], TextVQA [50], ScienceQA [27] (only items containing images), and MMMU [71] datasets, covering a range of applications from assisting visually impaired users to requiring scientific reasoning, with VQA accuracy as metrics for most, except accuracy for multiple-choice formats for ScienceQA and MMMU. The test set is composed of a maximum of 5000 items, chosen randomly if the original dataset exceeds this number. This set remains the same across all tests, serving as a consistent basis for comparison. Additionally, the entire training dataset is used as the support set for M-ICL demonstrations.

Models and ICL details.

We conduct our tests with IDEFICS [18] 9B version (for OpenFlamingo, results are reported in the Appendix Sec. 8.1). For RICES we use the CLIP version "openai/clip-vit-large-patch14". Unless specified, demonstrations are chosen randomly. For captioning and classification tasks (image-to-text tasks), the demonstrations consists of interleaved image and captions/classes. For VQA datasets, the text consists of the question-answer pairs. We do not use explicit task instruction, letting the model understand the task from its context. We repeat each experiment 3 times and report the averaged results.

4 RQ1: How does each modality influence M-ICL?

In this section, we try to answer RQ1, i.e we investigate the influence of each modality on M-ICL and their interactions by manipulating the context (text or images). We conduct our study with randomly sampled demonstrations and extend to the retrieval M-ICL such as RICES in Section 4.3. We summarise the results in Figure 2, presenting the scores for the 16-shot scenario with both contexts of altered images (Figure 2(a)) and texts (Figure 2(c)). Additionally, we illustrate the effect of the number of demonstrations in Figure 2(b). To make values comparable across tasks, we normalize the measures.

4.1 Images impact M-ICL

In Figure 2(a), we observe that image-to-text tasks like captioning and classification are highly affected when altering the images. Compared to the context baseline that consists of images and their correct classes/captions, using random images or removing them from the context leads to a significant drop in performance. The performance for datasets such as CIFAR and ImageNet is close to the level of a zero-shot m-ICL, and for MS-COCO it is even worse. This phenomenon is corroborated in Figure 2(b), where we show that adding more demonstrations with random images has a strong negative impact on image-to-text tasks, in stark contrast to the initial demonstrations.

To understand the effect of using random images, in Figure 3 we examine the model’s output in this setup. Our analysis focuses on the most common n-grams found in the captions over the whole dataset (pink) within the context, looking at their frequency in the model’s output. We compare the base prompt (blue) when 32 demonstrations are used, against random images setup in 4 shot (orange) and 32 shot (green). In the case of the base prompt, the distribution appears similar to that of the context, indicating a similar input and output distribution of words. However, in scenarios involving random images, there is a noticeable shift towards an over-representation of the most frequent n-grams, and the more demonstrations the more this happens. This suggests that the mismatch between images and their corresponding textual outputs in the demonstrations causes the model to switch to a generic mode, in which it tends to output the most frequent words in the dataset used to construct the ICL context.

These results suggest that demonstration images influence the performance of M-ICL in image-to-text tasks, and that the model leverages the relationship between visual inputs and textual outputs. We discuss the potential reasons for this behavior in section 5.

We now turn to VQA which exhibits a different behavior. Altering or omitting images results in a minor decrease in performance, typically between 1.2 to 1.5 points from the base prompt (Figure 2(a)). This suggests that the inclusion of textual information (i.e. questions) diminishes the model’s dependence on visual data, a topic we explore in the next section.

4.2 Text drives M-ICL

In VQA, which has both image and text (i.e. questions) as input, Figure 2(c) illustrates that removing the question (purple) results in an average drop of 3.5 points. Moreover, replacing it with a random question (green) leads to an average decrease of 9.5 points. We further observe (Figure 2(b)) that the decrease worsens with an increasing number of shots¹¹1In practice, M-ICL often outputs ’no’, the most frequent answer in the dataset.

For text-to-image tasks, Figure 2(a) provides also insights into the role of text, as scenarios without images (orange) correspond to a scenario with only text. In classification tasks, where the text has limited information, i.e. just the one-two word labels, the text-only scenario performs as poorly as the zero-shot setup (black), with only a 0.47% increase in accuracy. However, in captioning, where the text is richer, M-ICL enables capturing the style of the captions and/or the distribution of words, resulting in an average improvement of 31 points over the zero-shot approach. These results indicate that text influences M-ICL when it carries sufficient semantic content.

In summary, in classification tasks, text has a minor impact compared to images. When text becomes richer, particularly in the context of captioning, the use of text alone can improve zero-shot methods by 31 points. Incorporating images further enhances performance by an additional 20 points, underscoring the importance of both modalities. In tasks like VQA, textual information becomes dominant and significantly influences performance, with random text leading to a significant drop in performance. We conclude that while images do have an impact on M-ICL, textual information takes precedence and drives the model’s decision-making process.

4.3 How do retrieving similar demonstrations affect interactions?

In the previous section, demonstrations were randomly sampled. Here, we turn to similarity-based (RICES) M-ICL, and analyze which observations still hold and which don’t. First, Figure 4 shows that in most cases RICES leads to better performance. For captioning tasks, the more demonstrations, the better the performance. For VQA, the use of RICES leads to improvements of between 2 and 5% for most datasets. The most significant improvements are in classification, where gains range from 10 to 50%.

Investigating the factors influencing these improvements and how each modality contributes can help us understand better multimodal interactions. We follow the same procedure as with random sampling: We investigate the effect of disrupting the alignment between visual and textual parts, while maintaining one modality closely related to the query. Additionally, we explore which modality is pivotal for the improvements by computing similarity based on different modality choices.

Disrupting image-text alignment

In Figure 5 we observe that there is no significant degradation when removing the images or replacing them with random ones. The context with random images (in blue), where only the demonstration responses resemble the query, yields results comparable to random sampling and is slightly better than removing images. Furthermore there is no noticeable drop in performance as the number of examples increases, which is different than when using randomly sampled demonstrations (as shown in Appendix Fig. 13). On the other hand, random responses (in purple) show a significant decrease in performance (i.e. only the demonstration images are similar to the query’s).

In particular, when substituting the responses in the context by random ones, the drop is more important in RICES than with random sampling (e.g., as shown in Appendix Figure 13; for random sampling and image-to-text tasks, random image and random label is equivalent). Having the wrong responses for images similar to the query, might push the model to naturally output the wrong response as well.

Overall, the results above suggests that images serve as a prior for the demonstrations, which is confirmed in the analysis conducted in Sec. 5.

Retrieving demonstrations similar to text or image query?

In the case of VQA, the question is composed of text and images. As described in Section 3, $S_{iq}$ is the sum of CLIP textual and visual similarities. In Figure 6, to further explore the effect of each modality, we compare this baseline (orange) to using only CLIP image similarity (blue) or CLIP text similarity (pink). Results vary across different datasets, however for TextVQA, VQAv2, and VizWiz, using image similarity has a better outcome, while textual similarity is better for MMMU, OK-VQA, and ScienceQA. This might be explained by the nature of each dataset: TextVQA, VQAv2, and VizWiz necessitate images for accurate responses, whereas MMMU, OK-VQA, and ScienceQA are more dependent on textual information. To conclude, using the right similarity highly depends on the actual dataset, and there is no clear indication of which to choose for M-ICL models.

5 RQ2: Which kind of shortcuts influence M-ICL?

In this section, we answer to RQ2, i.e. we try to explain the M-ICL behavior with random or similarity-based demonstrations. More precisely, we investigate whether M-ICL performance can be partially explained by the fact that the demonstration responses can be close to the desired response, and the M-ICL model do a “soft copy” of the demonstration responses. Formally, we hypothesize (1) that, given a demonstration $i$ and a query $q$ the similarity function $S_{iq}$ has a correlation with the CLIP score between the responses $R_{i}$ and $R_{q}$ , denoted $S_{iq}^{R}$ , i.e. if demonstrations inputs are similar to the query inputs, the same applies for the responses. Furthermore, we also hypothesize (2) that, given a context $C$ composed by demonstrations $i$ , the average of the similarities $S^{R}_{iq}$ of the demonstrations responses $R_{i}$ with the target response $R_{q}$ correlates with performances, i.e., the closest the context responses to the target one, the better the generated one.

To verify these hypotheses, we compute both General Linear Model (GLM) coefficients and Spearman correlation to characterize the relationship between different factors described above. In the first column of Table 1, we compare $S_{iq}$ and $S^{R}_{iq}=s(R_{i},R_{q})$ with $s$ the CLIP similarity across all demonstrations (hypothesis 1). With RICES, we can observe a positive Spearman correlation, especially for classification (SST2) and text-to-image (COCO) datasets, slightly less for VQA (VQAv2). The regression coefficient, close to 1, shows that the similarities almost match in average. We also observe that correlation drops when using random samples, showing that this relation holds only when looking at more similar demonstrations. In the second column of Table 1, we look at the relationship between (a) the average similarity $avg(S^{R}_{iq})$ between a demonstration and target response; and (b) the performance of M-ICL. We again only observe correlation in all cases when using RICES demonstrations.

Dataset	Sampling	$S_{iq}\sim S^{R}_{iq}$		$avg(S^{R}_{iq})\sim$ score
Dataset	Sampling	GLM	Sp.	GLM	Sp.
COCO	Random	0.69	0.16	0.96	-0.01
COCO	RICES	0.75	0.37	0.51	0.22
VQAv2	Random	1.33	0.10	0.66	0.25
VQAv2	RICES	1.01	0.18	0.61	0.33
R. SST2	Random	0.80	0.05	1.01	0.22
R. SST2	RICES	0.89	0.29	0.96	0.35

Table 1: Correlation between input and output similarities and performance. The correlation between inputs and outputs of any given demonstration and any query is represented by

S_{iq}\sim S^{R}_{iq}

. Here,

S_{iq}

refers to the similarity of the inputs of the demonstration and the query, while

S^{R}_{iq}

refers to the similarity of their responses.

avg(S^{R}_{iq})\sim\text{score}

represents the correlation, for a given set of demonstrations

i

within a context

C

, between the mean similarity of the demonstration responses with the query’s and the overall score. We show the coefficients of the Generalized Linear Model (GLM) as well as Spearman’s rank correlation (Sp.), with all p-values

<0.01

These observation support our initial explanation of M-ICL performance in the case of RICES (this is less clear otherwise), i.e. RICES is effective because it retrieves responses that closely match the target one. This raises the question of whether the performance gains from M-ICL are simply due to better context responses acting as shortcuts, or whether there is genuine learning involved, with demonstrations that are more similar to the query proving to be more useful. In what follows, we study two potential shortcuts: one being that M-ICL might simply exploit the presence of more accurate or relevant responses in the context, and the other being that the most similar demonstrations, whose response is probably the same or close to the query’s, are the most recent, and the model could be leveraging this recency. The remaining of this section aims to explore and clarify these two possibilities.

5.1 M-ICL does a majority vote over the demonstrations

We dive into the first possibility, which examines the impact of having more accurate or relevant responses in the context. We aim to assess the effectiveness of M-ICL with demonstrations similar to the query by comparing the performance of M-ICL and a simple RICES KNN baseline described in Sec. 3.1.

Figure 7 illustrates that for classification, RICES KNN (blue) obtains similar performances than M-ICL when using the same demonstration (orange), and outperforms the random sampling setup (green). In particular RICES M-ICL struggles to surpass RICES KNN, and this is particularly visible for SST-2, where increasing the number of demonstrations decreases the performances for both the KNN and ICL (see Appendix 11).

To further show this majority voting effect, we observed that ensuring that the labels are uniformly distributed with the demonstrations degrades the performance of both M-ICL and the KNN (see Appendix 5). This suggests that M-ICL leverages similar demonstrations by leveraging the distribution of context responses, rather than actually learning. Said otherwise, in classification tasks, M-ICL’s effectiveness is comparable to that of a KNN, and M-ICL does not seem to be useful.

In open-ended generation tasks, i.e. captioning and visual question answering, majority voting is insufficient. Here the baseline method falls short against random sampling and the RICES approach shows small improvements. Table 2 and Figure 8 show that there is a correlation especially between the responses and performance while this is not the case for the images and texts. This is more true with RICES, but also present for random sampling in VQA.

To further analyze this phenomenon, we introduce Oracle RICES which leverages the similarity metric $S^{R}_{iq}=s(R_{i},R_{q})$ that uses the ground truth response $R_{q}$ . This approach enables us to select examples with responses that closely match the desired answer. In VQA, if "yes" is the correct answer, the chosen examples will all share this answer despite differences in image or text content. Figure 7 illustrates this method in pink and that it significantly outperforms the others methods, providing an upper limit for the RICES approach. This in turn show that (a) for open-ended generation m-ICL can do intelligent soft copy when provided close responses; (b) that the used RICES similarity does not select enough demonstrations with a high target response similarity which can improve substantially the performance.

Dataset	Sampling	$I$	$T$	$R$	$IT$	$TR$	$IR$
COCO	Random	0.61	-	0.40	-	-	-0.73
	RICES	-0.01	-	0.64	-	-	-0.22
VQA	Random	-0.55	-0.18	0.41	0.29	0.29	0.55
	RICES	0.02	-1.00	0.98	0.24	0.24	0.17

Table 2: Influence of demonstration’s parts on the performances. GLM coefficients (with the score as the response variable) of similarities of context image

I

, text

T

, response

R

with target ones, as well as their interactions, i.e. Image*Text (

IT

), Text*Response (

TR

), Image*Response (

IR

). For each context, we select the maximum of each value across the demonstrations. All coeff. have a p-value

<0.001

5.2 M-ICL tends to copy recent similar responses

Another factor impacting the performance can be the ordering of the demonstrations. In Table 3, we compute the GLM coefficients for $S_{i}^{R}=s(R_{q},R_{i})$ when the performance is the response variable. For random sampling, we observe that this coefficient does not depend much on the position, while for RICES the coefficient increases from 0.01 (first rank) to 0.30 (4th rank) in captioning (and similarly in VQA, but to a lesser extent). As we saw earlier, this might be explained by the fact that this coefficient increases with more similar demonstrations. Another possibility is that M-ICL relies more on later ranks. The lines "RICE reverse" show that the latter explanation is truer, since by reversing the RICES order the coefficient still increases (to some extent) with the rank of the demonstration.

Dataset	Sampling	$S^{R}_{i}\sim$ perf
Dataset	Sampling	$S^{R}_{1}$	$S^{R}_{2}$	$S^{R}_{3}$	$S^{R}_{4}$
COCO	Random	0.26	0.26	0.25	0.18
	RICES	0.01	0.06	0.14	0.30
	RICES Reverse	0.11	0.13	0.09	0.20
VQA	Random	0.10	0.13	0.22	0.21
	RICES	0.10	0.15	0.16	0.20
	RICES Reverse	0.15	0.21	0.19	0.06

Table 3: Influence of demonstrations on the performance based on their position. GLM coefficients (with the score as the response variable) of the similarity of each demonstration following his position. All coefficients have a p-value

<

0.01

To further analyze the impact of this recency phenomenon, we compare the outputs of the model against each demonstration’s output. Where there is a complete match between an entire demonstration’s response and the full output produced by the model For multiple matches, only the last one is recorded. Yes/no answers are excluded since in their frequency in VQA would skew the results. This allows us to measure the extent with which a demonstration response is replicated in the model output. Although we observed that exact copies are extremely rare for random sampling (not shown here), the RICES method shows a frequent replication of the last demonstration (as depicted by the bars in Figure 9). For VQA, the final context response is used 12% of the cases, regardless of the number of shots. For captioning, this ranges from 24% with four shots to 4% with 32 shots. We compare with a variation of RICES where demonstrations are arranged from most to least similar (depicted by the lines). In this setup, the model less frequently replicates the last output, yet the same trend remains, indicating that the model tends to replicate the outputs of the more recent demonstrations over the more similar ones. This demonstrates that when M-ICL is faced with similar demonstrations, a recency bias leans towards replicating the output of the latest ones rather than the most similar.

6 Conclusion

We propose a framework to study ICL in a multimodal context. Our study reveals that M-ICL is primarily text-driven, and that images in the context have little impact on the overall performance. This is exacerbated when using RICES to improve M-ICL. We also show that the reason of the success of similarity-based M-ICL is partially due to the fact that such techniques retrieve responses which are more similar to the target one rather than merely retrieving more related demonstrations. The practical consequences are that for classification-based tasks, M-ICL is useless when using RICES, and that for open-ended generation, there is still a gap that could be leveraged between RICES-retrieved responses and ideal ones. In addition, we show that M-ICL suffers from different biases, such as the ability to replicate the last example in the demonstrations. Our work sheds light on several limitations and suggests that there is room for improvement regarding M-ICL. Current M-ICL improvements can be brought by M-ICL variants or prompting strategies [49, 33, 70], or better training datasets [75, 18]. Our work suggests that working on better retrieval and reducing the biases (e.g. recency) would also benefit this line of models. Finally, while our findings hold for the best open-source M-ICL models, we recognize it would be important to study more powerful models such as GPT4-V [37] and Gemini [53] to check if our conclusion still hold.

7 Acknowledgments

This work was partly funded by the ANR-23-PEIA-0008, PEPR IA, project "Principes théoriques et algorithmiques de l’apprentissage frugal (SHARP)," and received computing AI and storage resources from GENCI at IDRIS on the Jean Zay supercomputer’s V100/A100 partition through grant 2023-AD011014764.

References

Agrawal et al. [2016] Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, and Devi Parikh. VQA: Visual Question Answering, 2016. arXiv:1505.00468 [cs].
Akyürek et al. [2023] Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? Investigations with linear models, 2023. arXiv:2211.15661 [cs].
Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
Antol et al. [2015] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
Awadalla et al. [2023] Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models, 2023. arXiv:2308.01390 [cs].
Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Chen et al. [2023] Shuo Chen, Zhen Han, Bailan He, Mark Buckley, Philip Torr, Volker Tresp, and **dong Gu. Understanding and Improving In-Context Learning on Vision-language Models, 2023. arXiv:2311.18021 [cs].
Chen et al. [2015] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft COCO Captions: Data Collection and Evaluation Server, 2015. arXiv:1504.00325 [cs].
Dai et al. [2023] Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. Why can gpt learn in-context? language models secretly perform gradient descent as meta-optimizers. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4005–4019, 2023.
Dai et al. [2024] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
Dong et al. [2022] Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, **g**g Xu, Lei Li, and Zhifang Sui. A Survey on In-context Learning, 2022.
Eichenberg et al. [2022] Constantin Eichenberg, Sidney Black, Samuel Weinbach, Letitia Parcalabescu, and Anette Frank. Magma–multimodal augmentation of generative models through adapter-based finetuning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2416–2428, 2022.
Gao et al. [2024] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, **liu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. Retrieval-Augmented Generation for Large Language Models: A Survey, 2024. arXiv:2312.10997 [cs].
Gui et al. [2022] Liangke Gui, Borui Wang, Qiuyuan Huang, Alexander Hauptmann, Yonatan Bisk, and Jianfeng Gao. KAT: A knowledge augmented transformer for vision-and-language. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 956–968, Seattle, United States, 2022. Association for Computational Linguistics.
Hendel et al. [2023] Roee Hendel, Mor Geva, and Amir Globerson. In-context learning creates task vectors. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9318–9333, 2023.
Kiela et al. [2020] Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in neural information processing systems, 33:2611–2624, 2020.
Krizhevsky [2009] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
Laurençon et al. [2024] Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems, 36, 2024.
Li et al. [2023a] Bo Li, Yuanhan Zhang, Liangyu Chen, **ghao Wang, **gkang Yang, and Ziwei Liu. Otter: A Multi-Modal Model with In-Context Instruction Tuning, 2023a. arXiv:2305.03726 [cs].
Li et al. [2023b] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023b.
Li et al. [2019] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
Lin et al. [2022] Yuanze Lin, Yujia Xie, Dongdong Chen, Yichong Xu, Chenguang Zhu, and Lu Yuan. Revive: Regional visual representation matters in knowledge-based visual question answering. Advances in Neural Information Processing Systems, 35:10560–10571, 2022.
Liu et al. [2023] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023.
Liu et al. [2024] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
Liu et al. [2022] Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online, 2022. Association for Computational Linguistics.
Lu et al. [2022a] Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks. In The Eleventh International Conference on Learning Representations, 2022a.
Lu et al. [2022b] Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022b.
Lu et al. [2023] Sheng Lu, Irina Bigoulaeva, Rachneet Sachdeva, Harish Tayyar Madabushi, and Iryna Gurevych. Are Emergent Abilities in Large Language Models Just in-Context Learning?, 2023. arXiv:2309.01809 [cs].
Lu et al. [2022c] Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland, 2022c. Association for Computational Linguistics.
Luo et al. [2024a] Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, and Rongrong Ji. Cheap and quick: Efficient vision-language instruction tuning for large language models. Advances in Neural Information Processing Systems, 36, 2024a.
Luo et al. [2024b] Man Luo, Xin Xu, Yue Liu, Panupong Pasupat, and Mehran Kazemi. In-context Learning with Retrieved Demonstrations for Language Models: A Survey, 2024b. arXiv:2401.11624 [cs].
Min et al. [2022] Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics.
Mitra et al. [2023] Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. arXiv preprint arXiv:2311.17076, 2023.
Mizrahi et al. [2024] David Mizrahi, Roman Bachmann, Oguzhan Kar, Teresa Yeo, Mingfei Gao, Afshin Dehghan, and Amir Zamir. 4m: Massively multimodal masked modeling. Advances in Neural Information Processing Systems, 36, 2024.
Mokady et al. [2021] Ron Mokady, Amir Hertz, and Amit H. Bermano. ClipCap: CLIP Prefix for Image Captioning, 2021. arXiv:2111.09734 [cs].
Olsson et al. [2022] Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. In-context Learning and Induction Heads, 2022. arXiv:2209.11895 [cs].
OpenAI [2024a] OpenAI. GPT-4 Technical Report, 2024a. arXiv:2303.08774 [cs].
OpenAI [2024b] OpenAI. Clip: Rendered sst2 dataset, 2024b. GitHub repository.
Pan et al. [2023] Jane Pan, Tianyu Gao, Howard Chen, and Danqi Chen. What in-context learning “learns” in-context: Disentangling task recognition and task learning, 2023.
Peng et al. [2023] Hao Peng, Xiaozhi Wang, Jianhui Chen, Weikai Li, Yunjia Qi, Zimu Wang, Zhili Wu, Kaisheng Zeng, Bin Xu, Lei Hou, and Juanzi Li. When does In-context Learning Fall Short and Why? A Study on Specification-Heavy Tasks, 2023. arXiv:2311.08993 [cs].
Plummer et al. [2015] Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Rubin et al. [2022] Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671, Seattle, United States, 2022. Association for Computational Linguistics.
Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
Schwenk et al. [2022] Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022.
Shao et al. [2023] Zhenwei Shao, Zhou Yu, Meng Wang, and Jun Yu. Prompting large language models with answer heuristics for knowledge-based visual question answering. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14974–14983, 2023.
Shukor et al. [2023a] Mustafa Shukor, Corentin Dancette, and Matthieu Cord. ep-alm: Efficient perceptual augmentation of language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22056–22069, 2023a.
Shukor et al. [2023b] Mustafa Shukor, Corentin Dancette, Alexandre Rame, and Matthieu Cord. Unival: Unified model for image, video, audio and language tasks. Transactions on Machine Learning Research Journal (TMLR), 2023b.
Shukor et al. [2024] Mustafa Shukor, Alexandre Rame, Corentin Dancette, and Matthieu Cord. Beyond task performance: evaluating and reducing the flaws of large multimodal models with in-context-learning. In The Twelfth International Conference on Learning Representations (ICLR), 2024.
Singh et al. [2019] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
Socher et al. [2013] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng, and Christopher Potts. Parsing With Compositional Vector Grammars. In EMNLP. 2013.
Tai et al. [2023] Yan Tai, Weichen Fan, Zhao Zhang, Feng Zhu, Rui Zhao, and Ziwei Liu. Link-Context Learning for Multimodal LLMs, 2023. arXiv:2308.07891 [cs].
Team [2023] Gemini Team. Gemini: A Family of Highly Capable Multimodal Models, 2023. arXiv:2312.11805 [cs].
Tsimpoukelli et al. [2021] Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, and Felix Hill. Multimodal Few-Shot Learning with Frozen Language Models. In Advances in Neural Information Processing Systems, pages 200–212. Curran Associates, Inc., 2021.
Vallaeys et al. [2024] Théophane Vallaeys, Mustafa Shukor, Matthieu Cord, and Jakob Verbeek. Improved baselines for data-efficient perceptual augmentation of llms. arXiv preprint arXiv:2403.13499, 2024.
Vedantam et al. [2015] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
Von Oswald et al. [2023] Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151–35174. PMLR, 2023.
Wang et al. [2023] Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. Label words are anchors: An information flow perspective for understanding in-context learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9840–9855, 2023.
Wang et al. [2022a] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, **gren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022a.
Wang et al. [2022b] Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision, 2022b. arXiv:2108.10904 [cs].
Webson and Pavlick [2022] Albert Webson and Ellie Pavlick. Do Prompt-Based Models Really Understand the Meaning of Their Prompts? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2300–2344, Seattle, United States, 2022. Association for Computational Linguistics.
Wei et al. [2022a] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022a.
Wei et al. [2022b] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022b.
Wei et al. [2023a] Jerry Wei, Le Hou, Andrew Lampinen, Xiangning Chen, Da Huang, Yi Tay, Xinyun Chen, Yifeng Lu, Denny Zhou, Tengyu Ma, and Quoc Le. Symbol tuning improves in-context learning in language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 968–979, Singapore, 2023a. Association for Computational Linguistics.
Wei et al. [2023b] Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, and Tengyu Ma. Larger language models do in-context learning differently, 2023b.
Xu et al. [2023] Zhiyang Xu, Ying Shen, and Lifu Huang. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11445–11465, 2023.
Yadlowsky et al. [2023] Steve Yadlowsky, Lyric Doshi, and Nilesh Tripuraneni. Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models, 2023. arXiv:2311.00871 [cs, stat] version: 1.
Yang et al. [2022] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu, and Lijuan Wang. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3081–3089, 2022.
Yoo et al. [2022] Kang Min Yoo, Junyeob Kim, Hyuhng Joon Kim, Hyunsoo Cho, Hwiyeol Jo, Sang-Woo Lee, Sang-goo Lee, and Taeuk Kim. Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2422–2437, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics.
Yu et al. [2022] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. CoCa: Contrastive Captioners are Image-Text Foundation Models, 2022. arXiv:2205.01917 [cs].
Yue et al. [2023] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI, 2023. arXiv:2311.16502 [cs].
Zhang et al. [2024a] Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, and Dong Yu. MM-LLMs: Recent Advances in MultiModal Large Language Models, 2024a. arXiv:2401.13601 [cs].
Zhang et al. [2024b] Xingxuan Zhang, Jiansheng Li, Wen**g Chu, Junjia Hai, Renzhe Xu, Yuqing Yang, Shikai Guan, Jiazheng Xu, and Peng Cui. On the Out-Of-Distribution Generalization of Multimodal Large Language Models, 2024b. arXiv:2402.06599 [cs].
Zhao et al. [2023] Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, and Baobao Chang. MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning, 2023. arXiv:2309.07915 [cs].
Zhao et al. [2024] Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, and Baobao Chang. MMICL: Empowering vision-language model with multi-modal in-context learning. In The Twelfth International Conference on Learning Representations, 2024.
Zhao et al. [2021] Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate Before Use: Improving Few-shot Performance of Language Models. In Proceedings of the 38th International Conference on Machine Learning, pages 12697–12706. PMLR, 2021. ISSN: 2640-3498.

8 Appendix

8.1 Consideration on different behaviour of IDEFICS and OpenFlamingo

The two open-source models, IDEFICS [18] and OpenFlamingo [5], are both implementations of the model proposed by [3]. Despite sharing the same architecture, our analysis, as observable in 4 and 7 , reveals distinct behaviors between the two models when subjected to image removal or random image swap**. OpenFlamingo demonstrates a slight decrease in performance when removing or swap** images compared to the godlen prompt, indicating minimal impact from perturbations and recognising task, but not focusing on the image-text map**. On the other hand, IDEFICS exhibits a larger performance drop without images and with random images experiences even further degradation with an increase in the number of shots.

	num shots	4	8	16	32
dataset	Prompt
Flickr30k	W/o image	61.11	63.45	62.57	61.66
	Rnd. image	51.04	53.15	58.20	59.07
	Base	60.92	62.42	64.05	63.03
ImageNet 1k	W/o image	25.67	24.05	20.93	16.43
	Rnd. image	11.09	7.73	6.16	5.18
	Base	22.55	21.54	18.73	16.11
MS-COCO	W/o image	83.33	88.68	93.36	94.50
	Rnd. image	76.31	84.89	90.55	NaN
	Base	84.43	91.34	96.52	NaN
rendered SST-2	W/o image	10.70	29.87	11.19	14.22
	Rnd. image	53.48	61.29	60.63	56.79
	Base	53.44	59.94	60.53	57.91

Table 4: Evaluation results using OpenFlamingo 9B and demonstrations sampled uniformly at random across four image-to-text datasets using 0, 4, 8, 16, and 32 in-context demonstrations. Depicted various prompt modifications, such as removing one modality (either the image or the question) or replacing it with a different random instance from the training dataset.

The disparity in behavior between the two models can likely be attributed to differences in their training datasets. IDEFICS was trained on the OBELICS [18] dataset, which contains longer, more contextual texts and extracts data directly from the HTML DOM tree, thus providing cleaner data free from ads and spam. This method ensures higher document quality, comparable to renowned datasets like The Pile and Wikipedia. Furthermore, OBELICS addresses the issue of image duplication present in Multimodal C4, in which only 60% of images are unique, thus offering a higher quality and more efficient training dataset. In contrast, OpenFlamingo was trained on the shorter, less detailed texts of Multimodal C4.

Given that IDEFICS generally achieves better scores and is more responsive to ICL, we have chosen to focus our study on this model.

Comparaison with Chen et al. [7]

The findings presented by Chen et al. [7], corroborate the behavioral differences between the two models that we observed. However, their study emphasizes the behavior of OpenFlamingo and concludes that ICLis primarily driven by text, as it appears insensitive to changes in images. Our observations regarding VQA align with this: ICL indeed seems to be driven predominantly by text. However, we note a different pattern in image-to-text tasks, where ICL does respond to visual elements. Nonetheless, when text is also available, it tends to become the dominant factor influencing the model’s responses.

8.2 Balanced sampling

In Section 5.1, we demonstrated that the performance of RICES ICL improves significantly due to a majority voting process that selects the most common label in a given context. To better understand how label imbalance impacts this, we conducted experiments in a binary classification framework, adjusting the sampling method to ensure an equal number of demonstrations from each class in the context. For random sampling, the demonstrations were arranged without specific order, while for RICES, we selected the closest demonstrations from each class and sorted them by increasing similarity. In Tab. 5, we found the following order of performance from worst to best: random sampling (comparaison point), balanced random sampling (+1.74% improvement), balanced RICES sampling (+8.40% improvement), and RICES sampling alone (+18.90% improvement). This suggests that while balancing the samples improves performance in random contexts, the balanced RICES approach yields only half the performance boost compared to using RICES alone. Therefore, we can infer that while example similarity contributes to model performance, the distribution of labels plays an important role.

	num shots	4	8	16	32
dataset	sampling
Hateful Memes	RICES	60.50	62.30	63.40	62.60
	Balanced rnd.	53.30	53.37	55.03	55.17
	Balanced RICES	54.60	56.10	58.30	57.70
	Random	50.57	50.93	52.00	53.77
rendered SST-2	RICES	75.80	84.14	82.84	80.18
	Balanced rnd.	57.07	57.27	58.11	58.85
	Balanced RICES	61.46	70.74	77.30	80.34
	Random	56.41	56.81	57.62	58.67

Table 5: Evaluation results using IDEFICS 9B across two binary classification vision-language datasets using 0, 4, 8, 16, and 32 in-context demonstrations. Depicted various sampling methods, random sampling (Random), RICES and their balanced counterparts.

	Shots	4	8	16	32
Dataset	Sampling
CIFAR-100	R. image	72.24	74.06	75.68	77.20
CIFAR-100	Random	47.70	49.03	49.68	51.23
Flickr30k	RICES	45.48	54.54	60.21	64.03
Flickr30k	Random	61.69	62.12	60.83	61.89
Hateful Memes	RICES	60.50	62.30	63.40	62.60
	Random	50.57	50.93	52.00	53.77
	R. image	60.80	61.10	61.70	62.20
	R. OCR	59.80	61.70	62.30	62.80
ImageNet 1k	RICES	73.04	74.52	75.18	76.00
ImageNet 1k	Random	18.01	21.53	23.90	25.81
MMMU	RICES	22.60	26.60	20.20	NaN
	Random	23.93	26.60	15.67	NaN
	R. image	25.80	27.40	14.90	NaN
	R. question	24.80	24.30	20.20	NaN
MS-COCO	RICES	84.65	97.98	106.44	110.36
MS-COCO	Random	92.98	99.88	103.66	104.57
OK-VQA	RICES	42.47	44.70	46.87	48.84
	Random	39.54	41.85	42.58	43.68
	R. image	39.71	42.10	44.00	45.94
	R. question	42.35	44.92	46.74	48.02
ScienceQA	RICES	39.07	36.84	32.03	NaN
	Random	41.88	40.89	39.88	NaN
	R. image	39.07	36.14	33.76	NaN
	R. question	39.96	40.16	38.37	NaN
TextVQA	RICES	26.48	27.67	28.54	28.51
	Random	25.77	26.09	26.40	26.50
	R. image	26.39	27.38	28.24	28.33
	R. question	24.97	26.10	26.37	26.91
VQAv2	RICES	51.68	54.25	56.26	57.04
	Random	53.33	54.58	55.39	55.93
	R. image	52.92	54.57	55.98	57.20
	R. question	48.99	49.88	52.37	52.95
VizWiz	RICES	31.75	33.23	34.17	34.85
	Random	23.58	28.18	29.71	30.69
	R. image	32.45	34.61	34.82	35.03
	R. question	27.15	30.02	31.37	31.83
rendered SST-2	RICES	75.80	84.14	82.84	80.18
rendered SST-2	Random	56.41	56.81	57.62	58.67

Table 6: Full evaluation results using IDEFICS 9B and base prompt across twelve vision-language datasets using 0, 4, 8, 16, and 32 in-context demonstrations. Depicted the scores of random sampling (Random) and RICES in is standard form or using only one modality for similarity function (R. modality)

	Shots	4	8	16	32
Dataset	Prompt
CIFAR-100	W/o image	36.03	37.84	39.21	40.57
	Rnd. image	20.05	11.67	6.65	4.79
	Base	47.70	49.03	49.68	51.23
Flickr30k	W/o image	41.78	45.05	50.15	54.16
	Rnd. image	48.54	40.99	34.29	37.75
	Base	61.69	62.12	60.83	61.89
Hateful Memes	W/o image	50.93	51.03	51.83	53.70
	Rnd. image	51.87	51.40	52.50	52.43
	Base	50.57	50.93	52.00	53.77
ImageNet 1k	W/o image	17.46	18.47	16.77	17.77
	Rnd. image	3.72	2.07	2.25	2.99
	Base	18.01	21.53	23.90	25.81
MS-COCO	W/o image	60.87	71.25	78.02	82.86
	Rnd. image	60.29	37.12	28.43	29.63
	Base	92.98	99.88	103.66	104.57
OK-VQA	W/o image	38.48	40.39	41.17	42.56
	W/o question	35.18	35.19	35.17	34.70
	Rnd. image	38.54	39.78	40.77	41.46
	Rnd. question	34.06	29.88	24.72	20.27
	Base	39.54	41.85	42.58	43.68
ScienceQA	W/o image	39.83	40.31	38.99	NaN
	W/o question	40.41	40.37	40.59	NaN
	Rnd. image	41.41	40.64	39.12	NaN
	Base	41.88	40.89	39.88	NaN
TextVQA	W/o image	25.08	24.69	24.71	24.38
	W/o question	22.66	22.90	23.08	22.58
	Rnd. image	24.25	24.26	24.22	24.08
	Rnd. question	23.49	23.23	22.92	22.87
	Base	25.77	26.09	26.40	26.50
VQAv2	W/o image	52.26	52.67	53.22	53.47
	W/o question	49.98	50.63	49.73	48.49
	Rnd. image	51.67	52.84	52.90	53.57
	Rnd. question	46.80	43.75	38.52	33.92
	Base	53.33	54.58	55.39	55.93
VizWiz	W/o image	21.96	27.44	30.90	30.89
	W/o question	20.36	25.02	29.63	31.55
	Rnd. image	22.94	27.22	28.97	28.65
	Rnd. question	22.08	24.40	24.60	23.59
	Base	23.58	28.18	29.71	30.69
rendered SST-2	W/o image	55.55	56.57	59.61	61.88
	Rnd. image	56.57	56.37	55.69	55.46
	Base	56.41	56.81	57.62	58.67

Table 7: Full evaluation results using IDEFICS 9B and demonstrations sampled uniformly at random across twelve vision-language datasets using 0, 4, 8, 16, and 32 in-context demonstrations. Depicted various prompt modifications, such as removing one modality (either the image or the question) or replacing it with a different random instance from the training dataset.

	Shots	4	8	16	32
Dataset	ordering
CIFAR-100	ascending	72.24	74.06	75.68	77.20
CIFAR-100	descending	70.46	72.36	73.84	74.92
Flickr30k	ascending	45.48	54.54	60.21	64.03
Flickr30k	descending	44.56	57.21	59.53	64.11
Hateful Memes	ascending	60.50	62.30	63.40	62.60
Hateful Memes	descending	59.70	58.10	59.10	59.20
ImageNet 1k	ascending	73.04	74.52	75.18	76.00
ImageNet 1k	descending	69.74	71.70	71.78	73.50
MMMU	ascending	22.60	26.60	20.20	NaN
MMMU	descending	26.00	26.60	22.80	NaN
MS-COCO	ascending	84.65	97.98	106.44	110.36
MS-COCO	descending	85.37	97.68	107.28	111.41
OK-VQA	ascending	42.47	44.70	46.87	48.84
OK-VQA	descending	41.09	43.54	46.16	48.88
ScienceQA	ascending	39.07	36.84	32.03	NaN
ScienceQA	descending	39.56	37.68	32.37	NaN
TextVQA	ascending	26.48	27.67	28.54	28.51
TextVQA	descending	26.42	26.14	26.61	27.40
VQAv2	ascending	51.68	54.25	56.26	57.04
VQAv2	descending	50.26	53.04	55.30	56.16
VizWiz	ascending	31.75	33.23	34.17	34.85
VizWiz	descending	31.33	33.26	34.80	35.66
rendered SST-2	ascending	75.80	84.14	82.84	80.18
rendered SST-2	descending	62.52	67.54	67.72	68.52

Table 8: Evaluation results using IDEFICS 9B and base prompt across twelve vision-language datasets using 0, 4, 8, 16, and 32 in-context demonstrations. Comparison of RICES with default order of demonstration (ascending) and a variant with descending similarity ordering.

	Shots	4	8	16	32
Dataset	Variant
CIFAR-100	Rnd. S. LMM	47.70	49.03	49.68	51.23
	RICES LMM	72.24	74.06	75.68	77.20
	RICES KNN	80.28	80.96	81.24	80.82
Flickr30k	Rnd. S. LMM	61.69	62.12	60.83	61.89
	RICES LMM	45.48	54.54	60.21	64.03
	RICES KNN	20.77	20.77	20.77	20.73
Hateful Memes	Rnd. S. LMM	50.57	50.93	52.00	53.77
	RICES LMM	60.50	62.30	63.40	62.60
	RICES KNN	63.00	63.40	62.40	60.20
ImageNet 1k	Rnd. S. LMM	18.01	21.53	23.90	25.81
	RICES LMM	73.04	74.52	75.18	76.00
	RICES KNN	78.58	79.46	79.52	78.90
MMMU	Rnd. S. LMM	23.93	26.60	15.67	NaN
	RICES LMM	22.60	26.60	20.20	NaN
	RICES KNN	3.10	3.10	2.90	NaN
MS-COCO	Rnd. S. LMM	92.98	99.88	103.66	104.57
	RICES LMM	84.65	97.98	106.44	110.36
	RICES KNN	57.69	57.90	59.00	61.55
OK-VQA	Rnd. S. LMM	39.54	41.85	42.58	43.68
	RICES LMM	42.47	44.70	46.87	48.84
	RICES KNN	13.86	14.46	15.14	15.35
ScienceQA	Rnd. S. LMM	41.88	40.89	39.88	NaN
	RICES LMM	39.07	36.84	32.03	NaN
	RICES KNN	30.29	29.10	29.55	NaN
TextVQA	Rnd. S. LMM	25.77	26.09	26.40	26.50
	RICES LMM	26.48	27.67	28.54	28.51
	RICES KNN	8.69	9.09	9.75	10.13
VQAv2	Rnd. S. LMM	53.33	54.58	55.39	55.93
	RICES LMM	51.68	54.25	56.26	57.04
	RICES KNN	38.01	42.01	43.12	42.25
VizWiz	Rnd. S. LMM	23.58	28.18	29.71	30.69
	RICES LMM	31.75	33.23	34.17	34.85
	RICES KNN	32.66	39.91	43.55	44.43
rendered SST-2	Rnd. S. LMM	56.41	56.81	57.62	58.67
	RICES LMM	75.80	84.14	82.84	80.18
	RICES KNN	92.26	87.12	82.96	78.38

Table 9: Evaluation results using IDEFICS 9B across twelve vision-language datasets using 0, 4, 8, 16, and 32 in-context demonstrations. Depicted M-ICL with random sampling (Rnd. S. LMM), M-ICL with RICES sampling (RICES LMM) and the majority voting baseline (RICES KNN)

	Shots	4	8	16	32
Dataset	Prompt
CIFAR-100	W/o image	68.28	69.88	69.68	70.80
	Rnd. image	70.59	71.71	72.07	72.18
	Rnd. label	9.91	3.63	2.09	1.72
	Base	72.24	74.06	75.68	77.20
Flickr30k	W/o image	30.75	36.66	43.75	47.83
	Rnd. image	38.88	46.78	54.52	58.04
	Rnd. label	26.80	26.40	24.12	26.42
	Base	45.48	54.54	60.21	64.03
Hateful Memes	W/o image	60.10	62.80	64.60	65.10
	Rnd. image	60.00	62.17	63.27	62.70
	Rnd. label	54.77	54.10	54.43	53.67
	Base	60.50	62.30	63.40	62.60
ImageNet 1k	W/o image	68.94	69.28	69.02	70.42
	Rnd. image	72.41	72.79	71.84	72.67
	Rnd. label	2.32	0.51	0.23	0.14
	Base	73.04	74.52	75.18	76.00
MMMU	W/o image	22.40	26.40	19.90	NaN
	Rnd. image	22.47	26.47	19.67	NaN
	Rnd. label	22.47	26.47	19.90	NaN
	Base	22.60	26.60	20.20	NaN
MS-COCO	W/o image	67.58	77.81	88.01	93.93
	Rnd. image	75.42	88.40	98.06	103.08
	Rnd. label	29.19	21.85	18.34	19.64
	Base	84.65	97.98	106.44	110.36
OK-VQA	W/o image	39.69	41.11	42.76	44.82
	W/o quest.	34.50	33.77	34.47	34.44
	Rnd. image	37.83	40.37	43.01	44.72
	Rnd. label	17.80	8.60	3.06	1.02
	Rnd. quest.	34.11	30.20	26.66	23.18
	Base	42.47	44.70	46.87	48.84
ScienceQA	W/o image	38.7	37.98	33.07	NaN
	W/o quest.	39.56	41.45	38.92	NaN
	Rnd. image	38.13	36.42	30.52	NaN
	Rnd. label	39.07	36.84	32.52	NaN
	Rnd. quest.	44.04	40.92	35.35	NaN
	Base	39.07	36.84	32.03	NaN
TextVQA	W/o image	19.68	19.43	19.94	20.36
	W/o quest.	19.05	18.90	19.04	18.51
	Rnd. image	19.77	20.17	20.91	20.97
	Rnd. label	11.97	9.44	6.96	5.56
	Rnd. quest.	22.82	23.17	23.63	23.71
	Base	26.48	27.67	28.54	28.51
VQAv2	W/o image	51.43	52.32	53.39	54.07
	W/o quest.	48.47	48.90	48.96	48.38
	Rnd. image	50.21	52.63	54.08	55.22
	Rnd. label	31.87	28.32	26.24	25.20
	Rnd. quest.	48.42	48.24	46.10	44.54
	Base	51.68	54.25	56.26	57.04
VizWiz	W/o image	28.89	29.95	30.98	32.29
	W/o quest.	26.62	29.39	32.54	34.17
	Rnd. image	30.67	32.51	32.56	31.98
	Rnd. label	17.11	17.86	18.70	16.78
	Rnd. quest.	31.21	31.37	29.81	28.69
	Base	31.75	33.23	34.17	34.85
rendered SST-2	W/o image	72.56	70.36	67.00	64.10
	Rnd. image	72.35	70.93	68.11	64.83
	Rnd. label	51.97	51.89	52.53	52.41
	Base	75.80	84.14	82.84	80.18

Table 10: Full evaluation results using IDEFICS 9B and demonstrations sampled with RICES across twelve vision-language datasets using 0, 4, 8, 16, and 32 in-context demonstrations. Depicted various prompt modifications, such as removing one modality (either the image or the question) or replacing it with a different random instance from the training dataset.

Dataset	Zero-shot score
ScienceQA	36.39
MMMU	4.37
MS-COCO	38.94
Flickr30k	19.44
OK-VQA	10.29
VQAv2	6.66
VizWiz	2.16
ImageNet 1k	16.98
Hateful Memes	0.00
TextVQA	7.66
rendered SST-2	0.02
CIFAR-100	39.98

Table 11: Full evaluation results using IDEFICS 9B across twelve vision-language datasets using no demonstrations.

Dataset	Oracle RICES score
CIFAR-100	91.98
Flickr30k	76.53
Hateful Memes	100.00
ImageNet 1k	99.56
MMMU	19.30
MS-COCO	139.03
OK-VQA	75.90
ScienceQA	35.05
TextVQA	49.79
VQAv2	82.97
VizWiz	44.26
rendered SST-2	100.00

Table 12: Evaluation results using IDEFICS 9B and demonstrations sampled with RICES using ground truth as similarity across twelve vision-language datasets using 16 in-context demonstrations.