HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: axessibility
  • failed: selectp

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2404.07983v1 [cs.CV] 11 Apr 2024
11institutetext: 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT University of Freiburg, 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Bosch Center for Artificial Intelligence
11email: {schrodi,hoffmann,argusm,brox}@cs.uni-freiburg.de, 11email: [email protected]

Two Effects, One Trigger:
On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Representation Learning

Simon Schrodi 11 0009-0003-7006-953X    David T. Hoffmann 1122 0009-0002-4942-814X    Max Argus 11 0000-0002-1288-7476   
Volker Fischer
22 0000-0001-5437-4030
   Thomas Brox 11 0000-0002-6282-8861
Abstract

Contrastive vision-language models like CLIP have gained popularity for their versatile applicable learned representations in various downstream tasks. Despite their successes in some tasks, like zero-shot image recognition, they also perform surprisingly poor on other tasks, like attribute detection. Previous work has attributed these challenges to the modality gap, a separation of image and text in the shared representation space, and a bias towards objects over other factors, such as attributes. In this work we investigate both phenomena. We find that only a few embedding dimensions drive the modality gap. Further, we propose a measure for object bias and find that object bias does not lead to worse performance on other concepts, such as attributes. But what leads to the emergence of the modality gap and object bias? To answer this question we carefully designed an experimental setting which allows us to control the amount of shared information between the modalities. This revealed that the driving factor behind both, the modality gap and the object bias, is the information imbalance between images and captions.

Keywords:
Contrastive vision-language representation learning Information imbalance Modality gap Object bias
 *{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT Equal contribution.

1 Introduction

Vision-language models have become increasingly popular and are successfully applied to numerous tasks. They benefit from their ability to exploit weak supervision, which can be acquired by scra** the internet for image-text pairs. Models are trained with contrastive [46, 27] or captioning-based pretraining [16, 49, 58]. Subsequent works improved reproducibility [11] and training efficiency [71, 34, 70, 33, 31, 55]. Despite weaker supervision, these models show intriguing properties: strong zero-shot image recognition performance [46, 43], cross-modal understanding [8] and retrieval [40], or robustness [44].

Regardless of these remarkable improvements and the widespread usage, the profound understanding of representations learned by vision-language models is still in its infancy. For instance, a recent work identified a modality gap in the shared embedding space [36], while other work conjectured about a bias towards objects [3]. But how bad are these effects? The consequences as well as the underlying triggers are not fully understood. In this paper, we thoroughly compare the learned embeddings for both modalities, study the two phenomena modality gap and object bias, and identify their common trigger.

The modality gap discovered by Liang et al. [36] is defined as image and text embeddings occupying separate regions of the shared embedding space. They attributed the phenomenon to the cone effect during model initialization and the contrastive loss preserving the gap, while subsequent work studied the influence of the Softmax temperature [59, 51]. In this paper, we show that few embedding dimensions drive the modality gap and it is caused by an information imbalance between images and their captions. While we find that increases of the modality gap correlate with improvements in downstream performance, the effect of the modality gap on downstream performance is small compared to other components, such as model size or dataset properties. For instance, when we control for the datasets, we indeed observe that modality gap decreases as performance improves. Further, we find that image and text embeddings exhibit different biases and neighborhood orderings vary between the modalities.

Refer to caption
Figure 1: Illustration of information imbalance between images (top left) and captions (bottom left). This makes it virtually impossible for the image encoder (top right) to know what a (sparse) caption may contain. Consequently, it focuses on the most salient objects due to their high probability of being present in the caption and tends to neglect other more unlikely factors, such as attributes.

Beyond above cross-modal analyses, recent work hypothesized that vision-language models exhibit a bias towards objects [3]. To formalize the notion of “a bias towards objects”, we propose a metric, Matching Object Attribute Distance (MOAD), which assesses the bias towards objects compared to other factors, such as attributes. We confirm that contrastive vision-language models are indeed more biased towards objects than attributes. However, we observe that performance on attributes positively correlates with performance on objects. This suggests that improvements on object tasks also lead to improvements on attribute tasks. Further, we find that the bias does not stem from the global word frequency of the training dataset but a per-sample caption presence bias, i.e., models are more biased towards words that are more consistently mentioned in the captions; as objects typically are in captions.

Finally, we identify the common trigger for both the modality gap and object bias: information imbalance. Information imbalance describes the availability of more information for one modality in comparison to the other. For example, for vision-language models captions are sparse (lossy) descriptions of images and determine the focal point, whereas images contain much more information. As a result, image encoders cannot know what information of the image they need to encode to align the image encoding with the text encoding for some unknown caption; see Fig. 1 for an illustration. The best that the image encoders can do is to focus on the most salient parts of the image that are typically present in captions, e.g., objects. Consequently, the encoder exhibits a bias towards these parts of the captions. Similarly, the modality gap emerges as by-product of the contrastive optimization under the information imbalanced regime. Here, the model trades off alignment, which is limited due to the information imbalance, with uniformity by making all images and all texts more dissimilar (a.k.a. modality gap). Besides that, a reduction of information imbalance through caption enrichment also improves zero-shot downstream performance. We believe that above findings help to better understand the role of sparse captions in contrastive vision-language representation learning.

In summary, the contributions of our analysis paper are as follows: 1) The modality gap is driven by only few embedding dimensions. 2) For off-the-shelf contrastive vision-language models, the modality gap and downstream performance are positively correlated due to common confounders. Controlling for more confounders indicates that a lower modality gap indeed correlates with higher performance. 3) Image and text embeddings have distinct characteristics despite coupling them via the cross-modal loss. 4) Object bias is caused by higher per-sample caption presence bias. 5) Improvements on object tasks yield improvements on attribute tasks. 6) An information imbalance between the modalities leads to both the modality gap and the object bias.

2 Related Work

Contrastive vision-language representation learning has recently emerged as an effective technique to learn representations with weak supervision that work for a wide range of tasks and have intriguing properties, such as strong zero-shot abilities. However, our understanding of the learned representation is in its infancy. For instance, recent work showed the presence of a modality gap and attributed it to the cone effect of model initialization and the contrastive loss [36]. Subsequent work explored the influence of the Softmax temperature [59, 51]. Other work found that the modality gap is orthogonal to the span of image and text embeddings [73]. In this work, we find that few dimensions are responsible for the separation of the modalities, and identify that information imbalance between images and text is the driving factor for the modality gap.

While some work found that large models, including vision-language models, close the gap to human perception [19, 32], other work found several failure modes of them [69, 4]. Other work studied the importance of data [44, 64], generalization/robustness [42, 14], analyzed the learned features/representations [20, 41, 47], compositionality [27, 13, 57], or learned abilities and (social) biases [1, 65, 72, 63, 52, 22]. It has also been discussed that vision-language models may be biased towards objects [3]. To study object bias, we introduce a measure to assess “bias towards objects”, affirming that they are indeed biased towards objects. But, we also find that improvements on object tasks correlate with improvements on attribute tasks. Lastly, we find that the bias stems from a per-sample caption presence bias caused by information imbalance.

Finally, theoretical work disentangled the InfoNCE (contrastive) loss [45] into an alignment and uniformity term [62] and showed the importance of shared task-relevant information between the modalities [61, 15, 35]. In this work, we connect information imbalance of task-relevant information to the two phenomena modality gap and bias towards objects.

3 Experimental Setup

Contrastive vision-language models. Unless stated otherwise, we used CLIP ViT-B/16 [46] and SigLIP ViT-B/16 [70] for our analyses. For our large-scale analyses, we used a total of 112 contrastive vision-language models provided by OpenCLIP [25, 11]. We distinguished between medium- (i.e., dataset size of \leq128 Mtimes128million128\text{\,}\mathrm{M}start_ARG 128 end_ARG start_ARG times end_ARG start_ARG roman_M end_ARG) and large-scale datasets.

Downstream evaluation tasks. We conducted our evaluations on ImageNet [48], MS COCO [37, 10], MIT-States [26], and UT-Zappos [67] using the standard evaluation protocols from the literature. For ImageNet, we used the CLIP-style prompts "a photo of a {obj}" [46] and computed the zero-shot (object) accuracy. For MS COCO, we prepended the prompt "a photo of" to the description of each image following Radford et al. [46] and used R@1 to assess zero-shot image-to-text retrieval performance (text-to-image retrieval yielded similar results). For MIT-States and UT-Zappos, we used the prompts "an image of a {attr} object" and computed the zero-shot attribute accuracy. We provide further evaluation details in Appendix 0.A.

Multi-modal Attributes and Digits (MAD). To understand the influence of data, we built a multi-modal dataset based on Morpho-MNIST [6] (a variation of MNIST [30]) with full control over the data-generating process, called Multi-modal Attributes and Digits (MAD). We used the following morphing or war**

Refer to caption
Figure 2: MAD examples.

operations as latent factors (i.e., attributes): altering image thickness (thickening, thinning, no thickthinning), swelling (swelling, no swelling), fractures (fracture, no fracture) from Castro et al. [6], and added scaling (large, small), colors (gray, red, green, blue, cyan, magenta, yellow) and captions. Visual examples are provided in Fig. 2 and Appendix 0.A. To generate captions, we mapped the digit class and latent factors to words and chained them together in random order, e.g., 0-thickening-swelling-fractures-large-blue. Model and training details are provided in Appendix 0.B.

To study the effect of information imbalance due to missing information in captions, we varied the number of attributes included in each caption, while ensuring that the digit remained consistently present. Importantly, the images remain unchanged, i.e., all latent factors still affect the images. For example, if we restrict each caption to one attribute (in addition to the digit), above (full) caption reduces to, e.g., 0-blue or 0-large.

4 Cross-modal Disparities of Embeddings

Most previous work focused on improving downstream performance of contrastive vision-language models, while some works discovered intriguing phenomena or shortcomings of contrastive vision-language models. However, we still lack a thorough understanding on the causes and effects of the learned representations. In an effort to enrich our understanding, we first analyze differences between image and text embeddings. We start by discovering that few embedding dimensions drive the modality gap (Sec. 4-4.1) and discuss the relation of the modality gap to downstream performance (Sec. 4.2). Finally, we study further similarities and differences of the embeddings (Sec. 4.3).

Revisiting the Modality Gap

Liang et al. [36] showed that embeddings of the modalities are located in completely separate regions of the embedding space and coined the phenomenon modality gap. They defined the modality gap distance as the L2-distance between the Means (L2M) of the embeddings:

L2M1ni=1n𝐱i1ni=1n𝐲i,L2Mnorm1𝑛superscriptsubscript𝑖1𝑛subscript𝐱𝑖1𝑛superscriptsubscript𝑖1𝑛subscript𝐲𝑖\text{L2M}\coloneqq||\frac{1}{n}\sum\limits_{i=1}^{n}\mathbf{x}_{i}-\frac{1}{n% }\sum\limits_{i=1}^{n}\mathbf{y}_{i}||\quad,L2M ≔ | | divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | , (1)

where 𝐱i,𝐲isubscript𝐱𝑖subscript𝐲𝑖\mathbf{x}_{i},\mathbf{y}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the i𝑖iitalic_i-th L2-normalized image or text embeddings, respectively.

4.1 Few Embedding Dimensions Make Up the Modality Gap

Refer to caption
(a) Abs. difference of means of embedding dims.
Refer to caption
(b) Some dims. perfectly separate the modalities.
Figure 3: Few embedding dimensions separate the modalities. (2(a)) We plot the absolute difference in the means of each embedding dimension between the modalities. Most dimensions are comparable but few dimensions have vastly different means between the modalities. (2(b)) Pairs of these dimensions can perfectly separate the modalities (top: CLIP ViT-B/16, bottom: SigLIP ViT-B/16).
Refer to caption
Figure 4: Successive ablation of embedding dimensions based on the sorting of embedding dimensions from Fig. 2(a) leads to a sharp drop, followed by a partial recovery of downstream performance, while the modality gap gradually closes.

To obtain better insights into the nature of the modality gap, we asked two questions: 1) Is the modality gap present in all dimensions or only a subset thereof? 2) Does post-hoc closing of the modality gap improve downstream performance?

We compared the distributions (means and variances) per embedding dimension between the modalities. Interestingly, there are few embedding dimensions that have stark differences in their means, while that difference is close to zero for most other dimensions (Fig. 2(a)). Moreover, we find that some dimensions exhibit substantial variance within one modality and negligible variance within the other, whereas this is reversed for the other dimension. Consequently, two of these embedding dimensions suffice to perfectly separate the modalities, as shown in Fig. 2(b). Moreover, these dimensions are by far the largest components of the image or text embeddings, respectively. Hence, they substantially influence the entire embedding and are responsible for the largest part of the measured modality gap (see the sharp drop of L2M in Fig. 4). Takeaway 1: Few embedding dimensions drive the modality gap and two dimensions suffice to separate the modalities.

Does removing these dimensions close the gap and improve performance? One may suspect that ablating the dimensions with high contributions to the modality gap will (substantially) close the modality gap and lead to better downstream performance. To test this, we successively ablated these dimensions using the sorting from Fig. 2(a). We re-normalized the remaining dimensions and evaluated their downstream performance. Fig. 4 shows that the modality gap (L2M) indeed closes, but downstream performance initially decreases sharply before partially recovering as more dimensions are ablated.

What is the mechanism that explains this observation? Note that the first embedding dimensions are the largest components of the embeddings (as discussed above). Hence, ablating and re-normalizing them causes substantial changes in cosine similarities and cross-modal neighborhoods. Consider the following example: let i=[8,0.6,0.7,0.3]T𝑖superscript80.60.70.3𝑇i=[8,0.6,0.7,0.3]^{T}italic_i = [ 8 , 0.6 , 0.7 , 0.3 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT be an image embedding, and t=[5,0.13,0.035,0.02]T𝑡superscript50.130.0350.02𝑇t=[5,0.13,0.035,0.02]^{T}italic_t = [ 5 , 0.13 , 0.035 , 0.02 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and t=[5,1.5,0.7,0.45]Tsuperscript𝑡superscript51.50.70.45𝑇t^{\prime}=[5,1.5,0.7,0.45]^{T}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = [ 5 , 1.5 , 0.7 , 0.45 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT be the matching or non-matching text embedding, respectively. We have cosine similarities d𝑑ditalic_d of d(i,t)=0.995𝑑𝑖𝑡0.995d(i,t)=0.995italic_d ( italic_i , italic_t ) = 0.995 and d(i,t)=0.975𝑑𝑖superscript𝑡0.975d(i,t^{\prime})=0.975italic_d ( italic_i , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 0.975. After ablating the first dimension, we have cosine similarities of 0.8220.8220.8220.822 and 0.9170.9170.9170.917 and hence the image-text alignment flipped. Finally, it flips back after ablating also the second dimension.

But why do these modality-separating embedding dimensions appear? As we will discuss in more detail in Sec. 6, maximization of the alignment term of the popular (contrastive) InfoNCE loss [45] (refer to Appendix 0.C for background) is limited by an information imbalance between the modalities, i.e., texts are sparse descriptions of the images. This makes it impossible to minimize the contrastive loss only by maximizing the similarity of the matching image-text pairs (the nominator of the loss also called alignment term). To still minimize the InfoNCE loss, a model maximizes also its uniformity term (denominator). It can do so by making images and texts as dissimilar as possible.111Note that the repulsive forces of the uniformity term of the InfoNCE loss solely act across the modalities and not within the modalities. After all, the InfoNCE loss maximizes the similarity of the matching image-text pair relative to the similarity of the non-matching image-text pairs, not the absolute similarity. Thus, we hypothesize that models trade off alignment, which is limited due to information imbalance (cf., Sec. 6), in a subset of dimensions for higher uniformity to minimize the loss. To still achieve good alignment, the model uses few dimensions that will have a negligible effect on the alignment term but significantly increase the uniformity term (overall lower cosine similarity scores between images and texts). This hypothesis explains our observations in Fig. 3 (few large and modality separating dimensions), Fig. 4 (recovery of performance when these dimensions are dropped but alignment is recovered), and Fig. 8 (modality gap closes as information imbalance reduces).

4.2 Does the Modality Gap Harm Downstream Performance?

Refer to caption
Figure 5: Relation between modality gap and downstream task performance. We plot downstream performance against the modality gap for a total of 112 contrastive vision-language models. Performance and the modality gap distances show a positive correlation, but downstream performance is largely influenced by training dataset size. To factor this out, we split the models into two groups based on dataset size, i.e., medium- and large-scale.
Table 1: Spearman rank correlation between downstream task performance and various factors for models trained on medium and large datasets.
Downstream
task
Modality
gap (L2M)
Modality
gap (RMG)
Model
size
Embedding
size
Dataset
size
ImageNet 46.8 | 46.2 24.3 | 34.1  34.9 | 79.1  6.4 | 77.9  22.4 | -22.6
MS COCO 29.6 | 32.0 24.9 | 13.1 -14.8 | 77.2 54.0 | 75.2 -18.6 | -15.8

The influence of the modality gap on downstream performance is controversially discussed in the literature [53, 74, 36, 73]. In the experiments from Fig. 4, ablating dimensions closed the modality gap but did not improve the downstream performance. Similarly, Liang et al. [36] closed the gap by shifting the embeddings but found that an increase of the modality gap actually improved performance. This is in contrast to intuition, which suggests that image-text pairs should be close and a smaller modality gap should improve downstream performance.

To bring additional insights into this discussion, we evaluated 112 contrastive vision-language models provided by OpenCLIP [25, 11] on ImageNet classification and MS COCO image-to-text retrieval (text-to-image retrieval yielded similar results). We computed the modality gap distance with L2M (Eq. 1) proposed by Liang et al. [36]. Our results in Figs. 5 and 1 show that a larger L2M distance counter-intuitively correlates with downstream performance improvements. Also note the separation of models trained on medium- (i.e., \leq128 Mtimes128million128\text{\,}\mathrm{M}start_ARG 128 end_ARG start_ARG times end_ARG start_ARG roman_M end_ARG image-text pairs) and large-scale data.

We suspect that L2M may be the cause for this counter-intuitive observation (refer to Appendix 0.D for a discussion of its limitations). Thus, we propose the alternative Relative Modality Gap (RMG) measure:

RMG1ni=1nd(𝐱i,𝐲j)0.5n(n1)(i,j=1;i1nd(𝐱i,𝐱j)+i,j=1;i1nd(𝐲i,𝐲j))+1ni=1nd(𝐱i,𝐲j),RMG1𝑛superscriptsubscript𝑖1𝑛𝑑subscript𝐱𝑖subscript𝐲𝑗0.5𝑛𝑛1subscriptsuperscript𝑛formulae-sequence𝑖𝑗1𝑖1𝑑subscript𝐱𝑖subscript𝐱𝑗subscriptsuperscript𝑛formulae-sequence𝑖𝑗1𝑖1𝑑subscript𝐲𝑖subscript𝐲𝑗1𝑛superscriptsubscript𝑖1𝑛𝑑subscript𝐱𝑖subscript𝐲𝑗\text{RMG}\coloneqq\frac{\frac{1}{n}\sum\limits_{i=1}^{n}d(\mathbf{x}_{i},% \mathbf{y}_{j})}{\frac{0.5}{n(n-1)}\Big{(}\sum\limits^{n}_{i,j=1;i\neq 1}d(% \mathbf{x}_{i},\mathbf{x}_{j})+\sum\limits^{n}_{i,j=1;i\neq 1}d(\mathbf{y}_{i}% ,\mathbf{y}_{j})\Big{)}+\frac{1}{n}\sum\limits_{i=1}^{n}d(\mathbf{x}_{i},% \mathbf{y}_{j})}\quad,RMG ≔ divide start_ARG divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_d ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG divide start_ARG 0.5 end_ARG start_ARG italic_n ( italic_n - 1 ) end_ARG ( ∑ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j = 1 ; italic_i ≠ 1 end_POSTSUBSCRIPT italic_d ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + ∑ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j = 1 ; italic_i ≠ 1 end_POSTSUBSCRIPT italic_d ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_d ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG , (2)

where 𝐱i,𝐲isubscript𝐱𝑖subscript𝐲𝑖\mathbf{x}_{i},\mathbf{y}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the i𝑖iitalic_i-th L2-normalized image or text embeddings, respectively, and d𝑑ditalic_d is some distance function (we used cosine dissimilarity scaled to [0,1]). Intuitively, the numerator takes a per-sample view, measuring the gap where it matters, and the denominator accounts for the effectively used space by setting the numerator in relation to the average distances within the modalities. We also add the distances of the matching image-text pairs to the denominator to scale the metric to [0,1]. However, also with this more sophisticated measure for the modality gap, we still observe a positive correlation, even though it became weaker (Figs. 5 and 1).

Our intuition can fool us, when we assume the modality gap causes a certain downstream performance independent of other factors. Other factors can overshadow the effect of the modality gap. As Tab. 1 reveals, downstream performance is much more influenced by the model and embedding size; particularly for models trained on large datasets. Besides that, when comparing only models trained on the same dataset (removing the dataset as confounder) we observe the expected negative correlation for LAION-400M (-46.4), LAION-2B (-27.5), and WebLI (-58.3); see Appendix 0.D for details. OpenAI’s CLIP models still behave differently (38.3), possibly due to differences in the training protocols. We note that the grou** based on the datasets substantially reduced the number of available models and, thus, these correlations need to be taken with caution. However, as we will see in Sec. 6, the modality gap as well as downstream performance are indeed both affected by common third variables, e.g., dataset quality (or embedding size). Tab. 1 also suggests model size as a potential shared influential factor but we leave further investigation of it for future work. Takeaway 2: A larger modality gap positively correlates with downstream performance, yet there is no indication that this is a causal relationship, but there are rather common confounders.

4.3 Further Similarities and Differences of Cross-modal Embeddings

Beyond the modality gap, we find further similarities and differences of the embeddings by investigating the following aspects: 1) Do the directions have similar meaning? 2) Are the nearest neighbor relations the same? 3) Are the biases of the embeddings equally pronounced (refer to Sec. 5)?

To identify meaningful directions (1st question), we followed the ideal words approach of Trager et al. [57]: We paired all objects with all attributes in the captions, marginalized the attributes and subtracted the mean text embedding to get ideal object words, and vice versa. To get ideal image embeddings, we followed the same procedure but were limited by the available labeled images. We find low cosine similarities between ideal words and images (CLIP ViT-B/16: 0.19 for MIT-States, 0.16 for UT-Zappos; SigLIP ViT-B/16: 0.20, 0.16). However, when we correct them with the modality gap vector (mean difference vector between matching image and text embeddings) cosine similarities significantly increase (CLIP ViT-B/16: 0.56 for MIT-States, 0.40 for UT-Zappos; SigLIP ViT-B/16: 0.68, 0.56). Hence, ideal words and images are not aligned and it suggests that the embedding directions of each modality have different meanings when not corrected by the modality gap vector.

Table 2: Dissimilarity of neighborhood orderings in the embedding space unsing normalized Kendall-Tau distance [0,1]absent01\in[0,1]∈ [ 0 , 1 ]. Higher normalized Kendall-Tau distance values indicate that the ranking of neighbors is altered more. For ImageNet-100, “s. i𝑖iitalic_i” indicates the i𝑖iitalic_i-th split.
CIFAR-10 CIFAR-100 ImgNet-100 s. 1 ImgNet-100 s. 2 ImgNet-100 s. 3
CLIP 0.3399 0.4965 0.4975 0.5046 0.5081
SigLIP 0.5044 0.4981 0.5003 0.4965 0.4987

To test similarity of neighborhood relations of the embeddings (2nd question), we computed the mean embedding for each class in CIFAR-10, CIFAR-100 [28], and three ImageNet-100 splits [24]. We computed the normalized Kendall-Tau distance ([0,1]01[0,1][ 0 , 1 ]), where the normalization accounts for varying number of classes. Intuitively, it counts the percentage of bubble-sort swaps (w.r.t. all possible swaps) necessary to transform a nearest neighbor list of modality A to match the nearest neighbor list of modality B. Tab. 2 reveals that the neighborhood orderings are dissimilar between the modalities.

Takeaway 3: Directions of image and text embeddings align when corrected by the modality gap vector and neighborhood relations vary between the modalities.

5 Object Bias Is a Caption Presence Bias

Object bias refers to the observation that contrastive vision-language models have high performance on downstream tasks mainly linked to objects, while achieving comparably worse performance on tasks linked to other latent factors, such as attributes [3]. However, solely assessing object bias based on worse performance on some attribute benchmark may be misleading, as the task just could be more difficult than an object-based task.

Instead, we propose a measure for object vs. attribute bias, denoted as Matching Object Attribute Distance (MOAD). MOAD quantifies how well a model distinguishes matching to non-matching images (or texts) of objects o𝑜oitalic_o compared to attributes a𝑎aitalic_a. Matching images (texts) show both the same object or attribute, whereas non-matching images (texts) show different objects or attributes. We define MOAD for L2-normalized image embeddings 𝐱𝐱\mathbf{x}bold_x as follows:

MOADimg12|O|oO(1N1𝐱i,𝐱jXoij𝐱iT𝐱j1N2𝐱iXo,𝐱jX¬o𝐱iT𝐱j)12|A|aA(1N3𝐱i,𝐱jXaij𝐱iT𝐱j1N4𝐱iXa,𝐱jX¬a𝐱iT𝐱j),subscriptMOADimg12𝑂subscript𝑜𝑂1subscript𝑁1subscriptsubscript𝐱𝑖subscript𝐱𝑗subscript𝑋𝑜𝑖𝑗superscriptsubscript𝐱𝑖𝑇subscript𝐱𝑗1subscript𝑁2subscriptformulae-sequencesubscript𝐱𝑖subscript𝑋𝑜subscript𝐱𝑗subscript𝑋𝑜superscriptsubscript𝐱𝑖𝑇subscript𝐱𝑗12𝐴subscript𝑎𝐴1subscript𝑁3subscriptsubscript𝐱𝑖subscript𝐱𝑗subscript𝑋𝑎𝑖𝑗superscriptsubscript𝐱𝑖𝑇subscript𝐱𝑗1subscript𝑁4subscriptformulae-sequencesubscript𝐱𝑖subscript𝑋𝑎subscript𝐱𝑗subscript𝑋𝑎superscriptsubscript𝐱𝑖𝑇subscript𝐱𝑗\displaystyle\begin{split}\text{MOAD}_{\mathbf{\text{img}}}\coloneqq~{}&\frac{% 1}{2|O|}\sum\limits_{o\in O}\left(\frac{1}{N_{1}}\sum\limits_{\begin{subarray}% {c}\mathbf{x}_{i},\mathbf{x}_{j}\in X_{o}\\ i\neq j\end{subarray}}\mathbf{x}_{i}^{T}\mathbf{x}_{j}-\frac{1}{N_{2}}\sum% \limits_{\mathbf{x}_{i}\in X_{o},\mathbf{x}_{j}\in X_{\neg o}}\mathbf{x}_{i}^{% T}\mathbf{x}_{j}\right)\\ -&\frac{1}{2|A|}\sum\limits_{a\in A}\left(\frac{1}{N_{3}}\sum\limits_{\begin{% subarray}{c}\mathbf{x}_{i},\mathbf{x}_{j}\in X_{a}\\ i\neq j\end{subarray}}\mathbf{x}_{i}^{T}\mathbf{x}_{j}-\frac{1}{N_{4}}\sum% \limits_{\mathbf{x}_{i}\in X_{a},\mathbf{x}_{j}\in X_{\neg a}}\mathbf{x}_{i}^{% T}\mathbf{x}_{j}\right)\quad,\end{split}start_ROW start_CELL MOAD start_POSTSUBSCRIPT img end_POSTSUBSCRIPT ≔ end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG 2 | italic_O | end_ARG ∑ start_POSTSUBSCRIPT italic_o ∈ italic_O end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_i ≠ italic_j end_CELL end_ROW end_ARG end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT ¬ italic_o end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL - end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG 2 | italic_A | end_ARG ∑ start_POSTSUBSCRIPT italic_a ∈ italic_A end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_i ≠ italic_j end_CELL end_ROW end_ARG end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT ¬ italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , end_CELL end_ROW (3)

where N1,,N4subscript𝑁1subscript𝑁4N_{1},...,N_{4}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_N start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT are normalization factors, Xo,X¬o,Xa,X¬asubscript𝑋𝑜subscript𝑋𝑜subscript𝑋𝑎subscript𝑋𝑎X_{o},X_{\neg o},X_{a},X_{\neg a}italic_X start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT ¬ italic_o end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT ¬ italic_a end_POSTSUBSCRIPT are all images 𝐱𝐱\mathbf{x}bold_x that (not) entail the object oO𝑜𝑂o\in Oitalic_o ∈ italic_O or attribute aA𝑎𝐴a\in Aitalic_a ∈ italic_A, respectively. We similarly define MOADtxtsubscriptMOADtxt\text{MOAD}_{\mathbf{\text{txt}}}MOAD start_POSTSUBSCRIPT txt end_POSTSUBSCRIPT for text embeddings 𝐲𝐲\mathbf{y}bold_y. Positive values indicate a bias towards objects, negative values a bias towards attributes, and zero no bias.

Refer to caption
(a) Object bias vs. downstream performance.
Refer to caption
(b) Object vs. attribute performance.
Figure 6: Object bias and performance on attribute tasks. (5(a)) We find a bias towards objects (mostly positive MOAD values) but only weak to no correlation with attribute performance. We attribute this to the observation that (5(b)) performance improvements on object tasks (ImageNet, MS COCO) positively correlate with attribute tasks (MIT-States, UT-Zappos).

We analyzed the relation between MOAD and downstream performance in Fig. 5(a). As expected, the majority of contrastive vision-language models showcase a bias towards objects (positive MOAD values). Notably, models trained on large-scale data exhibit a less pronounced object bias (smaller positive MOAD values) compared to models trained on medium-scale data. However, we find only a very weak to no correlation between MOAD and downstream performance for models trained on large-scale data. Fig. 5(b) shows that this can be attributed to the medium-to-strong correlations between performance on object (ImageNet, MS COCO) and attribute tasks (MIT-States, UT-Zappos). This suggests that contrastive vision-language models tend to perform well or poorly in both types of tasks, rather than excelling in one while underperforming in the other. Takeaway 4: Contrastive vision-language models trained on large-scale data tend to have a lower object bias than medium-scale models. However, there is no clear relation between object bias and downstream performance. This can be attributed to the observation that performance improvements on object tasks correlate with improvements on attribute tasks.

Where does the object bias come from?

Refer to caption
(a) Object and attribute counts in LAION-2B.
Refer to caption
(b) Object bias is a caption presence bias.
Figure 7: Object bias is caused by a per-sample caption presence bias. (6(a)) Object bias is not caused by the word frequency since attributes appear more frequently than objects. We used the objects and attributes from Bravo et al. [3] on the LAION-2B captions [50]. (6(b)) We trained CLIP models on MAD and changed which factor is always in the caption (i.e., caption presence bias). The resulting models are more biased towards the factor (e.g., color, thickness, …) that is always in the caption during training (left) and achieves higher performance on that factor (right). Hence, the bias towards objects is not caused by salience of the object, but because captioners tend to name the object but only some of its attributes. The bias is larger for the image encoder, as it needs to match to the most likely caption, while the text encoder can encode the entire information less biased, as sketched in Fig. 1 and outlined in Sec. 6.

In principle, the word frequency of the training dataset could cause the object bias. However, Fig. 6(a) disproves this, as attributes are more frequent than objects in LAION-2B. Thus, we posit that the bias arises from the per-sample prevalence of objects in natural language, e.g., humans tend to describe the most salient object(s) and typically only few of their attributes in a caption.222Not all objects in a scene will be described, but objects are more consistently present than attributes. To verify this hypothesis, we used MAD and redefined the prevalence of the latent factors. Specifically, we trained models in 5 settings. For each setting another factor (e.g. digit, swelling, fracture, etc.) was always present in the caption while only one of the remaining factors was randomly sampled for each training sample. Fig. 6(b) confirms that the caption presence bias results in a bias towards objects for natural language captions. It also illustrates that models perform better on tasks for which the (task-relevant) information is always present in the captions. Takeaway 5: Bias towards concepts, e.g., objects, is caused by their high presence probability in captions if said concept appears in the image.

6 Information Imbalance Triggers Modality Gap and Object Bias

The previous sections analyzed the differences of image and text embeddings, particularly the modality gap, and a bias towards objects. However, what is the underlying cause for their emergence? This section reveals one factor that creates both phenomena: information imbalance between images and texts due to sparse captions. That is, images contain all the information, while captions are an incomplete description of the images, as captions typically entail the most salient object(s) and only a handful of other factors, such as attributes. The information imbalance problem is illustrated in Fig. 1.

As a consequence of information imbalance, the image encoder can hardly align its embedding of an image to the one of the text encoder of a matching caption, as it cannot know what latent factors may be encoded in that caption. To still achieve sufficient alignment, the best both encoders can do is to focus on the latent factors that are likely present in the caption (e.g., objects), while neglecting other factors that are unlikely to be present (e.g., attributes). This results in the above discussed caption presence bias and, consequently, leads to a bias towards the most likely present words. In natural language descriptions these are objects. Moreover, the model will maximize the uniformity of the image and text embeddings to minimize the total contrastive loss, as there is a shortcut to quickly achieve this during training: make images and texts maximally dissimilar. This creates the modality gap (refer to Sec. 4.1 for a thorough explanation).

To validate our hypothesis, we varied the information imbalance in MAD in a controlled setting. Specifically, we varied the number of attributes mentioned in the captions and ensured that the object (i.e., digit class) is always included; refer to Sec. 3 for details. Note that settings with high information imbalance (few attributes are present) are aligned with natural language captions that typically only entail the most salient object(s) while neglecting most of the attributes. In contrast, settings with low information imbalance are more aligned with approaches using enriched captions. Fig. 8 shows the results and indeed validates our hypothesis.

Refer to caption
(a) Effect of information imbalance.
Refer to caption
(b) UMAP embeddings.
Figure 8: Increasing shared information between modalities improves the representations. To study the influence of information imbalance between the modalities, we control the number of attributes present in the captions (the image is always affected by all attributes) in MAD. As the amount of information shared between the modalities increases, the modality gap (I-II) and bias towards objects reduces (III-IV), while downstream accuracy improves (V-VI). 7(b): The contrastive loss is able to close the modality gap given full shared information between modalities, as illustrated by the UMAP embeddings after model initialization (top) and after training (bottom).

Information imbalance and modality gap. Fig. 7(a)(I-II) show that the modality gap decreases with decreasing information imbalance. Moreover, even when there is a modality gap after model initialization, the contrastive loss is capable to substantially reduce it in the full information setting; see Fig. 7(b).

Information imbalance and object bias. Fig. 7(a)(III-IV) show that the bias towards objects reduces, as information imbalance reduces. The image encoder is also more biased towards objects than the text encoder (Fig. 5(a), 6(b) and 7(a)(II-III)), which is implied by our hypothesis.

Information imbalance and zero-shot performance. Fig. 7(a)(V-VI) show that zero-shot digit (object) and attribute performance improve with decreasing information imbalance. This is also supported by the theoretical results of Daunhawer et al. [15], who showed that latent factors, i.e., objects or attributes, can be block-identified if they are shared between the modalities.

Takeaway 6: Information imbalance between the modalities leads to both, modality gap and object bias. Reducing the level of information imbalance causes a smaller modality gap and a smaller object bias.

Embedding dimensionality. Fig. 7(a)(V-VI) shows that a higher embedding dimensionality also improves zero-shot downstream performance, even in the presence of substantial information imbalance. This suggests that the embedding dimension could also be an effective way to improve zero-shot downstream performance. We leave further investigation for future work.

7 Discussion

We found that information imbalance between modalities results in poor downstream performance for the imbalanced factors, leads to the modality gap as well as a bias towards the factors that are always present in the captions. In contrast, when we ensure information balance with all task-relevant information, e.g., attributes, contrastive vision-language models achieve strong zero-shot downstream performance, a negligible modality gap, and less bias towards objects.

Synthetic vs. real data. We validated our hypothesis on our synthetic MAD dataset for small contrastive vision-language models. However, there are several differences between real and synthetic data, e.g., real images have substantially more latent factors (attributes, lighting, relations, etc.). Further, we acknowledge that models are typically trained in the large-scale setting, i.e., a large model trained on a large dataset with large amounts of compute. Nonetheless, we are confident that our findings go beyond the synthetic setting. We find evidence for that in form of aligned findings of concurrent empirical studies that showed the positive influence of data quality [44] or caption enrichment [66, 29] for performance of contrastive vision-language models. In this work, we studied the upper bound of data quality and caption enrichment (i.e., all latent factors are described in the caption) and found it effective to learn better representations.

Beyond contrastive vision-language models. We focused on contrastive vision-language models due to their popularity. However, our analysis and conclusions are not specific to these modalities and generalize to multi-modal models trained on other input modalities. Another interesting future direction is to include recent large-scale captioning-based models [16, 49, 58] to our analysis. Unfortunately, without publicly accessible weights this is not possible for us.

8 Conclusion

This work investigated contrastive vision-language models to gain a better understanding of their characteristics. We found that the modality gap and a bias towards objects are both triggered by an information imbalance between modalities. A reduction of such mitigates both and improves downstream performance. Surprisingly, we also found that only few embedding dimensions drive the modality gap. Besides that, we introduced two novel measures to compare modality gaps across models as well as quantify the notion of a “bias towards objects”. While we observed that a larger modality gap correlates with better downstream performance, we found that both the modality gap and downstream performance are substantially more influenced by other factors, such as the dataset quality. Finally, we confirmed that contrastive vision-language models have a bias towards objects but also found that improvements on object tasks positively correlate with improvements on attribute tasks.

Acknowledgments

This research was funded by the Bundesministerium für Umwelt, Naturschutz, nukleare Sicherheit und Verbraucherschutz (BMUV, German Federal Ministry for the Environment, Nature Conservation, Nuclear Safety and Consumer Protection) based on a resolution of the German Bundestag (67KI2029A), the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under grant number 417962828, and the Bosch Center for Artificial Intelligence.

References

  • [1] Agarwal, S., Krueger, G., Clark, J., Radford, A., Kim, J.W., Brundage, M.: Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications. arXiv (2021)
  • [2] Alabdulmohsin, I., Zhai, X., Kolesnikov, A., Beyer, L.: Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design. In: NeurIPS (2023)
  • [3] Bravo, M.A., Mittal, S., Ging, S., Brox, T.: Open-vocabulary Attribute Detection. In: CVPR (2023)
  • [4] Brody, J.: On the Potential of CLIP for Compositional Logical Reasoning. In: ICLP (2023)
  • [5] Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., Kim, S.: COYO-700M: Image-Text Pair Dataset (2022), https://github.com/kakaobrain/coyo-dataset
  • [6] Castro, D.C., Tan, J., Kainz, B., Konukoglu, E., Glocker, B.: Morpho-MNIST: Quantitative Assessment and Diagnostics for Representation Learning. JMLR (2019)
  • [7] Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts. In: CVPR (2021)
  • [8] Chen, R., Liu, Y., Kong, L., Zhu, X., Ma, Y., Li, Y., Hou, Y., Qiao, Y., Wang, W.: CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP. In: CVPR (2023)
  • [9] Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., Kolesnikov, A., Puigcerver, J., Ding, N., Rong, K., Akbari, H., Mishra, G., Xue, L., Thapliyal, A.V., Bradbury, J., Kuo, W., Seyedhosseini, M., Jia, C., Ayan, B.K., Ruiz, C.R., Steiner, A.P., Angelova, A., Zhai, X., Houlsby, N., Soricut, R.: PaLI: A Jointly-Scaled Multilingual Language-Image Model. In: ICLR (2023)
  • [10] Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L.: Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv (2015)
  • [11] Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. In: CVPR (2023)
  • [12] Chopra, S., Hadsell, R., LeCun, Y.: Learning a Similarity Metric Discriminatively, with Application to Face Verification. In: CVPR (2005)
  • [13] Couairon, G., Douze, M., Cord, M., Schwenk, H.: Embedding Arithmetic of Multimodal Queries for Image Retrieval. In: CVPR (2022)
  • [14] Crabbé, J., Rodríguez, P., Shankar, V., Zappella, L., Blaas, A.: Robust multimodal models have outlier features and encode more concepts. arXiv (2023)
  • [15] Daunhawer, I., Bizeul, A., Palumbo, E., Marx, A., Vogt, J.E.: Identifiability Results for Multimodal Contrastive Learning. In: ICLR (2023)
  • [16] Desai, K., Johnson, J.: VirTex: Learning Visual Representations from Textual Annotations. In: CVPR (2021)
  • [17] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: ICLR (2020)
  • [18] Gadre, S.Y., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., et al.: DataComp: In search of the next generation of multimodal datasets. In: Datasets and Benchmarks Track@NeurIPS (2023)
  • [19] Geirhos, R., Narayanappa, K., Mitzkus, B., Thieringer, T., Bethge, M., Wichmann, F.A., Brendel, W.: Partial success in closing the gap between human and machine vision. In: NeurIPS (2021)
  • [20] Goh, G., Cammarata, N., Voss, C., Carter, S., Petrov, M., Schubert, L., Radford, A., Olah, C.: Multimodal Neurons in Artificial Neural Networks. Distill (2021)
  • [21] Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: AISTATS (2010)
  • [22] Hamidieh, K., Zhang, H., Hartvigsen, T., Ghassemi, M.: Identifying Implicit Social Biases in Vision-Language Models. Workshop@ICLR (2023)
  • [23] He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: CVPR (2016)
  • [24] Hoffmann, D.T., Behrmann, N., Gall, J., Brox, T., Noroozi, M.: Ranking Info Noise Contrastive Estimation: Boosting Contrastive Learning via Ranked Positives. In: AAAI (2022)
  • [25] Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: OpenCLIP (2021). https://doi.org/10.5281/zenodo.5143773
  • [26] Isola, P., Lim, J.J., Adelson, E.H.: Discovering States and Transformations in Image Collections. In: CVPR (2015)
  • [27] Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In: ICML (2021)
  • [28] Krizhevsky, A., Hinton, G., et al.: Learning Multiple Layers of Features from Tiny Images (2009)
  • [29] Lai, Z., Zhang, H., Wu, W., Bai, H., Timofeev, A., Du, X., Gan, Z., Shan, J., Chuah, C.N., Yang, Y., et al.: From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions. arXiv (2023)
  • [30] LeCun, Y.: The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/ (1998)
  • [31] Li, X., Wang, Z., Xie, C.: CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra $4,000 Unlocks 81.8% Accuracy. Workshop@NeurIPS (2023)
  • [32] Li, X., Wang, Z., Xie, C.: Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans? In: EMLNP (2023)
  • [33] Li, X., Wang, Z., Xie, C.: An Inverse Scaling Law for CLIP Training. In: NeurIPS (2023)
  • [34] Li, Y., Fan, H., Hu, R., Feichtenhofer, C., He, K.: Scaling Language-Image Pre-training via Masking. In: CVPR (2023)
  • [35] Liang, P.P., Deng, Z., Ma, M., Zou, J., Morency, L.P., Salakhutdinov, R.: Factorized Contrastive Learning: Going Beyond Multi-view Redundancy. In: NeurIPS (2023)
  • [36] Liang, V.W., Zhang, Y., Kwon, Y., Yeung, S., Zou, J.Y.: Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning. In: NeurIPS (2022)
  • [37] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common Objects in Context. In: ECCV (2014)
  • [38] Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: CVPR (2022)
  • [39] Loshchilov, I., Hutter, F.: Decoupled Weight Decay Regularization. In: ICLR (2019)
  • [40] Ma, Y., Xu, G., Sun, X., Yan, M., Zhang, J., Ji, R.: X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval. In: International Conference on Multimedia (2022)
  • [41] Materzyńska, J., Torralba, A., Bau, D.: Disentangling visual and written concepts in CLIP. In: CVPR (2022)
  • [42] Mayilvahanan, P., Wiedemer, T., Rusak, E., Bethge, M., Brendel, W.: Does CLIP’s Generalization Performance Mainly Stem from High Train-Test Similarity? In: ICLR (2024)
  • [43] Menon, S., Vondrick, C.: Visual Classification via Description from Large Language Models. In: ICLR (2023)
  • [44] Nguyen, T., Ilharco, G., Wortsman, M., Oh, S., Schmidt, L.: Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP. In: NeurIPS (2022)
  • [45] Oord, A.v.d., Li, Y., Vinyals, O.: Representation Learning with Contrastive Predictive Coding. arXiv (2018)
  • [46] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning Transferable Visual Models From Natural Language Supervision. In: ICML (2021)
  • [47] Rashtchian, C., Herrmann, C., Ferng, C.S., Chakrabarti, A., Krishnan, D., Sun, D., Juan, D.C., Tomkins, A.: Substance or Style: What Does Your Image Embedding Know? arXiv (2023)
  • [48] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: ImageNet Large Scale Visual Recognition Challenge. IJCV (2015)
  • [49] Sariyildiz, M.B., Perez, J., Larlus, D.: Learning Visual Representations with Caption Annotations. In: ECCV (2020)
  • [50] Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C.W., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S.R., Crowson, K., Schmidt, L., Kaczmarczyk, R., Jitsev, J.: LAION-5B: An open large-scale dataset for training next generation image-text models. In: Datasets and Benchmarks Track@NeurIPS (2022)
  • [51] Shi, P., Welle, M.C., Björkman, M., Kragic, D.: Towards understanding the modality gap in CLIP. In: Workshop@ICLR (2023)
  • [52] Shtedritski, A., Rupprecht, C., Vedaldi, A.: What does CLIP know about a red circle? Visual prompt engineering for VLMs. In: ICCV (2023)
  • [53] So, J., Oh, C., Lim, Y., Byun, H., Shin, M., Song, K.: Geodesic Multi-Modal Mixup for Robust Fine-Tuning. In: NeurIPS (2023)
  • [54] Sohn, K.: Improved Deep Metric Learning with Multi-class N-pair Loss Objective. In: NeurIPS (2016)
  • [55] Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: EVA-CLIP: Improved Training Techniques for CLIP at Scale. arXiv (2023)
  • [56] Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.J.: YFCC100M: The New Data in Multimedia Research. Communications of the ACM (2016)
  • [57] Trager, M., Perera, P., Zancato, L., Achille, A., Bhatia, P., Soatto, S.: Linear Spaces of Meanings: Compositional Structures in Vision-Language Models. In: ICCV (2023)
  • [58] Tschannen, M., Kumar, M., Steiner, A., Zhai, X., Houlsby, N., Beyer, L.: Image Captioners Are Scalable Vision Learners Too. In: NeurIPS (2023)
  • [59] Udandarao, V.: Understanding and Fixing the Modality Gap in Vision-Language Models (2022), Master’s thesis
  • [60] Visheratin, A.: NLLB-CLIP–train performant multilingual image retrieval model on a budget. arXiv (2023)
  • [61] Von Kügelgen, J., Sharma, Y., Gresele, L., Brendel, W., Schölkopf, B., Besserve, M., Locatello, F.: Self-Supervised Learning with Data Augmentations Provably Isolates Content from Style. In: NeurIPS (2021)
  • [62] Wang, T., Isola, P.: Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. In: ICML (2020)
  • [63] Wu, C., Maji, S.: How well does CLIP understand texture? Workshop@ECCV (2022)
  • [64] Xu, H., Xie, S., Tan, X.E., Huang, P.Y., Howes, R., Sharma, V., Li, S.W., Ghosh, G., Zettlemoyer, L., Feichtenhofer, C.: Demystifying CLIP Data. In: ICLR (2024)
  • [65] Yamada, Y., Tang, Y., Yildirim, I.: When are Lemons Purple? The Concept Association Bias of CLIP. In: EMNLP (2023)
  • [66] Yao, L., Chen, W., **, Q.: CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge. In: WWW (2023)
  • [67] Yu, A., Grauman, K.: Fine-Grained Visual Comparisons with Local Learning. In: CVPR (2014)
  • [68] Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: CoCa: Contrastive Captioners are Image-Text Foundation Models. TMLR (2022)
  • [69] Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J.: When and Why Vision-Language Models Behave like Bags-Of-Words, and What to Do About It? In: ICLR (2022)
  • [70] Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid Loss for Language Image Pre-Training. In: ICCV (2023)
  • [71] Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., Beyer, L.: LiT: Zero-Shot Transfer with Locked-image text Tuning. In: CVPR (2022)
  • [72] Zhang, R., Zeng, Z., Guo, Z., Li, Y.: Can Language Understand Depth? In: International Conference on Multimedia (2022)
  • [73] Zhang, Y., HaoChen, J.Z., Huang, S.C., Wang, K.C., Zou, J., Yeung, S.: Diagnosing and Rectifying Vision Models using Language. In: ICLR (2023)
  • [74] Zhou, C., Zhong, F., Öztireli, C.: CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable and Controllable Text-Guided Face Manipulation. In: SIGGRAPH (2023)

Appendix 0.A Evaluation Details

We ran our evaluations on ImageNet [48], MS COCO [37, 10], MIT-States [26], and UT-Zappos [67]. The datasets comprise 50000, 25000 (5000 images with 5 captions each), 12995, or 2914 test samples, respectively. ImageNet and MS COCO are standard datasets for evaluation of object recognition or retrieval performance, respectively. We used the standard evaluation protocols to compute accuracy or image retrieval performance. MIT-States consists of 245 objects and 115 adjectives (attributes), while UT-Zappos consists of 12 shoe types with 16 fine-grained states (similar-to\sim attributes). For both datasets, we assume that we do not know the object of a respective image and only want to find the adjective or fine-grained state. We considered this a classification problem, following previous work [57]. Note that these datasets implicitly assume that the adjectives are mutually exclusive per image. However, this may not be necessarily true, as multiple adjectives or fine-grained states may be present in the image.

Contrastive vision-language model details. For our large-scale analyses, we used a total of 112 contrastive vision-language models trained across various datasets provided by OpenCLIP [25, 11]333https://github.com/mlfoundations/open_clip. It contains contrastive vision-language models, such as OpenAI’s CLIP [46], CLIP-A [33], EVA-CLIP [55], CoCa [68], NLLB-CLIP [60], or SigLIP [70]. Note that these models use various backbones, including ResNet [23], ConvNeXt [38], or ViT [17]. The models were trained on, e.g., OpenAI’s proprietary (400 Mtimes400million400\text{\,}\mathrm{M}start_ARG 400 end_ARG start_ARG times end_ARG start_ARG roman_M end_ARG) WebImageText dataset [46], LAION-400 Mtimes400million400\text{\,}\mathrm{M}start_ARG 400 end_ARG start_ARG times end_ARG start_ARG roman_M end_ARG, LAION-2 Btimes2billion2\text{\,}\mathrm{B}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG roman_B end_ARG, LAION-5 Btimes5billion5\text{\,}\mathrm{B}start_ARG 5 end_ARG start_ARG times end_ARG start_ARG roman_B end_ARG [50], Merged-2 Btimes2billion2\text{\,}\mathrm{B}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG roman_B end_ARG (merge of 1.6 Btimes1.6billion1.6\text{\,}\mathrm{B}start_ARG 1.6 end_ARG start_ARG times end_ARG start_ARG roman_B end_ARG samples from LAION-2 Btimes2billion2\text{\,}\mathrm{B}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG roman_B end_ARG and 0.4 Btimes0.4billion0.4\text{\,}\mathrm{B}start_ARG 0.4 end_ARG start_ARG times end_ARG start_ARG roman_B end_ARG samples from COYO-700 Mtimes700million700\text{\,}\mathrm{M}start_ARG 700 end_ARG start_ARG times end_ARG start_ARG roman_M end_ARG [5][55], WebLI [9], So-400 Mtimes400million400\text{\,}\mathrm{M}start_ARG 400 end_ARG start_ARG times end_ARG start_ARG roman_M end_ARG [2], MetaCLIP (400 Mtimes400million400\text{\,}\mathrm{M}start_ARG 400 end_ARG start_ARG times end_ARG start_ARG roman_M end_ARG[64], Conceptual 12 Mtimes12million12\text{\,}\mathrm{M}start_ARG 12 end_ARG start_ARG times end_ARG start_ARG roman_M end_ARG [7], YFCC (15 Mtimes15million15\text{\,}\mathrm{M}start_ARG 15 end_ARG start_ARG times end_ARG start_ARG roman_M end_ARG[56], CommonPool-s (max. 12.8 Mtimes12.8million12.8\text{\,}\mathrm{M}start_ARG 12.8 end_ARG start_ARG times end_ARG start_ARG roman_M end_ARG; refer to Table 3 of Gadre et al. [18] for the details of filtering), CommonPool-m (max. 128 Mtimes128million128\text{\,}\mathrm{M}start_ARG 128 end_ARG start_ARG times end_ARG start_ARG roman_M end_ARG), CommonPool-l (max. 1.28 Btimes1.28billion1.28\text{\,}\mathrm{B}start_ARG 1.28 end_ARG start_ARG times end_ARG start_ARG roman_B end_ARG), CommonPool-xl (max. 12.8 Btimes12.8billion12.8\text{\,}\mathrm{B}start_ARG 12.8 end_ARG start_ARG times end_ARG start_ARG roman_B end_ARG[18], or DataPool-s (1.4 Mtimes1.4million1.4\text{\,}\mathrm{M}start_ARG 1.4 end_ARG start_ARG times end_ARG start_ARG roman_M end_ARG), DataPool-m (14 Mtimes14million14\text{\,}\mathrm{M}start_ARG 14 end_ARG start_ARG times end_ARG start_ARG roman_M end_ARG), DataPool-l (140 Mtimes140million140\text{\,}\mathrm{M}start_ARG 140 end_ARG start_ARG times end_ARG start_ARG roman_M end_ARG), DataPool-xl (1 Btimes1billion1\text{\,}\mathrm{B}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG roman_B end_ARG[18].


Refer to captionbase imageclassthicknessswellingfracturesscalingcolorRefer to captionMAD image
Figure 9: Causal graph of MAD.
Refer to caption
Figure 10: Example images with corresponding caption of our MAD dataset. Note that the words of the captions are shuffled during training. For example, the in the first row and first column shows the digit 4 without altering the thickness, no swelling applied, with fracture augmentation, scaled down and the color gray.

Multi-modal Attributes and Digits (MAD). Our dataset Multi-modal Attributes and Digits (MAD) is based on the MNIST [30] variation Morpho-MNIST [6]. The causal graph of the data-generating process of MAD is depicted in Fig. 9. We used the following words for digits (0, …, 9), altering image thickness (thickening, thinning, no thickthinning), swelling (swelling, no swelling), fractures (fracture, no fracture), scaling (large, small), and color (gray, red, green, blue, cyan, magenta, yellow). Thus, we have 16 different attributes. Fig. 10 provides examples of image-text pairs of MAD.

In our experiments, we investigated information imbalance in the captions by restricting the number of attributes present within each caption. We provide examples below, where we sequentially remove the amount of information within the captions, i.e., fewer latent factors (attributes) are present in the caption:

  • Full information setting (i.e., digit & all five attributes)

    • yellow-swelling-thickening-9-large-fracture

    • swelling-thickening-6-red-small-fracture

    • 5-large-yellow-no swelling-fracture-thinning

  • Partial information setting I (i.e., digit & four attributes)

    • yellow-swelling-thickening-9-large

    • swelling-thickening-6-red-small

    • 5-large-yellow-no swelling-fracture

  • Partial information setting II (i.e., digit & three attributes)

    • yellow-swelling-thickening-9

    • swelling-thickening-6-red

    • 5-large-yellow-no swelling

  • Partial information setting III (i.e., digit & two attributes)

    • yellow-swelling-9

    • swelling-thickening-6

    • 5-large-yellow

  • Partial information setting IV (i.e., digit & one attributes)

    • yellow-9

    • swelling-6

    • 5-large

Note that while all the latent factors, i.e., digit and all five attributes, still affect the generated image, the caption may only provide partial information, i.e., attributes are missing from the caption.

Appendix 0.B Model and Training Details for the Experiments on Multi-modal Attributes and Digits

Model details. We used small CLIP models. Specifically, the ViT-based vision backbone comprises 6 layers, each with a dimensionality d𝑑ditalic_d of 256 and d/64=4𝑑644\lfloor d/64\rfloor=4⌊ italic_d / 64 ⌋ = 4 heads. The transformer-based language backbone also comprises 6 layers, each with a dimensionality of 256 and 8 heads. We set the patch size to 7 and context length to 8. The vocabulary consists of 28 words, i.e., all the words for digits (10) and attributes (16), as well as a start and end symbol (2).

Training details. We trained all models with a batch size of 128 for 200 epochs with a learning rate warm-up period of 5 epochs. We used AdamW [39] as optimizer with cosine annealing learning rate schedule [39]. We always selected the best performing learning rate across 3 learning rates {5104, 5105, 105}5superscript1045superscript105superscript105\{5\cdot 10^{-4},\;5\cdot 10^{-5},\;10^{-5}\}{ 5 ⋅ 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT , 5 ⋅ 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT } each trained with 3 random seeds. The best learning rate was selected by comparing average accuracies over the ideal word accuracy and average zero-shot accuracy on all attributes and the class label. For all of our results, we report the average over 3 random seeds.

Appendix 0.C Alignment and Uniformity Terms in InfoNCE

Contrastive representation learning [12, 21, 54, 45] leverages paired inputs as weak supervision signal. The basic idea is to learn representations in a shared representation space that are as similar as possible for “positive/matching” pairs, while as dissimilar as possible for “negative/non-matching” pairs. A popular choice for contrastive learning approaches is the InfoNCE objective [45]:

(fx,fy)𝔼(𝐱i,𝐲i)pdata[logexp(fx(𝐱i)Tfy(𝐲i)/τ)exp(fx(𝐱i)Tfy(𝐲i)/τ)+j=1N1exp(fx(𝐱i)Tfy(𝐲j)/τ)],subscript𝑓𝑥subscript𝑓𝑦similar-tosubscript𝐱𝑖subscript𝐲𝑖subscript𝑝data𝔼delimited-[]subscript𝑓𝑥superscriptsubscript𝐱𝑖𝑇subscript𝑓𝑦subscript𝐲𝑖𝜏subscript𝑓𝑥superscriptsubscript𝐱𝑖𝑇subscript𝑓𝑦subscript𝐲𝑖𝜏superscriptsubscript𝑗1𝑁1subscript𝑓𝑥superscriptsubscript𝐱𝑖𝑇subscript𝑓𝑦subscript𝐲𝑗𝜏\small\mathcal{L}(f_{x},f_{y})\coloneqq\underset{{(\mathbf{x}_{i},\mathbf{y}_{% i})\sim p_{\text{data}}}}{\mathbb{E}}[-\log\frac{\exp(f_{x}(\mathbf{x}_{i})^{T% }f_{y}(\mathbf{y}_{i})/\tau)}{\exp(f_{x}(\mathbf{x}_{i})^{T}f_{y}(\mathbf{y}_{% i})/\tau)+\sum\limits_{j=1}^{N-1}\exp(f_{x}(\mathbf{x}_{i})^{T}f_{y}(\mathbf{y% }_{j})/\tau)}]\hskip 1.79997pt,caligraphic_L ( italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ≔ start_UNDERACCENT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∼ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG [ - roman_log divide start_ARG roman_exp ( italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG roman_exp ( italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_τ ) + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT roman_exp ( italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG ] , (4)

where fx,fysubscript𝑓𝑥subscript𝑓𝑦f_{x},f_{y}italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT are two encoders for the inputs 𝐱,𝐲𝐱𝐲\mathbf{x},\mathbf{y}bold_x , bold_y, τ𝜏\tauitalic_τ is the scalar temperature, and pdatasubscript𝑝datap_{\text{data}}italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT is the data distribution d×dsuperscript𝑑superscript𝑑\mathbb{R}^{d}\times\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Wang & Isola [62] define two components of the loss:

  • Alignment (nominator): matching pairs should be close, i.e., aligned.

  • Uniformity (denominator): representations should be roughly uniformly distributed on the unit hypersphere.

For multi-modal contrastive representation learning, it is popular to use a symmetric version of above InfoNCE objective [46]:

sym=12(fx,fy)+12(fy,fx).subscriptsym12subscript𝑓𝑥subscript𝑓𝑦12subscript𝑓𝑦subscript𝑓𝑥\mathcal{L}_{\text{sym}}=\frac{1}{2}\mathcal{L}(f_{x},f_{y})+\frac{1}{2}% \mathcal{L}(f_{y},f_{x})\quad.caligraphic_L start_POSTSUBSCRIPT sym end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG caligraphic_L ( italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG caligraphic_L ( italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) . (5)

Note that the the repulsive forces of the uniformity term only act across the modalities but not within them.

Mismatches Can Cause Global Optima Exhibiting a Modality Gap

Refer to caption
Figure 11: Visualization of the global minima of the contrastive loss under perfect or imperfect pairs in our toy 2D example. For perfect pairs (left), matching pairs are aligned and uniformly distributed. For imperfect pairs (right), the global minimum exhibits a modality gap. Here, 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes sample i𝑖iitalic_i from modality 1, e.g., images, and 𝐲isubscript𝐲𝑖\mathbf{y}_{i}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT sample i𝑖iitalic_i from modality 2, e.g., text.

Going beyond the discussion in the main text, we illustrate below that the modality gap can even be present in the global optimum under certain data properties. Under perfect image-text pairs, it is easy to see that the global minimum of the contrastive loss does not exhibit a modality gap. I.e., all matching image-text pairs are perfectly aligned and the pairs are uniformly distributed on the unit hypersphere. But what happens in the presence of mismatches caused by, e.g., miscaptioning?

To illustrate, the effect of such mismatches on the final image and text embeddings, we designed a 2D toy example. We generated two sets of points on the unit circle and directly optimized their positions. The points represent the embeddings of both modalities: {𝐱1,𝐱2}=𝐗subscript𝐱1subscript𝐱2𝐗\{\mathbf{x}_{1},\mathbf{x}_{2}\}=\mathbf{X}{ bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } = bold_X and {𝐲1,𝐲2}=𝐘subscript𝐲1subscript𝐲2𝐘\{\mathbf{y}_{1},\mathbf{y}_{2}\}=\mathbf{Y}{ bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } = bold_Y, where 𝐱1,𝐲1,𝐱2,𝐲2{[a,b]T|a2+b2=1}subscript𝐱1subscript𝐲1subscript𝐱2subscript𝐲2conditional-setsuperscript𝑎𝑏𝑇superscript𝑎2superscript𝑏21\mathbf{x}_{1},\mathbf{y}_{1},\mathbf{x}_{2},\mathbf{y}_{2}\in\{[a,b]^{T}~{}|~% {}a^{2}+b^{2}=1\}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ { [ italic_a , italic_b ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1 }. Further, we specified which point of set 𝐗𝐗\mathbf{X}bold_X matches to which point in set 𝐘𝐘\mathbf{Y}bold_Y. For the perfect matching setting, we considered the following matching pairs:

{(𝐱1,𝐲1),(𝐱2,𝐲2)}.subscript𝐱1subscript𝐲1subscript𝐱2subscript𝐲2\{(\mathbf{x}_{1},\mathbf{y}_{1}),(\mathbf{x}_{2},\mathbf{y}_{2})\}\quad.{ ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) } . (6)

It is easy to see that the global minimum (up to rotations) is: 𝐱1=𝐲1=[1,0]Tsubscript𝐱1subscript𝐲1superscript10𝑇\mathbf{x}_{1}=\mathbf{y}_{1}=[1,0]^{T}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = [ 1 , 0 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, 𝐱2=𝐲2=[1,0]Tsubscript𝐱2subscript𝐲2superscript10𝑇\mathbf{x}_{2}=\mathbf{y}_{2}=[-1,0]^{T}bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = [ - 1 , 0 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT; see Fig. 11 for a visualization.

However, what happens when we introduce mismatches? For image-text pairs, this can happen if a human annotator miscaptions an image. It could also stem from differing focuses among human annotators on distinct aspects of the image. We considered the following matching pairs:444We need the additional matching pairs of (𝐱1,𝐲1)subscript𝐱1subscript𝐲1(\mathbf{x}_{1},\mathbf{y}_{1})( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and (𝐱2,𝐲2)subscript𝐱2subscript𝐲2(\mathbf{x}_{2},\mathbf{y}_{2})( bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) to avoid the degenerated global minimum 𝐱1=𝐲1=𝐱2=𝐲2subscript𝐱1subscript𝐲1subscript𝐱2subscript𝐲2\mathbf{x}_{1}=\mathbf{y}_{1}=\mathbf{x}_{2}=\mathbf{y}_{2}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

{(𝐱1,𝐲1),(𝐱1,𝐲1),(𝐱2,𝐲2),(𝐱2,𝐲2),(𝐱1,𝐲2),(𝐱2,𝐲2)mismatches}.subscript𝐱1subscript𝐲1subscript𝐱1subscript𝐲1subscript𝐱2subscript𝐲2subscript𝐱2subscript𝐲2subscriptsubscript𝐱1subscript𝐲2subscript𝐱2subscript𝐲2mismatches\{(\mathbf{x}_{1},\mathbf{y}_{1}),(\mathbf{x}_{1},\mathbf{y}_{1}),(\mathbf{x}_% {2},\mathbf{y}_{2}),(\mathbf{x}_{2},\mathbf{y}_{2}),\underbrace{(\mathbf{x}_{1% },\mathbf{y}_{2}),(\mathbf{x}_{2},\mathbf{y}_{2})}_{\text{mismatches}}\}\quad.{ ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ( bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , under⏟ start_ARG ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ( bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT mismatches end_POSTSUBSCRIPT } . (7)

We recognize that identification and removal of above mismatches may be simple in practice. However, our goal is to illustrate the impact of mismatches in a simplistic setting to provide an intuition on the behavior of the contrastive loss.

To search for the globally optimal embeddings for Eq. 7, we ran a grid search with an angular resolution of 6superscript66^{\circ}6 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT. We found the following global minimum (again up to rotations): 𝐱1=[1,0]Tsubscript𝐱1superscript10𝑇\mathbf{x}_{1}=[1,0]^{T}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = [ 1 , 0 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, 𝐲1=[cos276,sin276]Tsubscript𝐲1superscriptsuperscript276superscript276𝑇\mathbf{y}_{1}=[\cos{276^{\circ}},\sin{276^{\circ}}]^{T}bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = [ roman_cos 276 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , roman_sin 276 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, 𝐱2=[cos42,sin42]Tsubscript𝐱2superscriptsuperscript42superscript42𝑇\mathbf{x}_{2}=[\cos{42^{\circ}},\sin{42^{\circ}}]^{T}bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = [ roman_cos 42 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , roman_sin 42 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, 𝐲2=[cos126,sin126]Tsubscript𝐲2superscriptsuperscript126superscript126𝑇\mathbf{y}_{2}=[\cos{126^{\circ}},\sin{126^{\circ}}]^{T}bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = [ roman_cos 126 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , roman_sin 126 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT; see Fig. 11 for a visualization. It is apparent that the global minimum exhibits a modality gap.

Appendix 0.D Extended details for Sec. 4

0.D.1 Relative Modality Gap

Limitations of L2M. L2M has been initially proposed as modality gap distance by Liang et al. [36]. However, it has several limitations that we will discuss below:

  1. 1.

    L2M does not account for difference of the effectively used hyperspaces.

  2. 2.

    L2M takes a distributional instead of a per-sample view.

  3. 3.

    The L2 norm can be sensitive to outlier embedding dimensions.

Regarding the first point: note that different models can use different amounts of the unit hypersphere, as the contrastive loss accounts for relative cosine similarities. Here, the similarity is always relative to the similarity to the negative samples. Consequently, models can use a varying degree of the unit hypersphere and, thus, L2 distances can have a different meaning. For example, consider two models that have the same L2M but the first model uses the entire unit hypersphere, while the second model only uses a small fraction of it. While L2M suggests that the modality gap distance is the same, the actual gap of the first model is significantly smaller, since the average distances between the samples are larger.

Regarding the second point: intuition suggests that matching image-text pairs should be close but non-matching pairs can (and should) be large. L2M considers the distance of the means of all pairs. This can lead to misleading results. An illustrative (but unlikely) example is the case of two distributions that occupy exactly the same region of the hypersphere, but are rotated by n-degree (i.e., they are very misaligned). Clearly, there exists a gap between the modalities but L2M does not indicate it.

Last, the L2 norm can be sensitive to embedding dimensions that exhibit vast differences. For instance, our discovered most modality-separating embedding dimensions qualify for this.

Relative Modality Gap (RMG). As a remedy to above outlined limitations, we proposed a Relative Modality Gap (RMG) measure in the main text. RMG computes the distances between matching image-text pairs instead of the means to address 2. Since density estimation in high-dimensional spaces is difficult, we used the mean distances between all samples per modality as rough approximation to address 1. Finally, we used cosine similarities instead of the L2 norm to address 3.

0.D.2 Rank Correlations when Fixing the Datasets

We also computed rank correlation for single datasets. We applied the following filtering criteria:

  1. 1.

    We removed models that were subsequently finetuned, e.g., on MS COCO.

  2. 2.

    We removed multiple instances of the same model due varying training epochs, activations (GeLU vs. Quick-GeLU), or training protocol (e.g., augmentations). We used the models that were trained longer. For varying activations or training protocols, we used the model that achieved better image recognition performance on ImageNet.

We only considered datasets that have at least seven models after filtering (LAION-400M: 7, LAION-2B: 14, OpenAI’s CLIP dataset: 9, WebLI: 9). Note that the small number of models makes the rank correlations susceptible to noise and they need to be interpreted with caution.