¹¹institutetext:

{}^{1}

University of Freiburg,

{}^{2}

Bosch Center for Artificial Intelligence
¹¹email: {schrodi,hoffmann,argusm,brox}@cs.uni-freiburg.de, ¹¹email: [email protected]

Two Effects, One Trigger:
On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Representation Learning

Simon Schrodi 11 0009-0003-7006-953X David T. Hoffmann 1122 0009-0002-4942-814X Max Argus 11 0000-0002-1288-7476
Volker Fischer 22 0000-0001-5437-4030 Thomas Brox 11 0000-0002-6282-8861

Abstract

Contrastive vision-language models like CLIP have gained popularity for their versatile applicable learned representations in various downstream tasks. Despite their successes in some tasks, like zero-shot image recognition, they also perform surprisingly poor on other tasks, like attribute detection. Previous work has attributed these challenges to the modality gap, a separation of image and text in the shared representation space, and a bias towards objects over other factors, such as attributes. In this work we investigate both phenomena. We find that only a few embedding dimensions drive the modality gap. Further, we propose a measure for object bias and find that object bias does not lead to worse performance on other concepts, such as attributes. But what leads to the emergence of the modality gap and object bias? To answer this question we carefully designed an experimental setting which allows us to control the amount of shared information between the modalities. This revealed that the driving factor behind both, the modality gap and the object bias, is the information imbalance between images and captions.

Keywords:

Contrastive vision-language representation learning Information imbalance Modality gap Object bias

^†^†

{}^{*}

Equal contribution.

1 Introduction

Vision-language models have become increasingly popular and are successfully applied to numerous tasks. They benefit from their ability to exploit weak supervision, which can be acquired by scra** the internet for image-text pairs. Models are trained with contrastive [46, 27] or captioning-based pretraining [16, 49, 58]. Subsequent works improved reproducibility [11] and training efficiency [71, 34, 70, 33, 31, 55]. Despite weaker supervision, these models show intriguing properties: strong zero-shot image recognition performance [46, 43], cross-modal understanding [8] and retrieval [40], or robustness [44].

Regardless of these remarkable improvements and the widespread usage, the profound understanding of representations learned by vision-language models is still in its infancy. For instance, a recent work identified a modality gap in the shared embedding space [36], while other work conjectured about a bias towards objects [3]. But how bad are these effects? The consequences as well as the underlying triggers are not fully understood. In this paper, we thoroughly compare the learned embeddings for both modalities, study the two phenomena modality gap and object bias, and identify their common trigger.

The modality gap discovered by Liang et al. [36] is defined as image and text embeddings occupying separate regions of the shared embedding space. They attributed the phenomenon to the cone effect during model initialization and the contrastive loss preserving the gap, while subsequent work studied the influence of the Softmax temperature [59, 51]. In this paper, we show that few embedding dimensions drive the modality gap and it is caused by an information imbalance between images and their captions. While we find that increases of the modality gap correlate with improvements in downstream performance, the effect of the modality gap on downstream performance is small compared to other components, such as model size or dataset properties. For instance, when we control for the datasets, we indeed observe that modality gap decreases as performance improves. Further, we find that image and text embeddings exhibit different biases and neighborhood orderings vary between the modalities.

Refer to caption — Figure 1: Illustration of information imbalance between images (top left) and captions (bottom left). This makes it virtually impossible for the image encoder (top right) to know what a (sparse) caption may contain. Consequently, it focuses on the most salient objects due to their high probability of being present in the caption and tends to neglect other more unlikely factors, such as attributes.

Beyond above cross-modal analyses, recent work hypothesized that vision-language models exhibit a bias towards objects [3]. To formalize the notion of “a bias towards objects”, we propose a metric, Matching Object Attribute Distance (MOAD), which assesses the bias towards objects compared to other factors, such as attributes. We confirm that contrastive vision-language models are indeed more biased towards objects than attributes. However, we observe that performance on attributes positively correlates with performance on objects. This suggests that improvements on object tasks also lead to improvements on attribute tasks. Further, we find that the bias does not stem from the global word frequency of the training dataset but a per-sample caption presence bias, i.e., models are more biased towards words that are more consistently mentioned in the captions; as objects typically are in captions.

Finally, we identify the common trigger for both the modality gap and object bias: information imbalance. Information imbalance describes the availability of more information for one modality in comparison to the other. For example, for vision-language models captions are sparse (lossy) descriptions of images and determine the focal point, whereas images contain much more information. As a result, image encoders cannot know what information of the image they need to encode to align the image encoding with the text encoding for some unknown caption; see Fig. 1 for an illustration. The best that the image encoders can do is to focus on the most salient parts of the image that are typically present in captions, e.g., objects. Consequently, the encoder exhibits a bias towards these parts of the captions. Similarly, the modality gap emerges as by-product of the contrastive optimization under the information imbalanced regime. Here, the model trades off alignment, which is limited due to the information imbalance, with uniformity by making all images and all texts more dissimilar (a.k.a. modality gap). Besides that, a reduction of information imbalance through caption enrichment also improves zero-shot downstream performance. We believe that above findings help to better understand the role of sparse captions in contrastive vision-language representation learning.

In summary, the contributions of our analysis paper are as follows: 1) The modality gap is driven by only few embedding dimensions. 2) For off-the-shelf contrastive vision-language models, the modality gap and downstream performance are positively correlated due to common confounders. Controlling for more confounders indicates that a lower modality gap indeed correlates with higher performance. 3) Image and text embeddings have distinct characteristics despite coupling them via the cross-modal loss. 4) Object bias is caused by higher per-sample caption presence bias. 5) Improvements on object tasks yield improvements on attribute tasks. 6) An information imbalance between the modalities leads to both the modality gap and the object bias.

2 Related Work

Contrastive vision-language representation learning has recently emerged as an effective technique to learn representations with weak supervision that work for a wide range of tasks and have intriguing properties, such as strong zero-shot abilities. However, our understanding of the learned representation is in its infancy. For instance, recent work showed the presence of a modality gap and attributed it to the cone effect of model initialization and the contrastive loss [36]. Subsequent work explored the influence of the Softmax temperature [59, 51]. Other work found that the modality gap is orthogonal to the span of image and text embeddings [73]. In this work, we find that few dimensions are responsible for the separation of the modalities, and identify that information imbalance between images and text is the driving factor for the modality gap.

While some work found that large models, including vision-language models, close the gap to human perception [19, 32], other work found several failure modes of them [69, 4]. Other work studied the importance of data [44, 64], generalization/robustness [42, 14], analyzed the learned features/representations [20, 41, 47], compositionality [27, 13, 57], or learned abilities and (social) biases [1, 65, 72, 63, 52, 22]. It has also been discussed that vision-language models may be biased towards objects [3]. To study object bias, we introduce a measure to assess “bias towards objects”, affirming that they are indeed biased towards objects. But, we also find that improvements on object tasks correlate with improvements on attribute tasks. Lastly, we find that the bias stems from a per-sample caption presence bias caused by information imbalance.

Finally, theoretical work disentangled the InfoNCE (contrastive) loss [45] into an alignment and uniformity term [62] and showed the importance of shared task-relevant information between the modalities [61, 15, 35]. In this work, we connect information imbalance of task-relevant information to the two phenomena modality gap and bias towards objects.

3 Experimental Setup

Contrastive vision-language models. Unless stated otherwise, we used CLIP ViT-B/16 [46] and SigLIP ViT-B/16 [70] for our analyses. For our large-scale analyses, we used a total of 112 contrastive vision-language models provided by OpenCLIP [25, 11]. We distinguished between medium- (i.e., dataset size of $\leq$ $128\text{\,}\mathrm{M}$ ) and large-scale datasets.

Downstream evaluation tasks. We conducted our evaluations on ImageNet [48], MS COCO [37, 10], MIT-States [26], and UT-Zappos [67] using the standard evaluation protocols from the literature. For ImageNet, we used the CLIP-style prompts "a photo of a {obj}" [46] and computed the zero-shot (object) accuracy. For MS COCO, we prepended the prompt "a photo of" to the description of each image following Radford et al. [46] and used R@1 to assess zero-shot image-to-text retrieval performance (text-to-image retrieval yielded similar results). For MIT-States and UT-Zappos, we used the prompts "an image of a {attr} object" and computed the zero-shot attribute accuracy. We provide further evaluation details in Appendix 0.A.

Multi-modal Attributes and Digits (MAD). To understand the influence of data, we built a multi-modal dataset based on Morpho-MNIST [6] (a variation of MNIST [30]) with full control over the data-generating process, called Multi-modal Attributes and Digits (MAD). We used the following morphing or war**

operations as latent factors (i.e., attributes): altering image thickness (thickening, thinning, no thickthinning), swelling (swelling, no swelling), fractures (fracture, no fracture) from Castro et al. [6], and added scaling (large, small), colors (gray, red, green, blue, cyan, magenta, yellow) and captions. Visual examples are provided in Fig. 2 and Appendix 0.A. To generate captions, we mapped the digit class and latent factors to words and chained them together in random order, e.g., 0-thickening-swelling-fractures-large-blue. Model and training details are provided in Appendix 0.B.

To study the effect of information imbalance due to missing information in captions, we varied the number of attributes included in each caption, while ensuring that the digit remained consistently present. Importantly, the images remain unchanged, i.e., all latent factors still affect the images. For example, if we restrict each caption to one attribute (in addition to the digit), above (full) caption reduces to, e.g., 0-blue or 0-large.

4 Cross-modal Disparities of Embeddings

Most previous work focused on improving downstream performance of contrastive vision-language models, while some works discovered intriguing phenomena or shortcomings of contrastive vision-language models. However, we still lack a thorough understanding on the causes and effects of the learned representations. In an effort to enrich our understanding, we first analyze differences between image and text embeddings. We start by discovering that few embedding dimensions drive the modality gap (Sec. 4-4.1) and discuss the relation of the modality gap to downstream performance (Sec. 4.2). Finally, we study further similarities and differences of the embeddings (Sec. 4.3).

Revisiting the Modality Gap

Liang et al. [36] showed that embeddings of the modalities are located in completely separate regions of the embedding space and coined the phenomenon modality gap. They defined the modality gap distance as the L2-distance between the Means (L2M) of the embeddings:

\text{L2M}\coloneqq||\frac{1}{n}\sum\limits_{i=1}^{n}\mathbf{x}_{i}-\frac{1}{n% }\sum\limits_{i=1}^{n}\mathbf{y}_{i}||\quad,

(1)

where $\mathbf{x}_{i},\mathbf{y}_{i}$ are the $i$ -th L2-normalized image or text embeddings, respectively.

4.1 Few Embedding Dimensions Make Up the Modality Gap

To obtain better insights into the nature of the modality gap, we asked two questions: 1) Is the modality gap present in all dimensions or only a subset thereof? 2) Does post-hoc closing of the modality gap improve downstream performance?

We compared the distributions (means and variances) per embedding dimension between the modalities. Interestingly, there are few embedding dimensions that have stark differences in their means, while that difference is close to zero for most other dimensions (Fig. 2(a)). Moreover, we find that some dimensions exhibit substantial variance within one modality and negligible variance within the other, whereas this is reversed for the other dimension. Consequently, two of these embedding dimensions suffice to perfectly separate the modalities, as shown in Fig. 2(b). Moreover, these dimensions are by far the largest components of the image or text embeddings, respectively. Hence, they substantially influence the entire embedding and are responsible for the largest part of the measured modality gap (see the sharp drop of L2M in Fig. 4). Takeaway 1: Few embedding dimensions drive the modality gap and two dimensions suffice to separate the modalities.

Does removing these dimensions close the gap and improve performance? One may suspect that ablating the dimensions with high contributions to the modality gap will (substantially) close the modality gap and lead to better downstream performance. To test this, we successively ablated these dimensions using the sorting from Fig. 2(a). We re-normalized the remaining dimensions and evaluated their downstream performance. Fig. 4 shows that the modality gap (L2M) indeed closes, but downstream performance initially decreases sharply before partially recovering as more dimensions are ablated.

What is the mechanism that explains this observation? Note that the first embedding dimensions are the largest components of the embeddings (as discussed above). Hence, ablating and re-normalizing them causes substantial changes in cosine similarities and cross-modal neighborhoods. Consider the following example: let $i=[8,0.6,0.7,0.3]^{T}$ be an image embedding, and $t=[5,0.13,0.035,0.02]^{T}$ and $t^{\prime}=[5,1.5,0.7,0.45]^{T}$ be the matching or non-matching text embedding, respectively. We have cosine similarities $d$ of $d(i,t)=0.995$ and $d(i,t^{\prime})=0.975$ . After ablating the first dimension, we have cosine similarities of $0.822$ and $0.917$ and hence the image-text alignment flipped. Finally, it flips back after ablating also the second dimension.

But why do these modality-separating embedding dimensions appear? As we will discuss in more detail in Sec. 6, maximization of the alignment term of the popular (contrastive) InfoNCE loss [45] (refer to Appendix 0.C for background) is limited by an information imbalance between the modalities, i.e., texts are sparse descriptions of the images. This makes it impossible to minimize the contrastive loss only by maximizing the similarity of the matching image-text pairs (the nominator of the loss also called alignment term). To still minimize the InfoNCE loss, a model maximizes also its uniformity term (denominator). It can do so by making images and texts as dissimilar as possible.¹¹1Note that the repulsive forces of the uniformity term of the InfoNCE loss solely act across the modalities and not within the modalities. After all, the InfoNCE loss maximizes the similarity of the matching image-text pair relative to the similarity of the non-matching image-text pairs, not the absolute similarity. Thus, we hypothesize that models trade off alignment, which is limited due to information imbalance (cf., Sec. 6), in a subset of dimensions for higher uniformity to minimize the loss. To still achieve good alignment, the model uses few dimensions that will have a negligible effect on the alignment term but significantly increase the uniformity term (overall lower cosine similarity scores between images and texts). This hypothesis explains our observations in Fig. 3 (few large and modality separating dimensions), Fig. 4 (recovery of performance when these dimensions are dropped but alignment is recovered), and Fig. 8 (modality gap closes as information imbalance reduces).

4.2 Does the Modality Gap Harm Downstream Performance?

Table 1: Spearman rank correlation between downstream task performance and various factors for models trained on medium and large datasets.

Downstream

task

Modality

gap (L2M)

Modality

gap (RMG)

Model

size

Embedding

size

Dataset

size

ImageNet

46.8 | 46.2

24.3 | 34.1

34.9 | 79.1

6.4 | 77.9

22.4 | -22.6

MS COCO

29.6 | 32.0

24.9 | 13.1

-14.8 | 77.2

54.0 | 75.2

-18.6 | -15.8

The influence of the modality gap on downstream performance is controversially discussed in the literature [53, 74, 36, 73]. In the experiments from Fig. 4, ablating dimensions closed the modality gap but did not improve the downstream performance. Similarly, Liang et al. [36] closed the gap by shifting the embeddings but found that an increase of the modality gap actually improved performance. This is in contrast to intuition, which suggests that image-text pairs should be close and a smaller modality gap should improve downstream performance.

To bring additional insights into this discussion, we evaluated 112 contrastive vision-language models provided by OpenCLIP [25, 11] on ImageNet classification and MS COCO image-to-text retrieval (text-to-image retrieval yielded similar results). We computed the modality gap distance with L2M (Eq. 1) proposed by Liang et al. [36]. Our results in Figs. 5 and 1 show that a larger L2M distance counter-intuitively correlates with downstream performance improvements. Also note the separation of models trained on medium- (i.e., $\leq$ $128\text{\,}\mathrm{M}$ image-text pairs) and large-scale data.

We suspect that L2M may be the cause for this counter-intuitive observation (refer to Appendix 0.D for a discussion of its limitations). Thus, we propose the alternative Relative Modality Gap (RMG) measure:

\text{RMG}\coloneqq\frac{\frac{1}{n}\sum\limits_{i=1}^{n}d(\mathbf{x}_{i},% \mathbf{y}_{j})}{\frac{0.5}{n(n-1)}\Big{(}\sum\limits^{n}_{i,j=1;i\neq 1}d(% \mathbf{x}_{i},\mathbf{x}_{j})+\sum\limits^{n}_{i,j=1;i\neq 1}d(\mathbf{y}_{i}% ,\mathbf{y}_{j})\Big{)}+\frac{1}{n}\sum\limits_{i=1}^{n}d(\mathbf{x}_{i},% \mathbf{y}_{j})}\quad,

(2)

where $\mathbf{x}_{i},\mathbf{y}_{i}$ are the $i$ -th L2-normalized image or text embeddings, respectively, and $d$ is some distance function (we used cosine dissimilarity scaled to [0,1]). Intuitively, the numerator takes a per-sample view, measuring the gap where it matters, and the denominator accounts for the effectively used space by setting the numerator in relation to the average distances within the modalities. We also add the distances of the matching image-text pairs to the denominator to scale the metric to [0,1]. However, also with this more sophisticated measure for the modality gap, we still observe a positive correlation, even though it became weaker (Figs. 5 and 1).

Our intuition can fool us, when we assume the modality gap causes a certain downstream performance independent of other factors. Other factors can overshadow the effect of the modality gap. As Tab. 1 reveals, downstream performance is much more influenced by the model and embedding size; particularly for models trained on large datasets. Besides that, when comparing only models trained on the same dataset (removing the dataset as confounder) we observe the expected negative correlation for LAION-400M (-46.4), LAION-2B (-27.5), and WebLI (-58.3); see Appendix 0.D for details. OpenAI’s CLIP models still behave differently (38.3), possibly due to differences in the training protocols. We note that the grou** based on the datasets substantially reduced the number of available models and, thus, these correlations need to be taken with caution. However, as we will see in Sec. 6, the modality gap as well as downstream performance are indeed both affected by common third variables, e.g., dataset quality (or embedding size). Tab. 1 also suggests model size as a potential shared influential factor but we leave further investigation of it for future work. Takeaway 2: A larger modality gap positively correlates with downstream performance, yet there is no indication that this is a causal relationship, but there are rather common confounders.

4.3 Further Similarities and Differences of Cross-modal Embeddings

Beyond the modality gap, we find further similarities and differences of the embeddings by investigating the following aspects: 1) Do the directions have similar meaning? 2) Are the nearest neighbor relations the same? 3) Are the biases of the embeddings equally pronounced (refer to Sec. 5)?

To identify meaningful directions (1st question), we followed the ideal words approach of Trager et al. [57]: We paired all objects with all attributes in the captions, marginalized the attributes and subtracted the mean text embedding to get ideal object words, and vice versa. To get ideal image embeddings, we followed the same procedure but were limited by the available labeled images. We find low cosine similarities between ideal words and images (CLIP ViT-B/16: 0.19 for MIT-States, 0.16 for UT-Zappos; SigLIP ViT-B/16: 0.20, 0.16). However, when we correct them with the modality gap vector (mean difference vector between matching image and text embeddings) cosine similarities significantly increase (CLIP ViT-B/16: 0.56 for MIT-States, 0.40 for UT-Zappos; SigLIP ViT-B/16: 0.68, 0.56). Hence, ideal words and images are not aligned and it suggests that the embedding directions of each modality have different meanings when not corrected by the modality gap vector.

Table 2: Dissimilarity of neighborhood orderings in the embedding space unsing normalized Kendall-Tau distance

\in[0,1]

. Higher normalized Kendall-Tau distance values indicate that the ranking of neighbors is altered more. For ImageNet-100, “s.

i

” indicates the

i

-th split.

	CIFAR-10	CIFAR-100	ImgNet-100 s. 1	ImgNet-100 s. 2	ImgNet-100 s. 3
CLIP	0.3399	0.4965	0.4975	0.5046	0.5081
SigLIP	0.5044	0.4981	0.5003	0.4965	0.4987

To test similarity of neighborhood relations of the embeddings (2nd question), we computed the mean embedding for each class in CIFAR-10, CIFAR-100 [28], and three ImageNet-100 splits [24]. We computed the normalized Kendall-Tau distance ( $[0,1]$ ), where the normalization accounts for varying number of classes. Intuitively, it counts the percentage of bubble-sort swaps (w.r.t. all possible swaps) necessary to transform a nearest neighbor list of modality A to match the nearest neighbor list of modality B. Tab. 2 reveals that the neighborhood orderings are dissimilar between the modalities.

Takeaway 3: Directions of image and text embeddings align when corrected by the modality gap vector and neighborhood relations vary between the modalities.

5 Object Bias Is a Caption Presence Bias

Object bias refers to the observation that contrastive vision-language models have high performance on downstream tasks mainly linked to objects, while achieving comparably worse performance on tasks linked to other latent factors, such as attributes [3]. However, solely assessing object bias based on worse performance on some attribute benchmark may be misleading, as the task just could be more difficult than an object-based task.

Instead, we propose a measure for object vs. attribute bias, denoted as Matching Object Attribute Distance (MOAD). MOAD quantifies how well a model distinguishes matching to non-matching images (or texts) of objects $o$ compared to attributes $a$ . Matching images (texts) show both the same object or attribute, whereas non-matching images (texts) show different objects or attributes. We define MOAD for L2-normalized image embeddings $\mathbf{x}$ as follows:

\displaystyle\begin{split}\text{MOAD}_{\mathbf{\text{img}}}\coloneqq~{}&\frac{% 1}{2|O|}\sum\limits_{o\in O}\left(\frac{1}{N_{1}}\sum\limits_{\begin{subarray}% {c}\mathbf{x}_{i},\mathbf{x}_{j}\in X_{o}\\ i\neq j\end{subarray}}\mathbf{x}_{i}^{T}\mathbf{x}_{j}-\frac{1}{N_{2}}\sum% \limits_{\mathbf{x}_{i}\in X_{o},\mathbf{x}_{j}\in X_{\neg o}}\mathbf{x}_{i}^{% T}\mathbf{x}_{j}\right)\\ -&\frac{1}{2|A|}\sum\limits_{a\in A}\left(\frac{1}{N_{3}}\sum\limits_{\begin{% subarray}{c}\mathbf{x}_{i},\mathbf{x}_{j}\in X_{a}\\ i\neq j\end{subarray}}\mathbf{x}_{i}^{T}\mathbf{x}_{j}-\frac{1}{N_{4}}\sum% \limits_{\mathbf{x}_{i}\in X_{a},\mathbf{x}_{j}\in X_{\neg a}}\mathbf{x}_{i}^{% T}\mathbf{x}_{j}\right)\quad,\end{split}

(3)

where $N_{1},...,N_{4}$ are normalization factors, $X_{o},X_{\neg o},X_{a},X_{\neg a}$ are all images $\mathbf{x}$ that (not) entail the object $o\in O$ or attribute $a\in A$ , respectively. We similarly define $\text{MOAD}_{\mathbf{\text{txt}}}$ for text embeddings $\mathbf{y}$ . Positive values indicate a bias towards objects, negative values a bias towards attributes, and zero no bias.

We analyzed the relation between MOAD and downstream performance in Fig. 5(a). As expected, the majority of contrastive vision-language models showcase a bias towards objects (positive MOAD values). Notably, models trained on large-scale data exhibit a less pronounced object bias (smaller positive MOAD values) compared to models trained on medium-scale data. However, we find only a very weak to no correlation between MOAD and downstream performance for models trained on large-scale data. Fig. 5(b) shows that this can be attributed to the medium-to-strong correlations between performance on object (ImageNet, MS COCO) and attribute tasks (MIT-States, UT-Zappos). This suggests that contrastive vision-language models tend to perform well or poorly in both types of tasks, rather than excelling in one while underperforming in the other. Takeaway 4: Contrastive vision-language models trained on large-scale data tend to have a lower object bias than medium-scale models. However, there is no clear relation between object bias and downstream performance. This can be attributed to the observation that performance improvements on object tasks correlate with improvements on attribute tasks.

Where does the object bias come from?

In principle, the word frequency of the training dataset could cause the object bias. However, Fig. 6(a) disproves this, as attributes are more frequent than objects in LAION-2B. Thus, we posit that the bias arises from the per-sample prevalence of objects in natural language, e.g., humans tend to describe the most salient object(s) and typically only few of their attributes in a caption.²²2Not all objects in a scene will be described, but objects are more consistently present than attributes. To verify this hypothesis, we used MAD and redefined the prevalence of the latent factors. Specifically, we trained models in 5 settings. For each setting another factor (e.g. digit, swelling, fracture, etc.) was always present in the caption while only one of the remaining factors was randomly sampled for each training sample. Fig. 6(b) confirms that the caption presence bias results in a bias towards objects for natural language captions. It also illustrates that models perform better on tasks for which the (task-relevant) information is always present in the captions. Takeaway 5: Bias towards concepts, e.g., objects, is caused by their high presence probability in captions if said concept appears in the image.

6 Information Imbalance Triggers Modality Gap and Object Bias

The previous sections analyzed the differences of image and text embeddings, particularly the modality gap, and a bias towards objects. However, what is the underlying cause for their emergence? This section reveals one factor that creates both phenomena: information imbalance between images and texts due to sparse captions. That is, images contain all the information, while captions are an incomplete description of the images, as captions typically entail the most salient object(s) and only a handful of other factors, such as attributes. The information imbalance problem is illustrated in Fig. 1.

As a consequence of information imbalance, the image encoder can hardly align its embedding of an image to the one of the text encoder of a matching caption, as it cannot know what latent factors may be encoded in that caption. To still achieve sufficient alignment, the best both encoders can do is to focus on the latent factors that are likely present in the caption (e.g., objects), while neglecting other factors that are unlikely to be present (e.g., attributes). This results in the above discussed caption presence bias and, consequently, leads to a bias towards the most likely present words. In natural language descriptions these are objects. Moreover, the model will maximize the uniformity of the image and text embeddings to minimize the total contrastive loss, as there is a shortcut to quickly achieve this during training: make images and texts maximally dissimilar. This creates the modality gap (refer to Sec. 4.1 for a thorough explanation).

To validate our hypothesis, we varied the information imbalance in MAD in a controlled setting. Specifically, we varied the number of attributes mentioned in the captions and ensured that the object (i.e., digit class) is always included; refer to Sec. 3 for details. Note that settings with high information imbalance (few attributes are present) are aligned with natural language captions that typically only entail the most salient object(s) while neglecting most of the attributes. In contrast, settings with low information imbalance are more aligned with approaches using enriched captions. Fig. 8 shows the results and indeed validates our hypothesis.

Information imbalance and modality gap. Fig. 7(a)(I-II) show that the modality gap decreases with decreasing information imbalance. Moreover, even when there is a modality gap after model initialization, the contrastive loss is capable to substantially reduce it in the full information setting; see Fig. 7(b).

Information imbalance and object bias. Fig. 7(a)(III-IV) show that the bias towards objects reduces, as information imbalance reduces. The image encoder is also more biased towards objects than the text encoder (Fig. 5(a), 6(b) and 7(a)(II-III)), which is implied by our hypothesis.

Information imbalance and zero-shot performance. Fig. 7(a)(V-VI) show that zero-shot digit (object) and attribute performance improve with decreasing information imbalance. This is also supported by the theoretical results of Daunhawer et al. [15], who showed that latent factors, i.e., objects or attributes, can be block-identified if they are shared between the modalities.

Takeaway 6: Information imbalance between the modalities leads to both, modality gap and object bias. Reducing the level of information imbalance causes a smaller modality gap and a smaller object bias.

Embedding dimensionality. Fig. 7(a)(V-VI) shows that a higher embedding dimensionality also improves zero-shot downstream performance, even in the presence of substantial information imbalance. This suggests that the embedding dimension could also be an effective way to improve zero-shot downstream performance. We leave further investigation for future work.

7 Discussion

We found that information imbalance between modalities results in poor downstream performance for the imbalanced factors, leads to the modality gap as well as a bias towards the factors that are always present in the captions. In contrast, when we ensure information balance with all task-relevant information, e.g., attributes, contrastive vision-language models achieve strong zero-shot downstream performance, a negligible modality gap, and less bias towards objects.

Synthetic vs. real data. We validated our hypothesis on our synthetic MAD dataset for small contrastive vision-language models. However, there are several differences between real and synthetic data, e.g., real images have substantially more latent factors (attributes, lighting, relations, etc.). Further, we acknowledge that models are typically trained in the large-scale setting, i.e., a large model trained on a large dataset with large amounts of compute. Nonetheless, we are confident that our findings go beyond the synthetic setting. We find evidence for that in form of aligned findings of concurrent empirical studies that showed the positive influence of data quality [44] or caption enrichment [66, 29] for performance of contrastive vision-language models. In this work, we studied the upper bound of data quality and caption enrichment (i.e., all latent factors are described in the caption) and found it effective to learn better representations.

Beyond contrastive vision-language models. We focused on contrastive vision-language models due to their popularity. However, our analysis and conclusions are not specific to these modalities and generalize to multi-modal models trained on other input modalities. Another interesting future direction is to include recent large-scale captioning-based models [16, 49, 58] to our analysis. Unfortunately, without publicly accessible weights this is not possible for us.

8 Conclusion

This work investigated contrastive vision-language models to gain a better understanding of their characteristics. We found that the modality gap and a bias towards objects are both triggered by an information imbalance between modalities. A reduction of such mitigates both and improves downstream performance. Surprisingly, we also found that only few embedding dimensions drive the modality gap. Besides that, we introduced two novel measures to compare modality gaps across models as well as quantify the notion of a “bias towards objects”. While we observed that a larger modality gap correlates with better downstream performance, we found that both the modality gap and downstream performance are substantially more influenced by other factors, such as the dataset quality. Finally, we confirmed that contrastive vision-language models have a bias towards objects but also found that improvements on object tasks positively correlate with improvements on attribute tasks.

Acknowledgments

This research was funded by the Bundesministerium für Umwelt, Naturschutz, nukleare Sicherheit und Verbraucherschutz (BMUV, German Federal Ministry for the Environment, Nature Conservation, Nuclear Safety and Consumer Protection) based on a resolution of the German Bundestag (67KI2029A), the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under grant number 417962828, and the Bosch Center for Artificial Intelligence.

References

[1] Agarwal, S., Krueger, G., Clark, J., Radford, A., Kim, J.W., Brundage, M.: Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications. arXiv (2021)
[2] Alabdulmohsin, I., Zhai, X., Kolesnikov, A., Beyer, L.: Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design. In: NeurIPS (2023)
[3] Bravo, M.A., Mittal, S., Ging, S., Brox, T.: Open-vocabulary Attribute Detection. In: CVPR (2023)
[4] Brody, J.: On the Potential of CLIP for Compositional Logical Reasoning. In: ICLP (2023)
[5] Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., Kim, S.: COYO-700M: Image-Text Pair Dataset (2022), https://github.com/kakaobrain/coyo-dataset
[6] Castro, D.C., Tan, J., Kainz, B., Konukoglu, E., Glocker, B.: Morpho-MNIST: Quantitative Assessment and Diagnostics for Representation Learning. JMLR (2019)
[7] Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts. In: CVPR (2021)
[8] Chen, R., Liu, Y., Kong, L., Zhu, X., Ma, Y., Li, Y., Hou, Y., Qiao, Y., Wang, W.: CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP. In: CVPR (2023)
[9] Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., Kolesnikov, A., Puigcerver, J., Ding, N., Rong, K., Akbari, H., Mishra, G., Xue, L., Thapliyal, A.V., Bradbury, J., Kuo, W., Seyedhosseini, M., Jia, C., Ayan, B.K., Ruiz, C.R., Steiner, A.P., Angelova, A., Zhai, X., Houlsby, N., Soricut, R.: PaLI: A Jointly-Scaled Multilingual Language-Image Model. In: ICLR (2023)
[10] Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L.: Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv (2015)
[11] Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. In: CVPR (2023)
[12] Chopra, S., Hadsell, R., LeCun, Y.: Learning a Similarity Metric Discriminatively, with Application to Face Verification. In: CVPR (2005)
[13] Couairon, G., Douze, M., Cord, M., Schwenk, H.: Embedding Arithmetic of Multimodal Queries for Image Retrieval. In: CVPR (2022)
[14] Crabbé, J., Rodríguez, P., Shankar, V., Zappella, L., Blaas, A.: Robust multimodal models have outlier features and encode more concepts. arXiv (2023)
[15] Daunhawer, I., Bizeul, A., Palumbo, E., Marx, A., Vogt, J.E.: Identifiability Results for Multimodal Contrastive Learning. In: ICLR (2023)
[16] Desai, K., Johnson, J.: VirTex: Learning Visual Representations from Textual Annotations. In: CVPR (2021)
[17] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: ICLR (2020)
[18] Gadre, S.Y., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., et al.: DataComp: In search of the next generation of multimodal datasets. In: Datasets and Benchmarks Track@NeurIPS (2023)
[19] Geirhos, R., Narayanappa, K., Mitzkus, B., Thieringer, T., Bethge, M., Wichmann, F.A., Brendel, W.: Partial success in closing the gap between human and machine vision. In: NeurIPS (2021)
[20] Goh, G., Cammarata, N., Voss, C., Carter, S., Petrov, M., Schubert, L., Radford, A., Olah, C.: Multimodal Neurons in Artificial Neural Networks. Distill (2021)
[21] Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: AISTATS (2010)
[22] Hamidieh, K., Zhang, H., Hartvigsen, T., Ghassemi, M.: Identifying Implicit Social Biases in Vision-Language Models. Workshop@ICLR (2023)
[23] He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: CVPR (2016)
[24] Hoffmann, D.T., Behrmann, N., Gall, J., Brox, T., Noroozi, M.: Ranking Info Noise Contrastive Estimation: Boosting Contrastive Learning via Ranked Positives. In: AAAI (2022)
[25] Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: OpenCLIP (2021). https://doi.org/10.5281/zenodo.5143773
[26] Isola, P., Lim, J.J., Adelson, E.H.: Discovering States and Transformations in Image Collections. In: CVPR (2015)
[27] Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In: ICML (2021)
[28] Krizhevsky, A., Hinton, G., et al.: Learning Multiple Layers of Features from Tiny Images (2009)
[29] Lai, Z., Zhang, H., Wu, W., Bai, H., Timofeev, A., Du, X., Gan, Z., Shan, J., Chuah, C.N., Yang, Y., et al.: From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions. arXiv (2023)
[30] LeCun, Y.: The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/ (1998)
[31] Li, X., Wang, Z., Xie, C.: CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra $4,000 Unlocks 81.8% Accuracy. Workshop@NeurIPS (2023)
[32] Li, X., Wang, Z., Xie, C.: Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans? In: EMLNP (2023)
[33] Li, X., Wang, Z., Xie, C.: An Inverse Scaling Law for CLIP Training. In: NeurIPS (2023)
[34] Li, Y., Fan, H., Hu, R., Feichtenhofer, C., He, K.: Scaling Language-Image Pre-training via Masking. In: CVPR (2023)
[35] Liang, P.P., Deng, Z., Ma, M., Zou, J., Morency, L.P., Salakhutdinov, R.: Factorized Contrastive Learning: Going Beyond Multi-view Redundancy. In: NeurIPS (2023)
[36] Liang, V.W., Zhang, Y., Kwon, Y., Yeung, S., Zou, J.Y.: Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning. In: NeurIPS (2022)
[37] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common Objects in Context. In: ECCV (2014)
[38] Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: CVPR (2022)
[39] Loshchilov, I., Hutter, F.: Decoupled Weight Decay Regularization. In: ICLR (2019)
[40] Ma, Y., Xu, G., Sun, X., Yan, M., Zhang, J., Ji, R.: X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval. In: International Conference on Multimedia (2022)
[41] Materzyńska, J., Torralba, A., Bau, D.: Disentangling visual and written concepts in CLIP. In: CVPR (2022)
[42] Mayilvahanan, P., Wiedemer, T., Rusak, E., Bethge, M., Brendel, W.: Does CLIP’s Generalization Performance Mainly Stem from High Train-Test Similarity? In: ICLR (2024)
[43] Menon, S., Vondrick, C.: Visual Classification via Description from Large Language Models. In: ICLR (2023)
[44] Nguyen, T., Ilharco, G., Wortsman, M., Oh, S., Schmidt, L.: Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP. In: NeurIPS (2022)
[45] Oord, A.v.d., Li, Y., Vinyals, O.: Representation Learning with Contrastive Predictive Coding. arXiv (2018)
[46] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning Transferable Visual Models From Natural Language Supervision. In: ICML (2021)
[47] Rashtchian, C., Herrmann, C., Ferng, C.S., Chakrabarti, A., Krishnan, D., Sun, D., Juan, D.C., Tomkins, A.: Substance or Style: What Does Your Image Embedding Know? arXiv (2023)
[48] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: ImageNet Large Scale Visual Recognition Challenge. IJCV (2015)
[49] Sariyildiz, M.B., Perez, J., Larlus, D.: Learning Visual Representations with Caption Annotations. In: ECCV (2020)
[50] Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C.W., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S.R., Crowson, K., Schmidt, L., Kaczmarczyk, R., Jitsev, J.: LAION-5B: An open large-scale dataset for training next generation image-text models. In: Datasets and Benchmarks Track@NeurIPS (2022)
[51] Shi, P., Welle, M.C., Björkman, M., Kragic, D.: Towards understanding the modality gap in CLIP. In: Workshop@ICLR (2023)
[52] Shtedritski, A., Rupprecht, C., Vedaldi, A.: What does CLIP know about a red circle? Visual prompt engineering for VLMs. In: ICCV (2023)
[53] So, J., Oh, C., Lim, Y., Byun, H., Shin, M., Song, K.: Geodesic Multi-Modal Mixup for Robust Fine-Tuning. In: NeurIPS (2023)
[54] Sohn, K.: Improved Deep Metric Learning with Multi-class N-pair Loss Objective. In: NeurIPS (2016)
[55] Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: EVA-CLIP: Improved Training Techniques for CLIP at Scale. arXiv (2023)
[56] Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.J.: YFCC100M: The New Data in Multimedia Research. Communications of the ACM (2016)
[57] Trager, M., Perera, P., Zancato, L., Achille, A., Bhatia, P., Soatto, S.: Linear Spaces of Meanings: Compositional Structures in Vision-Language Models. In: ICCV (2023)
[58] Tschannen, M., Kumar, M., Steiner, A., Zhai, X., Houlsby, N., Beyer, L.: Image Captioners Are Scalable Vision Learners Too. In: NeurIPS (2023)
[59] Udandarao, V.: Understanding and Fixing the Modality Gap in Vision-Language Models (2022), Master’s thesis
[60] Visheratin, A.: NLLB-CLIP–train performant multilingual image retrieval model on a budget. arXiv (2023)
[61] Von Kügelgen, J., Sharma, Y., Gresele, L., Brendel, W., Schölkopf, B., Besserve, M., Locatello, F.: Self-Supervised Learning with Data Augmentations Provably Isolates Content from Style. In: NeurIPS (2021)
[62] Wang, T., Isola, P.: Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. In: ICML (2020)
[63] Wu, C., Maji, S.: How well does CLIP understand texture? Workshop@ECCV (2022)
[64] Xu, H., Xie, S., Tan, X.E., Huang, P.Y., Howes, R., Sharma, V., Li, S.W., Ghosh, G., Zettlemoyer, L., Feichtenhofer, C.: Demystifying CLIP Data. In: ICLR (2024)
[65] Yamada, Y., Tang, Y., Yildirim, I.: When are Lemons Purple? The Concept Association Bias of CLIP. In: EMNLP (2023)
[66] Yao, L., Chen, W., **, Q.: CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge. In: WWW (2023)
[67] Yu, A., Grauman, K.: Fine-Grained Visual Comparisons with Local Learning. In: CVPR (2014)
[68] Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: CoCa: Contrastive Captioners are Image-Text Foundation Models. TMLR (2022)
[69] Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J.: When and Why Vision-Language Models Behave like Bags-Of-Words, and What to Do About It? In: ICLR (2022)
[70] Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid Loss for Language Image Pre-Training. In: ICCV (2023)
[71] Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., Beyer, L.: LiT: Zero-Shot Transfer with Locked-image text Tuning. In: CVPR (2022)
[72] Zhang, R., Zeng, Z., Guo, Z., Li, Y.: Can Language Understand Depth? In: International Conference on Multimedia (2022)
[73] Zhang, Y., HaoChen, J.Z., Huang, S.C., Wang, K.C., Zou, J., Yeung, S.: Diagnosing and Rectifying Vision Models using Language. In: ICLR (2023)
[74] Zhou, C., Zhong, F., Öztireli, C.: CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable and Controllable Text-Guided Face Manipulation. In: SIGGRAPH (2023)

Appendix 0.A Evaluation Details

We ran our evaluations on ImageNet [48], MS COCO [37, 10], MIT-States [26], and UT-Zappos [67]. The datasets comprise 50000, 25000 (5000 images with 5 captions each), 12995, or 2914 test samples, respectively. ImageNet and MS COCO are standard datasets for evaluation of object recognition or retrieval performance, respectively. We used the standard evaluation protocols to compute accuracy or image retrieval performance. MIT-States consists of 245 objects and 115 adjectives (attributes), while UT-Zappos consists of 12 shoe types with 16 fine-grained states ( $\sim$ attributes). For both datasets, we assume that we do not know the object of a respective image and only want to find the adjective or fine-grained state. We considered this a classification problem, following previous work [57]. Note that these datasets implicitly assume that the adjectives are mutually exclusive per image. However, this may not be necessarily true, as multiple adjectives or fine-grained states may be present in the image.

Contrastive vision-language model details. For our large-scale analyses, we used a total of 112 contrastive vision-language models trained across various datasets provided by OpenCLIP [25, 11]³³3https://github.com/mlfoundations/open_clip. It contains contrastive vision-language models, such as OpenAI’s CLIP [46], CLIP-A [33], EVA-CLIP [55], CoCa [68], NLLB-CLIP [60], or SigLIP [70]. Note that these models use various backbones, including ResNet [23], ConvNeXt [38], or ViT [17]. The models were trained on, e.g., OpenAI’s proprietary ( $400\text{\,}\mathrm{M}$ ) WebImageText dataset [46], LAION- $400\text{\,}\mathrm{M}$ , LAION- $2\text{\,}\mathrm{B}$ , LAION- $5\text{\,}\mathrm{B}$ [50], Merged- $2\text{\,}\mathrm{B}$ (merge of $1.6\text{\,}\mathrm{B}$ samples from LAION- $2\text{\,}\mathrm{B}$ and $0.4\text{\,}\mathrm{B}$ samples from COYO- $700\text{\,}\mathrm{M}$ [5]) [55], WebLI [9], So- $400\text{\,}\mathrm{M}$ [2], MetaCLIP ( $400\text{\,}\mathrm{M}$ ) [64], Conceptual $12\text{\,}\mathrm{M}$ [7], YFCC ( $15\text{\,}\mathrm{M}$ ) [56], CommonPool-s (max. $12.8\text{\,}\mathrm{M}$ ; refer to Table 3 of Gadre et al. [18] for the details of filtering), CommonPool-m (max. $128\text{\,}\mathrm{M}$ ), CommonPool-l (max. $1.28\text{\,}\mathrm{B}$ ), CommonPool-xl (max. $12.8\text{\,}\mathrm{B}$ ) [18], or DataPool-s ( $1.4\text{\,}\mathrm{M}$ ), DataPool-m ( $14\text{\,}\mathrm{M}$ ), DataPool-l ( $140\text{\,}\mathrm{M}$ ), DataPool-xl ( $1\text{\,}\mathrm{B}$ ) [18].

Figure 9: Causal graph of MAD.

Multi-modal Attributes and Digits (MAD). Our dataset Multi-modal Attributes and Digits (MAD) is based on the MNIST [30] variation Morpho-MNIST [6]. The causal graph of the data-generating process of MAD is depicted in Fig. 9. We used the following words for digits (0, …, 9), altering image thickness (thickening, thinning, no thickthinning), swelling (swelling, no swelling), fractures (fracture, no fracture), scaling (large, small), and color (gray, red, green, blue, cyan, magenta, yellow). Thus, we have 16 different attributes. Fig. 10 provides examples of image-text pairs of MAD.

In our experiments, we investigated information imbalance in the captions by restricting the number of attributes present within each caption. We provide examples below, where we sequentially remove the amount of information within the captions, i.e., fewer latent factors (attributes) are present in the caption:

•
Full information setting (i.e., digit & all five attributes)
- –
  
  yellow-swelling-thickening-9-large-fracture
- –
  
  swelling-thickening-6-red-small-fracture
- –
  
  5-large-yellow-no swelling-fracture-thinning
•
Partial information setting I (i.e., digit & four attributes)
- –
  
  yellow-swelling-thickening-9-large
- –
  
  swelling-thickening-6-red-small
- –
  
  5-large-yellow-no swelling-fracture
•
Partial information setting II (i.e., digit & three attributes)
- –
  
  yellow-swelling-thickening-9
- –
  
  swelling-thickening-6-red
- –
  
  5-large-yellow-no swelling
•
Partial information setting III (i.e., digit & two attributes)
- –
  
  yellow-swelling-9
- –
  
  swelling-thickening-6
- –
  
  5-large-yellow
•
Partial information setting IV (i.e., digit & one attributes)
- –
  
  yellow-9
- –
  
  swelling-6
- –
  
  5-large

Note that while all the latent factors, i.e., digit and all five attributes, still affect the generated image, the caption may only provide partial information, i.e., attributes are missing from the caption.

Appendix 0.B Model and Training Details for the Experiments on Multi-modal Attributes and Digits

Model details. We used small CLIP models. Specifically, the ViT-based vision backbone comprises 6 layers, each with a dimensionality $d$ of 256 and $\lfloor d/64\rfloor=4$ heads. The transformer-based language backbone also comprises 6 layers, each with a dimensionality of 256 and 8 heads. We set the patch size to 7 and context length to 8. The vocabulary consists of 28 words, i.e., all the words for digits (10) and attributes (16), as well as a start and end symbol (2).

Training details. We trained all models with a batch size of 128 for 200 epochs with a learning rate warm-up period of 5 epochs. We used AdamW [39] as optimizer with cosine annealing learning rate schedule [39]. We always selected the best performing learning rate across 3 learning rates $\{5\cdot 10^{-4},\;5\cdot 10^{-5},\;10^{-5}\}$ each trained with 3 random seeds. The best learning rate was selected by comparing average accuracies over the ideal word accuracy and average zero-shot accuracy on all attributes and the class label. For all of our results, we report the average over 3 random seeds.

Appendix 0.C Alignment and Uniformity Terms in InfoNCE

Contrastive representation learning [12, 21, 54, 45] leverages paired inputs as weak supervision signal. The basic idea is to learn representations in a shared representation space that are as similar as possible for “positive/matching” pairs, while as dissimilar as possible for “negative/non-matching” pairs. A popular choice for contrastive learning approaches is the InfoNCE objective [45]:

\small\mathcal{L}(f_{x},f_{y})\coloneqq\underset{{(\mathbf{x}_{i},\mathbf{y}_{% i})\sim p_{\text{data}}}}{\mathbb{E}}[-\log\frac{\exp(f_{x}(\mathbf{x}_{i})^{T% }f_{y}(\mathbf{y}_{i})/\tau)}{\exp(f_{x}(\mathbf{x}_{i})^{T}f_{y}(\mathbf{y}_{% i})/\tau)+\sum\limits_{j=1}^{N-1}\exp(f_{x}(\mathbf{x}_{i})^{T}f_{y}(\mathbf{y% }_{j})/\tau)}]\hskip 1.79997pt,

(4)

where $f_{x},f_{y}$ are two encoders for the inputs $\mathbf{x},\mathbf{y}$ , $\tau$ is the scalar temperature, and $p_{\text{data}}$ is the data distribution $\mathbb{R}^{d}\times\mathbb{R}^{d}$ . Wang & Isola [62] define two components of the loss:

•

Alignment (nominator): matching pairs should be close, i.e., aligned.
•

Uniformity (denominator): representations should be roughly uniformly distributed on the unit hypersphere.

For multi-modal contrastive representation learning, it is popular to use a symmetric version of above InfoNCE objective [46]:

\mathcal{L}_{\text{sym}}=\frac{1}{2}\mathcal{L}(f_{x},f_{y})+\frac{1}{2}% \mathcal{L}(f_{y},f_{x})\quad.

(5)

Note that the the repulsive forces of the uniformity term only act across the modalities but not within them.

Mismatches Can Cause Global Optima Exhibiting a Modality Gap

Going beyond the discussion in the main text, we illustrate below that the modality gap can even be present in the global optimum under certain data properties. Under perfect image-text pairs, it is easy to see that the global minimum of the contrastive loss does not exhibit a modality gap. I.e., all matching image-text pairs are perfectly aligned and the pairs are uniformly distributed on the unit hypersphere. But what happens in the presence of mismatches caused by, e.g., miscaptioning?

To illustrate, the effect of such mismatches on the final image and text embeddings, we designed a 2D toy example. We generated two sets of points on the unit circle and directly optimized their positions. The points represent the embeddings of both modalities: $\{\mathbf{x}_{1},\mathbf{x}_{2}\}=\mathbf{X}$ and $\{\mathbf{y}_{1},\mathbf{y}_{2}\}=\mathbf{Y}$ , where $\mathbf{x}_{1},\mathbf{y}_{1},\mathbf{x}_{2},\mathbf{y}_{2}\in\{[a,b]^{T}~{}|~% {}a^{2}+b^{2}=1\}$ . Further, we specified which point of set $\mathbf{X}$ matches to which point in set $\mathbf{Y}$ . For the perfect matching setting, we considered the following matching pairs:

\{(\mathbf{x}_{1},\mathbf{y}_{1}),(\mathbf{x}_{2},\mathbf{y}_{2})\}\quad.

(6)

It is easy to see that the global minimum (up to rotations) is: $\mathbf{x}_{1}=\mathbf{y}_{1}=[1,0]^{T}$ , $\mathbf{x}_{2}=\mathbf{y}_{2}=[-1,0]^{T}$ ; see Fig. 11 for a visualization.

However, what happens when we introduce mismatches? For image-text pairs, this can happen if a human annotator miscaptions an image. It could also stem from differing focuses among human annotators on distinct aspects of the image. We considered the following matching pairs:⁴⁴4We need the additional matching pairs of $(\mathbf{x}_{1},\mathbf{y}_{1})$ and $(\mathbf{x}_{2},\mathbf{y}_{2})$ to avoid the degenerated global minimum $\mathbf{x}_{1}=\mathbf{y}_{1}=\mathbf{x}_{2}=\mathbf{y}_{2}$ .

\{(\mathbf{x}_{1},\mathbf{y}_{1}),(\mathbf{x}_{1},\mathbf{y}_{1}),(\mathbf{x}_% {2},\mathbf{y}_{2}),(\mathbf{x}_{2},\mathbf{y}_{2}),\underbrace{(\mathbf{x}_{1% },\mathbf{y}_{2}),(\mathbf{x}_{2},\mathbf{y}_{2})}_{\text{mismatches}}\}\quad.

(7)

We recognize that identification and removal of above mismatches may be simple in practice. However, our goal is to illustrate the impact of mismatches in a simplistic setting to provide an intuition on the behavior of the contrastive loss.

To search for the globally optimal embeddings for Eq. 7, we ran a grid search with an angular resolution of $6^{\circ}$ . We found the following global minimum (again up to rotations): $\mathbf{x}_{1}=[1,0]^{T}$ , $\mathbf{y}_{1}=[\cos{276^{\circ}},\sin{276^{\circ}}]^{T}$ , $\mathbf{x}_{2}=[\cos{42^{\circ}},\sin{42^{\circ}}]^{T}$ , $\mathbf{y}_{2}=[\cos{126^{\circ}},\sin{126^{\circ}}]^{T}$ ; see Fig. 11 for a visualization. It is apparent that the global minimum exhibits a modality gap.

Appendix 0.D Extended details for Sec. 4

0.D.1 Relative Modality Gap

Limitations of L2M. L2M has been initially proposed as modality gap distance by Liang et al. [36]. However, it has several limitations that we will discuss below:

1.

L2M does not account for difference of the effectively used hyperspaces.
2.

L2M takes a distributional instead of a per-sample view.
3.

The L2 norm can be sensitive to outlier embedding dimensions.

Regarding the first point: note that different models can use different amounts of the unit hypersphere, as the contrastive loss accounts for relative cosine similarities. Here, the similarity is always relative to the similarity to the negative samples. Consequently, models can use a varying degree of the unit hypersphere and, thus, L2 distances can have a different meaning. For example, consider two models that have the same L2M but the first model uses the entire unit hypersphere, while the second model only uses a small fraction of it. While L2M suggests that the modality gap distance is the same, the actual gap of the first model is significantly smaller, since the average distances between the samples are larger.

Regarding the second point: intuition suggests that matching image-text pairs should be close but non-matching pairs can (and should) be large. L2M considers the distance of the means of all pairs. This can lead to misleading results. An illustrative (but unlikely) example is the case of two distributions that occupy exactly the same region of the hypersphere, but are rotated by n-degree (i.e., they are very misaligned). Clearly, there exists a gap between the modalities but L2M does not indicate it.

Last, the L2 norm can be sensitive to embedding dimensions that exhibit vast differences. For instance, our discovered most modality-separating embedding dimensions qualify for this.

Relative Modality Gap (RMG). As a remedy to above outlined limitations, we proposed a Relative Modality Gap (RMG) measure in the main text. RMG computes the distances between matching image-text pairs instead of the means to address 2. Since density estimation in high-dimensional spaces is difficult, we used the mean distances between all samples per modality as rough approximation to address 1. Finally, we used cosine similarities instead of the L2 norm to address 3.

0.D.2 Rank Correlations when Fixing the Datasets

We also computed rank correlation for single datasets. We applied the following filtering criteria:

1.

We removed models that were subsequently finetuned, e.g., on MS COCO.
2.

We removed multiple instances of the same model due varying training epochs, activations (GeLU vs. Quick-GeLU), or training protocol (e.g., augmentations). We used the models that were trained longer. For varying activations or training protocols, we used the model that achieved better image recognition performance on ImageNet.

We only considered datasets that have at least seven models after filtering (LAION-400M: 7, LAION-2B: 14, OpenAI’s CLIP dataset: 9, WebLI: 9). Note that the small number of models makes the rank correlations susceptible to noise and they need to be interpreted with caution.

Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Representation Learning

Abstract

Keywords:

1 Introduction

2 Related Work

3 Experimental Setup

4 Cross-modal Disparities of Embeddings

Revisiting the Modality Gap

4.1 Few Embedding Dimensions Make Up the Modality Gap

4.2 Does the Modality Gap Harm Downstream Performance?

4.3 Further Similarities and Differences of Cross-modal Embeddings

5 Object Bias Is a Caption Presence Bias

6 Information Imbalance Triggers Modality Gap and Object Bias

7 Discussion

8 Conclusion

Acknowledgments

References

Appendix 0.A Evaluation Details

Appendix 0.B Model and Training Details for the Experiments on Multi-modal Attributes and Digits

Appendix 0.C Alignment and Uniformity Terms in InfoNCE

Mismatches Can Cause Global Optima Exhibiting a Modality Gap

Appendix 0.D Extended details for Sec. 4

0.D.1 Relative Modality Gap

0.D.2 Rank Correlations when Fixing the Datasets

Two Effects, One Trigger:
On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Representation Learning