(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

¹¹institutetext: Dartmouth College, Hanover NH 03755, USA ²²institutetext: Meta, FAIR
²²email: {maxwell.m.aladago.gr, LT, soroush.voshoughi}@dartmouth.edu

Semantic Compositions Enhance Vision-Language Contrastive Learning

Maxwell Mbabilla Aladago 11 Lorenzo Torresani 1122 Soroush Vosoughi 11

Abstract

In the field of vision-language contrastive learning, models such as CLIP capitalize on matched image-caption pairs as positive examples and leverage within-batch non-matching pairs as negatives. This approach has led to remarkable outcomes in zero-shot image classification, cross-modal retrieval, and linear evaluation tasks. We show that the zero-shot classification and retrieval capabilities of CLIP-like models can be improved significantly through the introduction of semantically composite examples during pretraining. Inspired by CutMix in vision categorization, we create semantically composite image-caption pairs by merging elements from two distinct instances in the dataset via a novel procedure. Our method fuses the captions and blends 50% of each image to form a new composite sample. This simple technique (termed CLIP- $\mathcal{C}$ for CLIP Compositions), devoid of any additional computational overhead or increase in model parameters, significantly improves zero-shot image classification and cross-modal retrieval. The benefits of CLIP- $\mathcal{C}$ are particularly pronounced in settings with relatively limited pretraining data.

Keywords:

multimodal learningsemantic composition pretraining

1 Introduction

Recent advancements in vision-language pretraining have propelled a multitude of tasks, including zero-shot image classification [41, 22, 32], video understanding [52, 59], and various multi-modal applications [27, 54, 36]. These successes echo the transformative trajectory initiated by large-scale pretraining efforts in Computer Vision (CV) [26, 17] and later in Natural Language Processing (NLP) [11, 42, 43, 1]. A prominent recent example in the vision-language interplay is the Contrastive Language-Image Pre-training (CLIP) model [41], which has become a benchmark for language-supervised training.

The objective of contrastive language-supervised pretraining is straightforward yet powerful: to align embeddings of corresponding image-text pairs in a shared embedding space while distancing the non-matching pairs [4, 5, 16]. CLIP has pioneered this direction with a dual-encoder framework, training on an expansive dataset of image-caption pairs sourced from the internet, using a bidirectional contrastive loss [37].

Subsequent studies aimed at improving the data efficiency of CLIP have introduced supplementary objectives, such as within-modality self-supervision [35, 32], multi-crop supervision [32], and the use of captions generated by large language models [13, 28, 59]. These methods typically necessitate additional computational steps, such as multiple forward passes or the inclusion of extra encoders.

Our contribution, termed CLIP- $\mathcal{C}$ (illustrated in Fig. 1), enhances data efficiency through an innovative but straightforward compositional approach, merging original image-caption pairs into novel compound examples. This approach builds upon the achievements of CutMix [56] in the domain of vision categorization tasks, adapting it to the vision-language pretraining context. This adaptation results in significant enhancements in model performance across various downstream tasks, achieved without incurring extra computational costs or increasing model complexity.

Our technique involves composing together two image-caption pairs from the dataset. This is done by conjoining the captions with “and” serving as the conjunction and merging the central crops of both images. The result is a set of compound instances that embody a broader array of concepts than the individual pairs, presenting the model with expanded semantic challenges that drive learning (Fig. 5). The two image-caption pairs constituting the composite instance are sampled dynamically in each iteration based on a predefined probability, empowering the model to uncover novel combinations of examples throughout training.

Distinguishing itself from stylistic variation methods that manipulate single examples, CLIP- $\mathcal{C}$ leverages “semantic composition” to introduce contextually varied training examples. Interestingly, we discover that the benefits of CLIP- $\mathcal{C}$ arise from other factors besides the diverse nature of these novel semantic associations. Surprisingly and very counter-intuitively, we find that the model learns to match compound image-caption pairs more easily than the original plain image-caption pairs. This then initiates a positive spillover effect where the model is able to learn better representations of plain unmodified examples in CLIP- $\mathcal{C}$ compared to CLIP.

Like [13], our approach improves the performance of CLIP in poor data scenarios. However, we enhance the data with compositions of both the images and captions without any reliance on external systems. As a result, we can efficiently and flexibly generate novel captions and images online during training. Moreover, our method does not increase the batch size or the number of iterations needed, maintaining operational parity with CLIP. Indeed, we demonstrate that training CLIP longer or with a higher batch size than CLIP- $\mathcal{C}$ is not sufficient to close the performance gap between the two methods.

In downstream applications, CLIP- $\mathcal{C}$ exhibits a competitive edge, surpassing CLIP by over 5% in cross-modal retrieval accuracy on Flickr30k [53] and showing substantial improvements on MS-COCO [33] in both image-to-text and text-to-image retrieval tasks. Additionally, our model demonstrates impressive gains in zero-shot classification, with a 2% increase on ImageNet [9], and superior linear evaluation results without necessitating any additional model parameters, memory, or dependence on external language processing systems. Even when evaluated on relatively large datasets such as CC12M [3] and RedCaps [10], CLIP- $\mathcal{C}$ still outperform the baseline CLIP in cross-modal retrieval and zero-shot classification tasks, albeit with decreased margins.

Finally, we believe that CLIP- $\mathcal{C}$ will be particularly beneficial in contexts where image-text datasets do not exist in large quantities or are not easily accessible (e.g., medical images, satellite images, etc.). However, it is not feasible to carry out comprehensive evaluations in these domains precisely because there are no established benchmarks of images with captions for them. Thus, we use evaluations on medium-size Web-derived datasets to demonstrate the potential value of our approach in application scenarios where in-domain data is not as abundantly available as for general natural images.

2 Related Works

The use of language as an effective supervisory signal for learning visual representations has a rich history in machine learning [40, 15, 24, 30, 41]. Early influential works such as DeViSE [15] first learned semantic relations using unannotated textual data before map** the images into that semantic space using class labels. More recently, models like CLIP [41], ALIGN [22], and others [35, 32, 20, 21] further improved the capabilities of joint vision-language embedding models by training on massive image-text paired datasets contrastively using the InfoNCE loss [37]. Our work aligns with these prior arts but focuses on incorporating semantic compositions during pretraining to improve data efficiency and enhance performance.

Both CLIP [41] and ALIGN [22] use huge datasets —400 million and 1B image-text pairs for CLIP and ALIGN, respectively. DeCLIP [32] improves the data efficiency of CLIP by incorporating several training objectives, including self-supervision within each modality [5, 11], nearest-neighbor supervision, and multi-view supervision [2]. SLIP [35], on the other hand, adds image self-supervision, SimCLR [4], to the language supervision. In [50], Wu et al. show good zero-shot results in the low data regime through soft image-text matches via optimal transport distillation. These methods, however, require multiple passes through the image encoder [35, 32] for each update or a first-in-first-out feature queue [32] to generate the representations for the extra objectives. Our method is free of these additional complexities.

Most similar to our work are data augmentation methods such as CutMix [56] and MixUP [58] which have been very effective in training categorization models in computer vision. Our work brings the benefits of these established augmentation techniques in image understanding to the vision-language joint-embedding space. In addition to the incorporation of language, our method differs from CutMix by concatenating the image crops instead of pasting one crop on the other. Additionally, we train our models using contrastive loss. As our method is a pre-training mechanism, we do not discuss works [23, 19, 57, 31, 55] that use open-source CLIP checkpoints transfer learning scenarios.

Refer to caption — Figure 1: CLIP- $\mathcal{C}$ : We use the center half crops spanning the width (as in this illustration) or the height of the image. The captions are concatenated with the delimiter “and”. We vary the positions of the captions on either side of the conjunction, *i.e*., the output caption can be either (a) $\{\text{caption1 }\text{and }\text{caption2}\}$ or (b) $\{\text{caption2 }\text{and }\text{caption1}\}$ . We emphasize that only a fraction of the batch in each iteration constitute composite samples. The colored boxes and texts shown here are for illustrative purposes.

3 Method

This section covers a background of the baseline method as well as the core components of CLIP- $\mathcal{C}$ ’s framework.

3.1 Background

Contrastive Language-Image Pre-training (CLIP) from Radford et al. [41] has emerged as a highly successful approach for training vision-language models. CLIP is a dual encoder model with separate encoders $f_{I}$ and $f_{T}$ for extracting visual and textual features respectively. It also has two dedicated projection functions $g_{I}$ and $g_{T}$ that map the outputs of the encoders to a shared embedding space. Given a batch of $B$ images and text pairs $\left\{x_{I}^{(i)},x_{T}^{(i)}\right\}_{i=1}^{B}$ in each training step, CLIP computes the embeddings $z_{I}^{(i)}=g_{I}\left(f_{I}\left(x_{I}^{(i)}\right)\right)$ and $z_{T}^{(i)}=g_{T}\left(f_{T}\left(x_{T}^{(i)}\right)\right)$ where $z_{I}^{(i)}\in\mathbb{R}^{d}$ represents the normalized features of image $x_{I}^{(i)}$ . $z_{T}^{(i)}\in\mathbb{R}^{d}$ denotes the normalized features of the corresponding caption $x_{T}^{(i)}$ . The loss is evaluated using InfoNCE [37] whereby matching image-text pairs $\{x_{I}^{(i)},x_{T}^{(i)}\}$ constitute the positive samples and non-matching pairs $\{x_{I}^{(i)},x_{T}^{(j)}\}\quad\forall j\neq i$ form the negative examples. A bidirectional loss is computed as

\displaystyle\mathcal{L}_{I_{2}T}

\displaystyle=-\frac{1}{B}\sum_{i=1}^{B}\log\frac{\exp\left(\frac{1}{\tau}% \text{sim}\left(z_{I}^{(i)},z_{T}^{(i)}\right)\right)}{\sum_{j=1}^{B}\exp\left% (\frac{1}{\tau}\text{sim}\left(z_{I}^{(i)},z_{T}^{(j)}\right)\right)}

(1)

\displaystyle\mathcal{L}_{T_{2}I}

\displaystyle=-\frac{1}{B}\sum_{i=1}^{B}\log\frac{\exp\left(\frac{1}{\tau}% \text{sim}\left(z_{I}^{(i)},z_{T}^{(i)}\right)\right)}{\sum_{k=1}^{B}\exp\left% (\frac{1}{\tau}\text{sim}\left(z_{I}^{(k)},z_{T}^{(i)}\right)\right)}

(2)

where temperature $\tau$ is typically a learnable parameter used to scale the logits. $\tau$ is fixed in all of our ablation experiments as it has a noticeable impact on the model [29] which makes comparisons across different experiments difficult. $\text{sim}(\cdot,\cdot)$ is a similarity function measuring the distance between the features. In CLIP [41] and our experiments, $\text{sim}(\cdot,\cdot)$ is set as the dot product function. The total loss is an average of the two losses in Eq. 1 and Eq. 2:

\displaystyle\mathcal{L}=(\mathcal{L}_{I_{2}T}+\mathcal{L}_{T_{2}I})/2\ .

(3)

3.2 CLIP- $\mathcal{C}$

In each training step, CLIP- $\mathcal{C}$ samples a batch of examples of size $B$ , $\left\{\hat{x}_{I}^{(i)},\hat{x}_{T}^{(i)}\right\}_{i=1}^{B}$ . Any given paired instance $\left(\hat{x}_{I}^{(i)},\hat{x}_{T}^{(i)}\right)$ is either the original example $\left({x}_{I}^{(i)},{x}_{T}^{(i)}\right)$ or a composition of that example and another example $\left({x}_{I}^{(i^{\prime})},{x}_{T}^{(i^{\prime})}\right),i\neq i^{\prime}$ , drawn from the dataset. Note that index $i^{\prime}$ is taken with respect to the dataset size and not the batch size $B$ , i.e., sample $i^{\prime}$ may not be present in the current mini-batch. The proportion of composed samples in any mini-batch is controlled by a sampling rate hyper-parameter $\rho$ . The impact of this parameter is discussed in Sec. 6.2.

In the case whereby $\left(\hat{x}_{I}^{(i)},\hat{x}_{T}^{(i)}\right)$ is a composite sample, the new caption $\hat{x}_{T}^{(i)}$ is a concatenation of the two original captions involved: $\hat{x}_{T}^{(i)}=[x_{T}^{(i)},x_{T}^{(i^{\prime})}]$ where $[\cdot,\cdot]$ is a string concatenation function with the word “and” as a conjunction. The positions of the captions on either side of this conjunction change, with $x_{T}^{(i)}$ appearing first fifty percent of the time.

The new image is composed of the center half crops spanning either the height or the width of each image. For example, if the images have resolution $(S\times S)$ , either $(\frac{S}{2}\times S)$ or $(S\times\frac{S}{2})$ center crops are taken from both images and concatenated as illustrated in Fig. 1. We experiment with other forms of image augmentation methods such as MixUP[58] and CutMix[56] in Tab. 8.

After assembling the mini-batch as described above, CLIP- $\mathcal{C}$ proceeds to extract the image and text features as in CLIP: $\hat{z}_{I}^{(i)}=g_{I}\left(f_{I}\left(\hat{x}_{I}^{(i)}\right)\right)$ and $\hat{z}_{T}^{(i)}=g_{T}\left(f_{T}\left(\hat{x}_{T}^{(i)}\right)\right)$ . With $\hat{z}_{I}^{(i)}$ and $\hat{z}_{T}^{(i)}$ computed, Eq. 1, Eq. 2, and Eq. 3 are used to compute the InfoNCE loss.

The sampling strategy CLIP- $\mathcal{C}$ employs exposes the model to a much higher diversity of images and their corresponding captions compared to the vanilla pretraining pipeline. As a result, we observe much more significant improvements in downstream transfer when the pretraining dataset is small. It is reasonably expected that relatively larger datasets such as RedCaps [10] are already sufficiently diverse and, therefore, may not benefit from our method. Nonetheless, CLIP- $\mathcal{C}$ still does better than CLIP on these large datasets.

4 Experimental Setup

All our experiments use the CLIP framework due to its demonstrated effectiveness, simplicity, and widespread usage. We emphasize that we do not use pretrained CLIP checkpoints from prior works as our method is a pretraining mechanism. Thus, we retrain CLIP on our pretraining datasets and compare it to our approach. Finally, due to resource constraints, we conduct our experiments in the low data and small model regimes. Consequently, we are unable to compare with prior large-scale training systems.

Pretraining Datasets. We use three widely adopted web-crawled datasets of varying sizes and distributions for our experiments: Conceptual Captions [44], Conceptual 12M [3], and RedCaps [10]. These three datasets together enable us to assess the effectiveness of our method across pretraining datasets of different sizes and qualities.

Models. We use Vision Transformer [12] models of various sizes as in [35]. The vision encoder is set to ViT-S/16 [45] in all our ablation experiments unless explicitly specified otherwise. We use ViT-B/16 [12, 45] as the image encoder to demonstrate the efficacy of our method at scale as we are unable to run much bigger models such as ViT-L/16 because of resource constraints. The text encoder in all our experiments is set to the 38M parameter text Transformer model from [41]. Following previous methods, Byte-Pair encoding is used for tokenization with a context length of 77 and a vocabulary size of 49k. Finally, we fixed the temperature parameter at $0.01$ , the maximum value used in CLIP [41].

Hyper-parameters. We train all our models using PyTorch [39] with a global batch size of $2,048$ split across 8 GPUS in a single machine. AdamW [34] is the optimizer during pretraining. All models are pretrained for 40 epochs using a cosine decay learning rate schedule with a base rate of $0.003$ , a warm-up period of $5$ epochs, and a final learning rate of $1e^{-5}$ . The weight decay parameter is always set to $0.1$ . Random crop** is the only augmentation applied to the images during pretraining. We refer the reader to the Supplemental for more detailed information about these and other hyper-parameters.

Evaluation. We perform zero-shot evaluation on several classification benchmarks using class names and prompts provided by [41, 35]. First, the embeddings for all classes in a given benchmark are computed with each class embedding being an ensemble of multiple prompt templates. The highest cosine similarity between the image embedding and the class embeddings is then used as the zero-shot prediction.

We test our model on eleven downstream datasets including ImageNet [9], CIFAR-10 [25], CIFAR-100 [25], Caltech-101 [14], Oxford Pets [38], Country211 [41], DTD [7], Sun397 [51], STL-10 [8], RESISC-45 [6], and EuroSAT [18]. Following previous works [35, 13], we use “mean per class accuracy” as the metric for Oxford Pets and Caltech-101. Accuracy is the metric for all other datasets. In addition to the zero-shot analysis, we also conduct zero-shot retrieval in Sec. 5.3, and linear probing evaluations in Sec. 5.4. For the linear probing experiments, we use the standard “train” and “test” splits for training and evaluation whenever possible. In instances where the standard splits are not present (RESISC-45 [6], and EuroSAT [18]), we randomly split the dataset into an 80%-20% ratio for training and testing respectively.

5 Results

This section outlines our key comparisons between CLIP and CLIP- $\mathcal{C}$ (our method) on zero-shot image classification, cross-modal retrieval, and linear probing. However, we explain first why our method works.

5.1 Why is CLIP- $\mathcal{C}$ an Effective Method?

Why will combining multiple different image-caption pairs into single instances during pretraining lead to improvements in downstream evaluations? In other words, why will CLIP- $\mathcal{C}$ work? To investigate this salient question, we examined the pretraining losses and cosine similarities of both the composite examples and plain examples as the model evolves. This fine-grained tracking of training mechanics provides insights into how the model handles plain simple examples versus composite examples, and whether there are any differences between the two groups.

Contrary to expectation that compound examples will be the more challenging to the model (since they are multiple examples condensed into single instances), we observed precisely the opposite: as shown in Fig. 3, the loss on the composite examples is lower than the loss on plain examples especially in the early stages. Our hypothesis for this empirical observation is that the model more easily recognizes compound image-caption pairs because they tend to be structurally different from plain examples. The more interesting development arising from this phenomenon, however, is that the model is encouraged to dedicate more effort into learning the plain examples in CLIP- $\mathcal{C}$ compared to CLIP as seen in Fig. 3. We believe this elevated learning of plain examples together with the use of dynamic semantic compositions (See Sec. 6) all contribute to the superior capabilities of our method as discussed in the next sections.

Table 1: Zero-shot Image Classification: CLIP-

\mathcal{C}

is our method. CLIP is a the model from [41] trained in our setting. CC3M CLIP-

\mathcal{C}

models use

\rho=0.3

while CC12M and RedCaps models use

\rho=0.15

. Bold numbers are the best in each dataset and architecture comparison.

PT Dataset	Method	Food-101	CIFAR-10	CIFAR-100	Caltech-101	Pets	DTD	Country211	Sun397	STL-10	RESISC45	EuroSAT	ImageNet
Vision Encoder: ViT-S/16
CC3M	CLIP	11.6	56.1	22.7	46.9	12.9	10.5	0.6	20.5	77.0	24.5	23.7	18.5
CC3M	CLIP- $\mathcal{C}$	15.1	66.4	26.9	51.9	14.5	14.8	0.7	27.2	84.6	25.4	30.7	20.5
Vision Encoder: ViT-B/16
CC3M	CLIP	13.8	54.8	20.4	49.8	14.9	12.2	0.7	21.9	76.0	22.7	19.6	19.6
CC3M	CLIP- $\mathcal{C}$	15.7	58.0	28.5	50.1	11.4	14.2	0.7	27.8	86.8	26.1	21.3	21.2
CC12M	CLIP	46.9	78.0	43.0	76.2	57.2	19.3	4.8	41.2	89.7	33.8	27.8	37.9
CC12M	CLIP- $\mathcal{C}$	48.1	76.8	44.8	73.5	60.8	21.9	5.0	41.1	90.3	36.2	36.1	38.5
RedCaps	CLIP	78.8	72.8	38.7	72.1	76.0	16.2	6.1	27.5	92.9	36.5	30.9	40.7
RedCaps	CLIP- $\mathcal{C}$	79.0	73.7	42.2	72.1	77.1	18.1	6.6	29.4	94.2	41.1	34.8	41.6

5.2 Zero-shot Image Classification

We conduct a thorough study of the transfer learning capabilities of our model in zero-shot image classification on many downstream benchmarks, including ImageNet [9] in Tab. 1. Across different pretraining datasets, our method substantially improves over CLIP. For ViT-S/16, CLIP- $\mathcal{C}$ achieves a $2\%$ top-1 improvement over the baseline CLIP model on ImageNet while outperforming CLIP on $12$ out of $12$ downstream datasets when pretraining on CC3M. Furthermore, these enhancements are maintained when we scale the vision encoder from ViT-S/16 to ViT-B/16 showing the continued effectiveness of our method over CLIP in a bigger model. When pretraining on RedCaps and CC12M, the gains of CLIP- $\mathcal{C}$ over CLIP on ImageNet are respectively are $0.9\%$ and $0.4\%$ . These results are remarkable, considering that our approach and CLIP both use the same number of parameters, memory, and computational resources during pretraining. Even in the relatively data-rich settings of CC12M and RedCaps, CLIP- $\mathcal{C}$ still improves over CLIP on $11$ out of $12$ benchmarks for RedCaps and $9$ of the $12$ benchmarks for CC12M.

Table 2: Zero-shot Cross-modal Retrieval.

\rho

is set to

0.3

for CC3M and

0.15

for CC12M abd RedCaps. Similarly to zero-shot classification, our semantic composition model is nontrivially better than CLIP on zero-shot retrieval.

Vision Encoder: ViT-S/16
		Flickr30k				MS-COCO
PT Dataset	Method	Image $\rightarrow$ Text		Text $\rightarrow$ Image		Image $\rightarrow$ Text		Text $\rightarrow$ Image
PT Dataset	Method	R@1	R@5	R@1	R@5	R@1	R@5	R@1	R@5
CC3M	CLIP	35.2	62.3	25.4	49.12	17.3	39.0	13.1	31.2
CC3M	CLIP- $\mathcal{C}$	40.7	70.9	30.6	57.9	21.4	45.6	16.2	36.5
Vision Encoder: ViT-B/16
CC3M	CLIP	36.1	65.1	26.3	52.4	18.6	41.1	13.9	32.8
CC3M	CLIP- $\mathcal{C}$	39.6	69.4	31.2	58.3	22.9	46.7	17.0	37.9
CC12M	CLIP	61.5	87.2	46.1	74.9	36.2	64.2	25.3	49.7
CC12M	CLIP- $\mathcal{C}$	66.0	87.8	49.5	75.6	38.4	65.6	26.4	51.5
RedCaps	CLIP	26.8	51.9	20.5	42.5	24.3	44.8	16.7	35.7
RedCaps	CLIP- $\mathcal{C}$	32.3	57.2	23.6	44.9	27.1	49.2	18.2	38.4

Table 3: Linear Probing: CLIP-

\mathcal{C}

(ours) is very competitive with CLIP in linear probe experiments on CC12M and RedCaps. Our method outperforms CLIP on a majority of benchmarks when using CC3M. Additionally, on our largest downstream dataset, ImageNet, CLIP-

\mathcal{C}

beats CLIP in all settings except when pretraining on CC12M. All CLIP-

\mathcal{C}

models here are trained using a sampling probability

\rho=0.3

PT Dataset	Method	Food-101	CIFAR-10	CIFAR-100	Caltech-101	Pets	DTD	Country211	Sun397	STL-10	RESISC45	EuroSAT	ImageNet
Vision Encoder: ViT-S/16
CC3M	CLIP	63.7	84.9	65.7	79.6	69.6	60.9	12.3	63.5	91.6	88.5	95.8	55.0
CC3M	CLIP- $\mathcal{C}$	64.7	85.3	66.6	81.2	69.3	61.8	12.7	64.6	92.9	88.1	95.5	56.8
Vision Encoder: ViT-B/16
CC3M	CLIP	66.6	85.7	67.1	79.0	71.9	59.1	12.6	63.9	91.8	89.4	96.3	58.4
CC3M	CLIP- $\mathcal{C}$	67.6	86.2	67.1	81.0	72.4	62.2	13.7	66.1	93.2	90.0	95.9	59.5
CC12M	CLIP	79.4	91.7	74.9	88.8	83.4	67.4	16.6	72.3	95.2	91.6	97.0	68.6
CC12M	CLIP- $\mathcal{C}$	79.7	91.2	75.0	88.7	83.3	68.6	17.0	72.8	95.7	91.7	96.7	68.3
RedCaps	CLIP	88.7	91.4	73.5	88.3	90.2	69.1	15.7	68.6	96.8	91.7	95.4	70.4
RedCaps	CLIP- $\mathcal{C}$	88.9	90.8	74.1	88.6	89.9	69.7	16.0	69.7	96.9	92.3	97.0	70.7

5.3 Zero-shot Cross-Modal Retrieval

In addition to the zero-shot transfer results as detailed in Sec. 5.2, we also provide analysis of the performance of CLIP- $\mathcal{C}$ versus CLIP on zero-shot cross-modal retrieval in Tab. 2. For these evaluations, we use MS-COCO [33] and Flickr30k [53] as the downstream benchmarks. As in the zero-shot transfer setting, CLIP- $\mathcal{C}$ yields significant improvements over the baseline model on both MS-COCO and Flickr30k across different pretraining datasets and model sizes. For example, when using CC3M as the pretraining dataset, our method outperforms CLIP by over $5\%$ absolute top-1 retrieval accuracy in both image-to-text and text-to-image retrieval. The enhancement on MS-COCO is $4\%$ on image-to-text and $3\%$ on text-to-image retrievals. For both CLIP and our method, we noticed low retrieval results when pretraining on RedCaps, which we believe is related to the data distribution. We leave that analysis out for later works.

5.4 Linear Probe Evaluations

Having verified the efficacy of our method using joint-embedding features in the zero-shot settings, we conduct several linear-probing evaluations in Tab. 3 to test the quality of the learned image features. In these experiments, a randomly initialized linear layer is added on top of the pretrained image encoder $f_{I}$ which is frozen. The text encoder $f_{T}$ , along with linear projections $g_{I}$ and $g_{T}$ are discarded. We train the linear layer for 50 epochs using a stochastic gradient descent optimizer with a weight decay of $0$ and a momentum of $0.9$ . The linear probe learning rate and mini-batch size for each downstream dataset are provided in Tab. 9 in the Supplemental.

Using our proposed compositions, CLIP- $\mathcal{C}$ surpasses CLIP in several linear probe experiments when the pretraining dataset is relatively small, e.g., CC3M, indicating that our method also learns more discriminative image features than CLIP in that regime. These superior generalization capabilities come as a result of exposing the image encoder to a more diverse array of images through the compositions. We stress again that these gains are obtained using the same pretraining datasets (without any external augmentations), computational costs, and number of parameters as the CLIP baseline. The linear probing results are competitive on all downstream benchmarks when using larger pretraining datasets.

6 Ablations

We ablate the various components of our framework including (1) providing more training resources to the CLIP model, (2) the sampling probability $\rho$ , (3) semantic versus stylistic compositions, (4) the impact of stochasticity in drawing the second example, and (5) the composition function used for the images. We cover further ablations in Sec. C of the Supplemental detailing different mechanisms of pairing the examples, and other ways of merging captions including large language model rewrites.

These ablation experiments underscore the importance of using semantically diverse examples in compositions (Sec. 6.3). They also reveal that while incorporating a proportion of CLIP- $\mathcal{C}$ examples in the mini-batch contributes positively to performance, exclusively using such compositions during training detracts from downstream transfer capabilities (Sec. 6.2). Finally, the results demonstrate the necessity of generating compound examples dynamically during training rather than relying on a static set of pre-generated instances (Sec. 6.4). Collectively, these insights affirm the effectiveness of the design principles underpinning our method.

Due to computational constraints, all ablation experiments are conducted using CC3M with ViT-S/16 as the image encoder. Additionally, we present only the zero-shot results of CIFAR-10, CIFAR-100, and ImageNet for the ablation experiments. Also, we are unable to conduct multiple runs of each experiment because of the number and scale of our ablations. As a result, we execute most of our experiments once using a shared fixed random seed.

Table 4: CLIP-

\mathcal{C}

beats a CLIP despite using half the batch size of the CLIP model.

Method	Batch-Size	CIFAR-10	CIFAR-100	ImageNet
CLIP	$2,048$	56.1	22.7	18.5
CLIP- $\mathcal{C}$ (Ours)	$1,024$	67.7	31.1	20.1

Table 5: Both CLIP and CLIP-

\mathcal{C}

are consistent across the three initializations.

Method	CIFAR-10	CIFAR-100	ImageNet
CLIP	$57.8\pm 2.36$	$24.6\pm 1.62$	$18.6\pm 0.24$
CLIP- $\mathcal{C}$	$64.7\pm 2.04$	$27.6\pm 0.53$	$20.4\pm 0.35$

To ablate the impact of this choice, we train three models each for CLIP and CLIP- $\mathcal{C}$ on CC3M with three different random seeds. The results in Tab. 5 indicate that zero-shot performances are consistent across different random initializations.

6.1 Is CLIP- $\mathcal{C}$ a CLIP Model Exposed to More Data?

It could be argued that our method sees a lot more examples due to the compositions we employ, and that may be the reason for the observed improved performances. In Tab. 4, we show that a CLIP- $\mathcal{C}$ model that uses a batch size of $1,024$ examples outperforms the equivalent CLIP model trained with a batch-size of $2,048$ by 1.6% on ImageNet, strongly indicating that our method is different from —and more impactful than— a technique to increase the CLIP batch size. Similarly, we show in Fig. 5 that training CLIP for $(1+\rho)$ times the number of epochs for the CLIP- $\mathcal{C}$ model does not close the performance gap (compare CLIP-52 and CLIP- $\mathcal{C}$ -40 epochs in Fig. 5). Indeed, the results in Fig. 5 highlight the strong regularization effect of CLIP- $\mathcal{C}$ as its superiority over CLIP emerges towards the later stages of pretraining. CLIP- $\mathcal{C}$ becomes even more superior to CLIP as training duration increases, extending the improvement from $+2\%$ zero-short accuracy on ImageNet when both models are trained for 40 epochs to over $+3\%$ when both are trained for 52 epochs. All these empirical results point to concrete beneficial qualities of CLIP- $\mathcal{C}$ as discussed in Sec. 5.1 and not from any implicit amplified exposure to data.

6.2 Sampling Probability $\rho$

The probability at which we create a composite sample as opposed to the original image-caption pair is an important parameter in our method which determines the percentage of the mini-batch that are compound instances. When $\rho=0$ , our method is identical to CLIP as no composition is performed. On the other extreme, when $\rho=1$ , all the examples in each mini-batch are instances of our composition method. As shown in Fig. 5, using a small non-zero sampling rate is more effective than CLIP. However, the performance deteriorates when more than fifty percent of the mini-batch are these compound image-text pairs. These results indicate that maintaining a reasonable percentage of the original examples is necessary likely because streamlined non-contradictory learning signal is significantly reduced when a majority of the batch are compositions. Also, since downstream evaluations do not involve such compositions, some exposure to examples with uniform semantic content during pretraining is important for effective transfer.

6.3 Why Semantic Compositions?

We call CLIP- $\mathcal{C}$ compositions semantic because the new instances are not just stylistically different from the constituent original examples, they are also semantically different. Thus, it is fair to question whether or not this semantic differentiation is important in producing the observed favorable results over CLIP. After all, purely stylistic augmentations that use content from the same examples also increase data diversity and could yield the same outcomes as our semantic compositions. We investigate this prospect in this section. We train a model using two augmentations of the same example instead of two distinct examples as outlined in Sec. 3.2. On the image side, two random crops of the image are taken simulating two instances. For the text, we employ “Easy Data Augmentation (EDA)” [48] to generate a caption for the second crop while the first crop uses the original caption. These two stylistically generated examples are then combined using CLIP- $\mathcal{C}$ .

Table 6: Semantic Compositions: Using semantically distinct examples is better than stylistic augmentations.

Method	CIFAR-10	CIFAR-100	ImageNet
CLIP	56.1	22.7	18.5
Stylistic	53.7	25.0	19.0
Semantic (Ours)	66.4	26.9	20.5

In Tab. 6, it is evident that such stylistic augmentations are sub-optimal compared to the semantic generations we employ in CLIP- $\mathcal{C}$ . On ImageNet, the CLIP- $\mathcal{C}$ model achieves a $1.5\%$ absolute top-1 accuracy than the stylistic augmentations model. This suggests that the content of the new instances is important as the model prefers the use of distinct examples in the composition. We also note that just increasing the diversity of examples is helpful as the stylistic augmentations method yields a $0.5$ % zero-shot accuracy gain over CLIP on ImageNet.

6.4 Impact of Stochasticity During Sampling

Whenever CLIP- $\mathcal{C}$ composition is activated, the second example is usually chosen randomly from the dataset. This allows for every image-caption pair to be paired with any other image-caption pair in the dataset. Moreover, the pairings differ from one epoch to another, thus uncovering novel combinations of examples throughout pretraining.

Table 7: Impact of Stochasticity: Dynamic assignments is more effective than fixed pairings.

Method	CIFAR-10	CIFAR-100	ImageNet
CLIP	56.1	22.7	18.5
Fixed	60.4	28.0	19.7
Dynamic (Ours)	66.4	26.9	20.5

We examine the impact of this dynamic nature of CLIP- $\mathcal{C}$ versus using fixed pairs of examples. To do this, for every example $x$ , we allocate only one other example $x^{\prime}$ that is fixed throughout training. Then, whenever $x$ is involved in a CLIP- $\mathcal{C}$ composition, $x^{\prime}$ is used. The results in Tab. 7 suggest that dynamic compositions lead to better downstream results than fixed compositions. This makes intuitive sense because in the dynamic case, if a particular composition is unhelpful, there is a possibility of changing it in subsequent epochs. This possibility does not exist when the combinations are fixed. in Sec. C.2 of the Supplemental, we also investigate scenarios whereby we combine examples whose captions are either close or far apart in the feature space of a sentence embedding model [47].

6.5 Image Composition Function

In this section, we compare our image mixing method with established systems such as CutMix [56] and MixUP [58]. When activated, MixUP executes a weighted pixel-wise summation of the two images, $\hat{x}_{I}^{(i)}=\omega\cdot x_{I}^{(i)}+(1-\omega)\cdot x_{I}^{(j)}$ with the weighting factor, $\omega$ sampled from the beta distribution $\omega\sim\beta(1,1)$ . CutMix on the other hand takes a random crop from one of the images and pastes it at the same spatial location on the other image. The crop’s dimensions are scaled by the value $\alpha=\sqrt{1-\omega}$ , $\omega\sim\beta(1,1)$ . That is, $H_{\text{cut}}=\alpha\cdot H$ , $W_{\text{cut}}=\alpha\cdot W$ where $H$ and $W$ are the height and width of the image respectively.

Table 8: Image Composition: Our strategy outperforms CutMix and MixUP.

Function	CIFAR-10	CIFAR-100	ImageNet
MixUP [58]	50.8	22.3	20.2
CutMix [56]	54.2	26.9	20.4
CLIP- $\mathcal{C}$ (Ours)	66.4	26.9	20.5

Unlike MixUP, our method as depicted in Fig. 1 preserves the integrity of each crop, and does not paste parts of one image on the other as in CutMix. Additionally, using the center-half crop of each image guarantees that substantial portions of both images are represented in the output image. We believe these characteristics of our method are important as demonstrated by its superior zero-shot results over MixUP and CutMix in Tab. 8.

7 Conclusion

We have demonstrated in this study that fast and simple compositions of different image-caption pairs can significantly enhance the effectiveness of language-supervised visual representation learning models. This approach proves particularly beneficial when pretraining on smaller datasets. Our comprehensive analysis shows that CLIP- $\mathcal{C}$ , our proposed model, delivers substantial improvements in zero-shot learning tasks over the baseline CLIP model and performs robustly in linear evaluation settings. Our ablation studies provide crucial insights, emphasizing that the observed performance improvements stem not from a mere increase in data augmentation but from the strategic use of semantically distinct examples in compositions. We anticipate that these findings will encourage further exploration into novel and efficient uses of small-scale datasets for vision-language pretraining, especially in settings where it is difficult to curate massive amounts of paired data (e.g., medical and satellite images).

References

[1] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 1877–1901 (2020), https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
[2] Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 9912–9924 (2020), https://proceedings.neurips.cc/paper_files/paper/2020/file/70feb62b69f16e0238f741fab228fec2-Paper.pdf
[3] Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: CVPR (2021)
[4] Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 1597–1607. PMLR (13–18 Jul 2020), https://proceedings.mlr.press/v119/chen20j.html
[5] Chen, X., He, K.: Exploring simple siamese representation learning. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 15745–15753 (2021). https://doi.org/10.1109/CVPR46437.2021.01549
[6] Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (Oct 2017). https://doi.org/10.1109/jproc.2017.2675998, http://dx.doi.org/10.1109/JPROC.2017.2675998
[7] Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2014)
[8] Coates, A., Ng, A., Lee, H.: An Analysis of Single Layer Networks in Unsupervised Feature Learning. In: AISTATS (2011)
[9] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
[10] Desai, K., Kaul, G., Aysola, Z.T., Johnson, J.: Redcaps: Web-curated image-text data created by the people, for the people. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1) (2021), https://openreview.net/forum?id=VjJxBi1p9zh
[11] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics (Jun 2019). https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
[12] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021), https://openreview.net/forum?id=YicbFdNTTy
[13] Fan, L., Krishnan, D., Isola, P., Katabi, D., Tian, Y.: Improving clip training with language rewrites. arXiv:2305.20088 (2023)
[14] Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. CVPR Workshop (2004)
[15] Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M.A., Mikolov, T.: Devise: A deep visual-semantic embedding model. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Advances in Neural Information Processing Systems. vol. 26 (2013), https://proceedings.neurips.cc/paper_files/paper/2013/file/7cce53cf90577442771720a370c3c723-Paper.pdf
[16] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9726–9735 (2020). https://doi.org/10.1109/CVPR42600.2020.00975
[17] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
[18] Helber, P., Bischke, B., Dengel, A., Borth, D.: Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification (2017)
[19] Hong, T., Guo, X., Ma, J.: Itmix: Image-text mix augmentation for transferring clip to image classification. In: 2022 16th IEEE International Conference on Signal Processing (ICSP). vol. 1, pp. 129–133 (2022). https://doi.org/10.1109/ICSP56322.2022.9965292
[20] Hu Xu, S.X., Tan, X.E., Huang, P.Y., Howes, R., Sharma, V., Li, S.W., Ghosh, G., Zettlemoyer, L., Feichtenhofer, C.: Demystifying clip data (2023)
[21] Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: Openclip (Jul 2021). https://doi.org/10.5281/zenodo.5143773, https://doi.org/10.5281/zenodo.5143773
[22] Jia, C., Yang, Y., Xia, Y., Chen, Y., Parekh, Z., Pham, H., Le, Q.V., Sung, Y., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021. Proceedings of Machine Learning Research, vol. 139, pp. 4904–4916 (2021), http://proceedings.mlr.press/v139/jia21b.html
[23] Jiang, K., He, X., Xu, R., Wang, X.E.: Comclip: Training-free compositional image and text matching (2023)
[24] Joulin, A., van der Maaten, L., Jabri, A., Vasilache, N.: Learning visual features from large weakly supervised data. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016. pp. 67–84. Springer International Publishing (2016)
[25] Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech. rep., University of Toronto (2009)
[26] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C., Bottou, L., Weinberger, K. (eds.) Advances in Neural Information Processing Systems. vol. 25 (2012), https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
[27] Kuo, W., Piergiovanni, A., Kim, D., Luo, X., Caine, B., Li, W., Ogale, A., Zhou, L., Dai, A., Chen, Z., Cui, C., Angelova, A.: Mammut: A simple vision-encoder text-decoder architecture for multimodal tasks. Transactions on Machine Learning Research (2023), https://arxiv.longhoe.net/abs/2303.16839
[28] Lai, Z., Zhang, H., Wu, W., Bai, H., Timofeev, A., Du, X., Gan, Z., Shan, J., Chuah, C.N., Yang, Y., Cao, M.: From scarcity to efficiency: Improving clip training via visual-enriched captions (2023)
[29] Lazaridou, A., Kuncoro, A., Gribovskaya, E., Agrawal, D., Liska, A., Terzi, T., Gimenez, M., de Masson d'Autume, C., Kocisky, T., Ruder, S., Yogatama, D., Cao, K., Young, S., Blunsom, P.: Mind the gap: Assessing temporal generalization in neural language models. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems. vol. 34, pp. 29348–29363 (2021), https://proceedings.neurips.cc/paper_files/paper/2021/file/f5bf0ba0a17ef18f9607774722f5698c-Paper.pdf
[30] Li, A., Jabri, A., Joulin, A., van der Maaten, L.: Learning visual n-grams from web data. In: 2017 IEEE International Conference on Computer Vision (ICCV). pp. 4193–4202. IEEE Computer Society (2017). https://doi.org/10.1109/ICCV.2017.449, https://doi.ieeecomputersociety.org/10.1109/ICCV.2017.449
[31] Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrap** language-image pre-training with frozen image encoders and large language models. In: ICML (2023)
[32] Li, Y., Liang, F., Zhao, L., Cui, Y., Ouyang, W., Shao, J., Yu, F., Yan, J.: Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In: International Conference on Learning Representations (2022), https://openreview.net/forum?id=zq1iJkNk3uN
[33] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014. pp. 740–755 (2014)
[34] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICML (2019), https://openreview.net/forum?id=Bkg6RiCqY7
[35] Mu, N., Kirillov, A., Wagner, D., Xie, S.: Slip: Self-supervision meets language-image pre-training. arXiv:2112.12750 (2021)
[36] Naeem, M.F., Xian, Y., Zhai, X., Hoyer, L., Gool, L.V., Tombari, F.: Silc: Improving vision language pretraining with self-distillation (2023)
[37] van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding (2019)
[38] Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V.: Cats and dogs. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)
[39] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32, pp. 8024–8035 (2019), http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
[40] Quattoni, A., Collins, M., Darrell, T.: Learning visual representations using images with captions. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition. pp. 1–8 (2007). https://doi.org/10.1109/CVPR.2007.383173
[41] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021)
[42] Radford, A., Narasimhan, K.: Improving language understanding by generative pre-training (2018)
[43] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)
[44] Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2556–2565. Association for Computational Linguistics (Jul 2018). https://doi.org/10.18653/v1/P18-1238, https://aclanthology.org/P18-1238
[45] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers &; distillation through attention. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 10347–10357. PMLR (18–24 Jul 2021), https://proceedings.mlr.press/v139/touvron21a.html
[46] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C.C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P.S., Lachaux, M.A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E.M., Subramanian, R., Tan, X.E., Tang, B., Taylor, R., Williams, A., Kuan, J.X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., Scialom, T.: Llama 2: Open foundation and fine-tuned chat models (2023)
[47] Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., Zhou, M.: Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers (2020)
[48] Wei, J., Zou, K.: EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 6382–6388. Association for Computational Linguistics (Nov 2019). https://doi.org/10.18653/v1/D19-1670, https://aclanthology.org/D19-1670
[49] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., Rush, A.: Transformers: State-of-the-art natural language processing. In: Liu, Q., Schlangen, D. (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 38–45. Association for Computational Linguistics (Oct 2020). https://doi.org/10.18653/v1/2020.emnlp-demos.6, https://aclanthology.org/2020.emnlp-demos.6
[50] Wu, B., Cheng, R., Zhang, P., Gao, T., Gonzalez, J.E., Vajda, P.: Data efficient language-supervised zero-shot recognition with optimal transport distillation. In: International Conference on Learning Representations (2022), https://openreview.net/forum?id=G89-1yZLFHk
[51] Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: Large-scale scene recognition from abbey to zoo. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. pp. 3485–3492 (June 2010). https://doi.org/10.1109/CVPR.2010.5539970
[52] Xu, H., Ghosh, G., Huang, P.Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L., Feichtenhofer, C.: VideoCLIP: Contrastive pre-training for zero-shot video-text understanding. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 6787–6800. Association for Computational Linguistics (Nov 2021). https://doi.org/10.18653/v1/2021.emnlp-main.544, https://aclanthology.org/2021.emnlp-main.544
[53] Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, 67–78 (2014). https://doi.org/10.1162/tacl_a_00166, https://aclanthology.org/Q14-1006
[54] Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research (2022), https://openreview.net/forum?id=Ee277P3AYC
[55] Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J.: When and why vision-language models behave like bags-of-words, and what to do about it? In: The Eleventh International Conference on Learning Representations (2023), https://openreview.net/forum?id=KRLUvxh8uaX
[56] Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: Regularization strategy to train strong classifiers with localizable features. In: ICCV (2019)
[57] Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., Beyer, L.: Lit: Zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18123–18133 (June 2022)
[58] Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations (2018), https://openreview.net/forum?id=r1Ddp1-Rb
[59] Zhao, Y., Misra, I., Krähenbühl, P., Girdhar, R.: Learning video representations from large language models. In: arXiv:2212.04501 (2022)

\thetitle

Supplementary Material

Table 9: Linear probe evaluation hyper-parameters: We use stochastic gradient descent optimizer with a decay of 0 and a momentum of 0.9 for all linear probe experiments. All linear probe experiments are tuned for 50 epochs.

Hyper-parameter	Food-101	CIFAR-10	CIFAR-100	Caltech-101	Pets	DTD	Country211	Sun397	STL-10	RESISC45	EuroSAT	ImageNet
Batch Size	16	16	64	64	16	16	64	16	16	16	16	1024
Learning Rate	0.1	0.1	0.1	0.05	0.1	0.1	0.05	0.05	0.1	0.1	0.1	0.1

A Implementation Details

We discuss other hyper-parameters in our setup in addition to those described in Sec. 4. During pretraining, AdamW [34] with $\beta_{1}=0.9$ and $\beta_{2}=0.98$ is used as the optimizer. We use a Stochastic Gradient Descent optimizer with a momentum of 0.9 for linear probing. The learning rate schedule is always set to Cosine Annealing with a warm-up period of $5$ epochs. In all pretraining settings, we train for 40 epochs and use the best checkpoint determined by the top-1 zero-shot accuracy on ImageNet [9] for downstream evaluations. In linear probing experiments, we use the final results after training for 50 epochs on the respective dataset.

In pretraining, the batch-size and learning rate are set to $2,048$ and $0.003$ , respectively. For linear probe experiments, we use a batch-size of 1024 and a learning rate of $0.1$ when evaluating on ImageNet. For all other downstream datasets, we select the batch-size and learning rate using the ViT-B/16 CC3M pretrained model as follows. First, we choose the best batch-size from the set $\{16,64,256\}$ before selecting a learning rate from the set $\{0.05,0.1,0.2,0.3\}$ using the chosen batch-size. After selecting a batch-size and learning rate using the CC3M ViT/B-16 model, we then apply those parameters to all other models without any further hyperparameter searches. Table 9 contains details about the combinations of batch-sizes and base learning rates we use for linear probing on all downstream datasets. Every linear experiment run is executed on a single NVIDIA RTX A6000 GPU.

The image resolution is always set to $224\times 224$ . Given an image of arbitrary size, we invoke the “RandomResizedCrop” from PyTorch [39] with bicubic interpolation to generate an appropriately sized crop for training. The “scale” argument of “RandomResizedCrop” is set to $(0.6,1.0)$ during pretraining and $(0.08,1.0)$ during linear probing. In linear evaluations, the output crop is further flipped horizontally with a probability of $0.5$ . Following previous works, we normalize all images using the ImageNet mean and standard deviation values irrespective of the dataset. We do not use crop** in the zero-shot classification and retrieval settings — we simply resize the image to $224\times 224$ before normalization. In all our experiments, we used a fixed seed for the PyTorch and Numpy random number generators. We also enabled PyTorch’s CUDNN deterministic setting.

Table 10: Pretraining Datasets: Size is the number of image-text pairs in the dataset. Min, Avg, and Max are respectively the minimum, average, and maximum number of words in a caption obtained by splitting the string at spaces. CC12M tends to have the longest captions on average.

Dataset	Size	Min.	Avg.	Max.
CC3M [44]	2.85M	4	10.25	50
CC12M [3]	9.55M	1	17.76	242
RedCaps [10]	12.01M	1	9.50	70

Table 11: Downstream Datasets: Train and Test respectively denote the sizes of the training and evaluation sets. “Classes” is the number of categories while “Metric” is the evaluation metric. “Acc” represents top-1 accuracy over the entire dataset and “P/C Acc” represents the mean of per-category top-1 accuracy.

Dataset	Train	Test	Classes	Metric
Food-101	75,750	25,250	101	Acc
CIFAR-10	50,000	10,000	10	Acc
CIFAR-100	50,000	10,000	100	Acc
Caltech-101	3,060	6,084	102	P/C Acc
Pets	3,721	3,669	37	P/C Acc
DTD	1,880	1,880	47	Acc
Country211	31,650	21,100	211	Acc
Sun397	76, 127	21,750	397	Acc
STL-10	5,000	8,000	10	Acc
RESISC45	25,200	6,300	45	Acc
EuroSAT	21,600	5,400	10	Acc
ImageNet	1,281,167	50,000	1000	Acc

B Datasets

We cover basic characteristics of our pretraining datasets in Tab. 10 and the downstream benchmarks in Tab. 11. We downloaded the pretraining datasets images manually from the internet using the URLs provided in Sharma [44], Changpinyo et al. [3], and Desai et al. [6]. As a result, we could not retrieve the full original datasets as some of the links are now broken. Most of the downstream datasets were downloaded from Tensorflow datasets¹¹1https://www.tensorflow.org/datasets/catalog/overview and Torchvision datasets²²2https://pytorch.org/vision/stable/datasets.html.

C Additional Ablations

Table 12: Modality Involved in Composition: Applying semantic compositions on both modalities is the most consistently effective method across different downstream datasets and tasks.

Modality	CIFAR-10	CIFAR-100	ImageNet
Text only	55.2	27.0	21.3
Images only	55.9	23.1	19.4
Text & Images	66.4	26.9	20.5

C.1 Impact of Modality Used in Composition

Since our inputs are of different modalities, visual and textual, it is important to examine whether compositions in each of these modalities produce similar effects. To that end, in Tab. 12, we conduct analysis where our method is applied on (1) only the captions, (2) only the images, and (3) both the captions and images. Of these three variations, executing the compositions on both the captions and images is the most effective, probably due to the symmetry of transforming both modalities. The second most effective is the captions-only approach. Option (2) is the least effective method likely because the images are naturally augmented (random crop**) in the baseline method whereas the captions are fixed. These observations suggest that our method is more helpful in learning representations of the texts relative to those of the images. They also help elucidate why we obtain much bigger improvements over the baseline in zero-shot settings compared to linear evaluations.

Table 13: Zero-shot Results on Ways of Pairing Examples: For each row, we pretrain a ViT-S/16 model on the 100k CC3M subset explained in Sec. C.2. L-C-S stands for Largest Caption-Similarity while S-C-S represents Smallest Caption-Similarity. Random is the case where we perform random assignments at the beginning and fix them for the rest of the training. We pretrain for 40 epochs using a reduced learning rate of

3\times 10^{-4}

, a batch-size of

256

, and a warm-up period to

20

epochs.

Method	Food-101	CIFAR-10	CIFAR-100	Caltech-101	Pets	DTD	Country211	Sun397	STL-10	RESISC45	EuroSAT	ImageNet
CLIP	1.55	18.1	4.93	8.73	2.64	2.98	0.36	0.6	28.7	4.97	20.4	1.43
Random	2.22	16.7	5.38	7.86	2.95	2.18	0.56	0.69	25.6	8.46	11.4	1.52
L-C-S	2.63	20.4	5.66	9.36	2.64	2.71	0.38	1.12	27.5	8.31	19.2	1.56
S-C-S	2.17	23.6	5.87	8.46	1.93	3.03	0.52	0.75	27.4	8.69	12.7	1.49
CLIP- $\mathcal{C}$	1.83	21.9	5.68	8.15	2.74	3.19	0.37	0.84	25.3	4.90	12.3	1.61

C.2 Other Ways of Pairing Examples

In CLIP- $\mathcal{C}$ , the second example is chosen randomly from the dataset whenever the composition is active. In this section, we explore other ways of choosing that second image-caption pair. The configurations studied here include selecting a second instance whose caption is (1) the closest to, or (2) the farthest from the first caption of the anchor, with distance measured by the pair-wise cosine similarity between features obtained from a sentence embedding model [47]. We also consider the scenario where examples are paired randomly. However, the pairings are fixed once generated at the beginning of training. (CLIP- $\mathcal{C}$ uses random assignments that change in every epoch.)

Since computing the embeddings of all captions as well as the pair-wise cosine similarities of all examples is extraordinarily expensive computationally, we randomly selected a subset of $100,000$ examples from CC3M for these studies. We then computed the normalized features for all captions using the All-MiniLM-L6-v2³³3https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 pretrained model from HuggingFace [49]. Afterward, we generated the $100k\times 100k$ pair-wise cosine similarities for this experiment. We also run CLIP, our method as discussed in the paper, and the fixed random assigned described above on this subset.

In zero-shot evaluations in this setting, we do not observe any significant differences between pairing examples whose captions are close or far in the embedding space. More importantly, CLIP- $\mathcal{C}$ beats all other setups on a plurality of the downstream benchmarks including on ImageNet. This performance, along with its simplicity, efficiency, and scalability (since there is no need for pair-wise comparisons of examples) all further justify the adoption of dynamic random sampling in our method.

Table 14: Zero-shot Results on Text Composition Methods: For each row, we pretrain a ViT-S/16 model on the 100k CC3M subset detailed in Sec. C.3. CLIP-

\mathcal{C}_{\text{w/ AND}}

(default) represents the scenario where the captions are concatenated with “and” as a conjunction. CLIP-

\mathcal{C}_{\text{w/o AND}}

denotes the case where we omit the “and”. Finally, CLIP-

\mathcal{C}_{\text{LLM}}

is the language model rewriting method. We pretrain for 40 epochs using a reduced learning rate of

3\times 10^{-4}

, a batch-size of

256

, and a warm-up period to

20

epochs.

Method	Food-101	CIFAR-10	CIFAR-100	Caltech-101	Pets	DTD	Country211	Sun397	STL-10	RESISC45	EuroSAT	ImageNet
CLIP- $\mathcal{C}_{\text{w/ AND}}$	1.92	15.6	5.58	8.49	4.35	3.24	0.50	1.15	31.0	8.97	19.8	1.70
CLIP- $\mathcal{C}_{\text{w/o AND}}$	2.34	19.48	5.33	7.67	2.34	4.04	0.40	1.29	31.6	6.49	18.5	1.92
CLIP- $\mathcal{C}_{\text{LLM}}$	1.94	15.6	5.68	8.22	3.81	3.19	0.45	1.13	26.7	6.17	9.05	1.83

C.3 Different Methods of Merging Captions

As in Sec. 6.5 where we looked at different ways of composing the merged image, we study other ways of merging the captions in this section. In CLIP- $\mathcal{C}$ , we adopt the very simple and highly flexible process of concatenating the two captions using “and” as a conjunction. Here, we compare that method against omitting the conjunction or rewriting the merged caption with a large language model (LLM).

Similarly to the ablation in Sec. C.2, we randomly selected $100,000$ examples from CC3M for this ablation because of the costs involved in the LLM rewriting method. Each example in the subset is associated with a second example (drawn from the full dataset) at random. We generate these associations once and use them for all ablation methods in this section: when (1) concatenating with “and”, (2) concatenating without “and”, and (3) rewriting the output of (1) with an LLM.

We employ the open-source 70B parameter model LLama2-chat [46] for the LLM rewriting method. The temperature and top-p arguments of the model are set to $0.8$ and $0.95$ , respectively. The maximum sequence length is set to 256. To rewrite a given caption, we (a) instruct the model to act as a text completion assistant, and (b) prompt the model as: rephrase the sentence {merged caption} where merged caption is from option (1) above. We use the first generated sentence as the new caption. The LLM-based captions are generated offline and saved to file before training begins. We emphasize that only $100,000$ rewrites are undertaken, one for each entry in the sampled subset. Some examples of these rewritten captions are provided in Fig. 6.

On zero-shot classification in this limited setting, our simple concatenation method with and without “and” as a conjunction, each obtained the highest accuracy on $6$ out of $14$ downstream datasets while the LLM rewrite option achieved the highest performance on the remaining $2$ datasets as shown in Tab. 14. These results show that besides being highly efficient and flexible, our method is also comparable with the more resource-intensive large language model rewriting method. Additionally, while the LLM rewrite method may hallucinate non-grounded captions (See Fig. 6), our method is devoid of such hallucinations. We use concatenation with the conjunction as the default caption composition method because of its coherence.

Semantic Compositions Enhance Vision-Language Contrastive Learning

Abstract

Keywords:

1 Introduction

2 Related Works

3 Method

3.1 Background

3.2 CLIP-𝒞𝒞\mathcal{C}caligraphic_C

4 Experimental Setup

5 Results

5.1 Why is CLIP-𝒞𝒞\mathcal{C}caligraphic_C an Effective Method?

5.2 Zero-shot Image Classification

5.3 Zero-shot Cross-Modal Retrieval

5.4 Linear Probe Evaluations

6 Ablations

6.1 Is CLIP-𝒞𝒞\mathcal{C}caligraphic_C a CLIP Model Exposed to More Data?

6.2 Sampling Probability ρ𝜌\rhoitalic_ρ

6.3 Why Semantic Compositions?

6.4 Impact of Stochasticity During Sampling

6.5 Image Composition Function

7 Conclusion

References

A Implementation Details

B Datasets

C Additional Ablations

C.1 Impact of Modality Used in Composition

C.2 Other Ways of Pairing Examples

C.3 Different Methods of Merging Captions

3.2 CLIP- $\mathcal{C}$

5.1 Why is CLIP- $\mathcal{C}$ an Effective Method?

6.1 Is CLIP- $\mathcal{C}$ a CLIP Model Exposed to More Data?

6.2 Sampling Probability $\rho$