Centered Masking for Language-Image Pre-Training

Mingliang Liang, Martha Larson
Radboud University
Nijmegen
{m.liang, m.larson}@cs.ru.nl
Abstract

We introduce Gaussian masking for Language-Image Pre-Training (GLIP) a novel, straightforward, and effective technique for masking image patches during pre-training of a vision-language model. GLIP builds on Fast Language-Image Pre-Training (FLIP), which randomly masks image patches while training a CLIP model. GLIP replaces random masking with centered masking, which uses a Gaussian distribution and is inspired by the importance of image patches at the center of the image. GLIP retains the same computational savings as FLIP, while improving performance across a range of downstream datasets and tasks, as demonstrated by our experimental results. We show the benefits of GLIP to be easy to obtain, requiring no delicate tuning of the Gaussian, and also applicable to datasets containing images without an obvious center focus.

Keywords Vision-Language Model  \cdot Multimodal Data  \cdot Gaussian Distribution Masking.

1 Introduction

The rise of Vision-Language Models trained on a large number of image text pairs has been led by Contrastive Language-Image Pretraining CLIP [1]. CLIP delivers high-quality visual representations, with strong performance on downstream tasks and impressive transferability. However, the downside of CLIP, is the large amount of computational resources that it requires to train. According to [1], pre-training CLIP on 400 million image-text pairs over 32 epochs required thousands of GPU days for completion. The high computational cost of training Vision-Language models limits further increases in training data size.

Recent research has proposed to accelerate the training of CLIP, with surprising success being achieved by Fast Language-Image Pre-Training (FLIP) [2]. FLIP randomly masks image patches during CLIP training. According to [2], if 50% or 75% of the patches in each training image are discarded by masking, FLIP can reduce the computation carried out during training by 2–4 times. Interestingly, masking 50% of an image does not compromise performance. In fact, it makes it possible to increase the global batch size given a processor’s fixed processing power. More samples per batch are known to be beneficial for contrastive learning  [3, 4, 2, 1].

In this paper, we further improve on the performance of FLIP, while retaining the same savings in computation. Specifically, we conjecture that patches closer to the center of an image are more important when training a Vision-Language Model than patches nearer the periphery. Based on this conjecture, we propose an approach that replaces random masking of FLIP with centered masking. Our approach, called Gaussian masking for Language-Image Pre-Training (GLIP), uses a Gaussian distribution centered at the middle of the image in order to select patches during training. Figure 1 contrasts the random masking of FLIP with the centered masking of GLIP. It can be seen that the main subject of the image is better preserved in the retained patches when masking prioritizes patches from the center and disregards more patches at the periphery. The inspiration for the centered masking of GLIP comes from the behavior of human photographers, who often place the main subject of the photo in the center. However, as will be seen from our experimental results, GLIP also delivers improvements for image data that does not have a strong photographic center focus, demonstrating GLIP’s promise for generality. Note that we use masking in the same way that FLIP does, i.e., not for the purpose of reconstruction, which was shown by [2] to actually hurt performance.

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 1: Images from ImageNet-1K [5] (left side) and CC12M [6] (right side). A contrast can be seen between random masking (left mask) and centered masking (right mask). Centered masking captures more of the main subject of the image.

GLIP can be considered to be related to other approaches that do not use random masking. A key example is A-CLIP [7], which employs an online Exponential Moving Average (EMA) to generate masks derived from the attention weights associated with the [CLS] token of the visual encoder. The attentive masks attempt to include more image patches that are highly correlated with the semantics of the text description. GLIP aims to leverage the same effect, but instead of calculating attention, assumes that the most important part of the image is the center. In this paper, we do not compare GLIP with A-CLIP because even the efficient version of A-CLIP requires more computational resources than GLIP.

This paper makes the following contributions:

  • We introduce a centered masking approach for language-image pre-training called Gaussian Masking for Language-Image Pre-Training (GLIP), which enjoys the same reduction in computational resources as FLIP.

  • We provide extensive experiments that show the superiority of GLIP over FLIP. GLIP shows general performance improvements and also allows higher masking ratios.

  • We show that GLIP delivers an easy-to-obtain improvement across different datasets without requiring effort to tune the variance of the Gaussian.

Our code is available online111https://github.com/Anastasiais-ml/GLIP. In our experiments comparing FLIP and GLIP we train on a smaller dataset than was used in the original FLIP paper [2]. For this reason, the results of our paper can be easily reproduced using the resources typically available in either an academic or industry setting.

2 Related work

2.1 Vision-Language Models

Refining visual representations of images and develo** foundation models to enhance subsequent tasks is a key objective in the field of computer vision. CLIP [1], and closely related ALIGN [8], learn visual representations using natural language supervision. The approach involves collecting billions of image-text pairs from the internet and pre-training the model on the data using contrastive learning techniques. This extensive pre-training on large-scale datasets substantially improves the transferability of the model to a variety of downstream tasks and datasets.

However, pre-training models on billion-scale datasets require thousands of GPU days, which is prohibitively expensive. The Fast Language-Image Pre-training (FLIP) approach involves randomly masking a large proportion of the patches in each training image to improve the efficiency of pre-training. Remarkably, as mentioned above, FLIP not only outperforms models trained without masking but also reduces computational requirements by 2–4 times by masking 50% to 75% of the image patches.

The Attentive Mask CLIP (A-CLIP) refines the image patch masking strategy by utilizing correlation scores between image and text [7]. The correlation scores are computed using the attention weights to the [CLS] token of the visual encoder. By reducing the image resolution of the Exponential Moving Average (EMA) encoder by half and masking 50% of the image patches of the image encoder, A-CLIP reduces 14% of pre-training time while also improving performance. Nevertheless, selecting external patches for processing still requires over 30% more training time compared to masking patches directly.

Masked autoencoders (MAE) develop an encoder-decoder architecture to visual representations through self-supervision [3]. In this approach, the MAE randomly masks most of the patches in the input image during the encoding stage and then reconstructs these masked patches using the decoder. The pre-trained encoders developed through this method can be effectively used in downstream tasks, such as image classification. Although MAE appears to have originally inspired FLIP, the original FLIP paper [2] did not find reconstruction helpful, and GLIP also does not use reconstruction.

2.2 Relative Importance of Image Regions

There are several strands of literature motivating our move from random masking to centered masking because the build on the idea that not all parts of an image are equally important. The first that comes to mind might be the concept of model attention in deep learning, which is a mechanism that allows models to focus on specific parts of the input data that are more relevant for a given task [9, 10, 11, 12, 13]. GLIP exploits the same notion that underlies model attention but in a simple manner rooted in the assumption that the center of the image is most important.

The importance of the center of the image is exploited for data augmentation during training. Center crop** and random crop** are a commonly used data augmentation technique for image preprocessing for deep learning models, especially in tasks related to computer vision [14, 15, 16, 17] and Vision-Language Model [2, 1]. Center crop** assumes that the most relevant information is located in the center of the image and helps to reduce overfitting by providing the model with slightly different views of the same image during training [14, 15, 16]. This assumption applies to many datasets, especially those collected from the Internet, which contain images framed by photographers [6, 18, 19, 20]. MaskCLIP [21] and Masked Siamese Networks (MSN) [22] augment images through the masking of image patches in Vision Transformer (ViT) [23], subsequently aligning the representations of both masked and unmasked images. This process contributes to the improvement of the semantic performance of images by ensuring the semantic content of both masked and unmasked images is coherent. In our study, we do not draw comparisons with these methods. In our study, we do not compare our method with MaskCLIP. This is because MaskCLIP focuses on maintaining semantic consistency between masked and unmasked images to enhance image representation via self-supervised learning, rather than reducing pre-training time.

2.3 Photographer behavior

When photographers take pictures they have a strong tendency to place the subject of the image near the center of the image [24, 25, 26]. The placement of the subject or object within a photograph greatly influences its composition and aesthetic appeal [24, 25]. The Rule of Thirds and Center Composition are two widely accepted techniques for enhancing the visual appeal of an image [24]. The Rule of Thirds involves dividing the frame into nine equal sections with two horizontal and two vertical lines and positioning the subject at the intersections or along these lines to create balanced, visually pleasing images [24]. Centre composition focuses on placing the subject at the center of the picture, highlighting symmetry and prominence, a technique commonly used in portraiture and symmetrical scenes [25]. A commonality across these techniques is the avoidance of placing objects at the edges of the image. We refer to the tendency of a dataset to contain images with the main subject material as “center focus”. In addition to being created by a photograph “center focus” can also arise when someone crops an image after it has been taken so that the main subject appears close to the center.

Refer to caption
Figure 2: Our GLIP architecture. Following CLIP [1] and FLIP [2], we use contrastive loss to pre-train our model. Different from FLIP, we mask image patches with a Gaussian distribution instead of random masking.

3 Method

The key idea of FLIP is to enhance computational efficiency and increase the batch size of contrastive objectives by applying masks to the input image in CLIP. GLIP differs by masking image patches according to the idea that the center of an image is more important. This technique substantially improves performance across various tasks almost without additional computation, outperforming FLIP. Our image masking strategy is straightforward: we utilize probabilities from a Gaussian distribution to mask the image patches, as illustrated in Figure 2. Considering the three-dimensional nature of the image, we employ a Bivariate Gaussian Distribution for this purpose:

f(x,y)=12πσxσy1ρ2exp(12(1ρ2)[(xμx)2σx2+(yμy)2σy22ρ(xμx)(yμy)σxσy])𝑓𝑥𝑦12𝜋subscript𝜎𝑥subscript𝜎𝑦1superscript𝜌2121superscript𝜌2delimited-[]superscript𝑥subscript𝜇𝑥2superscriptsubscript𝜎𝑥2superscript𝑦subscript𝜇𝑦2superscriptsubscript𝜎𝑦22𝜌𝑥subscript𝜇𝑥𝑦subscript𝜇𝑦subscript𝜎𝑥subscript𝜎𝑦f(x,y)=\frac{1}{2\pi\sigma_{x}\sigma_{y}\sqrt{1-\rho^{2}}}\exp\left(-\frac{1}{% 2(1-\rho^{2})}\left[\frac{(x-\mu_{x})^{2}}{\sigma_{x}^{2}}+\frac{(y-\mu_{y})^{% 2}}{\sigma_{y}^{2}}-\frac{2\rho(x-\mu_{x})(y-\mu_{y})}{\sigma_{x}\sigma_{y}}% \right]\right)italic_f ( italic_x , italic_y ) = divide start_ARG 1 end_ARG start_ARG 2 italic_π italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT square-root start_ARG 1 - italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 ( 1 - italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG [ divide start_ARG ( italic_x - italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG ( italic_y - italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 2 italic_ρ ( italic_x - italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) ( italic_y - italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG ] )

(1)
  • f(x,y)𝑓𝑥𝑦f(x,y)italic_f ( italic_x , italic_y ) is the probability density function of the bivariate Gaussian distribution.

  • μxsubscript𝜇𝑥\mu_{x}italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and μysubscript𝜇𝑦\mu_{y}italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT are the means of the two variables, x𝑥xitalic_x and y𝑦yitalic_y, respectively.

  • σxsubscript𝜎𝑥\sigma_{x}italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and σysubscript𝜎𝑦\sigma_{y}italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT are the standard deviations of the two variables, x𝑥xitalic_x and y𝑦yitalic_y, respectively.

  • ρ𝜌\rhoitalic_ρ is the correlation coefficient between x𝑥xitalic_x and y𝑦yitalic_y.

when set μxsubscript𝜇𝑥\mu_{x}italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and μysubscript𝜇𝑦\mu_{y}italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT to 00 and set ρ𝜌\rhoitalic_ρ to 00:

f(x,y)=12πσxσyexp(12[x2σx2+y2σy2])𝑓𝑥𝑦12𝜋subscript𝜎𝑥subscript𝜎𝑦12delimited-[]superscript𝑥2superscriptsubscript𝜎𝑥2superscript𝑦2superscriptsubscript𝜎𝑦2f(x,y)=\frac{1}{2\pi\sigma_{x}\sigma_{y}}\exp\left(-\frac{1}{2}\left[\frac{x^{% 2}}{\sigma_{x}^{2}}+\frac{y^{2}}{\sigma_{y}^{2}}\right]\right)italic_f ( italic_x , italic_y ) = divide start_ARG 1 end_ARG start_ARG 2 italic_π italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] ) (2)

Finally, we use Equation 2 to calculate the probability that an image patch will be masked. In this formula, x𝑥xitalic_x and y𝑦yitalic_y are within the range of [1,1]11[-1,1][ - 1 , 1 ], and the step size corresponds to the grid size of the image encoder. We set the center of the image as the coordinate origin, which is the center of the bivariate Gaussian distribution. The examples are shown in Figure 1 and  3, we remain more patches in the center of the image.

Refer to caption Refer to caption Refer to caption Refer to caption
(a) Random Masking (b) σ=0.1𝜎0.1\sigma=0.1italic_σ = 0.1 (c) σ=0.2𝜎0.2\sigma=0.2italic_σ = 0.2 (d) σ=0.8𝜎0.8\sigma=0.8italic_σ = 0.8
Figure 3: Comparison of Random and Gaussian Masking Strategies. Image (a) demonstrates a random masking strategy with uniform masking probability. Images (b), (c), and (d) illustrate Gaussian masking with increasing standard deviations (σ𝜎\sigmaitalic_σ), showcasing the effect of masking that is focused in the center and gradually spreads to the edges.

Building on the CLIP [1] method, we pre-trained the model using Noise-Contrastive Estimation (InfoNCE) loss with temperature parameter τ𝜏\tauitalic_τ [27, 1] is shown in Equation 3. This function works by bringing positive image-text pairs closer and negative image-text pairs farther apart [28]. This approach allows the encoder to learn and recognize similar semantics in the corresponding image-text pairs.

InfoNCE=log(exp(sim(Ii,Ti)/τ)j=1Nexp(sim(Ii,Tj)/τ))subscriptInfoNCEsimsubscript𝐼𝑖subscript𝑇𝑖𝜏superscriptsubscript𝑗1𝑁simsubscript𝐼𝑖subscript𝑇𝑗𝜏\mathcal{L}_{\text{InfoNCE}}=-\log\left(\frac{\exp(\text{sim}(I_{i},T_{i})/% \tau)}{\sum_{j=1}^{N}\exp(\text{sim}(I_{i},T_{j})/\tau)}\right)caligraphic_L start_POSTSUBSCRIPT InfoNCE end_POSTSUBSCRIPT = - roman_log ( divide start_ARG roman_exp ( sim ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( sim ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG ) (3)

Here, sim𝑠𝑖𝑚simitalic_s italic_i italic_m is the similarity score between image Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and text Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Following FLIP [2], we also continue pre-training the model without masking for a small number of steps to reduce the distribution gap between training and testing caused by masking.

4 Experiments

Table 1: The details of pre-training and fine-tuning setup.
Config Pre-training Value Fine-tuning Value
Optimizer AdamW AdamW
Learning rate 1e-3 5e-6
Weight decay 0.1 0.05
Optimizer momentum β1,β2=0.9,0.999formulae-sequence𝛽1𝛽20.90.999\beta 1,\beta 2=0.9,0.999italic_β 1 , italic_β 2 = 0.9 , 0.999 β1,β2=0.9,0.98formulae-sequence𝛽1𝛽20.90.98\beta 1,\beta 2=0.9,0.98italic_β 1 , italic_β 2 = 0.9 , 0.98
Learning rate schedule Cosine decay Cosine decay
Warmup steps 10k 10% of total steps
Epoch 30 1
Numerical precision Automatic mixed precision Automatic mixed precision
Augmentation RandomResizedCrop RandomResizedCrop

4.1 Implementation Details

Our implementation follows CLIP [1], OpenCLIP [29] and FLIP [2].

Dataset. We pre-trained our model on CC12M [20] dataset which includes 12 million image-text pairs. To observe the performance of the model that was pre-trained on a different dataset, we further evaluate our method on CC3M [20]. These datasets were selected due to their diverse range of real-world concepts and scenes. Models such as SLIP, CLIP, and BLIP [30, 31, 7] also use these popular datasets to train their models. Despite expired URLs, we successfully downloaded approximately 2.72 million and 9.30 million image-text pairs for CC3M and CC12M, respectively. Notably, our model achieved similar baseline performance levels as reported by SLIP and A-CLIP for both CC3M and CC12M [31, 7], demonstrating its effectiveness and robustness across diverse datasets.

Architecture. For our image encoders, we employed the ViT-B/16 [23] with a patch size of 16, while utilizing a transformer-based model for the text encoder [32]. According to the results reported by FLIP, patch size is not an important contributor to performance. The maximum context length of the text encoder used in our study is 32 [2]. Consistent with the configurations in CLIP [1] and FLIP [2], the input image size is 224 ×\times× 224. In our principal experiments, we specified the values of σxsubscript𝜎𝑥\sigma_{x}italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and σysubscript𝜎𝑦\sigma_{y}italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT to be 0.20 in the formula 2 as the default value, unless otherwise mentioned. Additionally, we explored the impact of varying σxsubscript𝜎𝑥\sigma_{x}italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT on the performance of GLIP.

Training and fine-tuning. Following the FLIP [2], we pre-train the model for 30 epochs with patch masking and fine-tune the model for small steps without patch masking. Pre-training and fine-tuning configurations are detailed in Table 1, utilizing 8 RTX A5000 GPUs with batch sizes of 160, 320, 640, and 1024 for mask ratios of 0%, 50%, 75%, and 90%, respectively.

4.2 Evaluation.

First, we evaluate our approach using the ImageNet dataset for the zero-shot classification task, comparing its performance with CLIP and FLIP. Subsequently, we extended our evaluation to include various other datasets. we follow the prompt engineering of CLIP [1] and OpenCLIP [29]. We use the average of a set of 80 templates embedding for each class. Then, the cosine similarity between the caption and the image embeddings is calculated to obtain the top-1 class as the predicted class for each image.

Additionally, we evaluate our approach across various tasks to demonstrate its advantages. These tasks include image-text retrieval on the MS COCO [33] and Flickr30K [34] datasets, as well as linear probing and fine-tuning on ImageNet-1K [5]. This comprehensive evaluation highlights the versatility and effectiveness of GLIP in addressing the various challenges in the field.

Table 2: Zero-shot accuracy on ImageNet-1K classification. We pre-trained the models for 30 epochs on the CC12M [6] dataset by different image patch mask ratios with ViT-B/16 as the image encoder. Then, we fine-tuned them by an additional epoch without image masking.
Method Masking Inference After tuning
Masking Unmasking
CLIP 35.5
FLIP 50% 32.1 34.0 34.2
GLIP 33.2 35.1 35.4
FLIP 75% 26.3 29.4 30.0
GLIP 28.8 32.1 32.2
FLIP 90% 16.6 21.1 22.0
GLIP 20.7 16.5 25.8

ImageNet zero-shot transfer. We initially compare our model against the CLIP baseline on the ImageNet-1K dataset for zero-shot transfer tasks. As shown in Table 2, our model, with a 50% masking rate and inference unmasking, achieved a 35.1% accuracy, slightly trailing CLIP by 0.4%. However, after additional tuning without image unmasking, our model performance is very close to CLIP. In comparison, FLIP’s performance remains 1.3% below CLIP’s, even after unmasking tuning. Subsequently, we evaluated GLIP against FLIP, also as shown in Table 2. Here, GLIP surpassed FLIP by 1.1% in accuracy, employing a 50% masking rate and inference unmasking. After fine-tuning, our model maintained a 1.2% lead over FLIP. Further, as we increase the masking ratio, GLIP’s advantage over FLIP extends from 1.2% (with a 50% masking ratio and tuning) to 3.8% (with a 90% masking ratio and tuning), showcasing our approach’s efficacy across different masking ratios.

At a 90% masking rate, GLIP exceeds FLIP’s performance by 4.1% with inference masking. However, performance drops substantially from 20.7% to 16.5% with inference unmasking, suggesting a reduction in the model’s generalizability when masking is removed. Nevertheless, small steps of fine-tuning can substantially improve performance, from 16.5% to 25.8%, indicating that GLIP is better at learning the relationship between images and text compared to FLIP. Impressively, GLIP also achieved a zero-shot classification accuracy of 25.8% on ImageNet-1K, while requiring only 10% of the image encoder computing resources normally required by an image encoder. Large masking ratios allow us to use larger batch sizes to pre-train the model. We anticipate that GLIP will effectively extend the advantages observed on smaller datasets to larger ones, such as LAION-400M [19] and LAION-2B datasets [18]. This is because larger mask ratios allow more samples to be seen in the same training time, a factor that has been proven to enhance performance in the CLIP [1] and OpenCLIP [29].

Table 3: Linear probing and fine-tuning accuracy on ImageNet-1K classification. The performance of FLIP and GLIP that pre-trained on CC12M [6].
Dataset Method 0-shot Linear Finetuned
CC12M CLIP 35.5 59.87 71.80
FLIP 34.2 59.13 71.75
GLIP 35.4 59.97 71.80

ImageNet linear probing. Linear probing is to fine-tune a linear classifier on top of a pre-trained model while kee** the model’s weights fixed. The second column of Table 3 shows the performance outcomes of various methods pre-trained on different datasets. In this evaluation, GLIP surpasses the performance of FLIP, achieving an improvement of 0.84%. Additionally, GLIP marginally outperforms CLIP, with a 0.1% higher accuracy.

ImageNet fine-tuning. For ImageNet fine-tuning, the process includes training linear classifiers on the ImageNet dataset without freezing the pre-trained model’s weights. In this case, CLIP and GLIP perform equally well on this task, slightly better than FLIP, with a 0.05% lead over FLIP.

Table 4: Zero-shot accuracy on more classification datasets. FLIP and GLIP are pre-trained with 50% ratio image patch masking and fine-tuned an additional epoch without image masking.
Category Dataset CLIP Before tuning After tuning
FLIP GLIP FLIP GLIP
Object Food101 40.46 39.02 40.04 39.68 40.42
CIFAR10 66.22 59.41 51.63 56.19 49.24
CIFAR100 28.83 26.10 26.13 24.59 24.86
CUB 7.28 9.42 9.23 9.79 9.98
Cars 13.70 10.24 13.02 10.24 12.78
Aircraft 2.38 2.59 3.11 2.47 3.05
DTD 18.56 16.76 18.14 17.02 18.30
Oxford Pets 51.67 50.06 54.30 50.33 54.95
Caltech101 71.06 66.82 69.72 65.99 70.16
Flowers102 1.37 2.48 3.47 1.60 3.51
MNIST 14.46 9.80 9.36 9.80 9.74
STL10 91.00 87.25 88.75 87.70 87.64
GTSRB 10.10 13.24 7.95 12.75 9.07
ImageNet-1K 35.51 33.95 35.12 34.22 35.35
Others SUN397 49.77 47.69 48.10 47.35 47.94
EuroSAT 23.92 12.48 20.20 12.18 19.38
RESISC45 34.84 33.65 32.59 33.84 33.10
Country211 4.36 4.18 3.97 4.29 4.03
PCam 50.62 50.21 57.22 52.41 55.94
KITTI 25.67 37.30 34.22 37.90 30.95
UCF101 40.10 37.67 38.09 37.48 38.33
Kinetics700 24.07 23.15 23.70 23.15 23.51
CLEVR 18.48 20.58 21.39 18.77 21.48
HatefulMemes 54.36 51.64 50.59 51.01 50.76
SST2 50.14 50.41 49.48 50.58 49.37

Zero-shot classification on more datasets. We evaluate our method across a variety of datasets to demonstrate its superiority. These are the same datasets that were used in the original CLIP [1] and FLIP [2] papers. In order to investigate whether GLIP has a dependence on the characteristics of data, we divide the datasets into two groups: “Object” and “Other”. Images of “Object” datasets depict an object and are framed so that the object is in the middle of the image nearly without exception. Images of the “Other” datasets include scenes, activities, and datasets that otherwise cannot be considered “Object” datasets. Note that the “Other” datasets might also have a center focus, meaning that the main subject matter occurs towards the center of the image, although this subject material might not be an object (e.g., it might be a set of objects in a scene). According to our scan of the content of these datasets, the two datasets that are neither photographer framed nor cropped to center the main subject content and clearly depart from the assumption of center focus are EuroSat (satellite images) [35] and PCAM (medical images) [36, 37].

In Table 4, we see that GLIP outperforms FLIP in in the majority of “Object” datasets both before and after fine-tuning. Fine-tuning in general provides a small but noticeable advantage in some but not all cases. Recall that the number of patches used by GLIP is the same as FLIP, and the only difference is that GLIP prefers to retain patches closer to the center during masking.

Moving to the “Other” datasets, we might expect that GLIP is less helpful, since these images are less likely to have their main subject materially located at the center of the image. We see that GLIP still delivers an improvement over FLIP in relatively many cases. Surprisingly, this improvement appears to be greatest for EuroSAT and PCam, the two datasets that are clearly not center focused. These results are interesting because they demonstrate that the potential of GLIP is not restricted to image data with specific characteristics. We return to comment on the cases of EuroSAT and PCam in the outlook.

In sum, before fine-tuning, The overall average absolute performance of GLIP on the 26 downstream datasets is improved by 0.54% compared to FLIP. Following fine-tuning, GLIP continues to hold its advantage across these datasets, underscoring its robustness and efficiency in handling a variety of datasets.

It is worth noting that both FLIP and GLIP do not perform as well as CLIP on low-resolution datasets, such as MNIST, CIFAR-10, CIFAR-100, STL10 [38, 39, 40]. The original FLIP paper [2] does not observe this issue. For our smaller training set, it possibly could have been addressed by data augmentation, a point we leave for future work. It is also important to note that in contrast to the original FLIP paper [2], our FLIP model (50% sampling ratio) does not consistently outperform CLIP. This suggests that very large batch sizes are critical for optimal performance of the sampling approaches and deserve further investigation.

Zero-shot robustness evaluation In Table 5, we evaluate robustness, following the methodologies of Table 16 in [1]. GLIP surpasses FLIP on 4 out of 6 datasets by a margin of 1.15% and achieves comparable performance to CLIP which is pre-trained on the entire images. This comparison highlights GLIP’s effectiveness in enhancing robustness across diverse datasets.

Table 5: Zero-shot robustness evaluation, comparison the zero-shot accuracy performance of CLIP, FLIP, and GLIP on various datasets.
Dataset CLIP Before tuning After tuning
FLIP GLIP FLIP GLIP
ImageNet-A 8.47 7.03 6.95 7.47 7.45
ImageNet-O 37.45 39.80 38.75 39.85 38.95
ImageNet-R 46.01 39.59 43.06 40.24 43.47
ImageNet Sketch 23.57 19.97 22.03 20.24 22.20
ImageNetV2 30.07 29.09 30.12 29.24 30.47
ObjectNet 21.85 18.38 19.32 19.56 20.98
Average 27.90 25.64 26.71 26.10 27.25
Table 6: Zero-shot Image-Text Retrieval, we evaluated CLIP, FLIP, and GLIP image-text retrieval performance on COCO and Flickr30k datasets. FLIP and CLIP are pre-trained with 50% ratio image patch masking and fine-tuned an additional epoch.
Model Text Retrieval Image Retrieval
Flickr30k COCO Flickr30k COCO
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
CLIP 55.40 81.70 88.60 34.42 61.62 72.16 41.52 70.76 79.84 23.65 47.64 59.13
FLIP 53.90 79.70 87.50 31.82 59.12 70.42 39.80 67.58 76.66 22.42 45.62 57.40
GLIP 53.70 80.80 86.60 32.74 59.18 71.30 41.92 67.66 76.42 22.53 46.61 58.52

Image-Text retrieval Table 6 presents the performance of image-text retrieval on the COCO [33] and Flickr30K [34] datasets. GLIP outperforms FLIP on both datasets. However, the performance of both GLIP and FLIP falls short of that achieved by CLIP, which was pre-trained on the CC12M dataset without any masking. This contrasts with previous reports suggesting that FLIP surpasses CLIP in zero-shot image-text retrieval tasks [2]. With these models being pre-trained on a 400M dataset, our method may still hold an advantage when applied to larger datasets. Notably, GLIP demonstrates a substantial advantage over FLIP in image-text retrieval tasks.

Inverse Gaussian Masking We carry out an experiment in which we invert the Gaussian mask used by GLIP in order to confirm the importance of the center patches of the image. We “inverse GLIP” a standard task, i.e., ImageNet-1K zero-shot classification. As expected, this strategy performs poorly (as shown in row 2 Table 7), GLIP, which employs center masking, outperforms inverse Gaussian masking by 4.5%. Additionally, inverse Gaussian masking falls 3.3% short of FLIP, which utilizes random masking. Because ImageNet-1K is photographer framed, we attribute this difference in performance to the periphery of an image being more likely to contain background information rather than the main subject of the image. The contrast between the performance of inverse-GLIP and FLIP suggests that the strength of FLIP actually derives from those random patches that it samples that happen to be near the center of the image.

Table 7: Zero-shot classification performance on ImageNet-1K when we pre-train the model on CC12M using the random, inverse Gaussian and Gaussian masking strategies to mask 50% image patches. The image encoder is ViT-B/16. We set σ𝜎\sigmaitalic_σ to 0.2.
Method 0-shot
Before tuning After tuning
random masking 34.0 34.2
Inverse Gaussian Masking 31.1 30.9
Gaussian masking 35.1 35.4
Refer to caption
Figure 4: Zero-shot classification performance on ImageNet-1K with different values of σ𝜎\sigmaitalic_σ (cf. Formula 3). The models, using ViT-B/16 as the image encoder, were pre-trained on the CC3M dataset for 30 epochs and fine-tuned for one more epoch without masking.

Different σ𝜎\sigmaitalic_σ For our experiments, the value of σ𝜎\sigmaitalic_σ was initially set by visualizations such as those in Figure 1 and with limited exploratory experiments. Here, we investigate the sensitivity of the model to σ𝜎\sigmaitalic_σ We again study a standard task, i.e., ImageNet-1K zero-shot, and train models with different σ𝜎\sigmaitalic_σ values on the CC3M dataset. Results are presented in Figure 4. From 0.1 to 0.2, the performance of the model improves by 0.6%, and the model achieves optimal performance at σ=0.20𝜎0.20\sigma=0.20italic_σ = 0.20, which is the value that we had chosen via visualisation. An increase in σ𝜎\sigmaitalic_σ from 0.2 results in decreased performance. However, the model still surpasses FLIP, which utilizes a random masking strategy, for σ𝜎\sigmaitalic_σ values up to 0.8 after fine-tuning. Thus, it is essential to balance the model’s focus on both the image center and its edges, ensuring comprehensive attention across the entire image. Further, from this experiment, we conclude that GLIP has a certain robustness to the choice of σ𝜎\sigmaitalic_σ, since a σ𝜎\sigmaitalic_σ somewhat larger than the optimal size will decrease performance only mildly and still allow GLIP to outperform FLIP.

5 Conclusion and Outlook

In this paper, we have introduced GLIP, a novel image masking strategy that improves the pre-training efficiency of Vision-Language Models. Our method outperforms the FLIP random image masking strategy by increasing the retention of image patches at the center of images. We achieved better performance than FLIP in various zero-shot tasks such as classification and image text retrieval. Unlike A-CLIP, our approach simplifies the process by eliminating the need for additional inference to compute the attention scores of image patches. Across a wide range of tasks and datasets, GLIP demonstrates superior performance compared to its FLIP counterparts, particularly with the use of a large masking ratio during pre-training. We also demonstrated that the performance of GLIP is similar to that of CLIP when we pre-train the model on the CC12M dataset.

Moving forward, it will be interesting to study the improvement offered by GLIP on very large-scale datasets. In the original work on FLIP [2], experiments were reported on training sets with 400 million samples and 2 billion samples. Because FLIP is simply sampling patches, there are no inherent constraints that prevent it from scaling to larger data. We point out that GLIP, for the same reason, also lacks any inherent constraints. For this reason, we are confident that GLIP will also deliver performance improvement on datasets much larger than the set of 10 million training examples we used here. Further, because GLIP is effective with a large masking ratio, it can process more samples than FLIP given the same training time. For this reason, we expect that GLIP will continue to show stronger competitiveness as datasets grow larger.

An interesting aspect of GLIP is its applicability across a wide range of datasets, as demonstrated by our experiments. Intuitively, it might be expected that GLIP performs better on datasets with center focus, i.e., they contain images that were taken by a photographer who was actively framing the images so that the main subject material would be close to the center of the image, and not at the edge. However, we have seen that GLIP shows a strong advantage over FLIP on data where there is apparently no explicit center focus, specifically, for the EuroSAT dataset [35], which contains remote sensing images, and PCam [36, 37], which contains medical images of tumors. It remains an open question as to why GLIP performs so well on datasets where the main information does not have a strong tendency to be located towards the middle of the image. A possible explanation is that patches at the center of the image are simply more helpful to the vision encoder because they are surrounded by full context. A patch located at the periphery of an image does not enjoy a full context because the image ends at the image edge. Moving forward, we will continue to explore the nature of the reasons for which the centered masking of GLIP delivers an improvement over the random masking of FLIP.

References

  • [1] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
  • [2] Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, and Kaiming He. Scaling language-image pre-training via masking. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  • [3] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  • [4] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  • [5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, 2009.
  • [6] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In IEEE/CVF Computer Vision and Pattern Recognition Conference, 2021.
  • [7] Yifan Yang, Weiquan Huang, Yixuan Wei, Houwen Peng, Xinyang Jiang, Huiqiang Jiang, Fangyun Wei, Yin Wang, Han Hu, Lili Qiu, et al. Attentive mask clip. In IEEE/CVF International Conference on Computer Vision, pages 2771–2781, 2023.
  • [8] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 2021.
  • [9] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, 2015, 2015.
  • [10] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-based models for speech recognition. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
  • [11] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention for visual question answering. Advances in neural information processing systems, 29, 2016.
  • [12] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. Advances in neural information processing systems, 27, 2014.
  • [13] Zhaoyang Niu, Guoqiang Zhong, and Hui Yu. A review on the attention mechanism of deep learning. Neurocomputing, 452:48–62, 2021.
  • [14] Ekin Dogus Cubuk, Barret Zoph, Jon Shlens, and Quoc Le. Randaugment: Practical automated data augmentation with a reduced search space. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 18613–18624, 2020.
  • [15] Terrance DeVries and Graham W Taylor. Dataset augmentation in feature space. arXiv preprint arXiv:1702.05538, 2017.
  • [16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  • [17] Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of big data, 6(1):1–48, 2019.
  • [18] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: An open large-scale dataset for training next generation image-text models. In Conference on Neural Information Processing Systems, Datasets and Benchmarks Track, 2022.
  • [19] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. LAION-400M: Open dataset of CLIP-Filtered 400 million image-text pairs, 2021.
  • [20] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual Captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Annual Meeting of the Association for Computational Linguistics, 2018.
  • [21] Xiaoyi Dong, Jianmin Bao, Yinglin Zheng, Ting Zhang, Dongdong Chen, Hao Yang, Ming Zeng, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, and Nenghai Yu. MaskCLIP: masked self-distillation advances contrastive language-image pretraining. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10995–11005, 2023.
  • [22] Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Mike Rabbat, and Nicolas Ballas. Masked siamese networks for label-efficient learning. In European Conference on Computer Vision, pages 456–473, 2022.
  • [23] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  • [24] Rudolf Arnheim. Art and visual perception: A psychology of the creative eye. Univ of California Press, 1954.
  • [25] Roy H. Quan. Photography and the creation of meaning. Art Education, 32(2):4–9, 1979.
  • [26] Benjamin W Tatler. The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions. Journal of vision, 7(14):4–4, 2007.
  • [27] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, 2020.
  • [28] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding, 2019.
  • [29] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. OpenCLIP, 2021.
  • [30] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: bootstrap** language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  • [31] Norman Mu, Alexander Kirillov, David A. Wagner, and Saining Xie. SLIP: Self-supervision Meets Language-Image Pre-training. In European conference on computer vision, 2022.
  • [32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
  • [33] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision, pages 740–755, 2014.
  • [34] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, pages 67–78, 2014.
  • [35] Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
  • [36] Babak Ehteshami Bejnordi, Mitko Veta, Paul Johannes van Diest, Bram van Ginneken, Nico Karssemeijer, Geert Litjens, Jeroen A. W. M. van der Laak, , and the CAMELYON16 Consortium. Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer. JAMA, 318(22):2199–2210, 2017.
  • [37] Bastiaan S. Veeling, Jasper Linmans, Jim Winkens, Taco Cohen, and Max Welling. Rotation equivariant cnns for digital pathology. In Medical Image Computing and Computer Assisted Intervention, 2018.
  • [38] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In International Conference on Artificial Intelligence and Statistics, 2011.
  • [39] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [40] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.