FastCLIP: A Suite of Optimization Techniques to
Accelerate CLIP Training with Limited Resources

Xiyuan Wei¹
[email protected]
&Fanjiang Ye²
[email protected]
&Ori Yonay¹
[email protected]
&Xingyu Chen¹
[email protected]
&Baixi Sun²
[email protected]
&Dingwen Tao²
[email protected]
&Tianbao Yang¹
[email protected]
&
¹Texas A&M University
College Station, TX 77843
&
²Indiana University Bloomington
Bloomington, IN 47405

Abstract

Existing studies of training state-of-the-art Contrastive Language-Image Pretraining (CLIP) models on large-scale data involve hundreds of or even thousands of GPUs due to the requirement of a large batch size. However, such a large amount of resources is not accessible to most people. While advanced compositional optimization techniques for optimizing global contrastive losses have been demonstrated effective for removing the requirement of large batch size, their performance on large-scale data remains underexplored and not optimized. To bridge the gap, this paper explores several aspects of CLIP training with limited resources (e.g., up to tens of GPUs). First, we introduce FastCLIP, a general CLIP training framework built on advanced compositional optimization techniques while designed and optimized for the distributed setting. Our framework is equipped with an efficient gradient reduction strategy to reduce communication overhead. Second, to further boost training efficiency, we investigate three components of the framework from an optimization perspective: the schedule of the inner learning rate, the update rules of the temperature parameter and the model parameters, respectively. Experiments on different strategies for each component shed light on how to conduct CLIP training more efficiently. Finally, we benchmark the performance of FastCLIP and the state-of-the-art training baseline (OpenCLIP) on different compute scales up to 32 GPUs on 8 nodes, and three data scales ranging from 2.7 million, 9.1 million to 315 million image-text pairs to demonstrate the significant improvement of FastCLIP in the resource-limited setting. We release the code of FastCLIP at https://github.com/Optimization-AI/fast_clip.

1 Introduction

Contrastive Language-Image Pretraining (CLIP) [38] is a popular approach for vision-language representation learning [7, 47, 6, 27, 37]. The method effectively embeds data from the image and language modality into a joint embedding space by optimizing a contrastive loss in a self-supervised manner. It has demonstrated strong performance on various downstream tasks (e.g., zero-shot classification and retrieval) and has been adopted in various applications, including text-to-image generation [40, 59, 8], image captioning [55, 34], and evaluation of image generation quality [19]. Its popularity is further fueled by releases of web-scale datasets [43, 44, 14, 12].

However, vanilla mini-batch based methods for self-supervised contrastive learning are known to require a large batch size to obtain satisfactory performance. Theoretically, it has been shown that the optimization error of mini-batch based contrastive learning methods inversely depends on the batch size [56]. Empirically, state-of-the-art CLIP models are typically trained using a large batch size on a large number of GPUs (e.g., 84k batch size and 1024 Nvidia A100 GPUs in OpenCLIP [7]). Such a large amount of resources is not accessible to most people in academia and small companies. Recently, Yuan et al. [56] proposed an algorithm named SogCLR to address the large batch size issue, which leverages finite-sum coupled compositional optimization (FCCO) techniques to optimize a global contrastive loss (GCL) that contrasts each anchor data with all other data in a compositional structure. A key feature of compositional optimization is the inner and outer steps where the inner steps maintain and update a sequence of estimators to track the inner functions on the solution path, which can be interpreted as an SGD update with a learning rate called the inner learning rate [49]. Later, SogCLR has been leveraged by Qiu et al. [37] to design the iSogCLR algorithm for optimizing a robust global contrastive loss (RGCL) with individualized learnable temperatures in CLIP models. However, these algorithms are not fully optimized for large-scale training of CLIP models since they were examined only on small-scale datasets.

This paper aims to scale up the advanced optimization algorithms for optimizing global contrastive losses of CLIP training on large-scale data with limited compute resources. We introduce a distributed training framework named FastCLIP by employing data parallelism such that each worker computes the gradient estimator using their respective data and then reduces (averages) them through communication, based on which the model is updated. A novel gradient reduction strategy is designed, which requires less communication than the existing distributed framework. This distributed training framework lays the foundation for scaling up CLIP training with limited resources. To further boost the efficiency of our framework, we investigate its three aspects from an optimization perspective: the schedule of the inner learning rate (LR) of compositional optimization, the update rule of the temperature parameter, and the update rule of the model parameters, respectively.

•

Previous studies [56, 37] set the inner LR to a constant value less than but close to one, which could slow down the training for large-scale data at earlier iterations. Inspired by the learning rate schedule of existing optimizers of Deep Learning [31], we examine a cosine decay schedule for the inner LR by benchmarking its performance and comparing it with the constant schedule.
•

For the update rule of the temperature parameter, we compare four different strategies in the FastCLIP framework, including a heuristic approach based on the gradient of GCL, a constant strategy as used in SogCLR, learning individualized temperatures as used in iSogCLR, and learning global temperature by optimizing a new RGCL with a single learnable temperature.
•

For the update rule of the model parameters, we benchmark the performance of the AdamW optimizer [32] and the LAMB optimizer [53] in FastCLIP. We also explored momentum-based optimizer, but it yields much worse performance and hence is not reported in this paper.

Moreover, in order to study the scaling capability of FastCLIP, we compare the performance of FastCLIP and state-of-the-art baseline OpenCLIP [22] on three data scales and four compute scales. The data scales include 2.7 million (CC3M [45]), 9.1 million (CC12M [3]), and 315 million (LAION400M [43]) image-text pairs¹¹1Our downloaded versions of these datasets are smaller than their original versions because some web links are not valid anymore.. The compute scales include 1, 2, 4, and 8 nodes, with 4 GPUs on each node.

The contributions of this paper are summarized as follows: (1) We propose FastCLIP, an efficient distributed framework to scale up CLIP training with limited computing resources. (2) We benchmark the performance of different strategies for three components of FastCLIP, providing insights on how to conduct CLIP training more efficiently. (3) We study the performance of FastCLIP on different data scales and compute scales. The results show that FastCLIP consistently outperforms the state-of-the-art training baseline OpenCLIP by a large margin. A quick comparison between FastCLIP and OpenCLIP on the medium and large data scales across different compute scales is shown in Figure 1.

2 Related Works

CLIP training in the distributed setting: Radford et al. [38] train CLIP models in a distributed setting, but few details regarding the implementation are provided. Ilharco et al. [22] develop OpenCLIP, an open-source implementation of CLIP. They leverage the PyTorch distributed data-parallel module [26] to automatically communicate features and gradients. EVA-CLIP [46, 47] scales the number of parameters of the image encoder in CLIP up to 18 billion by applying several techniques from the system perspective, including the ZeRO optimizer [39] and global half-precision training with DeepSpeed [41]. The key difference between existing works and this work is that they all use a simple mini-batch based contrastive loss, which suffers from the issue of requiring a large batch size. This in turn requires hundreds and even thousands of GPUs. For example, CLIP uses 592 V100 GPUs, OpenCLIP uses 1024 A100 GPUs, and EVA-CLIP uses 256 A100 GPUs. Our work focuses on scaling up CLIP training in a resource-limited setting with only tens of GPUs. We make unique efforts to reduce communication overhead and optimize algorithmic components.

Benchmark for CLIP training: Cherti et al. [7] study the scaling performance of CLIP training. They measure the performance of CLIP across different model sizes and dataset sizes, and study the relationships between downstream task performance and resource consumption. Gadre et al. [14] investigate the impact of different data filtering strategies on the trained model’s downstream performance. They conduct experiments across different data scales ranging from 12.8 million to 12.8 billion and provide insights on how to curate CLIP’s training data. Cui et al. [9] examine the impact of data quality, supervision strategies (e.g., additional image supervision), and model architectures. Li et al. [30] explore different aspects of CLIP training under a limited training budget, including the impact of the quality and quantity of the training data, different model architectures, and different existing training strategies. Different from these works, we study different algorithmic components of CLIP training in an advanced optimization framework for optimizing the global contrastive loss.

Improved CLIP training: Many works have studied efficient CLIP training with limited resources. Yuan et al. [56] propose SogCLR to improve the performance of contrastive learning with small batch size. Our work scales up SogCLR in the distributed setting and incorporates several algorithmic strategies to accelerate its training speed. Besides the algorithm, other directions are also explored for more efficient CLIP training, including augmenting mini-batch based contrastive losses [29, 57, 35, 25, 33, 24, 15], model compression [51, 28, 13], and system optimization [5, 46, 39].

Temperature scheme: The temperature parameter in contrastive losses plays an important role in CLIP training. Many techniques have been proposed to update or set the temperature parameter. Radford et al. [38] treat the temperature as part of the learnable parameters in the mini-batch contrastive loss. Zhang et al. [58] propose to use different temperatures for positive and negative samples to independently control intra-anchor and inter-anchor hardness-awareness. Kukleva et al. [23] study a cosine decay schedule for setting the temperature. Huang et al. [21] propose to set the temperature parameter proportional to the alignment between positive pairs. Qiu et al. [37] propose a robust global contrastive loss (RGCL) with individualized temperatures inspired by Distributionally Robust Optimization and optimize it with the iSogCLR algorithm which extends SogCLR. However, their performance on large-scale data remains unknown. This work focuses on comparing different global contrastive losses for learning the temperature parameter and discovers a new strategy by learning a global temperature in the RGCL that yields better performance for large-scale data.

Optimizers for CLIP training: Different optimizers for updating the learnable parameters have been employed in CLIP training, including AdamW [32] used in [38, 7, 14, 6, 27, 37], and LAMB [53] used in [46, 52, 20]. In this work, we compare the performance of AdamW and LAMB to determine which optimizer is better in FastCLIP for training CLIP models from scratch.

3 Preliminaries

Notations: Given a dataset of $n$ images $\bm{x}_{i}$ and their corresponding text descriptions $\bm{z}_{i}$ : $\mathcal{S}=\{(\bm{x}_{1},\bm{z}_{1}),\ldots,(\bm{x}_{n},\bm{z}_{n})\}$ , we aim to learn an image encoder and a text encoder (jointly represented by $\bm{w}$ ) from the data. We use $\bm{e}_{1,i}=\bm{e}_{1}(\bm{w},\bm{x}_{i})\in\mathbb{R}^{d}$ and $\bm{e}_{2,i}=\bm{e}_{2}(\bm{w},\bm{z}_{i})\in\mathbb{R}^{d}$ to denote the encoded vector of the input $\bm{x}_{i}$ and $\bm{z}_{i}$ , respectively. And we use $\bm{e}_{i}=(\bm{e}_{1,i}^{\top},\bm{e}_{2,i}^{\top})^{\top}$ to denote the concatenation of $\bm{e}_{1,i}$ and $\bm{e}_{2,i}$ . Denote by $\mathcal{B}\subset\mathcal{S}$ a mini-batch of image-text pairs. With slight abuse of notation, we also use $\mathcal{B}$ (and $\mathcal{S}$ ) to denote the index set of the image-text pairs it contains. $\mathcal{S}_{i-}:=\mathcal{S}\backslash\{i\}$ denotes the subset of $\mathcal{S}$ without $i$ -th pair. We consider the data parallel setting such that $\mathcal{S}$ is partitioned evenly across $K$ workers denoted by $\mathcal{S}_{1},\ldots,\mathcal{S}_{K}$ . For a function $\ell(\cdot,\cdot)$ , let $\nabla_{1}\ell(\cdot,\cdot)$ and $\nabla_{2}\ell(\cdot,\cdot)$ denote the partial gradient in terms of the first and the second argument.

Mini-batch Contrastive Loss (MBCL) and Global Contrastive Loss (GCL): The core idea of CLIP training is to leverage a contrastive loss to push features of paired image and text close to each other (i.e., to maximize the similarity between $\bm{e}_{1,i}$ and $\bm{e}_{2,i}$ ), while pushing features of non-paired image and text away from each other (i.e., minimizing the similarity between $\bm{e}_{1,i}$ and $\bm{e}_{2,j}$ for $i\neq j$ ). Mathematically, let $s_{i,j}$ denote the cosine similarity between $\bm{e}_{1,i}$ and $\bm{e}_{2,j}$ . Define

\ell_{1}(\bm{e}_{i},\bm{e}_{2,j},\tau):=\exp\left((s_{i,j}-s_{i,i})/\tau\right% ),\quad\ell_{2}(\bm{e}_{i},\bm{e}_{1,j},\tau):=\exp\left((s_{j,i}-s_{i,i})/% \tau\right),

where $\tau>0$ is the temperature parameter. Given a mini-batch $\mathcal{B}$ of image-text pairs, let

g_{1}(\bm{w},\tau,i,\mathcal{B}):=\frac{1}{|\mathcal{B}|}\sum\nolimits_{j\in% \mathcal{B}}\ell_{1}(\bm{e}_{i},\bm{e}_{2,j},\tau),\quad g_{2}(\bm{w},\tau,i,% \mathcal{B}):=\frac{1}{|\mathcal{B}|}\sum\nolimits_{j\in\mathcal{B}}\ell_{2}(% \bm{e}_{i},\bm{e}_{1,j},\tau).

Following Radford et al. [38], many works minimize the mini-batch contrastive loss (MBCL):

\frac{1}{|\mathcal{S}|}\sum\nolimits_{i\in\mathcal{S}}\mathbb{E}_{\mathcal{B}% \subset\mathcal{S}_{i-}}\left(\log\left(\frac{1}{|\mathcal{B}|}+g_{1}(\bm{w},% \tau,i,\mathcal{B})\right)+\log\left(\frac{1}{|\mathcal{B}|}+g_{2}(\bm{w},\tau% ,i,\mathcal{B})\right)\right),

(MBCL)

which contrasts the $i$ -th pair with other pairs within only a mini-batch $\mathcal{B}$ . However, this loss suffers from the large-batch size issue, which has been addressed by the Global Contrastive Loss (GCL) [56] that contrasts the $i$ -th pair with all other pairs in the dataset $\mathcal{S}$ :

\frac{\tau}{|\mathcal{S}|}\sum\nolimits_{i\in\mathcal{S}}\left(\log\left(\frac% {1}{|\mathcal{S}_{i-}|}+g_{1}(\bm{w},\tau,i,\mathcal{S}_{i-})\right)+\log\left% (\frac{1}{|\mathcal{S}_{i-}|}+g_{2}(\bm{w},\tau,i,\mathcal{S}_{i-})\right)% \right).

(GCL)

Robust Global Contrastive Loss (RGCL): To improve CLIP training, Qiu et al. [37] has designed a robust global contrastive loss (RGCL) with individualized temperature parameters inspired by Distributionally Robust Optimization. It is defined as:

	$\displaystyle\min_{\tau_{1},\tau_{2}\geq\tau_{0}}\frac{1}{\|\mathcal{S}\|}\sum_{% i\in\mathcal{S}}$	$\displaystyle\left(\tau_{1,i}\cdot\left(\log\left(\frac{1}{\|\mathcal{S}_{i-}\|}% +g_{1}(\bm{w},\tau_{1,i},i,\mathcal{S}_{i-})\right)+\rho\right)\right.$		(RGCL)
		$\displaystyle\;\;\left.+\tau_{2,i}\cdot\left(\log\left(\frac{1}{\|\mathcal{S}_{% i-}\|}+g_{2}(\bm{w},\tau_{2,i},i,\mathcal{S}_{i-})\right)+\rho\right)\right),$		(RGCL)

where $\tau_{1}=(\tau_{1,1},\ldots,\tau_{1,n})$ , $\tau_{2}=(\tau_{2,1},\ldots,\tau_{2,n})$ , $\tau_{0}$ is a small value, $\rho\geq 0$ is a hyperparameter.

Optimization Algorithms. To optimize GCL, Yuan et al. [56] have proposed the SogCLR algorithm based on advanced compositional optimization known as finite-sum coupled compositional optimization (FCCO) [49]. In particular, GCL is formulated as $\frac{1}{n}\sum_{i\in\mathcal{S}}f(g_{i}(\bm{w}))$ , where $f(g)=\log(\varepsilon+g)$ and $g_{i}(\bm{w})$ is the inner function inside the log. The main challenge is to compute a gradient estimator using a mini-batch of samples such that the algorithm can converge without requiring a large batch size. The key idea of SogCLR is to maintain and update an estimator for each inner function $g_{i}(\bm{w})$ denoted by $u_{i}$ , by using Eqn. (1). As a result, the gradient at the $t$ -th iteration is estimated by $\frac{1}{|\mathcal{B}|}\sum_{i\in\mathcal{B}}\nabla f(u_{i}^{t+1})\nabla\hat{g% }_{i}(\bm{w}^{t})$ , where $\mathcal{B}$ is a mini-batch and $\hat{g}_{i}(\bm{w})$ is a mini-batch estimator of $g_{i}(\bm{w})$ . To optimize RGCL, Qiu et al. [37] have proposed the iSogCLR algorithm by combining SogCLR with stochastic coordinate updates for the temperature parameters.

4 FastCLIP: A Distributed Training Framework of CLIP Models

1 Input: Initial model parameters

\bm{w}^{0},\tau^{0},(\bm{u}_{1}^{0},\bm{u}_{2}^{0})

, Number of iterations

T

2 for $t=0,\ldots,T-1$ do

3 for each worker $k$ do in parallel

4 Sample a batch

\mathcal{B}^{t}_{k}

from

\mathcal{S}_{k}

and compute

\mathcal{E}^{t}_{k}=\{(\bm{e}_{1,j},\bm{e}_{2,j})\}_{j\in\mathcal{B}^{t}_{k}}

5 All_Gather

\mathcal{E}^{t}=\cup_{k}\mathcal{E}^{t}_{k}

to obtain global features

6 Compute mini-batch contrastive losses

g_{1,i}^{t},g_{2,i}^{t}

for

i\in\mathcal{B}^{t}_{k}

(c.f. Proc. 2 in Appendix A)

7 Update

u_{1,i}^{t+1},u_{2,i}^{t+1}

using Eqn. (1) for

i\in\mathcal{B}^{t}_{k}

. Set

u_{1,i}^{t+1}=u_{1,i}^{t},u_{2,i}^{t+1}=u_{2,i}^{t}

for

i\notin\mathcal{B}^{t}_{k}

8 Set

\mathcal{U}^{t}_{k}=\{(u_{1,j}^{t+1},u_{2,j}^{t+1})\}_{j\in\mathcal{B}^{t}_{k}}

, and All_Gather

\mathcal{U}^{t}=\cup_{k}\mathcal{U}^{t}_{k}

9 Compute gradient estimators

G_{\bm{w},k}^{t}=G_{\bm{w},a,k}^{t}+G_{\bm{w},b,k}^{t}

for

\bm{w}

using

\mathcal{U}^{t}

(c.f. Proc. 3)

10 All_Reduce

G_{\bm{w}}^{t}=\frac{1}{K}\sum_{l=1}^{K}G_{\bm{w},l}^{t}

across all workers

11 Update

\bm{w}^{t+1}

from

\bm{w}^{t}

and

G_{\bm{w}}^{t}

using an optimizer (c.f. Proc. 4).

12 Update

\tau^{t+1}

from

\tau^{t}

(c.f. Proc. 5).

Algorithm 1 The FastCLIP Framework (Sketch)

FastCLIP is a distributed training framework for optimizing a GCL including RGCL. Its key updates are built upon the SogCLR algorithm. The main difference between SogCLR and mini-batch based methods such as CLIP is that SogCLR maintains $u_{1,i}$ and $u_{2,i}$ to keep track of $g_{1}(\bm{w},\tau,i,\mathcal{S}_{i-})$ and $g_{2}(\bm{w},\tau,i,\mathcal{S}_{i-})$ as stated in Section 3. At iteration $t$ , for $i$ selected in the batch $\mathcal{B}^{t}$ , $u_{1,i}$ and $u_{2,i}$ will be updated using a moving average estimator with hyperparameter $\gamma_{t}\in(0,1]$ :

u_{1,i}^{t+1}=(1-\gamma_{t})u_{1,i}^{t}+\gamma_{t}g_{1}(\bm{w}^{t},\tau^{t},i,% \mathcal{B}_{i-}^{t}),\quad u_{2,i}^{t+1}=(1-\gamma_{t})u_{2,i}^{t}+\gamma_{t}% g_{2}(\bm{w}^{t},\tau^{t},i,\mathcal{B}_{i-}^{t}),

(1)

and the gradient estimator is computed by $\frac{1}{|\mathcal{B}^{t}|}\sum_{i\in\mathcal{B}^{t}}\nabla f(u_{i}^{t+1})% \nabla\hat{g}_{i}(\bm{w}^{t})$ . The core of FastCLIP (Alg. 1) is how to compute the gradient estimator in a distributed manner.

Next, we use (GCL) as an example to present our gradient computation strategy that effectively reduces the communication cost. We only present key steps and defer the complete derivation to Appendix A due to space limit. Let $\mathcal{B}^{t}_{k}$ denote local mini-batch on $k$ -th worker. Below, we omit the superscript $t$ and use $\mathcal{B}_{k}$ for simplicity. Note that (GCL) is the sum of two parts: the image part (loss $g_{1}$ ) and the text part (loss $g_{2}$ ). Due to their symmetric structure, we only present the gradient of the image part. Let $\varepsilon=1/|\mathcal{S}_{i-}|$ , the gradient estimator of (GCL) is computed by $G_{\bm{w},1,a}+G_{\bm{w},1,b}$ :

	$\displaystyle G_{\bm{w},1,a}=$	$\displaystyle\tau\cdot\underbrace{\frac{1}{K}\sum_{k=1}^{K}}_{\textsc{All\_% Reduce}}\frac{1}{\|\mathcal{B}_{k}\|}\sum_{i\in\mathcal{B}_{k}}\frac{1}{% \varepsilon+\underbrace{u_{1,i}}_{\textrm{local}}}\cdot\overbrace{\frac{1}{K}% \sum_{k^{\prime}=1}^{K}\frac{1}{\|\mathcal{B}_{k^{\prime},i-}\|}\sum_{j\in% \mathcal{B}_{k^{\prime},i-}}\nabla_{1}\ell_{1}(\underbrace{\bm{e}_{i}}_{% \textrm{local}},\underbrace{\hbox{\pagecolor{yellow!15}$\bm{e}_{2,j}$}}_{% \textrm{global}},\tau)\cdot\underbrace{\nabla\bm{e}_{i}}_{\textrm{local}}}^{G_% {\bm{w},1,a,i}},$
	$\displaystyle G_{\bm{w},1,b}=$	$\displaystyle\tau\cdot\underbrace{\frac{1}{K}\sum_{k^{\prime}=1}^{K}}_{\textsc% {All\_Reduce}}\frac{1}{\|\mathcal{B}_{k^{\prime}}\|}\sum_{j\in\mathcal{B}_{k^{% \prime}}}\cdot\frac{1}{K}\sum_{k=1}^{K}\frac{1}{\|\mathcal{B}_{k,j-}\|}\sum_{i% \in\mathcal{B}_{k,j-}}\frac{1}{\varepsilon+\underbrace{\hbox{\pagecolor{yellow% !15}$u_{1,i}$}}_{\textrm{global}}}\nabla_{2}\ell_{1}(\underbrace{\hbox{% \pagecolor{yellow!15}$\bm{e}_{i}$}}_{\textrm{global}},\underbrace{\bm{e}_{2,j}% }_{\textrm{local}},\tau)\cdot\underbrace{\nabla\bm{e}_{2,j}}_{\textrm{local}}.$

Both $G_{\bm{w},1,a}$ and $G_{\bm{w},1,b}$ have two averages over $\mathcal{B}$ due to compositional structure of the loss. For FastCLIP, the inner average (e.g. $G_{\bm{w},1,a,i}$ ) is computed on a single worker after gathering global parts (shaded, e.g., $\bm{e}_{2,j}$ ) from all workers. The outer average is then computed using All_Reduce.

Difference from OpenCLIP. Algorithmically, OpenCLIP does not use the $u$ sequence, which is equivalent to setting $\gamma_{t}=1$ . In terms of distributed implementation, for computing $G_{\bm{w},1,b}$ , OpenCLIP first computes $\frac{1}{\varepsilon+u_{1,i}}\nabla_{2}\ell_{1}(\bm{e}_{i},\bm{e}_{2,j},\tau)$ on the worker where $i$ -th pair resides, then worker $k^{\prime}$ gathers them using Reduce_Scatter and uses them to compute the inner average. FastCLIP avoids Reduce_Scatter by gathering $u_{1,i}$ using All_Gather and directly computing the inner average on worker $k^{\prime}$ given that $\{\bm{e}_{i}\}$ have already been gathered when computing $G_{\bm{w},1,a}$ and $\{u_{i}\}$ .

FastCLIP has the same communication and computation cost for computing $G_{\bm{w},1,a}$ as OpenCLIP, but has significant communication reduction for computing $G_{\bm{w},1,b}$ . Specifically, Reduce_Scatter in OpenCLIP requires $\mathcal{O}(K|\mathcal{B}|d)$ communication cost, where $d$ is the feature dimensionality (>512 in practice). While All_Gather of $u_{1,i}$ in FastCLIP requires only $\mathcal{O}(K|\mathcal{B}|)$ communication since each $u_{1,i}$ is a scalar. This leads to a communication reduction, as verified empirically in Sec. 6.

Table 1: Comparison between different algorithms. In Temperature Scheme, “G” denotes global temperature parameter, while “I” denotes individualized temperature parameters for each data.

Algorithm	Loss	FCCO	Distributed	Inner LR Schedule	Temperature Scheme
OpenCLIP [22]	(MBCL)	No	Yes	N/A	G, Learnable
SogCLR [56]	(GCL)	Yes	No	Constant	G, Constant
iSogCLR [37]	(RGCL)	Yes	No	Constant	I, Learnable
FastCLIP-v0	(GCL)	Yes	Yes	Cosine	G, Learnable
FastCLIP-v1	(GCL)	Yes	Yes	Cosine	G, Constant
FastCLIP-v2	(RGCL)	Yes	Yes	Cosine	I, Learnable
FastCLIP-v3	(RGCL-g)	Yes	Yes	Cosine	G, Learnable

5 Benchmark of Optimization Components

We benchmark three components of the FastCLIP framework, i.e., the schedule for inner LR $\gamma_{t}$ , the update rule of the temperature parameter, and the optimizer for updating the model parameters.

The Inner LR Schedule: We first explore different schedules for $\gamma_{t}$ in Eqn. (1), which is interpreted as an SGD step with learning rate (LR) $\gamma_{t}$ by Wang and Yang [49]. They showed in theory that $\gamma_{t}$ should be set to a very small value close to 0 in order to guarantee convergence. However, in practice a large $\gamma_{t}$ value close to 1 is adopted [56]. Ideally, $\gamma_{t}$ should be large to rely more on the current mini-batch at earlier iterations and be smaller to rely more on history in later iterations. To achieve this, we consider a cosine schedule to decrease $\gamma_{t}$ : Let $t$ be the current iteration, $\hat{E}$ be the number of iterations per epoch and $E$ be the number of decay epochs, then we set $\gamma_{t}=0.5\cdot(1+\cos(\pi\lfloor t/\hat{E}\rfloor/E))\cdot(1-\gamma_{% \mathrm{min}})+\gamma_{\mathrm{min}}$ . With this schedule, $\gamma_{t}$ will decrease from 1.0 to $\gamma_{\mathrm{min}}$ . Note that $\lfloor t/\hat{E}\rfloor$ denotes the current epoch, which means the value of $\gamma_{t}$ stays unchanged within one epoch. Also, The number of decay epochs $E$ is a hyperparameter, and it is not necessarily equal to the total number of training epochs. If the current epoch exceeds $E$ , $\gamma_{t}$ will be set to $\gamma_{\mathrm{min}}$ .

The Temperature Parameter Updates: At Line 1 of Alg. 1, the temperature parameter $\tau$ is updated. The update rule is not explicitly provided due to its variety. We consider four different versions, v0 to v3. Specifically, v1 sets $\tau$ to a constant as in SogCLR and the other three view $\tau$ as a learnable parameter; v2 leverages the same $\tau$ update as iSogCLR, which maintains individual temperature parameters for each data and updates them using gradient of (RGCL) w.r.t. $\tau$ . A potential issue of maintaining and updating individualized temperature is that it may overfit the data and hence harm the generalization for large-scale data. To mitigate this issue, we also consider the following loss, which unifies the individual temperature in (RGCL) into a single global one:

\min_{\tau\geq\tau_{0}}\frac{\tau}{|\mathcal{S}|}\sum_{i\in\mathcal{S}}\left(% \log\left(\frac{1}{|\mathcal{S}_{i-}|}+g_{1}(\bm{w},\tau,i,\mathcal{S}_{i-})% \right)+\log\left(\frac{1}{|\mathcal{S}_{i-}|}+g_{2}(\bm{w},\tau,i,\mathcal{S}% _{i-})\right)\right)+2\rho\tau.

(RGCL-g)

We refer to this version as v3. We also include a baseline version named v0 that updates $\tau$ using the gradient of an unscaled version of (GCL) that does not multiply $\tau$ , similar to existing $\tau$ updates [38, 7] based on MBCL. The explicit rules of all updates are deferred to Proc. 5 in Appendix A. Combining the four versions of updating/setting $\tau$ with the cosine inner LR schedule, we get four algorithms FastCLIP-v0 to v3. A comparison between them and existing algorithms is shown in Table 1. Different updates of $\tau$ also lead to slightly different ways of computing the contrastive losses and gradient estimator (Line 1 and Line 1 in Alg. 1), and the details are deferred to Appendix A.

The Optimizer: We compare the performance of two optmizers (i.e., the update rule of model parameters and temperature at Line 1 to 1 in Alg. 1) in FastCLIP, i.e., AdamW [32] and LAMB [53]. The update rules of the two optimizers are presented in Proc. 4 in Appendix A for completeness.

Experiment Settings: We conduct experiments in three different settings, which differ in data scales, vision encoders, and training environments. The difference is presented in Table 2. In all settings, we use a 12-layer transformer [48] as the text encoder. All the experiments are conducted in the multi-node setting where each node has 4 GPUs. Due to its extreme size, xlarge-scale setting is only used to compare the best version of FastCLIP with OpenCLIP.

Metrics: To evaluate the performance of the trained models, we leverage the Datacomp Benchmark [14], which includes 38 zero-shot downstream tasks. The evaluation metric is the average performance, which is called Datacomp. We also report the average performance on two subsets of the tasks: ImageNet and its different variants (IN & Variants), and Retrieval. IN & Variants tasks consist of ImageNet-1k [10] and 6 ImageNet distribution shift datasets [50, 42, 18, 17, 1] [14, Section 3.5]. Retrieval tasks consist of Flickr30k [54], MSCOCO [4], and WinoGAViL [2].

Table 2: Overview of the experiment settings. # Samples denotes the size of the dataset downloaded. Batch Size denotes per-GPU batch size, with global batch size specified in parentheses.

Setting	Dataset	# Samples/Epochs	Vision Encoder	Batch Size	GPUs
Medium	CC3M [45]	2.7M/37 epochs	ResNet50 [16]	128 (1024)	8 Tesla T4
Large	CC12M [3]	9.1M/33 epochs	ViT-B/32 [11]	256 (2048)	8 Tesla T4
xLarge	LAION400M [43]	315M/30 epochs	ViT-B/16 [11]	320 (5120)	16 A100

5.1 Results

In this subsection, we present the benchmark results. We report results averaged over 3 runs with different seeds, and standard deviation in parentheses. Training details are provided in Appendix B.

The Inner LR Schedule: We first present results of different $\gamma$ schedules. We compare three pairs of approaches: SogCLR and FastCLIP-v1; iSogCLR and FastCLIP-v2; FastCLIP-v3 with Constant $\gamma$ and FastCLIP-v3, where the former of each pair uses constant $\gamma$ schedule and the latter uses cosine $\gamma$ schedule. SogCLR and iSogCLR are implemented in the same framework as FastCLIP. The results are presented in Table 5. We can observe that all of the three approaches obtain a significant performance gain when equipped with the cosine schedule. This indicates that cosine schedule performs better than the constant schedule. Also, when tuning the $\gamma$ value for the two schedules, we observe that constant schedule favors larger $\gamma$ values (0.6 or 0.8), while cosine schedule favors small $\gamma$ value (0.2) (c.f. Table 7 in Appendix B). These results suggest: (1) $\gamma$ needs to be set to a small value as the theory predicts, (2) but instead of being constant, its value should gradually decrease.

The Temperature Parameter Updates: Next, we present the benchmark results of different $\tau$ updates. We compare the four versions of FastCLIP. The results are presented in Table 5. We have the following observations. In the medium-scale setting, the average performance on Datacomp of the four algorithms are close to each other. FastCLIP-v3 has better performance than others either on Retrieval or IN & Variants. In the large-scale setting, FastCLIP-v3 outperforms other algorithms on Datacomp and Retrieval. This demonstrates the effectiveness of FastCLIP-v3. Also we can see that FastCLIP-v0, v2 are competitive with each other while FastCLIP-v1 is generally worse in this setting.

The Optimizer: We use FastCLIP-v3 as the base algorithm and compare the AdamW and LAMB optimizer. The results are presented in Table 5. We observe that AdamW works better than LAMB in both settings. This indicates that AdamW should be chosen in favor of LAMB for FastCLIP.

6 Scaling Performance of FastCLIP

In this section, we benchmark the performance of FastCLIP using AdamW on different number of nodes in comparison with OpenCLIP. We conduct experiments on 1, 2, 4, and 8 node(s). Except for the number of nodes, other settings are kept the same as the experiment settings specified in Section 5. Training details and additional experiment results are provided in Appendix B and C, respectively.

Table 3: Performance of different inner LR schedules. Shaded algorithms use the cosine schedule, while the others use the constant schedule. Improvement denotes the absolute difference between two algorithms on the three metrics. ^∗: v3 (Const.

\gamma

) denotes FastCLIP-v3 with constant

\gamma

schedule.

Setting	Algorithm	Datacomp	Retrieval	IN & Variants	Improvement
	SogCLR	23.41 (0.34)	27.48 (0.24)	16.90 (0.01)
	FastCLIP-v1	24.87 (0.13)	29.28 (0.30)	18.86 (0.09)	1.46, 1.80, 1.96
	iSogCLR	23.35 (0.63)	27.92 (0.34)	17.05 (0.14)
	FastCLIP-v2	24.10 (0.34)	29.32 (1.29)	18.52 (0.37)	0.75, 1.40, 1.47
	v3 (Const. $\gamma$ )^∗	23.60 (0.18)	27.68 (0.17)	17.33 (0.22)
Medium	FastCLIP-v3	24.76 (0.26)	30.36 (0.18)	19.08 (0.16)	1.16, 2.68, 1.75
	SogCLR	29.91 (0.23)	30.16 (0.36)	22.98 (0.07)
	FastCLIP-v1	30.65 (0.11)	32.66 (0.12)	24.26 (0.06)	0.74, 2.50, 1.28
	iSogCLR	30.32 (0.18)	30.27 (0.41)	24.96 (0.09)
	FastCLIP-v2	30.94 (0.20)	31.84 (0.17)	25.52 (0.17)	0.62, 1.57, 0.56
	v3 (Const. $\gamma$ )^∗	29.46 (0.39)	30.33 (0.58)	23.69 (0.09)
Large	FastCLIP-v3	31.60 (0.46)	34.88 (0.28)	24.78 (0.28)	2.14, 4.55, 1.09

Table 4: Performance of different temperature parameter updates.

Setting	Algorithm	Datacomp	Retrieval	IN & Variants
	FastCLIP-v0	24.71 (0.21)	30.36 (0.26)	17.50 (0.33)
	FastCLIP-v1	24.87 (0.13)	29.28 (0.30)	18.86 (0.09)
	FastCLIP-v2	24.21 (0.76)	30.35 (0.47)	17.86 (0.21)
Medium	FastCLIP-v3	24.76 (0.26)	30.36 (0.18)	19.08 (0.16)
	FastCLIP-v0	31.47 (0.31)	34.86 (0.53)	24.55 (0.21)
	FastCLIP-v1	30.65 (0.11)	32.66 (0.12)	24.26 (0.06)
	FastCLIP-v2	30.95 (0.32)	33.71 (0.20)	24.94 (0.18)
Large	FastCLIP-v3	31.60 (0.46)	34.88 (0.28)	24.78 (0.28)

Table 5: Performance of different optimizers. Gap denotes improvements of AdamW on three metrics.

Setting	Algorithm	Datacomp	Retrieval	IN & Variants	Gap
	LAMB	22.63 (0.30)	24.87 (0.27)	16.43 (0.06)
Medium	AdamW	24.76 (0.26)	30.36 (0.18)	19.08 (0.16)	2.13, 5.49, 2.65
	LAMB	30.54 (0.24)	34.02 (0.26)	24.11 (0.21)
Large	AdamW	31.60 (0.46)	34.88 (0.28)	24.78 (0.28)	1.06, 0.86, 0.67

Performance: The results of selected models based on the average Datacomp performance are presented in Figure 4, Subfigures (a) and (b) are the IN & Variants and Retrieval performance in the medium-scale setting, and subfigures (c) and (d) are the results in the large-scale setting. We can observe that FastCLIP-v3 consistently outperforms OpenCLIP across different number of nodes. This clearly illustrates the advantage of GCL family over MBCL. Also, FastCLIP-v3’s performance plateaus at 2 nodes, which verifies that FastCLIP does not require a large amount of computing resources. In contrast, OpenCLIP has a significant performance gain when scaling from 2 nodes to 8 nodes, meaning that it requires a large amount of computing resources to obtain good performance. Additionally, Figure 1 demonstrates the significant speedup of FastCLIP-v3 over OpenCLIP.

Training Time: Besides the performance on downstream tasks, we also benchmark the training time of OpenCLIP and FastCLIP-v1 to v3. We use PyTorch [36] Profiler to record the data. We break down per-iteration training time into 3 parts: computation, pure communication (not overlapped with computation), and others. The results are plotted in Figure 4 (a) and (b). We also break down communication into two parts: communication overlapped with computation and pure communication, which are plotted in Figure 4 (c) and (d). From subfigures (a) and (b) we can see that the running time of FastCLIP is similar to OpenCLIP when the number of nodes is small (1 and 2), and becomes shorter than OpenCLIP when the number of nodes scales up (4 and 8). This is because OpenCLIP has a longer communication time on 4 and 8 nodes (subfigures (c) and (d)), which demonstrates the effectiveness of our efficient gradient computation/communication strategy described in Section 4. For each algorithm, we also plot its speedup over 1 node in terms of training time in Figure 4 (a) and (b). All algorithms have similar speedup over 1 node and the gap between the ideal speedup (which is number of nodes) and the real speedup becomes larger when the number of nodes scales up. This indicates that training with more resources has a diminishing return.

Moreover, we benchmark the performance of FastCLIP-v3 and OpenCLIP in the xlarge-scale setting with 4 nodes. We plot the ImageNet-1k top 1 accuracy curve in Figure 4(c). OpenCLIP achieves a top1 accuracy of 53.36% on ImageNet-1k, while FastCLIP-v3 achieves an accuracy of 54.92%, resulting in a 1.56% gain. We also evaluate the average performance on Datacomp, which exhibits similar performance between FastCLIP-v3 and OpenCLIP shown in Appendix C.

In summary, the results in this section demonstrate the effectiveness of FastCLIP across different data scales (3 million to 315 million) and compute scales (1 to 8 nodes) in the limited-resource setting.

7 Conclusion

In this paper, we have proposed a distributed training framework of CLIP models in a resource-limited setting named FastCLIP. It leverages advanced compositional optimization with a novel gradient computation strategy to reduce the communication cost. We have investigated different optimization components, by proposing new techniques and benchmarking different techniques for each component under different settings to provide insights on which techniques to use. Finally, leveraging the best-performant techniques from the benchmark results, we compare the performance of FastCLIP with OpenCLIP on different data scales and compute scales, from 3 million to 315 million image-text pairs and from 1 node to 8 nodes. The results demonstrate that FastCLIP outperforms OpenCLIP by a large margin and achieves a significant speedup. A limitation of this work is extensive benchmark results on extremely large-scale setting are lacking due to limited computing budget that we have.

References

Barbu et al. [2019] Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/97af07a14cacba681feacf3012730892-Paper.pdf.
Bitton et al. [2022] Yonatan Bitton, Nitzan Bitton Guetta, Ron Yosef, Yuval Elovici, Mohit Bansal, Gabriel Stanovsky, and Roy Schwartz. Winogavil: Gamified association benchmark to challenge vision-and-language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 26549–26564. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/a96fe863f85c59789bba63588a9557b4-Paper-Datasets_and_Benchmarks.pdf.
Changpinyo et al. [2021] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
Chen et al. [2015] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
Chen et al. [2023a] Yihao Chen, Xianbiao Qi, Jianan Wang, and Lei Zhang. Disco-clip: A distributed contrastive loss for memory efficient clip training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22648–22657, June 2023a.
Chen et al. [2023b] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023b.
Cherti et al. [2023] Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2829, June 2023.
Crowson et al. [2022] Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. Vqgan-clip: Open domain image generation and editing with natural language guidance. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, pages 88–105, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-19836-6.
Cui et al. [2022] Yufeng Cui, Lichen Zhao, Feng Liang, Yangguang Li, and **g Shao. Democratizing contrastive language-image pre-training: A clip benchmark of data, model, and supervision. arXiv preprint arXiv:2203.05796, 2022.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
Fang et al. [2023] Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data filtering networks. arXiv preprint arXiv:2309.17425, 2023.
Fang et al. [2021] Zhiyuan Fang, Jianfeng Wang, Xiaowei Hu, Lijuan Wang, Yezhou Yang, and Zicheng Liu. Compressing visual-linguistic model via knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1428–1438, October 2021.
Gadre et al. [2023] Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei W Koh, Olga Saukh, Alexander J Ratner, Shuran Song, Hannaneh Hajishirzi, Ali Farhadi, Romain Beaumont, Sewoong Oh, Alex Dimakis, Jenia Jitsev, Yair Carmon, Vaishaal Shankar, and Ludwig Schmidt. Datacomp: In search of the next generation of multimodal datasets. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 27092–27112. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/56332d41d55ad7ad8024aac625881be7-Paper-Datasets_and_Benchmarks.pdf.
Goel et al. [2022] Shashank Goel, Hritik Bansal, Sumit Bhatia, Ryan Rossi, Vishwa Vinay, and Aditya Grover. Cyclip: Cyclic contrastive language-image pretraining. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 6704–6719. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/2cd36d327f33d47b372d4711edd08de0-Paper-Conference.pdf.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Hendrycks et al. [2021a] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8340–8349, October 2021a.
Hendrycks et al. [2021b] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15262–15271, June 2021b.
Hessel et al. [2021] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Ye** Choi. CLIPScore: A reference-free evaluation metric for image captioning. In Marie-Francine Moens, Xuan**g Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.595. URL https://aclanthology.org/2021.emnlp-main.595.
Huang et al. [2023a] Runhui Huang, Yanxin Long, Jianhua Han, Hang Xu, Xiwen Liang, Chun**g Xu, and Xiaodan Liang. Nlip: Noise-robust language-image pre-training. Proceedings of the AAAI Conference on Artificial Intelligence, 37(1):926–934, Jun. 2023a. doi: 10.1609/aaai.v37i1.25172. URL https://ojs.aaai.org/index.php/AAAI/article/view/25172.
Huang et al. [2023b] Zizheng Huang, Haoxing Chen, Ziqi Wen, Chao Zhang, Huaxiong Li, Bo Wang, and Chunlin Chen. Model-aware contrastive learning: Towards esca** the dilemmas. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 13774–13790. PMLR, 23–29 Jul 2023b. URL https://proceedings.mlr.press/v202/huang23c.html.
Ilharco et al. [2021] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip. https://doi.org/10.5281/zenodo.5143773, July 2021. URL https://doi.org/10.5281/zenodo.5143773.
Kukleva et al. [2023] Anna Kukleva, Moritz Böhle, Bernt Schiele, Hilde Kuehne, and Christian Rupprecht. Temperature schedules for self-supervised contrastive methods on long-tail data. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=ejHUr4nfHhD.
Lee et al. [2022] Janghyeon Lee, Jongsuk Kim, Hyounguk Shon, Bumsoo Kim, Seung Hwan Kim, Honglak Lee, and Junmo Kim. Uniclip: Unified framework for contrastive language-image pre-training. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 1008–1019. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/072fd0525592b43da661e254bbaadc27-Paper-Conference.pdf.
Li et al. [2022] Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. BLIP: bootstrap** language-image pre-training for unified vision-language understanding and generation. CoRR, abs/2201.12086, 2022. URL https://arxiv.longhoe.net/abs/2201.12086.
Li et al. [2020] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. Pytorch distributed: experiences on accelerating data parallel training. Proc. VLDB Endow., 13(12):3005–3018, aug 2020. ISSN 2150-8097. doi: 10.14778/3415478.3415530. URL https://doi.org/10.14778/3415478.3415530.
Li et al. [2023a] Xianhang Li, Zeyu Wang, and Cihang Xie. An inverse scaling law for clip training. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 49068–49087. Curran Associates, Inc., 2023a. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/996e2b446391fcb8bf32a3d1645cc799-Paper-Conference.pdf.
Li et al. [2023b] Xuanlin Li, Yunhao Fang, Minghua Liu, Zhan Ling, Zhuowen Tu, and Hao Su. Distilling large vision-language model with out-of-distribution generalizability. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2492–2503, October 2023b.
Li et al. [2023c] Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, and Kaiming He. Scaling language-image pre-training via masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23390–23400, 2023c.
Li et al. [2024] Zichao Li, Cihang Xie, and Ekin Dogus Cubuk. Scaling (down) CLIP: A comprehensive analysis of data,architecture, and training strategies. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=t4nnCi5AO6.
Loshchilov and Hutter [2017] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Skq89Scxx.
Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
Mo et al. [2023] Sangwoo Mo, Minkyu Kim, Kyungmin Lee, and **woo Shin. S-clip: Semi-supervised vision-language learning using few specialist captions. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 61187–61212. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/c06f788963f0ce069f5b2dbf83fe7822-Paper-Conference.pdf.
Mokady et al. [2021] Ron Mokady, Amir Hertz, and Amit H Bermano. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
Mu et al. [2022] Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. Slip: Self-supervision meets language-image pre-training. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, pages 529–544, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-19809-0.
Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf.
Qiu et al. [2023] Zi-Hao Qiu, Quanqi Hu, Zhuoning Yuan, Denny Zhou, Lijun Zhang, and Tianbao Yang. Not all semantics are created equal: Contrastive self-supervised learning with automatic temperature individualization. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 28389–28421. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/qiu23a.html.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/radford21a.html.
Rajbhandari et al. [2020] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16, 2020. doi: 10.1109/SC41405.2020.00024.
Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
Rasley et al. [2020] Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450379984. doi: 10.1145/3394486.3406703. URL https://doi.org/10.1145/3394486.3406703.
Recht et al. [2019] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classifiers generalize to ImageNet? In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5389–5400. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/recht19a.html.
Schuhmann et al. [2021] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
Sharma et al. [2018] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL, 2018.
Sun et al. [2023] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
Sun et al. [2024] Quan Sun, **sheng Wang, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, and Xinlong Wang. Eva-clip-18b: Scaling clip to 18 billion parameters. arXiv preprint arXiv:2402.04252, 2024.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
Wang and Yang [2022] Bokun Wang and Tianbao Yang. Finite-sum coupled compositional stochastic optimization: Theory and applications. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 23292–23317. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/wang22ak.html.
Wang et al. [2019] Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/3eefceb8087e964f89c2d59e8a249915-Paper.pdf.
Wu et al. [2023] Kan Wu, Houwen Peng, Zhenghong Zhou, Bin Xiao, Mengchen Liu, Lu Yuan, Hong Xuan, Michael Valenzuela, Xi Stephen Chen, Xinggang Wang, et al. Tinyclip: Clip distillation via affinity mimicking and weight inheritance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21970–21980, 2023.
Xie et al. [2023] Chen-Wei Xie, Siyang Sun, Xiong Xiong, Yun Zheng, Deli Zhao, and **gren Zhou. Ra-clip: Retrieval augmented contrastive language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19265–19274, June 2023.
You et al. [2020] Yang You, **g Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Syx4wnEtvH.
Young et al. [2014] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
Yu et al. [2022] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=Ee277P3AYC.
Yuan et al. [2022] Zhuoning Yuan, Yuexin Wu, Zi-Hao Qiu, Xianzhi Du, Lijun Zhang, Denny Zhou, and Tianbao Yang. Provable stochastic optimization for global contrastive learning: Small batch does not harm performance. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 25760–25782. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/yuan22b.html.
Zhai et al. [2023] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986, October 2023.
Zhang et al. [2022] Chaoning Zhang, Kang Zhang, Trung X. Pham, Axi Niu, Zhinan Qiao, Chang D. Yoo, and In So Kweon. Dual temperature helps contrastive learning without many negative samples: Towards understanding and simplifying moco. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14441–14450, June 2022.
Zhou et al. [2022] Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, **hui Xu, and Tong Sun. Towards language-free training for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17907–17917, 2022.

Appendix A Details of the FastCLIP Framework

/* global, individual

\tau

: temperature scheme (c.f. Table 1) */

1 if global $\tau$ then

2 Compute

g_{1,i}^{t}=g_{1}(\bm{w}^{t},\tau^{t},i,\mathcal{B}_{i-}^{t}),g_{2,i}^{t}=g_{2% }(\bm{w}^{t},\tau^{t},i,\mathcal{B}_{i-}^{t})

3else if individual $\tau$ then

4 Compute

g_{1,i}^{t}=g_{1}(\bm{w}^{t},\tau_{1,i}^{t},i,\mathcal{B}_{i-}^{t}),g_{2,i}^{t% }=g_{2}(\bm{w}^{t},\tau_{2,i}^{t},i,\mathcal{B}_{i-}^{t})

Procedure 2 contrastive_loss

/* global, individual

\tau

: temperature scheme (c.f. Table 1) */

1 if global $\tau$ then

2 Compute

G_{\bm{w},a,k}^{t}

and

G_{\bm{w},b,k}^{t}

using (2) and (3), respectively

3else if individual $\tau$ then

4 Compute

G_{\bm{w},a,k}^{t}

and

G_{\bm{w},b,k}^{t}

using (6) and (7), respectively

Procedure 3 gradient_estimator

Input: Parameter

\theta^{t}

(can be

\bm{w}

\tau

) and its gradient estimator

G_{\theta}^{t}

, Hyperparameters

\beta_{1},\beta_{2},\epsilon

, Weight decay

\lambda

, Learning rate

\eta_{t}

/* We only consider AdamW and LAMB here. */

1 Compute

m^{t+1}=\beta_{1}m^{t}+(1-\beta_{1})G_{\theta}^{t}

2 Compute

v^{t+1}=\beta_{2}v^{t}+(1-\beta_{2})(G_{\theta}^{t})^{2}

3 Compute

\hat{m}^{t+1}=m^{t+1}/(1-(\beta_{1})^{t+1})

\hat{v}^{t+1}=v^{t+1}/(1-(\beta_{2})^{t+1})

4 if optimizer is AdamW then

5 Set

\theta^{t+1}=\theta^{t}-\eta_{t}\left(\hat{m}^{t+1}/(\sqrt{\hat{v}^{t+1}}+% \epsilon)+\lambda\theta^{t}\right)

6else if optimizer is LAMB then

7 Compute

r^{t+1}=\hat{m}^{t+1}/(\sqrt{\hat{v}^{t+1}}+\epsilon)

8 for each layer $\theta^{t,(i)}$ in $\theta^{t}$ do

9 Compute

\alpha_{t,(i)}=\|\theta^{t,(i)}\|_{2}/\|r^{t,(i)}+\lambda\theta^{t,(i)}\|_{2}

10 Set

\theta^{t+1,(i)}=\theta^{t,(i)}-\eta_{t}\cdot\alpha_{t,(i)}\left(r^{t,(i)}+% \lambda\theta^{t,(i)}\right)

Procedure 4 parameter_update

1 if constant $\tau$ then /* FastCLIP-v1 */

2 Set

\tau^{t+1}=\tau^{t}

3else if learnable $\tau$ then

4 if loss is (GCL) then /* FastCLIP-v0 */

5 Compute

G_{\tau,k}^{t}

using (8) and All_Reduce

G_{\tau}^{t}=\frac{1}{K}\sum_{l=1}^{K}G_{\tau,k}^{t}

6 Update

\tau^{t+1}

from

\tau^{t}

and

G_{\tau}^{t}

using Proc. 4 (with

\lambda=0

)^∗

7else if loss is (RGCL) then /* FastCLIP-v2 */

8 Compute

G_{\tau,1,i}^{t},G_{\tau,2,i}^{t}

for

i\in\mathcal{B}_{k}^{t}

using (9)

9 Update

\tau_{1,i}^{t+1}

from

\tau_{1,i}^{t}

and

G_{\tau,1,i}^{t}

, and update

\tau_{2,i}^{t+1}

from

\tau_{2,i}^{t}

and

G_{\tau,2,i}^{t}

using Proc. 4 (with

\lambda=0

) for

i\in\mathcal{B}_{k}^{t}

10else if loss is (RGCL-g) then /* FastCLIP-v3 */

11 Compute

G_{\tau,k}^{t}

using (10) and All_Reduce

G_{\tau}^{t}=\frac{1}{K}\sum_{l=1}^{K}G_{\tau,k}^{t}

12 Update

\tau^{t+1}

from

\tau^{t}

and

G_{\tau}^{t}

using Proc. 4 (with

\lambda=0

)

Procedure 5 temperature_update

Derivation of gradient of (GCL) w.r.t. $\bm{w}$ : Given a global batch $\mathcal{B}$ , the gradient of (GCL) w.r.t. $\bm{w}$ is given by $G_{\bm{w},a}+G_{\bm{w},b}$ , where

	$\displaystyle G_{\bm{w},a}=$	$\displaystyle\tau\cdot\frac{1}{K}\sum_{k=1}^{K}\overbrace{\frac{1}{\|\mathcal{B% }_{k}\|}\sum_{i\in\mathcal{B}_{k}}\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{1,i}% }\cdot\frac{1}{K}\sum_{k^{\prime}=1}^{K}\frac{1}{\|\mathcal{B}_{k^{\prime},i-}\|% }\sum_{j\in\mathcal{B}_{k^{\prime},i-}}\nabla_{1}\ell_{1}(\bm{e}_{i},\bm{e}_{2% ,j},\tau)\cdot\nabla\bm{e}_{i}}^{G_{\bm{w},a,1,k}}$
		$\displaystyle+\tau\cdot\frac{1}{K}\sum_{k=1}^{K}\overbrace{\frac{1}{\|\mathcal{% B}_{k}\|}\sum_{i\in\mathcal{B}_{k}}\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{2,i% }}\cdot\frac{1}{K}\sum_{k^{\prime}=1}^{K}\frac{1}{\|\mathcal{B}_{k^{\prime},i-}% \|}\sum_{j\in\mathcal{B}_{k^{\prime},i-}}\nabla_{1}\ell_{2}(\bm{e}_{i},\bm{e}_{% 1,j},\tau)\cdot\nabla\bm{e}_{i}}^{G_{\bm{w},a,2,k}}.$

	$\displaystyle G_{\bm{w},b}=$	$\displaystyle\tau\cdot\frac{1}{K}\sum_{k=1}^{K}\frac{1}{\|\mathcal{B}_{k}\|}\sum% _{i\in\mathcal{B}_{k}}\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{1,i}}\cdot\frac% {1}{K}\sum_{k^{\prime}=1}^{K}\frac{1}{\|\mathcal{B}_{k^{\prime},i-}\|}\sum_{j\in% \mathcal{B}_{k^{\prime},i-}}\nabla_{2}\ell_{1}(\bm{e}_{i},\bm{e}_{2,j},\tau)% \cdot\nabla\bm{e}_{2,j}$
		$\displaystyle+\tau\cdot\frac{1}{K}\sum_{k=1}^{K}\frac{1}{\|\mathcal{B}_{k}\|}% \sum_{i\in\mathcal{B}_{k}}\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{2,i}}\cdot% \frac{1}{K}\sum_{k^{\prime}=1}^{K}\frac{1}{\|\mathcal{B}_{k^{\prime},i-}\|}\sum_% {j\in\mathcal{B}_{k^{\prime},i-}}\nabla_{2}\ell_{2}(\bm{e}_{i},\bm{e}_{1,j},% \tau)\cdot\nabla\bm{e}_{1,j}.$

To compute $G_{\bm{w},a}$ , we first gather all the $\bm{e}_{2,j}$ and $\bm{e}_{1,j}$ using All_Gather to each worker, then compute $G_{\bm{w},a,1,k}$ and $G_{\bm{w},a,2,k}$ on the $k$ -th worker, and average $G_{\bm{w},a,1,k}$ and $G_{\bm{w},a,2,k}$ over each worker using All_Reduce. To compute $G_{\bm{w},b}$ , we first switch the inner and outer averages:

	$\displaystyle G_{\bm{w},b}=$	$\displaystyle\tau\cdot\frac{1}{K}\sum_{k^{\prime}=1}^{K}\overbrace{\frac{1}{\|% \mathcal{B}_{k^{\prime}}\|}\sum_{j\in\mathcal{B}_{k^{\prime}}}\cdot\frac{1}{K}% \sum_{k=1}^{K}\frac{1}{\|\mathcal{B}_{k,j-}\|}\sum_{i\in\mathcal{B}_{k,j-}}\frac% {1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{1,i}}\nabla_{2}\ell_{1}(\bm{e}_{i},\bm{e}_% {2,j},\tau)\cdot\nabla\bm{e}_{2,j}}^{G_{\bm{w},b,1,k^{\prime}}}$
		$\displaystyle+\tau\cdot\frac{1}{K}\sum_{k^{\prime}=1}^{K}\overbrace{\frac{1}{\|% \mathcal{B}_{k^{\prime}}\|}\sum_{j\in\mathcal{B}_{k^{\prime}}}\cdot\frac{1}{K}% \sum_{k=1}^{K}\frac{1}{\|\mathcal{B}_{k,j-}\|}\sum_{i\in\mathcal{B}_{k,j-}}\frac% {1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{2,i}}\nabla_{2}\ell_{2}(\bm{e}_{i},\bm{e}_% {1,j},\tau)\cdot\nabla\bm{e}_{1,j}}^{G_{\bm{w},b,2,k^{\prime}}}.$

Then we gather all the $\bm{u}_{1,i}$ and $\bm{u}_{2,i}$ using All_Gather to each worker, and compute $G_{\bm{w},b,1,k^{\prime}}$ and $G_{\bm{w},b,2,k^{\prime}}$ on the $k^{\prime}$ -th worker, then average $G_{\bm{w},b,1,k^{\prime}}$ and $G_{\bm{w},b,2,k^{\prime}}$ over each worker using All_Reduce to get $G_{\bm{w},b}$ . For practical consideration, we switch the inner and outer averages in $G_{\bm{w},b,1,k^{\prime}}$ and $G_{\bm{w},b,2,k^{\prime}}$ again so that we can compute them along with $G_{\bm{w},a,1,k}$ and $G_{\bm{w},a,2,k}$ using the same function:

	$\displaystyle G_{\bm{w},b,1,k^{\prime}}=$	$\displaystyle\frac{1}{\|\mathcal{B}_{k^{\prime}}\|}\sum_{j\in\mathcal{B}_{k^{% \prime}}}\cdot\frac{1}{K}\sum_{k=1}^{K}\frac{1}{\|\mathcal{B}_{k,j-}\|}\sum_{i% \in\mathcal{B}_{k,j-}}\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{1,i}}\nabla_{2}% \ell_{1}(\bm{e}_{i},\bm{e}_{2,j},\tau)\cdot\nabla\bm{e}_{2,j}$
	$\displaystyle\overset{(*)}{=}$	$\displaystyle\frac{1}{\|\mathcal{B}_{k^{\prime}}\|}\sum_{j\in\mathcal{B}_{k^{% \prime}}}\cdot\frac{1}{\|\mathcal{B}_{j-}\|}\sum_{i\in\mathcal{B}_{j-}}\frac{1}{% \frac{1}{\|\mathcal{S}_{i-}\|}+u_{1,i}}\nabla_{2}\ell_{1}(\bm{e}_{i},\bm{e}_{2,j% },\tau)\cdot\nabla\bm{e}_{2,j}$
	$\displaystyle=$	$\displaystyle\frac{1}{\|\mathcal{B}\|}\sum_{i\in\mathcal{B}}\frac{1}{\frac{1}{\|% \mathcal{S}_{i-}\|}+u_{1,i}}\cdot\frac{1}{\|\mathcal{B}_{k^{\prime}}\|}\cdot\frac% {\|\mathcal{B}\|}{\|\mathcal{B}_{j-}\|}\sum_{j\in\mathcal{B}_{k^{\prime},i-}}% \nabla_{2}\ell_{1}(\bm{e}_{i},\bm{e}_{2,j},\tau)\cdot\nabla\bm{e}_{2,j},$

where $(*)$ uses the fact that the average over local batch and workers is equal to the average over the global batch. Similarly,

G_{\bm{w},b,2,k^{\prime}}=\frac{1}{|\mathcal{B}|}\sum_{i\in\mathcal{B}}\frac{1% }{\frac{1}{|\mathcal{S}_{i-}|}+u_{2,i}}\cdot\frac{1}{|\mathcal{B}_{k^{\prime}}% |}\cdot\frac{|\mathcal{B}|}{|\mathcal{B}_{j-}|}\sum_{j\in\mathcal{B}_{k^{% \prime},i-}}\nabla_{2}\ell_{2}(\bm{e}_{i},\bm{e}_{1,j},\tau)\cdot\nabla\bm{e}_% {1,j}.

Deferred Computation in Alg.1: At iteration $t$ , for SogCLR and other algorithms with global temperature parameter (except FastCLIP-v0), the gradient estimator for $\bm{w}$ on $k$ -th worker is computed as

	$\displaystyle G_{\bm{w},a,k}^{t}=\frac{\tau^{t}}{\|\mathcal{B}_{k}^{t}\|}\sum_{i% \in\mathcal{B}_{k}^{t}}$	$\displaystyle\left(\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{1,i}^{t+1}}\left(% \frac{1}{\|\mathcal{B}_{i-}^{t}\|}\sum_{j\in\mathcal{B}_{i-}^{t}}\nabla_{1}\ell_% {1}(\bm{e}_{i},\bm{e}_{2,j},\tau^{t})\cdot\nabla\bm{e}_{i}\right)\right.$		(2)
		$\displaystyle\left.+\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{2,i}^{t+1}}\left(% \frac{1}{\|\mathcal{B}_{i-}^{t}\|}\sum_{j\in\mathcal{B}_{i-}^{t}}\nabla_{1}\ell_% {2}(\bm{e}_{i},\bm{e}_{1,j},\tau^{t})\cdot\nabla\bm{e}_{i}\right)\right).$		(2)

	$\displaystyle G_{\bm{w},b,k}^{t}=\frac{\tau^{t}}{\|\mathcal{B}^{t}\|}\sum_{i\in% \mathcal{B}^{t}}$	$\displaystyle\left(\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{1,i}^{t+1}}\left(% \frac{1}{\|\mathcal{B}_{k}^{t}\|}\cdot\frac{\|\mathcal{B}^{t}\|}{\|\mathcal{B}_{i-}% ^{t}\|}\sum_{j\in\mathcal{B}_{k,i-}^{t}}\nabla_{2}\ell_{1}(\bm{e}_{i},\bm{e}_{2% ,j},\tau^{t})\cdot\nabla\bm{e}_{2,j}\right)\right.$		(3)
		$\displaystyle\left.+\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{2,i}^{t+1}}\left(% \frac{1}{\|\mathcal{B}_{k}^{t}\|}\cdot\frac{\|\mathcal{B}^{t}\|}{\|\mathcal{B}_{i-}% ^{t}\|}\sum_{j\in\mathcal{B}_{k,i-}^{t}}\nabla_{2}\ell_{2}(\bm{e}_{i},\bm{e}_{1% ,j},\tau^{t})\cdot\nabla\bm{e}_{1,j}\right)\right).$		(3)

For FastCLIP-v0, we need to remove the $\tau^{t}$ at the front:

	$\displaystyle G_{\bm{w},a,k}^{t}=\frac{1}{\|\mathcal{B}_{k}^{t}\|}\sum_{i\in% \mathcal{B}_{k}^{t}}$	$\displaystyle\left(\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{1,i}^{t+1}}\left(% \frac{1}{\|\mathcal{B}_{i-}^{t}\|}\sum_{j\in\mathcal{B}_{i-}^{t}}\nabla_{1}\ell_% {1}(\bm{e}_{i},\bm{e}_{2,j},\tau^{t})\cdot\nabla\bm{e}_{i}\right)\right.$		(4)
		$\displaystyle\left.+\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{2,i}^{t+1}}\left(% \frac{1}{\|\mathcal{B}_{i-}^{t}\|}\sum_{j\in\mathcal{B}_{i-}^{t}}\nabla_{1}\ell_% {2}(\bm{e}_{i},\bm{e}_{1,j},\tau^{t})\cdot\nabla\bm{e}_{i}\right)\right).$		(4)

	$\displaystyle G_{\bm{w},b,k}^{t}=\frac{1}{\|\mathcal{B}^{t}\|}\sum_{i\in\mathcal% {B}^{t}}$	$\displaystyle\left(\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{1,i}^{t+1}}\left(% \frac{1}{\|\mathcal{B}_{k}^{t}\|}\cdot\frac{\|\mathcal{B}^{t}\|}{\|\mathcal{B}_{i-}% ^{t}\|}\sum_{j\in\mathcal{B}_{k,i-}^{t}}\nabla_{2}\ell_{1}(\bm{e}_{i},\bm{e}_{2% ,j},\tau^{t})\cdot\nabla\bm{e}_{2,j}\right)\right.$		(5)
		$\displaystyle\left.+\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{2,i}^{t+1}}\left(% \frac{1}{\|\mathcal{B}_{k}^{t}\|}\cdot\frac{\|\mathcal{B}^{t}\|}{\|\mathcal{B}_{i-}% ^{t}\|}\sum_{j\in\mathcal{B}_{k,i-}^{t}}\nabla_{2}\ell_{2}(\bm{e}_{i},\bm{e}_{1% ,j},\tau^{t})\cdot\nabla\bm{e}_{1,j}\right)\right).$		(5)

For iSogCLR and other algorithms with individual temperature parameter, it is computed using a slightly different formula (the $\tau$ part is different)

	$\displaystyle G_{\bm{w},a,k}^{t}=\frac{1}{\|\mathcal{B}_{k}^{t}\|}\sum_{i\in% \mathcal{B}_{k}^{t}}$	$\displaystyle\left(\frac{\tau_{1,i}^{t}}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{1,i}^% {t+1}}\left(\frac{1}{\|\mathcal{B}_{i-}^{t}\|}\sum_{j\in\mathcal{B}_{i-}^{t}}% \nabla_{1}\ell_{1}(\bm{e}_{i},\bm{e}_{2,j},\tau_{1,i}^{t})\cdot\nabla\bm{e}_{i% }\right)\right.$		(6)
		$\displaystyle\left.+\frac{\tau_{2,i}^{t}}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{2,i}% ^{t+1}}\left(\frac{1}{\|\mathcal{B}_{i-}^{t}\|}\sum_{j\in\mathcal{B}_{i-}^{t}}% \nabla_{1}\ell_{2}(\bm{e}_{i},\bm{e}_{1,j},\tau_{2,i}^{t})\cdot\nabla\bm{e}_{i% }\right)\right).$		(6)

	$\displaystyle G_{\bm{w},b,k}^{t}=\frac{1}{\|\mathcal{B}^{t}\|}\sum_{i\in\mathcal% {B}^{t}}$	$\displaystyle\left(\frac{\tau_{1,i}^{t}}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{1,i}^% {t+1}}\left(\frac{1}{\|\mathcal{B}_{k}^{t}\|}\cdot\frac{\|\mathcal{B}^{t}\|}{\|% \mathcal{B}_{i-}^{t}\|}\sum_{j\in\mathcal{B}_{k,i-}^{t}}\nabla_{2}\ell_{1}(\bm{% e}_{i},\bm{e}_{2,j},\tau_{1,i}^{t})\cdot\nabla\bm{e}_{2,j}\right)\right.$		(7)
		$\displaystyle\left.+\frac{\tau_{2,i}^{t}}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{2,i}% ^{t+1}}\left(\frac{1}{\|\mathcal{B}_{k}^{t}\|}\cdot\frac{\|\mathcal{B}^{t}\|}{\|% \mathcal{B}_{i-}^{t}\|}\sum_{j\in\mathcal{B}_{k,i-}^{t}}\nabla_{2}\ell_{2}(\bm{% e}_{i},\bm{e}_{1,j},\tau_{2,i}^{t})\cdot\nabla\bm{e}_{1,j}\right)\right).$		(7)

FastCLIP-v0 computes the following gradient estimator for $\tau$ :

	$\displaystyle G_{\tau,k}^{t}=$	$\displaystyle\frac{1}{\|\mathcal{B}_{k}^{t}\|}\sum_{i\in\mathcal{B}_{k}^{t}}% \frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{1,i}^{t+1}}\cdot\frac{1}{\|\mathcal{B}% _{i-}^{t}\|}\sum_{j\in\mathcal{B}_{i-}^{t}}\nabla_{3}\ell_{1}(\bm{e}_{i},\bm{e}% _{2,j},\tau^{t}),$		(8)
		$\displaystyle+\frac{1}{\|\mathcal{B}_{k}^{t}\|}\sum_{i\in\mathcal{B}_{k}^{t}}% \frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{2,i}^{t+1}}\cdot\frac{1}{\|\mathcal{B}% _{i-}^{t}\|}\sum_{j\in\mathcal{B}_{i-}^{t}}\nabla_{3}\ell_{2}(\bm{e}_{i},\bm{e}% _{1,j},\tau^{t}).$		(8)

FastCLIP-v2 computes the following gradient estimators for $\tau$ :

	$\displaystyle G_{\tau,1,i}^{t}=$	$\displaystyle\frac{1}{\|\mathcal{S}\|}\left(\log\left(\frac{1}{\|\mathcal{S}_{i-}% \|}+u_{1,i}^{t+1}\right)+\rho+\tau_{1,i}^{t}\cdot\frac{1}{\frac{1}{\|\mathcal{S}% _{i-}\|}+u_{1,i}^{t+1}}\cdot\frac{1}{\|\mathcal{B}_{i-}^{t}\|}\sum_{j\in\mathcal{% B}_{i-}^{t}}\nabla_{3}\ell_{1}(\bm{e}_{i},\bm{e}_{2,j},\tau_{1,i}^{t})\right),$		(9)
	$\displaystyle G_{\tau,2,i}^{t}=$	$\displaystyle\frac{1}{\|\mathcal{S}\|}\left(\log\left(\frac{1}{\|\mathcal{S}_{i-}% \|}+u_{2,i}^{t+1}\right)+\rho+\tau_{2,i}^{t}\cdot\frac{1}{\frac{1}{\|\mathcal{S}% _{i-}\|}+u_{2,i}^{t+1}}\cdot\frac{1}{\|\mathcal{B}_{i-}^{t}\|}\sum_{j\in\mathcal{% B}_{i-}^{t}}\nabla_{3}\ell_{2}(\bm{e}_{i},\bm{e}_{1,j},\tau_{2,i}^{t})\right),$		(9)

FastCLIP-v3 computes the following gradient estimator for $\tau$ :

$\displaystyle G_{\tau,k}^{t}=$	$\displaystyle\frac{1}{\|\mathcal{B}_{k}^{t}\|}\sum_{i\in\mathcal{B}_{k}^{t}}% \left(\log\left(\frac{1}{\|\mathcal{S}_{i-}\|}+u_{1,i}^{t+1}\right)+\log\left(% \frac{1}{\|\mathcal{S}_{i-}\|}+u_{2,i}^{t+1}\right)\right)+2\rho$	(10)
	$\displaystyle+\tau^{t}\cdot\frac{1}{\|\mathcal{B}_{k}^{t}\|}\sum_{i\in\mathcal{B% }_{k}^{t}}\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{1,i}^{t+1}}\cdot\frac{1}{\|% \mathcal{B}_{i-}^{t}\|}\sum_{j\in\mathcal{B}_{i-}^{t}}\nabla_{3}\ell_{1}(\bm{e}% _{i},\bm{e}_{2,j},\tau^{t})$
	$\displaystyle+\tau^{t}\cdot\frac{1}{\|\mathcal{B}_{k}^{t}\|}\sum_{i\in\mathcal{B% }_{k}^{t}}\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{2,i}^{t+1}}\cdot\frac{1}{\|% \mathcal{B}_{i-}^{t}\|}\sum_{j\in\mathcal{B}_{i-}^{t}}\nabla_{3}\ell_{2}(\bm{e}% _{i},\bm{e}_{1,j},\tau^{t}).$

Appendix B Experiment Hyperparameters

Unless otherwise specified, for both FastCLIP and OpenCLIP, we use AdamW as the optimizer. For all settings, we use a cosine learning rate (LR) schedule for updating model parameters, which first linearly increases the LR from 0 to peak LR in the warmup stage, then decreases the LR to 0 at the end of training following a cosine function. The hyperparameters we use are specified in Table 6. Other hyperparameters regarding the inner learning rate schedule, temperature parameter updates, and the LAMB optimizer will be introduced in the paragraphs that follow.

Table 6: Hyperparameters for different settings.

\beta_{1},\beta_{2},\epsilon

are hyperparameters in the AdamW optimizer. lr denotes the learning rate. wd denotes the weight decay. warmup denotes the number of iterations in the warmup stage.

Setting	$\beta_{1}$	$\beta_{2}$	$\epsilon$	lr	wd	warmup
Medium	0.9	0.999	1e-8	1e-3	0.1	10k
Large	0.9	0.98	1e-6	4e-4	0.1	10k
xLarge	0.9	0.98	1e-6	2e-4	0.2	13k

The Inner LR Schedule: We compare three pairs of approaches: SogCLR and FastCLIP-v1; iSogCLR and FastCLIP-v2; FastCLIP-v3 with constant $\gamma$ and FastCLIP-v3, where the former of each pair uses constant $\gamma$ schedule and the latter uses cosine $\gamma$ schedule. Any two approaches of each pair only differ in $\gamma$ schedule. For approaches using constant $\gamma$ schedule, we tune the value of $\gamma$ in $\{0.2,0.4,0.6,0.8\}$ . For approaches using cosine $\gamma$ schedule, we tune the value of $\gamma_{\mathrm{min}}$ (the value $\gamma$ will decay to in the end) in $\{0.2,0.6\}$ and decay epochs in $\{50\%,100\%\}$ of the number of training epochs. The $\gamma$ values for each algorithm are presented in Table 7. Other hyperparameters are kept the same within each pair. For SogCLR and FastCLIP-v1, we set the temperature parameter to 0.03. For iSogCLR and FastCLIP-v2, we set the initial temperature parameter to 0.03, $\rho$ to 9.0, and the learning rate of $\tau$ to 1e-2. For FastCLIP-v3 with constant $\gamma$ schedule and FastCLIP-v3, we set the initial temperature parameter to 0.07, $\rho$ to 6.5 in the medium-scale setting and 8.5 in the large-scale setting, and learning rate of $\tau$ to 2e-4 in the medium-scale setting and 1e-4 in the large-scale setting. For FastCLIP-v3, its learning rate of $\tau$ decays to 1/3 of its original value when $\tau$ becomes smaller than 0.03.

Table 7: Values of

\gamma

for different schedules in different settings. For Cosine

\gamma

schedule, we report the

\gamma

value along with number of

\gamma

decay epochs

E

(c.f. Section 5). ^∗: v3 (Const.

\gamma

) denotes FastCLIP-v3 with constant

\gamma

schedule.

	Constant $\gamma$		Cosine $\gamma$
Setting	Algorithm	$\gamma$	Algorithm	$\gamma_{\mathrm{min}},E$
	SogCLR	0.6	FastCLIP-v1	0.2, 18
	iSogCLR	0.6	FastCLIP-v2	0.2, 18
Medium	v3 (Const. $\gamma$ )^∗	0.6	FastCLIP-v3	0.2, 18
	SogCLR	0.6	FastCLIP-v1	0.2, 16
	iSogCLR	0.8	FastCLIP-v2	0.6, 16
Large	v3 (Const. $\gamma$ )^∗	0.6	FastCLIP-v3	0.2, 16
xLarge	-	-	FastCLIP-v3	0.8, 10

The Temperature Parameter Updates: For all algorithms we leverage a cosine $\gamma$ schedule with $\gamma_{\mathrm{min}}=0.2$ and decay epochs $E$ equal to 50% of the number of training epochs. For all algorithms, we tune their initial temperature parameter in $\{0.03,0.05,0.07\}$ . For FastCLIP-v2 and -v3, we tune $\rho$ in $[6.0,9.0]$ , we also tune the learning rate of $\tau$ in $[1\mathrm{e}-4,1\mathrm{e}-2]$ . Other hyperparameters are kept the same for the four algorithms. The tuned initial temperature is 0.07 for FastCLIP-v3 and 0.03 for other algorithms. The $\rho$ values are presented in Table 8. For FastCLIP-v2, the tuned learning rate of $\tau$ is 1e-2 in the medium-scale setting and 1e-4 in the large-scale setting. For FastCLIP-v3, the tuned learning rate of $\tau$ is 2e-4 in the medium-scale setting and 1e-4 in the large-scale setting. For FastCLIP-v3, its learning rate of $\tau$ decays to 1/3 of its original value when $\tau$ becomes smaller than 0.03.

Table 8: Value of

\rho

for FastCLIP-v2 and -v3 in different settings.

Algorithm	Medium	Large	xLarge
FastCLIP-v2	7.0	8.5	-
FastCLIP-v3	6.5	8.5	16.0

The Optimizer: We use FastCLIP-v3 as the base algorithm. For both optimizers, we tune their learning rate of model parameters in $[4\mathrm{e}-5,4\mathrm{e}-3]$ and weight decay in $[0.01,0.2]$ . Other hyperparameters are kept the same as in Temperature Parameter Updates. The tuned learning rate of model parameters for AdamW is 1e-3 in the medium-scale setting and 4e-4 in the large-scale setting, and the tuned weight decay is 0.1. The tuned learning rate of $\bm{w}$ for LAMB is 2e-3, and the tuned weight decay is 0.1. Following OpenCLIP, we set the weight decay of the temperature parameter to 0. And following EVA-CLIP [46] in the implementation of LAMB, we set $\alpha$ at Line 4 in Proc. 4 to 1.0 when updating the temperature parameter, leading to the same update as AdamW.

Scaling Performance: We tune the learning rate of model parameters of OpenCLIP on 2 nodes in the medium-scale and large-scale setting in $[4\mathrm{e}-5,4\mathrm{e}-3]$ , and on 4 nodes in the xlarge-scale setting in $[4\mathrm{e}-5,4\mathrm{e}-4]$ . The tuned learning rate of model parameters of OpenCLIP is 1e-3, 4e-4 and 2e-4 in the medium-scale, large-scale and xlarge-scale setting, respectively. Other hyperparameters are set according to Table 6 to 8. In the xlarge-scale setting, we set the learning rate of model parameters of FastCLIP-v3 to the same value as OpenCLIP. For different number of nodes in the medium-scale and large-scale setting, we scale the learning rate of model parameters and temperature parameter linearly in proportion to global batch size and keep other hyperparameters unchanged. For FastCLIP-v3 in the xlarge-scale setting, we set $\rho$ to 16.0 and the learning rate of temperature parameter to 5e-5. We leverage a cosine $\gamma$ schedule with $\gamma_{\mathrm{min}}=0.8$ and decay epochs $E=10$ .

Choice of $\gamma_{\mathrm{min}}$ in the xlarge-scale setting: Note that in the xlarge-scale setting we use a larger $\gamma_{\mathrm{min}}$ value than in the medium-scale and large-scale settings. We find that the batch size impacts how we should set the $\gamma_{\mathrm{min}}$ value. To illustrate this, we conduct two sets of experiments in the large-scale setting on 2 nodes and 8 nodes, respectively. Each set is FastCLIP-v3 with different $\gamma_{\mathrm{min}}$ value. The results are plotted in Figure 5. Comparing a larger $\gamma_{\mathrm{min}}$ (0.8) with a smaller one (0.2) in the same setting, we find that the training can be split into three stages. In the first stage, the two runs have similar performance. In the second stage, larger $\gamma_{\mathrm{min}}$ outperforms the smaller one, while the smaller one catches up with the larger one and outperforms it in the last stage. From Figure 5 we can also observe that with a larger global batch size, the second stage becomes longer. Note that in the medium-scale and large-scale settings we use a global batch size of 1024 and 2048 respectively, while we set it to 5120 in the xlarge-scale setting. We also conjecture that the second stage becomes longer as the data scales up, though we did not validate this due to resource limits. The large batch size and large data scale in the xlarge-scale setting motivate our use of a larger $\gamma_{\mathrm{min}}$ value than in the medium-scale and large-scale settings.

Appendix C More Experiment Results

C.1 Optimization Components

We plot the Datacomp average performance curves of different algorithms with constant $\gamma$ schedule and cosine $\gamma$ schedule in Figure 6, which corresponds to Table 5 in Section 5. We plot the Datacomp average performance curves of algorithms with different temperature updates in Figure 7 (a) and (b), which corresponds to Table 5 in Section 5. We plot the Datacomp average performance curves of FastCLIP-v3 with AdamW and LAMB optimizer in Figure 7 (c) and (d), which corresponds to Table 5 in Section 5.

C.2 Scaling Performance

In this subsection we provide more results to complement the figures in Section 6.

Table 9: Datacomp Average performance of OpenCLIP and FastCLIP-v3 trained on different number of nodes. Improvement denotes the absolute difference between FastCLIP-v3 and OpenCLIP.

Setting	Algorithm	1 Node	2 Nodes	4 Nodes	8 Nodes
	OpenCLIP	21.82 (0.59)	21.84 (0.23)	21.65 (0.13)	22.22 (0.37)
	FastCLIP-v3	24.54 (0.25)	24.76 (0.26)	24.43 (0.20)	25.23 (0.28)
Medium	Improvement	2.72	2.92	2.78	3.01
	OpenCLIP	27.55 (0.46)	27.91 (0.73)	28.93 (0.29)	28.75 (0.59)
	FastCLIP-v3	30.81 (0.38)	31.60 (0.46)	31.65 (0.13)	31.45 (0.32)
Large	Improvement	3.26	3.69	2.72	2.70

Table 10: Retrieval performance of OpenCLIP and FastCLIP-v3 trained on different number of nodes. Improvement denotes the absolute difference between FastCLIP-v3 and OpenCLIP.

Setting	Algorithm	1 Node	2 Nodes	4 Nodes	8 Nodes
	OpenCLIP	24.07 (0.16)	25.20 (0.22)	25.07 (0.26)	26.20 (0.10)
	FastCLIP-v3	30.02 (0.57)	30.36 (0.18)	30.42 (0.24)	30.42 (0.24)
Medium	Improvement	5.95	5.16	5.35	4.22
	OpenCLIP	29.17 (0.17)	29.58 (0.62)	30.25 (0.31)	30.87 (0.11)
	FastCLIP-v3	33.90 (0.28)	34.88 (0.28)	34.91 (0.16)	34.74 (0.31)
Large	Improvement	4.73	5.30	4.66	3.87

Table 11: ImageNet & Variants accuracy of OpenCLIP and FastCLIP-v3 trained on different number of nodes. Improvement denotes the absolute difference between FastCLIP-v3 and OpenCLIP.

Setting	Algorithm	1 Node	2 Nodes	4 Nodes	8 Nodes
	OpenCLIP	14.16 (0.11)	14.73 (0.22)	15.24 (0.26)	16.03 (0.23)
	FastCLIP-v3	18.37 (0.26)	19.08 (0.16)	19.21 (0.18)	19.20 (0.16)
Medium	Improvement	4.21	4.35	3.97	3.17
	OpenCLIP	20.51 (0.14)	21.08 (0.09)	22.32 (0.23)	22.77 (0.14)
	FastCLIP-v3	23.76 (0.38)	24.78 (0.28)	24.79 (0.20)	24.93 (0.16)
Large	Improvement	3.25	3.70	2.47	2.16

Performance of OpenCLIP and FastCLIP-v3: The data to plot Figure 4 is presented in Table 10 and Table 11. We also provide the Datacomp performance in Table 9. The Datacomp performance of OpenCLIP and FastCLIP-v3 in the xlarge-scale setting is plotted in Figure 8.

Table 12: Comparison between OpenCLIP and FastCLIP-v3 in terms of training time in the medium-scale setting. The shaded results are from FastCLIP-v3, and the others are from OpenCLIP. Computation denotes the whole computation time. Communication denotes the whole communication time. Pure Comm. denotes the communication time that is not overlapped with computation. Overlap denotes the overlapped time between computation and communication.

Category	1 Node	2 Nodes	4 Nodes	8 Nodes
	867.85 (11.04)	880.19 (53.45)	925.47 (27.77)	1049.90 (32.44)
Total	866.36 (5.89)	879.91 (52.17)	917.54 (25.46)	1028.06 (32.26)
	770.57 (6.10)	738.87 (21.58)	726.07 (1.53)	742.93 (15.91)
Computation	771.80 (5.53)	737.93 (21.73)	725.40 (2.01)	742.90 (15.90)
	222.01 (4.43)	403.40 (130.80)	548.07 (60.97)	698.87 (26.24)
Communication	223.34 (5.51)	400.76 (125.78)	536.15 (59.29)	675.43 (25.97)
	27.18 (1.61)	68.74 (25.45)	127.39 (30.29)	224.71 (16.05)
Pure Comm.	25.50 (2.24)	64.32 (22.47)	116.21 (28.48)	200.97 (15.58)
	194.84 (2.88)	334.66 (105.36)	420.68 (30.80)	474.16 (10.23)
Overlap	197.84 (3.65)	336.44 (103.35)	419.94 (30.83)	474.46 (10.41)
	70.09 (8.17)	72.58 (6.59)	72.01 (2.73)	82.26 (0.93)
Others	69.06 (1.67)	77.66 (8.14)	75.93 (2.83)	84.19 (0.86)

Table 13: Comparison between OpenCLIP and FastCLIP-v3 in terms of training time in the large-scale setting. The shaded results are from FastCLIP-v3, and the others are from OpenCLIP. Computation denotes the whole computation time. Communication denotes the whole communication time. Pure Comm. denotes the communication time that is not overlapped with computation. Overlap denotes the overlapped time between computation and communication.

Category	1 Node	2 Nodes	4 Nodes	8 Nodes
	1125.29 (14.14)	1234.06 (151.37)	1396.76 (47.86)	1564.46 (47.92)
Total	1128.75 (9.75)	1234.82 (153.86)	1394.91 (48.35)	1542.32 (47.87)
	960.14 (12.00)	910.77 (10.48)	891.71 (6.09)	896.54 (8.02)
Computation	964.16 (9.10)	910.94 (11.55)	892.72 (4.72)	897.59 (9.09)
	360.34 (15.55)	655.30 (175.45)	876.13 (71.52)	1061.52 (55.08)
Communication	363.38 (16.66)	652.78 (173.41)	870.01 (69.56)	1035.03 (56.84)
	56.73 (4.09)	192.89 (129.45)	379.10 (58.13)	525.78 (57.22)
Pure Comm.	55.44 (2.23)	190.56 (127.48)	371.30 (55.62)	498.95 (59.72)
	303.62 (14.70)	462.41 (46.02)	497.02 (13.45)	535.74 (2.33)
Overlap	307.94 (18.14)	462.22 (45.93)	498.71 (13.97)	536.08 (2.99)
	108.42 (5.54)	130.40 (12.26)	125.95 (5.57)	142.14 (2.08)
Others	109.14 (2.67)	133.33 (15.30)	130.89 (4.34)	145.78 (3.13)

Training Time of OpenCLIP and FastCLIP-v3: We present the training time breakdown of OpenCLIP and FastCLIP-v3 in Table 12 and 13 for the medium-scale and large-scale settings, respectively. We can see that as the number of nodes scales up, the computation time of OpenCLIP and FastCLIP-v3 is always close to each other, while the gap in communication time becomes much larger, which is also depicted in subfigures (c) and (d). Even if we exclude the part of communication that overlaps with computation, the gap in pure communication still becomes larger with increasing number of nodes, and thus FastCLIP-v3 has a shorter running time on 4 and 8 nodes.

	$\displaystyle\min_{\tau_{1},\tau_{2}\geq\tau_{0}}\frac{1}{\|\mathcal{S}\|}\sum_{% i\in\mathcal{S}}$	$\displaystyle\left(\tau_{1,i}\cdot\left(\log\left(\frac{1}{\|\mathcal{S}_{i-}\|}% +g_{1}(\bm{w},\tau_{1,i},i,\mathcal{S}_{i-})\right)+\rho\right)\right.$		(RGCL)
		$\displaystyle\;\;\left.+\tau_{2,i}\cdot\left(\log\left(\frac{1}{\|\mathcal{S}_{% i-}\|}+g_{2}(\bm{w},\tau_{2,i},i,\mathcal{S}_{i-})\right)+\rho\right)\right),$		(RGCL)

	$\displaystyle G_{\bm{w},a,k}^{t}=\frac{\tau^{t}}{\|\mathcal{B}_{k}^{t}\|}\sum_{i% \in\mathcal{B}_{k}^{t}}$	$\displaystyle\left(\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{1,i}^{t+1}}\left(% \frac{1}{\|\mathcal{B}_{i-}^{t}\|}\sum_{j\in\mathcal{B}_{i-}^{t}}\nabla_{1}\ell_% {1}(\bm{e}_{i},\bm{e}_{2,j},\tau^{t})\cdot\nabla\bm{e}_{i}\right)\right.$		(2)
		$\displaystyle\left.+\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{2,i}^{t+1}}\left(% \frac{1}{\|\mathcal{B}_{i-}^{t}\|}\sum_{j\in\mathcal{B}_{i-}^{t}}\nabla_{1}\ell_% {2}(\bm{e}_{i},\bm{e}_{1,j},\tau^{t})\cdot\nabla\bm{e}_{i}\right)\right).$		(2)

	$\displaystyle G_{\bm{w},b,k}^{t}=\frac{\tau^{t}}{\|\mathcal{B}^{t}\|}\sum_{i\in% \mathcal{B}^{t}}$	$\displaystyle\left(\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{1,i}^{t+1}}\left(% \frac{1}{\|\mathcal{B}_{k}^{t}\|}\cdot\frac{\|\mathcal{B}^{t}\|}{\|\mathcal{B}_{i-}% ^{t}\|}\sum_{j\in\mathcal{B}_{k,i-}^{t}}\nabla_{2}\ell_{1}(\bm{e}_{i},\bm{e}_{2% ,j},\tau^{t})\cdot\nabla\bm{e}_{2,j}\right)\right.$		(3)
		$\displaystyle\left.+\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{2,i}^{t+1}}\left(% \frac{1}{\|\mathcal{B}_{k}^{t}\|}\cdot\frac{\|\mathcal{B}^{t}\|}{\|\mathcal{B}_{i-}% ^{t}\|}\sum_{j\in\mathcal{B}_{k,i-}^{t}}\nabla_{2}\ell_{2}(\bm{e}_{i},\bm{e}_{1% ,j},\tau^{t})\cdot\nabla\bm{e}_{1,j}\right)\right).$		(3)

	$\displaystyle G_{\bm{w},a,k}^{t}=\frac{1}{\|\mathcal{B}_{k}^{t}\|}\sum_{i\in% \mathcal{B}_{k}^{t}}$	$\displaystyle\left(\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{1,i}^{t+1}}\left(% \frac{1}{\|\mathcal{B}_{i-}^{t}\|}\sum_{j\in\mathcal{B}_{i-}^{t}}\nabla_{1}\ell_% {1}(\bm{e}_{i},\bm{e}_{2,j},\tau^{t})\cdot\nabla\bm{e}_{i}\right)\right.$		(4)
		$\displaystyle\left.+\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{2,i}^{t+1}}\left(% \frac{1}{\|\mathcal{B}_{i-}^{t}\|}\sum_{j\in\mathcal{B}_{i-}^{t}}\nabla_{1}\ell_% {2}(\bm{e}_{i},\bm{e}_{1,j},\tau^{t})\cdot\nabla\bm{e}_{i}\right)\right).$		(4)

	$\displaystyle G_{\bm{w},b,k}^{t}=\frac{1}{\|\mathcal{B}^{t}\|}\sum_{i\in\mathcal% {B}^{t}}$	$\displaystyle\left(\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{1,i}^{t+1}}\left(% \frac{1}{\|\mathcal{B}_{k}^{t}\|}\cdot\frac{\|\mathcal{B}^{t}\|}{\|\mathcal{B}_{i-}% ^{t}\|}\sum_{j\in\mathcal{B}_{k,i-}^{t}}\nabla_{2}\ell_{1}(\bm{e}_{i},\bm{e}_{2% ,j},\tau^{t})\cdot\nabla\bm{e}_{2,j}\right)\right.$		(5)
		$\displaystyle\left.+\frac{1}{\frac{1}{\|\mathcal{S}_{i-}\|}+u_{2,i}^{t+1}}\left(% \frac{1}{\|\mathcal{B}_{k}^{t}\|}\cdot\frac{\|\mathcal{B}^{t}\|}{\|\mathcal{B}_{i-}% ^{t}\|}\sum_{j\in\mathcal{B}_{k,i-}^{t}}\nabla_{2}\ell_{2}(\bm{e}_{i},\bm{e}_{1% ,j},\tau^{t})\cdot\nabla\bm{e}_{1,j}\right)\right).$		(5)

FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training with Limited Resources

Abstract

1 Introduction

2 Related Works

3 Preliminaries

4 FastCLIP: A Distributed Training Framework of CLIP Models

5 Benchmark of Optimization Components

5.1 Results

6 Scaling Performance of FastCLIP

7 Conclusion

References

Appendix A Details of the FastCLIP Framework

Appendix B Experiment Hyperparameters

Appendix C More Experiment Results

C.1 Optimization Components

C.2 Scaling Performance

FastCLIP: A Suite of Optimization Techniques to
Accelerate CLIP Training with Limited Resources