FastCLIP: A Suite of Optimization Techniques to
Accelerate CLIP Training with Limited Resources

Xiyuan Wei1
[email protected]
&Fanjiang Ye2
[email protected]
&Ori Yonay1
[email protected]
&Xingyu Chen1
[email protected]
&Baixi Sun2
[email protected]
&Dingwen Tao2
[email protected]
&Tianbao Yang1
[email protected]
&
1Texas A&M University
College Station, TX 77843
&
2Indiana University Bloomington
Bloomington, IN 47405
Abstract

Existing studies of training state-of-the-art Contrastive Language-Image Pretraining (CLIP) models on large-scale data involve hundreds of or even thousands of GPUs due to the requirement of a large batch size. However, such a large amount of resources is not accessible to most people. While advanced compositional optimization techniques for optimizing global contrastive losses have been demonstrated effective for removing the requirement of large batch size, their performance on large-scale data remains underexplored and not optimized. To bridge the gap, this paper explores several aspects of CLIP training with limited resources (e.g., up to tens of GPUs). First, we introduce FastCLIP, a general CLIP training framework built on advanced compositional optimization techniques while designed and optimized for the distributed setting. Our framework is equipped with an efficient gradient reduction strategy to reduce communication overhead. Second, to further boost training efficiency, we investigate three components of the framework from an optimization perspective: the schedule of the inner learning rate, the update rules of the temperature parameter and the model parameters, respectively. Experiments on different strategies for each component shed light on how to conduct CLIP training more efficiently. Finally, we benchmark the performance of FastCLIP and the state-of-the-art training baseline (OpenCLIP) on different compute scales up to 32 GPUs on 8 nodes, and three data scales ranging from 2.7 million, 9.1 million to 315 million image-text pairs to demonstrate the significant improvement of FastCLIP in the resource-limited setting. We release the code of FastCLIP at https://github.com/Optimization-AI/fast_clip.

1 Introduction

Contrastive Language-Image Pretraining (CLIP) [38] is a popular approach for vision-language representation learning [7, 47, 6, 27, 37]. The method effectively embeds data from the image and language modality into a joint embedding space by optimizing a contrastive loss in a self-supervised manner. It has demonstrated strong performance on various downstream tasks (e.g., zero-shot classification and retrieval) and has been adopted in various applications, including text-to-image generation [40, 59, 8], image captioning [55, 34], and evaluation of image generation quality [19]. Its popularity is further fueled by releases of web-scale datasets [43, 44, 14, 12].

However, vanilla mini-batch based methods for self-supervised contrastive learning are known to require a large batch size to obtain satisfactory performance. Theoretically, it has been shown that the optimization error of mini-batch based contrastive learning methods inversely depends on the batch size [56]. Empirically, state-of-the-art CLIP models are typically trained using a large batch size on a large number of GPUs (e.g., 84k batch size and 1024 Nvidia A100 GPUs in OpenCLIP [7]). Such a large amount of resources is not accessible to most people in academia and small companies. Recently, Yuan et al. [56] proposed an algorithm named SogCLR to address the large batch size issue, which leverages finite-sum coupled compositional optimization (FCCO) techniques to optimize a global contrastive loss (GCL) that contrasts each anchor data with all other data in a compositional structure. A key feature of compositional optimization is the inner and outer steps where the inner steps maintain and update a sequence of estimators to track the inner functions on the solution path, which can be interpreted as an SGD update with a learning rate called the inner learning rate [49]. Later, SogCLR has been leveraged by Qiu et al. [37] to design the iSogCLR algorithm for optimizing a robust global contrastive loss (RGCL) with individualized learnable temperatures in CLIP models. However, these algorithms are not fully optimized for large-scale training of CLIP models since they were examined only on small-scale datasets.

This paper aims to scale up the advanced optimization algorithms for optimizing global contrastive losses of CLIP training on large-scale data with limited compute resources. We introduce a distributed training framework named FastCLIP by employing data parallelism such that each worker computes the gradient estimator using their respective data and then reduces (averages) them through communication, based on which the model is updated. A novel gradient reduction strategy is designed, which requires less communication than the existing distributed framework. This distributed training framework lays the foundation for scaling up CLIP training with limited resources. To further boost the efficiency of our framework, we investigate its three aspects from an optimization perspective: the schedule of the inner learning rate (LR) of compositional optimization, the update rule of the temperature parameter, and the update rule of the model parameters, respectively.

  • Previous studies [56, 37] set the inner LR to a constant value less than but close to one, which could slow down the training for large-scale data at earlier iterations. Inspired by the learning rate schedule of existing optimizers of Deep Learning [31], we examine a cosine decay schedule for the inner LR by benchmarking its performance and comparing it with the constant schedule.

  • For the update rule of the temperature parameter, we compare four different strategies in the FastCLIP framework, including a heuristic approach based on the gradient of GCL, a constant strategy as used in SogCLR, learning individualized temperatures as used in iSogCLR, and learning global temperature by optimizing a new RGCL with a single learnable temperature.

  • For the update rule of the model parameters, we benchmark the performance of the AdamW optimizer [32] and the LAMB optimizer [53] in FastCLIP. We also explored momentum-based optimizer, but it yields much worse performance and hence is not reported in this paper.

Moreover, in order to study the scaling capability of FastCLIP, we compare the performance of FastCLIP and state-of-the-art baseline OpenCLIP [22] on three data scales and four compute scales. The data scales include 2.7 million (CC3M [45]), 9.1 million (CC12M [3]), and 315 million (LAION400M [43]) image-text pairs111Our downloaded versions of these datasets are smaller than their original versions because some web links are not valid anymore.. The compute scales include 1, 2, 4, and 8 nodes, with 4 GPUs on each node.

Refer to caption
Refer to caption
(a) 1 Node, Medium
Refer to caption
(b) 2 Nodes, Medium
Refer to caption
(c) 4 Nodes, Medium
Refer to caption
(d) 8 Nodes, Medium
Refer to caption
(e) 1 Node, Large
Refer to caption
(f) 2 Nodes, Large
Refer to caption
(g) 4 Nodes, Large
Refer to caption
(h) 8 Nodes, Large
Figure 1: Zero-shot accuracy curves on ImageNet & its variants of OpenCLIP and FastCLIP-v3 trained on 1 to 8 node(s) with 4 GPUs per node on medium and large-scale settings (c.f. Section 5).

The contributions of this paper are summarized as follows: (1) We propose FastCLIP, an efficient distributed framework to scale up CLIP training with limited computing resources. (2) We benchmark the performance of different strategies for three components of FastCLIP, providing insights on how to conduct CLIP training more efficiently. (3) We study the performance of FastCLIP on different data scales and compute scales. The results show that FastCLIP consistently outperforms the state-of-the-art training baseline OpenCLIP by a large margin. A quick comparison between FastCLIP and OpenCLIP on the medium and large data scales across different compute scales is shown in Figure 1.

2 Related Works

CLIP training in the distributed setting: Radford et al. [38] train CLIP models in a distributed setting, but few details regarding the implementation are provided. Ilharco et al. [22] develop OpenCLIP, an open-source implementation of CLIP. They leverage the PyTorch distributed data-parallel module [26] to automatically communicate features and gradients. EVA-CLIP [46, 47] scales the number of parameters of the image encoder in CLIP up to 18 billion by applying several techniques from the system perspective, including the ZeRO optimizer [39] and global half-precision training with DeepSpeed [41]. The key difference between existing works and this work is that they all use a simple mini-batch based contrastive loss, which suffers from the issue of requiring a large batch size. This in turn requires hundreds and even thousands of GPUs. For example, CLIP uses 592 V100 GPUs, OpenCLIP uses 1024 A100 GPUs, and EVA-CLIP uses 256 A100 GPUs. Our work focuses on scaling up CLIP training in a resource-limited setting with only tens of GPUs. We make unique efforts to reduce communication overhead and optimize algorithmic components.

Benchmark for CLIP training: Cherti et al. [7] study the scaling performance of CLIP training. They measure the performance of CLIP across different model sizes and dataset sizes, and study the relationships between downstream task performance and resource consumption. Gadre et al. [14] investigate the impact of different data filtering strategies on the trained model’s downstream performance. They conduct experiments across different data scales ranging from 12.8 million to 12.8 billion and provide insights on how to curate CLIP’s training data. Cui et al. [9] examine the impact of data quality, supervision strategies (e.g., additional image supervision), and model architectures. Li et al. [30] explore different aspects of CLIP training under a limited training budget, including the impact of the quality and quantity of the training data, different model architectures, and different existing training strategies. Different from these works, we study different algorithmic components of CLIP training in an advanced optimization framework for optimizing the global contrastive loss.

Improved CLIP training: Many works have studied efficient CLIP training with limited resources. Yuan et al. [56] propose SogCLR to improve the performance of contrastive learning with small batch size. Our work scales up SogCLR in the distributed setting and incorporates several algorithmic strategies to accelerate its training speed. Besides the algorithm, other directions are also explored for more efficient CLIP training, including augmenting mini-batch based contrastive losses [29, 57, 35, 25, 33, 24, 15], model compression [51, 28, 13], and system optimization [5, 46, 39].

Temperature scheme: The temperature parameter in contrastive losses plays an important role in CLIP training. Many techniques have been proposed to update or set the temperature parameter. Radford et al. [38] treat the temperature as part of the learnable parameters in the mini-batch contrastive loss. Zhang et al. [58] propose to use different temperatures for positive and negative samples to independently control intra-anchor and inter-anchor hardness-awareness. Kukleva et al. [23] study a cosine decay schedule for setting the temperature. Huang et al. [21] propose to set the temperature parameter proportional to the alignment between positive pairs. Qiu et al. [37] propose a robust global contrastive loss (RGCL) with individualized temperatures inspired by Distributionally Robust Optimization and optimize it with the iSogCLR algorithm which extends SogCLR. However, their performance on large-scale data remains unknown. This work focuses on comparing different global contrastive losses for learning the temperature parameter and discovers a new strategy by learning a global temperature in the RGCL that yields better performance for large-scale data.

Optimizers for CLIP training: Different optimizers for updating the learnable parameters have been employed in CLIP training, including AdamW [32] used in [38, 7, 14, 6, 27, 37], and LAMB [53] used in [46, 52, 20]. In this work, we compare the performance of AdamW and LAMB to determine which optimizer is better in FastCLIP for training CLIP models from scratch.

3 Preliminaries

Notations: Given a dataset of n𝑛nitalic_n images 𝒙isubscript𝒙𝑖\bm{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and their corresponding text descriptions 𝒛isubscript𝒛𝑖\bm{z}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: 𝒮={(𝒙1,𝒛1),,(𝒙n,𝒛n)}𝒮subscript𝒙1subscript𝒛1subscript𝒙𝑛subscript𝒛𝑛\mathcal{S}=\{(\bm{x}_{1},\bm{z}_{1}),\ldots,(\bm{x}_{n},\bm{z}_{n})\}caligraphic_S = { ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) }, we aim to learn an image encoder and a text encoder (jointly represented by 𝒘𝒘\bm{w}bold_italic_w) from the data. We use 𝒆1,i=𝒆1(𝒘,𝒙i)dsubscript𝒆1𝑖subscript𝒆1𝒘subscript𝒙𝑖superscript𝑑\bm{e}_{1,i}=\bm{e}_{1}(\bm{w},\bm{x}_{i})\in\mathbb{R}^{d}bold_italic_e start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT = bold_italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_w , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and 𝒆2,i=𝒆2(𝒘,𝒛i)dsubscript𝒆2𝑖subscript𝒆2𝒘subscript𝒛𝑖superscript𝑑\bm{e}_{2,i}=\bm{e}_{2}(\bm{w},\bm{z}_{i})\in\mathbb{R}^{d}bold_italic_e start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT = bold_italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_w , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT to denote the encoded vector of the input 𝒙isubscript𝒙𝑖\bm{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒛isubscript𝒛𝑖\bm{z}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, respectively. And we use 𝒆i=(𝒆1,i,𝒆2,i)subscript𝒆𝑖superscriptsuperscriptsubscript𝒆1𝑖topsuperscriptsubscript𝒆2𝑖toptop\bm{e}_{i}=(\bm{e}_{1,i}^{\top},\bm{e}_{2,i}^{\top})^{\top}bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( bold_italic_e start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT to denote the concatenation of 𝒆1,isubscript𝒆1𝑖\bm{e}_{1,i}bold_italic_e start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT and 𝒆2,isubscript𝒆2𝑖\bm{e}_{2,i}bold_italic_e start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT. Denote by 𝒮𝒮\mathcal{B}\subset\mathcal{S}caligraphic_B ⊂ caligraphic_S a mini-batch of image-text pairs. With slight abuse of notation, we also use \mathcal{B}caligraphic_B (and 𝒮𝒮\mathcal{S}caligraphic_S) to denote the index set of the image-text pairs it contains. 𝒮i:=𝒮\{i}assignsubscript𝒮limit-from𝑖\𝒮𝑖\mathcal{S}_{i-}:=\mathcal{S}\backslash\{i\}caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT := caligraphic_S \ { italic_i } denotes the subset of 𝒮𝒮\mathcal{S}caligraphic_S without i𝑖iitalic_i-th pair. We consider the data parallel setting such that 𝒮𝒮\mathcal{S}caligraphic_S is partitioned evenly across K𝐾Kitalic_K workers denoted by 𝒮1,,𝒮Ksubscript𝒮1subscript𝒮𝐾\mathcal{S}_{1},\ldots,\mathcal{S}_{K}caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_S start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT. For a function (,)\ell(\cdot,\cdot)roman_ℓ ( ⋅ , ⋅ ), let 1(,)subscript1\nabla_{1}\ell(\cdot,\cdot)∇ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_ℓ ( ⋅ , ⋅ ) and 2(,)subscript2\nabla_{2}\ell(\cdot,\cdot)∇ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_ℓ ( ⋅ , ⋅ ) denote the partial gradient in terms of the first and the second argument.

Mini-batch Contrastive Loss (MBCL) and Global Contrastive Loss (GCL): The core idea of CLIP training is to leverage a contrastive loss to push features of paired image and text close to each other (i.e., to maximize the similarity between 𝒆1,isubscript𝒆1𝑖\bm{e}_{1,i}bold_italic_e start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT and 𝒆2,isubscript𝒆2𝑖\bm{e}_{2,i}bold_italic_e start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT), while pushing features of non-paired image and text away from each other (i.e., minimizing the similarity between 𝒆1,isubscript𝒆1𝑖\bm{e}_{1,i}bold_italic_e start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT and 𝒆2,jsubscript𝒆2𝑗\bm{e}_{2,j}bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT for ij𝑖𝑗i\neq jitalic_i ≠ italic_j). Mathematically, let si,jsubscript𝑠𝑖𝑗s_{i,j}italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT denote the cosine similarity between 𝒆1,isubscript𝒆1𝑖\bm{e}_{1,i}bold_italic_e start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT and 𝒆2,jsubscript𝒆2𝑗\bm{e}_{2,j}bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT. Define

1(𝒆i,𝒆2,j,τ):=exp((si,jsi,i)/τ),2(𝒆i,𝒆1,j,τ):=exp((sj,isi,i)/τ),formulae-sequenceassignsubscript1subscript𝒆𝑖subscript𝒆2𝑗𝜏subscript𝑠𝑖𝑗subscript𝑠𝑖𝑖𝜏assignsubscript2subscript𝒆𝑖subscript𝒆1𝑗𝜏subscript𝑠𝑗𝑖subscript𝑠𝑖𝑖𝜏\ell_{1}(\bm{e}_{i},\bm{e}_{2,j},\tau):=\exp\left((s_{i,j}-s_{i,i})/\tau\right% ),\quad\ell_{2}(\bm{e}_{i},\bm{e}_{1,j},\tau):=\exp\left((s_{j,i}-s_{i,i})/% \tau\right),roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_τ ) := roman_exp ( ( italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT ) / italic_τ ) , roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT , italic_τ ) := roman_exp ( ( italic_s start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT ) / italic_τ ) ,

where τ>0𝜏0\tau>0italic_τ > 0 is the temperature parameter. Given a mini-batch \mathcal{B}caligraphic_B of image-text pairs, let

g1(𝒘,τ,i,):=1||j1(𝒆i,𝒆2,j,τ),g2(𝒘,τ,i,):=1||j2(𝒆i,𝒆1,j,τ).formulae-sequenceassignsubscript𝑔1𝒘𝜏𝑖1subscript𝑗subscript1subscript𝒆𝑖subscript𝒆2𝑗𝜏assignsubscript𝑔2𝒘𝜏𝑖1subscript𝑗subscript2subscript𝒆𝑖subscript𝒆1𝑗𝜏g_{1}(\bm{w},\tau,i,\mathcal{B}):=\frac{1}{|\mathcal{B}|}\sum\nolimits_{j\in% \mathcal{B}}\ell_{1}(\bm{e}_{i},\bm{e}_{2,j},\tau),\quad g_{2}(\bm{w},\tau,i,% \mathcal{B}):=\frac{1}{|\mathcal{B}|}\sum\nolimits_{j\in\mathcal{B}}\ell_{2}(% \bm{e}_{i},\bm{e}_{1,j},\tau).italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_w , italic_τ , italic_i , caligraphic_B ) := divide start_ARG 1 end_ARG start_ARG | caligraphic_B | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_τ ) , italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_w , italic_τ , italic_i , caligraphic_B ) := divide start_ARG 1 end_ARG start_ARG | caligraphic_B | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT , italic_τ ) .

Following Radford et al. [38], many works minimize the mini-batch contrastive loss (MBCL):

1|𝒮|i𝒮𝔼𝒮i(log(1||+g1(𝒘,τ,i,))+log(1||+g2(𝒘,τ,i,))),1𝒮subscript𝑖𝒮subscript𝔼subscript𝒮limit-from𝑖1subscript𝑔1𝒘𝜏𝑖1subscript𝑔2𝒘𝜏𝑖\frac{1}{|\mathcal{S}|}\sum\nolimits_{i\in\mathcal{S}}\mathbb{E}_{\mathcal{B}% \subset\mathcal{S}_{i-}}\left(\log\left(\frac{1}{|\mathcal{B}|}+g_{1}(\bm{w},% \tau,i,\mathcal{B})\right)+\log\left(\frac{1}{|\mathcal{B}|}+g_{2}(\bm{w},\tau% ,i,\mathcal{B})\right)\right),divide start_ARG 1 end_ARG start_ARG | caligraphic_S | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_B ⊂ caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_log ( divide start_ARG 1 end_ARG start_ARG | caligraphic_B | end_ARG + italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_w , italic_τ , italic_i , caligraphic_B ) ) + roman_log ( divide start_ARG 1 end_ARG start_ARG | caligraphic_B | end_ARG + italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_w , italic_τ , italic_i , caligraphic_B ) ) ) , (MBCL)

which contrasts the i𝑖iitalic_i-th pair with other pairs within only a mini-batch \mathcal{B}caligraphic_B. However, this loss suffers from the large-batch size issue, which has been addressed by the Global Contrastive Loss (GCL) [56] that contrasts the i𝑖iitalic_i-th pair with all other pairs in the dataset 𝒮𝒮\mathcal{S}caligraphic_S:

τ|𝒮|i𝒮(log(1|𝒮i|+g1(𝒘,τ,i,𝒮i))+log(1|𝒮i|+g2(𝒘,τ,i,𝒮i))).𝜏𝒮subscript𝑖𝒮1subscript𝒮limit-from𝑖subscript𝑔1𝒘𝜏𝑖subscript𝒮limit-from𝑖1subscript𝒮limit-from𝑖subscript𝑔2𝒘𝜏𝑖subscript𝒮limit-from𝑖\frac{\tau}{|\mathcal{S}|}\sum\nolimits_{i\in\mathcal{S}}\left(\log\left(\frac% {1}{|\mathcal{S}_{i-}|}+g_{1}(\bm{w},\tau,i,\mathcal{S}_{i-})\right)+\log\left% (\frac{1}{|\mathcal{S}_{i-}|}+g_{2}(\bm{w},\tau,i,\mathcal{S}_{i-})\right)% \right).divide start_ARG italic_τ end_ARG start_ARG | caligraphic_S | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S end_POSTSUBSCRIPT ( roman_log ( divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_w , italic_τ , italic_i , caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT ) ) + roman_log ( divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_w , italic_τ , italic_i , caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT ) ) ) . (GCL)

Robust Global Contrastive Loss (RGCL): To improve CLIP training, Qiu et al. [37] has designed a robust global contrastive loss (RGCL) with individualized temperature parameters inspired by Distributionally Robust Optimization. It is defined as:

minτ1,τ2τ01|𝒮|i𝒮subscriptsubscript𝜏1subscript𝜏2subscript𝜏01𝒮subscript𝑖𝒮\displaystyle\min_{\tau_{1},\tau_{2}\geq\tau_{0}}\frac{1}{|\mathcal{S}|}\sum_{% i\in\mathcal{S}}roman_min start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_S | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S end_POSTSUBSCRIPT (τ1,i(log(1|𝒮i|+g1(𝒘,τ1,i,i,𝒮i))+ρ)\displaystyle\left(\tau_{1,i}\cdot\left(\log\left(\frac{1}{|\mathcal{S}_{i-}|}% +g_{1}(\bm{w},\tau_{1,i},i,\mathcal{S}_{i-})\right)+\rho\right)\right.( italic_τ start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ⋅ ( roman_log ( divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_w , italic_τ start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT , italic_i , caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT ) ) + italic_ρ ) (RGCL)
+τ2,i(log(1|𝒮i|+g2(𝒘,τ2,i,i,𝒮i))+ρ)),\displaystyle\;\;\left.+\tau_{2,i}\cdot\left(\log\left(\frac{1}{|\mathcal{S}_{% i-}|}+g_{2}(\bm{w},\tau_{2,i},i,\mathcal{S}_{i-})\right)+\rho\right)\right),+ italic_τ start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT ⋅ ( roman_log ( divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_w , italic_τ start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT , italic_i , caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT ) ) + italic_ρ ) ) ,

where τ1=(τ1,1,,τ1,n)subscript𝜏1subscript𝜏11subscript𝜏1𝑛\tau_{1}=(\tau_{1,1},\ldots,\tau_{1,n})italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( italic_τ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , … , italic_τ start_POSTSUBSCRIPT 1 , italic_n end_POSTSUBSCRIPT ), τ2=(τ2,1,,τ2,n)subscript𝜏2subscript𝜏21subscript𝜏2𝑛\tau_{2}=(\tau_{2,1},\ldots,\tau_{2,n})italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( italic_τ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT , … , italic_τ start_POSTSUBSCRIPT 2 , italic_n end_POSTSUBSCRIPT ), τ0subscript𝜏0\tau_{0}italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a small value, ρ0𝜌0\rho\geq 0italic_ρ ≥ 0 is a hyperparameter.

Optimization Algorithms. To optimize GCL, Yuan et al. [56] have proposed the SogCLR algorithm based on advanced compositional optimization known as finite-sum coupled compositional optimization (FCCO) [49]. In particular, GCL is formulated as 1ni𝒮f(gi(𝒘))1𝑛subscript𝑖𝒮𝑓subscript𝑔𝑖𝒘\frac{1}{n}\sum_{i\in\mathcal{S}}f(g_{i}(\bm{w}))divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S end_POSTSUBSCRIPT italic_f ( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_w ) ), where f(g)=log(ε+g)𝑓𝑔𝜀𝑔f(g)=\log(\varepsilon+g)italic_f ( italic_g ) = roman_log ( italic_ε + italic_g ) and gi(𝒘)subscript𝑔𝑖𝒘g_{i}(\bm{w})italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_w ) is the inner function inside the log. The main challenge is to compute a gradient estimator using a mini-batch of samples such that the algorithm can converge without requiring a large batch size. The key idea of SogCLR is to maintain and update an estimator for each inner function gi(𝒘)subscript𝑔𝑖𝒘g_{i}(\bm{w})italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_w ) denoted by uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, by using Eqn. (1). As a result, the gradient at the t𝑡titalic_t-th iteration is estimated by 1||if(uit+1)g^i(𝒘t)1subscript𝑖𝑓superscriptsubscript𝑢𝑖𝑡1subscript^𝑔𝑖superscript𝒘𝑡\frac{1}{|\mathcal{B}|}\sum_{i\in\mathcal{B}}\nabla f(u_{i}^{t+1})\nabla\hat{g% }_{i}(\bm{w}^{t})divide start_ARG 1 end_ARG start_ARG | caligraphic_B | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B end_POSTSUBSCRIPT ∇ italic_f ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) ∇ over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ), where \mathcal{B}caligraphic_B is a mini-batch and g^i(𝒘)subscript^𝑔𝑖𝒘\hat{g}_{i}(\bm{w})over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_w ) is a mini-batch estimator of gi(𝒘)subscript𝑔𝑖𝒘g_{i}(\bm{w})italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_w ). To optimize RGCL, Qiu et al. [37] have proposed the iSogCLR algorithm by combining SogCLR with stochastic coordinate updates for the temperature parameters.

4 FastCLIP: A Distributed Training Framework of CLIP Models

1 Input: Initial model parameters 𝒘0,τ0,(𝒖10,𝒖20)superscript𝒘0superscript𝜏0superscriptsubscript𝒖10superscriptsubscript𝒖20\bm{w}^{0},\tau^{0},(\bm{u}_{1}^{0},\bm{u}_{2}^{0})bold_italic_w start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , ( bold_italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ), Number of iterations T𝑇Titalic_T.
2 for t=0,,T1𝑡0𝑇1t=0,\ldots,T-1italic_t = 0 , … , italic_T - 1 do
      3 for each worker k𝑘kitalic_k do in parallel
            4 Sample a batch ktsubscriptsuperscript𝑡𝑘\mathcal{B}^{t}_{k}caligraphic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT from 𝒮ksubscript𝒮𝑘\mathcal{S}_{k}caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and compute kt={(𝒆1,j,𝒆2,j)}jktsubscriptsuperscript𝑡𝑘subscriptsubscript𝒆1𝑗subscript𝒆2𝑗𝑗subscriptsuperscript𝑡𝑘\mathcal{E}^{t}_{k}=\{(\bm{e}_{1,j},\bm{e}_{2,j})\}_{j\in\mathcal{B}^{t}_{k}}caligraphic_E start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { ( bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT
            5 All_Gather t=kktsuperscript𝑡subscript𝑘subscriptsuperscript𝑡𝑘\mathcal{E}^{t}=\cup_{k}\mathcal{E}^{t}_{k}caligraphic_E start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ∪ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT caligraphic_E start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to obtain global features
            6 Compute mini-batch contrastive losses g1,it,g2,itsuperscriptsubscript𝑔1𝑖𝑡superscriptsubscript𝑔2𝑖𝑡g_{1,i}^{t},g_{2,i}^{t}italic_g start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_g start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for ikt𝑖subscriptsuperscript𝑡𝑘i\in\mathcal{B}^{t}_{k}italic_i ∈ caligraphic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (c.f. Proc. 2 in Appendix A)
            7 Update u1,it+1,u2,it+1superscriptsubscript𝑢1𝑖𝑡1superscriptsubscript𝑢2𝑖𝑡1u_{1,i}^{t+1},u_{2,i}^{t+1}italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT using Eqn. (1) for ikt𝑖subscriptsuperscript𝑡𝑘i\in\mathcal{B}^{t}_{k}italic_i ∈ caligraphic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Set u1,it+1=u1,it,u2,it+1=u2,itformulae-sequencesuperscriptsubscript𝑢1𝑖𝑡1superscriptsubscript𝑢1𝑖𝑡superscriptsubscript𝑢2𝑖𝑡1superscriptsubscript𝑢2𝑖𝑡u_{1,i}^{t+1}=u_{1,i}^{t},u_{2,i}^{t+1}=u_{2,i}^{t}italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for ikt𝑖subscriptsuperscript𝑡𝑘i\notin\mathcal{B}^{t}_{k}italic_i ∉ caligraphic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
            8 Set 𝒰kt={(u1,jt+1,u2,jt+1)}jktsubscriptsuperscript𝒰𝑡𝑘subscriptsuperscriptsubscript𝑢1𝑗𝑡1superscriptsubscript𝑢2𝑗𝑡1𝑗subscriptsuperscript𝑡𝑘\mathcal{U}^{t}_{k}=\{(u_{1,j}^{t+1},u_{2,j}^{t+1})\}_{j\in\mathcal{B}^{t}_{k}}caligraphic_U start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { ( italic_u start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , italic_u start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and All_Gather 𝒰t=k𝒰ktsuperscript𝒰𝑡subscript𝑘subscriptsuperscript𝒰𝑡𝑘\mathcal{U}^{t}=\cup_{k}\mathcal{U}^{t}_{k}caligraphic_U start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ∪ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
            9 Compute gradient estimators G𝒘,kt=G𝒘,a,kt+G𝒘,b,ktsuperscriptsubscript𝐺𝒘𝑘𝑡superscriptsubscript𝐺𝒘𝑎𝑘𝑡superscriptsubscript𝐺𝒘𝑏𝑘𝑡G_{\bm{w},k}^{t}=G_{\bm{w},a,k}^{t}+G_{\bm{w},b,k}^{t}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_G start_POSTSUBSCRIPT bold_italic_w , italic_a , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_G start_POSTSUBSCRIPT bold_italic_w , italic_b , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for 𝒘𝒘\bm{w}bold_italic_w using 𝒰tsuperscript𝒰𝑡\mathcal{U}^{t}caligraphic_U start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (c.f. Proc. 3)
            10 All_Reduce G𝒘t=1Kl=1KG𝒘,ltsuperscriptsubscript𝐺𝒘𝑡1𝐾superscriptsubscript𝑙1𝐾superscriptsubscript𝐺𝒘𝑙𝑡G_{\bm{w}}^{t}=\frac{1}{K}\sum_{l=1}^{K}G_{\bm{w},l}^{t}italic_G start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT bold_italic_w , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT across all workers
            11 Update 𝒘t+1superscript𝒘𝑡1\bm{w}^{t+1}bold_italic_w start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT from 𝒘tsuperscript𝒘𝑡\bm{w}^{t}bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and G𝒘tsuperscriptsubscript𝐺𝒘𝑡G_{\bm{w}}^{t}italic_G start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT using an optimizer (c.f. Proc. 4).
            12 Update τt+1superscript𝜏𝑡1\tau^{t+1}italic_τ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT from τtsuperscript𝜏𝑡\tau^{t}italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (c.f. Proc. 5).
            
Algorithm 1 The FastCLIP Framework (Sketch)

FastCLIP is a distributed training framework for optimizing a GCL including RGCL. Its key updates are built upon the SogCLR algorithm. The main difference between SogCLR and mini-batch based methods such as CLIP is that SogCLR maintains u1,isubscript𝑢1𝑖u_{1,i}italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT and u2,isubscript𝑢2𝑖u_{2,i}italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT to keep track of g1(𝒘,τ,i,𝒮i)subscript𝑔1𝒘𝜏𝑖subscript𝒮limit-from𝑖g_{1}(\bm{w},\tau,i,\mathcal{S}_{i-})italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_w , italic_τ , italic_i , caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT ) and g2(𝒘,τ,i,𝒮i)subscript𝑔2𝒘𝜏𝑖subscript𝒮limit-from𝑖g_{2}(\bm{w},\tau,i,\mathcal{S}_{i-})italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_w , italic_τ , italic_i , caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT ) as stated in Section 3. At iteration t𝑡titalic_t, for i𝑖iitalic_i selected in the batch tsuperscript𝑡\mathcal{B}^{t}caligraphic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, u1,isubscript𝑢1𝑖u_{1,i}italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT and u2,isubscript𝑢2𝑖u_{2,i}italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT will be updated using a moving average estimator with hyperparameter γt(0,1]subscript𝛾𝑡01\gamma_{t}\in(0,1]italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( 0 , 1 ]:

u1,it+1=(1γt)u1,it+γtg1(𝒘t,τt,i,it),u2,it+1=(1γt)u2,it+γtg2(𝒘t,τt,i,it),formulae-sequencesuperscriptsubscript𝑢1𝑖𝑡11subscript𝛾𝑡superscriptsubscript𝑢1𝑖𝑡subscript𝛾𝑡subscript𝑔1superscript𝒘𝑡superscript𝜏𝑡𝑖superscriptsubscriptlimit-from𝑖𝑡superscriptsubscript𝑢2𝑖𝑡11subscript𝛾𝑡superscriptsubscript𝑢2𝑖𝑡subscript𝛾𝑡subscript𝑔2superscript𝒘𝑡superscript𝜏𝑡𝑖superscriptsubscriptlimit-from𝑖𝑡u_{1,i}^{t+1}=(1-\gamma_{t})u_{1,i}^{t}+\gamma_{t}g_{1}(\bm{w}^{t},\tau^{t},i,% \mathcal{B}_{i-}^{t}),\quad u_{2,i}^{t+1}=(1-\gamma_{t})u_{2,i}^{t}+\gamma_{t}% g_{2}(\bm{w}^{t},\tau^{t},i,\mathcal{B}_{i-}^{t}),italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = ( 1 - italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_i , caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = ( 1 - italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_i , caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , (1)

and the gradient estimator is computed by 1|t|itf(uit+1)g^i(𝒘t)1superscript𝑡subscript𝑖superscript𝑡𝑓superscriptsubscript𝑢𝑖𝑡1subscript^𝑔𝑖superscript𝒘𝑡\frac{1}{|\mathcal{B}^{t}|}\sum_{i\in\mathcal{B}^{t}}\nabla f(u_{i}^{t+1})% \nabla\hat{g}_{i}(\bm{w}^{t})divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ italic_f ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) ∇ over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). The core of FastCLIP (Alg. 1) is how to compute the gradient estimator in a distributed manner.

Next, we use (GCL) as an example to present our gradient computation strategy that effectively reduces the communication cost. We only present key steps and defer the complete derivation to Appendix A due to space limit. Let ktsubscriptsuperscript𝑡𝑘\mathcal{B}^{t}_{k}caligraphic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denote local mini-batch on k𝑘kitalic_k-th worker. Below, we omit the superscript t𝑡titalic_t and use ksubscript𝑘\mathcal{B}_{k}caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for simplicity. Note that (GCL) is the sum of two parts: the image part (loss g1subscript𝑔1g_{1}italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) and the text part (loss g2subscript𝑔2g_{2}italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). Due to their symmetric structure, we only present the gradient of the image part. Let ε=1/|𝒮i|𝜀1subscript𝒮limit-from𝑖\varepsilon=1/|\mathcal{S}_{i-}|italic_ε = 1 / | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT |, the gradient estimator of (GCL) is computed by G𝒘,1,a+G𝒘,1,bsubscript𝐺𝒘1𝑎subscript𝐺𝒘1𝑏G_{\bm{w},1,a}+G_{\bm{w},1,b}italic_G start_POSTSUBSCRIPT bold_italic_w , 1 , italic_a end_POSTSUBSCRIPT + italic_G start_POSTSUBSCRIPT bold_italic_w , 1 , italic_b end_POSTSUBSCRIPT:

G𝒘,1,a=subscript𝐺𝒘1𝑎absent\displaystyle G_{\bm{w},1,a}=italic_G start_POSTSUBSCRIPT bold_italic_w , 1 , italic_a end_POSTSUBSCRIPT = τ1Kk=1KAll_Reduce1|k|ik1ε+u1,ilocal1Kk=1K1|k,i|jk,i11(𝒆ilocal,𝒆2,jglobal,τ)𝒆ilocalG𝒘,1,a,i,𝜏subscript1𝐾superscriptsubscript𝑘1𝐾All_Reduce1subscript𝑘subscript𝑖subscript𝑘1𝜀subscriptsubscript𝑢1𝑖localsuperscript1𝐾superscriptsubscriptsuperscript𝑘1𝐾1subscriptsuperscript𝑘limit-from𝑖subscript𝑗subscriptsuperscript𝑘limit-from𝑖subscript1subscript1subscriptsubscript𝒆𝑖localsubscriptsubscript𝒆2𝑗global𝜏subscriptsubscript𝒆𝑖localsubscript𝐺𝒘1𝑎𝑖\displaystyle\tau\cdot\underbrace{\frac{1}{K}\sum_{k=1}^{K}}_{\textsc{All\_% Reduce}}\frac{1}{|\mathcal{B}_{k}|}\sum_{i\in\mathcal{B}_{k}}\frac{1}{% \varepsilon+\underbrace{u_{1,i}}_{\textrm{local}}}\cdot\overbrace{\frac{1}{K}% \sum_{k^{\prime}=1}^{K}\frac{1}{|\mathcal{B}_{k^{\prime},i-}|}\sum_{j\in% \mathcal{B}_{k^{\prime},i-}}\nabla_{1}\ell_{1}(\underbrace{\bm{e}_{i}}_{% \textrm{local}},\underbrace{\hbox{\pagecolor{yellow!15}$\bm{e}_{2,j}$}}_{% \textrm{global}},\tau)\cdot\underbrace{\nabla\bm{e}_{i}}_{\textrm{local}}}^{G_% {\bm{w},1,a,i}},italic_τ ⋅ under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT All_Reduce end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_ε + under⏟ start_ARG italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT local end_POSTSUBSCRIPT end_ARG ⋅ over⏞ start_ARG divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i - end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i - end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( under⏟ start_ARG bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT local end_POSTSUBSCRIPT , under⏟ start_ARG bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT global end_POSTSUBSCRIPT , italic_τ ) ⋅ under⏟ start_ARG ∇ bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT local end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT bold_italic_w , 1 , italic_a , italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,
G𝒘,1,b=subscript𝐺𝒘1𝑏absent\displaystyle G_{\bm{w},1,b}=italic_G start_POSTSUBSCRIPT bold_italic_w , 1 , italic_b end_POSTSUBSCRIPT = τ1Kk=1KAll_Reduce1|k|jk1Kk=1K1|k,j|ik,j1ε+u1,iglobal21(𝒆iglobal,𝒆2,jlocal,τ)𝒆2,jlocal.𝜏subscript1𝐾superscriptsubscriptsuperscript𝑘1𝐾All_Reduce1subscriptsuperscript𝑘subscript𝑗subscriptsuperscript𝑘1𝐾superscriptsubscript𝑘1𝐾1subscript𝑘limit-from𝑗subscript𝑖subscript𝑘limit-from𝑗1𝜀subscriptsubscript𝑢1𝑖globalsubscript2subscript1subscriptsubscript𝒆𝑖globalsubscriptsubscript𝒆2𝑗local𝜏subscriptsubscript𝒆2𝑗local\displaystyle\tau\cdot\underbrace{\frac{1}{K}\sum_{k^{\prime}=1}^{K}}_{\textsc% {All\_Reduce}}\frac{1}{|\mathcal{B}_{k^{\prime}}|}\sum_{j\in\mathcal{B}_{k^{% \prime}}}\cdot\frac{1}{K}\sum_{k=1}^{K}\frac{1}{|\mathcal{B}_{k,j-}|}\sum_{i% \in\mathcal{B}_{k,j-}}\frac{1}{\varepsilon+\underbrace{\hbox{\pagecolor{yellow% !15}$u_{1,i}$}}_{\textrm{global}}}\nabla_{2}\ell_{1}(\underbrace{\hbox{% \pagecolor{yellow!15}$\bm{e}_{i}$}}_{\textrm{global}},\underbrace{\bm{e}_{2,j}% }_{\textrm{local}},\tau)\cdot\underbrace{\nabla\bm{e}_{2,j}}_{\textrm{local}}.italic_τ ⋅ under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT All_Reduce end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k , italic_j - end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_k , italic_j - end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_ε + under⏟ start_ARG italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT global end_POSTSUBSCRIPT end_ARG ∇ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( under⏟ start_ARG bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT global end_POSTSUBSCRIPT , under⏟ start_ARG bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT local end_POSTSUBSCRIPT , italic_τ ) ⋅ under⏟ start_ARG ∇ bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT local end_POSTSUBSCRIPT .

Both G𝒘,1,asubscript𝐺𝒘1𝑎G_{\bm{w},1,a}italic_G start_POSTSUBSCRIPT bold_italic_w , 1 , italic_a end_POSTSUBSCRIPT and G𝒘,1,bsubscript𝐺𝒘1𝑏G_{\bm{w},1,b}italic_G start_POSTSUBSCRIPT bold_italic_w , 1 , italic_b end_POSTSUBSCRIPT have two averages over \mathcal{B}caligraphic_B due to compositional structure of the loss. For FastCLIP, the inner average (e.g. G𝒘,1,a,isubscript𝐺𝒘1𝑎𝑖G_{\bm{w},1,a,i}italic_G start_POSTSUBSCRIPT bold_italic_w , 1 , italic_a , italic_i end_POSTSUBSCRIPT) is computed on a single worker after gathering global parts (shaded, e.g., 𝒆2,jsubscript𝒆2𝑗\bm{e}_{2,j}bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT) from all workers. The outer average is then computed using All_Reduce.

Difference from OpenCLIP. Algorithmically, OpenCLIP does not use the u𝑢uitalic_u sequence, which is equivalent to setting γt=1subscript𝛾𝑡1\gamma_{t}=1italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1. In terms of distributed implementation, for computing G𝒘,1,bsubscript𝐺𝒘1𝑏G_{\bm{w},1,b}italic_G start_POSTSUBSCRIPT bold_italic_w , 1 , italic_b end_POSTSUBSCRIPT, OpenCLIP first computes 1ε+u1,i21(𝒆i,𝒆2,j,τ)1𝜀subscript𝑢1𝑖subscript2subscript1subscript𝒆𝑖subscript𝒆2𝑗𝜏\frac{1}{\varepsilon+u_{1,i}}\nabla_{2}\ell_{1}(\bm{e}_{i},\bm{e}_{2,j},\tau)divide start_ARG 1 end_ARG start_ARG italic_ε + italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT end_ARG ∇ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_τ ) on the worker where i𝑖iitalic_i-th pair resides, then worker ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT gathers them using Reduce_Scatter and uses them to compute the inner average. FastCLIP avoids Reduce_Scatter by gathering u1,isubscript𝑢1𝑖u_{1,i}italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT using All_Gather and directly computing the inner average on worker ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT given that {𝒆i}subscript𝒆𝑖\{\bm{e}_{i}\}{ bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } have already been gathered when computing G𝒘,1,asubscript𝐺𝒘1𝑎G_{\bm{w},1,a}italic_G start_POSTSUBSCRIPT bold_italic_w , 1 , italic_a end_POSTSUBSCRIPT and {ui}subscript𝑢𝑖\{u_{i}\}{ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }.

FastCLIP has the same communication and computation cost for computing G𝒘,1,asubscript𝐺𝒘1𝑎G_{\bm{w},1,a}italic_G start_POSTSUBSCRIPT bold_italic_w , 1 , italic_a end_POSTSUBSCRIPT as OpenCLIP, but has significant communication reduction for computing G𝒘,1,bsubscript𝐺𝒘1𝑏G_{\bm{w},1,b}italic_G start_POSTSUBSCRIPT bold_italic_w , 1 , italic_b end_POSTSUBSCRIPT. Specifically, Reduce_Scatter in OpenCLIP requires 𝒪(K||d)𝒪𝐾𝑑\mathcal{O}(K|\mathcal{B}|d)caligraphic_O ( italic_K | caligraphic_B | italic_d ) communication cost, where d𝑑ditalic_d is the feature dimensionality (>512 in practice). While All_Gather of u1,isubscript𝑢1𝑖u_{1,i}italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT in FastCLIP requires only 𝒪(K||)𝒪𝐾\mathcal{O}(K|\mathcal{B}|)caligraphic_O ( italic_K | caligraphic_B | ) communication since each u1,isubscript𝑢1𝑖u_{1,i}italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT is a scalar. This leads to a communication reduction, as verified empirically in Sec. 6.

Table 1: Comparison between different algorithms. In Temperature Scheme, “G” denotes global temperature parameter, while “I” denotes individualized temperature parameters for each data.
Algorithm Loss FCCO Distributed Inner LR Schedule Temperature Scheme
OpenCLIP [22] (MBCL) No Yes N/A G, Learnable
SogCLR [56] (GCL) Yes No Constant G, Constant
iSogCLR [37] (RGCL) Yes No Constant I, Learnable
FastCLIP-v0 (GCL) Yes Yes Cosine G, Learnable
FastCLIP-v1 (GCL) Yes Yes Cosine G, Constant
FastCLIP-v2 (RGCL) Yes Yes Cosine I, Learnable
FastCLIP-v3 (RGCL-g) Yes Yes Cosine G, Learnable

5 Benchmark of Optimization Components

We benchmark three components of the FastCLIP framework, i.e., the schedule for inner LR γtsubscript𝛾𝑡\gamma_{t}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the update rule of the temperature parameter, and the optimizer for updating the model parameters.

The Inner LR Schedule: We first explore different schedules for γtsubscript𝛾𝑡\gamma_{t}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in Eqn. (1), which is interpreted as an SGD step with learning rate (LR) γtsubscript𝛾𝑡\gamma_{t}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by Wang and Yang [49]. They showed in theory that γtsubscript𝛾𝑡\gamma_{t}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT should be set to a very small value close to 0 in order to guarantee convergence. However, in practice a large γtsubscript𝛾𝑡\gamma_{t}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT value close to 1 is adopted [56]. Ideally, γtsubscript𝛾𝑡\gamma_{t}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT should be large to rely more on the current mini-batch at earlier iterations and be smaller to rely more on history in later iterations. To achieve this, we consider a cosine schedule to decrease γtsubscript𝛾𝑡\gamma_{t}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT: Let t𝑡titalic_t be the current iteration, E^^𝐸\hat{E}over^ start_ARG italic_E end_ARG be the number of iterations per epoch and E𝐸Eitalic_E be the number of decay epochs, then we set γt=0.5(1+cos(πt/E^/E))(1γmin)+γminsubscript𝛾𝑡0.51𝜋𝑡^𝐸𝐸1subscript𝛾minsubscript𝛾min\gamma_{t}=0.5\cdot(1+\cos(\pi\lfloor t/\hat{E}\rfloor/E))\cdot(1-\gamma_{% \mathrm{min}})+\gamma_{\mathrm{min}}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0.5 ⋅ ( 1 + roman_cos ( italic_π ⌊ italic_t / over^ start_ARG italic_E end_ARG ⌋ / italic_E ) ) ⋅ ( 1 - italic_γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ) + italic_γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT. With this schedule, γtsubscript𝛾𝑡\gamma_{t}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT will decrease from 1.0 to γminsubscript𝛾min\gamma_{\mathrm{min}}italic_γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT. Note that t/E^𝑡^𝐸\lfloor t/\hat{E}\rfloor⌊ italic_t / over^ start_ARG italic_E end_ARG ⌋ denotes the current epoch, which means the value of γtsubscript𝛾𝑡\gamma_{t}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT stays unchanged within one epoch. Also, The number of decay epochs E𝐸Eitalic_E is a hyperparameter, and it is not necessarily equal to the total number of training epochs. If the current epoch exceeds E𝐸Eitalic_E, γtsubscript𝛾𝑡\gamma_{t}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT will be set to γminsubscript𝛾min\gamma_{\mathrm{min}}italic_γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT.

The Temperature Parameter Updates: At Line 1 of Alg. 1, the temperature parameter τ𝜏\tauitalic_τ is updated. The update rule is not explicitly provided due to its variety. We consider four different versions, v0 to v3. Specifically, v1 sets τ𝜏\tauitalic_τ to a constant as in SogCLR and the other three view τ𝜏\tauitalic_τ as a learnable parameter; v2 leverages the same τ𝜏\tauitalic_τ update as iSogCLR, which maintains individual temperature parameters for each data and updates them using gradient of (RGCL) w.r.t. τ𝜏\tauitalic_τ. A potential issue of maintaining and updating individualized temperature is that it may overfit the data and hence harm the generalization for large-scale data. To mitigate this issue, we also consider the following loss, which unifies the individual temperature in (RGCL) into a single global one:

minττ0τ|𝒮|i𝒮(log(1|𝒮i|+g1(𝒘,τ,i,𝒮i))+log(1|𝒮i|+g2(𝒘,τ,i,𝒮i)))+2ρτ.subscript𝜏subscript𝜏0𝜏𝒮subscript𝑖𝒮1subscript𝒮limit-from𝑖subscript𝑔1𝒘𝜏𝑖subscript𝒮limit-from𝑖1subscript𝒮limit-from𝑖subscript𝑔2𝒘𝜏𝑖subscript𝒮limit-from𝑖2𝜌𝜏\min_{\tau\geq\tau_{0}}\frac{\tau}{|\mathcal{S}|}\sum_{i\in\mathcal{S}}\left(% \log\left(\frac{1}{|\mathcal{S}_{i-}|}+g_{1}(\bm{w},\tau,i,\mathcal{S}_{i-})% \right)+\log\left(\frac{1}{|\mathcal{S}_{i-}|}+g_{2}(\bm{w},\tau,i,\mathcal{S}% _{i-})\right)\right)+2\rho\tau.roman_min start_POSTSUBSCRIPT italic_τ ≥ italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_τ end_ARG start_ARG | caligraphic_S | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S end_POSTSUBSCRIPT ( roman_log ( divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_w , italic_τ , italic_i , caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT ) ) + roman_log ( divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_w , italic_τ , italic_i , caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT ) ) ) + 2 italic_ρ italic_τ . (RGCL-g)

We refer to this version as v3. We also include a baseline version named v0 that updates τ𝜏\tauitalic_τ using the gradient of an unscaled version of (GCL) that does not multiply τ𝜏\tauitalic_τ, similar to existing τ𝜏\tauitalic_τ updates [38, 7] based on MBCL. The explicit rules of all updates are deferred to Proc. 5 in Appendix A. Combining the four versions of updating/setting τ𝜏\tauitalic_τ with the cosine inner LR schedule, we get four algorithms FastCLIP-v0 to v3. A comparison between them and existing algorithms is shown in Table 1. Different updates of τ𝜏\tauitalic_τ also lead to slightly different ways of computing the contrastive losses and gradient estimator (Line 1 and Line 1 in Alg. 1), and the details are deferred to Appendix A.

The Optimizer: We compare the performance of two optmizers (i.e., the update rule of model parameters and temperature at Line 1 to 1 in Alg. 1) in FastCLIP, i.e., AdamW [32] and LAMB [53]. The update rules of the two optimizers are presented in Proc. 4 in Appendix A for completeness.

Experiment Settings: We conduct experiments in three different settings, which differ in data scales, vision encoders, and training environments. The difference is presented in Table 2. In all settings, we use a 12-layer transformer [48] as the text encoder. All the experiments are conducted in the multi-node setting where each node has 4 GPUs. Due to its extreme size, xlarge-scale setting is only used to compare the best version of FastCLIP with OpenCLIP.

Metrics: To evaluate the performance of the trained models, we leverage the Datacomp Benchmark [14], which includes 38 zero-shot downstream tasks. The evaluation metric is the average performance, which is called Datacomp. We also report the average performance on two subsets of the tasks: ImageNet and its different variants (IN & Variants), and Retrieval. IN & Variants tasks consist of ImageNet-1k [10] and 6 ImageNet distribution shift datasets [50, 42, 18, 17, 1] [14, Section 3.5]. Retrieval tasks consist of Flickr30k [54], MSCOCO [4], and WinoGAViL [2].

Table 2: Overview of the experiment settings. # Samples denotes the size of the dataset downloaded. Batch Size denotes per-GPU batch size, with global batch size specified in parentheses.
Setting Dataset # Samples/Epochs Vision Encoder Batch Size GPUs
Medium CC3M [45] 2.7M/37 epochs ResNet50 [16] 128 (1024) 8 Tesla T4
Large CC12M [3] 9.1M/33 epochs ViT-B/32 [11] 256 (2048) 8 Tesla T4
xLarge LAION400M [43] 315M/30 epochs ViT-B/16 [11] 320 (5120) 16 A100

5.1 Results

In this subsection, we present the benchmark results. We report results averaged over 3 runs with different seeds, and standard deviation in parentheses. Training details are provided in Appendix B.

The Inner LR Schedule: We first present results of different γ𝛾\gammaitalic_γ schedules. We compare three pairs of approaches: SogCLR and FastCLIP-v1; iSogCLR and FastCLIP-v2; FastCLIP-v3 with Constant γ𝛾\gammaitalic_γ and FastCLIP-v3, where the former of each pair uses constant γ𝛾\gammaitalic_γ schedule and the latter uses cosine γ𝛾\gammaitalic_γ schedule. SogCLR and iSogCLR are implemented in the same framework as FastCLIP. The results are presented in Table 5. We can observe that all of the three approaches obtain a significant performance gain when equipped with the cosine schedule. This indicates that cosine schedule performs better than the constant schedule. Also, when tuning the γ𝛾\gammaitalic_γ value for the two schedules, we observe that constant schedule favors larger γ𝛾\gammaitalic_γ values (0.6 or 0.8), while cosine schedule favors small γ𝛾\gammaitalic_γ value (0.2) (c.f. Table 7 in Appendix B). These results suggest: (1) γ𝛾\gammaitalic_γ needs to be set to a small value as the theory predicts, (2) but instead of being constant, its value should gradually decrease.

The Temperature Parameter Updates: Next, we present the benchmark results of different τ𝜏\tauitalic_τ updates. We compare the four versions of FastCLIP. The results are presented in Table 5. We have the following observations. In the medium-scale setting, the average performance on Datacomp of the four algorithms are close to each other. FastCLIP-v3 has better performance than others either on Retrieval or IN & Variants. In the large-scale setting, FastCLIP-v3 outperforms other algorithms on Datacomp and Retrieval. This demonstrates the effectiveness of FastCLIP-v3. Also we can see that FastCLIP-v0, v2 are competitive with each other while FastCLIP-v1 is generally worse in this setting.

The Optimizer: We use FastCLIP-v3 as the base algorithm and compare the AdamW and LAMB optimizer. The results are presented in Table 5. We observe that AdamW works better than LAMB in both settings. This indicates that AdamW should be chosen in favor of LAMB for FastCLIP.

6 Scaling Performance of FastCLIP

In this section, we benchmark the performance of FastCLIP using AdamW on different number of nodes in comparison with OpenCLIP. We conduct experiments on 1, 2, 4, and 8 node(s). Except for the number of nodes, other settings are kept the same as the experiment settings specified in Section 5. Training details and additional experiment results are provided in Appendix B and C, respectively.

Table 3: Performance of different inner LR schedules. Shaded algorithms use the cosine schedule, while the others use the constant schedule. Improvement denotes the absolute difference between two algorithms on the three metrics. : v3 (Const. γ𝛾\gammaitalic_γ) denotes FastCLIP-v3 with constant γ𝛾\gammaitalic_γ schedule.
Setting Algorithm Datacomp Retrieval IN & Variants Improvement
SogCLR 23.41 (0.34) 27.48 (0.24) 16.90 (0.01)
FastCLIP-v1 24.87 (0.13) 29.28 (0.30) 18.86 (0.09) 1.46, 1.80, 1.96
iSogCLR 23.35 (0.63) 27.92 (0.34) 17.05 (0.14)
FastCLIP-v2 24.10 (0.34) 29.32 (1.29) 18.52 (0.37) 0.75, 1.40, 1.47
v3 (Const. γ𝛾\gammaitalic_γ) 23.60 (0.18) 27.68 (0.17) 17.33 (0.22)
Medium FastCLIP-v3 24.76 (0.26) 30.36 (0.18) 19.08 (0.16) 1.16, 2.68, 1.75
SogCLR 29.91 (0.23) 30.16 (0.36) 22.98 (0.07)
FastCLIP-v1 30.65 (0.11) 32.66 (0.12) 24.26 (0.06) 0.74, 2.50, 1.28
iSogCLR 30.32 (0.18) 30.27 (0.41) 24.96 (0.09)
FastCLIP-v2 30.94 (0.20) 31.84 (0.17) 25.52 (0.17) 0.62, 1.57, 0.56
v3 (Const. γ𝛾\gammaitalic_γ) 29.46 (0.39) 30.33 (0.58) 23.69 (0.09)
Large FastCLIP-v3 31.60 (0.46) 34.88 (0.28) 24.78 (0.28) 2.14, 4.55, 1.09
Table 4: Performance of different temperature parameter updates.
Setting Algorithm Datacomp Retrieval IN & Variants
FastCLIP-v0 24.71 (0.21) 30.36 (0.26) 17.50 (0.33)
FastCLIP-v1 24.87 (0.13) 29.28 (0.30) 18.86 (0.09)
FastCLIP-v2 24.21 (0.76) 30.35 (0.47) 17.86 (0.21)
Medium FastCLIP-v3 24.76 (0.26) 30.36 (0.18) 19.08 (0.16)
FastCLIP-v0 31.47 (0.31) 34.86 (0.53) 24.55 (0.21)
FastCLIP-v1 30.65 (0.11) 32.66 (0.12) 24.26 (0.06)
FastCLIP-v2 30.95 (0.32) 33.71 (0.20) 24.94 (0.18)
Large FastCLIP-v3 31.60 (0.46) 34.88 (0.28) 24.78 (0.28)
Table 5: Performance of different optimizers. Gap denotes improvements of AdamW on three metrics.
Setting Algorithm Datacomp Retrieval IN & Variants Gap
LAMB 22.63 (0.30) 24.87 (0.27) 16.43 (0.06)
Medium AdamW 24.76 (0.26) 30.36 (0.18) 19.08 (0.16) 2.13, 5.49, 2.65
LAMB 30.54 (0.24) 34.02 (0.26) 24.11 (0.21)
Large AdamW 31.60 (0.46) 34.88 (0.28) 24.78 (0.28) 1.06, 0.86, 0.67

Performance: The results of selected models based on the average Datacomp performance are presented in Figure 4, Subfigures (a) and (b) are the IN & Variants and Retrieval performance in the medium-scale setting, and subfigures (c) and (d) are the results in the large-scale setting. We can observe that FastCLIP-v3 consistently outperforms OpenCLIP across different number of nodes. This clearly illustrates the advantage of GCL family over MBCL. Also, FastCLIP-v3’s performance plateaus at 2 nodes, which verifies that FastCLIP does not require a large amount of computing resources. In contrast, OpenCLIP has a significant performance gain when scaling from 2 nodes to 8 nodes, meaning that it requires a large amount of computing resources to obtain good performance. Additionally, Figure 1 demonstrates the significant speedup of FastCLIP-v3 over OpenCLIP.

Refer to caption
Refer to caption
(a) IN & Variants, Medium
Refer to caption
(b) Retrieval, Medium
Refer to caption
(c) IN & Variants, Large
Refer to caption
(d) Retrieval, Large
Figure 2: Comparison between OpenCLIP and FastCLIP-v3. The numbers in between represent the improvement of FastCLIP-v3 over OpenCLIP.
Refer to caption
Refer to caption
(e) Total, Medium
Refer to caption
(f) Total, Large
Refer to caption
(g) Comm., Medium
Refer to caption
(h) Comm., Large
Figure 3: Comparison of per-iteration running time (ms) between OpenCLIP and FastCLIP. Each bar in (a), (b) is divided into (top to bottom): computation, communication, and others. Each bar in (c), (d) is divided into (top to bottom): communication-computation overlap and pure communication.
Refer to caption
Refer to caption
Refer to caption
(a) Speedup, Medium
Refer to caption
(b) Speedup, Large
Refer to caption
(c) ImageNet1k Top1, xLarge
Figure 4: Subfigures (a), (b) present the speedup of different algorithms in the medium and large-scale settings, respectively. Subfigure (c) present the ImageNet-1k Top1 accuracy curve of OpenCLIP and FastCLIP-v3 in the xLarge-scale setting, with numbers denoting the improvement.

Training Time: Besides the performance on downstream tasks, we also benchmark the training time of OpenCLIP and FastCLIP-v1 to v3. We use PyTorch [36] Profiler to record the data. We break down per-iteration training time into 3 parts: computation, pure communication (not overlapped with computation), and others. The results are plotted in Figure 4 (a) and (b). We also break down communication into two parts: communication overlapped with computation and pure communication, which are plotted in Figure 4 (c) and (d). From subfigures (a) and (b) we can see that the running time of FastCLIP is similar to OpenCLIP when the number of nodes is small (1 and 2), and becomes shorter than OpenCLIP when the number of nodes scales up (4 and 8). This is because OpenCLIP has a longer communication time on 4 and 8 nodes (subfigures (c) and (d)), which demonstrates the effectiveness of our efficient gradient computation/communication strategy described in Section 4. For each algorithm, we also plot its speedup over 1 node in terms of training time in Figure 4 (a) and (b). All algorithms have similar speedup over 1 node and the gap between the ideal speedup (which is number of nodes) and the real speedup becomes larger when the number of nodes scales up. This indicates that training with more resources has a diminishing return.

Moreover, we benchmark the performance of FastCLIP-v3 and OpenCLIP in the xlarge-scale setting with 4 nodes. We plot the ImageNet-1k top 1 accuracy curve in Figure 4(c). OpenCLIP achieves a top1 accuracy of 53.36% on ImageNet-1k, while FastCLIP-v3 achieves an accuracy of 54.92%, resulting in a 1.56% gain. We also evaluate the average performance on Datacomp, which exhibits similar performance between FastCLIP-v3 and OpenCLIP shown in Appendix C.

In summary, the results in this section demonstrate the effectiveness of FastCLIP across different data scales (3 million to 315 million) and compute scales (1 to 8 nodes) in the limited-resource setting.

7 Conclusion

In this paper, we have proposed a distributed training framework of CLIP models in a resource-limited setting named FastCLIP. It leverages advanced compositional optimization with a novel gradient computation strategy to reduce the communication cost. We have investigated different optimization components, by proposing new techniques and benchmarking different techniques for each component under different settings to provide insights on which techniques to use. Finally, leveraging the best-performant techniques from the benchmark results, we compare the performance of FastCLIP with OpenCLIP on different data scales and compute scales, from 3 million to 315 million image-text pairs and from 1 node to 8 nodes. The results demonstrate that FastCLIP outperforms OpenCLIP by a large margin and achieves a significant speedup. A limitation of this work is extensive benchmark results on extremely large-scale setting are lacking due to limited computing budget that we have.

References

  • Barbu et al. [2019] Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/97af07a14cacba681feacf3012730892-Paper.pdf.
  • Bitton et al. [2022] Yonatan Bitton, Nitzan Bitton Guetta, Ron Yosef, Yuval Elovici, Mohit Bansal, Gabriel Stanovsky, and Roy Schwartz. Winogavil: Gamified association benchmark to challenge vision-and-language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 26549–26564. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/a96fe863f85c59789bba63588a9557b4-Paper-Datasets_and_Benchmarks.pdf.
  • Changpinyo et al. [2021] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  • Chen et al. [2015] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  • Chen et al. [2023a] Yihao Chen, Xianbiao Qi, Jianan Wang, and Lei Zhang. Disco-clip: A distributed contrastive loss for memory efficient clip training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22648–22657, June 2023a.
  • Chen et al. [2023b] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023b.
  • Cherti et al. [2023] Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2829, June 2023.
  • Crowson et al. [2022] Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. Vqgan-clip: Open domain image generation and editing with natural language guidance. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, pages 88–105, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-19836-6.
  • Cui et al. [2022] Yufeng Cui, Lichen Zhao, Feng Liang, Yangguang Li, and **g Shao. Democratizing contrastive language-image pre-training: A clip benchmark of data, model, and supervision. arXiv preprint arXiv:2203.05796, 2022.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
  • Fang et al. [2023] Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data filtering networks. arXiv preprint arXiv:2309.17425, 2023.
  • Fang et al. [2021] Zhiyuan Fang, Jianfeng Wang, Xiaowei Hu, Lijuan Wang, Yezhou Yang, and Zicheng Liu. Compressing visual-linguistic model via knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1428–1438, October 2021.
  • Gadre et al. [2023] Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei W Koh, Olga Saukh, Alexander J Ratner, Shuran Song, Hannaneh Hajishirzi, Ali Farhadi, Romain Beaumont, Sewoong Oh, Alex Dimakis, Jenia Jitsev, Yair Carmon, Vaishaal Shankar, and Ludwig Schmidt. Datacomp: In search of the next generation of multimodal datasets. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 27092–27112. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/56332d41d55ad7ad8024aac625881be7-Paper-Datasets_and_Benchmarks.pdf.
  • Goel et al. [2022] Shashank Goel, Hritik Bansal, Sumit Bhatia, Ryan Rossi, Vishwa Vinay, and Aditya Grover. Cyclip: Cyclic contrastive language-image pretraining. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 6704–6719. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/2cd36d327f33d47b372d4711edd08de0-Paper-Conference.pdf.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Hendrycks et al. [2021a] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8340–8349, October 2021a.
  • Hendrycks et al. [2021b] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15262–15271, June 2021b.
  • Hessel et al. [2021] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Ye** Choi. CLIPScore: A reference-free evaluation metric for image captioning. In Marie-Francine Moens, Xuan**g Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.595. URL https://aclanthology.org/2021.emnlp-main.595.
  • Huang et al. [2023a] Runhui Huang, Yanxin Long, Jianhua Han, Hang Xu, Xiwen Liang, Chun**g Xu, and Xiaodan Liang. Nlip: Noise-robust language-image pre-training. Proceedings of the AAAI Conference on Artificial Intelligence, 37(1):926–934, Jun. 2023a. doi: 10.1609/aaai.v37i1.25172. URL https://ojs.aaai.org/index.php/AAAI/article/view/25172.
  • Huang et al. [2023b] Zizheng Huang, Haoxing Chen, Ziqi Wen, Chao Zhang, Huaxiong Li, Bo Wang, and Chunlin Chen. Model-aware contrastive learning: Towards esca** the dilemmas. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 13774–13790. PMLR, 23–29 Jul 2023b. URL https://proceedings.mlr.press/v202/huang23c.html.
  • Ilharco et al. [2021] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip. https://doi.org/10.5281/zenodo.5143773, July 2021. URL https://doi.org/10.5281/zenodo.5143773.
  • Kukleva et al. [2023] Anna Kukleva, Moritz Böhle, Bernt Schiele, Hilde Kuehne, and Christian Rupprecht. Temperature schedules for self-supervised contrastive methods on long-tail data. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=ejHUr4nfHhD.
  • Lee et al. [2022] Janghyeon Lee, Jongsuk Kim, Hyounguk Shon, Bumsoo Kim, Seung Hwan Kim, Honglak Lee, and Junmo Kim. Uniclip: Unified framework for contrastive language-image pre-training. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 1008–1019. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/072fd0525592b43da661e254bbaadc27-Paper-Conference.pdf.
  • Li et al. [2022] Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. BLIP: bootstrap** language-image pre-training for unified vision-language understanding and generation. CoRR, abs/2201.12086, 2022. URL https://arxiv.longhoe.net/abs/2201.12086.
  • Li et al. [2020] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. Pytorch distributed: experiences on accelerating data parallel training. Proc. VLDB Endow., 13(12):3005–3018, aug 2020. ISSN 2150-8097. doi: 10.14778/3415478.3415530. URL https://doi.org/10.14778/3415478.3415530.
  • Li et al. [2023a] Xianhang Li, Zeyu Wang, and Cihang Xie. An inverse scaling law for clip training. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 49068–49087. Curran Associates, Inc., 2023a. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/996e2b446391fcb8bf32a3d1645cc799-Paper-Conference.pdf.
  • Li et al. [2023b] Xuanlin Li, Yunhao Fang, Minghua Liu, Zhan Ling, Zhuowen Tu, and Hao Su. Distilling large vision-language model with out-of-distribution generalizability. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2492–2503, October 2023b.
  • Li et al. [2023c] Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, and Kaiming He. Scaling language-image pre-training via masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23390–23400, 2023c.
  • Li et al. [2024] Zichao Li, Cihang Xie, and Ekin Dogus Cubuk. Scaling (down) CLIP: A comprehensive analysis of data,architecture, and training strategies. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=t4nnCi5AO6.
  • Loshchilov and Hutter [2017] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Skq89Scxx.
  • Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  • Mo et al. [2023] Sangwoo Mo, Minkyu Kim, Kyungmin Lee, and **woo Shin. S-clip: Semi-supervised vision-language learning using few specialist captions. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 61187–61212. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/c06f788963f0ce069f5b2dbf83fe7822-Paper-Conference.pdf.
  • Mokady et al. [2021] Ron Mokady, Amir Hertz, and Amit H Bermano. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
  • Mu et al. [2022] Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. Slip: Self-supervision meets language-image pre-training. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, pages 529–544, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-19809-0.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf.
  • Qiu et al. [2023] Zi-Hao Qiu, Quanqi Hu, Zhuoning Yuan, Denny Zhou, Lijun Zhang, and Tianbao Yang. Not all semantics are created equal: Contrastive self-supervised learning with automatic temperature individualization. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 28389–28421. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/qiu23a.html.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/radford21a.html.
  • Rajbhandari et al. [2020] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16, 2020. doi: 10.1109/SC41405.2020.00024.
  • Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  • Rasley et al. [2020] Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450379984. doi: 10.1145/3394486.3406703. URL https://doi.org/10.1145/3394486.3406703.
  • Recht et al. [2019] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classifiers generalize to ImageNet? In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5389–5400. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/recht19a.html.
  • Schuhmann et al. [2021] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  • Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  • Sharma et al. [2018] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL, 2018.
  • Sun et al. [2023] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
  • Sun et al. [2024] Quan Sun, **sheng Wang, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, and Xinlong Wang. Eva-clip-18b: Scaling clip to 18 billion parameters. arXiv preprint arXiv:2402.04252, 2024.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  • Wang and Yang [2022] Bokun Wang and Tianbao Yang. Finite-sum coupled compositional stochastic optimization: Theory and applications. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 23292–23317. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/wang22ak.html.
  • Wang et al. [2019] Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/3eefceb8087e964f89c2d59e8a249915-Paper.pdf.
  • Wu et al. [2023] Kan Wu, Houwen Peng, Zhenghong Zhou, Bin Xiao, Mengchen Liu, Lu Yuan, Hong Xuan, Michael Valenzuela, Xi Stephen Chen, Xinggang Wang, et al. Tinyclip: Clip distillation via affinity mimicking and weight inheritance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21970–21980, 2023.
  • Xie et al. [2023] Chen-Wei Xie, Siyang Sun, Xiong Xiong, Yun Zheng, Deli Zhao, and **gren Zhou. Ra-clip: Retrieval augmented contrastive language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19265–19274, June 2023.
  • You et al. [2020] Yang You, **g Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Syx4wnEtvH.
  • Young et al. [2014] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  • Yu et al. [2022] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=Ee277P3AYC.
  • Yuan et al. [2022] Zhuoning Yuan, Yuexin Wu, Zi-Hao Qiu, Xianzhi Du, Lijun Zhang, Denny Zhou, and Tianbao Yang. Provable stochastic optimization for global contrastive learning: Small batch does not harm performance. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 25760–25782. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/yuan22b.html.
  • Zhai et al. [2023] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986, October 2023.
  • Zhang et al. [2022] Chaoning Zhang, Kang Zhang, Trung X. Pham, Axi Niu, Zhinan Qiao, Chang D. Yoo, and In So Kweon. Dual temperature helps contrastive learning without many negative samples: Towards understanding and simplifying moco. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14441–14450, June 2022.
  • Zhou et al. [2022] Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, **hui Xu, and Tong Sun. Towards language-free training for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17907–17917, 2022.

Appendix A Details of the FastCLIP Framework

/* global, individual τ𝜏\tauitalic_τ: temperature scheme (c.f. Table 1) */
1 if global τ𝜏\tauitalic_τ then
      2 Compute g1,it=g1(𝒘t,τt,i,it),g2,it=g2(𝒘t,τt,i,it)formulae-sequencesuperscriptsubscript𝑔1𝑖𝑡subscript𝑔1superscript𝒘𝑡superscript𝜏𝑡𝑖superscriptsubscriptlimit-from𝑖𝑡superscriptsubscript𝑔2𝑖𝑡subscript𝑔2superscript𝒘𝑡superscript𝜏𝑡𝑖superscriptsubscriptlimit-from𝑖𝑡g_{1,i}^{t}=g_{1}(\bm{w}^{t},\tau^{t},i,\mathcal{B}_{i-}^{t}),g_{2,i}^{t}=g_{2% }(\bm{w}^{t},\tau^{t},i,\mathcal{B}_{i-}^{t})italic_g start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_i , caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_g start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_i , caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )
      
3else if individual τ𝜏\tauitalic_τ then
      4 Compute g1,it=g1(𝒘t,τ1,it,i,it),g2,it=g2(𝒘t,τ2,it,i,it)formulae-sequencesuperscriptsubscript𝑔1𝑖𝑡subscript𝑔1superscript𝒘𝑡superscriptsubscript𝜏1𝑖𝑡𝑖superscriptsubscriptlimit-from𝑖𝑡superscriptsubscript𝑔2𝑖𝑡subscript𝑔2superscript𝒘𝑡superscriptsubscript𝜏2𝑖𝑡𝑖superscriptsubscriptlimit-from𝑖𝑡g_{1,i}^{t}=g_{1}(\bm{w}^{t},\tau_{1,i}^{t},i,\mathcal{B}_{i-}^{t}),g_{2,i}^{t% }=g_{2}(\bm{w}^{t},\tau_{2,i}^{t},i,\mathcal{B}_{i-}^{t})italic_g start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_τ start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_i , caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_g start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_τ start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_i , caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )
      
Procedure 2 contrastive_loss
/* global, individual τ𝜏\tauitalic_τ: temperature scheme (c.f. Table 1) */
1 if global τ𝜏\tauitalic_τ then
      2 Compute G𝒘,a,ktsuperscriptsubscript𝐺𝒘𝑎𝑘𝑡G_{\bm{w},a,k}^{t}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_a , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and G𝒘,b,ktsuperscriptsubscript𝐺𝒘𝑏𝑘𝑡G_{\bm{w},b,k}^{t}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_b , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT using (2) and (3), respectively
      
3else if individual τ𝜏\tauitalic_τ then
      4 Compute G𝒘,a,ktsuperscriptsubscript𝐺𝒘𝑎𝑘𝑡G_{\bm{w},a,k}^{t}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_a , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and G𝒘,b,ktsuperscriptsubscript𝐺𝒘𝑏𝑘𝑡G_{\bm{w},b,k}^{t}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_b , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT using (6) and (7), respectively
      
Procedure 3 gradient_estimator
Input: Parameter θtsuperscript𝜃𝑡\theta^{t}italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (can be 𝒘𝒘\bm{w}bold_italic_w or τ𝜏\tauitalic_τ) and its gradient estimator Gθtsuperscriptsubscript𝐺𝜃𝑡G_{\theta}^{t}italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, Hyperparameters β1,β2,ϵsubscript𝛽1subscript𝛽2italic-ϵ\beta_{1},\beta_{2},\epsilonitalic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ϵ, Weight decay λ𝜆\lambdaitalic_λ, Learning rate ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
/* We only consider AdamW and LAMB here. */
1 Compute mt+1=β1mt+(1β1)Gθtsuperscript𝑚𝑡1subscript𝛽1superscript𝑚𝑡1subscript𝛽1superscriptsubscript𝐺𝜃𝑡m^{t+1}=\beta_{1}m^{t}+(1-\beta_{1})G_{\theta}^{t}italic_m start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
2 Compute vt+1=β2vt+(1β2)(Gθt)2superscript𝑣𝑡1subscript𝛽2superscript𝑣𝑡1subscript𝛽2superscriptsuperscriptsubscript𝐺𝜃𝑡2v^{t+1}=\beta_{2}v^{t}+(1-\beta_{2})(G_{\theta}^{t})^{2}italic_v start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
3 Compute m^t+1=mt+1/(1(β1)t+1)superscript^𝑚𝑡1superscript𝑚𝑡11superscriptsubscript𝛽1𝑡1\hat{m}^{t+1}=m^{t+1}/(1-(\beta_{1})^{t+1})over^ start_ARG italic_m end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = italic_m start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT / ( 1 - ( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ), v^t+1=vt+1/(1(β2)t+1)superscript^𝑣𝑡1superscript𝑣𝑡11superscriptsubscript𝛽2𝑡1\hat{v}^{t+1}=v^{t+1}/(1-(\beta_{2})^{t+1})over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = italic_v start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT / ( 1 - ( italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT )
4 if optimizer is AdamW then
      5 Set θt+1=θtηt(m^t+1/(v^t+1+ϵ)+λθt)superscript𝜃𝑡1superscript𝜃𝑡subscript𝜂𝑡superscript^𝑚𝑡1superscript^𝑣𝑡1italic-ϵ𝜆superscript𝜃𝑡\theta^{t+1}=\theta^{t}-\eta_{t}\left(\hat{m}^{t+1}/(\sqrt{\hat{v}^{t+1}}+% \epsilon)+\lambda\theta^{t}\right)italic_θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over^ start_ARG italic_m end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT / ( square-root start_ARG over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT end_ARG + italic_ϵ ) + italic_λ italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )
      
6else if optimizer is LAMB then
      7 Compute rt+1=m^t+1/(v^t+1+ϵ)superscript𝑟𝑡1superscript^𝑚𝑡1superscript^𝑣𝑡1italic-ϵr^{t+1}=\hat{m}^{t+1}/(\sqrt{\hat{v}^{t+1}}+\epsilon)italic_r start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = over^ start_ARG italic_m end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT / ( square-root start_ARG over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT end_ARG + italic_ϵ )
      8 for each layer θt,(i)superscript𝜃𝑡𝑖\theta^{t,(i)}italic_θ start_POSTSUPERSCRIPT italic_t , ( italic_i ) end_POSTSUPERSCRIPT in θtsuperscript𝜃𝑡\theta^{t}italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT do
            9 Compute αt,(i)=θt,(i)2/rt,(i)+λθt,(i)2subscript𝛼𝑡𝑖subscriptnormsuperscript𝜃𝑡𝑖2subscriptnormsuperscript𝑟𝑡𝑖𝜆superscript𝜃𝑡𝑖2\alpha_{t,(i)}=\|\theta^{t,(i)}\|_{2}/\|r^{t,(i)}+\lambda\theta^{t,(i)}\|_{2}italic_α start_POSTSUBSCRIPT italic_t , ( italic_i ) end_POSTSUBSCRIPT = ∥ italic_θ start_POSTSUPERSCRIPT italic_t , ( italic_i ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / ∥ italic_r start_POSTSUPERSCRIPT italic_t , ( italic_i ) end_POSTSUPERSCRIPT + italic_λ italic_θ start_POSTSUPERSCRIPT italic_t , ( italic_i ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
            10 Set θt+1,(i)=θt,(i)ηtαt,(i)(rt,(i)+λθt,(i))superscript𝜃𝑡1𝑖superscript𝜃𝑡𝑖subscript𝜂𝑡subscript𝛼𝑡𝑖superscript𝑟𝑡𝑖𝜆superscript𝜃𝑡𝑖\theta^{t+1,(i)}=\theta^{t,(i)}-\eta_{t}\cdot\alpha_{t,(i)}\left(r^{t,(i)}+% \lambda\theta^{t,(i)}\right)italic_θ start_POSTSUPERSCRIPT italic_t + 1 , ( italic_i ) end_POSTSUPERSCRIPT = italic_θ start_POSTSUPERSCRIPT italic_t , ( italic_i ) end_POSTSUPERSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_α start_POSTSUBSCRIPT italic_t , ( italic_i ) end_POSTSUBSCRIPT ( italic_r start_POSTSUPERSCRIPT italic_t , ( italic_i ) end_POSTSUPERSCRIPT + italic_λ italic_θ start_POSTSUPERSCRIPT italic_t , ( italic_i ) end_POSTSUPERSCRIPT )
            
Procedure 4 parameter_update
1 if constant τ𝜏\tauitalic_τ then /* FastCLIP-v1 */
      2 Set τt+1=τtsuperscript𝜏𝑡1superscript𝜏𝑡\tau^{t+1}=\tau^{t}italic_τ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
3else if learnable τ𝜏\tauitalic_τ then
      4 if loss is (GCL) then /* FastCLIP-v0 */
            5 Compute Gτ,ktsuperscriptsubscript𝐺𝜏𝑘𝑡G_{\tau,k}^{t}italic_G start_POSTSUBSCRIPT italic_τ , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT using (8) and All_Reduce Gτt=1Kl=1KGτ,ktsuperscriptsubscript𝐺𝜏𝑡1𝐾superscriptsubscript𝑙1𝐾superscriptsubscript𝐺𝜏𝑘𝑡G_{\tau}^{t}=\frac{1}{K}\sum_{l=1}^{K}G_{\tau,k}^{t}italic_G start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_τ , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
            6 Update τt+1superscript𝜏𝑡1\tau^{t+1}italic_τ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT from τtsuperscript𝜏𝑡\tau^{t}italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and Gτtsuperscriptsubscript𝐺𝜏𝑡G_{\tau}^{t}italic_G start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT using Proc. 4 (with λ=0𝜆0\lambda=0italic_λ = 0)
            
      7else if loss is (RGCL) then /* FastCLIP-v2 */
            8 Compute Gτ,1,it,Gτ,2,itsuperscriptsubscript𝐺𝜏1𝑖𝑡superscriptsubscript𝐺𝜏2𝑖𝑡G_{\tau,1,i}^{t},G_{\tau,2,i}^{t}italic_G start_POSTSUBSCRIPT italic_τ , 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_G start_POSTSUBSCRIPT italic_τ , 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for ikt𝑖superscriptsubscript𝑘𝑡i\in\mathcal{B}_{k}^{t}italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT using (9)
            9 Update τ1,it+1superscriptsubscript𝜏1𝑖𝑡1\tau_{1,i}^{t+1}italic_τ start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT from τ1,itsuperscriptsubscript𝜏1𝑖𝑡\tau_{1,i}^{t}italic_τ start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and Gτ,1,itsuperscriptsubscript𝐺𝜏1𝑖𝑡G_{\tau,1,i}^{t}italic_G start_POSTSUBSCRIPT italic_τ , 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, and update τ2,it+1superscriptsubscript𝜏2𝑖𝑡1\tau_{2,i}^{t+1}italic_τ start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT from τ2,itsuperscriptsubscript𝜏2𝑖𝑡\tau_{2,i}^{t}italic_τ start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and Gτ,2,itsuperscriptsubscript𝐺𝜏2𝑖𝑡G_{\tau,2,i}^{t}italic_G start_POSTSUBSCRIPT italic_τ , 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT using Proc. 4 (with λ=0𝜆0\lambda=0italic_λ = 0) for ikt𝑖superscriptsubscript𝑘𝑡i\in\mathcal{B}_{k}^{t}italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
            
      10else if loss is (RGCL-g) then /* FastCLIP-v3 */
            11 Compute Gτ,ktsuperscriptsubscript𝐺𝜏𝑘𝑡G_{\tau,k}^{t}italic_G start_POSTSUBSCRIPT italic_τ , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT using (10) and All_Reduce Gτt=1Kl=1KGτ,ktsuperscriptsubscript𝐺𝜏𝑡1𝐾superscriptsubscript𝑙1𝐾superscriptsubscript𝐺𝜏𝑘𝑡G_{\tau}^{t}=\frac{1}{K}\sum_{l=1}^{K}G_{\tau,k}^{t}italic_G start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_τ , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
            12 Update τt+1superscript𝜏𝑡1\tau^{t+1}italic_τ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT from τtsuperscript𝜏𝑡\tau^{t}italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and Gτtsuperscriptsubscript𝐺𝜏𝑡G_{\tau}^{t}italic_G start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT using Proc. 4 (with λ=0𝜆0\lambda=0italic_λ = 0)
            
Procedure 5 temperature_update

Derivation of gradient of (GCL) w.r.t. w𝑤\bm{w}bold_italic_w: Given a global batch \mathcal{B}caligraphic_B, the gradient of (GCL) w.r.t. 𝒘𝒘\bm{w}bold_italic_w is given by G𝒘,a+G𝒘,bsubscript𝐺𝒘𝑎subscript𝐺𝒘𝑏G_{\bm{w},a}+G_{\bm{w},b}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_a end_POSTSUBSCRIPT + italic_G start_POSTSUBSCRIPT bold_italic_w , italic_b end_POSTSUBSCRIPT, where

G𝒘,a=subscript𝐺𝒘𝑎absent\displaystyle G_{\bm{w},a}=italic_G start_POSTSUBSCRIPT bold_italic_w , italic_a end_POSTSUBSCRIPT = τ1Kk=1K1|k|ik11|𝒮i|+u1,i1Kk=1K1|k,i|jk,i11(𝒆i,𝒆2,j,τ)𝒆iG𝒘,a,1,k𝜏1𝐾superscriptsubscript𝑘1𝐾superscript1subscript𝑘subscript𝑖subscript𝑘11subscript𝒮limit-from𝑖subscript𝑢1𝑖1𝐾superscriptsubscriptsuperscript𝑘1𝐾1subscriptsuperscript𝑘limit-from𝑖subscript𝑗subscriptsuperscript𝑘limit-from𝑖subscript1subscript1subscript𝒆𝑖subscript𝒆2𝑗𝜏subscript𝒆𝑖subscript𝐺𝒘𝑎1𝑘\displaystyle\tau\cdot\frac{1}{K}\sum_{k=1}^{K}\overbrace{\frac{1}{|\mathcal{B% }_{k}|}\sum_{i\in\mathcal{B}_{k}}\frac{1}{\frac{1}{|\mathcal{S}_{i-}|}+u_{1,i}% }\cdot\frac{1}{K}\sum_{k^{\prime}=1}^{K}\frac{1}{|\mathcal{B}_{k^{\prime},i-}|% }\sum_{j\in\mathcal{B}_{k^{\prime},i-}}\nabla_{1}\ell_{1}(\bm{e}_{i},\bm{e}_{2% ,j},\tau)\cdot\nabla\bm{e}_{i}}^{G_{\bm{w},a,1,k}}italic_τ ⋅ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over⏞ start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i - end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i - end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_τ ) ⋅ ∇ bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT bold_italic_w , italic_a , 1 , italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
+τ1Kk=1K1|k|ik11|𝒮i|+u2,i1Kk=1K1|k,i|jk,i12(𝒆i,𝒆1,j,τ)𝒆iG𝒘,a,2,k.𝜏1𝐾superscriptsubscript𝑘1𝐾superscript1subscript𝑘subscript𝑖subscript𝑘11subscript𝒮limit-from𝑖subscript𝑢2𝑖1𝐾superscriptsubscriptsuperscript𝑘1𝐾1subscriptsuperscript𝑘limit-from𝑖subscript𝑗subscriptsuperscript𝑘limit-from𝑖subscript1subscript2subscript𝒆𝑖subscript𝒆1𝑗𝜏subscript𝒆𝑖subscript𝐺𝒘𝑎2𝑘\displaystyle+\tau\cdot\frac{1}{K}\sum_{k=1}^{K}\overbrace{\frac{1}{|\mathcal{% B}_{k}|}\sum_{i\in\mathcal{B}_{k}}\frac{1}{\frac{1}{|\mathcal{S}_{i-}|}+u_{2,i% }}\cdot\frac{1}{K}\sum_{k^{\prime}=1}^{K}\frac{1}{|\mathcal{B}_{k^{\prime},i-}% |}\sum_{j\in\mathcal{B}_{k^{\prime},i-}}\nabla_{1}\ell_{2}(\bm{e}_{i},\bm{e}_{% 1,j},\tau)\cdot\nabla\bm{e}_{i}}^{G_{\bm{w},a,2,k}}.+ italic_τ ⋅ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over⏞ start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i - end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i - end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT , italic_τ ) ⋅ ∇ bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT bold_italic_w , italic_a , 2 , italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .
G𝒘,b=subscript𝐺𝒘𝑏absent\displaystyle G_{\bm{w},b}=italic_G start_POSTSUBSCRIPT bold_italic_w , italic_b end_POSTSUBSCRIPT = τ1Kk=1K1|k|ik11|𝒮i|+u1,i1Kk=1K1|k,i|jk,i21(𝒆i,𝒆2,j,τ)𝒆2,j𝜏1𝐾superscriptsubscript𝑘1𝐾1subscript𝑘subscript𝑖subscript𝑘11subscript𝒮limit-from𝑖subscript𝑢1𝑖1𝐾superscriptsubscriptsuperscript𝑘1𝐾1subscriptsuperscript𝑘limit-from𝑖subscript𝑗subscriptsuperscript𝑘limit-from𝑖subscript2subscript1subscript𝒆𝑖subscript𝒆2𝑗𝜏subscript𝒆2𝑗\displaystyle\tau\cdot\frac{1}{K}\sum_{k=1}^{K}\frac{1}{|\mathcal{B}_{k}|}\sum% _{i\in\mathcal{B}_{k}}\frac{1}{\frac{1}{|\mathcal{S}_{i-}|}+u_{1,i}}\cdot\frac% {1}{K}\sum_{k^{\prime}=1}^{K}\frac{1}{|\mathcal{B}_{k^{\prime},i-}|}\sum_{j\in% \mathcal{B}_{k^{\prime},i-}}\nabla_{2}\ell_{1}(\bm{e}_{i},\bm{e}_{2,j},\tau)% \cdot\nabla\bm{e}_{2,j}italic_τ ⋅ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i - end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i - end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_τ ) ⋅ ∇ bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT
+τ1Kk=1K1|k|ik11|𝒮i|+u2,i1Kk=1K1|k,i|jk,i22(𝒆i,𝒆1,j,τ)𝒆1,j.𝜏1𝐾superscriptsubscript𝑘1𝐾1subscript𝑘subscript𝑖subscript𝑘11subscript𝒮limit-from𝑖subscript𝑢2𝑖1𝐾superscriptsubscriptsuperscript𝑘1𝐾1subscriptsuperscript𝑘limit-from𝑖subscript𝑗subscriptsuperscript𝑘limit-from𝑖subscript2subscript2subscript𝒆𝑖subscript𝒆1𝑗𝜏subscript𝒆1𝑗\displaystyle+\tau\cdot\frac{1}{K}\sum_{k=1}^{K}\frac{1}{|\mathcal{B}_{k}|}% \sum_{i\in\mathcal{B}_{k}}\frac{1}{\frac{1}{|\mathcal{S}_{i-}|}+u_{2,i}}\cdot% \frac{1}{K}\sum_{k^{\prime}=1}^{K}\frac{1}{|\mathcal{B}_{k^{\prime},i-}|}\sum_% {j\in\mathcal{B}_{k^{\prime},i-}}\nabla_{2}\ell_{2}(\bm{e}_{i},\bm{e}_{1,j},% \tau)\cdot\nabla\bm{e}_{1,j}.+ italic_τ ⋅ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i - end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i - end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT , italic_τ ) ⋅ ∇ bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT .

To compute G𝒘,asubscript𝐺𝒘𝑎G_{\bm{w},a}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_a end_POSTSUBSCRIPT, we first gather all the 𝒆2,jsubscript𝒆2𝑗\bm{e}_{2,j}bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT and 𝒆1,jsubscript𝒆1𝑗\bm{e}_{1,j}bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT using All_Gather to each worker, then compute G𝒘,a,1,ksubscript𝐺𝒘𝑎1𝑘G_{\bm{w},a,1,k}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_a , 1 , italic_k end_POSTSUBSCRIPT and G𝒘,a,2,ksubscript𝐺𝒘𝑎2𝑘G_{\bm{w},a,2,k}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_a , 2 , italic_k end_POSTSUBSCRIPT on the k𝑘kitalic_k-th worker, and average G𝒘,a,1,ksubscript𝐺𝒘𝑎1𝑘G_{\bm{w},a,1,k}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_a , 1 , italic_k end_POSTSUBSCRIPT and G𝒘,a,2,ksubscript𝐺𝒘𝑎2𝑘G_{\bm{w},a,2,k}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_a , 2 , italic_k end_POSTSUBSCRIPT over each worker using All_Reduce. To compute G𝒘,bsubscript𝐺𝒘𝑏G_{\bm{w},b}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_b end_POSTSUBSCRIPT, we first switch the inner and outer averages:

G𝒘,b=subscript𝐺𝒘𝑏absent\displaystyle G_{\bm{w},b}=italic_G start_POSTSUBSCRIPT bold_italic_w , italic_b end_POSTSUBSCRIPT = τ1Kk=1K1|k|jk1Kk=1K1|k,j|ik,j11|𝒮i|+u1,i21(𝒆i,𝒆2,j,τ)𝒆2,jG𝒘,b,1,k𝜏1𝐾superscriptsubscriptsuperscript𝑘1𝐾superscript1subscriptsuperscript𝑘subscript𝑗subscriptsuperscript𝑘1𝐾superscriptsubscript𝑘1𝐾1subscript𝑘limit-from𝑗subscript𝑖subscript𝑘limit-from𝑗11subscript𝒮limit-from𝑖subscript𝑢1𝑖subscript2subscript1subscript𝒆𝑖subscript𝒆2𝑗𝜏subscript𝒆2𝑗subscript𝐺𝒘𝑏1superscript𝑘\displaystyle\tau\cdot\frac{1}{K}\sum_{k^{\prime}=1}^{K}\overbrace{\frac{1}{|% \mathcal{B}_{k^{\prime}}|}\sum_{j\in\mathcal{B}_{k^{\prime}}}\cdot\frac{1}{K}% \sum_{k=1}^{K}\frac{1}{|\mathcal{B}_{k,j-}|}\sum_{i\in\mathcal{B}_{k,j-}}\frac% {1}{\frac{1}{|\mathcal{S}_{i-}|}+u_{1,i}}\nabla_{2}\ell_{1}(\bm{e}_{i},\bm{e}_% {2,j},\tau)\cdot\nabla\bm{e}_{2,j}}^{G_{\bm{w},b,1,k^{\prime}}}italic_τ ⋅ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over⏞ start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k , italic_j - end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_k , italic_j - end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT end_ARG ∇ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_τ ) ⋅ ∇ bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT bold_italic_w , italic_b , 1 , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
+τ1Kk=1K1|k|jk1Kk=1K1|k,j|ik,j11|𝒮i|+u2,i22(𝒆i,𝒆1,j,τ)𝒆1,jG𝒘,b,2,k.𝜏1𝐾superscriptsubscriptsuperscript𝑘1𝐾superscript1subscriptsuperscript𝑘subscript𝑗subscriptsuperscript𝑘1𝐾superscriptsubscript𝑘1𝐾1subscript𝑘limit-from𝑗subscript𝑖subscript𝑘limit-from𝑗11subscript𝒮limit-from𝑖subscript𝑢2𝑖subscript2subscript2subscript𝒆𝑖subscript𝒆1𝑗𝜏subscript𝒆1𝑗subscript𝐺𝒘𝑏2superscript𝑘\displaystyle+\tau\cdot\frac{1}{K}\sum_{k^{\prime}=1}^{K}\overbrace{\frac{1}{|% \mathcal{B}_{k^{\prime}}|}\sum_{j\in\mathcal{B}_{k^{\prime}}}\cdot\frac{1}{K}% \sum_{k=1}^{K}\frac{1}{|\mathcal{B}_{k,j-}|}\sum_{i\in\mathcal{B}_{k,j-}}\frac% {1}{\frac{1}{|\mathcal{S}_{i-}|}+u_{2,i}}\nabla_{2}\ell_{2}(\bm{e}_{i},\bm{e}_% {1,j},\tau)\cdot\nabla\bm{e}_{1,j}}^{G_{\bm{w},b,2,k^{\prime}}}.+ italic_τ ⋅ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over⏞ start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k , italic_j - end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_k , italic_j - end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT end_ARG ∇ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT , italic_τ ) ⋅ ∇ bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT bold_italic_w , italic_b , 2 , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .

Then we gather all the 𝒖1,isubscript𝒖1𝑖\bm{u}_{1,i}bold_italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT and 𝒖2,isubscript𝒖2𝑖\bm{u}_{2,i}bold_italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT using All_Gather to each worker, and compute G𝒘,b,1,ksubscript𝐺𝒘𝑏1superscript𝑘G_{\bm{w},b,1,k^{\prime}}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_b , 1 , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and G𝒘,b,2,ksubscript𝐺𝒘𝑏2superscript𝑘G_{\bm{w},b,2,k^{\prime}}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_b , 2 , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT on the ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-th worker, then average G𝒘,b,1,ksubscript𝐺𝒘𝑏1superscript𝑘G_{\bm{w},b,1,k^{\prime}}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_b , 1 , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and G𝒘,b,2,ksubscript𝐺𝒘𝑏2superscript𝑘G_{\bm{w},b,2,k^{\prime}}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_b , 2 , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over each worker using All_Reduce to get G𝒘,bsubscript𝐺𝒘𝑏G_{\bm{w},b}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_b end_POSTSUBSCRIPT. For practical consideration, we switch the inner and outer averages in G𝒘,b,1,ksubscript𝐺𝒘𝑏1superscript𝑘G_{\bm{w},b,1,k^{\prime}}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_b , 1 , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and G𝒘,b,2,ksubscript𝐺𝒘𝑏2superscript𝑘G_{\bm{w},b,2,k^{\prime}}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_b , 2 , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT again so that we can compute them along with G𝒘,a,1,ksubscript𝐺𝒘𝑎1𝑘G_{\bm{w},a,1,k}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_a , 1 , italic_k end_POSTSUBSCRIPT and G𝒘,a,2,ksubscript𝐺𝒘𝑎2𝑘G_{\bm{w},a,2,k}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_a , 2 , italic_k end_POSTSUBSCRIPT using the same function:

G𝒘,b,1,k=subscript𝐺𝒘𝑏1superscript𝑘absent\displaystyle G_{\bm{w},b,1,k^{\prime}}=italic_G start_POSTSUBSCRIPT bold_italic_w , italic_b , 1 , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = 1|k|jk1Kk=1K1|k,j|ik,j11|𝒮i|+u1,i21(𝒆i,𝒆2,j,τ)𝒆2,j1subscriptsuperscript𝑘subscript𝑗subscriptsuperscript𝑘1𝐾superscriptsubscript𝑘1𝐾1subscript𝑘limit-from𝑗subscript𝑖subscript𝑘limit-from𝑗11subscript𝒮limit-from𝑖subscript𝑢1𝑖subscript2subscript1subscript𝒆𝑖subscript𝒆2𝑗𝜏subscript𝒆2𝑗\displaystyle\frac{1}{|\mathcal{B}_{k^{\prime}}|}\sum_{j\in\mathcal{B}_{k^{% \prime}}}\cdot\frac{1}{K}\sum_{k=1}^{K}\frac{1}{|\mathcal{B}_{k,j-}|}\sum_{i% \in\mathcal{B}_{k,j-}}\frac{1}{\frac{1}{|\mathcal{S}_{i-}|}+u_{1,i}}\nabla_{2}% \ell_{1}(\bm{e}_{i},\bm{e}_{2,j},\tau)\cdot\nabla\bm{e}_{2,j}divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k , italic_j - end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_k , italic_j - end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT end_ARG ∇ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_τ ) ⋅ ∇ bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT
=()\displaystyle\overset{(*)}{=}start_OVERACCENT ( ∗ ) end_OVERACCENT start_ARG = end_ARG 1|k|jk1|j|ij11|𝒮i|+u1,i21(𝒆i,𝒆2,j,τ)𝒆2,j1subscriptsuperscript𝑘subscript𝑗subscriptsuperscript𝑘1subscriptlimit-from𝑗subscript𝑖subscriptlimit-from𝑗11subscript𝒮limit-from𝑖subscript𝑢1𝑖subscript2subscript1subscript𝒆𝑖subscript𝒆2𝑗𝜏subscript𝒆2𝑗\displaystyle\frac{1}{|\mathcal{B}_{k^{\prime}}|}\sum_{j\in\mathcal{B}_{k^{% \prime}}}\cdot\frac{1}{|\mathcal{B}_{j-}|}\sum_{i\in\mathcal{B}_{j-}}\frac{1}{% \frac{1}{|\mathcal{S}_{i-}|}+u_{1,i}}\nabla_{2}\ell_{1}(\bm{e}_{i},\bm{e}_{2,j% },\tau)\cdot\nabla\bm{e}_{2,j}divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_j - end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_j - end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT end_ARG ∇ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_τ ) ⋅ ∇ bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT
=\displaystyle== 1||i11|𝒮i|+u1,i1|k||||j|jk,i21(𝒆i,𝒆2,j,τ)𝒆2,j,1subscript𝑖11subscript𝒮limit-from𝑖subscript𝑢1𝑖1subscriptsuperscript𝑘subscriptlimit-from𝑗subscript𝑗subscriptsuperscript𝑘limit-from𝑖subscript2subscript1subscript𝒆𝑖subscript𝒆2𝑗𝜏subscript𝒆2𝑗\displaystyle\frac{1}{|\mathcal{B}|}\sum_{i\in\mathcal{B}}\frac{1}{\frac{1}{|% \mathcal{S}_{i-}|}+u_{1,i}}\cdot\frac{1}{|\mathcal{B}_{k^{\prime}}|}\cdot\frac% {|\mathcal{B}|}{|\mathcal{B}_{j-}|}\sum_{j\in\mathcal{B}_{k^{\prime},i-}}% \nabla_{2}\ell_{1}(\bm{e}_{i},\bm{e}_{2,j},\tau)\cdot\nabla\bm{e}_{2,j},divide start_ARG 1 end_ARG start_ARG | caligraphic_B | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | end_ARG ⋅ divide start_ARG | caligraphic_B | end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_j - end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i - end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_τ ) ⋅ ∇ bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ,

where ()(*)( ∗ ) uses the fact that the average over local batch and workers is equal to the average over the global batch. Similarly,

G𝒘,b,2,k=1||i11|𝒮i|+u2,i1|k||||j|jk,i22(𝒆i,𝒆1,j,τ)𝒆1,j.subscript𝐺𝒘𝑏2superscript𝑘1subscript𝑖11subscript𝒮limit-from𝑖subscript𝑢2𝑖1subscriptsuperscript𝑘subscriptlimit-from𝑗subscript𝑗subscriptsuperscript𝑘limit-from𝑖subscript2subscript2subscript𝒆𝑖subscript𝒆1𝑗𝜏subscript𝒆1𝑗G_{\bm{w},b,2,k^{\prime}}=\frac{1}{|\mathcal{B}|}\sum_{i\in\mathcal{B}}\frac{1% }{\frac{1}{|\mathcal{S}_{i-}|}+u_{2,i}}\cdot\frac{1}{|\mathcal{B}_{k^{\prime}}% |}\cdot\frac{|\mathcal{B}|}{|\mathcal{B}_{j-}|}\sum_{j\in\mathcal{B}_{k^{% \prime},i-}}\nabla_{2}\ell_{2}(\bm{e}_{i},\bm{e}_{1,j},\tau)\cdot\nabla\bm{e}_% {1,j}.italic_G start_POSTSUBSCRIPT bold_italic_w , italic_b , 2 , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_B | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | end_ARG ⋅ divide start_ARG | caligraphic_B | end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_j - end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i - end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT , italic_τ ) ⋅ ∇ bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT .

Deferred Computation in Alg.1: At iteration t𝑡titalic_t, for SogCLR and other algorithms with global temperature parameter (except FastCLIP-v0), the gradient estimator for 𝒘𝒘\bm{w}bold_italic_w on k𝑘kitalic_k-th worker is computed as

G𝒘,a,kt=τt|kt|iktsuperscriptsubscript𝐺𝒘𝑎𝑘𝑡superscript𝜏𝑡superscriptsubscript𝑘𝑡subscript𝑖superscriptsubscript𝑘𝑡\displaystyle G_{\bm{w},a,k}^{t}=\frac{\tau^{t}}{|\mathcal{B}_{k}^{t}|}\sum_{i% \in\mathcal{B}_{k}^{t}}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_a , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (11|𝒮i|+u1,it+1(1|it|jit11(𝒆i,𝒆2,j,τt)𝒆i)\displaystyle\left(\frac{1}{\frac{1}{|\mathcal{S}_{i-}|}+u_{1,i}^{t+1}}\left(% \frac{1}{|\mathcal{B}_{i-}^{t}|}\sum_{j\in\mathcal{B}_{i-}^{t}}\nabla_{1}\ell_% {1}(\bm{e}_{i},\bm{e}_{2,j},\tau^{t})\cdot\nabla\bm{e}_{i}\right)\right.( divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ∇ bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (2)
+11|𝒮i|+u2,it+1(1|it|jit12(𝒆i,𝒆1,j,τt)𝒆i)).\displaystyle\left.+\frac{1}{\frac{1}{|\mathcal{S}_{i-}|}+u_{2,i}^{t+1}}\left(% \frac{1}{|\mathcal{B}_{i-}^{t}|}\sum_{j\in\mathcal{B}_{i-}^{t}}\nabla_{1}\ell_% {2}(\bm{e}_{i},\bm{e}_{1,j},\tau^{t})\cdot\nabla\bm{e}_{i}\right)\right).+ divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT , italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ∇ bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) .
G𝒘,b,kt=τt|t|itsuperscriptsubscript𝐺𝒘𝑏𝑘𝑡superscript𝜏𝑡superscript𝑡subscript𝑖superscript𝑡\displaystyle G_{\bm{w},b,k}^{t}=\frac{\tau^{t}}{|\mathcal{B}^{t}|}\sum_{i\in% \mathcal{B}^{t}}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_b , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG | caligraphic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (11|𝒮i|+u1,it+1(1|kt||t||it|jk,it21(𝒆i,𝒆2,j,τt)𝒆2,j)\displaystyle\left(\frac{1}{\frac{1}{|\mathcal{S}_{i-}|}+u_{1,i}^{t+1}}\left(% \frac{1}{|\mathcal{B}_{k}^{t}|}\cdot\frac{|\mathcal{B}^{t}|}{|\mathcal{B}_{i-}% ^{t}|}\sum_{j\in\mathcal{B}_{k,i-}^{t}}\nabla_{2}\ell_{1}(\bm{e}_{i},\bm{e}_{2% ,j},\tau^{t})\cdot\nabla\bm{e}_{2,j}\right)\right.( divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ⋅ divide start_ARG | caligraphic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_k , italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ∇ bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ) (3)
+11|𝒮i|+u2,it+1(1|kt||t||it|jk,it22(𝒆i,𝒆1,j,τt)𝒆1,j)).\displaystyle\left.+\frac{1}{\frac{1}{|\mathcal{S}_{i-}|}+u_{2,i}^{t+1}}\left(% \frac{1}{|\mathcal{B}_{k}^{t}|}\cdot\frac{|\mathcal{B}^{t}|}{|\mathcal{B}_{i-}% ^{t}|}\sum_{j\in\mathcal{B}_{k,i-}^{t}}\nabla_{2}\ell_{2}(\bm{e}_{i},\bm{e}_{1% ,j},\tau^{t})\cdot\nabla\bm{e}_{1,j}\right)\right).+ divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ⋅ divide start_ARG | caligraphic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_k , italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT , italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ∇ bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT ) ) .

For FastCLIP-v0, we need to remove the τtsuperscript𝜏𝑡\tau^{t}italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT at the front:

G𝒘,a,kt=1|kt|iktsuperscriptsubscript𝐺𝒘𝑎𝑘𝑡1superscriptsubscript𝑘𝑡subscript𝑖superscriptsubscript𝑘𝑡\displaystyle G_{\bm{w},a,k}^{t}=\frac{1}{|\mathcal{B}_{k}^{t}|}\sum_{i\in% \mathcal{B}_{k}^{t}}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_a , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (11|𝒮i|+u1,it+1(1|it|jit11(𝒆i,𝒆2,j,τt)𝒆i)\displaystyle\left(\frac{1}{\frac{1}{|\mathcal{S}_{i-}|}+u_{1,i}^{t+1}}\left(% \frac{1}{|\mathcal{B}_{i-}^{t}|}\sum_{j\in\mathcal{B}_{i-}^{t}}\nabla_{1}\ell_% {1}(\bm{e}_{i},\bm{e}_{2,j},\tau^{t})\cdot\nabla\bm{e}_{i}\right)\right.( divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ∇ bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (4)
+11|𝒮i|+u2,it+1(1|it|jit12(𝒆i,𝒆1,j,τt)𝒆i)).\displaystyle\left.+\frac{1}{\frac{1}{|\mathcal{S}_{i-}|}+u_{2,i}^{t+1}}\left(% \frac{1}{|\mathcal{B}_{i-}^{t}|}\sum_{j\in\mathcal{B}_{i-}^{t}}\nabla_{1}\ell_% {2}(\bm{e}_{i},\bm{e}_{1,j},\tau^{t})\cdot\nabla\bm{e}_{i}\right)\right).+ divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT , italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ∇ bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) .
G𝒘,b,kt=1|t|itsuperscriptsubscript𝐺𝒘𝑏𝑘𝑡1superscript𝑡subscript𝑖superscript𝑡\displaystyle G_{\bm{w},b,k}^{t}=\frac{1}{|\mathcal{B}^{t}|}\sum_{i\in\mathcal% {B}^{t}}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_b , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (11|𝒮i|+u1,it+1(1|kt||t||it|jk,it21(𝒆i,𝒆2,j,τt)𝒆2,j)\displaystyle\left(\frac{1}{\frac{1}{|\mathcal{S}_{i-}|}+u_{1,i}^{t+1}}\left(% \frac{1}{|\mathcal{B}_{k}^{t}|}\cdot\frac{|\mathcal{B}^{t}|}{|\mathcal{B}_{i-}% ^{t}|}\sum_{j\in\mathcal{B}_{k,i-}^{t}}\nabla_{2}\ell_{1}(\bm{e}_{i},\bm{e}_{2% ,j},\tau^{t})\cdot\nabla\bm{e}_{2,j}\right)\right.( divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ⋅ divide start_ARG | caligraphic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_k , italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ∇ bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ) (5)
+11|𝒮i|+u2,it+1(1|kt||t||it|jk,it22(𝒆i,𝒆1,j,τt)𝒆1,j)).\displaystyle\left.+\frac{1}{\frac{1}{|\mathcal{S}_{i-}|}+u_{2,i}^{t+1}}\left(% \frac{1}{|\mathcal{B}_{k}^{t}|}\cdot\frac{|\mathcal{B}^{t}|}{|\mathcal{B}_{i-}% ^{t}|}\sum_{j\in\mathcal{B}_{k,i-}^{t}}\nabla_{2}\ell_{2}(\bm{e}_{i},\bm{e}_{1% ,j},\tau^{t})\cdot\nabla\bm{e}_{1,j}\right)\right).+ divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ⋅ divide start_ARG | caligraphic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_k , italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT , italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ∇ bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT ) ) .

For iSogCLR and other algorithms with individual temperature parameter, it is computed using a slightly different formula (the τ𝜏\tauitalic_τ part is different)

G𝒘,a,kt=1|kt|iktsuperscriptsubscript𝐺𝒘𝑎𝑘𝑡1superscriptsubscript𝑘𝑡subscript𝑖superscriptsubscript𝑘𝑡\displaystyle G_{\bm{w},a,k}^{t}=\frac{1}{|\mathcal{B}_{k}^{t}|}\sum_{i\in% \mathcal{B}_{k}^{t}}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_a , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (τ1,it1|𝒮i|+u1,it+1(1|it|jit11(𝒆i,𝒆2,j,τ1,it)𝒆i)\displaystyle\left(\frac{\tau_{1,i}^{t}}{\frac{1}{|\mathcal{S}_{i-}|}+u_{1,i}^% {t+1}}\left(\frac{1}{|\mathcal{B}_{i-}^{t}|}\sum_{j\in\mathcal{B}_{i-}^{t}}% \nabla_{1}\ell_{1}(\bm{e}_{i},\bm{e}_{2,j},\tau_{1,i}^{t})\cdot\nabla\bm{e}_{i% }\right)\right.( divide start_ARG italic_τ start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ∇ bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (6)
+τ2,it1|𝒮i|+u2,it+1(1|it|jit12(𝒆i,𝒆1,j,τ2,it)𝒆i)).\displaystyle\left.+\frac{\tau_{2,i}^{t}}{\frac{1}{|\mathcal{S}_{i-}|}+u_{2,i}% ^{t+1}}\left(\frac{1}{|\mathcal{B}_{i-}^{t}|}\sum_{j\in\mathcal{B}_{i-}^{t}}% \nabla_{1}\ell_{2}(\bm{e}_{i},\bm{e}_{1,j},\tau_{2,i}^{t})\cdot\nabla\bm{e}_{i% }\right)\right).+ divide start_ARG italic_τ start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ∇ bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) .
G𝒘,b,kt=1|t|itsuperscriptsubscript𝐺𝒘𝑏𝑘𝑡1superscript𝑡subscript𝑖superscript𝑡\displaystyle G_{\bm{w},b,k}^{t}=\frac{1}{|\mathcal{B}^{t}|}\sum_{i\in\mathcal% {B}^{t}}italic_G start_POSTSUBSCRIPT bold_italic_w , italic_b , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (τ1,it1|𝒮i|+u1,it+1(1|kt||t||it|jk,it21(𝒆i,𝒆2,j,τ1,it)𝒆2,j)\displaystyle\left(\frac{\tau_{1,i}^{t}}{\frac{1}{|\mathcal{S}_{i-}|}+u_{1,i}^% {t+1}}\left(\frac{1}{|\mathcal{B}_{k}^{t}|}\cdot\frac{|\mathcal{B}^{t}|}{|% \mathcal{B}_{i-}^{t}|}\sum_{j\in\mathcal{B}_{k,i-}^{t}}\nabla_{2}\ell_{1}(\bm{% e}_{i},\bm{e}_{2,j},\tau_{1,i}^{t})\cdot\nabla\bm{e}_{2,j}\right)\right.( divide start_ARG italic_τ start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ⋅ divide start_ARG | caligraphic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_k , italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ∇ bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ) (7)
+τ2,it1|𝒮i|+u2,it+1(1|kt||t||it|jk,it22(𝒆i,𝒆1,j,τ2,it)𝒆1,j)).\displaystyle\left.+\frac{\tau_{2,i}^{t}}{\frac{1}{|\mathcal{S}_{i-}|}+u_{2,i}% ^{t+1}}\left(\frac{1}{|\mathcal{B}_{k}^{t}|}\cdot\frac{|\mathcal{B}^{t}|}{|% \mathcal{B}_{i-}^{t}|}\sum_{j\in\mathcal{B}_{k,i-}^{t}}\nabla_{2}\ell_{2}(\bm{% e}_{i},\bm{e}_{1,j},\tau_{2,i}^{t})\cdot\nabla\bm{e}_{1,j}\right)\right).+ divide start_ARG italic_τ start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ⋅ divide start_ARG | caligraphic_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_k , italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ ∇ bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT ) ) .

FastCLIP-v0 computes the following gradient estimator for τ𝜏\tauitalic_τ:

Gτ,kt=superscriptsubscript𝐺𝜏𝑘𝑡absent\displaystyle G_{\tau,k}^{t}=italic_G start_POSTSUBSCRIPT italic_τ , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = 1|kt|ikt11|𝒮i|+u1,it+11|it|jit31(𝒆i,𝒆2,j,τt),1superscriptsubscript𝑘𝑡subscript𝑖superscriptsubscript𝑘𝑡11subscript𝒮limit-from𝑖superscriptsubscript𝑢1𝑖𝑡11superscriptsubscriptlimit-from𝑖𝑡subscript𝑗superscriptsubscriptlimit-from𝑖𝑡subscript3subscript1subscript𝒆𝑖subscript𝒆2𝑗superscript𝜏𝑡\displaystyle\frac{1}{|\mathcal{B}_{k}^{t}|}\sum_{i\in\mathcal{B}_{k}^{t}}% \frac{1}{\frac{1}{|\mathcal{S}_{i-}|}+u_{1,i}^{t+1}}\cdot\frac{1}{|\mathcal{B}% _{i-}^{t}|}\sum_{j\in\mathcal{B}_{i-}^{t}}\nabla_{3}\ell_{1}(\bm{e}_{i},\bm{e}% _{2,j},\tau^{t}),divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , (8)
+1|kt|ikt11|𝒮i|+u2,it+11|it|jit32(𝒆i,𝒆1,j,τt).1superscriptsubscript𝑘𝑡subscript𝑖superscriptsubscript𝑘𝑡11subscript𝒮limit-from𝑖superscriptsubscript𝑢2𝑖𝑡11superscriptsubscriptlimit-from𝑖𝑡subscript𝑗superscriptsubscriptlimit-from𝑖𝑡subscript3subscript2subscript𝒆𝑖subscript𝒆1𝑗superscript𝜏𝑡\displaystyle+\frac{1}{|\mathcal{B}_{k}^{t}|}\sum_{i\in\mathcal{B}_{k}^{t}}% \frac{1}{\frac{1}{|\mathcal{S}_{i-}|}+u_{2,i}^{t+1}}\cdot\frac{1}{|\mathcal{B}% _{i-}^{t}|}\sum_{j\in\mathcal{B}_{i-}^{t}}\nabla_{3}\ell_{2}(\bm{e}_{i},\bm{e}% _{1,j},\tau^{t}).+ divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT , italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) .

FastCLIP-v2 computes the following gradient estimators for τ𝜏\tauitalic_τ:

Gτ,1,it=superscriptsubscript𝐺𝜏1𝑖𝑡absent\displaystyle G_{\tau,1,i}^{t}=italic_G start_POSTSUBSCRIPT italic_τ , 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = 1|𝒮|(log(1|𝒮i|+u1,it+1)+ρ+τ1,it11|𝒮i|+u1,it+11|it|jit31(𝒆i,𝒆2,j,τ1,it)),1𝒮1subscript𝒮limit-from𝑖superscriptsubscript𝑢1𝑖𝑡1𝜌superscriptsubscript𝜏1𝑖𝑡11subscript𝒮limit-from𝑖superscriptsubscript𝑢1𝑖𝑡11superscriptsubscriptlimit-from𝑖𝑡subscript𝑗superscriptsubscriptlimit-from𝑖𝑡subscript3subscript1subscript𝒆𝑖subscript𝒆2𝑗superscriptsubscript𝜏1𝑖𝑡\displaystyle\frac{1}{|\mathcal{S}|}\left(\log\left(\frac{1}{|\mathcal{S}_{i-}% |}+u_{1,i}^{t+1}\right)+\rho+\tau_{1,i}^{t}\cdot\frac{1}{\frac{1}{|\mathcal{S}% _{i-}|}+u_{1,i}^{t+1}}\cdot\frac{1}{|\mathcal{B}_{i-}^{t}|}\sum_{j\in\mathcal{% B}_{i-}^{t}}\nabla_{3}\ell_{1}(\bm{e}_{i},\bm{e}_{2,j},\tau_{1,i}^{t})\right),divide start_ARG 1 end_ARG start_ARG | caligraphic_S | end_ARG ( roman_log ( divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) + italic_ρ + italic_τ start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⋅ divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) , (9)
Gτ,2,it=superscriptsubscript𝐺𝜏2𝑖𝑡absent\displaystyle G_{\tau,2,i}^{t}=italic_G start_POSTSUBSCRIPT italic_τ , 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = 1|𝒮|(log(1|𝒮i|+u2,it+1)+ρ+τ2,it11|𝒮i|+u2,it+11|it|jit32(𝒆i,𝒆1,j,τ2,it)),1𝒮1subscript𝒮limit-from𝑖superscriptsubscript𝑢2𝑖𝑡1𝜌superscriptsubscript𝜏2𝑖𝑡11subscript𝒮limit-from𝑖superscriptsubscript𝑢2𝑖𝑡11superscriptsubscriptlimit-from𝑖𝑡subscript𝑗superscriptsubscriptlimit-from𝑖𝑡subscript3subscript2subscript𝒆𝑖subscript𝒆1𝑗superscriptsubscript𝜏2𝑖𝑡\displaystyle\frac{1}{|\mathcal{S}|}\left(\log\left(\frac{1}{|\mathcal{S}_{i-}% |}+u_{2,i}^{t+1}\right)+\rho+\tau_{2,i}^{t}\cdot\frac{1}{\frac{1}{|\mathcal{S}% _{i-}|}+u_{2,i}^{t+1}}\cdot\frac{1}{|\mathcal{B}_{i-}^{t}|}\sum_{j\in\mathcal{% B}_{i-}^{t}}\nabla_{3}\ell_{2}(\bm{e}_{i},\bm{e}_{1,j},\tau_{2,i}^{t})\right),divide start_ARG 1 end_ARG start_ARG | caligraphic_S | end_ARG ( roman_log ( divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) + italic_ρ + italic_τ start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⋅ divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ,

FastCLIP-v3 computes the following gradient estimator for τ𝜏\tauitalic_τ:

Gτ,kt=superscriptsubscript𝐺𝜏𝑘𝑡absent\displaystyle G_{\tau,k}^{t}=italic_G start_POSTSUBSCRIPT italic_τ , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = 1|kt|ikt(log(1|𝒮i|+u1,it+1)+log(1|𝒮i|+u2,it+1))+2ρ1superscriptsubscript𝑘𝑡subscript𝑖superscriptsubscript𝑘𝑡1subscript𝒮limit-from𝑖superscriptsubscript𝑢1𝑖𝑡11subscript𝒮limit-from𝑖superscriptsubscript𝑢2𝑖𝑡12𝜌\displaystyle\frac{1}{|\mathcal{B}_{k}^{t}|}\sum_{i\in\mathcal{B}_{k}^{t}}% \left(\log\left(\frac{1}{|\mathcal{S}_{i-}|}+u_{1,i}^{t+1}\right)+\log\left(% \frac{1}{|\mathcal{S}_{i-}|}+u_{2,i}^{t+1}\right)\right)+2\rhodivide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_log ( divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) + roman_log ( divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) ) + 2 italic_ρ (10)
+τt1|kt|ikt11|𝒮i|+u1,it+11|it|jit31(𝒆i,𝒆2,j,τt)superscript𝜏𝑡1superscriptsubscript𝑘𝑡subscript𝑖superscriptsubscript𝑘𝑡11subscript𝒮limit-from𝑖superscriptsubscript𝑢1𝑖𝑡11superscriptsubscriptlimit-from𝑖𝑡subscript𝑗superscriptsubscriptlimit-from𝑖𝑡subscript3subscript1subscript𝒆𝑖subscript𝒆2𝑗superscript𝜏𝑡\displaystyle+\tau^{t}\cdot\frac{1}{|\mathcal{B}_{k}^{t}|}\sum_{i\in\mathcal{B% }_{k}^{t}}\frac{1}{\frac{1}{|\mathcal{S}_{i-}|}+u_{1,i}^{t+1}}\cdot\frac{1}{|% \mathcal{B}_{i-}^{t}|}\sum_{j\in\mathcal{B}_{i-}^{t}}\nabla_{3}\ell_{1}(\bm{e}% _{i},\bm{e}_{2,j},\tau^{t})+ italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⋅ divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT , italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )
+τt1|kt|ikt11|𝒮i|+u2,it+11|it|jit32(𝒆i,𝒆1,j,τt).superscript𝜏𝑡1superscriptsubscript𝑘𝑡subscript𝑖superscriptsubscript𝑘𝑡11subscript𝒮limit-from𝑖superscriptsubscript𝑢2𝑖𝑡11superscriptsubscriptlimit-from𝑖𝑡subscript𝑗superscriptsubscriptlimit-from𝑖𝑡subscript3subscript2subscript𝒆𝑖subscript𝒆1𝑗superscript𝜏𝑡\displaystyle+\tau^{t}\cdot\frac{1}{|\mathcal{B}_{k}^{t}|}\sum_{i\in\mathcal{B% }_{k}^{t}}\frac{1}{\frac{1}{|\mathcal{S}_{i-}|}+u_{2,i}^{t+1}}\cdot\frac{1}{|% \mathcal{B}_{i-}^{t}|}\sum_{j\in\mathcal{B}_{i-}^{t}}\nabla_{3}\ell_{2}(\bm{e}% _{i},\bm{e}_{1,j},\tau^{t}).+ italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⋅ divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT | end_ARG + italic_u start_POSTSUBSCRIPT 2 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT , italic_τ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) .

Appendix B Experiment Hyperparameters

Unless otherwise specified, for both FastCLIP and OpenCLIP, we use AdamW as the optimizer. For all settings, we use a cosine learning rate (LR) schedule for updating model parameters, which first linearly increases the LR from 0 to peak LR in the warmup stage, then decreases the LR to 0 at the end of training following a cosine function. The hyperparameters we use are specified in Table 6. Other hyperparameters regarding the inner learning rate schedule, temperature parameter updates, and the LAMB optimizer will be introduced in the paragraphs that follow.

Table 6: Hyperparameters for different settings. β1,β2,ϵsubscript𝛽1subscript𝛽2italic-ϵ\beta_{1},\beta_{2},\epsilonitalic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ϵ are hyperparameters in the AdamW optimizer. lr denotes the learning rate. wd denotes the weight decay. warmup denotes the number of iterations in the warmup stage.
Setting β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ϵitalic-ϵ\epsilonitalic_ϵ lr wd warmup
Medium 0.9 0.999 1e-8 1e-3 0.1 10k
Large 0.9 0.98 1e-6 4e-4 0.1 10k
xLarge 0.9 0.98 1e-6 2e-4 0.2 13k

The Inner LR Schedule: We compare three pairs of approaches: SogCLR and FastCLIP-v1; iSogCLR and FastCLIP-v2; FastCLIP-v3 with constant γ𝛾\gammaitalic_γ and FastCLIP-v3, where the former of each pair uses constant γ𝛾\gammaitalic_γ schedule and the latter uses cosine γ𝛾\gammaitalic_γ schedule. Any two approaches of each pair only differ in γ𝛾\gammaitalic_γ schedule. For approaches using constant γ𝛾\gammaitalic_γ schedule, we tune the value of γ𝛾\gammaitalic_γ in {0.2,0.4,0.6,0.8}0.20.40.60.8\{0.2,0.4,0.6,0.8\}{ 0.2 , 0.4 , 0.6 , 0.8 }. For approaches using cosine γ𝛾\gammaitalic_γ schedule, we tune the value of γminsubscript𝛾min\gamma_{\mathrm{min}}italic_γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT (the value γ𝛾\gammaitalic_γ will decay to in the end) in {0.2,0.6}0.20.6\{0.2,0.6\}{ 0.2 , 0.6 } and decay epochs in {50%,100%}percent50percent100\{50\%,100\%\}{ 50 % , 100 % } of the number of training epochs. The γ𝛾\gammaitalic_γ values for each algorithm are presented in Table 7. Other hyperparameters are kept the same within each pair. For SogCLR and FastCLIP-v1, we set the temperature parameter to 0.03. For iSogCLR and FastCLIP-v2, we set the initial temperature parameter to 0.03, ρ𝜌\rhoitalic_ρ to 9.0, and the learning rate of τ𝜏\tauitalic_τ to 1e-2. For FastCLIP-v3 with constant γ𝛾\gammaitalic_γ schedule and FastCLIP-v3, we set the initial temperature parameter to 0.07, ρ𝜌\rhoitalic_ρ to 6.5 in the medium-scale setting and 8.5 in the large-scale setting, and learning rate of τ𝜏\tauitalic_τ to 2e-4 in the medium-scale setting and 1e-4 in the large-scale setting. For FastCLIP-v3, its learning rate of τ𝜏\tauitalic_τ decays to 1/3 of its original value when τ𝜏\tauitalic_τ becomes smaller than 0.03.

Table 7: Values of γ𝛾\gammaitalic_γ for different schedules in different settings. For Cosine γ𝛾\gammaitalic_γ schedule, we report the γ𝛾\gammaitalic_γ value along with number of γ𝛾\gammaitalic_γ decay epochs E𝐸Eitalic_E (c.f. Section 5). : v3 (Const. γ𝛾\gammaitalic_γ) denotes FastCLIP-v3 with constant γ𝛾\gammaitalic_γ schedule.
Constant γ𝛾\gammaitalic_γ Cosine γ𝛾\gammaitalic_γ
Setting Algorithm γ𝛾\gammaitalic_γ Algorithm γmin,Esubscript𝛾min𝐸\gamma_{\mathrm{min}},Eitalic_γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_E
SogCLR 0.6 FastCLIP-v1 0.2, 18
iSogCLR 0.6 FastCLIP-v2 0.2, 18
Medium v3 (Const. γ𝛾\gammaitalic_γ) 0.6 FastCLIP-v3 0.2, 18
SogCLR 0.6 FastCLIP-v1 0.2, 16
iSogCLR 0.8 FastCLIP-v2 0.6, 16
Large v3 (Const. γ𝛾\gammaitalic_γ) 0.6 FastCLIP-v3 0.2, 16
xLarge - - FastCLIP-v3 0.8, 10

The Temperature Parameter Updates: For all algorithms we leverage a cosine γ𝛾\gammaitalic_γ schedule with γmin=0.2subscript𝛾min0.2\gamma_{\mathrm{min}}=0.2italic_γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = 0.2 and decay epochs E𝐸Eitalic_E equal to 50% of the number of training epochs. For all algorithms, we tune their initial temperature parameter in {0.03,0.05,0.07}0.030.050.07\{0.03,0.05,0.07\}{ 0.03 , 0.05 , 0.07 }. For FastCLIP-v2 and -v3, we tune ρ𝜌\rhoitalic_ρ in [6.0,9.0]6.09.0[6.0,9.0][ 6.0 , 9.0 ], we also tune the learning rate of τ𝜏\tauitalic_τ in [1e4,1e2]1e41e2[1\mathrm{e}-4,1\mathrm{e}-2][ 1 roman_e - 4 , 1 roman_e - 2 ]. Other hyperparameters are kept the same for the four algorithms. The tuned initial temperature is 0.07 for FastCLIP-v3 and 0.03 for other algorithms. The ρ𝜌\rhoitalic_ρ values are presented in Table 8. For FastCLIP-v2, the tuned learning rate of τ𝜏\tauitalic_τ is 1e-2 in the medium-scale setting and 1e-4 in the large-scale setting. For FastCLIP-v3, the tuned learning rate of τ𝜏\tauitalic_τ is 2e-4 in the medium-scale setting and 1e-4 in the large-scale setting. For FastCLIP-v3, its learning rate of τ𝜏\tauitalic_τ decays to 1/3 of its original value when τ𝜏\tauitalic_τ becomes smaller than 0.03.

Table 8: Value of ρ𝜌\rhoitalic_ρ for FastCLIP-v2 and -v3 in different settings.
Algorithm Medium Large xLarge
FastCLIP-v2 7.0 8.5 -
FastCLIP-v3 6.5 8.5 16.0

The Optimizer: We use FastCLIP-v3 as the base algorithm. For both optimizers, we tune their learning rate of model parameters in [4e5,4e3]4e54e3[4\mathrm{e}-5,4\mathrm{e}-3][ 4 roman_e - 5 , 4 roman_e - 3 ] and weight decay in [0.01,0.2]0.010.2[0.01,0.2][ 0.01 , 0.2 ]. Other hyperparameters are kept the same as in Temperature Parameter Updates. The tuned learning rate of model parameters for AdamW is 1e-3 in the medium-scale setting and 4e-4 in the large-scale setting, and the tuned weight decay is 0.1. The tuned learning rate of 𝒘𝒘\bm{w}bold_italic_w for LAMB is 2e-3, and the tuned weight decay is 0.1. Following OpenCLIP, we set the weight decay of the temperature parameter to 0. And following EVA-CLIP [46] in the implementation of LAMB, we set α𝛼\alphaitalic_α at Line 4 in Proc. 4 to 1.0 when updating the temperature parameter, leading to the same update as AdamW.

Scaling Performance: We tune the learning rate of model parameters of OpenCLIP on 2 nodes in the medium-scale and large-scale setting in [4e5,4e3]4e54e3[4\mathrm{e}-5,4\mathrm{e}-3][ 4 roman_e - 5 , 4 roman_e - 3 ], and on 4 nodes in the xlarge-scale setting in [4e5,4e4]4e54e4[4\mathrm{e}-5,4\mathrm{e}-4][ 4 roman_e - 5 , 4 roman_e - 4 ]. The tuned learning rate of model parameters of OpenCLIP is 1e-3, 4e-4 and 2e-4 in the medium-scale, large-scale and xlarge-scale setting, respectively. Other hyperparameters are set according to Table 6 to 8. In the xlarge-scale setting, we set the learning rate of model parameters of FastCLIP-v3 to the same value as OpenCLIP. For different number of nodes in the medium-scale and large-scale setting, we scale the learning rate of model parameters and temperature parameter linearly in proportion to global batch size and keep other hyperparameters unchanged. For FastCLIP-v3 in the xlarge-scale setting, we set ρ𝜌\rhoitalic_ρ to 16.0 and the learning rate of temperature parameter to 5e-5. We leverage a cosine γ𝛾\gammaitalic_γ schedule with γmin=0.8subscript𝛾min0.8\gamma_{\mathrm{min}}=0.8italic_γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = 0.8 and decay epochs E=10𝐸10E=10italic_E = 10.

Choice of γminsubscript𝛾min\gamma_{\mathrm{min}}italic_γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT in the xlarge-scale setting: Note that in the xlarge-scale setting we use a larger γminsubscript𝛾min\gamma_{\mathrm{min}}italic_γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT value than in the medium-scale and large-scale settings. We find that the batch size impacts how we should set the γminsubscript𝛾min\gamma_{\mathrm{min}}italic_γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT value. To illustrate this, we conduct two sets of experiments in the large-scale setting on 2 nodes and 8 nodes, respectively. Each set is FastCLIP-v3 with different γminsubscript𝛾min\gamma_{\mathrm{min}}italic_γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT value. The results are plotted in Figure 5. Comparing a larger γminsubscript𝛾min\gamma_{\mathrm{min}}italic_γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT (0.8) with a smaller one (0.2) in the same setting, we find that the training can be split into three stages. In the first stage, the two runs have similar performance. In the second stage, larger γminsubscript𝛾min\gamma_{\mathrm{min}}italic_γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT outperforms the smaller one, while the smaller one catches up with the larger one and outperforms it in the last stage. From Figure 5 we can also observe that with a larger global batch size, the second stage becomes longer. Note that in the medium-scale and large-scale settings we use a global batch size of 1024 and 2048 respectively, while we set it to 5120 in the xlarge-scale setting. We also conjecture that the second stage becomes longer as the data scales up, though we did not validate this due to resource limits. The large batch size and large data scale in the xlarge-scale setting motivate our use of a larger γminsubscript𝛾min\gamma_{\mathrm{min}}italic_γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT value than in the medium-scale and large-scale settings.

Refer to caption
(a) 2 Nodes, Batch size 2048
Refer to caption
(b) 8 Nodes, Batch size 8192
Figure 5: Datacomp average performance of FastCLIP-v3 with γ𝛾\gammaitalic_γ decay epochs 16 (145 million samples seen) and different γminsubscript𝛾min\gamma_{\mathrm{min}}italic_γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT in the large-scale setting. Batch size denotes global batch size. The vertical dashed lines divided the plot into three parts (c.f. Choice of γminsubscript𝛾min\gamma_{\mathrm{min}}italic_γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT in the xlarge-scale Setting in Appendix B).

Appendix C More Experiment Results

C.1 Optimization Components

We plot the Datacomp average performance curves of different algorithms with constant γ𝛾\gammaitalic_γ schedule and cosine γ𝛾\gammaitalic_γ schedule in Figure 6, which corresponds to Table 5 in Section 5. We plot the Datacomp average performance curves of algorithms with different temperature updates in Figure 7 (a) and (b), which corresponds to Table 5 in Section 5. We plot the Datacomp average performance curves of FastCLIP-v3 with AdamW and LAMB optimizer in Figure 7 (c) and (d), which corresponds to Table 5 in Section 5.

Refer to caption
(a) Medium, FastCLIP-v1
Refer to caption
(b) Medium, FastCLIP-v2
Refer to caption
(c) Medium, FastCLIP-v3
Refer to caption
(d) Large, FastCLIP-v1
Refer to caption
(e) Large, FastCLIP-v2
Refer to caption
(f) Large, FastCLIP-v3
Figure 6: Datacomp performance of algorithms with constant γ𝛾\gammaitalic_γ schedule and cosine γ𝛾\gammaitalic_γ schedule. v3 (Const. γ𝛾\gammaitalic_γ) denotes FastCLIP-v3 with constant γ𝛾\gammaitalic_γ schedule.
Refer to caption
(a) Temperature, Medium
Refer to caption
(b) Temperature, Large
Refer to caption
(c) Optimizer, Medium
Refer to caption
(d) Optimizer, Large
Figure 7: Subfigures (a), (b) present the Datacomp performance of algorithms with different temperature parameter updates in the medium-scale and large-scale setting, respectively. Subfigures (c), (d) present the Datacomp performance of FastCLIP-v3 with different optimizers in the medium-scale and large-scale setting, respectively.

C.2 Scaling Performance

In this subsection we provide more results to complement the figures in Section 6.

Table 9: Datacomp Average performance of OpenCLIP and FastCLIP-v3 trained on different number of nodes. Improvement denotes the absolute difference between FastCLIP-v3 and OpenCLIP.
Setting Algorithm 1 Node 2 Nodes 4 Nodes 8 Nodes
OpenCLIP 21.82 (0.59) 21.84 (0.23) 21.65 (0.13) 22.22 (0.37)
FastCLIP-v3 24.54 (0.25) 24.76 (0.26) 24.43 (0.20) 25.23 (0.28)
Medium Improvement 2.72 2.92 2.78 3.01
OpenCLIP 27.55 (0.46) 27.91 (0.73) 28.93 (0.29) 28.75 (0.59)
FastCLIP-v3 30.81 (0.38) 31.60 (0.46) 31.65 (0.13) 31.45 (0.32)
Large Improvement 3.26 3.69 2.72 2.70
Table 10: Retrieval performance of OpenCLIP and FastCLIP-v3 trained on different number of nodes. Improvement denotes the absolute difference between FastCLIP-v3 and OpenCLIP.
Setting Algorithm 1 Node 2 Nodes 4 Nodes 8 Nodes
OpenCLIP 24.07 (0.16) 25.20 (0.22) 25.07 (0.26) 26.20 (0.10)
FastCLIP-v3 30.02 (0.57) 30.36 (0.18) 30.42 (0.24) 30.42 (0.24)
Medium Improvement 5.95 5.16 5.35 4.22
OpenCLIP 29.17 (0.17) 29.58 (0.62) 30.25 (0.31) 30.87 (0.11)
FastCLIP-v3 33.90 (0.28) 34.88 (0.28) 34.91 (0.16) 34.74 (0.31)
Large Improvement 4.73 5.30 4.66 3.87
Table 11: ImageNet & Variants accuracy of OpenCLIP and FastCLIP-v3 trained on different number of nodes. Improvement denotes the absolute difference between FastCLIP-v3 and OpenCLIP.
Setting Algorithm 1 Node 2 Nodes 4 Nodes 8 Nodes
OpenCLIP 14.16 (0.11) 14.73 (0.22) 15.24 (0.26) 16.03 (0.23)
FastCLIP-v3 18.37 (0.26) 19.08 (0.16) 19.21 (0.18) 19.20 (0.16)
Medium Improvement 4.21 4.35 3.97 3.17
OpenCLIP 20.51 (0.14) 21.08 (0.09) 22.32 (0.23) 22.77 (0.14)
FastCLIP-v3 23.76 (0.38) 24.78 (0.28) 24.79 (0.20) 24.93 (0.16)
Large Improvement 3.25 3.70 2.47 2.16
Refer to caption
Refer to caption
(a) Datacomp, Medium
Refer to caption
(b) Datacomp, Large
Refer to caption
Refer to caption
(c) Datacomp, xLarge
Figure 8: Datacomp Avearge performance of OpenCLIP and FastCLIP-v3 in different settings. Subfigures (a), (b) present the results in the medium-scale and large-scale setting, with numbers denoting the improvement of FastCLIP-v3 over OpenCLIP. Subfigure (c) present the results in the xlarge-scale setting.

Performance of OpenCLIP and FastCLIP-v3: The data to plot Figure 4 is presented in Table 10 and Table 11. We also provide the Datacomp performance in Table 9. The Datacomp performance of OpenCLIP and FastCLIP-v3 in the xlarge-scale setting is plotted in Figure 8.

Table 12: Comparison between OpenCLIP and FastCLIP-v3 in terms of training time in the medium-scale setting. The shaded results are from FastCLIP-v3, and the others are from OpenCLIP. Computation denotes the whole computation time. Communication denotes the whole communication time. Pure Comm. denotes the communication time that is not overlapped with computation. Overlap denotes the overlapped time between computation and communication.
Category 1 Node 2 Nodes 4 Nodes 8 Nodes
867.85 (11.04) 880.19 (53.45) 925.47 (27.77) 1049.90 (32.44)
Total 866.36 (5.89) 879.91 (52.17) 917.54 (25.46) 1028.06 (32.26)
770.57 (6.10) 738.87 (21.58) 726.07 (1.53) 742.93 (15.91)
Computation 771.80 (5.53) 737.93 (21.73) 725.40 (2.01) 742.90 (15.90)
222.01 (4.43) 403.40 (130.80) 548.07 (60.97) 698.87 (26.24)
Communication 223.34 (5.51) 400.76 (125.78) 536.15 (59.29) 675.43 (25.97)
27.18 (1.61) 68.74 (25.45) 127.39 (30.29) 224.71 (16.05)
Pure Comm. 25.50 (2.24) 64.32 (22.47) 116.21 (28.48) 200.97 (15.58)
194.84 (2.88) 334.66 (105.36) 420.68 (30.80) 474.16 (10.23)
Overlap 197.84 (3.65) 336.44 (103.35) 419.94 (30.83) 474.46 (10.41)
70.09 (8.17) 72.58 (6.59) 72.01 (2.73) 82.26 (0.93)
Others 69.06 (1.67) 77.66 (8.14) 75.93 (2.83) 84.19 (0.86)
Table 13: Comparison between OpenCLIP and FastCLIP-v3 in terms of training time in the large-scale setting. The shaded results are from FastCLIP-v3, and the others are from OpenCLIP. Computation denotes the whole computation time. Communication denotes the whole communication time. Pure Comm. denotes the communication time that is not overlapped with computation. Overlap denotes the overlapped time between computation and communication.
Category 1 Node 2 Nodes 4 Nodes 8 Nodes
1125.29 (14.14) 1234.06 (151.37) 1396.76 (47.86) 1564.46 (47.92)
Total 1128.75 (9.75) 1234.82 (153.86) 1394.91 (48.35) 1542.32 (47.87)
960.14 (12.00) 910.77 (10.48) 891.71 (6.09) 896.54 (8.02)
Computation 964.16 (9.10) 910.94 (11.55) 892.72 (4.72) 897.59 (9.09)
360.34 (15.55) 655.30 (175.45) 876.13 (71.52) 1061.52 (55.08)
Communication 363.38 (16.66) 652.78 (173.41) 870.01 (69.56) 1035.03 (56.84)
56.73 (4.09) 192.89 (129.45) 379.10 (58.13) 525.78 (57.22)
Pure Comm. 55.44 (2.23) 190.56 (127.48) 371.30 (55.62) 498.95 (59.72)
303.62 (14.70) 462.41 (46.02) 497.02 (13.45) 535.74 (2.33)
Overlap 307.94 (18.14) 462.22 (45.93) 498.71 (13.97) 536.08 (2.99)
108.42 (5.54) 130.40 (12.26) 125.95 (5.57) 142.14 (2.08)
Others 109.14 (2.67) 133.33 (15.30) 130.89 (4.34) 145.78 (3.13)

Training Time of OpenCLIP and FastCLIP-v3: We present the training time breakdown of OpenCLIP and FastCLIP-v3 in Table 12 and 13 for the medium-scale and large-scale settings, respectively. We can see that as the number of nodes scales up, the computation time of OpenCLIP and FastCLIP-v3 is always close to each other, while the gap in communication time becomes much larger, which is also depicted in subfigures (c) and (d). Even if we exclude the part of communication that overlaps with computation, the gap in pure communication still becomes larger with increasing number of nodes, and thus FastCLIP-v3 has a shorter running time on 4 and 8 nodes.