Lightweight Model Pre-training via Language Guided Knowledge Distillation

Mingsheng Li*, Lin Zhang*, Mingzhen Zhu*, Zilong Huang, Gang Yu, Jiayuan Fan, , and Tao Chen Mingsheng Li, Lin Zhang, Mingzhen Zhu, and Tao Chen are with the School of Information Science and Technology, Fudan University, Shanghai 200433, China. E-mail: [email protected]; [email protected]; [email protected]; [email protected]. Zilong Huang and Gang Yu are with the Tencent GY-Laboratory, Shanghai 200000, China. E-mail: [email protected]; [email protected]. Jiayuan Fan is with the Academy for Engineering and Technology, Fudan University, Shanghai 200433, China. E-mail: [email protected]. Corresponding author. *Equal Contribution.
Abstract

This paper studies the problem of pre-training for small models, which is essential for many mobile devices. Current state-of-the-art methods on this problem transfer the representational knowledge of a large network (as a Teacher) into a smaller model (as a Student) using self-supervised distillation, improving the performance of the small model on downstream tasks. However, existing approaches are insufficient in extracting the crucial knowledge that is useful for discerning categories in downstream tasks during the distillation process. In this paper, for the first time, we introduce language guidance to the distillation process and propose a new method named Language-Guided Distillation (LGD) system, which uses category names of the target downstream task to help refine the knowledge transferred between the teacher and student. To this end, we utilize a pre-trained text encoder to extract semantic embeddings from language and construct a textual semantic space called Textual Semantics Bank (TSB). Furthermore, we design a Language-Guided Knowledge Aggregation (LGKA) module to construct the visual semantic space, also named Visual Semantics Bank (VSB). The task-related knowledge is transferred by driving a student encoder to mimic the similarity score distribution inferred by a teacher over TSB and VSB. Compared with other small models obtained by either ImageNet pre-training or self-supervised distillation, experiment results show that the distilled lightweight model using the proposed LGD method presents state-of-the-art performance and is validated on various downstream tasks, including classification, detection, and segmentation. We have made the code available at https://github.com/mZhenz/LGD.

Index Terms:
Lightweight model pre-training, language-guided distillation, textual semantics bank, visual semantics banks

I Introduction

Recently, the study of pre-training lightweight (small) models with both a small number of parameters and fast inference speed receives increasing attention [1, 2, 3]. Among existing small model pre-training methods, the self-supervised distillation (SSD) [4, 5, 6] which improves the pre-training performance of small models with a distillation signal from a pre-trained large model, has appeared as a promising solution to this problem, as this pipeline can save the overhead of image labeling and meanwhile maintain reasonable performance. Hence we follow this pipeline to design a small model pre-training approach in this work.

Refer to caption
Figure 1: (a) Comparison of previous self-supervised distillation (SSD) and the proposed Language-Guided Distillation (LGD) on Imagenet-1k, see more details in sec. IV-B2. (b) Comparison of using different input texts for LGD on ImageNet-1k, then fine-tuning on downstream tasks, e.g., classification on Caltech-256, and segmentation on Cityscapes and ADE20K. Using task-related category names as the input language for knowledge distillation could bring constant improvements, see more details in sec. IV-C2.

Considering the smaller capacity, it is difficult for the student model to fully mimic the outputs of the large teacher model. In other words, the student model may not learn all knowledge of the teacher model. An ideal solution is to let the student learn partial but essential knowledge of the teacher, such as the knowledge that is most useful for distinguishing categories in the target tasks. Unfortunately, existing SSD methods such as SEED [4] and CompRess [7] do not consider this. Inspired by the success of Contrastive Language-Image Pre-training (CLIP) [8], which reflects the value of natural language for enhancing the visual semantic features, we propose to distill the essential knowledge of the teacher with the help of language. The reason is that the input language could be built on the information of the target task, e.g., the category names in the classification, segmentation, or detection task. Thus, the textual semantic space is also relevant to the target task and can naturally help to identify the most relevant knowledge in the teacher for distillation. Specifically, we make use of the language with the pre-trained text encoder to extract semantic embeddings and build a textual semantic space, also named Textual Semantics Bank (TSB). Utilizing the proposed TSB to enhance distillation and align the outputs of the student and teacher in the target-related textual semantic space can help the lightweight student learn more knowledge of the teacher.

However, the textual semantic feature extracted from the pre-trained text encoder may not be consistent with the corresponding visual feature extracted by the teacher model, which could result in performance degradation in the distillation process. To alleviate this issue, we design a Language-Guided Knowledge Aggregation (LGKA) module to build visual semantic space, also named Visual Semantics Bank (VSB). The VSB shares the same shape as TSB. During the training process, the LGKA module takes a visual feature extracted from the teacher and the TSB as inputs. Then the visual feature is used to momentum update the corresponding feature in the VSB. The correspondence is determined by the similarity between the input visual feature and the TSB. Thus, the VSB is closer to the real distribution of visual semantics than the TSB. The outputs of the student and teacher will be aligned in both the visual and textual semantic spaces.

Therefore, we design two language-guided loss functions. The first is based on the output consistency constraint of the teacher and student in the textual semantic space, and the second is based on consistency loss after map** the outputs of the teacher and student to the visual semantic space. It is worth noting that the TSB is fixed, while the VSB is continually updated during the training process.

To this end, we propose a Language-Guided Distillation (LGD) framework, which pre-trains a small model by integrating language guidance into the distillation process. Despite the language guidance used, we do not use labels, which is the same setting as SSD, but only use unlabeled images and self-determined texts as source data. As shown in Fig. 1, distilling the same teacher model, the proposed LGD could achieve much better results on the ImageNet-1K dataset with the help of input language. Besides, we could take the corresponding category names according to the target downstream task as language guidance for distilling processing, rather than the fixed one, e.g. category names of ImageNet-1K. It also could bring constant improvements to downstream tasks.

In summary, the main contributions of this paper are summarized below:

  • We propose a novel Language-Guided Distillation (LGD) framework, which is the first attempt to introduce language guidance into the distillation process to pre-train the small model.

  • A Language-Guided Knowledge Aggregation (LGKA) module is developed, which uses language to guide teachers on how to structure the knowledge they pass to students and constrain the learning scope to improve the learning effect. Meanwhile, two losses are designed to maintain the consistency of both the image-to-image and image-to-text relationships between teacher and student features.

  • Thorough experiments are conducted on six datasets and six downstream tasks, which shows the small model pre-trained by the proposed method has better transferability.

The remainder of this paper is organized as follows. Section II reviews the related works on small model pre-training, vision-language model pre-training, and knowledge distillation respectively. In Section III, we present an overview of the whole framework, including language-guided knowledge aggregation, language-aware knowledge transfer, and optimization objectives. In Section IV, extensive experiments and analysis are presented to validate the effectiveness of the proposed LGD. Finally, conclusions are drawn in Section V.

II Related Work

II-A Small Model Pre-training

As a conventional model pre-training way, fully-supervised pre-training [9] trains a classification network to predict a fixed set of predetermined object categories on a large-scale annotated dataset like ImageNet [10]. To alleviate the massive overhead of annotation, self-supervised learning (SSL) [11, 12, 13, 14, 15, 16, 17, 18, 19] aims to dig out good feature representation through the relationship between data samples without labels. However, because of the significant performance drop when the model size decreases, SSD [4, 20, 5, 1] is proposed to improve the SSL performance on a small model. Specifically, SSD transfers the learned feature representation from an off-the-shelf pre-trained large model to the student. However, previous SSD methods do not consider the gap in feature representation ability between teacher and student, and they directly transfer all knowledge from the teacher to the student. In this paper, we introduce language guidance to distillation process and use language to constrain the learning scope of the small model.

II-B Vision-Language Learning

During the past few years, vision-language learning has attracted growing attention [8, 21, 22, 23, 24, 25, 26]. As a milestone, CLIP [8] learns high-quality image representations by a simple image text pairing task on a dataset of 400 million (image, text) pairs collected from the internet. Motivated by CLIP, several works have emerged to improve the training strategy [27, 28, 29] or apply the CLIP method to other domains [30, 31]. However, previous methods directly adopt the CLIP pre-trained large model to downstream tasks, which is unsuitable for practical applications due to the significant computation overhead. In this paper, for the first time, we transfer the visual and textual representations learned by the CLIP pre-trained large model to the small model through distillation, which reveals a new application of the CLIP pre-trained model.

II-C Knowledge Distillation

Knowledge distillation usually transfers knowledge from a large teacher to a small student. Previous methods mainly fall into three streams: response-based [32, 33, 34], feature-based [35, 36, 37], and relation-based  [38, 39]. Previous works [4, 20, 5] have demonstrated that relation-based SSD outperforms response-based and feature-based distillation strategies for self-supervised contrastive pre-trained teacher models, which targets modeling the relationship between features of different samples. Similar to SSL works, CLIP also adopts a contrastive learning strategy. But the difference is that CLIP not only models the relationship between image and image but also between image and language (text), which previous works have not considered. In this paper, we first adopt language guidance to knowledge distillation, which effectively transfers the relation knowledge of image-to-image and image-to-text with language guidance.

SSD v.s. LGD. Here we make a comparison of the self-supervised distillation with the proposed language-guided distillation. Similarly, they all do not use any label information for distillation. They all support the self-supervised pre-trained model (MoCo [12]), weakly-supervised pre-trained model (CLIP [8]), and fully supervised pre-trained model as the teacher model.  Differently, LGD takes the self-determined category names (always using the category names from the downstream task) as language guidance, it does not need extra manual labor to annotate. Thus, we think it is fair to compare LGD with other SSD methods. Besides, to evaluate the generalization ability of the proposed method, we evaluate our method with the weakly-supervised pre-trained text model (CLIP text encoder [8]) and self-supervised pre-trained text model (BERT [40]).

III The Proposed Method

Refer to caption
Figure 2: The overview of the proposed Language-Guided Distillation (LGD) framework. First, all the textual features are extracted by feeding [text prompt] + [class] into a pre-trained text encoder and stored in the Textual Semantics Bank (TSB) at one time. Besides, visual features are extracted by feeding unlabeled images into a pre-trained image encoder (teacher) and the student. Then, a Language-Guided Knowledge Aggregation (LGKA) module is developed to classify the visual features of each batch by the textual anchors, and maintain a Visual Semantics Bank (VSB) to store centroid features of different categories by momentum updating. At last, to align the feature similarity between the teacher/student feature and anchor in both textual and visual space, the cross-entropy loss is adopted to constrain the visual and textual similarity distribution between teacher and student. Best viewed in color.

III-A Overview

The overall framework of the proposed LGD is shown in Fig. 2, which consists of two major modules: Language-Guided Knowledge Aggregation (LGKA) and Language-aware Knowledge Transfer.

Firstly, textual semantic embeddings are extracted by feeding [text prompt] + [class] (e.g. ”a photo of cat”) into the pre-trained text encoder and stored in the Textual Semantics Bank (TSB), denoted as L𝐿Litalic_L. Here, the semantic embeddings can be regarded as the clustering centers of the self-determined categories. Following CLIP [8], the prompt engineering and feature ensembling are adopted here so that there is only one textual semantic embedding for each semantic category. Note that the [class] can be determined by the knowledge required on downstream task, and textual semantic embeddings only need to be extracted once. Similar to previous SSD works [4, 6], unlabeled images after data augmentation are fed into the image encoder (teacher) and student network.

As mentioned earlier, the textual semantic embeddings extracted from the pre-trained text encoder may not be consistent with the corresponding visual feature extracted by the teacher model. So, the LGKA module is developed to guide the teacher to build the Visual Semantics Bank (VSB), denoted as V𝑉Vitalic_V, which has the same shape as TSB. For each input, the teacher’s visual feature is classified by the TSB and is used to momentum update the VSB. Since the input [class] is self-determined, the user can specify the texts containing different semantic categories’ knowledge to supervise the teacher to transfer the knowledge to the student.

Lastly, to align the teacher and student in both visual and textual space, the similarity of the outputs of the teacher and student with the feature in TSB and VSB is calculated as the consistency constraint in both appearance and semantic space.

III-B Language-Guided Knowledge Aggregation

In order to transfer related knowledge from the teacher to the student, the LGKA module is developed to constrain the learning scope. Specifically, we first treat textual semantics bank L𝐿Litalic_L as a classifier and classify the visual feature zIsubscript𝑧𝐼z_{I}italic_z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT extracted by the teacher for each batch, which can be formulated as

θ=argmax(zIL),𝜃𝑎𝑟𝑔𝑚𝑎𝑥subscript𝑧𝐼𝐿\theta=argmax(z_{I}\cdot L),italic_θ = italic_a italic_r italic_g italic_m italic_a italic_x ( italic_z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ⋅ italic_L ) , (1)

where zIB×Dsubscript𝑧𝐼superscript𝐵𝐷z_{I}\in\mathbb{R}^{B\times D}italic_z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × italic_D end_POSTSUPERSCRIPT, LD×C𝐿superscript𝐷𝐶L\in\mathbb{R}^{D\times C}italic_L ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_C end_POSTSUPERSCRIPT and θB×1𝜃superscript𝐵1\theta\in\mathbb{R}^{B\times 1}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × 1 end_POSTSUPERSCRIPT. The B𝐵Bitalic_B, D𝐷Ditalic_D, C𝐶Citalic_C and θ𝜃\thetaitalic_θ denote batch size, #channels of the output of the text encoder, the number of self-determined categories, and the classification results of the teacher’s visual feature, respectively. Note that the categories are customized depending on the user given [class]. By calculating the similarity, each image feature will be classified into the most semantically similar category.

There may be more than one feature in a batch zIsubscript𝑧𝐼z_{I}italic_z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT that belongs to the same category, then the average vectors are computed, which can be formulated as

zCi=mean(zI[θ=i]),superscriptsubscript𝑧𝐶𝑖𝑚𝑒𝑎𝑛subscript𝑧𝐼delimited-[]𝜃𝑖z_{C}^{i}=mean(z_{I}[\theta=i]),italic_z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_m italic_e italic_a italic_n ( italic_z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT [ italic_θ = italic_i ] ) , (2)

where zCD×Csubscript𝑧𝐶superscript𝐷𝐶z_{C}\in\mathbb{R}^{D\times C}italic_z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_C end_POSTSUPERSCRIPT denotes the centroid feature of each semantic category in this batch and i[0,C1]𝑖0𝐶1i\in[0,C-1]italic_i ∈ [ 0 , italic_C - 1 ]. Note that only when one category is contained in the batch, its centroid feature is computed and the visual anchor for this category in VSB is updated.

At last, the centroid feature of each semantic category zCsubscript𝑧𝐶z_{C}italic_z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT will be added to the VSB by momentum updating, which can be formulated as

{VimVi+(1m)zCi,VizCi(init.),\left\{\begin{array}[]{l}V^{i}\leftarrow mV^{i}+(1-m)z_{C}^{i},\\ V^{i}\leftarrow z_{C}^{i}(init.),\end{array}\right.{ start_ARRAY start_ROW start_CELL italic_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ← italic_m italic_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + ( 1 - italic_m ) italic_z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ← italic_z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_i italic_n italic_i italic_t . ) , end_CELL end_ROW end_ARRAY (3)

where VD×C𝑉superscript𝐷𝐶V\in\mathbb{R}^{D\times C}italic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_C end_POSTSUPERSCRIPT. Similar to [12], m𝑚mitalic_m is a momentum coefficient and set to 0.999 as default. When updating the visual anchor of one semantic category in VSB for the first time, we directly replace V𝑉Vitalic_V with zCsubscript𝑧𝐶z_{C}italic_z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT instead of momentum updating it, which can produce a better initial point as compared with updating all classes’ features in VSB from randomly initialized points.

The proposed VSB has two advantages compared with the previous method. 1) Memory-friendly. Previous SSD methods [7, 4, 5] need to maintain a large queue with enough anchor points to get satisfactory results. Similar to MoCo [12], the queue length is 65536, containing a large number of similar features for the same semantic category, consuming a lot of memory. The proposed VSB only saves the anchor feature for each semantic category, which is more memory-efficient. 2) Data-efficient. Since previous SSD methods [7, 4] maintain a queue to store features, they can only save features in the neighboring few batches. On the contrary, the proposed VSB saves whole dataset-aware features through momentum updating, showing higher data efficiency.

III-C Language-aware Knowledge Transfer

As mentioned before, the language information is used to supervise the optimization of the whole model from both visual and textual space, to be detailed below.

Visual Space Alignment Loss. After LGKA, the VSB maintains the visual anchor for measuring the image-to-image similarity. To align the output of the teacher and the student in visual space, the cosine similarity scores between teacher/student feature zT/zSsubscript𝑧𝑇subscript𝑧𝑆z_{T}/z_{S}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT / italic_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and V𝑉Vitalic_V are calculated first. And similar to [4], the teacher feature is added at the end of VSB to directly align the student with the teacher. Therefore, the modified VSB is V=[V0,,VC1,VC]superscript𝑉superscript𝑉0superscript𝑉𝐶1superscript𝑉𝐶V^{\prime}=[V^{0},\cdots,V^{C-1},V^{C}]italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = [ italic_V start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , ⋯ , italic_V start_POSTSUPERSCRIPT italic_C - 1 end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ] with VC=zTisuperscript𝑉𝐶superscriptsubscript𝑧𝑇𝑖V^{C}=z_{T}^{i}italic_V start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT = italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. The calculation of the similarity score can be formulated as

sTVisuperscriptsubscript𝑠𝑇𝑉𝑖\displaystyle s_{T-V}^{i}italic_s start_POSTSUBSCRIPT italic_T - italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT =exp(zTiV/τT)jexp(zTiVj/τT),absent𝑒𝑥𝑝superscriptsubscript𝑧𝑇𝑖superscript𝑉subscript𝜏𝑇subscript𝑗𝑒𝑥𝑝superscriptsubscript𝑧𝑇𝑖superscript𝑉𝑗subscript𝜏𝑇\displaystyle=\frac{exp(z_{T}^{i}\cdot V^{\prime}/\tau_{T})}{\sum_{j}exp(z_{T}% ^{i}\cdot V^{\prime j}/\tau_{T})},= divide start_ARG italic_e italic_x italic_p ( italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_τ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e italic_x italic_p ( italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ italic_V start_POSTSUPERSCRIPT ′ italic_j end_POSTSUPERSCRIPT / italic_τ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG , (4)
sSVisuperscriptsubscript𝑠𝑆𝑉𝑖\displaystyle s_{S-V}^{i}italic_s start_POSTSUBSCRIPT italic_S - italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT =exp(zSiV/τS)jexp(zSiVj/τS),absent𝑒𝑥𝑝superscriptsubscript𝑧𝑆𝑖superscript𝑉subscript𝜏𝑆subscript𝑗𝑒𝑥𝑝superscriptsubscript𝑧𝑆𝑖superscript𝑉𝑗subscript𝜏𝑆\displaystyle=\frac{exp(z_{S}^{i}\cdot V^{\prime}/\tau_{S})}{\sum_{j}exp(z_{S}% ^{i}\cdot V^{\prime j}/\tau_{S})},= divide start_ARG italic_e italic_x italic_p ( italic_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_τ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e italic_x italic_p ( italic_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ italic_V start_POSTSUPERSCRIPT ′ italic_j end_POSTSUPERSCRIPT / italic_τ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) end_ARG , (5)

where sTV(C+1)×Bsubscript𝑠𝑇𝑉superscript𝐶1𝐵s_{T-V}\in\mathbb{R}^{(C+1)\times B}italic_s start_POSTSUBSCRIPT italic_T - italic_V end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_C + 1 ) × italic_B end_POSTSUPERSCRIPT and sSV(C+1)×Bsubscript𝑠𝑆𝑉superscript𝐶1𝐵s_{S-V}\in\mathbb{R}^{(C+1)\times B}italic_s start_POSTSUBSCRIPT italic_S - italic_V end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_C + 1 ) × italic_B end_POSTSUPERSCRIPT. And τTsubscript𝜏𝑇\tau_{T}italic_τ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and τSsubscript𝜏𝑆\tau_{S}italic_τ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is the temperature parameter.

The proposed visual space align loss can be formulated as the cross entropy between sTVsubscript𝑠𝑇𝑉s_{T-V}italic_s start_POSTSUBSCRIPT italic_T - italic_V end_POSTSUBSCRIPT and sSVsubscript𝑠𝑆𝑉s_{S-V}italic_s start_POSTSUBSCRIPT italic_S - italic_V end_POSTSUBSCRIPT, which can be formulated as

LVISsubscript𝐿𝑉𝐼𝑆\displaystyle L_{VIS}italic_L start_POSTSUBSCRIPT italic_V italic_I italic_S end_POSTSUBSCRIPT =i=0B1sTVilogsSVi,absentsuperscriptsubscript𝑖0𝐵1superscriptsubscript𝑠𝑇𝑉𝑖𝑙𝑜𝑔superscriptsubscript𝑠𝑆𝑉𝑖\displaystyle=-\sum_{i=0}^{B-1}s_{T-V}^{i}\cdot logs_{S-V}^{i},= - ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B - 1 end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_T - italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ italic_l italic_o italic_g italic_s start_POSTSUBSCRIPT italic_S - italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , (6)

Textual Space Alignment Loss. When language information is introduced, the output of the student should be aligned with the teacher’s output not only in visual space but also in textual space. Specifically, to this end, this loss directly uses textual anchors in TSB to construct the image-to-language relation, which can be formulated as

sTLisuperscriptsubscript𝑠𝑇𝐿𝑖\displaystyle s_{T-L}^{i}italic_s start_POSTSUBSCRIPT italic_T - italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT =exp(zTiL/τT)jexp(zTiLj/τT),absent𝑒𝑥𝑝superscriptsubscript𝑧𝑇𝑖𝐿subscript𝜏𝑇subscript𝑗𝑒𝑥𝑝superscriptsubscript𝑧𝑇𝑖superscript𝐿𝑗subscript𝜏𝑇\displaystyle=\frac{exp(z_{T}^{i}\cdot L/\tau_{T})}{\sum_{j}exp(z_{T}^{i}\cdot L% ^{j}/\tau_{T})},= divide start_ARG italic_e italic_x italic_p ( italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ italic_L / italic_τ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e italic_x italic_p ( italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ italic_L start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT / italic_τ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG , (7)
sSLisuperscriptsubscript𝑠𝑆𝐿𝑖\displaystyle s_{S-L}^{i}italic_s start_POSTSUBSCRIPT italic_S - italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT =exp(zSiL/τS)jexp(zSiLj/τS),absent𝑒𝑥𝑝superscriptsubscript𝑧𝑆𝑖𝐿subscript𝜏𝑆subscript𝑗𝑒𝑥𝑝superscriptsubscript𝑧𝑆𝑖superscript𝐿𝑗subscript𝜏𝑆\displaystyle=\frac{exp(z_{S}^{i}\cdot L/\tau_{S})}{\sum_{j}exp(z_{S}^{i}\cdot L% ^{j}/\tau_{S})},= divide start_ARG italic_e italic_x italic_p ( italic_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ italic_L / italic_τ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e italic_x italic_p ( italic_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ italic_L start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT / italic_τ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) end_ARG , (8)
LTEXsubscript𝐿𝑇𝐸𝑋\displaystyle L_{TEX}italic_L start_POSTSUBSCRIPT italic_T italic_E italic_X end_POSTSUBSCRIPT =i=0B1sTLilogsSLi,absentsuperscriptsubscript𝑖0𝐵1superscriptsubscript𝑠𝑇𝐿𝑖𝑙𝑜𝑔superscriptsubscript𝑠𝑆𝐿𝑖\displaystyle=-\sum_{i=0}^{B-1}s_{T-L}^{i}\cdot logs_{S-L}^{i},= - ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B - 1 end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_T - italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ italic_l italic_o italic_g italic_s start_POSTSUBSCRIPT italic_S - italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , (9)

where sTLC×Bsubscript𝑠𝑇𝐿superscript𝐶𝐵s_{T-L}\in\mathbb{R}^{C\times B}italic_s start_POSTSUBSCRIPT italic_T - italic_L end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_B end_POSTSUPERSCRIPT and sSLC×Bsubscript𝑠𝑆𝐿superscript𝐶𝐵s_{S-L}\in\mathbb{R}^{C\times B}italic_s start_POSTSUBSCRIPT italic_S - italic_L end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_B end_POSTSUPERSCRIPT denote the similarity score between extracted feature zT/zSsubscript𝑧𝑇subscript𝑧𝑆z_{T}/z_{S}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT / italic_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and zLsubscript𝑧𝐿z_{L}italic_z start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT. And τTsubscript𝜏𝑇\tau_{T}italic_τ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and τSsubscript𝜏𝑆\tau_{S}italic_τ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is the temperature parameter.

In sum, the total loss is denoted as:

LLGD=αLVIS+(1α)LTEX,subscript𝐿𝐿𝐺𝐷𝛼subscript𝐿𝑉𝐼𝑆1𝛼subscript𝐿𝑇𝐸𝑋L_{LGD}=\alpha L_{VIS}+(1-\alpha)L_{TEX},italic_L start_POSTSUBSCRIPT italic_L italic_G italic_D end_POSTSUBSCRIPT = italic_α italic_L start_POSTSUBSCRIPT italic_V italic_I italic_S end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_L start_POSTSUBSCRIPT italic_T italic_E italic_X end_POSTSUBSCRIPT , (10)

where α𝛼\alphaitalic_α is a hyper-parameter to balance the LVISsubscript𝐿𝑉𝐼𝑆L_{VIS}italic_L start_POSTSUBSCRIPT italic_V italic_I italic_S end_POSTSUBSCRIPT and LTEXsubscript𝐿𝑇𝐸𝑋L_{TEX}italic_L start_POSTSUBSCRIPT italic_T italic_E italic_X end_POSTSUBSCRIPT, which sets to 0.5 in our default.

III-D Generalize to More Image Encoders

To verify the generalization of the proposed LGD, we attempt to extend the LGD to other pre-trained image encoders, such as an SSL model, by making corresponding changes to solve two problems.

The first problem is the mismatch of feature dimensions between the image encoder and text encoder. For example, the representation dimension of MoCo [12] and SimSiam [13] is 128-D, while CLIP is 1024-D. Therefore, we add a learnable MLP layer after the L𝐿Litalic_L to downsample the dimension of L𝐿Litalic_L from 1024 to 128. The second one is mode collapse. From the formula of KL divergence, we can derive that:

LTEX=DKL(sTL||sSL)+H(sTL),L_{TEX}=D_{KL}(s_{T-L}||s_{S-L})+H(s_{T-L}),italic_L start_POSTSUBSCRIPT italic_T italic_E italic_X end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_T - italic_L end_POSTSUBSCRIPT | | italic_s start_POSTSUBSCRIPT italic_S - italic_L end_POSTSUBSCRIPT ) + italic_H ( italic_s start_POSTSUBSCRIPT italic_T - italic_L end_POSTSUBSCRIPT ) , (11)

where DKLsubscript𝐷𝐾𝐿D_{KL}italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT is the KL-Divergence between similarity scores of T-L and S-L, and H(sTL)𝐻subscript𝑠𝑇𝐿H(s_{T-L})italic_H ( italic_s start_POSTSUBSCRIPT italic_T - italic_L end_POSTSUBSCRIPT ) denotes the entropy of sTLsubscript𝑠𝑇𝐿s_{T-L}italic_s start_POSTSUBSCRIPT italic_T - italic_L end_POSTSUBSCRIPT. When we minimize LTEXsubscript𝐿𝑇𝐸𝑋L_{TEX}italic_L start_POSTSUBSCRIPT italic_T italic_E italic_X end_POSTSUBSCRIPT as before, since the TSB L𝐿Litalic_L is constant and zTsubscript𝑧𝑇z_{T}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is learnable (due to the added MLP layer to solve the first problem), zTsubscript𝑧𝑇z_{T}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is easy to fall into the suboptimal solution 𝟎0\bf{0}bold_0 to make the entropy of sTLsubscript𝑠𝑇𝐿s_{T-L}italic_s start_POSTSUBSCRIPT italic_T - italic_L end_POSTSUBSCRIPT minimal. To solve this problem, we let the difference between TL𝑇𝐿T-Litalic_T - italic_L and SL𝑆𝐿S-Litalic_S - italic_L mimic the difference between TV𝑇𝑉T-Vitalic_T - italic_V. The total loss can be formulated as

LLGD=αi=0B1sTVilogsSVi+αi=0B1sTVilogsTLi+αi=0B1sTVilogsSLi,subscript𝐿𝐿𝐺𝐷𝛼superscriptsubscript𝑖0𝐵1superscriptsubscript𝑠𝑇𝑉𝑖𝑙𝑜𝑔superscriptsubscript𝑠𝑆𝑉𝑖𝛼superscriptsubscript𝑖0𝐵1superscriptsubscript𝑠𝑇𝑉𝑖𝑙𝑜𝑔superscriptsubscript𝑠𝑇𝐿𝑖𝛼superscriptsubscript𝑖0𝐵1superscriptsubscript𝑠𝑇𝑉𝑖𝑙𝑜𝑔superscriptsubscript𝑠𝑆𝐿𝑖\begin{split}L_{LGD}=&\alpha\sum_{i=0}^{B-1}-s_{T-V}^{i}\cdot logs_{S-V}^{i}\\ &+\alpha\sum_{i=0}^{B-1}-s_{T-V}^{i}\cdot logs_{T-L}^{i}\\ &+\alpha\sum_{i=0}^{B-1}-s_{T-V}^{i}\cdot logs_{S-L}^{i},\end{split}start_ROW start_CELL italic_L start_POSTSUBSCRIPT italic_L italic_G italic_D end_POSTSUBSCRIPT = end_CELL start_CELL italic_α ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B - 1 end_POSTSUPERSCRIPT - italic_s start_POSTSUBSCRIPT italic_T - italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ italic_l italic_o italic_g italic_s start_POSTSUBSCRIPT italic_S - italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_α ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B - 1 end_POSTSUPERSCRIPT - italic_s start_POSTSUBSCRIPT italic_T - italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ italic_l italic_o italic_g italic_s start_POSTSUBSCRIPT italic_T - italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_α ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B - 1 end_POSTSUPERSCRIPT - italic_s start_POSTSUBSCRIPT italic_T - italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ italic_l italic_o italic_g italic_s start_POSTSUBSCRIPT italic_S - italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , end_CELL end_ROW (12)

where α𝛼\alphaitalic_α is 0.33 in our default. We use the same α𝛼\alphaitalic_α for different parts to keep a balanced contribution from each term in the total loss function. In this manner, since zTsubscript𝑧𝑇z_{T}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is learnable, the VSB V𝑉Vitalic_V is not constant, then 𝟎0\bf{0}bold_0 will no longer be a suboptimal solution making the loss function falling into a local minima. Additionally, learning with this loss function can minimize the KL-Divergence between similarity scores of T-V and T-L/S-L, thus achieving the original goal of driving the similarity score between zSsubscript𝑧𝑆z_{S}italic_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and L𝐿Litalic_L mimicking that between zTsubscript𝑧𝑇z_{T}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and L𝐿Litalic_L. By doing so, the distillation process from teacher to student can be successfully enhanced by the proposed TSB and VSB without the two problems mentioned above.

III-E Text Control on Downstream Tasks

For different downstream tasks and application scenes, the corresponding category names can be utilized as language guidance for distilling processing, rather than the fixed one. With the selected category names and some pre-built templates, the pre-trained language model can provide specific semantic knowledge of each category due to its ability to understand universal text contents. In some open-vocabulary perception works [41, 42, 43], task-specific category names can help refine the image feature and find the patterns with corresponding semantics. In our setting, taking the scene in Section III-C as an example, minimizing LTEXsubscript𝐿𝑇𝐸𝑋L_{TEX}italic_L start_POSTSUBSCRIPT italic_T italic_E italic_X end_POSTSUBSCRIPT is equivalent to minimizing DKL(sTL||sSL)D_{KL}(s_{T-L}||s_{S-L})italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_T - italic_L end_POSTSUBSCRIPT | | italic_s start_POSTSUBSCRIPT italic_S - italic_L end_POSTSUBSCRIPT ). If L𝐿Litalic_L is built upon category names that not match the image data distribution of current dataset, then values of sTLsubscript𝑠𝑇𝐿s_{T-L}italic_s start_POSTSUBSCRIPT italic_T - italic_L end_POSTSUBSCRIPT in those mismatched dimensions will always have low values, preventing student networks from learning useful information from distribution alignment. When using the selected task-oriented category names, ideal knowledge transfer can be achieved for each category in current scenario.

IV Experiment

IV-A Setup

Dataset and Downstream Tasks. For model pre-training with the LGD, the ImageNet-1k (IN-1k) [10] dataset is used as unlabeled source data by abandoning the labels, which contains 1.28M images for training, and 50,000 images for validation. Then, we evaluate the pre-trained model on various downstream tasks, including zero-shot and fully-supervised classification on the IN-1k and the Caltech-256 (CT-256) [44] dataset, semi-supervised classification on the IN-1k dataset, object detection and instance segmentation on the COCO 2017 [45] dataset, long-tailed object detection and instance segmentation on the LVIS v1 [46] dataset, and semantic segmentation on the ADE20K[47] and Cityscapes (CS) [48] datasets.

Teacher-Student Pair. Experiments are mainly conducted on two pairs of teacher-student models: CLIP pre-trained ResNet-50 (CLIP RN50) [8] → ResNet-18 (RN18) [49] and CLIP RN50 [8] → MobileNetV2 (MNV2) [50], representing knowledge transfer between similar and dissimilar networks respectively. Besides, we also conduct experiments with different teacher networks, such as MoCo RN50 [12] and SimSiam RN50 [13]. The pre-trained text encoder used is a Transformer [51] based encoder that is jointly trained with CLIP RN50.

TABLE I: Semi/Fully-supervised classification results on the ImageNet-1k (IN-1k) and Caltech-256 (CT-256) datasets. The top-1 accuracy is reported.
Method Teacher Student Data Text Epoch Zero-shot Linear-probe
IN-1k CT-256 IN-1k CT-256
1% 10% 100%
MoCo RN50 - - - - - - - - 67.5 -
CLIP RN50 - - - - 58.9 78.1 - - 73.3 -
Super. - RN18 IN-1k - 90 - - - - - 77.1
SEED MoCo RN50 RN18 IN-1k - 200 - - 37.5 51.1 57.9 78.5
BINGO MoCo RN50 RN18 IN-1k - 200 - - 42.8 57.5 61.4 79.3
DisCo MoCo RN50 RN18 IN-1k - 200 - - - - 60.6 -
SMD SimSiam RN50 RN18 IN-1k - 100 - - - - 61.8 -
SEED CLIP RN50 RN18 IN-1k - 200 44.2 62.1 43.1 56.3 62.7 79.5
BINGO CLIP RN50 RN18 IN-1k - 200 45.5 62.3 44.2 59.2 64.0 80.6
LGD CLIP RN50 RN18 IN-1k IN-1k 200 49.5 63.9 53.5 61.7 67.2 82.7
LGD CLIP RN50 RN18 IN-1k IN-1k 90 47.8 62.3 52.9 61.5 66.4 81.6
LGD CLIP RN50 RN18 IN-1k CT-256 90 34.3 65.8 - - - -
LGD CLIP RN50 RN18 CT-256 CT-256 90 18.9 66.5 - - - -
Super. - MNv2 IN-1k - 90 - - - - - 77.5
SEED CLIP RN50 MNv2 IN-1k - 200 - - 42.2 55.5 63.5 79.1
BINGO CLIP RN50 MNv2 IN-1k - 200 - - 43.5 56.9 64.6 80.1
LGD CLIP RN50 MNv2 IN-1k IN-1k 200 50.3 64.7 51.9 60.7 67.4 81.3
LGD CLIP RN50 MNv2 IN-1k IN-1k 90 48.5 63.0 50.3 59.4 66.3 81.2
LGD CLIP RN50 MNv2 IN-1k CT-256 90 35.0 66.8 - - - -
LGD CLIP RN50 MNv2 CT-256 CT-256 90 19.2 67.4 - - - -
TABLE II: The results of object detection and instance segmentation on the COCO and LVIS datasets, and the results of semantic segmentation on the ADE20K and Cityscapes (CS) datasets.
Method Teacher Student Data Text Epoch COCO LVIS ADE20K CS
APbb𝐴superscript𝑃𝑏𝑏AP^{bb}italic_A italic_P start_POSTSUPERSCRIPT italic_b italic_b end_POSTSUPERSCRIPT APmk𝐴superscript𝑃𝑚𝑘AP^{mk}italic_A italic_P start_POSTSUPERSCRIPT italic_m italic_k end_POSTSUPERSCRIPT APbb𝐴superscript𝑃𝑏𝑏AP^{bb}italic_A italic_P start_POSTSUPERSCRIPT italic_b italic_b end_POSTSUPERSCRIPT APmk𝐴superscript𝑃𝑚𝑘AP^{mk}italic_A italic_P start_POSTSUPERSCRIPT italic_m italic_k end_POSTSUPERSCRIPT mIoU𝑚𝐼𝑜𝑈mIoUitalic_m italic_I italic_o italic_U mIoU𝑚𝐼𝑜𝑈mIoUitalic_m italic_I italic_o italic_U
MoCo RN50 - IN-1k - - 38.5 35.1 22.7 21.9 38.9 75.3
CLIP RN50 - WIT - - 39.3 36.8 23.0 22.1 39.6 75.8
Super. - RN18 IN-1k - 90 32.8 31.3 17.5 18.6 33.3 71.3
SEED MoCo RN50 RN18 IN-1k - 200 33.7 31.8 18.1 19.2 34.0 72.2
BINGO MoCo RN50 RN18 IN-1k - 200 33.9 32.1 18.4 19.3 34.4 72.3
SEED CLIP RN50 RN18 IN-1k - 200 34.4 32.2 18.9 19.7 34.4 72.8
BINGO CLIP RN50 RN18 IN-1k - 200 34.6 32.4 19.2 19.9 34.7 73.2
LGD CLIP RN50 RN18 IN-1k IN-1k 90 34.0 32.0 19.7 20.3 34.9 73.7
LGD CLIP RN50 RN18 IN-1k IN-1k 200 35.1 32.9 20.1 20.5 35.3 74.0
Super. - MNv2 IN-1k - 90 27.3 26.4 14.9 14.4 29.7 70.2
SEED MoCo RN50 MNv2 IN-1k - 200 28.1 26.8 15.4 14.8 31.9 70.5
BINGO MoCo RN50 MNv2 IN-1k - 200 28.3 27.0 15.6 15.0 32.2 70.8
SEED CLIP RN50 MNv2 IN-1k - 200 28.9 27.4 15.9 16.1 32.2 71.0
BINGO CLIP RN50 MNv2 IN-1k - 200 29.1 27.7 16.0 16.2 32.5 71.3
LGD CLIP RN50 MNv2 IN-1k IN-1k 90 28.8 27.5 16.1 16.6 33.8 72.3
LGD CLIP RN50 MNv2 IN-1k IN-1k 200 29.6 28.1 16.2 16.9 33.9 72.3

Comparison Methods. The proposed LGD is compared with several small model pre-training methods: 1) fully-supervised pre-training method, including ImageNet pre-training [9]; 2) SSD based methods, including SEED [4], BINGO [5], DisCo [20], and SMD [6].

Implementation Details. The proposed LGD is implemented using the PyTorch framework, and all experiments are conducted on 8 NVIDIA TESLA V100 GPUs. The distillation process is trained with a standard SGD optimizer with a momentum of 0.9 and a weight decay parameter of 1e-4 for 90 and 200 epochs. The initial learning rate is set as 0.03 and updated by a cosine decay scheduler [52] with 5 warm-up epochs and a batch size of 256 per GPU.

For transferring to image classification, we conduct the supervised linear classification on IN-1k and CT-256. For zero-shot setting, we treat the textual semantics bank L𝐿Litalic_L as a classifier and the predicted category index is calculated following Eq. 1. In this way, each image feature will be classified into the most semantically similar category without training with the corresponding label. For linear-probe setting, following previous works [12, 4], we train a single linear layer classifier on top of the frozen network encoder after distillation or pre-training. For IN-1k dataset, besides fully supervised learning which utilizes 100% of the training set, following previous works [11, 5], we evaluate the proposed method by fine-tuning the student model with 1% and 10% labeled data as semi-supervised setting. We follow the training split settings as in previous works for fair comparisons. SGD optimizer is used to prepare the linear classifier for 100 epochs with a weight decay of 0. The initial learning rate is set as 30 and is then reduced by a factor of 10 at 60 and 80 epochs. The results are reported in terms of Top-1 accuracy.

For transferring to object detection and instance segmentation, we use mmdetection [53] for implementation. We train Mask-RCNN FPN [54] with RN18 and MNv2 backbone to evaluate the transferability of the learned features on COCO 2017 and LVIS v1. AdamW [55] optimizer is used to finetune the whole network for 12 epochs (the default 1x schedule). The initial learning rate is set as 2e-4 and then reduced by a factor of 10 at 8 and 11 epochs. To better preserve the pre-trained weights, we set the learning rate of the image encoder as 1/10 of the other parameters.

For semantic segmentation, we use mmsegmentation [56] for implementation. We train Semantic FPN [57] with RN18 backbone and PSPNet[58] with MNv2 backbone to evaluate the transferability of learned features on ADE20k and CS. AdamW [55] optimizer is also used, and we also set the learning rate of the image encoder as 1/10 of the other parameters. For MNv2 and RN18 backbone, the initial learning is set as 1e-2 and 1e-4, respectively. For ADE20k and Cityscapes, we train the model for 160k and 80k iterations, respectively.

IV-B Main Results

IV-B1 Results on Zero-shot Classification.

As shown in Table I, we compare our method with previous SSD methods SEED [4] and BINGO [5] on the ImageNet-1k and Caltech-256 validation set. The top-1 zero-shot classification accuracy is reported. We observe that the proposed LGD surpasses the contrastive SSD methods consistently on all benchmarks, verifying the effectiveness of the proposed LGD.

To further verify the effect of language guidance, we still use the same IN-1k as unlabeled image input but change the input texts to the class of CT-256 for doing experiments. Significant improvement is observed on both RN18 (+3.5%) and MNv2 (+3.8%) backbone, which shows that we can specify the VSB and TSB through the texts to control the knowledge we want to transfer to the students. Besides, we use the CT-256 as our unlabeled image input and achieve the highest zero-shot accuracy (66.5% and 67.4%) on the CT-256 dataset, which reveals that the consistency of data between the distillation and downstream tasks will impact the results. More relevance between the source data used in distillation and the downstream task brings better downstream performance of the trained small model. Nevertheless, to fairly compare the performance of small models transferring to downstream tasks with other methods, we uniformly use IN-1k as the source data for distillation or pre-training in the following experiments.

Refer to caption
Figure 3: The Grad-CAM visualization shows the different attention maps of distilled RN18 in columns (a)-(d). The first column shows the attention map of CLIP pre-trained RN50, which is the teacher model. Columns (a)-(d) show the visualization results of SEED, LGD with only visual space alignment, LGD with only textual space alignment, and LGD with both losses, respectively.

IV-B2 Transfer to Semi/Fully-supervised Classification.

Following [4, 5], we evaluate the learned representation on semi-supervised/fully-supervised classification on IN-1k, where a fixed 1%, 10% or 100% of IN-1k training data are provided with the annotations. Besides, to further study whether the improvement of the learned representations by distillation is confined to ImageNet, we evaluate the additional classification dataset CT-256 to study the generalization and transferability of the feature representation. As shown in Table I, the proposed LGD has a remarkably 1.2%-9.3% improvement compared with previous methods.

IV-B3 Transfer to Object Detection and Instance Segmentation.

As shown in Table II, we conduct experiments on two downstream tasks including object detection and instance segmentation, on the COCO 2017 and LVIS v1 datasets. Compared with standard ImageNet supervised pre-training (Super.), the distilled model pre-trained by the proposed LGD achieves a large improvement in the same number of training epochs. On COCO, the RN18 based Mask RCNN shows +1.2 and +0.7 point improvement on APbb𝐴superscript𝑃𝑏𝑏AP^{bb}italic_A italic_P start_POSTSUPERSCRIPT italic_b italic_b end_POSTSUPERSCRIPT and APmk𝐴superscript𝑃𝑚𝑘AP^{mk}italic_A italic_P start_POSTSUPERSCRIPT italic_m italic_k end_POSTSUPERSCRIPT, respectively. And the MNv2 based Mask RCNN shows +1.5 and +1.7 point improvement on APbb𝐴superscript𝑃𝑏𝑏AP^{bb}italic_A italic_P start_POSTSUPERSCRIPT italic_b italic_b end_POSTSUPERSCRIPT and APmk𝐴superscript𝑃𝑚𝑘AP^{mk}italic_A italic_P start_POSTSUPERSCRIPT italic_m italic_k end_POSTSUPERSCRIPT. Compared with other SSD methods, our method also shows a significant improvement. Compared with SEED and BINGO pre-trained RN18, which is distilled from CLIP RN50, the proposed LGD shows a consistent +0.5 to +0.9 point improvement on COCO. Compared with the RN18 distilled from MoCo RN50, the proposed LGD shows a large improvement of +0.8 to +1.7 points. The MNV2 backbone also shows a similar improvement and the detailed experiment results can be seen in Table II.

IV-B4 Transfer to Semantic Segmentation.

As shown in Table II, we conduct semantic segmentation experiments on the ADE20k and Cityscapes (CS) datasets. Compared with standard ImageNet supervised pre-training (Super.), the RN18 based Semantic FPN shows +1.6 and +2.4 mIoU improvement on the two datasets, respectively. Besides, compared with SSD methods, the proposed LGD also shows significant +0.5 to +1.8 mIoU improvement. For the MNv2 backbone, it also shows a similar improvement and the detailed experiment results can be seen in Table II.

Notably, we observe that the proposed LGD surpasses the comparative methods, including standard ImageNet pre-training and previous self-supervised distillation, on all benchmarks with the same overhead of training time and data amount. This also proves the generalization ability of the learned representations from language-guided distillation to a wide range of data domains and classes.

IV-C Ablation and Analysis

IV-C1 Different Distillation Strategies.

TABLE III: Zero-shot IN-1k Top-1 Acc.(%) of the distilled RN18 with different distillation strategies.
Method Loss ZS Top-1
SEED - 44.2
LGD LVISsubscript𝐿𝑉𝐼𝑆L_{VIS}italic_L start_POSTSUBSCRIPT italic_V italic_I italic_S end_POSTSUBSCRIPT 45.7
LGD LTEXsubscript𝐿𝑇𝐸𝑋L_{TEX}italic_L start_POSTSUBSCRIPT italic_T italic_E italic_X end_POSTSUBSCRIPT 47.3
LGD LVIS+LTEXsubscript𝐿𝑉𝐼𝑆subscript𝐿𝑇𝐸𝑋L_{VIS}+L_{TEX}italic_L start_POSTSUBSCRIPT italic_V italic_I italic_S end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_T italic_E italic_X end_POSTSUBSCRIPT 47.8
TABLE IV: The results of evaluating the text control on other downstream tasks, including linear-probe top-1 acc.(%) on the Caltech-256 (CT-256), and the mIoU on the Cityscapes (CS) and ADE20K (ADE).
Method Data Text CT-256 CS ADE
LGD IN-1k IN-1k 81.6 73.7 34.9
LGD IN-1k CT-256 81.9 - -
LGD IN-1k CS - 74.1 -
LGD IN-1k ADE - - 35.2

To show the effectiveness of consistency loss in visual and textual space, we conduct an ablation study and show results in Table III. We use CLIP RN50 as the teacher network and report the zero-shot top-1 accuracy on the IN-1k validation set. SEED [4] trains a student to mimic the similarity score distribution inferred by a teacher over a set of randomly maintained instances. LVISsubscript𝐿𝑉𝐼𝑆L_{VIS}italic_L start_POSTSUBSCRIPT italic_V italic_I italic_S end_POSTSUBSCRIPT and LTEXsubscript𝐿𝑇𝐸𝑋L_{TEX}italic_L start_POSTSUBSCRIPT italic_T italic_E italic_X end_POSTSUBSCRIPT are the proposed consistency losses in visual and textual space. Compared with SEED, the proposed visual constraint LVISsubscript𝐿𝑉𝐼𝑆L_{VIS}italic_L start_POSTSUBSCRIPT italic_V italic_I italic_S end_POSTSUBSCRIPT has a significant +1.5% improvement. With the constraint in both visual and textual space (LVISsubscript𝐿𝑉𝐼𝑆L_{VIS}italic_L start_POSTSUBSCRIPT italic_V italic_I italic_S end_POSTSUBSCRIPT + LTEXsubscript𝐿𝑇𝐸𝑋L_{TEX}italic_L start_POSTSUBSCRIPT italic_T italic_E italic_X end_POSTSUBSCRIPT), we can get the best result of 47.8%.

To further explore the effect of different distillation strategies, the attention maps are visualized by Grad-CAM [59] and shown in Fig. 3. The proposed LGD has the most accurate attention maps and the most similar attention area to the teacher (CLIP RN50). And we find that if we only apply constraints in visual space, such as SEED and LVISsubscript𝐿𝑉𝐼𝑆L_{VIS}italic_L start_POSTSUBSCRIPT italic_V italic_I italic_S end_POSTSUBSCRIPT, the student network will pay attention to the general feature of the category. For example, it will pay attention to the net and sea when it identifies the fish and shark. Although images of different categories commonly share these areas, they can not sufficiently represent the semantic categories in the downstream task. Besides, with the constraint in both visual and textual space, the attention map can focus on the most critical area, such as the animals’ faces.

IV-C2 Text Control on More Downstream Tasks

TABLE V: Results of different architecture for teacher and student on IN-1k Top-1 Acc.(%).
Method Teacher Student Top-1
CLIP - ViT-B/16 68.3
SEED CLIP RN101 RN18 45.2
LGD CLIP RN101 RN18 50.3
SEED CLIP ViT-B/16 RN18 53.2
LGD CLIP ViT-B/16 RN18 55.9
SEED CLIP ViT-B/16 DeiT-tiny 53.3
LGD CLIP ViT-B/16 DeiT-tiny 56.5
TABLE VI: Linear-probe IN-1k Top-1 Acc.(%) of distilled RN18. LGD is combined with other SSL pre-trained image and text encoder.
Method Image Encoder Text Encoder Top-1
SEED MoCo RN50 - 57.9
SEED+LGD MoCo RN50 CLIP text 59.7
SEED+LGD MoCo RN50 BERTBASE𝐵𝐸𝑅subscript𝑇𝐵𝐴𝑆𝐸BERT_{BASE}italic_B italic_E italic_R italic_T start_POSTSUBSCRIPT italic_B italic_A italic_S italic_E end_POSTSUBSCRIPT 58.5
SMD SimSiam RN50 - 61.8
SMD+LGD SimSiam RN50 CLIP text 62.1

In this section, we evaluate the text control on more downstream tasks as shown in Table IV. We firstly distill a small model and then finetune the distilled small model on the downstream tasks as described in Section II. In the distillation process, we use the IN-1k as unlabeled source data and the language prompt with class names in the downstream tasks as text input. For example, there are 19 text inputs for Cityscapes, 150 text inputs for ADE20K, and 256 text inputs for Caltech-256. The CLIP RN50 and RN18 are selected as the teacher and student models. And the training epoch is set to 90. Other hyper-parameters are the same as those introduced in Section II.

Constant improvements are observed on the Caltech-256 (+0.3), Cityscapes (+0.4), and ADE20K (+0.3). It shows that through the text control, the small model can learn more useful knowledge from the teacher and perform better on corresponding downstream tasks.

IV-C3 Different Pre-trained Image/Text Encoder.

To evaluate the generalization of the proposed LGD, we combine LGD with other SSL pre-trained image or text encoder and show consistent improvement. Specifically, we introduce LGD into SEED [4] and SMD [6] as the language supervision module introduced in the above Section III-D. As shown in Table VI, SEED + LGD has a significant +1.8% improvement compared with SEED. And SMD + LGD also has a +0.3% improvement compared with SMD. Besides, we use BERT [40] as text encoder, which also shows +0.6% improvement compared with SEED. Note that in the above experiments, the image encoder and text encoder are self-supervised and pre-trained independently, which shows that the proposed LGD also works on non-joint-trained image and text encoder.

TABLE VII: Training and Testing Time Comparison between LGD and SEED.
Method Pre-training (h) Finetuning (h) Testing (m)
SEED 58.8(200) 14.8 2.5
LGD 36.3(90) / 72.2(200) 14.7 2.7
TABLE VIII: Zero-shot and Linear-probe classification results on ModelNet40 dataset. The overall accuracy(%) is reported.
Method Student Zero-shot Linear-probe
SEED PointMLPElite 78.48 89.90
SEED Transformer-6L 77.31 89.74
LGD PointMLPElite 80.35 91.17
LGD Transformer-6L 79.86 91.26

To further verify the applicability of our proposed method across various architectures, we conduct experiments employing CLIP RN101 or CLIP ViT-B/16 as teacher and RN18 or DeiT-tiny as student. As shown in Table V, our proposed LGD demonstrates remarkable performance compared to SEED [4] when applied to multiple teacher/student architectures. For instance, in the cases where we use CLIP ViT-B/16 as teacher and DeiT-tiny as student, LGD achieves notable improvements of 3.2% compared with SEED. The results of zero-shot ImageNet classification demonstrate that LGD can not only perform well on CNN-based architectures but also show good performance on other architectures, such as ViTs.

IV-C4 Training and testing time analysis

The training and testing time of our proposed LGD and previous method SEED [4] is assessed in Table VII. Specifically, both models are trained for 200 epochs and finetuned for 100 epochs on the ImageNet1K dataset under the same settings (batch size, number of GPUs, etc). For testing time, we measure the inference time required for classifying images in the validation dataset. Results indicate that under the same number of training epochs (200), the pre-training time for LGD is longer than that for SEED. However, it is noteworthy that our method achieves comparable results with just 90 epochs of training (taking 36.3 hours), as opposed to SEED, which requires 200 epochs (taking 58.8 hours), as shown in Table I. Therefore, despite longer pre-training time per epoch, the overall efficiency of LGD is superior, requiring significantly less training time to achieve comparable performance. Additionally, the finetuning and testing time of SEED and LGD is similar, as both methods only utilize the student network.

IV-C5 Results on Tasks other than Image

To demonstrate the efficacy and generalization ability of the proposed LGD, extensive experiments are conducted on point cloud classification task. For teacher network, we select the latest point-language pre-training framework Point-Bind [60] with I2P-MAE [61] as the point cloud encoder. For student, a solid lightweight point cloud encoder PointMLPElite [62] is employed. Besides, plain transformer network used in [63] with less (6) layers is also used as student since it has a similar structure with teacher point cloud encoder. Results are shown in Table VIII. Compared to the previous method SEED, LGD performs better on both zero-shot and linear-probe classification on ModelNet40 dataset. And the improvement is consistent among both students.

IV-C6 Visualization of Feature Distributions

Refer to caption
Figure 4: Visualization of feature distributions. (a) CLIP RN50 (teacher); (b) LGD RN18; (c) SEED RN18. Different colors represent data samples of different classes. Zoom in for details. Best viewed in color.
Refer to caption
Figure 5: Visualization of feature distributions. (a) LGD; (b) SEED. Different colors represent different types of features, including teacher feature, student feature and text feature. The numbers represent the classes of data samples. Zoom in for details. Best viewed in color.

To validate the effectiveness of LGD in aligning the outputs of the teacher and student in both the visual and textual semantic spaces, we visualize the feature distribution of the teacher model (CLIP RN50) and student model (LGD RN18 & SEED RN18 [4]). In Fig. 4, it can be seen that the LGD learns more compact feature distribution than SEED. Due to the model capacity, the feature distributions of both LGD RN18 and SEED RN18 are more diverse than that of CLIP RN50. In Fig. 5, the feature distribution of the image encoder (teacher), text encoder, and student are shown. It can be observed that the feature distribution of SEED RN18 is similar to that of the teacher but has a large gap with the text feature of the text encoder which contains task-related knowledge . As shown in Fig. 5 (a), with the constraint in textual space, the feature distributions of LGD RN18 have some variances to that of the teacher but also have a matching relationship with the text feature, representing the combination of knowledge of both visual and textual features.

V Conclusion

This paper proposes Language-Guided Distillation (LGD) for transferring knowledge from a pre-trained large teacher model, such as CLIP, to a small one. We first use language guidance to determine which knowledge of the teacher should be transferred to the students. Then, we propose a language-guided knowledge aggregation module to maintain language-guided textual and visual semantic banks. At last, we design two distillation losses to maintain the consistency of teacher and student in both visual and textual space. Thorough experiments show that LGD offers significant performance improvement on various downstream tasks.

In our study, we find that the LGD performance largely relies on the joint-trained text and image encoder, while the performance gain on the non-joint-trained one is relatively small because of the mismatch between the textual and visual features. We hope our work could promote the development of distillation methods to become a new paradigm for small model pre-training by making full use of the off-the-shelf pre-trained models.

ACKNOWLEDGMENTS

This work is supported by National Key R&D Program of China (2022ZD0160300), National Natural Science Foundation of China (No. 62101137 and 62071127), Shanghai Natural Science Foundation (No. 23ZR1402900). The computations in this research were performed using the CFFF platform of Fudan University.

References

  • [1] R. He, S. Sun, J. Yang, S. Bai, and X. Qi, “Knowledge distillation as efficient pre-training: Faster convergence, higher data-efficiency, and better transferability,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9161–9171.
  • [2] H. Shi, Y. Zhang, S. Tang, W. Zhu, Y. Li, Y. Guo, and Y. Zhuang, “On the efficacy of small self-supervised contrastive models without distillation signals,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 2225–2234.
  • [3] H. Wu, Y. Gao, Y. Zhang, S. Lin, Y. Xie, X. Sun, and K. Li, “Self-supervised models are good teaching assistants for vision transformers,” in International Conference on Machine Learning.   PMLR, 2022, pp. 24 031–24 042.
  • [4] Z. Fang, J. Wang, L. Wang, L. Zhang, Y. Yang, and Z. Liu, “Seed: Self-supervised distillation for visual representation,” in International Conference on Learning Representations, 2021.
  • [5] H. Xu, J. Fang, X. Zhang, L. Xie, X. Wang, W. Dai, H. Xiong, and Q. Tian, “Bag of instances aggregation boosts self-supervised distillation,” in International Conference on Learning Representations, 2022.
  • [6] H. Liu and M. Ye, “Improving self-supervised lightweight model learning via hard-aware metric distillation,” in European Conference on Computer Vision.   Springer, 2022, pp. 295–311.
  • [7] S. Abbasi Koohpayegani, A. Tejankar, and H. Pirsiavash, “Compress: Self-supervised learning by compressing representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12 980–12 992, 2020.
  • [8] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning.   PMLR, 2021, pp. 8748–8763.
  • [9] K. He, R. Girshick, and P. Dollár, “Rethinking imagenet pre-training,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4918–4927.
  • [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2009.
  • [11] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning.   PMLR, 2020, pp. 1597–1607.
  • [12] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738.
  • [13] X. Chen and K. He, “Exploring simple siamese representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 750–15 758.
  • [14] F. Wang, T. Kong, R. Zhang, H. Liu, and H. Li, “Self-supervised learning by estimating twin class distribution,” IEEE Transactions on Image Processing, 2023.
  • [15] Y. Liu, K. Wang, L. Liu, H. Lan, and L. Lin, “Tcgl: Temporal contrastive graph for self-supervised video representation learning,” IEEE Transactions on Image Processing, vol. 31, pp. 1978–1993, 2022.
  • [16] J. Ge, Y. Liu, J. Gui, L. Fang, M. Lin, J. T.-Y. Kwok, L. Huang, and B. Luo, “Learning the relation between similarity loss and clustering loss in self-supervised learning,” IEEE Transactions on Image Processing, 2023.
  • [17] Z. Tao, X. Liu, Y. Xia, X. Wang, L. Yang, X. Huang, and T.-S. Chua, “Self-supervised learning for multimedia recommendation,” IEEE Transactions on Multimedia, 2022.
  • [18] Y. Liu, J. Wu, L. Qu, T. Gan, J. Yin, and L. Nie, “Self-supervised correlation learning for cross-modal retrieval,” IEEE Transactions on Multimedia, 2022.
  • [19] Y. Zhang, C. Liu, Y. Zhou, W. Wang, Q. Ye, and X. Ji, “Beyond instance discrimination: Relation-aware contrastive self-supervised learning,” IEEE Transactions on Multimedia, 2023.
  • [20] Y. Gao, J.-X. Zhuang, S. Lin, H. Cheng, X. Sun, K. Li, and C. Shen, “Disco: Remedy self-supervised learning on lightweight models with distilled contrastive learning,” in Proceedings of the European Conference on Computer Vision, 2022, pp. 1–10.
  • [21] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International Conference on Machine Learning.   PMLR, 2021, pp. 4904–4916.
  • [22] M. Li, X. Chen, C. Zhang, S. Chen, H. Zhu, F. Yin, G. Yu, and T. Chen, “M3dbench: Let’s instruct large models with multi-modal 3d prompts,” arXiv preprint arXiv:2312.10763, 2023.
  • [23] S. Chen, X. Chen, C. Zhang, M. Li, G. Yu, H. Fei, H. Zhu, J. Fan, and T. Chen, “Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 428–26 438.
  • [24] X. Yang, F. Liu, and G. Lin, “Effective end-to-end vision language pretraining with semantic visual loss,” IEEE Transactions on Multimedia, no. 99, pp. 1–10, 2023.
  • [25] Y. Xing, Q. Wu, D. Cheng, S. Zhang, G. Liang, P. Wang, and Y. Zhang, “Dual modality prompt tuning for vision-language pre-trained model,” IEEE Transactions on Multimedia, 2023.
  • [26] S. Chen, H. Zhu, M. Li, X. Chen, P. Guo, Y. Lei, Y. Gang, T. Li, and T. Chen, “Vote2cap-detr++: Decoupling localization and describing for end-to-end 3d dense captioning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  • [27] Y. Li, F. Liang, L. Zhao, Y. Cui, W. Ouyang, J. Shao, F. Yu, and J. Yan, “Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=zq1iJkNk3uN
  • [28] L. Yao, R. Huang, L. Hou, G. Lu, M. Niu, H. Xu, X. Liang, Z. Li, X. Jiang, and C. Xu, “Filip: Fine-grained interactive language-image pre-training,” arXiv preprint arXiv:2111.07783, 2021.
  • [29] L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang et al., “Grounded language-image pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 965–10 975.
  • [30] M. Wang, J. Xing, and Y. Liu, “Actionclip: A new paradigm for video action recognition,” arXiv preprint arXiv:2109.08472, 2021.
  • [31] Y. Rao, W. Zhao, G. Chen, Y. Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu, “Denseclip: Language-guided dense prediction with context-aware prompting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 082–18 091.
  • [32] C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil, “Model compression,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 2006, pp. 535–541.
  • [33] G. Hinton, O. Vinyals, J. Dean et al., “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, vol. 2, no. 7, 2015.
  • [34] B. Zhao, Q. Cui, R. Song, Y. Qiu, and J. Liang, “Decoupled knowledge distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 953–11 962.
  • [35] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550, 2014.
  • [36] S. Zagoruyko and N. Komodakis, “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer,” arXiv preprint arXiv:1612.03928, 2016.
  • [37] D. Chen, J.-P. Mei, Y. Zhang, C. Wang, Z. Wang, Y. Feng, and C. Chen, “Cross-layer distillation with semantic calibration,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 8, 2021, pp. 7028–7036.
  • [38] J. Yim, D. Joo, J. Bae, and J. Kim, “A gift from knowledge distillation: Fast optimization, network minimization and transfer learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4133–4141.
  • [39] Y. Tian, D. Krishnan, and P. Isola, “Contrastive representation distillation,” arXiv preprint arXiv:1910.10699, 2019.
  • [40] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds.   Association for Computational Linguistics, 2019, pp. 4171–4186. [Online]. Available: https://doi.org/10.18653/v1/n19-1423
  • [41] J. Qin, J. Wu, P. Yan, M. Li, R. Yuxi, X. Xiao, Y. Wang, R. Wang, S. Wen, X. Pan et al., “Freeseg: Unified, universal and open-vocabulary image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 446–19 455.
  • [42] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv preprint arXiv:2303.05499, 2023.
  • [43] R. Zhang, Z. Guo, W. Zhang, K. Li, X. Miao, B. Cui, Y. Qiao, P. Gao, and H. Li, “Pointclip: Point cloud understanding by clip,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 8552–8562.
  • [44] Griffin, Holub, and Perona, “Caltech 256,” 4 2022.
  • [45] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision.   Springer, 2014, pp. 740–755.
  • [46] A. Gupta, P. Dollar, and R. Girshick, “Lvis: A dataset for large vocabulary instance segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5356–5364.
  • [47] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 633–641.
  • [48] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223.
  • [49] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [50] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520.
  • [51] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [52] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Icml, 2010.
  • [53] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin, “MMDetection: Open mmlab detection toolbox and benchmark,” arXiv preprint arXiv:1906.07155, 2019.
  • [54] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
  • [55] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  • [56] M. Contributors, “MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark,” https://github.com/open-mmlab/mmsegmentation, 2020.
  • [57] A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic feature pyramid networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6399–6408.
  • [58] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890.
  • [59] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.
  • [60] Z. Guo, R. Zhang, X. Zhu, Y. Tang, X. Ma, J. Han, K. Chen, P. Gao, X. Li, H. Li et al., “Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following,” arXiv preprint arXiv:2309.00615, 2023.
  • [61] R. Zhang, L. Wang, Y. Qiao, P. Gao, and H. Li, “Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21 769–21 780.
  • [62] X. Ma, C. Qin, H. You, H. Ran, and Y. Fu, “Rethinking network design and local geometry in point cloud: A simple residual mlp framework,” in International Conference on Learning Representations, 2022.
  • [63] X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu, “Point-bert: Pre-training 3d point cloud transformers with masked point modeling,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 19 313–19 322.