Lightweight Model Pre-training via Language Guided Knowledge Distillation
Abstract
This paper studies the problem of pre-training for small models, which is essential for many mobile devices. Current state-of-the-art methods on this problem transfer the representational knowledge of a large network (as a Teacher) into a smaller model (as a Student) using self-supervised distillation, improving the performance of the small model on downstream tasks. However, existing approaches are insufficient in extracting the crucial knowledge that is useful for discerning categories in downstream tasks during the distillation process. In this paper, for the first time, we introduce language guidance to the distillation process and propose a new method named Language-Guided Distillation (LGD) system, which uses category names of the target downstream task to help refine the knowledge transferred between the teacher and student. To this end, we utilize a pre-trained text encoder to extract semantic embeddings from language and construct a textual semantic space called Textual Semantics Bank (TSB). Furthermore, we design a Language-Guided Knowledge Aggregation (LGKA) module to construct the visual semantic space, also named Visual Semantics Bank (VSB). The task-related knowledge is transferred by driving a student encoder to mimic the similarity score distribution inferred by a teacher over TSB and VSB. Compared with other small models obtained by either ImageNet pre-training or self-supervised distillation, experiment results show that the distilled lightweight model using the proposed LGD method presents state-of-the-art performance and is validated on various downstream tasks, including classification, detection, and segmentation. We have made the code available at https://github.com/mZhenz/LGD.
Index Terms:
Lightweight model pre-training, language-guided distillation, textual semantics bank, visual semantics banksI Introduction
Recently, the study of pre-training lightweight (small) models with both a small number of parameters and fast inference speed receives increasing attention [1, 2, 3]. Among existing small model pre-training methods, the self-supervised distillation (SSD) [4, 5, 6] which improves the pre-training performance of small models with a distillation signal from a pre-trained large model, has appeared as a promising solution to this problem, as this pipeline can save the overhead of image labeling and meanwhile maintain reasonable performance. Hence we follow this pipeline to design a small model pre-training approach in this work.
![Refer to caption](x1.png)
Considering the smaller capacity, it is difficult for the student model to fully mimic the outputs of the large teacher model. In other words, the student model may not learn all knowledge of the teacher model. An ideal solution is to let the student learn partial but essential knowledge of the teacher, such as the knowledge that is most useful for distinguishing categories in the target tasks. Unfortunately, existing SSD methods such as SEED [4] and CompRess [7] do not consider this. Inspired by the success of Contrastive Language-Image Pre-training (CLIP) [8], which reflects the value of natural language for enhancing the visual semantic features, we propose to distill the essential knowledge of the teacher with the help of language. The reason is that the input language could be built on the information of the target task, e.g., the category names in the classification, segmentation, or detection task. Thus, the textual semantic space is also relevant to the target task and can naturally help to identify the most relevant knowledge in the teacher for distillation. Specifically, we make use of the language with the pre-trained text encoder to extract semantic embeddings and build a textual semantic space, also named Textual Semantics Bank (TSB). Utilizing the proposed TSB to enhance distillation and align the outputs of the student and teacher in the target-related textual semantic space can help the lightweight student learn more knowledge of the teacher.
However, the textual semantic feature extracted from the pre-trained text encoder may not be consistent with the corresponding visual feature extracted by the teacher model, which could result in performance degradation in the distillation process. To alleviate this issue, we design a Language-Guided Knowledge Aggregation (LGKA) module to build visual semantic space, also named Visual Semantics Bank (VSB). The VSB shares the same shape as TSB. During the training process, the LGKA module takes a visual feature extracted from the teacher and the TSB as inputs. Then the visual feature is used to momentum update the corresponding feature in the VSB. The correspondence is determined by the similarity between the input visual feature and the TSB. Thus, the VSB is closer to the real distribution of visual semantics than the TSB. The outputs of the student and teacher will be aligned in both the visual and textual semantic spaces.
Therefore, we design two language-guided loss functions. The first is based on the output consistency constraint of the teacher and student in the textual semantic space, and the second is based on consistency loss after map** the outputs of the teacher and student to the visual semantic space. It is worth noting that the TSB is fixed, while the VSB is continually updated during the training process.
To this end, we propose a Language-Guided Distillation (LGD) framework, which pre-trains a small model by integrating language guidance into the distillation process. Despite the language guidance used, we do not use labels, which is the same setting as SSD, but only use unlabeled images and self-determined texts as source data. As shown in Fig. 1, distilling the same teacher model, the proposed LGD could achieve much better results on the ImageNet-1K dataset with the help of input language. Besides, we could take the corresponding category names according to the target downstream task as language guidance for distilling processing, rather than the fixed one, e.g. category names of ImageNet-1K. It also could bring constant improvements to downstream tasks.
In summary, the main contributions of this paper are summarized below:
-
•
We propose a novel Language-Guided Distillation (LGD) framework, which is the first attempt to introduce language guidance into the distillation process to pre-train the small model.
-
•
A Language-Guided Knowledge Aggregation (LGKA) module is developed, which uses language to guide teachers on how to structure the knowledge they pass to students and constrain the learning scope to improve the learning effect. Meanwhile, two losses are designed to maintain the consistency of both the image-to-image and image-to-text relationships between teacher and student features.
-
•
Thorough experiments are conducted on six datasets and six downstream tasks, which shows the small model pre-trained by the proposed method has better transferability.
The remainder of this paper is organized as follows. Section II reviews the related works on small model pre-training, vision-language model pre-training, and knowledge distillation respectively. In Section III, we present an overview of the whole framework, including language-guided knowledge aggregation, language-aware knowledge transfer, and optimization objectives. In Section IV, extensive experiments and analysis are presented to validate the effectiveness of the proposed LGD. Finally, conclusions are drawn in Section V.
II Related Work
II-A Small Model Pre-training
As a conventional model pre-training way, fully-supervised pre-training [9] trains a classification network to predict a fixed set of predetermined object categories on a large-scale annotated dataset like ImageNet [10]. To alleviate the massive overhead of annotation, self-supervised learning (SSL) [11, 12, 13, 14, 15, 16, 17, 18, 19] aims to dig out good feature representation through the relationship between data samples without labels. However, because of the significant performance drop when the model size decreases, SSD [4, 20, 5, 1] is proposed to improve the SSL performance on a small model. Specifically, SSD transfers the learned feature representation from an off-the-shelf pre-trained large model to the student. However, previous SSD methods do not consider the gap in feature representation ability between teacher and student, and they directly transfer all knowledge from the teacher to the student. In this paper, we introduce language guidance to distillation process and use language to constrain the learning scope of the small model.
II-B Vision-Language Learning
During the past few years, vision-language learning has attracted growing attention [8, 21, 22, 23, 24, 25, 26]. As a milestone, CLIP [8] learns high-quality image representations by a simple image text pairing task on a dataset of 400 million (image, text) pairs collected from the internet. Motivated by CLIP, several works have emerged to improve the training strategy [27, 28, 29] or apply the CLIP method to other domains [30, 31]. However, previous methods directly adopt the CLIP pre-trained large model to downstream tasks, which is unsuitable for practical applications due to the significant computation overhead. In this paper, for the first time, we transfer the visual and textual representations learned by the CLIP pre-trained large model to the small model through distillation, which reveals a new application of the CLIP pre-trained model.
II-C Knowledge Distillation
Knowledge distillation usually transfers knowledge from a large teacher to a small student. Previous methods mainly fall into three streams: response-based [32, 33, 34], feature-based [35, 36, 37], and relation-based [38, 39]. Previous works [4, 20, 5] have demonstrated that relation-based SSD outperforms response-based and feature-based distillation strategies for self-supervised contrastive pre-trained teacher models, which targets modeling the relationship between features of different samples. Similar to SSL works, CLIP also adopts a contrastive learning strategy. But the difference is that CLIP not only models the relationship between image and image but also between image and language (text), which previous works have not considered. In this paper, we first adopt language guidance to knowledge distillation, which effectively transfers the relation knowledge of image-to-image and image-to-text with language guidance.
SSD v.s. LGD. Here we make a comparison of the self-supervised distillation with the proposed language-guided distillation. Similarly, they all do not use any label information for distillation. They all support the self-supervised pre-trained model (MoCo [12]), weakly-supervised pre-trained model (CLIP [8]), and fully supervised pre-trained model as the teacher model. Differently, LGD takes the self-determined category names (always using the category names from the downstream task) as language guidance, it does not need extra manual labor to annotate. Thus, we think it is fair to compare LGD with other SSD methods. Besides, to evaluate the generalization ability of the proposed method, we evaluate our method with the weakly-supervised pre-trained text model (CLIP text encoder [8]) and self-supervised pre-trained text model (BERT [40]).
III The Proposed Method
![Refer to caption](x2.png)
III-A Overview
The overall framework of the proposed LGD is shown in Fig. 2, which consists of two major modules: Language-Guided Knowledge Aggregation (LGKA) and Language-aware Knowledge Transfer.
Firstly, textual semantic embeddings are extracted by feeding [text prompt] + [class] (e.g. ”a photo of cat”) into the pre-trained text encoder and stored in the Textual Semantics Bank (TSB), denoted as . Here, the semantic embeddings can be regarded as the clustering centers of the self-determined categories. Following CLIP [8], the prompt engineering and feature ensembling are adopted here so that there is only one textual semantic embedding for each semantic category. Note that the [class] can be determined by the knowledge required on downstream task, and textual semantic embeddings only need to be extracted once. Similar to previous SSD works [4, 6], unlabeled images after data augmentation are fed into the image encoder (teacher) and student network.
As mentioned earlier, the textual semantic embeddings extracted from the pre-trained text encoder may not be consistent with the corresponding visual feature extracted by the teacher model. So, the LGKA module is developed to guide the teacher to build the Visual Semantics Bank (VSB), denoted as , which has the same shape as TSB. For each input, the teacher’s visual feature is classified by the TSB and is used to momentum update the VSB. Since the input [class] is self-determined, the user can specify the texts containing different semantic categories’ knowledge to supervise the teacher to transfer the knowledge to the student.
Lastly, to align the teacher and student in both visual and textual space, the similarity of the outputs of the teacher and student with the feature in TSB and VSB is calculated as the consistency constraint in both appearance and semantic space.
III-B Language-Guided Knowledge Aggregation
In order to transfer related knowledge from the teacher to the student, the LGKA module is developed to constrain the learning scope. Specifically, we first treat textual semantics bank as a classifier and classify the visual feature extracted by the teacher for each batch, which can be formulated as
(1) |
where , and . The , , and denote batch size, #channels of the output of the text encoder, the number of self-determined categories, and the classification results of the teacher’s visual feature, respectively. Note that the categories are customized depending on the user given [class]. By calculating the similarity, each image feature will be classified into the most semantically similar category.
There may be more than one feature in a batch that belongs to the same category, then the average vectors are computed, which can be formulated as
(2) |
where denotes the centroid feature of each semantic category in this batch and . Note that only when one category is contained in the batch, its centroid feature is computed and the visual anchor for this category in VSB is updated.
At last, the centroid feature of each semantic category will be added to the VSB by momentum updating, which can be formulated as
(3) |
where . Similar to [12], is a momentum coefficient and set to 0.999 as default. When updating the visual anchor of one semantic category in VSB for the first time, we directly replace with instead of momentum updating it, which can produce a better initial point as compared with updating all classes’ features in VSB from randomly initialized points.
The proposed VSB has two advantages compared with the previous method. 1) Memory-friendly. Previous SSD methods [7, 4, 5] need to maintain a large queue with enough anchor points to get satisfactory results. Similar to MoCo [12], the queue length is 65536, containing a large number of similar features for the same semantic category, consuming a lot of memory. The proposed VSB only saves the anchor feature for each semantic category, which is more memory-efficient. 2) Data-efficient. Since previous SSD methods [7, 4] maintain a queue to store features, they can only save features in the neighboring few batches. On the contrary, the proposed VSB saves whole dataset-aware features through momentum updating, showing higher data efficiency.
III-C Language-aware Knowledge Transfer
As mentioned before, the language information is used to supervise the optimization of the whole model from both visual and textual space, to be detailed below.
Visual Space Alignment Loss. After LGKA, the VSB maintains the visual anchor for measuring the image-to-image similarity. To align the output of the teacher and the student in visual space, the cosine similarity scores between teacher/student feature and are calculated first. And similar to [4], the teacher feature is added at the end of VSB to directly align the student with the teacher. Therefore, the modified VSB is with . The calculation of the similarity score can be formulated as
(4) | ||||
(5) |
where and . And and is the temperature parameter.
The proposed visual space align loss can be formulated as the cross entropy between and , which can be formulated as
(6) |
Textual Space Alignment Loss. When language information is introduced, the output of the student should be aligned with the teacher’s output not only in visual space but also in textual space. Specifically, to this end, this loss directly uses textual anchors in TSB to construct the image-to-language relation, which can be formulated as
(7) | ||||
(8) | ||||
(9) |
where and denote the similarity score between extracted feature and . And and is the temperature parameter.
In sum, the total loss is denoted as:
(10) |
where is a hyper-parameter to balance the and , which sets to 0.5 in our default.
III-D Generalize to More Image Encoders
To verify the generalization of the proposed LGD, we attempt to extend the LGD to other pre-trained image encoders, such as an SSL model, by making corresponding changes to solve two problems.
The first problem is the mismatch of feature dimensions between the image encoder and text encoder. For example, the representation dimension of MoCo [12] and SimSiam [13] is 128-D, while CLIP is 1024-D. Therefore, we add a learnable MLP layer after the to downsample the dimension of from 1024 to 128. The second one is mode collapse. From the formula of KL divergence, we can derive that:
(11) |
where is the KL-Divergence between similarity scores of T-L and S-L, and denotes the entropy of . When we minimize as before, since the TSB is constant and is learnable (due to the added MLP layer to solve the first problem), is easy to fall into the suboptimal solution to make the entropy of minimal. To solve this problem, we let the difference between and mimic the difference between . The total loss can be formulated as
(12) |
where is 0.33 in our default. We use the same for different parts to keep a balanced contribution from each term in the total loss function. In this manner, since is learnable, the VSB is not constant, then will no longer be a suboptimal solution making the loss function falling into a local minima. Additionally, learning with this loss function can minimize the KL-Divergence between similarity scores of T-V and T-L/S-L, thus achieving the original goal of driving the similarity score between and mimicking that between and . By doing so, the distillation process from teacher to student can be successfully enhanced by the proposed TSB and VSB without the two problems mentioned above.
III-E Text Control on Downstream Tasks
For different downstream tasks and application scenes, the corresponding category names can be utilized as language guidance for distilling processing, rather than the fixed one. With the selected category names and some pre-built templates, the pre-trained language model can provide specific semantic knowledge of each category due to its ability to understand universal text contents. In some open-vocabulary perception works [41, 42, 43], task-specific category names can help refine the image feature and find the patterns with corresponding semantics. In our setting, taking the scene in Section III-C as an example, minimizing is equivalent to minimizing . If is built upon category names that not match the image data distribution of current dataset, then values of in those mismatched dimensions will always have low values, preventing student networks from learning useful information from distribution alignment. When using the selected task-oriented category names, ideal knowledge transfer can be achieved for each category in current scenario.
IV Experiment
IV-A Setup
Dataset and Downstream Tasks. For model pre-training with the LGD, the ImageNet-1k (IN-1k) [10] dataset is used as unlabeled source data by abandoning the labels, which contains 1.28M images for training, and 50,000 images for validation. Then, we evaluate the pre-trained model on various downstream tasks, including zero-shot and fully-supervised classification on the IN-1k and the Caltech-256 (CT-256) [44] dataset, semi-supervised classification on the IN-1k dataset, object detection and instance segmentation on the COCO 2017 [45] dataset, long-tailed object detection and instance segmentation on the LVIS v1 [46] dataset, and semantic segmentation on the ADE20K[47] and Cityscapes (CS) [48] datasets.
Teacher-Student Pair. Experiments are mainly conducted on two pairs of teacher-student models: CLIP pre-trained ResNet-50 (CLIP RN50) [8] → ResNet-18 (RN18) [49] and CLIP RN50 [8] → MobileNetV2 (MNV2) [50], representing knowledge transfer between similar and dissimilar networks respectively. Besides, we also conduct experiments with different teacher networks, such as MoCo RN50 [12] and SimSiam RN50 [13]. The pre-trained text encoder used is a Transformer [51] based encoder that is jointly trained with CLIP RN50.
Method | Teacher | Student | Data | Text | Epoch | Zero-shot | Linear-probe | ||||
IN-1k | CT-256 | IN-1k | CT-256 | ||||||||
1% | 10% | 100% | |||||||||
MoCo | RN50 | - | - | - | - | - | - | - | - | 67.5 | - |
CLIP | RN50 | - | - | - | - | 58.9 | 78.1 | - | - | 73.3 | - |
Super. | - | RN18 | IN-1k | - | 90 | - | - | - | - | - | 77.1 |
SEED | MoCo RN50 | RN18 | IN-1k | - | 200 | - | - | 37.5 | 51.1 | 57.9 | 78.5 |
BINGO | MoCo RN50 | RN18 | IN-1k | - | 200 | - | - | 42.8 | 57.5 | 61.4 | 79.3 |
DisCo | MoCo RN50 | RN18 | IN-1k | - | 200 | - | - | - | - | 60.6 | - |
SMD | SimSiam RN50 | RN18 | IN-1k | - | 100 | - | - | - | - | 61.8 | - |
SEED | CLIP RN50 | RN18 | IN-1k | - | 200 | 44.2 | 62.1 | 43.1 | 56.3 | 62.7 | 79.5 |
BINGO | CLIP RN50 | RN18 | IN-1k | - | 200 | 45.5 | 62.3 | 44.2 | 59.2 | 64.0 | 80.6 |
LGD | CLIP RN50 | RN18 | IN-1k | IN-1k | 200 | 49.5 | 63.9 | 53.5 | 61.7 | 67.2 | 82.7 |
LGD | CLIP RN50 | RN18 | IN-1k | IN-1k | 90 | 47.8 | 62.3 | 52.9 | 61.5 | 66.4 | 81.6 |
LGD | CLIP RN50 | RN18 | IN-1k | CT-256 | 90 | 34.3 | 65.8 | - | - | - | - |
LGD | CLIP RN50 | RN18 | CT-256 | CT-256 | 90 | 18.9 | 66.5 | - | - | - | - |
Super. | - | MNv2 | IN-1k | - | 90 | - | - | - | - | - | 77.5 |
SEED | CLIP RN50 | MNv2 | IN-1k | - | 200 | - | - | 42.2 | 55.5 | 63.5 | 79.1 |
BINGO | CLIP RN50 | MNv2 | IN-1k | - | 200 | - | - | 43.5 | 56.9 | 64.6 | 80.1 |
LGD | CLIP RN50 | MNv2 | IN-1k | IN-1k | 200 | 50.3 | 64.7 | 51.9 | 60.7 | 67.4 | 81.3 |
LGD | CLIP RN50 | MNv2 | IN-1k | IN-1k | 90 | 48.5 | 63.0 | 50.3 | 59.4 | 66.3 | 81.2 |
LGD | CLIP RN50 | MNv2 | IN-1k | CT-256 | 90 | 35.0 | 66.8 | - | - | - | - |
LGD | CLIP RN50 | MNv2 | CT-256 | CT-256 | 90 | 19.2 | 67.4 | - | - | - | - |
Method | Teacher | Student | Data | Text | Epoch | COCO | LVIS | ADE20K | CS | ||
MoCo | RN50 | - | IN-1k | - | - | 38.5 | 35.1 | 22.7 | 21.9 | 38.9 | 75.3 |
CLIP | RN50 | - | WIT | - | - | 39.3 | 36.8 | 23.0 | 22.1 | 39.6 | 75.8 |
Super. | - | RN18 | IN-1k | - | 90 | 32.8 | 31.3 | 17.5 | 18.6 | 33.3 | 71.3 |
SEED | MoCo RN50 | RN18 | IN-1k | - | 200 | 33.7 | 31.8 | 18.1 | 19.2 | 34.0 | 72.2 |
BINGO | MoCo RN50 | RN18 | IN-1k | - | 200 | 33.9 | 32.1 | 18.4 | 19.3 | 34.4 | 72.3 |
SEED | CLIP RN50 | RN18 | IN-1k | - | 200 | 34.4 | 32.2 | 18.9 | 19.7 | 34.4 | 72.8 |
BINGO | CLIP RN50 | RN18 | IN-1k | - | 200 | 34.6 | 32.4 | 19.2 | 19.9 | 34.7 | 73.2 |
LGD | CLIP RN50 | RN18 | IN-1k | IN-1k | 90 | 34.0 | 32.0 | 19.7 | 20.3 | 34.9 | 73.7 |
LGD | CLIP RN50 | RN18 | IN-1k | IN-1k | 200 | 35.1 | 32.9 | 20.1 | 20.5 | 35.3 | 74.0 |
Super. | - | MNv2 | IN-1k | - | 90 | 27.3 | 26.4 | 14.9 | 14.4 | 29.7 | 70.2 |
SEED | MoCo RN50 | MNv2 | IN-1k | - | 200 | 28.1 | 26.8 | 15.4 | 14.8 | 31.9 | 70.5 |
BINGO | MoCo RN50 | MNv2 | IN-1k | - | 200 | 28.3 | 27.0 | 15.6 | 15.0 | 32.2 | 70.8 |
SEED | CLIP RN50 | MNv2 | IN-1k | - | 200 | 28.9 | 27.4 | 15.9 | 16.1 | 32.2 | 71.0 |
BINGO | CLIP RN50 | MNv2 | IN-1k | - | 200 | 29.1 | 27.7 | 16.0 | 16.2 | 32.5 | 71.3 |
LGD | CLIP RN50 | MNv2 | IN-1k | IN-1k | 90 | 28.8 | 27.5 | 16.1 | 16.6 | 33.8 | 72.3 |
LGD | CLIP RN50 | MNv2 | IN-1k | IN-1k | 200 | 29.6 | 28.1 | 16.2 | 16.9 | 33.9 | 72.3 |
Comparison Methods. The proposed LGD is compared with several small model pre-training methods: 1) fully-supervised pre-training method, including ImageNet pre-training [9]; 2) SSD based methods, including SEED [4], BINGO [5], DisCo [20], and SMD [6].
Implementation Details. The proposed LGD is implemented using the PyTorch framework, and all experiments are conducted on 8 NVIDIA TESLA V100 GPUs. The distillation process is trained with a standard SGD optimizer with a momentum of 0.9 and a weight decay parameter of 1e-4 for 90 and 200 epochs. The initial learning rate is set as 0.03 and updated by a cosine decay scheduler [52] with 5 warm-up epochs and a batch size of 256 per GPU.
For transferring to image classification, we conduct the supervised linear classification on IN-1k and CT-256. For zero-shot setting, we treat the textual semantics bank as a classifier and the predicted category index is calculated following Eq. 1. In this way, each image feature will be classified into the most semantically similar category without training with the corresponding label. For linear-probe setting, following previous works [12, 4], we train a single linear layer classifier on top of the frozen network encoder after distillation or pre-training. For IN-1k dataset, besides fully supervised learning which utilizes 100% of the training set, following previous works [11, 5], we evaluate the proposed method by fine-tuning the student model with 1% and 10% labeled data as semi-supervised setting. We follow the training split settings as in previous works for fair comparisons. SGD optimizer is used to prepare the linear classifier for 100 epochs with a weight decay of 0. The initial learning rate is set as 30 and is then reduced by a factor of 10 at 60 and 80 epochs. The results are reported in terms of Top-1 accuracy.
For transferring to object detection and instance segmentation, we use mmdetection [53] for implementation. We train Mask-RCNN FPN [54] with RN18 and MNv2 backbone to evaluate the transferability of the learned features on COCO 2017 and LVIS v1. AdamW [55] optimizer is used to finetune the whole network for 12 epochs (the default 1x schedule). The initial learning rate is set as 2e-4 and then reduced by a factor of 10 at 8 and 11 epochs. To better preserve the pre-trained weights, we set the learning rate of the image encoder as 1/10 of the other parameters.
For semantic segmentation, we use mmsegmentation [56] for implementation. We train Semantic FPN [57] with RN18 backbone and PSPNet[58] with MNv2 backbone to evaluate the transferability of learned features on ADE20k and CS. AdamW [55] optimizer is also used, and we also set the learning rate of the image encoder as 1/10 of the other parameters. For MNv2 and RN18 backbone, the initial learning is set as 1e-2 and 1e-4, respectively. For ADE20k and Cityscapes, we train the model for 160k and 80k iterations, respectively.
IV-B Main Results
IV-B1 Results on Zero-shot Classification.
As shown in Table I, we compare our method with previous SSD methods SEED [4] and BINGO [5] on the ImageNet-1k and Caltech-256 validation set. The top-1 zero-shot classification accuracy is reported. We observe that the proposed LGD surpasses the contrastive SSD methods consistently on all benchmarks, verifying the effectiveness of the proposed LGD.
To further verify the effect of language guidance, we still use the same IN-1k as unlabeled image input but change the input texts to the class of CT-256 for doing experiments. Significant improvement is observed on both RN18 (+3.5%) and MNv2 (+3.8%) backbone, which shows that we can specify the VSB and TSB through the texts to control the knowledge we want to transfer to the students. Besides, we use the CT-256 as our unlabeled image input and achieve the highest zero-shot accuracy (66.5% and 67.4%) on the CT-256 dataset, which reveals that the consistency of data between the distillation and downstream tasks will impact the results. More relevance between the source data used in distillation and the downstream task brings better downstream performance of the trained small model. Nevertheless, to fairly compare the performance of small models transferring to downstream tasks with other methods, we uniformly use IN-1k as the source data for distillation or pre-training in the following experiments.
![Refer to caption](x3.png)
IV-B2 Transfer to Semi/Fully-supervised Classification.
Following [4, 5], we evaluate the learned representation on semi-supervised/fully-supervised classification on IN-1k, where a fixed 1%, 10% or 100% of IN-1k training data are provided with the annotations. Besides, to further study whether the improvement of the learned representations by distillation is confined to ImageNet, we evaluate the additional classification dataset CT-256 to study the generalization and transferability of the feature representation. As shown in Table I, the proposed LGD has a remarkably 1.2%-9.3% improvement compared with previous methods.
IV-B3 Transfer to Object Detection and Instance Segmentation.
As shown in Table II, we conduct experiments on two downstream tasks including object detection and instance segmentation, on the COCO 2017 and LVIS v1 datasets. Compared with standard ImageNet supervised pre-training (Super.), the distilled model pre-trained by the proposed LGD achieves a large improvement in the same number of training epochs. On COCO, the RN18 based Mask RCNN shows +1.2 and +0.7 point improvement on and , respectively. And the MNv2 based Mask RCNN shows +1.5 and +1.7 point improvement on and . Compared with other SSD methods, our method also shows a significant improvement. Compared with SEED and BINGO pre-trained RN18, which is distilled from CLIP RN50, the proposed LGD shows a consistent +0.5 to +0.9 point improvement on COCO. Compared with the RN18 distilled from MoCo RN50, the proposed LGD shows a large improvement of +0.8 to +1.7 points. The MNV2 backbone also shows a similar improvement and the detailed experiment results can be seen in Table II.
IV-B4 Transfer to Semantic Segmentation.
As shown in Table II, we conduct semantic segmentation experiments on the ADE20k and Cityscapes (CS) datasets. Compared with standard ImageNet supervised pre-training (Super.), the RN18 based Semantic FPN shows +1.6 and +2.4 mIoU improvement on the two datasets, respectively. Besides, compared with SSD methods, the proposed LGD also shows significant +0.5 to +1.8 mIoU improvement. For the MNv2 backbone, it also shows a similar improvement and the detailed experiment results can be seen in Table II.
Notably, we observe that the proposed LGD surpasses the comparative methods, including standard ImageNet pre-training and previous self-supervised distillation, on all benchmarks with the same overhead of training time and data amount. This also proves the generalization ability of the learned representations from language-guided distillation to a wide range of data domains and classes.
IV-C Ablation and Analysis
IV-C1 Different Distillation Strategies.
Method | Loss | ZS Top-1 |
---|---|---|
SEED | - | 44.2 |
LGD | 45.7 | |
LGD | 47.3 | |
LGD | 47.8 |
Method | Data | Text | CT-256 | CS | ADE |
---|---|---|---|---|---|
LGD | IN-1k | IN-1k | 81.6 | 73.7 | 34.9 |
LGD | IN-1k | CT-256 | 81.9 | - | - |
LGD | IN-1k | CS | - | 74.1 | - |
LGD | IN-1k | ADE | - | - | 35.2 |
To show the effectiveness of consistency loss in visual and textual space, we conduct an ablation study and show results in Table III. We use CLIP RN50 as the teacher network and report the zero-shot top-1 accuracy on the IN-1k validation set. SEED [4] trains a student to mimic the similarity score distribution inferred by a teacher over a set of randomly maintained instances. and are the proposed consistency losses in visual and textual space. Compared with SEED, the proposed visual constraint has a significant +1.5% improvement. With the constraint in both visual and textual space ( + ), we can get the best result of 47.8%.
To further explore the effect of different distillation strategies, the attention maps are visualized by Grad-CAM [59] and shown in Fig. 3. The proposed LGD has the most accurate attention maps and the most similar attention area to the teacher (CLIP RN50). And we find that if we only apply constraints in visual space, such as SEED and , the student network will pay attention to the general feature of the category. For example, it will pay attention to the net and sea when it identifies the fish and shark. Although images of different categories commonly share these areas, they can not sufficiently represent the semantic categories in the downstream task. Besides, with the constraint in both visual and textual space, the attention map can focus on the most critical area, such as the animals’ faces.
IV-C2 Text Control on More Downstream Tasks
Method | Teacher | Student | Top-1 |
---|---|---|---|
CLIP | - | ViT-B/16 | 68.3 |
SEED | CLIP RN101 | RN18 | 45.2 |
LGD | CLIP RN101 | RN18 | 50.3 |
SEED | CLIP ViT-B/16 | RN18 | 53.2 |
LGD | CLIP ViT-B/16 | RN18 | 55.9 |
SEED | CLIP ViT-B/16 | DeiT-tiny | 53.3 |
LGD | CLIP ViT-B/16 | DeiT-tiny | 56.5 |
Method | Image Encoder | Text Encoder | Top-1 |
---|---|---|---|
SEED | MoCo RN50 | - | 57.9 |
SEED+LGD | MoCo RN50 | CLIP text | 59.7 |
SEED+LGD | MoCo RN50 | 58.5 | |
SMD | SimSiam RN50 | - | 61.8 |
SMD+LGD | SimSiam RN50 | CLIP text | 62.1 |
In this section, we evaluate the text control on more downstream tasks as shown in Table IV. We firstly distill a small model and then finetune the distilled small model on the downstream tasks as described in Section II. In the distillation process, we use the IN-1k as unlabeled source data and the language prompt with class names in the downstream tasks as text input. For example, there are 19 text inputs for Cityscapes, 150 text inputs for ADE20K, and 256 text inputs for Caltech-256. The CLIP RN50 and RN18 are selected as the teacher and student models. And the training epoch is set to 90. Other hyper-parameters are the same as those introduced in Section II.
Constant improvements are observed on the Caltech-256 (+0.3), Cityscapes (+0.4), and ADE20K (+0.3). It shows that through the text control, the small model can learn more useful knowledge from the teacher and perform better on corresponding downstream tasks.
IV-C3 Different Pre-trained Image/Text Encoder.
To evaluate the generalization of the proposed LGD, we combine LGD with other SSL pre-trained image or text encoder and show consistent improvement. Specifically, we introduce LGD into SEED [4] and SMD [6] as the language supervision module introduced in the above Section III-D. As shown in Table VI, SEED + LGD has a significant +1.8% improvement compared with SEED. And SMD + LGD also has a +0.3% improvement compared with SMD. Besides, we use BERT [40] as text encoder, which also shows +0.6% improvement compared with SEED. Note that in the above experiments, the image encoder and text encoder are self-supervised and pre-trained independently, which shows that the proposed LGD also works on non-joint-trained image and text encoder.
Method | Pre-training (h) | Finetuning (h) | Testing (m) |
---|---|---|---|
SEED | 58.8(200) | 14.8 | 2.5 |
LGD | 36.3(90) / 72.2(200) | 14.7 | 2.7 |
Method | Student | Zero-shot | Linear-probe |
---|---|---|---|
SEED | PointMLPElite | 78.48 | 89.90 |
SEED | Transformer-6L | 77.31 | 89.74 |
LGD | PointMLPElite | 80.35 | 91.17 |
LGD | Transformer-6L | 79.86 | 91.26 |
To further verify the applicability of our proposed method across various architectures, we conduct experiments employing CLIP RN101 or CLIP ViT-B/16 as teacher and RN18 or DeiT-tiny as student. As shown in Table V, our proposed LGD demonstrates remarkable performance compared to SEED [4] when applied to multiple teacher/student architectures. For instance, in the cases where we use CLIP ViT-B/16 as teacher and DeiT-tiny as student, LGD achieves notable improvements of 3.2% compared with SEED. The results of zero-shot ImageNet classification demonstrate that LGD can not only perform well on CNN-based architectures but also show good performance on other architectures, such as ViTs.
IV-C4 Training and testing time analysis
The training and testing time of our proposed LGD and previous method SEED [4] is assessed in Table VII. Specifically, both models are trained for 200 epochs and finetuned for 100 epochs on the ImageNet1K dataset under the same settings (batch size, number of GPUs, etc). For testing time, we measure the inference time required for classifying images in the validation dataset. Results indicate that under the same number of training epochs (200), the pre-training time for LGD is longer than that for SEED. However, it is noteworthy that our method achieves comparable results with just 90 epochs of training (taking 36.3 hours), as opposed to SEED, which requires 200 epochs (taking 58.8 hours), as shown in Table I. Therefore, despite longer pre-training time per epoch, the overall efficiency of LGD is superior, requiring significantly less training time to achieve comparable performance. Additionally, the finetuning and testing time of SEED and LGD is similar, as both methods only utilize the student network.
IV-C5 Results on Tasks other than Image
To demonstrate the efficacy and generalization ability of the proposed LGD, extensive experiments are conducted on point cloud classification task. For teacher network, we select the latest point-language pre-training framework Point-Bind [60] with I2P-MAE [61] as the point cloud encoder. For student, a solid lightweight point cloud encoder PointMLPElite [62] is employed. Besides, plain transformer network used in [63] with less (6) layers is also used as student since it has a similar structure with teacher point cloud encoder. Results are shown in Table VIII. Compared to the previous method SEED, LGD performs better on both zero-shot and linear-probe classification on ModelNet40 dataset. And the improvement is consistent among both students.
IV-C6 Visualization of Feature Distributions
![Refer to caption](x4.png)
![Refer to caption](x5.png)
To validate the effectiveness of LGD in aligning the outputs of the teacher and student in both the visual and textual semantic spaces, we visualize the feature distribution of the teacher model (CLIP RN50) and student model (LGD RN18 & SEED RN18 [4]). In Fig. 4, it can be seen that the LGD learns more compact feature distribution than SEED. Due to the model capacity, the feature distributions of both LGD RN18 and SEED RN18 are more diverse than that of CLIP RN50. In Fig. 5, the feature distribution of the image encoder (teacher), text encoder, and student are shown. It can be observed that the feature distribution of SEED RN18 is similar to that of the teacher but has a large gap with the text feature of the text encoder which contains task-related knowledge . As shown in Fig. 5 (a), with the constraint in textual space, the feature distributions of LGD RN18 have some variances to that of the teacher but also have a matching relationship with the text feature, representing the combination of knowledge of both visual and textual features.
V Conclusion
This paper proposes Language-Guided Distillation (LGD) for transferring knowledge from a pre-trained large teacher model, such as CLIP, to a small one. We first use language guidance to determine which knowledge of the teacher should be transferred to the students. Then, we propose a language-guided knowledge aggregation module to maintain language-guided textual and visual semantic banks. At last, we design two distillation losses to maintain the consistency of teacher and student in both visual and textual space. Thorough experiments show that LGD offers significant performance improvement on various downstream tasks.
In our study, we find that the LGD performance largely relies on the joint-trained text and image encoder, while the performance gain on the non-joint-trained one is relatively small because of the mismatch between the textual and visual features. We hope our work could promote the development of distillation methods to become a new paradigm for small model pre-training by making full use of the off-the-shelf pre-trained models.
ACKNOWLEDGMENTS
This work is supported by National Key R&D Program of China (2022ZD0160300), National Natural Science Foundation of China (No. 62101137 and 62071127), Shanghai Natural Science Foundation (No. 23ZR1402900). The computations in this research were performed using the CFFF platform of Fudan University.
References
- [1] R. He, S. Sun, J. Yang, S. Bai, and X. Qi, “Knowledge distillation as efficient pre-training: Faster convergence, higher data-efficiency, and better transferability,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9161–9171.
- [2] H. Shi, Y. Zhang, S. Tang, W. Zhu, Y. Li, Y. Guo, and Y. Zhuang, “On the efficacy of small self-supervised contrastive models without distillation signals,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 2225–2234.
- [3] H. Wu, Y. Gao, Y. Zhang, S. Lin, Y. Xie, X. Sun, and K. Li, “Self-supervised models are good teaching assistants for vision transformers,” in International Conference on Machine Learning. PMLR, 2022, pp. 24 031–24 042.
- [4] Z. Fang, J. Wang, L. Wang, L. Zhang, Y. Yang, and Z. Liu, “Seed: Self-supervised distillation for visual representation,” in International Conference on Learning Representations, 2021.
- [5] H. Xu, J. Fang, X. Zhang, L. Xie, X. Wang, W. Dai, H. Xiong, and Q. Tian, “Bag of instances aggregation boosts self-supervised distillation,” in International Conference on Learning Representations, 2022.
- [6] H. Liu and M. Ye, “Improving self-supervised lightweight model learning via hard-aware metric distillation,” in European Conference on Computer Vision. Springer, 2022, pp. 295–311.
- [7] S. Abbasi Koohpayegani, A. Tejankar, and H. Pirsiavash, “Compress: Self-supervised learning by compressing representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12 980–12 992, 2020.
- [8] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 8748–8763.
- [9] K. He, R. Girshick, and P. Dollár, “Rethinking imagenet pre-training,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4918–4927.
- [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2009.
- [11] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
- [12] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738.
- [13] X. Chen and K. He, “Exploring simple siamese representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 750–15 758.
- [14] F. Wang, T. Kong, R. Zhang, H. Liu, and H. Li, “Self-supervised learning by estimating twin class distribution,” IEEE Transactions on Image Processing, 2023.
- [15] Y. Liu, K. Wang, L. Liu, H. Lan, and L. Lin, “Tcgl: Temporal contrastive graph for self-supervised video representation learning,” IEEE Transactions on Image Processing, vol. 31, pp. 1978–1993, 2022.
- [16] J. Ge, Y. Liu, J. Gui, L. Fang, M. Lin, J. T.-Y. Kwok, L. Huang, and B. Luo, “Learning the relation between similarity loss and clustering loss in self-supervised learning,” IEEE Transactions on Image Processing, 2023.
- [17] Z. Tao, X. Liu, Y. Xia, X. Wang, L. Yang, X. Huang, and T.-S. Chua, “Self-supervised learning for multimedia recommendation,” IEEE Transactions on Multimedia, 2022.
- [18] Y. Liu, J. Wu, L. Qu, T. Gan, J. Yin, and L. Nie, “Self-supervised correlation learning for cross-modal retrieval,” IEEE Transactions on Multimedia, 2022.
- [19] Y. Zhang, C. Liu, Y. Zhou, W. Wang, Q. Ye, and X. Ji, “Beyond instance discrimination: Relation-aware contrastive self-supervised learning,” IEEE Transactions on Multimedia, 2023.
- [20] Y. Gao, J.-X. Zhuang, S. Lin, H. Cheng, X. Sun, K. Li, and C. Shen, “Disco: Remedy self-supervised learning on lightweight models with distilled contrastive learning,” in Proceedings of the European Conference on Computer Vision, 2022, pp. 1–10.
- [21] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 4904–4916.
- [22] M. Li, X. Chen, C. Zhang, S. Chen, H. Zhu, F. Yin, G. Yu, and T. Chen, “M3dbench: Let’s instruct large models with multi-modal 3d prompts,” arXiv preprint arXiv:2312.10763, 2023.
- [23] S. Chen, X. Chen, C. Zhang, M. Li, G. Yu, H. Fei, H. Zhu, J. Fan, and T. Chen, “Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 428–26 438.
- [24] X. Yang, F. Liu, and G. Lin, “Effective end-to-end vision language pretraining with semantic visual loss,” IEEE Transactions on Multimedia, no. 99, pp. 1–10, 2023.
- [25] Y. Xing, Q. Wu, D. Cheng, S. Zhang, G. Liang, P. Wang, and Y. Zhang, “Dual modality prompt tuning for vision-language pre-trained model,” IEEE Transactions on Multimedia, 2023.
- [26] S. Chen, H. Zhu, M. Li, X. Chen, P. Guo, Y. Lei, Y. Gang, T. Li, and T. Chen, “Vote2cap-detr++: Decoupling localization and describing for end-to-end 3d dense captioning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
- [27] Y. Li, F. Liang, L. Zhao, Y. Cui, W. Ouyang, J. Shao, F. Yu, and J. Yan, “Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=zq1iJkNk3uN
- [28] L. Yao, R. Huang, L. Hou, G. Lu, M. Niu, H. Xu, X. Liang, Z. Li, X. Jiang, and C. Xu, “Filip: Fine-grained interactive language-image pre-training,” arXiv preprint arXiv:2111.07783, 2021.
- [29] L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang et al., “Grounded language-image pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 965–10 975.
- [30] M. Wang, J. Xing, and Y. Liu, “Actionclip: A new paradigm for video action recognition,” arXiv preprint arXiv:2109.08472, 2021.
- [31] Y. Rao, W. Zhao, G. Chen, Y. Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu, “Denseclip: Language-guided dense prediction with context-aware prompting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 082–18 091.
- [32] C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil, “Model compression,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 2006, pp. 535–541.
- [33] G. Hinton, O. Vinyals, J. Dean et al., “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, vol. 2, no. 7, 2015.
- [34] B. Zhao, Q. Cui, R. Song, Y. Qiu, and J. Liang, “Decoupled knowledge distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 953–11 962.
- [35] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550, 2014.
- [36] S. Zagoruyko and N. Komodakis, “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer,” arXiv preprint arXiv:1612.03928, 2016.
- [37] D. Chen, J.-P. Mei, Y. Zhang, C. Wang, Z. Wang, Y. Feng, and C. Chen, “Cross-layer distillation with semantic calibration,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 8, 2021, pp. 7028–7036.
- [38] J. Yim, D. Joo, J. Bae, and J. Kim, “A gift from knowledge distillation: Fast optimization, network minimization and transfer learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4133–4141.
- [39] Y. Tian, D. Krishnan, and P. Isola, “Contrastive representation distillation,” arXiv preprint arXiv:1910.10699, 2019.
- [40] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, 2019, pp. 4171–4186. [Online]. Available: https://doi.org/10.18653/v1/n19-1423
- [41] J. Qin, J. Wu, P. Yan, M. Li, R. Yuxi, X. Xiao, Y. Wang, R. Wang, S. Wen, X. Pan et al., “Freeseg: Unified, universal and open-vocabulary image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 446–19 455.
- [42] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv preprint arXiv:2303.05499, 2023.
- [43] R. Zhang, Z. Guo, W. Zhang, K. Li, X. Miao, B. Cui, Y. Qiao, P. Gao, and H. Li, “Pointclip: Point cloud understanding by clip,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 8552–8562.
- [44] Griffin, Holub, and Perona, “Caltech 256,” 4 2022.
- [45] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.
- [46] A. Gupta, P. Dollar, and R. Girshick, “Lvis: A dataset for large vocabulary instance segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5356–5364.
- [47] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 633–641.
- [48] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223.
- [49] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- [50] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520.
- [51] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- [52] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Icml, 2010.
- [53] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin, “MMDetection: Open mmlab detection toolbox and benchmark,” arXiv preprint arXiv:1906.07155, 2019.
- [54] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
- [55] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
- [56] M. Contributors, “MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark,” https://github.com/open-mmlab/mmsegmentation, 2020.
- [57] A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic feature pyramid networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6399–6408.
- [58] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890.
- [59] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.
- [60] Z. Guo, R. Zhang, X. Zhu, Y. Tang, X. Ma, J. Han, K. Chen, P. Gao, X. Li, H. Li et al., “Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following,” arXiv preprint arXiv:2309.00615, 2023.
- [61] R. Zhang, L. Wang, Y. Qiao, P. Gao, and H. Li, “Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21 769–21 780.
- [62] X. Ma, C. Qin, H. You, H. Ran, and Y. Fu, “Rethinking network design and local geometry in point cloud: A simple residual mlp framework,” in International Conference on Learning Representations, 2022.
- [63] X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu, “Point-bert: Pre-training 3d point cloud transformers with masked point modeling,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 19 313–19 322.