RdimKD: Generic Distillation Paradigm by Dimensionality Reduction

Yi Guo, Yiqian He, Xiaoyang Li, Haotong Qin, Van Tung Pham,
Yang Zhang, Shouda Liu
bytedance
{guoyi.0,heyiqian.11,lixiaoyang.x,qinhaotong,van.pham,zhangyang.elfin,liushouda}@bytedance.com

Abstract

Knowledge Distillation (KD) emerges as one of the most promising compression technologies to run advanced deep neural networks on resource-limited devices. In order to train a small network (student) under the guidance of a large network (teacher), the intuitive method is regularizing the feature maps or logits of the student using the teacher’s information. However, existing methods either over-restrict the student to learn all information from the teacher, which lead to some bad local minimum, or use various fancy and elaborate modules to process and align features, which are complex and lack generality. In this work, we proposed an abstract and general paradigm for the KD task, referred to as DIMensionality Reduction KD (RdimKD), which solely relies on dimensionality reduction, with a very minor modification to naive $\ell^{2}$ loss. RdimKD straightforwardly utilizes a projection matrix to project both the teacher’s and student’s feature maps onto a low-dimensional subspace, which are then optimized during training. RdimKD achieves the goal in the simplest way that not only does the student get valuable information from the teacher, but it also ensures sufficient flexibility to adapt to the student’s low-capacity reality. Our extensive empirical findings indicate the effectiveness of RdimKD across various learning tasks and diverse network architectures.

1 Introduction

With the increasing and extensive application of Deep Neural Networks (DNNs) in industry, model compression technologies [22, 37, 41, 96] have been widely studied to deploy deep models on storage and computation limited hardware. Among these technologies, Knowledge Distillation (KD) [29] attracts attention from academia and industry for its high architecture adaptability and compression performance.

The essence of knowledge distillation lies in how to obtain valuable knowledge from the teacher network. For example, soft labels can better reflect the distribution information between categories than hard one-hot labels [29, 3]. To extend knowledge distillation to more general and complex scenarios, more and more works [58, 79, 84, 41] are also exploring the distillation from intermediate feature maps as regularization to assist the training of student networks further. A common method, e.g. [58, 33, 40, 77], is to use a naive $\ell^{2}$ loss on the original feature maps of the teacher and student. Specifically, let $F_{t},F_{s}$ be the feature maps to be distilled of the teacher and student, then the KD loss can be described as

\min\mathcal{L}_{KD}=\min\|F_{t}-r_{\theta}(F_{s})\|^{2}

(1)

where $r_{\theta}(\cdot)$ is a learnable transformation layer needed when the shapes of the feature maps mismatch, $\theta$ the learnable parameters, $\|\cdot\|$ the Frobenius norm for matrix.

Intuitively, it is sub-optimum to force the student to get all the information of the teacher in a way like Eq. 1 because of the difference in network capacity, the randomness of initialization, and/or the difficulty of optimization. It can be demonstrated experimentally in Sec. 4 that a simple $\ell^{2}$ loss between feature maps (with the same shapes) of teacher and student does not bring enough performance improvement for the student. So, naturally, we may require the student to only learn some useful information from the teacher, while maintaining a certain degree of flexibility to adapt to the reality of its low capacity. Also, these methods come at a cost of additional modules to be trained as well as more hyper-parameter to be finetuned.

As a result, instead of applying $\ell^{2}$ loss on the original feature maps, some works [36, 84, 94, 28] manipulate and align the feature maps in some fancy and less explainable ways. For example, [84] calculates the attention of the feature maps by pooling along the channel dimension, while [94] performs along the width and height dimensions, and [82] generates FSP matrix from two layers to represent the knowledge flow. Methods in this category are essentially designed to increase students’ freedom of learning without over-restricting their flexibility and to only get some valuable information from the teacher network. But these fancy and specific designs are too elaborate to be essential, and we want to reveal the essence of knowledge distillation at a more abstract and higher level.

Refer to caption — Figure 1: The overall framework of RdimKD. The student’s and teacher’s feature maps are projected onto a low-dimensional subspace by a matrix, and then the simple $\ell^{2}$ loss is implemented. The projection of the teacher only allows some valuable knowledge to be transferred out, while the projection of the student leaves the complementary subspace as freedom of the student.

We propose a simple, generic, and effective knowledge distillation method named RdimKD from a more abstract, higher-level, and more essential perspective: dimensionality reduction. RdimKD is based on dimensionality reduction itself to ensure that students can focus on valuable information from teachers and enjoy enough flexibility. The proposed framework, shown in Fig. 1, can be formally described as

\min\mathcal{L}_{KD}=\min\|F_{t}K-F_{s}K\|^{2}

(2)

where $K$ is the projection matrix. We do not use the function $r_{\theta}(\cdot)$ with learnable $\theta$ in RdimKD because of two reasons. On the one hand, the function is somehow tricky to design in specific tasks, and we hope to abstract the essence of knowledge distillation in a more generic way; on the other hand, when optimizing Eq. 1, this function will make the student be lazy to some extend, and it just relies on $\theta$ to reconcile the difference with the teacher instead of getting knowledge from it.

Note that the proposed method requires $F_{t}$ and $F_{s}$ to have the same dimension. To allow more flexible architecture for teacher and student, there are two solutions to bypass the above requirement: 1. train a teacher with the same shapes at distillation positions; 2. set the distillation position at a linear transformation (e.g. linear or convolution layer) $f:\mathbb{R}^{p}\rightarrow\mathbb{R}^{q}$ of the student. $f$ can be split into two transformations, $f_{1}:\mathbb{R}^{p}\rightarrow\mathbb{R}^{t}$ and $f_{2}:\mathbb{R}^{t}\rightarrow\mathbb{R}^{q}$ , where $t$ is the dimension of teacher. In this way, teacher and student have the same dimension $t$ and can be distilled by RdimKD. During inference, $f_{1}$ and $f_{2}$ can be merged into $f$ without changing the original structure of the student.

RdimKD focuses on the concept of dimensionality reduction itself, regardless of the specific reduction methods. To show this, we provide three reduction methods that are very common in statistical machine learning, i.e., Principal Component Analysis (PCA) [1] (RdimKD-P), Autoencoder (RdimKD-A), and Random Orthogonal Matrices [5, 34] (RdimKD-R), explained in detail in Sec. 3.

Our main highlights are summarized as follows:

•

Compared with previous methods that manipulate or align features in elaborate and fancy ways, this work reveals the benefit of dimensionality reduction in distillation from an essential level.
•

RdimKD works well on various deep learning tasks (image classification, object detection, semantic segmentation, language understanding, speech recognition) and neural architectures (CNNs, Transformer, Conformer), which makes it scalable to complex and diverse industrial applications.
•

Experiments show that RdimKD achieves performance comparable to or higher than state-of-the-art results on the above benchmarks.
•

The implementation of RdimKD, especially RdimKD-R is very simple yet effective, and has been landed in one of the most famous short-video companies, which means that it has been evaluated in the practice of super large-scale industry projects.

2 Related works

Knowledge distillation (KD) was first proposed by Hinton et al. [29] in classification, where they utilize the logits from the teacher as the soft labels to transfer the “dark knowledge” to the student. Later, Fitnets [58] started to distill knowledge from the intermediate layers to further boost the performance of students. Since then, the mainstream KD methods can be roughly divided into logits distillation [29, 12, 18, 49, 75, 89, 90, 62, 68, 56, 4, 46, 52] and intermediate layer distillation [58, 40, 84, 53, 79, 27, 32, 55, 82, 63, 47, 81, 74, 61, 6, 64]. RdimKD falls into the latter category, and we will summarize related works in this category.

Distillation from intermediate layers can be considered as a regularization of models training. Besides classification, some works are designed for detection [7, 40, 66, 21, 13, 87, 39], segmentation[25, 47, 76, 69], and other specific domains [50, 67, 73, 16, 70, 11, 86]. In our paper, we want to construct a general distillation method for all these tasks. SAKD [61] proposed a strategy to adaptively determine the distillation layers in the teacher per sample in each training iteration during the distillation period. ReviewKD [10] built connection paths across different levels between teacher and student. MGD [79] utilized a mask and some transformation modules to make the feature maps of student mimic that of teacher. CD [60] normalized the feature map of each channel to obtain a distribution, then minimizes the Kullback–Leibler (KL) divergence between the distribution of teacher and student. TAT [41] proposed a one-to-all method that allows each pixel of the teacher to teach all spatial locations of student given the similarity. Most of these methods focus on specific view of teacher’s feature map by using some elaborated transformation, but failed to capture the generic information of KD. Rather than simply proposing another variant like before, we abstract a more generic and higher level perspective to reveal the nature of the problem.

3 Methods

We introduce the details of RdimKD in this section, including the overall design and the three projection matrices. The overall framework is shown in Fig. 1. Taking CNN as an example, let $F_{t},F_{s}\in\mathbb{R}^{b\times h\times w\times c}$ be the feature maps to be distilled of the teacher and student, respectively. They can be viewed as two matrices, $F_{t},F_{s}\in\mathbb{R}^{N\times c}$ representing a collection of $N$ points with $c$ dimensions (where $N=bhw$ , and the same notation $F_{t},F_{s}$ is used when not confused). These two matrices can be multiplied by a common fixed matrix $K\in\mathbb{R}^{c\times d}(d<c)$ to project these points into a subspace with $d$ dimensions. In the subspace, the simple $\ell^{2}$ loss is used to minimize the difference between the two projected feature maps. The final objective function is as follows:

\min_{w}\mathcal{F}(w)=\mathcal{L}(w)+\frac{\alpha}{Nd}\|F_{t}K-F_{s}K\|^{2}

(3)

where $w$ is all learnable parameters of the student, $\mathcal{L}(w)$ the original loss function of the student, $\alpha$ the balance factor. RdimKD focuses on the concept of dimensionality reduction itself, regardless of the specific reduction methods. To show this, we provide three methods to construct the projection matrix $K$ , explained in the following subsections. Unlike learnable modules of some previous works, we freeze $K$ during the whole training process. Also, we will study the performance when it is changeable at each iteration in the ablation study part.

3.1 Projection via PCA

Principal component analysis (PCA) [1] is a popular technique for reducing the dimensionality of a dataset such that the variance of the dataset is preserved as much as possible. Specifically, we first center the values of each point in $F_{t}$ by subtracting the mean of each column from each of those values, resulting in matrix $\hat{F_{t}}$ . The eigenvalue decomposition of its covariance matrix, $\frac{1}{N-1}\hat{F_{t}}^{T}\hat{F_{t}}$ , is as follows:

\frac{1}{N-1}\hat{F_{t}}^{T}\hat{F_{t}}=U\Sigma U^{T}

(4)

where $\Sigma=diag\{\sigma_{1},\sigma_{2},...,\sigma_{c}\}$ is a diagonal matrix, and each entry represents an eigenvalue. For the sake of description, we assume that they are already in descending order. $U=(u_{1},u_{2},...,u_{c})$ is the matrix whose columns $u_{i},i=1,2,...,c$ are units and orthogonal to each other. Each of $u_{i}$ can be interpreted as a principal axis, and $\sigma_{i}$ is the corresponding variance along the $i$ -th axis.

As an example, Sec. 3.1 shows the distribution of eigenvalues for ResNet-34 on ImageNet [15]. The severe anisotropy of the distribution suggests that we may only need to project the samples to the first $d$ principal axes to represent the most important information. Hence, we can let $K=(u_{1},u_{2},...,u_{d})$ in Eq. 3. The distillation method corresponding to the projection matrix obtained in this way is named RdimKD-P. We will show in the experiment section that projecting to the first $d$ principal axes does give better performance than projecting to the last $d$ principal axes.

3.2 Projection via autoencoder

PCA aims at projecting the dataset into a normal subspace while preserving the maximum amount of information. Another way to remove noise while retaining the primary information is to use an autoencoder. The matrix $K$ in Eq. 3 can be viewed as an encoder, and we design $K^{\prime}\in\mathbb{R}^{d\times c}$ as a decoder, to minimize the objective function:

\min_{K,K^{\prime}}\mathcal{J}(K,K^{\prime})=\frac{1}{Nc}\|F_{t}-F_{t}KK^{% \prime}\|^{2}+\gamma(\|K\|^{2}+\|K^{\prime}\|^{2})

(5)

where $\gamma$ is a small positive number to balance the norm of $K$ and $K^{\prime}$ . Since the decoder $K^{\prime}$ is used to restore the original information as much as possible, the encoded feature map $F_{t}K$ needs to retain general information of $F_{t}$ . When this is done, the solution of $K$ will act as the projection matrix in Eq. 3. We name this method RdimKD-A.

3.3 Projection via random orthogonal matrix

Random projection [2] is a technique used to reduce the dimensionality of datasets in Euclidean space. The core idea behind it is given in the Johnson-Lindenstrauss (JL) lemma [34, 48]:

Lemma 1

For any $0<\epsilon<1$ and integer $N$ , let $d$ be an integer with $d>4(\epsilon^{2}/2-\epsilon^{3}/3)^{-1}{\rm log}N$ . Then, for any set $V$ of $N$ points in $\mathbb{R}^{c}$ , there is a linear map f: $\mathbb{R}^{c}\rightarrow\mathbb{R}^{d}$ such that for all $u,v\in V$ , the inequality holds:

(1-\epsilon)\|u-v\|^{2}\leq\|f(u)-f(v)\|^{2}\leq(1+\epsilon)\|u-v\|^{2}

(6)

One proof takes $f$ to be a suitable multiple of orthogonal projector onto a random subspace, and it can be easily proved in [14]. This lemma states that datasets in the space of high dimension can be linearly projected onto low-dimensional space with approximate preservation of distances between the samples. Random projection is simple and computationally efficient compared with other dimensionality reduction methods. We find that this idea can also be borrowed in the field of knowledge distillation, although the dimensions before and after projection do not strictly satisfy the requirements of the JL lemma. By this idea, we generate the random matrix $K$ in Eq. 3 in the following two steps: 1. generating a random matrix in $\mathbb{R}^{c\times d}$ with elements chosen from Gaussian distribution; 2. orthonormalizing all the columns of the matrix by Gram–Schmidt process. These two steps are very simple to implement in PyTorch [54] with just one line of code: torch.nn.init.orthogonal_

Matrix obtained in this way has spherical symmetry, which we guess may be a good property. It is possible that in some extreme cases, the projection matrix will project the samples to a subspace that approximates the span of the last $d$ principal axes of the PCA, which will result in bad performance shown in Tab. 6 pac_last. Nevertheless, at least we did not observe this extreme phenomenon in our experiments. We name this method RdimKD-R.

4 Experiments

To show the generality and effectiveness of RdimKD, we conduct experiments on various deep learning tasks (image classification, object detection, semantic segmentation, language understanding, and speech recognition) and neural architectures (CNN [24], Transformer [65], and Conformer [20]), and compare them to works in recent years. We set $r=c/d$ , which represents the reduction rate of the subspace dimension. For simplicity, we use the same $r$ for all feature maps to be distilled given an experiment. For RdimKD-A, we choose to train Eq. 5 by gradient descent before distillation training, although it may have a closed-form solution. For RdimKD-P, we randomly selected hundreds of training samples to conduct PCA. In the following, for a brief description, “A to B” means a distillation experiment with A as the teacher and B as the student. Due to the page limit, we only explain the primary settings for some experiments, and other details and language understanding are attached in the supplementary materials.

4.1 Image classification

The classification experiments are done on ImageNet ILSVRC-12 dataset [15], which contains 1000 object categories with 1.2 million images for training and 50k for testing. We conduct experiments on “ResNet-34 to ResNet-18” and “ResNet-50 to MobileNet [30]”. Top-1 accuracy is reported. RdimKD can be combined with solf label based KD [29], that is, by adding the additional loss:

\mathcal{L}_{KL}=-\beta\sum_{i}q_{i}\log p_{i}

(7)

where $q_{i}$ is the probability distribution of the teacher’s output, $p_{i}$ is that of the student, and $\beta$ is the balance coefficient.

ResNet-34 and ResNet-18 contain four stages, and the difference is that the number of blocks in each stage is different. We distilled the last feature map of the third and fourth stages, where the number of channels is 128 and 512, respectively. For MobileNet, we distill the last feature maps of the third and fourth stages of Resnet-50 to the outputs of the 11th and 14th convolutions of MobielNet. To achieve this, we manually changed the number of channels in these two layers of ResNet-50 to 512 and 1024, respectively, and the necessary 1x1 convolution is added at the skip layer. Data preprocessing and augmentation are the same as that of PyTorch official example¹¹1https://github.com/pytorch/examples/blob/master/imagenet/main.py. We use a cosine learning rate scheduler with an initial value of 0.1 and train it for 105 epochs. In “ResNet-34 to ResNet-18”, $r=4$ and $\alpha=1$ for all the RdimKD, and $\beta=2$ for RdimKD. Results are shown in Tab. 1. We can see that our RdimKD can boost performance by a clear margin, and the simplest RdimKD-R produces about the same performance as RdimKD-A/P.

method	MbNet	Res18	method	MbNet	Res18
teacher	76.77	74.55	DIST [NIPS 2022] [31]	73.24	72.07
student	70.93	70.96	WSLD [ICLR 2021] [93]	71.52	72.04
RdimKD-R	72.56	71.89	SRRL [ICLR 2021] [77]	72.49	71.73
RdimKD-A	72.65	71.94	KR [CVPR 2021] [10]	72.56	71.61
RdimKD-P	72.77	72.01	DKD [CVPR2022] [90]	72.05	71.7
RdimKD-R*	73.13	72.53	MGD [ECCV 2022] [79]	72.59	71.8
RdimKD-A*	73.15	72.58	TAT [CVPR 2022] [41]	None	72.41
RdimKD-P*	73.23	72.49	KCD [ECCV 2022] [38]	71.25	72.13

Table 1: Top-1 results on ImageNet. For MbNet column, ResNet-50 is teacher and MobileNet is student; while for Res18 column, ResNet-34 is teacher and ResNet-18 is student. * means combined with

\mathcal{L}_{KL}

, and None means not reported in origin paper. Note that TAT, KCD, DIST, WSLD also contain

\mathcal{L}_{KL}

. We can see that RdimKD can boost the performance of student by a clear margin, and that the simplest RdimKD-R can get comparable performance as RdimKD-A/P. All of our results are the average on 3 trials.

4.2 Object detection

The detection experiments are conducted on the COCO2017 dataset [44], which contains 80 object categories with 115k training images and 5k validation images. All the teachers are trained for 36 epochs and students for 24 epochs. Other training details are the same as the standard protocols in the widely used Detectron2 library [71]. Inheriting strategy [35] is used for distillation. The networks are RetinaNet [43] and Faster-RCNN [57] with different backbones. Mean average precision(AP) is reported.

RetinaNet: RetinaNet uses a Feature Pyramid Network (FPN) [42] to generate a multi-scale feature pyramid with levels $P_{3}$ to $P_{7}$ , all of which contain 256 channels. The only difference between the various benchmark models is the backbone. So, naturally, knowledge can be distilled from the five levels of the pyramid. We use two teacher networks, RetinaNet-ResNet-101 and RetinaNet-ResNeXt-101 [72], separately, to teach the student network, RetinaNet-ResNet-50. Results are shown in Tab. 2. In these experiments, $r=4$ . For RdimKD-R and RdimKD-A, $\alpha=1$ , while for RdimKD-P, $\alpha=0.5$ .

Faster-RCNN: Besides the one-stage detector RetinaNet, we also evaluate our RdimKD in two-stage detector, Faster-RCNN. Similar to previous works [90, 78], we also use FPN to capture multi-scale features. Same as DKD [90], $P_{2}$ to $P_{5}$ are input features for the ROI heads, and the number of channels for each level is 256. Similar to RetinaNet, the only difference between various benchmarks is the backbone, while the structure of the ROI head is the same. So, naturally, we distill at the FPN layers. In these experiments, we conduct “Faster-RCNN-FPN-ResNet-101 to Faster-RCNN-FPN-ResNet-18”, “Faster-RCNN-FPN-ResNet-101 to Faster-RCNN-FPN-ResNet-50” and “Faster-RCNN-FPN-ResNet-101 to Faster-RCNN-FPN-MobileNet_V2 [59]”. $r=4$ for ResNet-50 and MobileNet_V2, $r=2$ for ResNet-18. Details are in Supplementary Materials. Results are shown in Tab. 3.

method	AP	$\text{AP}_{\text{S}}$	$\text{AP}_{\text{M}}$	$\text{AP}_{\text{L}}$
S:RetinaNet-R50	38.28	22.36	42.34	49.42
T:RetinaNet-R101	40.56	24.46	44.57	52.73
FGD [CVPR 2022] [78]	39.7	22.0	43.7	53.6
KR [CVPR 2021] [10]	38.48	22.67	42.72	58.22
LD [CVPR 2022] [91]	39.0	23.1	43.2	51.1
FRS [NIPS 2021] [92]	39.7	21.8	43.5	52.4
LGD [AAAI 2022] [88]	40.35	24.08	44.15	52.53
GID [CVPR 2021] [13]	39.1	22.8	43.1	52.3
KDRP [AAAI 2022] [39]	39.6	21.4	44.0	52.5
RdimKD-R	40.67	24.57	44.62	52.92
RdimKD-A	40.67	24.45	44.70	53.16
RdimKD-P	40.68	24.17	44.90	52.72
T:RetinaNet-X101	41.10	23.95	44.78	53.27
FGD [CVPR 2022] [78]	40.7	22.9	45.0	54.7
MGD [ECCV 2022] [79]	41.0	23.4	45.3	55.7
FRS [NIPS 2021] [92]	40.1	21.9	43.7	54.3
DIST [NIPS 2022] [31]	40.1	23.2	44.0	53.6
CD [ICCV 2021] [60]	40.8	22.7	44.5	55.3
FB [ICLR 2021] [87]	39.6	22.7	43.3	52.5
LGD [AAAI 2022] [88]	40.35	24.08	44.15	52.53
RdimKD-R	40.97	24.20	45.18	53.99
RdimKD-A	40.95	23.72	45.11	53.86
RdimKD-P	41.05	24.47	45.39	54.02

Table 2: Performance of “RetinaNet-R101 to RetinaNet-R50”(the top half of the table) and “RetinaNet-X101 to RetinaNet-R50”(the bottom half of the table) on COCO2017 validation set. Where ‘R101’,‘R50’ and ‘X101’ mean ResNet-101, ResNet-50 and ResNeXt-101, respectively. ‘T’ and ‘S’ mean teacher and student, respectively. LGD [88] is a self-distillation method and does not contain teachers. We can see that RdimKD consistently equals or outperforms other methods. The simplest RdimKD-R produces about the same performance as RdimKD-A/P. All of our results are the average on 3 trials.

method	AP	$\text{AP}_{\text{S}}$	$\text{AP}_{\text{M}}$	$\text{AP}_{\text{L}}$
T: ResNet-101	42.17	25.50	45.55	54.93
S: ResNet-18	34.96	19.78	37.39	45.44
DKD [CVPR 2022] [90]	37.01	None	None	None
SCKD [ICCV 2021] [95]	37.5	20.9	42.6	50.8
KR [CVPR 2021] [10]	36.75	19.42	39.51	49.58
RdimKD-R	38.25	21.12	41.25	51.34
RdimKD-A	38.29	21.04	41.03	51.45
RdimKD-P	38.31	20.78	41.12	51.82
T: ResNet-101	42.17	25.50	45.55	54.93
S: ResNet-50	39.66	24.03	42.76	51.74
DKD [CVPR2022] [90]	40.65	None	None	None
FGD [CVPR 2022] [78]	40.5	22.6	44.7	53.2
ICD [NIPS 2021] [35]	40.9	24.5	44.2	53.5
KR [CVPR 2021] [10]	40.36	23.60	43.81	52.87
FRS [NIPS 2021] [92]	39.5	22.3	43.6	51.7
LGD [AAAI 2022] [88]	40.47	23.96	43.94	52.19
RdimKD-R	41.76	25.22	45.28	54.42
RdimKD-A	41.72	25.12	45.11	54.50
RdimKD-P	41.74	24.97	45.18	54.60
T: ResNet-50	41.84	25.15	45.27	54.48
S: MobileNet_V2	34.51	19.98	36.38	44.94
DKD [CVPR2022] [90]	34.35	None	None	None
KR [CVPR 2021] [10]	33.71	16.77	35.81	46.47
RdimKD-R	36.26	20.24	38.29	49.21
RdimKD-A	36.06	19.82	37.95	48.86
RdimKD-P	36.00	19.50	38.00	49.20

Table 3: Results on COCO2017 on Faster-RCNN-FPN with different backbones. ‘T’ and ‘S’ mean teacher and student, respectively. In the table, the top part is for “Faster-RCNN-FPN-ResNet-101 to Faster-RCNN-FPN-ResNet-18”, the middle part is for “Faster-RCNN-FPN-ResNet-101 to Faster-RCNN-FPN-ResNet-50”, while the bottom part is for “Faster-RCNN-FPN-ResNet-101 to Faster-RCNN-FPN-MobileNet_V2”. ‘None’ means not reported in the original paper. For MobileNet_V2, the baseline for DKD [90] and KR [10] is relatively weak, such that our baseline performance, 34.51, is higher than that of DKD and KR after their distillation. We choose the best of all the ResNet-50, under the guidance of ResNet-101 with RdimKD-R, as the teacher for MobileNet_V2. All the results are average on 3 trials.

method	ResNet18	MobileNet_V2
teacher	79.26	79.26
student	73.59	73.70
TAT [CVPR 2022] [41]	75.76	73.85
CIRKD [CVPR 2022] [76]	74.50	None
FAKD [83]	None	67.62
IFVD [ECCV 2020] [69]	74.05	None
ICKD [ICCV 2021] [45]	75.01	72.79
RdimKD-R	75.63	75.28
RdimKD-A	75.55	75.67
RdimKD-P	75.94	75.19

Table 4: Results on Semantic segmentation on Pascal VOC. We use DeepLabv3+-ResNet-101 as teacher, and DeepLabv3+-ResNet-18 and DeepLabv3+-MobileNet_V2 as sudents. ‘None’ means not reported in the original paper. Also, for MobileNet_V2, the baseline for FAKD is relatively weak. All of our results are the average on 6 trials.

4.3 Semantic segmentation

The segmentation experiments are done on the Pascal VOC [17], which contains 20 foreground classes and 1 background class. With the additional coarse annotated training images from [23], there are a total of 10582 images for training. The validation set contains 1499 images, on which we report the mean Intersection over Union (mIoU) to show segmentation performance.

DeepLabv3+: DeepLabv3+ [9] is a popular network for segmentation tasks. Besides the Atrous Spatial Pyramid Pooling (ASPP) module and encode-decoder structure, it extends DeepLabv3 [8] by adding a decoder module to capture rich semantic information to refine the object boundaries. Knowledge is distilled at the low-level feature coming from the backbone (in front of the resize and ReLU layer) and the output of the ASPP module (in front of the final dropout and ReLU layer), where the number of channels is 48 and 256, respectively. We use the settings from the public code ²²2https://github.com/VainF/DeepLabV3Plus-Pytorch unless otherwise stated. For all of our experiments, the output stride (OS) is 16 for training and validation. We conduct “DeepLabv3+-ResNet-101 to DeepLabv3+-ResNet-18” and “DeepLabv3+-ResNet-101 to DeepLabv3+-MobileNet_V2”. The values for $\alpha$ and $r$ are in supplementary materials. We find that the variance of the results for each trial is large, so each of our results is the average of the six trials. The results are shown in Tab. 4.

4.4 Speech recognition

To show the powerful generalization and effectiveness of RdimKD, we apply RdimKD to a more challenging task, speech recognition. The input to this task is a speech waveform, and the output is the corresponding text. RNN-Transducer (RNN-T) [19] is a well-known end-to-end architecture for streaming speech recognition [26], which contains an encoder, a predictor, and a joiner. For the encoder, we use 12 layers of Conformer [20] for the teacher and 6 layers for the student. For the predictor, three layers of bidirectional transformers [65] (Bitransformers) is used for both teacher and student. We use the Librispeech dataset [51], which contains about 1000 hours of speech sampled at 16 kHz. We use the 960 hours of corpus for training and the development and test set for evaluation. Beam search is used as the decode mode, and the Word Error Rate (WER) is used as the metric. The implementation details are the same as the corresponding part of Wenet Library [80, 85]. We trained 50 epochs for each experiment. The dimension of attention for the encoder is 256, and we distill knowledge from the 8th and 12th layers of the teacher to the 4th and 6th layers of the student. $\{\alpha=2,r=4\}$ is chosen for all results in this experiment. Inheriting strategy [35] is used. The results are shown in Tab. 5. It can be seen that our RdimKD has strong generalization and effectiveness besides computer vision.

	clean(%)			other(%)
	dev	test	mean	dev	test	mean
Teacher	3.30	3.48	3.39	8.86	8.90	8.88
Student	3.71	3.99	3.85	10.22	10.21	10.22
RdimKD-R	3.51	3.80	3.66	9.40	9.50	9.45
RdimKD-A	3.56	3.78	3.67	9.44	9.44	9.44
RdimKD-P	3.56	3.77	3.67	9.48	9.43	9.46
no_proj	3.67	3.98	3.83	10.10	10.02	10.06

Table 5: Results on speech recognition. We use 12 layers of encoder of RNN-T as teacher and 6 as student, and use Librispeech as experiment dataset. Word Error Rate (WER) is used as metric (smaller is better). no_proj is explained in Sec. 4.5. The mean column is the arithmetic mean of dev and test columns. It can be seen that our RdimKD has strong generalization and effectiveness besides computer vision. Only one trial is run for each result in this table due to the high cost of the experiment.

4.5 Ablation studies

The key to RdimKD is the projection. In this subsection, we focus on the projection methods (including no projection), subspace reduction rate $r$ , and the coefficient of distillation loss $\alpha$ . For generality, we will explore these ablation studies in different experiments.

Projection method: As shown in Tab. 6, we conduct this ablation study via “ResNet-50 to MobileNet” on ImageNet, “Faster-RCNN-FPN-ResNet-101 to Faster-RCNN-FPN-ResNet-18” on COCO2017, and “DeepLabv3+-ResNet-101 to DeepLabv3+-MobileNet_V2” on Pascal VOC. We have the following findings: 1. In RdimKD-P, feature maps are projected to PCA’s first $d$ principal axes. A natural question is what will happen if feature maps are projected to the last $d$ principal axes (named by pca_last in Tab. 6). Comparing the results between RdimKD-P and pca_last, we find that the first $d$ principal components do contain more valuable information. Theoretically, in an extreme case, the projection matrix of RdimKD-R approximates that of pca_last. Nevertheless, RdimKD-R consistently performed well in our experiments. 2. If we remove the projection operation and apply the $\ell^{2}$ loss directly onto the original feature maps (named by no_proj in Tab. 6), the performance can be improved compared to the baseline. However, the improvement is smaller than that with projection (also shown in Tab. 5) consistently. We suspect that this is caused by the low capacity of students, which makes it impossible and unnecessary to learn every detail from the teacher network accurately. 3. When the random projection matrix is not necessarily orthogonal (each element is chosen from Gaussian distribution $\mathcal{N}(0,\frac{1}{c})$ , named by no_orth), the performance is slightly worse than RdimKD-R. 4. If the random projection matrix is generated in every iteration rather than kept fixed from the beginning(named by randE), the performance is also slightly worse than RdimKD-R. It should be noted that the above comparison is not necessarily fair because each method may correspond to a unique optimal $\alpha$ value, and it is difficult to find the optimal $\alpha$ for each method due to the experimental cost. Nonetheless, RdimKD performs better than no_proj and RdimKD-P performs better than pca_last for a certain range of $\alpha$ .

Reduction rate $r$ : The reduction rate is also an important variable, which determines the dimensional scaling of the subspace onto which the data of the original space is projected. It is easy to prove that $r=1$ is mathematically equivalent to no_proj, and they are close in experimental performance. As shown in Fig. 3, we can see that the projection of feature maps onto a subspace of appropriate dimensions does boost performance further.

method	top-1	top-5	method	top-1	top-5
baseline	70.93	89.59	pca_last	71.40	89.96
RdimKD-R	72.56	90.94	no_proj	72.24	90.79
RdimKD-A	72.65	91.00	no_orth	72.40	90.87
RdimKD-P	72.77	91.02	randE	71.63	90.47
method	AP	mIOU	method	AP	mIOU
baseline	34.96	73.70	pca_last	35.89	74.40
RdimKD-R	38.31	75.28	no_proj	37.96	74.93
RdimKD-A	38.29	75.67	no_orth	37.75	75.04
RdimKD-P	38.31	75.19	randE	37.77	74.91

Table 6: Ablation study for different projection methods. The top part is “ResNet-50 to MobileNet” on ImageNet, while the bottom is “Faster-RCNN-FPN-ResNet-101 to Faster-RCNN-FPN-ResNet-18” on COCO2017 (the AP column) and “DeepLabv3+-ResNet-101 to DeepLabv3+-MobileNet_V2” on Pascal VOC (the mIOU column). pca_last means feature maps are projected to the last

d

principal axes of PCA. no_proj means the

K

in Eq. 3 is an identity matrix. no_orth means that elements in the projection matrix

K

in Eq. 3 are randomly chosen from Gaussian distribution

\mathcal{N}(0,\frac{1}{c})

, and the matrix itself is not necessarily orthogonal. randE means that the matrix

K

is a random orthogonal matrix generated in Each iteration, not fixed. Results for ImageNet and COCO2017 are average on 3 trials, and that for VOC are average on 6 trials.

Coefficient $\alpha$ : The value of $\alpha$ balances the weights between the original loss and the distillation loss. In Fig. 4, we implement “DeepLabv3+-ResNet-101 to DeepLabv3+-ResNet-18” on Pascal VOC segmentation task and “ResNet-50 to MobileNet” on ImageNet classification task. It can be shown that with the increase of $\alpha$ value, the performance of the student rises first and then decreases, which is in line with our expectations and verifies the effectiveness of our method.

Distillation position: For non-sequential structures such as RetinaNet, we distill the feature maps of the output of FPN; for the sequential structure that stacks some blocks, layer indices to be distilled between teacher and student are proportional. For example, if teacher is 2 times deeper than the student, then the $i$ -th layer of student is taught by the corresponding 2 $i$ -th layer of the teacher. However, we find that, in general, only distilling some upper layers produces better results, as shown in Tab. 7 as an example.

mask	clean(%)			other(%)
mask	dev	test	mean	dev	test	mean
111111	3.58	3.79	3.69	9.65	9.78	9.72
001111	3.51	3.80	3.66	9.40	9.50	9.45
000011	3.54	3.78	3.66	9.47	9.53	9.50

Table 7: An example of distillation position for sequential structure network. When 12-layer RNN-T teaching 6-layer, the

i

-th layer of student is taught by the corresponding 2

i

-th layer of the teacher But not distillation the lower layers is better. Mask 001111 means not distilling the first and second layer of the student and 111111 means distilling all the 6 layers. The conclusion is consistent with [79].

4.6 Discussion

RdimKD-R and RdimKD-P project the feature maps onto a subspace (denoted as $\mathcal{S}$ ), and an interesting point is the learning of student in the orthogonal complement (denoted as $\mathcal{S}^{\perp}$ ). Note that $\mathcal{S}$ and $\mathcal{S}^{\perp}$ are determined by the teacher.

For illustration, we name the subspace spanned by the first $d$ principal axes as the principal subspace. By this definition, for RdimKD-P, $\mathcal{S}$ is also the principal subspace of the teacher. A natural question is whether it is also the student’s principal subspace after it is trained by RdimKD-P. To explore it, we project the feature map of the well-trained student into $\mathcal{S}$ and $\mathcal{S}^{\perp}$ , respectively, and then do PCA in these two subspaces. The distribution of these eigenvalues is plotted in Fig. 5(a). We find that the eigenvalues in $\mathcal{S}$ are almost all larger than those in $\mathcal{S}^{\perp}$ , which shows that $\mathcal{S}$ is almost the principal subspace of the student, too. Interestingly, for RdimKD-R in Fig. 5(b), $\mathcal{S}$ is randomly chosen instead of by PCA, but the eigenvalues in $\mathcal{S}$ are also generally larger than those in $\mathcal{S}^{\perp}$ . Comparing the ordinates of Fig. 5(a) and Fig. 5(b), the variance of the feature maps trained by RdimKD-R is smaller than that trained by RdimKD-P.

We also consider the rotation of the principal subspace of student with respect to that of the teacher. To do so, we first project the feature maps of the student onto $\mathcal{S}$ then plot the heat map of the covariance matrix in Fig. 6(a) (trained by RdimKD-P) and in Fig. 6(b) (trained by RdimKD-R). From Fig. 6(a), the value of the diagonal elements is much greater than the value of the off-diagonal elements, which leads to the conclusion that the angle between the two principal set of axes is small. For comparison, the heatmap via RdimKD-R is also plotted in Fig. 6(b).

5 Conclusion

In this paper, we proposed RdimKD with three projection methods for knowledge distillation. Compared with other methods, the advantage of RdimKD is three folds: simple to implement (especially RdimKD-R) and very favored for industrial applications; achieves performance comparable to or higher than state-of-the-art methods; general to various learning tasks and neural architectures. We believe our simple findings will bring more enlightenment and inspirations to knowledge distillation. Moreover, this approach has been widely used in our company’s industrial applications. However, the theoretical explanation for why this so simple method works still needs further study.

References

[1] Hervé Abdi and Lynne J Williams. Principal component analysis. Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010.
[2] Dimitris Achlioptas. Database-friendly random projections. In Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 274–281, 2001.
[3] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? Advances in neural information processing systems, 27, 2014.
[4] Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. Knowledge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10925–10934, 2022.
[5] Ella Bingham and Heikki Mannila. Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 245–250, 2001.
[6] Defang Chen, Jian-** Mei, Yuan Zhang, Can Wang, Zhe Wang, Yan Feng, and Chun Chen. Cross-layer distillation with semantic calibration. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7028–7036, 2021.
[7] Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. Learning efficient object detection models with knowledge distillation. Advances in neural information processing systems, 30, 2017.
[8] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
[9] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
[10] Pengguang Chen, Shu Liu, Hengshuang Zhao, and Jiaya Jia. Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5008–5017, 2021.
[11] Xianing Chen, Qiong Cao, Yujie Zhong, **g Zhang, Shenghua Gao, and Dacheng Tao. Dearkd: Data-efficient early knowledge distillation for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12052–12062, 2022.
[12] Jang Hyun Cho and Bharath Hariharan. On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4794–4802, 2019.
[13] ** Bao, Zhicheng Wang, Si Liu, and Er** Zhou. General instance distillation for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7842–7851, 2021.
[14] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Structures & Algorithms, 22(1):60–65, 2003.
[15] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[16] Yikang Ding, Qingtian Zhu, Xiangyue Liu, Wentao Yuan, Haotian Zhang, and CHi Zhang. Kd-mvs: Knowledge distillation based self-supervised learning for mvs. arXiv preprint arXiv:2207.10425, 2022.
[17] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
[18] Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. In International Conference on Machine Learning, pages 1607–1616. PMLR, 2018.
[19] Alex Graves. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711, 2012.
[20] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.
[21] Jianyuan Guo, Kai Han, Yunhe Wang, Han Wu, Xinghao Chen, Chun**g Xu, and Chang Xu. Distilling object detectors via decoupled features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2154–2164, 2021.
[22] Yi Guo, Huan Yuan, Jianchao Tan, Zhangyang Wang, Sen Yang, and Ji Liu. Gdp: Stabilized neural network pruning via gates with differentiable polarization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5239–5250, 2021.
[23] Bharath Hariharan, Pablo Arbeláez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Semantic contours from inverse detectors. In 2011 international conference on computer vision, pages 991–998. IEEE, 2011.
[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[25] Tong He, Chunhua Shen, Zhi Tian, Dong Gong, Changming Sun, and Youliang Yan. Knowledge adaptation for efficient semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 578–587, 2019.
[26] Yanzhang He, Tara N Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, et al. Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6381–6385. IEEE, 2019.
[27] Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyo** Park, Nojun Kwak, and ** Young Choi. A comprehensive overhaul of feature distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1921–1930, 2019.
[28] Byeongho Heo, Minsik Lee, Sangdoo Yun, and ** Young Choi. Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3779–3787, 2019.
[29] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. Computer Science, 14(7):38–39, 2015.
[30] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
[31] Tao Huang, Shan You, Fei Wang, Chen Qian, and Chang Xu. Knowledge distillation from a stronger teacher. arXiv preprint arXiv:2205.10536, 2022.
[32] Zehao Huang and Naiyan Wang. Like what you like: Knowledge distill via neuron selectivity transfer. arXiv preprint arXiv:1707.01219, 2017.
[33] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.
[34] William B Johnson. Extensions of lipschitz map**s into a hilbert space. Contemp. Math., 26:189–206, 1984.
[35] Zijian Kang, Peizhen Zhang, Xiangyu Zhang, Jian Sun, and Nanning Zheng. Instance-conditional knowledge distillation for object detection. Advances in Neural Information Processing Systems, 34:16468–16480, 2021.
[36] Jangho Kim, SeongUk Park, and Nojun Kwak. Paraphrasing complex network: Network compression via factor transfer. Advances in neural information processing systems, 31, 2018.
[37] Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018.
[38] Chenxin Li, Mingbao Lin, Zhiyuan Ding, Nie Lin, Yihong Zhuang, Yue Huang, Xinghao Ding, and Liujuan Cao. Knowledge condensation distillation. arXiv preprint arXiv:2207.05409, 2022.
[39] Gang Li, Xiang Li, Yujie Wang, Shanshan Zhang, Yichao Wu, and Ding Liang. Knowledge distillation for object detection via rank mimicking and prediction-guided feature imitation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 1306–1313, 2022.
[40] Quanquan Li, Shengying **, and Junjie Yan. Mimicking very efficient network for object detection. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 6356–6364, 2017.
[41] Sihao Lin, Hongwei Xie, Bing Wang, Kaicheng Yu, Xiaojun Chang, Xiaodan Liang, and Gang Wang. Knowledge distillation via the target-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10915–10924, 2022.
[42] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
[43] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
[44] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
[45] Li Liu, Qingle Huang, Sihao Lin, Hongwei Xie, Bing Wang, Xiaojun Chang, and Xiaodan Liang. Exploring inter-channel correlation for diversity-preserved knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8271–8280, 2021.
[46] Yufan Liu, Jiajiong Cao, Bing Li, Chunfeng Yuan, Weiming Hu, Yangxi Li, and Yunqiang Duan. Knowledge distillation via instance relationship graph. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7096–7104, 2019.
[47] Yifan Liu, Ke Chen, Chris Liu, Zengchang Qin, Zhenbo Luo, and **gdong Wang. Structured knowledge distillation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2604–2613, 2019.
[48] Jiří Matoušek. On variants of the johnson–lindenstrauss lemma. Random Structures & Algorithms, 33(2):142–156, 2008.
[49] Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 5191–5198, 2020.
[50] Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10870–10879, 2020.
[51] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.
[52] **hyuk Park and Albert No. Prune your model before distill it. arXiv preprint arXiv:2109.14960, 2021.
[53] Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3967–3976, 2019.
[54] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
[55] Baoyun Peng, Xiao **, Jiaheng Liu, Dongsheng Li, Yichao Wu, Yu Liu, Shunfeng Zhou, and Zhaoning Zhang. Correlation congruence for knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5007–5016, 2019.
[56] Mary Phuong and Christoph Lampert. Towards understanding knowledge distillation. In International Conference on Machine Learning, pages 5142–5151. PMLR, 2019.
[57] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
[58] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
[59] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
[60] Changyong Shu, Yifan Liu, Jianfei Gao, Zheng Yan, and Chunhua Shen. Channel-wise knowledge distillation for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5311–5320, 2021.
[61] Jie Song, Ying Chen, **gwen Ye, and Mingli Song. Spot-adaptive knowledge distillation. IEEE Transactions on Image Processing, 31:3359–3370, 2022.
[62] Samuel Stanton, Pavel Izmailov, Polina Kirichenko, Alexander A Alemi, and Andrew G Wilson. Does knowledge distillation really work? Advances in Neural Information Processing Systems, 34:6906–6919, 2021.
[63] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. arXiv preprint arXiv:1910.10699, 2019.
[64] Frederick Tung and Greg Mori. Similarity-preserving knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1365–1374, 2019.
[65] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[66] Tao Wang, Li Yuan, Xiaopeng Zhang, and Jiashi Feng. Distilling object detectors with fine-grained feature imitation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4933–4942, 2019.
[67] Xionghui Wang, Jian-Fang Hu, Jian-Huang Lai, Jianguo Zhang, and Wei-Shi Zheng. Progressive teacher-student learning for early action prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3556–3565, 2019.
[68] Xiaojie Wang, Rui Zhang, Yu Sun, and Jianzhong Qi. Kdgan: Knowledge distillation with generative adversarial networks. Advances in neural information processing systems, 31, 2018.
[69] Yukang Wang, Wei Zhou, Tao Jiang, Xiang Bai, and Yongchao Xu. Intra-class feature variation distillation for semantic segmentation. In European Conference on Computer Vision, pages 346–362. Springer, 2020.
[70] Kan Wu, **nian Zhang, Houwen Peng, Mengchen Liu, Bin Xiao, Jianlong Fu, and Lu Yuan. Tinyvit: Fast pretraining distillation for small vision transformers. arXiv preprint arXiv:2207.10666, 2022.
[71] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
[72] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
[73] Guodong Xu, Yuenan Hou, Ziwei Liu, and Chen Change Loy. Mind the gap in distilling stylegans. arXiv preprint arXiv:2208.08840, 2022.
[74] Chuanguang Yang, Zhulin An, Helong Zhou, Linhang Cai, Xiang Zhi, Jiwen Wu, Yongjun Xu, and Qian Zhang. Mixskd: Self-knowledge distillation from mixup for image recognition. In European Conference on Computer Vision, 2022.
[75] Chenglin Yang, Lingxi Xie, Chi Su, and Alan L Yuille. Snapshot distillation: Teacher-student optimization in one generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2859–2868, 2019.
[76] Chuanguang Yang, Helong Zhou, Zhulin An, Xue Jiang, Yongjun Xu, and Qian Zhang. Cross-image relational knowledge distillation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12319–12328, 2022.
[77] **g Yang, Brais Martinez, Adrian Bulat, Georgios Tzimiropoulos, et al. Knowledge distillation via softmax regression representation learning. International Conference on Learning Representations (ICLR), 2021.
[78] Zhendong Yang, Zhe Li, Xiaohu Jiang, Yuan Gong, Zehuan Yuan, Danpei Zhao, and Chun Yuan. Focal and global knowledge distillation for detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4643–4652, 2022.
[79] Zhendong Yang, Zhe Li, Mingqi Shao, Dachuan Shi, Zehuan Yuan, and Chun Yuan. Masked generative distillation. arXiv preprint arXiv:2205.01529, 2022.
[80] Zhuoyuan Yao, Di Wu, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, and Xin Lei. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. In Proc. Interspeech, Brno, Czech Republic, 2021. IEEE.
[81] Han-Jia Ye, Su Lu, and De-Chuan Zhan. Generalized knowledge distillation via relationship matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[82] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4133–4141, 2017.
[83] Jianlong Yuan, Qian Qi, Fei Du, Zhibin Wang, Fan Wang, and Yifan Liu. Fakd: Feature augmented knowledge distillation for semantic segmentation. arXiv preprint arXiv:2208.14143, 2022.
[84] Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928, 2016.
[85] Binbin Zhang, Di Wu, Zhendong Peng, Xingchen Song, Zhuoyuan Yao, Hang Lv, Lei ** Pan, and Jianwei Niu. Wenet 2.0: More productive end-to-end speech recognition toolkit. arXiv preprint arXiv:2203.15455, 2022.
[86] Linfeng Zhang, Xin Chen, Xiaobing Tu, Pengfei Wan, Ning Xu, and Kaisheng Ma. Wavelet knowledge distillation: Towards efficient image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12464–12474, 2022.
[87] Linfeng Zhang and Kaisheng Ma. Improve object detection with feature-based knowledge distillation: Towards accurate and efficient detectors. In International Conference on Learning Representations, 2020.
[88] Peizhen Zhang, Zijian Kang, Tong Yang, Xiangyu Zhang, Nanning Zheng, and Jian Sun. Lgd: Label-guided self-distillation for object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3309–3317, 2022.
[89] Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. Deep mutual learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4320–4328, 2018.
[90] Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11953–11962, 2022.
[91] Zhaohui Zheng, Rongguang Ye, ** Wang, Dongwei Ren, Wangmeng Zuo, Qibin Hou, and Ming-Ming Cheng. Localization distillation for dense object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9407–9416, 2022.
[92] Du Zhixing, Rui Zhang, Ming Chang, Shaoli Liu, Tianshi Chen, Yunji Chen, et al. Distilling object detectors with feature richness. Advances in Neural Information Processing Systems, 34:5213–5224, 2021.
[93] Helong Zhou, Liangchen Song, Jiajie Chen, Ye Zhou, Guoli Wang, Junsong Yuan, and Qian Zhang. Rethinking soft labels for knowledge distillation: A bias-variance tradeoff perspective. arXiv preprint arXiv:2102.00650, 2021.
[94] Zaida Zhou, Chaoran Zhuge, Xinwei Guan, and Wen Liu. Channel distillation: Channel-wise attention for knowledge distillation. arXiv preprint arXiv:2006.01683, 2020.
[95] Yichen Zhu and Yi Wang. Student customized knowledge distillation: Bridging the gap between student and teacher. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5057–5066, 2021.
[96] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.