License: arXiv.org perpetual non-exclusive license
arXiv:2312.08700v1 [cs.LG] 14 Dec 2023

RdimKD: Generic Distillation Paradigm by Dimensionality Reduction

Yi Guo, Yiqian He, Xiaoyang Li, Haotong Qin, Van Tung Pham,
Yang Zhang, Shouda Liu
bytedance
{guoyi.0,heyiqian.11,lixiaoyang.x,qinhaotong,van.pham,zhangyang.elfin,liushouda}@bytedance.com
Abstract

Knowledge Distillation (KD) emerges as one of the most promising compression technologies to run advanced deep neural networks on resource-limited devices. In order to train a small network (student) under the guidance of a large network (teacher), the intuitive method is regularizing the feature maps or logits of the student using the teacher’s information. However, existing methods either over-restrict the student to learn all information from the teacher, which lead to some bad local minimum, or use various fancy and elaborate modules to process and align features, which are complex and lack generality. In this work, we proposed an abstract and general paradigm for the KD task, referred to as DIMensionality Reduction KD (RdimKD), which solely relies on dimensionality reduction, with a very minor modification to naive 2superscript2\ell^{2}roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT loss. RdimKD straightforwardly utilizes a projection matrix to project both the teacher’s and student’s feature maps onto a low-dimensional subspace, which are then optimized during training. RdimKD achieves the goal in the simplest way that not only does the student get valuable information from the teacher, but it also ensures sufficient flexibility to adapt to the student’s low-capacity reality. Our extensive empirical findings indicate the effectiveness of RdimKD across various learning tasks and diverse network architectures.

1 Introduction

With the increasing and extensive application of Deep Neural Networks (DNNs) in industry, model compression technologies [22, 37, 41, 96] have been widely studied to deploy deep models on storage and computation limited hardware. Among these technologies, Knowledge Distillation (KD) [29] attracts attention from academia and industry for its high architecture adaptability and compression performance.

The essence of knowledge distillation lies in how to obtain valuable knowledge from the teacher network. For example, soft labels can better reflect the distribution information between categories than hard one-hot labels [29, 3]. To extend knowledge distillation to more general and complex scenarios, more and more works [58, 79, 84, 41] are also exploring the distillation from intermediate feature maps as regularization to assist the training of student networks further. A common method, e.g. [58, 33, 40, 77], is to use a naive 2superscript2\ell^{2}roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT loss on the original feature maps of the teacher and student. Specifically, let Ft,Fssubscript𝐹𝑡subscript𝐹𝑠F_{t},F_{s}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT be the feature maps to be distilled of the teacher and student, then the KD loss can be described as

minKD=minFtrθ(Fs)2subscript𝐾𝐷superscriptnormsubscript𝐹𝑡subscript𝑟𝜃subscript𝐹𝑠2\min\mathcal{L}_{KD}=\min\|F_{t}-r_{\theta}(F_{s})\|^{2}roman_min caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT = roman_min ∥ italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (1)

where rθ()subscript𝑟𝜃r_{\theta}(\cdot)italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) is a learnable transformation layer needed when the shapes of the feature maps mismatch, θ𝜃\thetaitalic_θ the learnable parameters, \|\cdot\|∥ ⋅ ∥ the Frobenius norm for matrix.

Intuitively, it is sub-optimum to force the student to get all the information of the teacher in a way like Eq. 1 because of the difference in network capacity, the randomness of initialization, and/or the difficulty of optimization. It can be demonstrated experimentally in Sec. 4 that a simple 2superscript2\ell^{2}roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT loss between feature maps (with the same shapes) of teacher and student does not bring enough performance improvement for the student. So, naturally, we may require the student to only learn some useful information from the teacher, while maintaining a certain degree of flexibility to adapt to the reality of its low capacity. Also, these methods come at a cost of additional modules to be trained as well as more hyper-parameter to be finetuned.

As a result, instead of applying 2superscript2\ell^{2}roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT loss on the original feature maps, some works [36, 84, 94, 28] manipulate and align the feature maps in some fancy and less explainable ways. For example,  [84] calculates the attention of the feature maps by pooling along the channel dimension, while [94] performs along the width and height dimensions, and [82] generates FSP matrix from two layers to represent the knowledge flow. Methods in this category are essentially designed to increase students’ freedom of learning without over-restricting their flexibility and to only get some valuable information from the teacher network. But these fancy and specific designs are too elaborate to be essential, and we want to reveal the essence of knowledge distillation at a more abstract and higher level.

Refer to caption
Figure 1: The overall framework of RdimKD. The student’s and teacher’s feature maps are projected onto a low-dimensional subspace by a matrix, and then the simple 2superscript2\ell^{2}roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT loss is implemented. The projection of the teacher only allows some valuable knowledge to be transferred out, while the projection of the student leaves the complementary subspace as freedom of the student.

We propose a simple, generic, and effective knowledge distillation method named RdimKD from a more abstract, higher-level, and more essential perspective: dimensionality reduction. RdimKD is based on dimensionality reduction itself to ensure that students can focus on valuable information from teachers and enjoy enough flexibility. The proposed framework, shown in  Fig. 1, can be formally described as

minKD=minFtKFsK2subscript𝐾𝐷superscriptnormsubscript𝐹𝑡𝐾subscript𝐹𝑠𝐾2\min\mathcal{L}_{KD}=\min\|F_{t}K-F_{s}K\|^{2}roman_min caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT = roman_min ∥ italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_K - italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_K ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (2)

where K𝐾Kitalic_K is the projection matrix. We do not use the function rθ()subscript𝑟𝜃r_{\theta}(\cdot)italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) with learnable θ𝜃\thetaitalic_θ in RdimKD because of two reasons. On the one hand, the function is somehow tricky to design in specific tasks, and we hope to abstract the essence of knowledge distillation in a more generic way; on the other hand, when optimizing Eq. 1, this function will make the student be lazy to some extend, and it just relies on θ𝜃\thetaitalic_θ to reconcile the difference with the teacher instead of getting knowledge from it.

Note that the proposed method requires Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Fssubscript𝐹𝑠F_{s}italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to have the same dimension. To allow more flexible architecture for teacher and student, there are two solutions to bypass the above requirement: 1. train a teacher with the same shapes at distillation positions; 2. set the distillation position at a linear transformation (e.g. linear or convolution layer) f:pq:𝑓superscript𝑝superscript𝑞f:\mathbb{R}^{p}\rightarrow\mathbb{R}^{q}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT of the student. f𝑓fitalic_f can be split into two transformations, f1:pt:subscript𝑓1superscript𝑝superscript𝑡f_{1}:\mathbb{R}^{p}\rightarrow\mathbb{R}^{t}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and f2:tq:subscript𝑓2superscript𝑡superscript𝑞f_{2}:\mathbb{R}^{t}\rightarrow\mathbb{R}^{q}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT, where t𝑡titalic_t is the dimension of teacher. In this way, teacher and student have the same dimension t𝑡titalic_t and can be distilled by RdimKD. During inference, f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be merged into f𝑓fitalic_f without changing the original structure of the student.

RdimKD focuses on the concept of dimensionality reduction itself, regardless of the specific reduction methods. To show this, we provide three reduction methods that are very common in statistical machine learning, i.e., Principal Component Analysis (PCA) [1] (RdimKD-P), Autoencoder (RdimKD-A), and Random Orthogonal Matrices [5, 34] (RdimKD-R), explained in detail in Sec. 3.

Our main highlights are summarized as follows:

  • Compared with previous methods that manipulate or align features in elaborate and fancy ways, this work reveals the benefit of dimensionality reduction in distillation from an essential level.

  • RdimKD works well on various deep learning tasks (image classification, object detection, semantic segmentation, language understanding, speech recognition) and neural architectures (CNNs, Transformer, Conformer), which makes it scalable to complex and diverse industrial applications.

  • Experiments show that RdimKD achieves performance comparable to or higher than state-of-the-art results on the above benchmarks.

  • The implementation of RdimKD, especially RdimKD-R is very simple yet effective, and has been landed in one of the most famous short-video companies, which means that it has been evaluated in the practice of super large-scale industry projects.

2 Related works

Knowledge distillation (KD) was first proposed by Hinton et al. [29] in classification, where they utilize the logits from the teacher as the soft labels to transfer the “dark knowledge” to the student. Later, Fitnets [58] started to distill knowledge from the intermediate layers to further boost the performance of students. Since then, the mainstream KD methods can be roughly divided into logits distillation [29, 12, 18, 49, 75, 89, 90, 62, 68, 56, 4, 46, 52] and intermediate layer distillation [58, 40, 84, 53, 79, 27, 32, 55, 82, 63, 47, 81, 74, 61, 6, 64]. RdimKD falls into the latter category, and we will summarize related works in this category.

Distillation from intermediate layers can be considered as a regularization of models training. Besides classification, some works are designed for detection [7, 40, 66, 21, 13, 87, 39], segmentation[25, 47, 76, 69], and other specific domains [50, 67, 73, 16, 70, 11, 86]. In our paper, we want to construct a general distillation method for all these tasks. SAKD [61] proposed a strategy to adaptively determine the distillation layers in the teacher per sample in each training iteration during the distillation period. ReviewKD [10] built connection paths across different levels between teacher and student. MGD [79] utilized a mask and some transformation modules to make the feature maps of student mimic that of teacher. CD [60] normalized the feature map of each channel to obtain a distribution, then minimizes the Kullback–Leibler (KL) divergence between the distribution of teacher and student. TAT [41] proposed a one-to-all method that allows each pixel of the teacher to teach all spatial locations of student given the similarity. Most of these methods focus on specific view of teacher’s feature map by using some elaborated transformation, but failed to capture the generic information of KD. Rather than simply proposing another variant like before, we abstract a more generic and higher level perspective to reveal the nature of the problem.

3 Methods

We introduce the details of RdimKD in this section, including the overall design and the three projection matrices. The overall framework is shown in Fig. 1. Taking CNN as an example, let Ft,Fsb×h×w×csubscript𝐹𝑡subscript𝐹𝑠superscript𝑏𝑤𝑐F_{t},F_{s}\in\mathbb{R}^{b\times h\times w\times c}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_h × italic_w × italic_c end_POSTSUPERSCRIPT be the feature maps to be distilled of the teacher and student, respectively. They can be viewed as two matrices, Ft,FsN×csubscript𝐹𝑡subscript𝐹𝑠superscript𝑁𝑐F_{t},F_{s}\in\mathbb{R}^{N\times c}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_c end_POSTSUPERSCRIPT representing a collection of N𝑁Nitalic_N points with c𝑐citalic_c dimensions (where N=bhw𝑁𝑏𝑤N=bhwitalic_N = italic_b italic_h italic_w, and the same notation Ft,Fssubscript𝐹𝑡subscript𝐹𝑠F_{t},F_{s}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is used when not confused). These two matrices can be multiplied by a common fixed matrix Kc×d(d<c)𝐾superscript𝑐𝑑𝑑𝑐K\in\mathbb{R}^{c\times d}(d<c)italic_K ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × italic_d end_POSTSUPERSCRIPT ( italic_d < italic_c ) to project these points into a subspace with d𝑑ditalic_d dimensions. In the subspace, the simple 2superscript2\ell^{2}roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT loss is used to minimize the difference between the two projected feature maps. The final objective function is as follows:

minw(w)=(w)+αNdFtKFsK2subscript𝑤𝑤𝑤𝛼𝑁𝑑superscriptnormsubscript𝐹𝑡𝐾subscript𝐹𝑠𝐾2\min_{w}\mathcal{F}(w)=\mathcal{L}(w)+\frac{\alpha}{Nd}\|F_{t}K-F_{s}K\|^{2}roman_min start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT caligraphic_F ( italic_w ) = caligraphic_L ( italic_w ) + divide start_ARG italic_α end_ARG start_ARG italic_N italic_d end_ARG ∥ italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_K - italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_K ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (3)

where w𝑤witalic_w is all learnable parameters of the student, (w)𝑤\mathcal{L}(w)caligraphic_L ( italic_w ) the original loss function of the student, α𝛼\alphaitalic_α the balance factor. RdimKD focuses on the concept of dimensionality reduction itself, regardless of the specific reduction methods. To show this, we provide three methods to construct the projection matrix K𝐾Kitalic_K, explained in the following subsections. Unlike learnable modules of some previous works, we freeze K𝐾Kitalic_K during the whole training process. Also, we will study the performance when it is changeable at each iteration in the ablation study part.

3.1 Projection via PCA

Principal component analysis (PCA) [1] is a popular technique for reducing the dimensionality of a dataset such that the variance of the dataset is preserved as much as possible. Specifically, we first center the values of each point in Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by subtracting the mean of each column from each of those values, resulting in matrix Ft^^subscript𝐹𝑡\hat{F_{t}}over^ start_ARG italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG. The eigenvalue decomposition of its covariance matrix, 1N1Ft^TFt^1𝑁1superscript^subscript𝐹𝑡𝑇^subscript𝐹𝑡\frac{1}{N-1}\hat{F_{t}}^{T}\hat{F_{t}}divide start_ARG 1 end_ARG start_ARG italic_N - 1 end_ARG over^ start_ARG italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over^ start_ARG italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG, is as follows:

1N1Ft^TFt^=UΣUT1𝑁1superscript^subscript𝐹𝑡𝑇^subscript𝐹𝑡𝑈Σsuperscript𝑈𝑇\frac{1}{N-1}\hat{F_{t}}^{T}\hat{F_{t}}=U\Sigma U^{T}divide start_ARG 1 end_ARG start_ARG italic_N - 1 end_ARG over^ start_ARG italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over^ start_ARG italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = italic_U roman_Σ italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (4)

where Σ=diag{σ1,σ2,,σc}Σ𝑑𝑖𝑎𝑔subscript𝜎1subscript𝜎2subscript𝜎𝑐\Sigma=diag\{\sigma_{1},\sigma_{2},...,\sigma_{c}\}roman_Σ = italic_d italic_i italic_a italic_g { italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } is a diagonal matrix, and each entry represents an eigenvalue. For the sake of description, we assume that they are already in descending order. U=(u1,u2,,uc)𝑈subscript𝑢1subscript𝑢2subscript𝑢𝑐U=(u_{1},u_{2},...,u_{c})italic_U = ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) is the matrix whose columns ui,i=1,2,,cformulae-sequencesubscript𝑢𝑖𝑖12𝑐u_{i},i=1,2,...,citalic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , 2 , … , italic_c are units and orthogonal to each other. Each of uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be interpreted as a principal axis, and σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the corresponding variance along the i𝑖iitalic_i-th axis.

[Uncaptioned image]

: The distribution of eigenvalues of the covariance matrix of the last feature maps in the fourth stage for ResNet-34 on ImageNet [15]. We can see that the distribution is very anisotropic. Similar phenomena have also occurred in other scenarios.

As an example, Sec. 3.1 shows the distribution of eigenvalues for ResNet-34 on ImageNet [15]. The severe anisotropy of the distribution suggests that we may only need to project the samples to the first d𝑑ditalic_d principal axes to represent the most important information. Hence, we can let K=(u1,u2,,ud)𝐾subscript𝑢1subscript𝑢2subscript𝑢𝑑K=(u_{1},u_{2},...,u_{d})italic_K = ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) in Eq. 3. The distillation method corresponding to the projection matrix obtained in this way is named RdimKD-P. We will show in the experiment section that projecting to the first d𝑑ditalic_d principal axes does give better performance than projecting to the last d𝑑ditalic_d principal axes.

3.2 Projection via autoencoder

PCA aims at projecting the dataset into a normal subspace while preserving the maximum amount of information. Another way to remove noise while retaining the primary information is to use an autoencoder. The matrix K𝐾Kitalic_K in Eq. 3 can be viewed as an encoder, and we design Kd×csuperscript𝐾superscript𝑑𝑐K^{\prime}\in\mathbb{R}^{d\times c}italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_c end_POSTSUPERSCRIPT as a decoder, to minimize the objective function:

minK,K𝒥(K,K)=1NcFtFtKK2+γ(K2+K2)subscript𝐾superscript𝐾𝒥𝐾superscript𝐾1𝑁𝑐superscriptnormsubscript𝐹𝑡subscript𝐹𝑡𝐾superscript𝐾2𝛾superscriptnorm𝐾2superscriptnormsuperscript𝐾2\min_{K,K^{\prime}}\mathcal{J}(K,K^{\prime})=\frac{1}{Nc}\|F_{t}-F_{t}KK^{% \prime}\|^{2}+\gamma(\|K\|^{2}+\|K^{\prime}\|^{2})roman_min start_POSTSUBSCRIPT italic_K , italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_J ( italic_K , italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_N italic_c end_ARG ∥ italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_K italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_γ ( ∥ italic_K ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (5)

where γ𝛾\gammaitalic_γ is a small positive number to balance the norm of K𝐾Kitalic_K and Ksuperscript𝐾K^{\prime}italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Since the decoder Ksuperscript𝐾K^{\prime}italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is used to restore the original information as much as possible, the encoded feature map FtKsubscript𝐹𝑡𝐾F_{t}Kitalic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_K needs to retain general information of Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. When this is done, the solution of K𝐾Kitalic_K will act as the projection matrix in Eq. 3. We name this method RdimKD-A.

3.3 Projection via random orthogonal matrix

Random projection [2] is a technique used to reduce the dimensionality of datasets in Euclidean space. The core idea behind it is given in the Johnson-Lindenstrauss (JL) lemma [34, 48]:

Lemma 1

For any 0<ϵ<10italic-ϵ10<\epsilon<10 < italic_ϵ < 1 and integer N𝑁Nitalic_N, let d𝑑ditalic_d be an integer with d>4(ϵ2/2ϵ3/3)1logN𝑑4superscriptsuperscriptitalic-ϵ22superscriptitalic-ϵ331normal-log𝑁d>4(\epsilon^{2}/2-\epsilon^{3}/3)^{-1}{\rm log}Nitalic_d > 4 ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 - italic_ϵ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT / 3 ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log italic_N. Then, for any set V𝑉Vitalic_V of N𝑁Nitalic_N points in csuperscript𝑐\mathbb{R}^{c}blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, there is a linear map f: cdnormal-→superscript𝑐superscript𝑑\mathbb{R}^{c}\rightarrow\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT such that for all u,vV𝑢𝑣𝑉u,v\in Vitalic_u , italic_v ∈ italic_V, the inequality holds:

(1ϵ)uv2f(u)f(v)2(1+ϵ)uv21italic-ϵsuperscriptnorm𝑢𝑣2superscriptnorm𝑓𝑢𝑓𝑣21italic-ϵsuperscriptnorm𝑢𝑣2(1-\epsilon)\|u-v\|^{2}\leq\|f(u)-f(v)\|^{2}\leq(1+\epsilon)\|u-v\|^{2}( 1 - italic_ϵ ) ∥ italic_u - italic_v ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∥ italic_f ( italic_u ) - italic_f ( italic_v ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ( 1 + italic_ϵ ) ∥ italic_u - italic_v ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (6)

One proof takes f𝑓fitalic_f to be a suitable multiple of orthogonal projector onto a random subspace, and it can be easily proved in [14]. This lemma states that datasets in the space of high dimension can be linearly projected onto low-dimensional space with approximate preservation of distances between the samples. Random projection is simple and computationally efficient compared with other dimensionality reduction methods. We find that this idea can also be borrowed in the field of knowledge distillation, although the dimensions before and after projection do not strictly satisfy the requirements of the JL lemma. By this idea, we generate the random matrix K𝐾Kitalic_K in Eq. 3 in the following two steps: 1. generating a random matrix in c×dsuperscript𝑐𝑑\mathbb{R}^{c\times d}blackboard_R start_POSTSUPERSCRIPT italic_c × italic_d end_POSTSUPERSCRIPT with elements chosen from Gaussian distribution; 2. orthonormalizing all the columns of the matrix by Gram–Schmidt process. These two steps are very simple to implement in PyTorch [54] with just one line of code: torch.nn.init.orthogonal_

Matrix obtained in this way has spherical symmetry, which we guess may be a good property. It is possible that in some extreme cases, the projection matrix will project the samples to a subspace that approximates the span of the last d𝑑ditalic_d principal axes of the PCA, which will result in bad performance shown in Tab. 6 pac_last. Nevertheless, at least we did not observe this extreme phenomenon in our experiments. We name this method RdimKD-R.

4 Experiments

To show the generality and effectiveness of RdimKD, we conduct experiments on various deep learning tasks (image classification, object detection, semantic segmentation, language understanding, and speech recognition) and neural architectures (CNN [24], Transformer [65], and Conformer [20]), and compare them to works in recent years. We set r=c/d𝑟𝑐𝑑r=c/ditalic_r = italic_c / italic_d, which represents the reduction rate of the subspace dimension. For simplicity, we use the same r𝑟ritalic_r for all feature maps to be distilled given an experiment. For RdimKD-A, we choose to train Eq. 5 by gradient descent before distillation training, although it may have a closed-form solution. For RdimKD-P, we randomly selected hundreds of training samples to conduct PCA. In the following, for a brief description, “A to B” means a distillation experiment with A as the teacher and B as the student. Due to the page limit, we only explain the primary settings for some experiments, and other details and language understanding are attached in the supplementary materials.

4.1 Image classification

The classification experiments are done on ImageNet ILSVRC-12 dataset [15], which contains 1000 object categories with 1.2 million images for training and 50k for testing. We conduct experiments on “ResNet-34 to ResNet-18” and “ResNet-50 to MobileNet [30]”. Top-1 accuracy is reported. RdimKD can be combined with solf label based KD [29], that is, by adding the additional loss:

KL=βiqilogpisubscript𝐾𝐿𝛽subscript𝑖subscript𝑞𝑖subscript𝑝𝑖\mathcal{L}_{KL}=-\beta\sum_{i}q_{i}\log p_{i}caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT = - italic_β ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (7)

where qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the probability distribution of the teacher’s output, pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is that of the student, and β𝛽\betaitalic_β is the balance coefficient.

ResNet-34 and ResNet-18 contain four stages, and the difference is that the number of blocks in each stage is different. We distilled the last feature map of the third and fourth stages, where the number of channels is 128 and 512, respectively. For MobileNet, we distill the last feature maps of the third and fourth stages of Resnet-50 to the outputs of the 11th and 14th convolutions of MobielNet. To achieve this, we manually changed the number of channels in these two layers of ResNet-50 to 512 and 1024, respectively, and the necessary 1x1 convolution is added at the skip layer. Data preprocessing and augmentation are the same as that of PyTorch official example111https://github.com/pytorch/examples/blob/master/imagenet/main.py. We use a cosine learning rate scheduler with an initial value of 0.1 and train it for 105 epochs. In “ResNet-34 to ResNet-18”, r=4𝑟4r=4italic_r = 4 and α=1𝛼1\alpha=1italic_α = 1 for all the RdimKD, and β=2𝛽2\beta=2italic_β = 2 for RdimKD. Results are shown in Tab. 1. We can see that our RdimKD can boost performance by a clear margin, and the simplest RdimKD-R produces about the same performance as RdimKD-A/P.

method MbNet Res18 method MbNet Res18
teacher 76.77 74.55 DIST [NIPS 2022] [31] 73.24 72.07
student 70.93 70.96 WSLD [ICLR 2021] [93] 71.52 72.04
RdimKD-R 72.56 71.89 SRRL [ICLR 2021] [77] 72.49 71.73
RdimKD-A 72.65 71.94 KR [CVPR 2021] [10] 72.56 71.61
RdimKD-P 72.77 72.01 DKD [CVPR2022] [90] 72.05 71.7
RdimKD-R* 73.13 72.53 MGD [ECCV 2022] [79] 72.59 71.8
RdimKD-A* 73.15 72.58 TAT [CVPR 2022] [41] None 72.41
RdimKD-P* 73.23 72.49 KCD [ECCV 2022] [38] 71.25 72.13
Table 1: Top-1 results on ImageNet. For MbNet column, ResNet-50 is teacher and MobileNet is student; while for Res18 column, ResNet-34 is teacher and ResNet-18 is student. * means combined with KLsubscript𝐾𝐿\mathcal{L}_{KL}caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT, and None means not reported in origin paper. Note that TAT, KCD, DIST, WSLD also contain KLsubscript𝐾𝐿\mathcal{L}_{KL}caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT. We can see that RdimKD can boost the performance of student by a clear margin, and that the simplest RdimKD-R can get comparable performance as RdimKD-A/P. All of our results are the average on 3 trials.

4.2 Object detection

The detection experiments are conducted on the COCO2017 dataset [44], which contains 80 object categories with 115k training images and 5k validation images. All the teachers are trained for 36 epochs and students for 24 epochs. Other training details are the same as the standard protocols in the widely used Detectron2 library [71]. Inheriting strategy [35] is used for distillation. The networks are RetinaNet [43] and Faster-RCNN [57] with different backbones. Mean average precision(AP) is reported.

RetinaNet: RetinaNet uses a Feature Pyramid Network (FPN) [42] to generate a multi-scale feature pyramid with levels P3subscript𝑃3P_{3}italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT to P7subscript𝑃7P_{7}italic_P start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT, all of which contain 256 channels. The only difference between the various benchmark models is the backbone. So, naturally, knowledge can be distilled from the five levels of the pyramid. We use two teacher networks, RetinaNet-ResNet-101 and RetinaNet-ResNeXt-101 [72], separately, to teach the student network, RetinaNet-ResNet-50. Results are shown in Tab. 2. In these experiments, r=4𝑟4r=4italic_r = 4. For RdimKD-R and RdimKD-A, α=1𝛼1\alpha=1italic_α = 1, while for RdimKD-P, α=0.5𝛼0.5\alpha=0.5italic_α = 0.5.

Faster-RCNN: Besides the one-stage detector RetinaNet, we also evaluate our RdimKD in two-stage detector, Faster-RCNN. Similar to previous works [90, 78], we also use FPN to capture multi-scale features. Same as DKD [90], P2subscript𝑃2P_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to P5subscript𝑃5P_{5}italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT are input features for the ROI heads, and the number of channels for each level is 256. Similar to RetinaNet, the only difference between various benchmarks is the backbone, while the structure of the ROI head is the same. So, naturally, we distill at the FPN layers. In these experiments, we conduct “Faster-RCNN-FPN-ResNet-101 to Faster-RCNN-FPN-ResNet-18”, “Faster-RCNN-FPN-ResNet-101 to Faster-RCNN-FPN-ResNet-50” and “Faster-RCNN-FPN-ResNet-101 to Faster-RCNN-FPN-MobileNet_V2 [59]”. r=4𝑟4r=4italic_r = 4 for ResNet-50 and MobileNet_V2, r=2𝑟2r=2italic_r = 2 for ResNet-18. Details are in Supplementary Materials. Results are shown in Tab. 3.

method AP APSsubscriptAPS\text{AP}_{\text{S}}AP start_POSTSUBSCRIPT S end_POSTSUBSCRIPT APMsubscriptAPM\text{AP}_{\text{M}}AP start_POSTSUBSCRIPT M end_POSTSUBSCRIPT APLsubscriptAPL\text{AP}_{\text{L}}AP start_POSTSUBSCRIPT L end_POSTSUBSCRIPT
S:RetinaNet-R50 38.28 22.36 42.34 49.42
T:RetinaNet-R101 40.56 24.46 44.57 52.73
FGD [CVPR 2022] [78] 39.7 22.0 43.7 53.6
KR [CVPR 2021] [10] 38.48 22.67 42.72 58.22
LD [CVPR 2022] [91] 39.0 23.1 43.2 51.1
FRS [NIPS 2021] [92] 39.7 21.8 43.5 52.4
LGD [AAAI 2022] [88] 40.35 24.08 44.15 52.53
GID [CVPR 2021] [13] 39.1 22.8 43.1 52.3
KDRP [AAAI 2022] [39] 39.6 21.4 44.0 52.5
RdimKD-R 40.67 24.57 44.62 52.92
RdimKD-A 40.67 24.45 44.70 53.16
RdimKD-P 40.68 24.17 44.90 52.72
T:RetinaNet-X101 41.10 23.95 44.78 53.27
FGD [CVPR 2022] [78] 40.7 22.9 45.0 54.7
MGD [ECCV 2022] [79] 41.0 23.4 45.3 55.7
FRS [NIPS 2021] [92] 40.1 21.9 43.7 54.3
DIST [NIPS 2022] [31] 40.1 23.2 44.0 53.6
CD [ICCV 2021] [60] 40.8 22.7 44.5 55.3
FB [ICLR 2021] [87] 39.6 22.7 43.3 52.5
LGD [AAAI 2022] [88] 40.35 24.08 44.15 52.53
RdimKD-R 40.97 24.20 45.18 53.99
RdimKD-A 40.95 23.72 45.11 53.86
RdimKD-P 41.05 24.47 45.39 54.02
Table 2: Performance of “RetinaNet-R101 to RetinaNet-R50”(the top half of the table) and “RetinaNet-X101 to RetinaNet-R50”(the bottom half of the table) on COCO2017 validation set. Where ‘R101’,‘R50’ and ‘X101’ mean ResNet-101, ResNet-50 and ResNeXt-101, respectively. ‘T’ and ‘S’ mean teacher and student, respectively. LGD [88] is a self-distillation method and does not contain teachers. We can see that RdimKD consistently equals or outperforms other methods. The simplest RdimKD-R produces about the same performance as RdimKD-A/P. All of our results are the average on 3 trials.
method AP APSsubscriptAPS\text{AP}_{\text{S}}AP start_POSTSUBSCRIPT S end_POSTSUBSCRIPT APMsubscriptAPM\text{AP}_{\text{M}}AP start_POSTSUBSCRIPT M end_POSTSUBSCRIPT APLsubscriptAPL\text{AP}_{\text{L}}AP start_POSTSUBSCRIPT L end_POSTSUBSCRIPT
T: ResNet-101 42.17 25.50 45.55 54.93
S: ResNet-18 34.96 19.78 37.39 45.44
DKD [CVPR 2022] [90] 37.01 None None None
SCKD [ICCV 2021] [95] 37.5 20.9 42.6 50.8
KR [CVPR 2021] [10] 36.75 19.42 39.51 49.58
RdimKD-R 38.25 21.12 41.25 51.34
RdimKD-A 38.29 21.04 41.03 51.45
RdimKD-P 38.31 20.78 41.12 51.82
T: ResNet-101 42.17 25.50 45.55 54.93
S: ResNet-50 39.66 24.03 42.76 51.74
DKD [CVPR2022] [90] 40.65 None None None
FGD [CVPR 2022] [78] 40.5 22.6 44.7 53.2
ICD [NIPS 2021] [35] 40.9 24.5 44.2 53.5
KR [CVPR 2021] [10] 40.36 23.60 43.81 52.87
FRS [NIPS 2021] [92] 39.5 22.3 43.6 51.7
LGD [AAAI 2022] [88] 40.47 23.96 43.94 52.19
RdimKD-R 41.76 25.22 45.28 54.42
RdimKD-A 41.72 25.12 45.11 54.50
RdimKD-P 41.74 24.97 45.18 54.60
T: ResNet-50 41.84 25.15 45.27 54.48
S: MobileNet_V2 34.51 19.98 36.38 44.94
DKD [CVPR2022] [90] 34.35 None None None
KR [CVPR 2021] [10] 33.71 16.77 35.81 46.47
RdimKD-R 36.26 20.24 38.29 49.21
RdimKD-A 36.06 19.82 37.95 48.86
RdimKD-P 36.00 19.50 38.00 49.20
Table 3: Results on COCO2017 on Faster-RCNN-FPN with different backbones. ‘T’ and ‘S’ mean teacher and student, respectively. In the table, the top part is for “Faster-RCNN-FPN-ResNet-101 to Faster-RCNN-FPN-ResNet-18”, the middle part is for “Faster-RCNN-FPN-ResNet-101 to Faster-RCNN-FPN-ResNet-50”, while the bottom part is for “Faster-RCNN-FPN-ResNet-101 to Faster-RCNN-FPN-MobileNet_V2”. ‘None’ means not reported in the original paper. For MobileNet_V2, the baseline for DKD [90] and KR [10] is relatively weak, such that our baseline performance, 34.51, is higher than that of DKD and KR after their distillation. We choose the best of all the ResNet-50, under the guidance of ResNet-101 with RdimKD-R, as the teacher for MobileNet_V2. All the results are average on 3 trials.
method ResNet18 MobileNet_V2
teacher 79.26 79.26
student 73.59 73.70
TAT [CVPR 2022] [41] 75.76 73.85
CIRKD [CVPR 2022] [76] 74.50 None
FAKD [83] None 67.62
IFVD [ECCV 2020] [69] 74.05 None
ICKD [ICCV 2021] [45] 75.01 72.79
RdimKD-R 75.63 75.28
RdimKD-A 75.55 75.67
RdimKD-P 75.94 75.19
Table 4: Results on Semantic segmentation on Pascal VOC. We use DeepLabv3+-ResNet-101 as teacher, and DeepLabv3+-ResNet-18 and DeepLabv3+-MobileNet_V2 as sudents. ‘None’ means not reported in the original paper. Also, for MobileNet_V2, the baseline for FAKD is relatively weak. All of our results are the average on 6 trials.

4.3 Semantic segmentation

The segmentation experiments are done on the Pascal VOC [17], which contains 20 foreground classes and 1 background class. With the additional coarse annotated training images from [23], there are a total of 10582 images for training. The validation set contains 1499 images, on which we report the mean Intersection over Union (mIoU) to show segmentation performance.

DeepLabv3+: DeepLabv3+ [9] is a popular network for segmentation tasks. Besides the Atrous Spatial Pyramid Pooling (ASPP) module and encode-decoder structure, it extends DeepLabv3 [8] by adding a decoder module to capture rich semantic information to refine the object boundaries. Knowledge is distilled at the low-level feature coming from the backbone (in front of the resize and ReLU layer) and the output of the ASPP module (in front of the final dropout and ReLU layer), where the number of channels is 48 and 256, respectively. We use the settings from the public code 222https://github.com/VainF/DeepLabV3Plus-Pytorch unless otherwise stated. For all of our experiments, the output stride (OS) is 16 for training and validation. We conduct “DeepLabv3+-ResNet-101 to DeepLabv3+-ResNet-18” and “DeepLabv3+-ResNet-101 to DeepLabv3+-MobileNet_V2”. The values for α𝛼\alphaitalic_α and r𝑟ritalic_r are in supplementary materials. We find that the variance of the results for each trial is large, so each of our results is the average of the six trials. The results are shown in  Tab. 4.

4.4 Speech recognition

To show the powerful generalization and effectiveness of RdimKD, we apply RdimKD to a more challenging task, speech recognition. The input to this task is a speech waveform, and the output is the corresponding text. RNN-Transducer (RNN-T) [19] is a well-known end-to-end architecture for streaming speech recognition [26], which contains an encoder, a predictor, and a joiner. For the encoder, we use 12 layers of Conformer [20] for the teacher and 6 layers for the student. For the predictor, three layers of bidirectional transformers [65] (Bitransformers) is used for both teacher and student. We use the Librispeech dataset [51], which contains about 1000 hours of speech sampled at 16 kHz. We use the 960 hours of corpus for training and the development and test set for evaluation. Beam search is used as the decode mode, and the Word Error Rate (WER) is used as the metric. The implementation details are the same as the corresponding part of Wenet Library [80, 85]. We trained 50 epochs for each experiment. The dimension of attention for the encoder is 256, and we distill knowledge from the 8th and 12th layers of the teacher to the 4th and 6th layers of the student. {α=2,r=4}formulae-sequence𝛼2𝑟4\{\alpha=2,r=4\}{ italic_α = 2 , italic_r = 4 } is chosen for all results in this experiment. Inheriting strategy [35] is used. The results are shown in Tab. 5. It can be seen that our RdimKD has strong generalization and effectiveness besides computer vision.

clean(%) other(%)
dev test mean dev test mean
Teacher 3.30 3.48 3.39 8.86 8.90 8.88
Student 3.71 3.99 3.85 10.22 10.21 10.22
RdimKD-R 3.51 3.80 3.66 9.40 9.50 9.45
RdimKD-A 3.56 3.78 3.67 9.44 9.44 9.44
RdimKD-P 3.56 3.77 3.67 9.48 9.43 9.46
no_proj 3.67 3.98 3.83 10.10 10.02 10.06
Table 5: Results on speech recognition. We use 12 layers of encoder of RNN-T as teacher and 6 as student, and use Librispeech as experiment dataset. Word Error Rate (WER) is used as metric (smaller is better). no_proj is explained in Sec. 4.5. The mean column is the arithmetic mean of dev and test columns. It can be seen that our RdimKD has strong generalization and effectiveness besides computer vision. Only one trial is run for each result in this table due to the high cost of the experiment.

4.5 Ablation studies

The key to RdimKD is the projection. In this subsection, we focus on the projection methods (including no projection), subspace reduction rate r𝑟ritalic_r, and the coefficient of distillation loss α𝛼\alphaitalic_α. For generality, we will explore these ablation studies in different experiments.

Projection method: As shown in  Tab. 6, we conduct this ablation study via “ResNet-50 to MobileNet” on ImageNet, “Faster-RCNN-FPN-ResNet-101 to Faster-RCNN-FPN-ResNet-18” on COCO2017, and “DeepLabv3+-ResNet-101 to DeepLabv3+-MobileNet_V2” on Pascal VOC. We have the following findings: 1. In RdimKD-P, feature maps are projected to PCA’s first d𝑑ditalic_d principal axes. A natural question is what will happen if feature maps are projected to the last d𝑑ditalic_d principal axes (named by pca_last in Tab. 6). Comparing the results between RdimKD-P and pca_last, we find that the first d𝑑ditalic_d principal components do contain more valuable information. Theoretically, in an extreme case, the projection matrix of RdimKD-R approximates that of pca_last. Nevertheless, RdimKD-R consistently performed well in our experiments. 2. If we remove the projection operation and apply the 2superscript2\ell^{2}roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT loss directly onto the original feature maps (named by no_proj in Tab. 6), the performance can be improved compared to the baseline. However, the improvement is smaller than that with projection (also shown in Tab. 5) consistently. We suspect that this is caused by the low capacity of students, which makes it impossible and unnecessary to learn every detail from the teacher network accurately. 3. When the random projection matrix is not necessarily orthogonal (each element is chosen from Gaussian distribution 𝒩(0,1c)𝒩01𝑐\mathcal{N}(0,\frac{1}{c})caligraphic_N ( 0 , divide start_ARG 1 end_ARG start_ARG italic_c end_ARG ), named by no_orth), the performance is slightly worse than RdimKD-R. 4. If the random projection matrix is generated in every iteration rather than kept fixed from the beginning(named by randE), the performance is also slightly worse than RdimKD-R. It should be noted that the above comparison is not necessarily fair because each method may correspond to a unique optimal α𝛼\alphaitalic_α value, and it is difficult to find the optimal α𝛼\alphaitalic_α for each method due to the experimental cost. Nonetheless, RdimKD performs better than no_proj and RdimKD-P performs better than pca_last for a certain range of α𝛼\alphaitalic_α.

Reduction rate r𝑟ritalic_r: The reduction rate is also an important variable, which determines the dimensional scaling of the subspace onto which the data of the original space is projected. It is easy to prove that r=1𝑟1r=1italic_r = 1 is mathematically equivalent to no_proj, and they are close in experimental performance. As shown in Fig. 3, we can see that the projection of feature maps onto a subspace of appropriate dimensions does boost performance further.

Refer to caption
(a) ImageNet
Refer to caption
(b) COCO
Figure 3: Ablation study for reduction rate r𝑟ritalic_r.  Fig. 3(a) is RdimKD-P for “ResNet-34 to ResNet-18” on ImageNet classification, while  Fig. 3(b) is RdimKD-R for “Faster-RCNN-FPN-ResNet-101 to Faster-RCNN-FPN-ResNet-18” on COCO object detection. The X-axis is the reduction rate r𝑟ritalic_r, while the Y-axis is the performance (top-1 for ImageNet, AP for COCO). We can see that the projection of the feature maps onto a subspace of appropriate dimensions does boost the performance. All the results are average on 3 trials.
method top-1 top-5 method top-1 top-5
baseline 70.93 89.59 pca_last 71.40 89.96
RdimKD-R 72.56 90.94 no_proj 72.24 90.79
RdimKD-A 72.65 91.00 no_orth 72.40 90.87
RdimKD-P 72.77 91.02 randE 71.63 90.47
method AP mIOU method AP mIOU
baseline 34.96 73.70 pca_last 35.89 74.40
RdimKD-R 38.31 75.28 no_proj 37.96 74.93
RdimKD-A 38.29 75.67 no_orth 37.75 75.04
RdimKD-P 38.31 75.19 randE 37.77 74.91
Table 6: Ablation study for different projection methods. The top part is “ResNet-50 to MobileNet” on ImageNet, while the bottom is “Faster-RCNN-FPN-ResNet-101 to Faster-RCNN-FPN-ResNet-18” on COCO2017 (the AP column) and “DeepLabv3+-ResNet-101 to DeepLabv3+-MobileNet_V2” on Pascal VOC (the mIOU column). pca_last means feature maps are projected to the last d𝑑ditalic_d principal axes of PCA. no_proj means the K𝐾Kitalic_K in Eq. 3 is an identity matrix. no_orth means that elements in the projection matrix K𝐾Kitalic_K in Eq. 3 are randomly chosen from Gaussian distribution 𝒩(0,1c)𝒩01𝑐\mathcal{N}(0,\frac{1}{c})caligraphic_N ( 0 , divide start_ARG 1 end_ARG start_ARG italic_c end_ARG ), and the matrix itself is not necessarily orthogonal. randE means that the matrix K𝐾Kitalic_K is a random orthogonal matrix generated in Each iteration, not fixed. Results for ImageNet and COCO2017 are average on 3 trials, and that for VOC are average on 6 trials.

Coefficient α𝛼\alphaitalic_α: The value of α𝛼\alphaitalic_α balances the weights between the original loss and the distillation loss. In Fig. 4, we implement “DeepLabv3+-ResNet-101 to DeepLabv3+-ResNet-18” on Pascal VOC segmentation task and “ResNet-50 to MobileNet” on ImageNet classification task. It can be shown that with the increase of α𝛼\alphaitalic_α value, the performance of the student rises first and then decreases, which is in line with our expectations and verifies the effectiveness of our method.

Distillation position: For non-sequential structures such as RetinaNet, we distill the feature maps of the output of FPN; for the sequential structure that stacks some blocks, layer indices to be distilled between teacher and student are proportional. For example, if teacher is 2 times deeper than the student, then the i𝑖iitalic_i-th layer of student is taught by the corresponding 2i𝑖iitalic_i-th layer of the teacher. However, we find that, in general, only distilling some upper layers produces better results, as shown in Tab. 7 as an example.

mask clean(%) other(%)
dev test mean dev test mean
111111 3.58 3.79 3.69 9.65 9.78 9.72
001111 3.51 3.80 3.66 9.40 9.50 9.45
000011 3.54 3.78 3.66 9.47 9.53 9.50
Table 7: An example of distillation position for sequential structure network. When 12-layer RNN-T teaching 6-layer, the i𝑖iitalic_i-th layer of student is taught by the corresponding 2i𝑖iitalic_i-th layer of the teacher But not distillation the lower layers is better. Mask 001111 means not distilling the first and second layer of the student and 111111 means distilling all the 6 layers. The conclusion is consistent with [79].

4.6 Discussion

RdimKD-R and RdimKD-P project the feature maps onto a subspace (denoted as 𝒮𝒮\mathcal{S}caligraphic_S), and an interesting point is the learning of student in the orthogonal complement (denoted as 𝒮superscript𝒮perpendicular-to\mathcal{S}^{\perp}caligraphic_S start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT). Note that 𝒮𝒮\mathcal{S}caligraphic_S and 𝒮superscript𝒮perpendicular-to\mathcal{S}^{\perp}caligraphic_S start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT are determined by the teacher.

For illustration, we name the subspace spanned by the first d𝑑ditalic_d principal axes as the principal subspace. By this definition, for RdimKD-P, 𝒮𝒮\mathcal{S}caligraphic_S is also the principal subspace of the teacher. A natural question is whether it is also the student’s principal subspace after it is trained by RdimKD-P. To explore it, we project the feature map of the well-trained student into 𝒮𝒮\mathcal{S}caligraphic_S and 𝒮superscript𝒮perpendicular-to\mathcal{S}^{\perp}caligraphic_S start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT, respectively, and then do PCA in these two subspaces. The distribution of these eigenvalues is plotted in Fig. 5(a). We find that the eigenvalues in 𝒮𝒮\mathcal{S}caligraphic_S are almost all larger than those in 𝒮superscript𝒮perpendicular-to\mathcal{S}^{\perp}caligraphic_S start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT, which shows that 𝒮𝒮\mathcal{S}caligraphic_S is almost the principal subspace of the student, too. Interestingly, for RdimKD-R in Fig. 5(b), 𝒮𝒮\mathcal{S}caligraphic_S is randomly chosen instead of by PCA, but the eigenvalues in 𝒮𝒮\mathcal{S}caligraphic_S are also generally larger than those in 𝒮superscript𝒮perpendicular-to\mathcal{S}^{\perp}caligraphic_S start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT. Comparing the ordinates of Fig. 5(a) and Fig. 5(b), the variance of the feature maps trained by RdimKD-R is smaller than that trained by RdimKD-P.

We also consider the rotation of the principal subspace of student with respect to that of the teacher. To do so, we first project the feature maps of the student onto 𝒮𝒮\mathcal{S}caligraphic_S then plot the heat map of the covariance matrix in Fig. 6(a) (trained by RdimKD-P) and in Fig. 6(b) (trained by RdimKD-R). From Fig. 6(a), the value of the diagonal elements is much greater than the value of the off-diagonal elements, which leads to the conclusion that the angle between the two principal set of axes is small. For comparison, the heatmap via RdimKD-R is also plotted in Fig. 6(b).

Refer to caption
(a) Pascal VOC
Refer to caption
(b) ImageNet
Figure 4: Ablation study for α𝛼\alphaitalic_α. The left is RdimKD-R for “DeepLabv3+-ResNet-101 to DeepLabv3+-ResNet-18” on the Pascal VOC segmentation task; while the right is RdimKD-P for “ResNet-50 to MobileNet” on ImageNet classification task. It can be shown that with the increase of α𝛼\alphaitalic_α value, the student performance rises first and then decreases, which is in line with our expectations and verifies our method’s effectiveness. The result for Pascal VOC is the average of 6 trials, and that for ImageNet is the average of 3 trials.
Refer to caption
(a) RdimKD-P
Refer to caption
(b) RdimKD-R
Figure 5: The distribution of eigenvalues of the covariance matrices of student’s feature maps in 𝒮𝒮\mathcal{S}caligraphic_S and 𝒮superscript𝒮perpendicular-to\mathcal{S}^{\perp}caligraphic_S start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT. The experiment is done in “Faster-RCNN-FPN-ResNet-101 to Faster-RCNN-FPN-ResNet-18”, and the feature map is the P5subscript𝑃5P_{5}italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT level. The left is by RdimKD-P and the right is by RdimKD-R. The eigenvalues with large index are very close to zero.
Refer to caption
(a) RdimKD-P
Refer to caption
(b) RdimKD-R
Figure 6: The heatmap of the absolute value of the covariance matrix. The experiment is done in “Faster-RCNN-FPN-ResNet-101 to Faster-RCNN-FPN-ResNet-18”, and the feature map is the P5subscript𝑃5P_{5}italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT level. The left is by RdimKD-P and the right is by RdimKD-R.

5 Conclusion

In this paper, we proposed RdimKD with three projection methods for knowledge distillation. Compared with other methods, the advantage of RdimKD is three folds: simple to implement (especially RdimKD-R) and very favored for industrial applications; achieves performance comparable to or higher than state-of-the-art methods; general to various learning tasks and neural architectures. We believe our simple findings will bring more enlightenment and inspirations to knowledge distillation. Moreover, this approach has been widely used in our company’s industrial applications. However, the theoretical explanation for why this so simple method works still needs further study.

References

  • [1] Hervé Abdi and Lynne J Williams. Principal component analysis. Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010.
  • [2] Dimitris Achlioptas. Database-friendly random projections. In Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 274–281, 2001.
  • [3] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? Advances in neural information processing systems, 27, 2014.
  • [4] Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. Knowledge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10925–10934, 2022.
  • [5] Ella Bingham and Heikki Mannila. Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 245–250, 2001.
  • [6] Defang Chen, Jian-** Mei, Yuan Zhang, Can Wang, Zhe Wang, Yan Feng, and Chun Chen. Cross-layer distillation with semantic calibration. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7028–7036, 2021.
  • [7] Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. Learning efficient object detection models with knowledge distillation. Advances in neural information processing systems, 30, 2017.
  • [8] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  • [9] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
  • [10] Pengguang Chen, Shu Liu, Hengshuang Zhao, and Jiaya Jia. Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5008–5017, 2021.
  • [11] Xianing Chen, Qiong Cao, Yujie Zhong, **g Zhang, Shenghua Gao, and Dacheng Tao. Dearkd: Data-efficient early knowledge distillation for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12052–12062, 2022.
  • [12] Jang Hyun Cho and Bharath Hariharan. On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4794–4802, 2019.
  • [13] ** Bao, Zhicheng Wang, Si Liu, and Er** Zhou. General instance distillation for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7842–7851, 2021.
  • [14] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Structures & Algorithms, 22(1):60–65, 2003.
  • [15] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • [16] Yikang Ding, Qingtian Zhu, Xiangyue Liu, Wentao Yuan, Haotian Zhang, and CHi Zhang. Kd-mvs: Knowledge distillation based self-supervised learning for mvs. arXiv preprint arXiv:2207.10425, 2022.
  • [17] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
  • [18] Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. In International Conference on Machine Learning, pages 1607–1616. PMLR, 2018.
  • [19] Alex Graves. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711, 2012.
  • [20] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.
  • [21] Jianyuan Guo, Kai Han, Yunhe Wang, Han Wu, Xinghao Chen, Chun**g Xu, and Chang Xu. Distilling object detectors via decoupled features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2154–2164, 2021.
  • [22] Yi Guo, Huan Yuan, Jianchao Tan, Zhangyang Wang, Sen Yang, and Ji Liu. Gdp: Stabilized neural network pruning via gates with differentiable polarization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5239–5250, 2021.
  • [23] Bharath Hariharan, Pablo Arbeláez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Semantic contours from inverse detectors. In 2011 international conference on computer vision, pages 991–998. IEEE, 2011.
  • [24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [25] Tong He, Chunhua Shen, Zhi Tian, Dong Gong, Changming Sun, and Youliang Yan. Knowledge adaptation for efficient semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 578–587, 2019.
  • [26] Yanzhang He, Tara N Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, et al. Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6381–6385. IEEE, 2019.
  • [27] Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyo** Park, Nojun Kwak, and ** Young Choi. A comprehensive overhaul of feature distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1921–1930, 2019.
  • [28] Byeongho Heo, Minsik Lee, Sangdoo Yun, and ** Young Choi. Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3779–3787, 2019.
  • [29] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. Computer Science, 14(7):38–39, 2015.
  • [30] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • [31] Tao Huang, Shan You, Fei Wang, Chen Qian, and Chang Xu. Knowledge distillation from a stronger teacher. arXiv preprint arXiv:2205.10536, 2022.
  • [32] Zehao Huang and Naiyan Wang. Like what you like: Knowledge distill via neuron selectivity transfer. arXiv preprint arXiv:1707.01219, 2017.
  • [33] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.
  • [34] William B Johnson. Extensions of lipschitz map**s into a hilbert space. Contemp. Math., 26:189–206, 1984.
  • [35] Zijian Kang, Peizhen Zhang, Xiangyu Zhang, Jian Sun, and Nanning Zheng. Instance-conditional knowledge distillation for object detection. Advances in Neural Information Processing Systems, 34:16468–16480, 2021.
  • [36] Jangho Kim, SeongUk Park, and Nojun Kwak. Paraphrasing complex network: Network compression via factor transfer. Advances in neural information processing systems, 31, 2018.
  • [37] Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018.
  • [38] Chenxin Li, Mingbao Lin, Zhiyuan Ding, Nie Lin, Yihong Zhuang, Yue Huang, Xinghao Ding, and Liujuan Cao. Knowledge condensation distillation. arXiv preprint arXiv:2207.05409, 2022.
  • [39] Gang Li, Xiang Li, Yujie Wang, Shanshan Zhang, Yichao Wu, and Ding Liang. Knowledge distillation for object detection via rank mimicking and prediction-guided feature imitation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 1306–1313, 2022.
  • [40] Quanquan Li, Shengying **, and Junjie Yan. Mimicking very efficient network for object detection. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 6356–6364, 2017.
  • [41] Sihao Lin, Hongwei Xie, Bing Wang, Kaicheng Yu, Xiaojun Chang, Xiaodan Liang, and Gang Wang. Knowledge distillation via the target-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10915–10924, 2022.
  • [42] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  • [43] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  • [44] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [45] Li Liu, Qingle Huang, Sihao Lin, Hongwei Xie, Bing Wang, Xiaojun Chang, and Xiaodan Liang. Exploring inter-channel correlation for diversity-preserved knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8271–8280, 2021.
  • [46] Yufan Liu, Jiajiong Cao, Bing Li, Chunfeng Yuan, Weiming Hu, Yangxi Li, and Yunqiang Duan. Knowledge distillation via instance relationship graph. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7096–7104, 2019.
  • [47] Yifan Liu, Ke Chen, Chris Liu, Zengchang Qin, Zhenbo Luo, and **gdong Wang. Structured knowledge distillation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2604–2613, 2019.
  • [48] Jiří Matoušek. On variants of the johnson–lindenstrauss lemma. Random Structures & Algorithms, 33(2):142–156, 2008.
  • [49] Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 5191–5198, 2020.
  • [50] Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10870–10879, 2020.
  • [51] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.
  • [52] **hyuk Park and Albert No. Prune your model before distill it. arXiv preprint arXiv:2109.14960, 2021.
  • [53] Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3967–3976, 2019.
  • [54] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  • [55] Baoyun Peng, Xiao **, Jiaheng Liu, Dongsheng Li, Yichao Wu, Yu Liu, Shunfeng Zhou, and Zhaoning Zhang. Correlation congruence for knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5007–5016, 2019.
  • [56] Mary Phuong and Christoph Lampert. Towards understanding knowledge distillation. In International Conference on Machine Learning, pages 5142–5151. PMLR, 2019.
  • [57] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
  • [58] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
  • [59] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
  • [60] Changyong Shu, Yifan Liu, Jianfei Gao, Zheng Yan, and Chunhua Shen. Channel-wise knowledge distillation for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5311–5320, 2021.
  • [61] Jie Song, Ying Chen, **gwen Ye, and Mingli Song. Spot-adaptive knowledge distillation. IEEE Transactions on Image Processing, 31:3359–3370, 2022.
  • [62] Samuel Stanton, Pavel Izmailov, Polina Kirichenko, Alexander A Alemi, and Andrew G Wilson. Does knowledge distillation really work? Advances in Neural Information Processing Systems, 34:6906–6919, 2021.
  • [63] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. arXiv preprint arXiv:1910.10699, 2019.
  • [64] Frederick Tung and Greg Mori. Similarity-preserving knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1365–1374, 2019.
  • [65] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • [66] Tao Wang, Li Yuan, Xiaopeng Zhang, and Jiashi Feng. Distilling object detectors with fine-grained feature imitation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4933–4942, 2019.
  • [67] Xionghui Wang, Jian-Fang Hu, Jian-Huang Lai, Jianguo Zhang, and Wei-Shi Zheng. Progressive teacher-student learning for early action prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3556–3565, 2019.
  • [68] Xiaojie Wang, Rui Zhang, Yu Sun, and Jianzhong Qi. Kdgan: Knowledge distillation with generative adversarial networks. Advances in neural information processing systems, 31, 2018.
  • [69] Yukang Wang, Wei Zhou, Tao Jiang, Xiang Bai, and Yongchao Xu. Intra-class feature variation distillation for semantic segmentation. In European Conference on Computer Vision, pages 346–362. Springer, 2020.
  • [70] Kan Wu, **nian Zhang, Houwen Peng, Mengchen Liu, Bin Xiao, Jianlong Fu, and Lu Yuan. Tinyvit: Fast pretraining distillation for small vision transformers. arXiv preprint arXiv:2207.10666, 2022.
  • [71] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
  • [72] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
  • [73] Guodong Xu, Yuenan Hou, Ziwei Liu, and Chen Change Loy. Mind the gap in distilling stylegans. arXiv preprint arXiv:2208.08840, 2022.
  • [74] Chuanguang Yang, Zhulin An, Helong Zhou, Linhang Cai, Xiang Zhi, Jiwen Wu, Yongjun Xu, and Qian Zhang. Mixskd: Self-knowledge distillation from mixup for image recognition. In European Conference on Computer Vision, 2022.
  • [75] Chenglin Yang, Lingxi Xie, Chi Su, and Alan L Yuille. Snapshot distillation: Teacher-student optimization in one generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2859–2868, 2019.
  • [76] Chuanguang Yang, Helong Zhou, Zhulin An, Xue Jiang, Yongjun Xu, and Qian Zhang. Cross-image relational knowledge distillation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12319–12328, 2022.
  • [77] **g Yang, Brais Martinez, Adrian Bulat, Georgios Tzimiropoulos, et al. Knowledge distillation via softmax regression representation learning. International Conference on Learning Representations (ICLR), 2021.
  • [78] Zhendong Yang, Zhe Li, Xiaohu Jiang, Yuan Gong, Zehuan Yuan, Danpei Zhao, and Chun Yuan. Focal and global knowledge distillation for detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4643–4652, 2022.
  • [79] Zhendong Yang, Zhe Li, Mingqi Shao, Dachuan Shi, Zehuan Yuan, and Chun Yuan. Masked generative distillation. arXiv preprint arXiv:2205.01529, 2022.
  • [80] Zhuoyuan Yao, Di Wu, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, and Xin Lei. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. In Proc. Interspeech, Brno, Czech Republic, 2021. IEEE.
  • [81] Han-Jia Ye, Su Lu, and De-Chuan Zhan. Generalized knowledge distillation via relationship matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • [82] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4133–4141, 2017.
  • [83] Jianlong Yuan, Qian Qi, Fei Du, Zhibin Wang, Fan Wang, and Yifan Liu. Fakd: Feature augmented knowledge distillation for semantic segmentation. arXiv preprint arXiv:2208.14143, 2022.
  • [84] Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928, 2016.
  • [85] Binbin Zhang, Di Wu, Zhendong Peng, Xingchen Song, Zhuoyuan Yao, Hang Lv, Lei ** Pan, and Jianwei Niu. Wenet 2.0: More productive end-to-end speech recognition toolkit. arXiv preprint arXiv:2203.15455, 2022.
  • [86] Linfeng Zhang, Xin Chen, Xiaobing Tu, Pengfei Wan, Ning Xu, and Kaisheng Ma. Wavelet knowledge distillation: Towards efficient image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12464–12474, 2022.
  • [87] Linfeng Zhang and Kaisheng Ma. Improve object detection with feature-based knowledge distillation: Towards accurate and efficient detectors. In International Conference on Learning Representations, 2020.
  • [88] Peizhen Zhang, Zijian Kang, Tong Yang, Xiangyu Zhang, Nanning Zheng, and Jian Sun. Lgd: Label-guided self-distillation for object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3309–3317, 2022.
  • [89] Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. Deep mutual learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4320–4328, 2018.
  • [90] Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11953–11962, 2022.
  • [91] Zhaohui Zheng, Rongguang Ye, ** Wang, Dongwei Ren, Wangmeng Zuo, Qibin Hou, and Ming-Ming Cheng. Localization distillation for dense object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9407–9416, 2022.
  • [92] Du Zhixing, Rui Zhang, Ming Chang, Shaoli Liu, Tianshi Chen, Yunji Chen, et al. Distilling object detectors with feature richness. Advances in Neural Information Processing Systems, 34:5213–5224, 2021.
  • [93] Helong Zhou, Liangchen Song, Jiajie Chen, Ye Zhou, Guoli Wang, Junsong Yuan, and Qian Zhang. Rethinking soft labels for knowledge distillation: A bias-variance tradeoff perspective. arXiv preprint arXiv:2102.00650, 2021.
  • [94] Zaida Zhou, Chaoran Zhuge, Xinwei Guan, and Wen Liu. Channel distillation: Channel-wise attention for knowledge distillation. arXiv preprint arXiv:2006.01683, 2020.
  • [95] Yichen Zhu and Yi Wang. Student customized knowledge distillation: Bridging the gap between student and teacher. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5057–5066, 2021.
  • [96] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
U2wAAAABJRU5ErkJggg==" alt="[LOGO]">