License: arXiv.org perpetual non-exclusive license
arXiv:2402.11148v2 [cs.LG] 07 Mar 2024

Knowledge Distillation Based on Transformed Teacher Matching

Kaixiang Zheng & En-Hui Yang
Department of Electrical and Computer Engineering, University of Waterloo
{k56zheng,ehyang}@uwaterloo.ca
Abstract

As a technique to bridge logit matching and probability distribution matching, temperature scaling plays a pivotal role in knowledge distillation (KD). Conventionally, temperature scaling is applied to both teacher’s logits and student’s logits in KD. Motivated by some recent works, in this paper, we drop instead temperature scaling on the student side, and systematically study the resulting variant of KD, dubbed transformed teacher matching (TTM). By reinterpreting temperature scaling as a power transform of probability distribution, we show that in comparison with the original KD, TTM has an inherent Rényi entropy term in its objective function, which serves as an extra regularization term. Extensive experiment results demonstrate that thanks to this inherent regularization, TTM leads to trained students with better generalization than the original KD. To further enhance student’s capability to match teacher’s power transformed probability distribution, we introduce a sample-adaptive weighting coefficient into TTM, yielding a novel distillation approach dubbed weighted TTM (WTTM). It is shown, by comprehensive experiments, that although WTTM is simple, it is effective, improves upon TTM, and achieves state-of-the-art accuracy performance. Our source code is available at https://github.com/zkxufo/TTM.

1 Introduction

Knowledge distillation (KD) has achieved a great success and drawn a lot of attention ever since it was proposed. The original form of KD was proposed by Buciluǎ et al. (2006), where a small model (student) was trained to match the logits of a large model (teacher). Later, a generalized version now known as KD was proposed by Hinton et al. (2015), where the small student model was trained to match the class probability distribution of the large teacher model. Compared to the student model trained with standard empirical risk minimization (ERM), the student model trained via KD has better performance in terms of accuracy, to the extent that this light-weight KD-trained student model is able to take the place of some larger and more complex models with little performance degradation, achieving the goal of model compression.

In the literature, KD is generally formulated as minimizing the following loss

KD=(1λ)H(y,q)+λT2D(pTt||qT)\displaystyle\mathcal{L}_{KD}=(1-\lambda)H(y,q)+\lambda T^{2}D(p^{t}_{T}||q_{T})caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT = ( 1 - italic_λ ) italic_H ( italic_y , italic_q ) + italic_λ italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_D ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | italic_q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) (1)

where CE=H(y,q)subscript𝐶𝐸𝐻𝑦𝑞\mathcal{L}_{CE}=H(y,q)caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT = italic_H ( italic_y , italic_q ) is the cross entropy loss between the one-hot probability distribution corresponding to label y𝑦yitalic_y and the student output probability distribution q𝑞qitalic_q, which is the canonical loss of ERM, D(pTt||qT)D(p^{t}_{T}||q_{T})italic_D ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | italic_q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) is the Kullback–Leibler divergence between the temperature scaled output probability distribution pTtsubscriptsuperscript𝑝𝑡𝑇p^{t}_{T}italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT of the teacher and the temperature scaled output probability distribution qTsubscript𝑞𝑇q_{T}italic_q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT of the student, T𝑇Titalic_T is the temperature of distillation, and λ𝜆\lambdaitalic_λ is a balancing weight. Note that pTt=σ(v/T)subscriptsuperscript𝑝𝑡𝑇𝜎𝑣𝑇p^{t}_{T}=\sigma(v/T)italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_σ ( italic_v / italic_T ) and qT=σ(z/T)subscript𝑞𝑇𝜎𝑧𝑇q_{T}=\sigma(z/T)italic_q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_σ ( italic_z / italic_T ), given logits v𝑣vitalic_v of the teacher and logits z𝑧zitalic_z of the student, where σ𝜎\sigmaitalic_σ denotes the softmax function.

The use of the temperature T𝑇Titalic_T above is a pivotal characteristic of KD. On one hand, it provides a way to build a bridge between class probability distribution matching and logits matching. Indeed, it was shown in Hinton et al. (2015) that as T𝑇Titalic_T goes to \infty, KD is equivalent to its logits-matching predecessor. On the other hand, it also distinguishes KD from the logits-matching approach, since in practice, empirically optimal values of the temperature T𝑇Titalic_T are often quite modest. Beyond these, there is little understanding about the role of the temperature T𝑇Titalic_T and in general why KD in its formulation (1) helps the student learns better. In particular, the following questions naturally arise:

Q1

Why does the temperature T𝑇Titalic_T have to be applied to both the teacher and student?

Q2

Would it be better off to apply the temperature T𝑇Titalic_T to the teacher only, but not to the student?

So far, answers to the above questions remain elusive at the best.

The purpose of this paper is to address the above questions. First, we demonstrate both theoretically and experimentally that the answer to the question Q2 above is affirmative, and it is better off to drop the temperature T𝑇Titalic_T entirely on the student side—the resulting variant of KD is referred to as transformed teacher matching (TTM) and formulated as minimizing the following objective:

TTM=H(y,q)+βD(pTt||q)\displaystyle\mathcal{L}_{TTM}=H(y,q)+\beta D(p^{t}_{T}||q)caligraphic_L start_POSTSUBSCRIPT italic_T italic_T italic_M end_POSTSUBSCRIPT = italic_H ( italic_y , italic_q ) + italic_β italic_D ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | italic_q ) (2)

where β𝛽\betaitalic_β is a balancing weight. Specifically, we show that (1) temperature scaling of logits is equivalent to a power transform of probability distribution, and (2) in comparison with KD, TTM has an inherent Rényi entropy term in its objective function (2). It is this inherent Rényi entropy that serves as an extra regularization term and hence improves upon KD. This theoretic analysis is further confirmed by extensive experiment results. It is shown by extensive experiments that thanks to this inherent regularization, TTM leads to trained students with better generalization than KD. Second, to further enhance student’s capability to match teacher’s power transformed probability distribution, we introduce a sample-adaptive weighting coefficient into TTM, yielding a novel distillation approach dubbed weighted TTM (WTTM). WTTM is simple and has almost the same computational complexity as KD. And yet it is very effective; it is shown, by comprehensive experiments, that it is significantly better than KD in terms of accuracy, improves upon TTM, and achieves state-of-the-art accuracy performance. For example, WTTM can reach 72.19% classification accuracy on ImageNet for ResNet-18 distilled from ResNet-34, outperforming most highly complex feature-based distillation methods.

With the temperature T𝑇Titalic_T dropped entirely on the student side, TTM and WTTM, along with the statistical perspective of KD (Menon et al., 2021) and the newly established upper bound on error rate in term of the cross entropy H(px*,q)𝐻subscriptsuperscript𝑝𝑥𝑞H(p^{*}_{x},q)italic_H ( italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_q ) between the true, but often unknown conditional probability distribution px*subscriptsuperscript𝑝𝑥p^{*}_{x}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT of label y𝑦yitalic_y given an input sample x𝑥xitalic_x and the output probability distribution q𝑞qitalic_q of a model in response to the input x𝑥xitalic_x, Yang et al. (2023a) offer a new explanation of why KD helps. First, the purpose of the teacher in KD is to provide a proper estimate for the unknown true conditional probability distribution px*subscriptsuperscript𝑝𝑥p^{*}_{x}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, which is a linear combination of the one-hot vector corresponding to the label y𝑦yitalic_y and the power transformed teacher’s probability distribution pTtsubscriptsuperscript𝑝𝑡𝑇p^{t}_{T}italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. Second, the role of the temperature T𝑇Titalic_T on the teacher side is to improve this estimate. Third, replacing px*subscriptsuperscript𝑝𝑥p^{*}_{x}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT by its estimate from the transformed teacher, the learning process in KD is to simply minimize the cross entropy upper bound on error rate, which improves upon the standard deep learning process where px*subscriptsuperscript𝑝𝑥p^{*}_{x}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT in the cross entropy upper bound is rudimentarily approximated by the one-hot vector corresponding to the label y𝑦yitalic_y.

2 Background and Related Work

2.1 Confidence Penalty

In a multi-class classification setting, an output of a neural network in response to an input sample is a probability vector or distribution q𝑞qitalic_q with K𝐾Kitalic_K entries, where K𝐾Kitalic_K is the number of all possible classes, and the class with the highest probability is the prediction made by the neural network for this particular sample. Conventionally, a prediction is said to be confident if the corresponding q𝑞qitalic_q concentrates most of its probability mass on the predicted class. Szegedy et al. (2016) points out that if a model is too confident about its predictions, then it tends to suffer from overfitting. To avoid overfitting and improve generalization, Pereyra et al. (2017) proposed to penalize confident predictions. Since a confident prediction generally corresponds to q𝑞qitalic_q with low entropy, they enforced confidence penalty (CP) by introducing a negative entropy regularizer into the objective function of the learning process, which is formulated as

CP=H(y,q)ηH(q)subscript𝐶𝑃𝐻𝑦𝑞𝜂𝐻𝑞\displaystyle\mathcal{L}_{CP}=H(y,q)-\eta H(q)caligraphic_L start_POSTSUBSCRIPT italic_C italic_P end_POSTSUBSCRIPT = italic_H ( italic_y , italic_q ) - italic_η italic_H ( italic_q ) (3)

where η𝜂\etaitalic_η controls the strength of the confidence penalty. Thanks to the entropy regularization, the learned model is encouraged to output smoother distributions with larger entropy, leading to less confident predictions, and most importantly, better generalization.

2.2 Rényi Entropy

Rényi entropy (Rényi, 1961) is a generalized version of Shannon entropy, which has been successfully applied in many machine learning topics, such as differential privacy (Mironov, 2017), understanding neural networks (Yu et al., 2020), and representation distillation (Miles et al., 2021). Given a discrete random variable X𝑋Xitalic_X with alphabet 𝒜={x1,x2,,xn}𝒜subscript𝑥1subscript𝑥2subscript𝑥𝑛\mathcal{A}=\{x_{1},x_{2},\dots,x_{n}\}caligraphic_A = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } and corresponding probabilities pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for i=1,2,,n𝑖12𝑛i=1,2,\dots,nitalic_i = 1 , 2 , … , italic_n, its Rényi entropy is defined as

Hα(X)=11αlogi=1npiαsubscript𝐻𝛼𝑋11𝛼superscriptsubscript𝑖1𝑛superscriptsubscript𝑝𝑖𝛼\displaystyle H_{\alpha}(X)=\frac{1}{1-\alpha}\log{\sum_{i=1}^{n}{p_{i}}^{% \alpha}}italic_H start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_X ) = divide start_ARG 1 end_ARG start_ARG 1 - italic_α end_ARG roman_log ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT (4)

where α𝛼\alphaitalic_α is called the order of Rényi entropy. The limit of Rényi entropy when α1𝛼1\alpha\rightarrow 1italic_α → 1 is the well-known Shannon entropy.

2.3 Label Smoothing Perspective towards KD

In the literature, different perspectives have been developed to understand KD. One of them is the label smoothing (LS) perspective advocated by Yuan et al. (2020) and Zhang & Sabuncu (2020).

LS (Szegedy et al., 2016) is a technique to encourage a model to make less confident predictions by minimizing the following objective function in the learning process

LS=(1ϵ)H(y,q)+ϵH(u,q)subscript𝐿𝑆1italic-ϵ𝐻𝑦𝑞italic-ϵ𝐻𝑢𝑞\displaystyle\mathcal{L}_{LS}=(1-\epsilon)H(y,q)+\epsilon H(u,q)caligraphic_L start_POSTSUBSCRIPT italic_L italic_S end_POSTSUBSCRIPT = ( 1 - italic_ϵ ) italic_H ( italic_y , italic_q ) + italic_ϵ italic_H ( italic_u , italic_q ) (5)

where u𝑢uitalic_u is a uniform distribution over all K𝐾Kitalic_K possible classes, and ϵitalic-ϵ\epsilonitalic_ϵ controls the strength of the smoothing effect. The model trained with LS tends to have significantly less confident predictions and output probability distributions with larger Shannon entropy compared to its counterpart in the case of ERM (visualized in A.1).

If we replace u𝑢uitalic_u with the teacher output ptsuperscript𝑝𝑡p^{t}italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT in (5), then we have LS=(1ϵ)H(y,q)+ϵH(pt,q)subscript𝐿𝑆1italic-ϵ𝐻𝑦𝑞italic-ϵ𝐻superscript𝑝𝑡𝑞\mathcal{L}_{LS}=(1-\epsilon)H(y,q)+\epsilon H(p^{t},q)caligraphic_L start_POSTSUBSCRIPT italic_L italic_S end_POSTSUBSCRIPT = ( 1 - italic_ϵ ) italic_H ( italic_y , italic_q ) + italic_ϵ italic_H ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_q ), which is equivalent to KDsubscript𝐾𝐷\mathcal{L}_{KD}caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT with T=1𝑇1T=1italic_T = 1, since the entropy H(pt)𝐻superscript𝑝𝑡H(p^{t})italic_H ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) does not depends on the student. Therefore, when T=1𝑇1T=1italic_T = 1, KD can indeed be regarded as sample-adaptive LS. However, when T>1𝑇1T>1italic_T > 1, such a perspective no longer holds since temperature scaling is also applied to the student model. This is confirmed by the empirical analysis shown in A.1. Although KD with T=1𝑇1T=1italic_T = 1 is able to increase the Shannon entropy of output probability distribution q𝑞qitalic_q compared to ERM, KD with T=4𝑇4T=4italic_T = 4 actually leads to decreased Shannon entropy compared to ERM, showing an opposite effect of LS.

The sample-adaptive LS perspective was also advocated in self-distillation Zhang & Sabuncu (2020), where the temperature T𝑇Titalic_T was dropped for convenience on the student side. However, no systematic treatment was provided to justify the drop-out of the temperature T𝑇Titalic_T for the student side. In fact, in terms of prediction accuracy, mixed results were demonstrated: drop** out the temperature T𝑇Titalic_T for the student can either decrease or increase the accuracy.

2.4 Statistical Perspective and Cross Entropy Upper Bound

Another perspective to understand KD is the statistical perspective advocated by Menon et al. (2021). A key observation therein is that the Bayes-distilled risk has a smaller variance than the standard empirical risk, which is actually the direct consequence of the law of total probability for variance (Ross, 2019). Since the Bayes class-probability distribution over the labels, i.e., the conditional probability distribution px*=[P(i|x)]i=1Ksubscriptsuperscript𝑝𝑥superscriptsubscriptdelimited-[]𝑃conditional𝑖𝑥𝑖1𝐾p^{*}_{x}=[P(i|x)]_{i=1}^{K}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = [ italic_P ( italic_i | italic_x ) ] start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT of label y𝑦yitalic_y given an input sample x𝑥xitalic_x, is unknown in practice, the role of the teacher in KD was believed to use its output probability distribution ptsuperscript𝑝𝑡p^{t}italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT or temperature scaled output probability distribution pTtsubscriptsuperscript𝑝𝑡𝑇p^{t}_{T}italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to estimate px*superscriptsubscript𝑝𝑥p_{x}^{*}italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT for the student. This, in turn, offers some explanation of why improving teacher accuracy can sometimes harm distillation performance, since improving teacher accuracy and providing better estimates for px*superscriptsubscript𝑝𝑥p_{x}^{*}italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT are two different tasks. In this perspective, the temperature T𝑇Titalic_T is also dropped for the student. Again, no justification was provided for drop** T𝑇Titalic_T on the student side. In addition, the question of why minimizing the Bayes-distilled risk or teacher-distilled risk could improve the student’s accuracy performance was not answered either.

Recently, it was shown in Yang et al. (2023a) that for any classification neural network, its error rate is upper bounded by 𝔼x[H(px*,q)]subscript𝔼xdelimited-[]𝐻subscriptsuperscript𝑝𝑥𝑞\mathbb{E}_{{\textnormal{x}}}[H(p^{*}_{x},q)]blackboard_E start_POSTSUBSCRIPT x end_POSTSUBSCRIPT [ italic_H ( italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_q ) ]. Thus, to reduce its error rate, the neural network can be trained by minimizing 𝔼x[H(px*,q)]subscript𝔼xdelimited-[]𝐻subscriptsuperscript𝑝𝑥𝑞\mathbb{E}_{{\textnormal{x}}}[H(p^{*}_{x},q)]blackboard_E start_POSTSUBSCRIPT x end_POSTSUBSCRIPT [ italic_H ( italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_q ) ]. Since the true conditional distribution px*subscriptsuperscript𝑝𝑥p^{*}_{x}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is generally unavailable in practice, KD with the temperature T𝑇Titalic_T dropped for the student can be essentially regarded as one way to solve approximately the problem of minimizing 𝔼x[H(px*,q)]subscript𝔼xdelimited-[]𝐻subscriptsuperscript𝑝𝑥𝑞\mathbb{E}_{{\textnormal{x}}}[H(p^{*}_{x},q)]blackboard_E start_POSTSUBSCRIPT x end_POSTSUBSCRIPT [ italic_H ( italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_q ) ], where px*subscriptsuperscript𝑝𝑥p^{*}_{x}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is first approximated by a linear combination of the one-hot probability distribution corresponding to label y𝑦yitalic_y and the temperature scaled output probability distribution pTtsubscriptsuperscript𝑝𝑡𝑇p^{t}_{T}italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT of the teacher. This perspective, when applied to KD, does provide justifications for drop** the temperature T𝑇Titalic_T entirely on the student side and also for minimizing the Bayes-distilled risk or teacher-distilled risk. Of course, KD with the temperature T𝑇Titalic_T dropped for the student may not be necessarily an effective way to minimize 𝔼x[H(px*,q)]subscript𝔼xdelimited-[]𝐻subscriptsuperscript𝑝𝑥𝑞\mathbb{E}_{{\textnormal{x}}}[H(p^{*}_{x},q)]blackboard_E start_POSTSUBSCRIPT x end_POSTSUBSCRIPT [ italic_H ( italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_q ) ]. Other recent related works are reviewed in Appendix A.7.

In contrast, in this paper, we show more directly that it is better off to drop entirely the temperature T𝑇Titalic_T on the student side in KD by comparing TTM with KD both theoretically and experimentally.

3 Transformed Teacher Matching

In this section, we compare TTM with KD theoretically by showing that TTM is equivalent to KD plus Rényi entropy regularization. To this end, we first come up with a general concept of power transform of output distributions. Then, we show the equivalence between temperature scaling and power transform. Based on this, a simple derivation is provided to decompose TTM into KD plus a Rényi entropy regularizer. In view of CP, it’s clear that TTM can lead to better generalization than KD because of the penalty over confident output distributions.

3.1 Power Transform of Probability Distributions

In KD, model output distributions are transformed by temperature scaling to improve their smoothness. However, such a transform is not unique. There are many other transforms which can smooth out peaked probability distributions as well. Below we will introduce a generalized transform.

Consider a point-wise map** f:[0,1][0,1]:𝑓0101f:[0,1]\rightarrow[0,1]italic_f : [ 0 , 1 ] → [ 0 , 1 ]. For any probability distribution p=[p1,,pK]𝑝subscript𝑝1subscript𝑝𝐾p=[p_{1},\dots,p_{K}]italic_p = [ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ], we can apply f𝑓fitalic_f to each component of p𝑝pitalic_p to define a generalized transform pp^𝑝^𝑝p\to\hat{p}italic_p → over^ start_ARG italic_p end_ARG, where p^=[p1^,,pK^]^𝑝^subscript𝑝1^subscript𝑝𝐾\hat{p}=[\hat{p_{1}},\dots,\hat{p_{K}}]over^ start_ARG italic_p end_ARG = [ over^ start_ARG italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , … , over^ start_ARG italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG ], and

pi^=f(pi)j=1Kf(pj), 1iK.formulae-sequence^subscript𝑝𝑖𝑓subscript𝑝𝑖superscriptsubscript𝑗1𝐾𝑓subscript𝑝𝑗for-all1𝑖𝐾\displaystyle\hat{p_{i}}=\frac{f(p_{i})}{\sum_{j=1}^{K}f(p_{j})},\ \forall\ 1% \leq i\leq K.over^ start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = divide start_ARG italic_f ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_f ( italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG , ∀ 1 ≤ italic_i ≤ italic_K . (6)

In this above, j=1Kf(pj)superscriptsubscript𝑗1𝐾𝑓subscript𝑝𝑗\sum_{j=1}^{K}f(p_{j})∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_f ( italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is used to normalize the vector [f(pi)]i=1Ksuperscriptsubscriptdelimited-[]𝑓subscript𝑝𝑖𝑖1𝐾[f(p_{i})]_{i=1}^{K}[ italic_f ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT back to a probability simplex. With this generalized framework, any specific transform can be described by its associated map** f𝑓fitalic_f. Among all possible map**s f𝑓fitalic_f, the most interesting one to us is the power function with exponent γ𝛾\gammaitalic_γ. If f𝑓fitalic_f is selected to be the power function with exponent γ𝛾\gammaitalic_γ, the resulting probability distribution transform pp^𝑝^𝑝p\to\hat{p}italic_p → over^ start_ARG italic_p end_ARG is referred to as the power transform of probability distribution. Accordingly, the power transformed distribution is given by

p^=[pi^]i=1K=[piγj=1Kpjγ]i=1K.^𝑝superscriptsubscriptdelimited-[]^subscript𝑝𝑖𝑖1𝐾superscriptsubscriptdelimited-[]superscriptsubscript𝑝𝑖𝛾superscriptsubscript𝑗1𝐾superscriptsubscript𝑝𝑗𝛾𝑖1𝐾\displaystyle\hat{p}=\left[\hat{p_{i}}\right]_{i=1}^{K}=\left[\frac{{p_{i}}^{% \gamma}}{\sum_{j=1}^{K}{p_{j}}^{\gamma}}\right]_{i=1}^{K}.over^ start_ARG italic_p end_ARG = [ over^ start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ] start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT = [ divide start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG ] start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT . (7)

Next, we will show that power transform is equivalent to temperature scaling. Indeed, suppose that p𝑝pitalic_p is the softmax of logits [l1,l2,,lK]subscript𝑙1subscript𝑙2subscript𝑙𝐾[l_{1},l_{2},\cdots,l_{K}][ italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_l start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ]:

pi=elij=1Kelj, 1iK.formulae-sequencesubscript𝑝𝑖superscript𝑒subscript𝑙𝑖superscriptsubscript𝑗1𝐾superscript𝑒subscript𝑙𝑗for-all1𝑖𝐾\displaystyle p_{i}={e^{l_{i}}\over\sum_{j=1}^{K}e^{l_{j}}},\ \forall\ 1\leq i% \leq K.italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG , ∀ 1 ≤ italic_i ≤ italic_K . (8)

Then

pi^=piγjpjγ=(elimelm)γj(eljkelk)γ=(1melm)γeγli(1kelk)γjeγlj=eγlijeγlj.^subscript𝑝𝑖superscriptsubscript𝑝𝑖𝛾subscript𝑗superscriptsubscript𝑝𝑗𝛾superscriptsuperscript𝑒subscript𝑙𝑖subscript𝑚superscript𝑒subscript𝑙𝑚𝛾subscript𝑗superscriptsuperscript𝑒subscript𝑙𝑗subscript𝑘superscript𝑒subscript𝑙𝑘𝛾superscript1subscript𝑚superscript𝑒subscript𝑙𝑚𝛾superscript𝑒𝛾subscript𝑙𝑖superscript1subscript𝑘superscript𝑒subscript𝑙𝑘𝛾subscript𝑗superscript𝑒𝛾subscript𝑙𝑗superscript𝑒𝛾subscript𝑙𝑖subscript𝑗superscript𝑒𝛾subscript𝑙𝑗\displaystyle\hat{p_{i}}=\frac{{p_{i}}^{\gamma}}{\sum_{j}{p_{j}}^{\gamma}}=% \frac{{\left(\frac{e^{l_{i}}}{\sum_{m}e^{l_{m}}}\right)}^{\gamma}}{\sum_{j}{% \left(\frac{e^{l_{j}}}{\sum_{k}e^{l_{k}}}\right)}^{\gamma}}=\frac{{\left(\frac% {1}{\sum_{m}e^{l_{m}}}\right)}^{\gamma}\cdot e^{\gamma l_{i}}}{{\left(\frac{1}% {\sum_{k}e^{l_{k}}}\right)}^{\gamma}\cdot\sum_{j}e^{\gamma l_{j}}}=\frac{e^{% \gamma l_{i}}}{\sum_{j}e^{\gamma l_{j}}}.over^ start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = divide start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG = divide start_ARG ( divide start_ARG italic_e start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( divide start_ARG italic_e start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG = divide start_ARG ( divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ⋅ italic_e start_POSTSUPERSCRIPT italic_γ italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ( divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_γ italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_γ italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_γ italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG . (9)

Thus p^^𝑝\hat{p}over^ start_ARG italic_p end_ARG is the softmax of the scaled logits [γl1,γl2,,γlK]𝛾subscript𝑙1𝛾subscript𝑙2𝛾subscript𝑙𝐾[\gamma l_{1},\gamma l_{2},\cdots,\gamma l_{K}][ italic_γ italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_γ italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_γ italic_l start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ] with temperature T=1/γ𝑇1𝛾T=1/\gammaitalic_T = 1 / italic_γ.

3.2 From KD to TTM

Based on the equivalence between power transform and temperature scaling, we can now reveal the connection between KD and TTM.

Let γ=1/T𝛾1𝑇\gamma=1/Titalic_γ = 1 / italic_T. Go back to (1) and (2). In view of (9), we have

pTt=pt^ and qT=q^.subscriptsuperscript𝑝𝑡𝑇^superscript𝑝𝑡 and subscript𝑞𝑇^𝑞\displaystyle p^{t}_{T}=\hat{p^{t}}\mbox{ and }q_{T}=\hat{q}.italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = over^ start_ARG italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG and italic_q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = over^ start_ARG italic_q end_ARG . (10)

Then we can decompose D(pTt||qT)D(p^{t}_{T}||q_{T})italic_D ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | italic_q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) as follows:

D(pTt||qT)\displaystyle D(p^{t}_{T}||q_{T})italic_D ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | italic_q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) =D(pt^||q^)\displaystyle=D(\hat{p^{t}}||\hat{q})= italic_D ( over^ start_ARG italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG | | over^ start_ARG italic_q end_ARG )
=ipt^ilogpt^iq^iabsentsubscript𝑖subscript^superscript𝑝𝑡𝑖subscript^superscript𝑝𝑡𝑖subscript^𝑞𝑖\displaystyle=\sum_{i}\hat{p^{t}}_{i}\log{\frac{\hat{p^{t}}_{i}}{\hat{q}_{i}}}= ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log divide start_ARG over^ start_ARG italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG
=ipt^ilogq^iH(pt^)absentsubscript𝑖subscript^superscript𝑝𝑡𝑖subscript^𝑞𝑖𝐻^superscript𝑝𝑡\displaystyle=-\sum_{i}\hat{p^{t}}_{i}\log{\hat{q}_{i}}-H(\hat{p^{t}})= - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_H ( over^ start_ARG italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG )
=ipt^ilogqiγjqjγH(pt^)absentsubscript𝑖subscript^superscript𝑝𝑡𝑖superscriptsubscript𝑞𝑖𝛾subscript𝑗superscriptsubscript𝑞𝑗𝛾𝐻^superscript𝑝𝑡\displaystyle=-\sum_{i}\hat{p^{t}}_{i}\log{\frac{{q_{i}}^{\gamma}}{\sum_{j}{q_% {j}}^{\gamma}}}-H(\hat{p^{t}})= - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log divide start_ARG italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT end_ARG - italic_H ( over^ start_ARG italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG ) (11)
=ipt^ilogqiγ+logjqjγH(pt^)absentsubscript𝑖subscript^superscript𝑝𝑡𝑖superscriptsubscript𝑞𝑖𝛾subscript𝑗superscriptsubscript𝑞𝑗𝛾𝐻^superscript𝑝𝑡\displaystyle=-\sum_{i}\hat{p^{t}}_{i}\log{{q_{i}}^{\gamma}}+\log{\sum_{j}{q_{% j}}^{\gamma}}-H(\hat{p^{t}})= - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT + roman_log ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT - italic_H ( over^ start_ARG italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG )
=γH(pt^,q)+(1γ)Hγ(q)H(pt^)absent𝛾𝐻^superscript𝑝𝑡𝑞1𝛾subscript𝐻𝛾𝑞𝐻^superscript𝑝𝑡\displaystyle=\gamma H(\hat{p^{t}},q)+(1-\gamma)H_{\gamma}(q)-H(\hat{p^{t}})= italic_γ italic_H ( over^ start_ARG italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG , italic_q ) + ( 1 - italic_γ ) italic_H start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_q ) - italic_H ( over^ start_ARG italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG ) (12)
=γD(pt^||q)+(1γ)Hγ(q)(1γ)H(pt^)\displaystyle=\gamma D(\hat{p^{t}}||q)+(1-\gamma)H_{\gamma}(q)-(1-\gamma)H(% \hat{p^{t}})= italic_γ italic_D ( over^ start_ARG italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG | | italic_q ) + ( 1 - italic_γ ) italic_H start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_q ) - ( 1 - italic_γ ) italic_H ( over^ start_ARG italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG ) (13)
=γD(pTt||q)+(1γ)Hγ(q)(1γ)H(pTt)\displaystyle=\gamma D(p^{t}_{T}||q)+(1-\gamma)H_{\gamma}(q)-(1-\gamma)H(p^{t}% _{T})= italic_γ italic_D ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | italic_q ) + ( 1 - italic_γ ) italic_H start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_q ) - ( 1 - italic_γ ) italic_H ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) (14)

where (11) follows the power transform (7), Hγ(q)subscript𝐻𝛾𝑞H_{\gamma}(q)italic_H start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_q ) in (12) is the Rényi entropy of q𝑞qitalic_q of order γ𝛾\gammaitalic_γ, and (14) is due to (10). Rearranging (14), we get

D(pTt||q)=TD(pTt||qT)(T1)H1T(q)+(T1)H(pTt).\displaystyle D(p^{t}_{T}||q)=TD(p^{t}_{T}||q_{T})-(T-1)H_{\frac{1}{T}}(q)+(T-% 1)H(p^{t}_{T}).italic_D ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | italic_q ) = italic_T italic_D ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | italic_q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) - ( italic_T - 1 ) italic_H start_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_T end_ARG end_POSTSUBSCRIPT ( italic_q ) + ( italic_T - 1 ) italic_H ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) . (15)

Plugging (15) into (2) yields

TTMsubscript𝑇𝑇𝑀\displaystyle\mathcal{L}_{TTM}caligraphic_L start_POSTSUBSCRIPT italic_T italic_T italic_M end_POSTSUBSCRIPT =H(y,q)+βTD(pTt||qT)β(T1)H1T(q)+β(T1)H(pTt)\displaystyle=H(y,q)+\beta TD(p^{t}_{T}||q_{T})-\beta(T-1)H_{\frac{1}{T}}(q)+% \beta(T-1)H(p^{t}_{T})= italic_H ( italic_y , italic_q ) + italic_β italic_T italic_D ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | italic_q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) - italic_β ( italic_T - 1 ) italic_H start_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_T end_ARG end_POSTSUBSCRIPT ( italic_q ) + italic_β ( italic_T - 1 ) italic_H ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )
H(y,q)+βTD(pTt||qT)β(T1)H1T(q)\displaystyle\equiv H(y,q)+\beta TD(p^{t}_{T}||q_{T})-\beta(T-1)H_{\frac{1}{T}% }(q)≡ italic_H ( italic_y , italic_q ) + italic_β italic_T italic_D ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | italic_q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) - italic_β ( italic_T - 1 ) italic_H start_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_T end_ARG end_POSTSUBSCRIPT ( italic_q ) (16)
=11λ[(1λ)H(y,q)+λT2D(pTt||qT)λT(T1)H1T(q)]\displaystyle={1\over 1-\lambda}\left[(1-\lambda)H(y,q)+\lambda T^{2}D(p^{t}_{% T}||q_{T})-\lambda T(T-1)H_{\frac{1}{T}}(q)\right]= divide start_ARG 1 end_ARG start_ARG 1 - italic_λ end_ARG [ ( 1 - italic_λ ) italic_H ( italic_y , italic_q ) + italic_λ italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_D ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | italic_q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) - italic_λ italic_T ( italic_T - 1 ) italic_H start_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_T end_ARG end_POSTSUBSCRIPT ( italic_q ) ] (17)
=11λ[KDλT(T1)H1T(q)]absent11𝜆delimited-[]subscript𝐾𝐷𝜆𝑇𝑇1subscript𝐻1𝑇𝑞\displaystyle={1\over 1-\lambda}\left[\mathcal{L}_{KD}-\lambda T(T-1)H_{\frac{% 1}{T}}(q)\right]= divide start_ARG 1 end_ARG start_ARG 1 - italic_λ end_ARG [ caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT - italic_λ italic_T ( italic_T - 1 ) italic_H start_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_T end_ARG end_POSTSUBSCRIPT ( italic_q ) ] (18)

whenever β𝛽\betaitalic_β is selected to be

β=λ1λT,𝛽𝜆1𝜆𝑇\displaystyle\beta={\lambda\over 1-\lambda}T,italic_β = divide start_ARG italic_λ end_ARG start_ARG 1 - italic_λ end_ARG italic_T , (19)

where (16) is due to the fact that the Shannon entropy H(pTt)𝐻subscriptsuperscript𝑝𝑡𝑇H(p^{t}_{T})italic_H ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) does not depend on the student model, (17) follows (19), and (18) is attributable to (1).

Thus we have shown that TTM can indeed be decomposed into KD plus a Rényi entropy regularizer. Since Rényi entropy is a generalized version of Shannon entropy, it plays a role in TTM similar to that of Shannon entropy in CP. With this, we have reasons to believe that it can lead to better generalization, which is indeed confirmed later by extensive experiments in Section 5.

It is also instructive to compare TTM and KD from the perspective of their respective gradients. The gradients of the distillation component in TTMsubscript𝑇𝑇𝑀\mathcal{L}_{TTM}caligraphic_L start_POSTSUBSCRIPT italic_T italic_T italic_M end_POSTSUBSCRIPT with respect to the logits are:

D(pTt||q)zi=H(pTt,q)zi=qipt^i=qi(pit)1/Tj=1K(pjt)1/T\displaystyle\frac{\partial D(p^{t}_{T}||q)}{\partial{z_{i}}}=\frac{\partial{H% (p^{t}_{T},q)}}{\partial{z_{i}}}=q_{i}-\hat{p^{t}}_{i}=q_{i}-\frac{{\left(p^{t% }_{i}\right)}^{1/T}}{\sum_{j=1}^{K}{\left(p^{t}_{j}\right)}^{1/T}}divide start_ARG ∂ italic_D ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | italic_q ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = divide start_ARG ∂ italic_H ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_q ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_T end_POSTSUPERSCRIPT end_ARG (20)

where zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the i𝑖iitalic_ith logit and i𝑖iitalic_ith class probability of the student model, respectively. In comparison, the corresponding gradients for KD are

D(pTt||qT)zi=H(pTt,qT)zi=1T(q^ipt^i)=1T(qi1/Tj=1Kqj1/T(pit)1/Tj=1K(pjt)1/T).\displaystyle\frac{\partial{D(p^{t}_{T}||q_{T})}}{\partial{z_{i}}}=\frac{% \partial{H(p^{t}_{T},q_{T})}}{\partial{z_{i}}}={1\over T}\left(\hat{q}_{i}-% \hat{p^{t}}_{i}\right)={1\over T}\left(\frac{{q_{i}}^{1/T}}{\sum_{j=1}^{K}{q_{% j}}^{1/T}}-\frac{{(p^{t}_{i})}^{1/T}}{\sum_{j=1}^{K}{\left(p^{t}_{j}\right)}^{% 1/T}}\right).divide start_ARG ∂ italic_D ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | italic_q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = divide start_ARG ∂ italic_H ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ( divide start_ARG italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / italic_T end_POSTSUPERSCRIPT end_ARG - divide start_ARG ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_T end_POSTSUPERSCRIPT end_ARG ) . (21)

From Eq. (20), we see that the gradient descent learning process would push qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to move towards the power transformed teacher probability distribution, thus encouraging the student to behave like the power transformed teacher, from which the name TTM (transformed teacher matching) is coined. Since the power transformed teacher distribution pTtsubscriptsuperscript𝑝𝑡𝑇p^{t}_{T}italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT with T>1𝑇1T>1italic_T > 1 is smoother, the student trained by TTM will output a distribution q𝑞qitalic_q with similar smoothness, leading to low confidence and high entropy. On the other hand, in Eq. (21), it is the transformed student distribution qTsubscript𝑞𝑇q_{T}italic_q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT that is pushed towards the transformed teacher distribution pTtsubscriptsuperscript𝑝𝑡𝑇p^{t}_{T}italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. Even when qTsubscript𝑞𝑇q_{T}italic_q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT has similar smoothness as pTtsubscriptsuperscript𝑝𝑡𝑇p^{t}_{T}italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, the original student distribution q𝑞qitalic_q can still be quite peaked, thus having high confidence and low entropy.

4 Sample-adaptive Matching to the Transformed Teacher

We can further improve TTM by introducing a sample-adaptive weighting coefficient into TTM. This is explored in this section.

In TTM, the soft target we use is a linear combination of the one-hot probability distribution corresponding to y𝑦yitalic_y and the power transformed teacher distribution pTtsubscriptsuperscript𝑝𝑡𝑇p^{t}_{T}italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, where the same coefficient β𝛽\betaitalic_β is applied to all samples. As discussed in Subsection 2.4, the role of the teacher in KD is to provide pTtsubscriptsuperscript𝑝𝑡𝑇p^{t}_{T}italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and use it as an estimate for px*subscriptsuperscript𝑝𝑥p^{*}_{x}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. Assume this estimate is good. It is reasonable to believe that it would be better off to favor a soft target over an one-hot target even more for those samples for which pTtsubscriptsuperscript𝑝𝑡𝑇p^{t}_{T}italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT have more intrinsic confusion and is away from the one-hot probability distribution. After all, when pTtsubscriptsuperscript𝑝𝑡𝑇p^{t}_{T}italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is close to the corresponding one-hot probability distribution, minimizing H(pTt,q)𝐻subscriptsuperscript𝑝𝑡𝑇𝑞H(p^{t}_{T},q)italic_H ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_q ) has little difference from minimizing H(y,q)𝐻𝑦𝑞H(y,q)italic_H ( italic_y , italic_q ), and as a result, it’s no longer meaningful to do distillation on these types of samples. This motivates us to discriminate among soft targets in TTM based on their smoothness. Concretely, a large β𝛽\betaitalic_β should be assigned to a smooth pTtsubscriptsuperscript𝑝𝑡𝑇p^{t}_{T}italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, while a small β𝛽\betaitalic_β should be assigned to a peaked pTtsubscriptsuperscript𝑝𝑡𝑇p^{t}_{T}italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT.

To implement the above idea, we need a quantity to quantify the smoothness of a soft target pTtsubscriptsuperscript𝑝𝑡𝑇p^{t}_{T}italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. In view of (7) and the definition of Rényi entropy (4), the following power sum defined for any distribution p𝑝pitalic_p and any 0<γ<10𝛾10<\gamma<10 < italic_γ < 1

Uγ(p)=j=1kpjγsubscript𝑈𝛾𝑝superscriptsubscript𝑗1𝑘superscriptsubscript𝑝𝑗𝛾\displaystyle U_{\gamma}(p)=\sum_{j=1}^{k}p_{j}^{\gamma}italic_U start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_p ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT

comes handy. Given 0<γ<10𝛾10<\gamma<10 < italic_γ < 1, we can use the power sum Uγ(p)subscript𝑈𝛾𝑝U_{\gamma}(p)italic_U start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_p ) to quantify the smoothness of p𝑝pitalic_p, since it is related to both the power transform and Rényi entropy. It is clear that the power sum Uγ(p)subscript𝑈𝛾𝑝U_{\gamma}(p)italic_U start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_p ) attains its minimum 1 when p𝑝pitalic_p is one-hot and maximum K1γsuperscript𝐾1𝛾K^{1-\gamma}italic_K start_POSTSUPERSCRIPT 1 - italic_γ end_POSTSUPERSCRIPT when p𝑝pitalic_p is uniform. Using Uγ(pt)subscript𝑈𝛾superscript𝑝𝑡U_{\gamma}(p^{t})italic_U start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) to discriminate among different samples, we modify TTM to minimize the following objective function

WTTM=H(y,q)+βU1T(pt)D(pTt||q).\displaystyle\mathcal{L}_{WTTM}=H(y,q)+\beta U_{\frac{1}{T}}(p^{t})\cdot D(p^{% t}_{T}||q).caligraphic_L start_POSTSUBSCRIPT italic_W italic_T italic_T italic_M end_POSTSUBSCRIPT = italic_H ( italic_y , italic_q ) + italic_β italic_U start_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_T end_ARG end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ italic_D ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | italic_q ) . (22)

The resulting variant of KD is referred to as weighted TTM (WTTM). Note that other sample-adaptive weights such as H(pTt)𝐻subscriptsuperscript𝑝𝑡𝑇H(p^{t}_{T})italic_H ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) may also be effective. Nonetheless, systematic study regarding how to select sample-adaptive weights and which one is optimal, is left for future work.

Compared to TTM where the student is trained to match all soft targets uniformly, WTTM trains the student to match more closely to smooth soft targets and less closely to peaked soft targets. Thus, students resulting from WTTM would output smoother q𝑞qitalic_q than those distilled from TTM, which is further confirmed in the next section by experiments.

5 Experiments

5.1 Experimental Settings

We benchmark TTM and WTTM on two prevailing image classification datasets, namely CIFAR-100 and ImageNet (Deng et al., 2009).

CIFAR-100 contains 60k 32×\times×32 color images of 100 classes, with 600 images per class, and it’s further split into 50k training images and 10k test images. For fair comparison, we adopt the same training strategy and teacher models as CRD (Tian et al., 2019). Also, following CRD, we generate comprehensive experiment results for 13 teacher-student pairs including both same-architecture distillation and different-architecture distillation, and the tested model architectures are VGG (Simonyan & Zisserman, 2014), ResNet (He et al., 2016), WideResNet (Zagoruyko & Komodakis, 2016b), MobileNetV2 (Sandler et al., 2018), and ShuffleNet (Zhang et al., 2018; Ma et al., 2018).

ImageNet is a large-scale image dataset consisting of over 1.2 million training images and 50k validation images from 1000 classes. For experiments on ImageNet, we employ torchdistill (Matsubara, 2021) library and follow all the standard settings. The tested model architectures are ResNet and MobileNet (Howard et al., 2017).

Note that we list T𝑇Titalic_T and β𝛽\betaitalic_β values of all experiments in A.4 for reproducibility.

5.2 Main Results

Results on CIFAR-100. The pure performances of TTM and WTTM are shown in Table 1 and Table 3. We compare them with feature-based methods FitNet (Romero et al., 2014), AT (Zagoruyko & Komodakis, 2016a), VID (Ahn et al., 2019), RKD (Park et al., 2019), PKT (Passalis & Tefas, 2018), CRD (Tian et al., 2019), and logits-based methods such as KD, DIST (Huang et al., 2022) and DKD (Zhao et al., 2022). In general, TTM and WTTM provide outstanding performance among all the compared methods, and WTTM is better than TTM in most cases. Note that TTM always outperforms KD, confirming our theoretic analysis in Section 3.

To further improve the performance, we combine WTTM loss with 2 existing distillation losses respectively, namely CRD and ITRD (Miles et al., 2021), and the resulting performance is shown in Table 2 and Table 4. For the combined methods, we directly adopt the optimal hyperparameters specified in the original papers without tuning (see A.5 for details). From the tables, we can see that the performance of the combined loss is always better than the pure performances of both ingredient losses, meaning that our proposed WTTM loss is orthogonal to other losses like CRD and ITRD. More importantly, the performance of WTTM aided by CRD and ITRD is consistently better than all other methods over all teacher-student pairs, achieving the state-of-the-art accuracy.

Table 1: Top-1 accuracy (%) on CIFAR-100 of student models trained with various distillation methods, including both feature-based methods and logits-based methods. Each teacher-student pair has the same architecture. We highlight the best results in bold, and the second best results with underscores. Note that some results of DIST (for the models excluded in their paper) are produced by our reimplementation. Average over 5 runs.
Teacher Student WRN-40-2 WRN-16-2 WRN-40-2 WRN-40-1 resnet56 resnet20 resnet110 resnet20 resnet110 resnet32 resnet32x4 resnet8x4 vgg13 vgg8
Teacher 75.61 75.61 72.34 74.31 74.31 79.42 74.64
Student 73.26 71.98 69.06 69.06 71.14 72.50 70.36
Feature-based
FitNet 73.58 72.24 69.21 68.99 71.06 73.50 71.02
AT 74.08 72.77 70.55 70.22 72.31 73.44 71.43
VID 74.11 73.30 70.38 70.16 72.61 73.09 71.23
RKD 73.35 72.22 69.61 69.25 71.82 71.90 71.48
PKT 74.54 73.45 70.34 70.25 72.61 73.64 72.88
CRD 75.48 74.14 71.16 71.46 73.48 75.51 73.94
Logits-based
KD 74.92 73.54 70.66 70.67 73.08 73.33 72.98
DIST 75.51 74.73 71.75 71.65 73.69 76.31 73.89
DKD 76.24 74.81 71.97 n/a 74.11 76.32 74.68
TTM 76.23 74.32 71.83 71.46 73.97 76.17 74.33
WTTM 76.37 74.58 71.92 71.67 74.13 76.06 74.44
Table 2: Top-1 accuracy (%) on CIFAR-100. Each teacher-student pair has the same architecture. Average over 5 runs (3 runs for ITRD and WTTM+ITRD following the original paper of ITRD).
Teacher Student WRN-40-2 WRN-16-2 WRN-40-2 WRN-40-1 resnet56 resnet20 resnet110 resnet20 resnet110 resnet32 resnet32x4 resnet8x4 vgg13 vgg8
CRD 75.48 74.14 71.16 71.46 73.48 75.51 73.94
ITRD 76.12 75.18 71.47 71.99 74.26 76.19 74.93
WTTM 76.37 74.58 71.92 71.67 74.13 76.06 74.44
WTTM+CRD 76.61 74.94 72.20 72.13 74.52 76.65 74.71
WTTM+ITRD 76.65 75.34 72.16 72.20 74.36 77.36 75.13
Table 3: Top-1 accuracy (%) on CIFAR-100. Each teacher-student pair has different architectures. Note that some results of DIST (for the models excluded in their paper) are produced by our reimplementation. Average over 3 runs.
Teacher Student vgg13 MobileNetV2 ResNet50 MobileNetV2 ResNet50 vgg8 resnet32x4 ShuffleNetV1 resnet32x4 ShuffleNetV2 WRN-40-2 ShuffleNetV1
Teacher 74.64 79.34 79.34 79.42 79.42 75.61
Student 64.6 64.6 70.36 70.5 71.82 70.5
Feature-based
FitNet 64.14 63.16 70.69 73.59 73.54 73.73
AT 59.40 58.58 71.84 71.73 72.73 73.32
VID 65.56 67.57 70.30 73.38 73.40 73.61
RKD 64.52 64.43 71.50 72.28 73.21 72.21
PKT 67.13 66.52 73.01 74.10 74.69 73.89
CRD 69.73 69.11 74.30 75.11 75.65 76.05
Logits-based
KD 67.37 67.35 73.81 74.07 74.45 74.83
DIST 68.50 68.66 74.11 76.34 77.35 76.40
DKD 69.71 70.35 n/a 76.45 77.07 76.70
TTM 68.98 69.24 74.87 74.18 76.57 75.39
WTTM 69.16 69.59 74.82 74.37 76.55 75.42
Table 4: Top-1 accuracy (%) on CIFAR-100. Each teacher-student pair has different architectures. Average over 3 runs.
Teacher Student vgg13 MobileNetV2 ResNet50 MobileNetV2 ResNet50 vgg8 resnet32x4 ShuffleNetV1 resnet32x4 ShuffleNetV2 WRN-40-2 ShuffleNetV1
CRD 69.73 69.11 74.30 75.11 75.65 76.05
ITRD 70.39 71.41 75.71 76.91 77.40 77.35
WTTM 69.16 69.59 74.82 74.37 76.55 75.42
WTTM+CRD 70.30 70.84 75.30 75.82 77.04 76.86
WTTM+ITRD 70.70 71.56 76.00 77.03 77.68 77.44

Results on ImageNet. In Table 5, we demonstrate the performance of WTTM compared to many competitive distillation methods such as KD, CRD, SRRL (Yang et al., 2020), ReviewKD (Chen et al., 2021), ITRD (Miles et al., 2021), DKD (Zhao et al., 2022), DIST (Huang et al., 2022), KD++ (Wang et al., 2023), NKD (Yang et al., 2023b), CTKD (Li et al., 2023c), and KD-Zero (Li et al., 2023a). It’s shown that WTTM achieves outstanding performance on both teacher-student pairs.

Table 5: Top-1 accuracy (%) on ImageNet. The adopted teacher models are released by PyTorch (Paszke et al., 2019).
Teacher Student KD CRD SRRL ReviewKD ITRD DKD DIST KD++ NKD CTKD KD-Zero WTTM
ResNet-34 (73.31) ResNet-18 (69.76) 70.66 71.17 71.73 71.61 71.68 71.70 72.07 71.98 71.96 71.51 72.17 72.19
ResNet-50 (76.16) MobileNet (68.87) 70.50 71.37 72.49 72.56 n/a 72.05 73.24 72.77 72.58 n/a 73.02 73.09

5.3 Extensions

To provide more comprehensive understanding and deeper insight about TTM and WTTM, we include 4 points of extension in this subsection, demonstrating some promising properties of WTTM and supporting our methodology with some analysis.

Distill without CEsubscript𝐶𝐸\mathcal{L}_{CE}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT. In Table 6, we compare the performance of WTTM without CEsubscript𝐶𝐸\mathcal{L}_{CE}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT to the performance of KD with CEsubscript𝐶𝐸\mathcal{L}_{CE}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT. We find that even in this unfair setting, WTTM can still outperform KD in most cases. This is of great value in the scenario where the ground-truth labels of the transfer set are not available.

Table 6: Comparison between WTTM without CEsubscript𝐶𝐸\mathcal{L}_{CE}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT and KD with CEsubscript𝐶𝐸\mathcal{L}_{CE}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT on CIFAR-100. Accuracy is averaged over 5 runs.
Teacher Student WRN-40-2 WRN-16-2 WRN-40-2 WRN-40-1 resnet56 resnet20 resnet110 resnet20 resnet110 resnet32 resnet32x4 resnet8x4 vgg13 vgg8
KD w/ CE 74.92 73.54 70.66 70.67 73.08 73.33 72.98
WTTM w/o CE 75.11 73.16 70.95 70.71 73.21 72.94 74.04

Distill from better teachers. Results in Table 7 show that the student can benefit more from a better teacher when distilling with WTTM. We observe that as the teacher model grows better, other distillation methods like KD and DIST cannot guarantee consistent improvement on the student side. In contrast, when we apply WTTM, the performance of the student is strictly increasing and consistently better than other distillation methods as the teacher becomes better and better.

Table 7: Performance of ResNet-18 on ImageNet distilled from different teachers.
Teacher Student Teacher Student KD DIST WTTM
ResNet-34 ResNet-18 73.31 69.76 71.21 72.07 72.19
ResNet-50 76.13 71.35 72.12 72.26
ResNet-101 77.37 71.09 72.08 72.34
ResNet-152 78.31 71.12 72.24 72.39

Regularization effect of TTM and WTTM. Following our methodology, TTM and WTTM are able to embed strong regularization into the distillation process, so it’s expected that student’s output probability distributions q𝑞qitalic_q resulting from TTM and WTTM should be much smoother than those resulting from KD. To validate this, we track the behavior of the average Shannon entropy of q𝑞qitalic_q for KD, TTM and WTTM respectively during training over 3 teacher-student pairs used in CIFAR-100 experiments, shown in Fig. 1. Comparatively, students trained with TTM always have significantly larger entropy than those trained with KD. This is attributable to the Rényi entropy regularizer introduced in TTM when we remove the temperature scaling on the student side from KD. Moreover, students trained with WTTM always have slightly larger entropy than those trained with TTM, owing to the sample-adaptive weighting coefficient U1T(pt)subscript𝑈1𝑇superscript𝑝𝑡U_{\frac{1}{T}}(p^{t})italic_U start_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_T end_ARG end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ).

Refer to caption
(a) WRN-40-2\rightarrowWRN-40-1
Refer to caption
(b) vgg13\rightarrowvgg8
Refer to caption
(c) ResNet50\rightarrowMobileNetV2
Figure 1: Average H(q)𝐻𝑞H(q)italic_H ( italic_q ) of 3 teacher-student pairs during training. For fair comparison, we use the same temperature T=4𝑇4T=4italic_T = 4 for KD, TTM and WTTM. The λ𝜆\lambdaitalic_λ for KD is 0.9, so the β𝛽\betaitalic_β for TTM is 36, computed by Eq. (19), in order to maintain the same ratio between H(y,q)𝐻𝑦𝑞H(y,q)italic_H ( italic_y , italic_q ) and H(pTt,qT)𝐻subscriptsuperscript𝑝𝑡𝑇subscript𝑞𝑇H(p^{t}_{T},q_{T})italic_H ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) as KD. As for WTTM, β=36/U¯𝛽36¯𝑈\beta=36/\bar{U}italic_β = 36 / over¯ start_ARG italic_U end_ARG, where U¯¯𝑈\bar{U}over¯ start_ARG italic_U end_ARG is the average of U1T(pt)subscript𝑈1𝑇superscript𝑝𝑡U_{\frac{1}{T}}(p^{t})italic_U start_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_T end_ARG end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) over all samples.

WTTM facilitates more accurate teacher matching. A closer look at TTM and WTTM is favorable to shed light on why WTTM generally performs better than TTM. To this end, we track the behavior of the average D(pTt||q)D(p^{t}_{T}||q)italic_D ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | italic_q ) for TTM and WTTM during training over the same 3 teacher-student pairs as above, shown in Fig. 2. In order to reflect the behavior of pure distillation, we remove CEsubscript𝐶𝐸\mathcal{L}_{CE}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT from both WTTM and TTM. It’s clear from the plots that WTTM always leads to smaller gap between pTtsubscriptsuperscript𝑝𝑡𝑇p^{t}_{T}italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and q𝑞qitalic_q than TTM, demonstrating more accurate transformed teacher matching, which is the reason behind performance improvement.

Refer to caption
(a) WRN-40-2\rightarrowWRN-40-1
Refer to caption
(b) vgg13\rightarrowvgg8
Refer to caption
(c) ResNet50\rightarrowMobileNetV2
Figure 2: Average D(pTt||q)D(p^{t}_{T}||q)italic_D ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | italic_q ) of 3 teacher-student pairs during training. For each pair, the same T𝑇Titalic_T is adopted in TTM and WTTM.

6 Conclusion

The paper systematically studies a variant of KD without temperature scaling on the student side, dubbed TTM. This slight modification gives rise to a Rényi entropy regularizer which improves the performance of the standard KD. Furthermore, we propose a sample-adaptive version of TTM, dubbed WTTM, to achieve more significant improvement. Extensive experimental results are presented to show the superiority of TTM and WTTM over other distillation methods on two image classification datasets. With almost the same training cost as KD, WTTM demonstrates state-of-the-art performance, better than most feature-based distillation methods with high computational complexity.

Acknowledgments

This work was supported in part by the Natural Sciences and Engineering Research Council of Canada under Grant RGPIN203035-22, and by the Canada Research Chairs Program.

References

  • Ahn et al. (2019) Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence, and Zhenwen Dai. Variational information distillation for knowledge transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  9163–9171, 2019.
  • Buciluǎ et al. (2006) Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.  535–541, 2006.
  • Chen et al. (2021) Pengguang Chen, Shu Liu, Hengshuang Zhao, and Jiaya Jia. Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5008–5017, 2021.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  • Dong et al. (2023) Peijie Dong, Lujun Li, and Zimian Wei. Diswot: Student architecture search for distillation without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11898–11908, 2023.
  • Hao et al. (2023) Zhiwei Hao, Jianyuan Guo, Kai Han, Han Hu, Chang Xu, and Yunhe Wang. Vanillakd: Revisit the power of vanilla knowledge distillation from small scale to large scale. arXiv preprint arXiv:2305.15781, 2023.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • Howard et al. (2017) Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • Huang et al. (2022) Tao Huang, Shan You, Fei Wang, Chen Qian, and Chang Xu. Knowledge distillation from a stronger teacher. Advances in Neural Information Processing Systems, 35:33716–33727, 2022.
  • Li (2022) Lujun Li. Self-regulated feature learning via teacher-free feature distillation. In European Conference on Computer Vision, pp.  347–363. Springer, 2022.
  • Li & ** (2022) Lujun Li and Zhe **. Shadow knowledge distillation: Bridging offline and online knowledge transfer. Advances in Neural Information Processing Systems, 35:635–649, 2022.
  • Li et al. (2023a) Lujun Li, Peijie Dong, Anggeng Li, Zimian Wei, and Yang Ya. Kd-zero: Evolving knowledge distiller for any teacher-student pairs. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a.
  • Li et al. (2023b) Lujun Li, Peijie Dong, Zimian Wei, and Ya Yang. Automated knowledge distillation via monte carlo tree search. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  17413–17424, 2023b.
  • Li et al. (2023c) Zheng Li, Xiang Li, Lingfeng Yang, Borui Zhao, Renjie Song, Lei Luo, Jun Li, and Jian Yang. Curriculum temperature for knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.  1504–1512, 2023c.
  • Liu et al. (2023) Xiaolong Liu, Lujun Li, Chao Li, and Anbang Yao. Norm: Knowledge distillation via n-to-one representation matching. arXiv preprint arXiv:2305.13803, 2023.
  • Ma et al. (2018) Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pp.  116–131, 2018.
  • Matsubara (2021) Yoshitomo Matsubara. torchdistill: A modular, configuration-driven framework for knowledge distillation. In International Workshop on Reproducible Research in Pattern Recognition, pp.  24–44. Springer, 2021.
  • Menon et al. (2021) Aditya K Menon, Ankit Singh Rawat, Sashank Reddi, Seungyeon Kim, and Sanjiv Kumar. A statistical perspective on distillation. In International Conference on Machine Learning, pp.  7632–7642. PMLR, 2021.
  • Miles et al. (2021) Roy Miles, Adrian Lopez Rodriguez, and Krystian Mikolajczyk. Information theoretic representation distillation. arXiv preprint arXiv:2112.00459, 2021.
  • Mironov (2017) Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th computer security foundations symposium (CSF), pp.  263–275. IEEE, 2017.
  • Park et al. (2019) Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  3967–3976, 2019.
  • Passalis & Tefas (2018) Nikolaos Passalis and Anastasios Tefas. Learning deep representations with probabilistic knowledge transfer. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  268–284, 2018.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  • Pereyra et al. (2017) Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.
  • Rényi (1961) Alfréd Rényi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, volume 4, pp.  547–562. University of California Press, 1961.
  • Romero et al. (2014) Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
  • Ross (2019) Sheldon Ross. A First Course in Probability. Pearson Higher Ed, 2019.
  • Sandler et al. (2018) Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4510–4520, 2018.
  • Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2818–2826, 2016.
  • Tian et al. (2019) Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. In International Conference on Learning Representations, 2019.
  • Touvron et al. (2021) Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pp.  10347–10357. PMLR, 2021.
  • Wang et al. (2023) Yuzhu Wang, Lechao Cheng, Manni Duan, Yongheng Wang, Zunlei Feng, and Shu Kong. Improving knowledge distillation via regularizing feature norm and direction. arXiv preprint arXiv:2305.17007, 2023.
  • Yang et al. (2023a) En-Hui Yang, Shayan Mohajer Hamidi, Linfeng Ye, Renhao Tan, and Beverly Yang. Conditional mutual information constrained deep learning for classification. arXiv preprint arXiv:2309.09123, 2023a.
  • Yang et al. (2020) **g Yang, Brais Martinez, Adrian Bulat, and Georgios Tzimiropoulos. Knowledge distillation via softmax regression representation learning. In International Conference on Learning Representations, 2020.
  • Yang et al. (2022) Zhendong Yang, Zhe Li, Ailing Zeng, Zexian Li, Chun Yuan, and Yu Li. Vitkd: Practical guidelines for vit feature knowledge distillation. arXiv preprint arXiv:2209.02432, 2022.
  • Yang et al. (2023b) Zhendong Yang, Ailing Zeng, Zhe Li, Tianke Zhang, Chun Yuan, and Yu Li. From knowledge distillation to self-knowledge distillation: A unified approach with normalized loss and customized soft labels. arXiv preprint arXiv:2303.13005, 2023b.
  • Ye et al. (2024) Linfeng Ye, Shayan Mohajer Hamidi, Renhao Tan, and En-Hui Yang. Bayes conditional distribution estimation for knowledge distillation based on conditional mutual information. arXiv preprint arXiv:2401.08732, 2024.
  • Yu et al. (2020) Shujian Yu, Kristoffer Wickstrøm, Robert Jenssen, and Jose C Principe. Understanding convolutional neural networks with information theory: An initial exploration. IEEE transactions on neural networks and learning systems, 32(1):435–442, 2020.
  • Yuan et al. (2020) Li Yuan, Francis EH Tay, Guilin Li, Tao Wang, and Jiashi Feng. Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3903–3911, 2020.
  • Zagoruyko & Komodakis (2016a) Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928, 2016a.
  • Zagoruyko & Komodakis (2016b) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016b.
  • Zhang et al. (2018) Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  6848–6856, 2018.
  • Zhang & Sabuncu (2020) Zhilu Zhang and Mert Sabuncu. Self-distillation as instance-specific label smoothing. Advances in Neural Information Processing Systems, 33:2184–2195, 2020.
  • Zhao et al. (2022) Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp.  11953–11962, 2022.

Appendix A Appendix

A.1 Empirical Analysis on the LS Perspective of KD

In support of our claims in Subsection 2.3, we carry out a simple empirical analysis in this section. Specifically, we train four resnet20 models on CIFAR-100 dataset with different objectives and demonstrate their Shannon entropy histograms of the output probability distributions q𝑞qitalic_q in Figure 3.

From Figures 3(b) and 3(a), it is clear that the Shannon entropy of q𝑞qitalic_q in the case of LS is significantly larger than its counterpart in the case of ERM, which shows the regularization effect of LS.

In comparison of Figure 3(c) with Figure 3(a), it is also clear that the Shannon entropy of q𝑞qitalic_q in the case of KD with T=1𝑇1T=1italic_T = 1 is also significantly larger than its counterpart in the case of ERM, which confirms that KD can indeed be regarded as sample-adaptive LS when T=1𝑇1T=1italic_T = 1.

However, when T>1𝑇1T>1italic_T > 1, such a perspective doesn’t hold anymore. To demonstrate this, we also trained resnet20 on CIFAR-100 dataset with KD setting T=4𝑇4T=4italic_T = 4, corresponding to Figure 3(d). Comparing Figure 3(d) with Figure 3(a), we see that the average Shannon entropy in the case of KD with T=4𝑇4T=4italic_T = 4 is even reduced over the ERM case significantly, showing an exactly opposite effect of LS. This confirms that when T>1𝑇1T>1italic_T > 1, KD can no longer be regarded as sample-adaptive LS.

Refer to caption
       (a) CEsubscript𝐶𝐸\mathcal{L}_{CE}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT
Refer to caption
       (b) LSsubscript𝐿𝑆\mathcal{L}_{LS}caligraphic_L start_POSTSUBSCRIPT italic_L italic_S end_POSTSUBSCRIPT
Refer to caption
       (c) KD(T=1)subscript𝐾𝐷𝑇1\mathcal{L}_{KD}~{}(T=1)caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT ( italic_T = 1 )
Refer to caption
       (d) KD(T=4)subscript𝐾𝐷𝑇4\mathcal{L}_{KD}~{}(T=4)caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT ( italic_T = 4 )
Figure 3: Entropy histograms for resnet20 trained with CEsubscript𝐶𝐸\mathcal{L}_{CE}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT, LSsubscript𝐿𝑆\mathcal{L}_{LS}caligraphic_L start_POSTSUBSCRIPT italic_L italic_S end_POSTSUBSCRIPT with ϵ=0.5italic-ϵ0.5\epsilon=0.5italic_ϵ = 0.5, KDsubscript𝐾𝐷\mathcal{L}_{KD}caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT with T=1𝑇1T=1italic_T = 1, and KDsubscript𝐾𝐷\mathcal{L}_{KD}caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT with T=4𝑇4T=4italic_T = 4. For fair comparison, the same λ=0.9𝜆0.9\lambda=0.9italic_λ = 0.9 is adopted in both KD experiments with different temperatures.

A.2 Discussion on the Generalized Transform

In this section, we provide more discussion on the generalized transform proposed in Subsection 3.1. As mentioned in Subsection 3.1, any specific transform can be described by its associated map** f𝑓fitalic_f. For visualization, we demonstrate some examples of map** f𝑓fitalic_f in Fig. 4(a). Also, the power function with exponent γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ) used in TTM and WTTM is visualized in Fig. 4(b).

Refer to caption
(a)
Refer to caption
(b)
Figure 4: (a) Various point-wise map**s. (b) Power functions with different exponents γ𝛾\gammaitalic_γ.

The reason why we only consider the power function in the main text is that the resulting power transform is equivalent to temperature scaling, which helps us to reveal the Rényi entropy regularizer in the subsequent derivation. However, it’s worth mentioning that the generalized transform is much more than a tool used in our derivations.

Currently, we use the power transform (temperature scaling) to smooth teacher’s output distributions p𝑝pitalic_p in TTM and WTTM, following the convention in standard KD. However, it’s possible that some other transforms could lead to better distillation compared to the power transform. Intuitively, map**s f𝑓fitalic_f associated to such transforms should satisfy 3 properties:

  • \bullet

    f(0)=0𝑓00f(0)=0italic_f ( 0 ) = 0 and f(1)=1𝑓11f(1)=1italic_f ( 1 ) = 1. A deterministic prediction shouldn’t be modified by the transform.

  • \bullet

    Non-decreasing. A non-decreasing map** avoids ruining the order information in p𝑝pitalic_p.

  • \bullet

    f(pi)>pi𝑓subscript𝑝𝑖subscript𝑝𝑖f(p_{i})>p_{i}italic_f ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. To improve the smoothness of p𝑝pitalic_p, we need a map** above the identity, since it expands the dynamic range of low probability values and compress the dynamic range of high probability values. As a result, after the normalization in Eq. (6), small probability values will be increased while large probability values will be decreased, achieving the goal of smoothing a distribution.

Following these suggested properties, some potential transforms can be developed in place of the power transform, while we leave this topic for future work.

A.3 Implementation of TTM and WTTM

In this section, we provide the pseudo-code for TTM and WTTM in a Pytorch-like style, shown in Algorithm 1. It’s clear that both TTM and WTTM are quite easy to implement.

Algorithm 1 PyTorch-style pseudo-code for TTM and WTTM.
    
# y_s: student output logits
# y_t: teacher output logits
# r: the exponent for power transform
p_s = F.log_softmax(y_s, dim=1)
p_t = torch.pow(F.softmax(y_t, dim=1), r)
U = torch.sum(p_t, dim=1)   # power sum 
p_t = p_t / U.unsqueeze(1)  # power transformed teacher 
KL = torch.sum(F.kl_div(p_s, p_t, reduction=’none’), dim=1)

# TTM 
ttm_loss = torch.mean(KL)
# WTTM 
wttm_loss = torch.mean(U*KL)

A.4 Hyperparameters

We list fine-tuned γ𝛾\gammaitalic_γ and β𝛽\betaitalic_β in Tables 8, 9 and 10 covering all experiments, where γ=1/T𝛾1𝑇\gamma=1/Titalic_γ = 1 / italic_T. Because we implement the temperature scaling with the equivalent power transform, the tuning is carried out over the exponent γ𝛾\gammaitalic_γ instead of the temperature T𝑇Titalic_T.

Table 8: Hyperparameters for same-architecture distillation on CIFAR-100.
Teacher Student WRN-40-2 WRN-16-2 WRN-40-2 WRN-40-1 resnet56 resnet20 resnet110 resnet20 resnet110 resnet32 resnet32x4 resnet8x4 vgg13 vgg8
TTM γ=0.1,β=101formulae-sequence𝛾0.1𝛽101\gamma=0.1,\beta=101italic_γ = 0.1 , italic_β = 101 γ=0.1,β=76formulae-sequence𝛾0.1𝛽76\gamma=0.1,\beta=76italic_γ = 0.1 , italic_β = 76 γ=0.3,β=7formulae-sequence𝛾0.3𝛽7\gamma=0.3,\beta=7italic_γ = 0.3 , italic_β = 7 γ=0.2,β=8formulae-sequence𝛾0.2𝛽8\gamma=0.2,\beta=8italic_γ = 0.2 , italic_β = 8 γ=0.1,β=33formulae-sequence𝛾0.1𝛽33\gamma=0.1,\beta=33italic_γ = 0.1 , italic_β = 33 γ=0.1,β=100formulae-sequence𝛾0.1𝛽100\gamma=0.1,\beta=100italic_γ = 0.1 , italic_β = 100 γ=0.1,β=45formulae-sequence𝛾0.1𝛽45\gamma=0.1,\beta=45italic_γ = 0.1 , italic_β = 45
WTTM γ=0.1,β=4formulae-sequence𝛾0.1𝛽4\gamma=0.1,\beta=4italic_γ = 0.1 , italic_β = 4 γ=0.1,β=3formulae-sequence𝛾0.1𝛽3\gamma=0.1,\beta=3italic_γ = 0.1 , italic_β = 3 γ=0.3,β=1.5formulae-sequence𝛾0.3𝛽1.5\gamma=0.3,\beta=1.5italic_γ = 0.3 , italic_β = 1.5 γ=0.2,β=2formulae-sequence𝛾0.2𝛽2\gamma=0.2,\beta=2italic_γ = 0.2 , italic_β = 2 γ=0.1,β=1.5formulae-sequence𝛾0.1𝛽1.5\gamma=0.1,\beta=1.5italic_γ = 0.1 , italic_β = 1.5 γ=0.1,β=3formulae-sequence𝛾0.1𝛽3\gamma=0.1,\beta=3italic_γ = 0.1 , italic_β = 3 γ=0.1,β=2.25formulae-sequence𝛾0.1𝛽2.25\gamma=0.1,\beta=2.25italic_γ = 0.1 , italic_β = 2.25
WTTM+CRD γ=0.1,β=4formulae-sequence𝛾0.1𝛽4\gamma=0.1,\beta=4italic_γ = 0.1 , italic_β = 4 γ=0.1,β=2formulae-sequence𝛾0.1𝛽2\gamma=0.1,\beta=2italic_γ = 0.1 , italic_β = 2 γ=0.3,β=0.6formulae-sequence𝛾0.3𝛽0.6\gamma=0.3,\beta=0.6italic_γ = 0.3 , italic_β = 0.6 γ=0.2,β=1.4formulae-sequence𝛾0.2𝛽1.4\gamma=0.2,\beta=1.4italic_γ = 0.2 , italic_β = 1.4 γ=0.2,β=1formulae-sequence𝛾0.2𝛽1\gamma=0.2,\beta=1italic_γ = 0.2 , italic_β = 1 γ=0.2,β=4formulae-sequence𝛾0.2𝛽4\gamma=0.2,\beta=4italic_γ = 0.2 , italic_β = 4 γ=0.2,β=4formulae-sequence𝛾0.2𝛽4\gamma=0.2,\beta=4italic_γ = 0.2 , italic_β = 4
WTTM+ITRD γ=0.3,β=6formulae-sequence𝛾0.3𝛽6\gamma=0.3,\beta=6italic_γ = 0.3 , italic_β = 6 γ=0.4,β=0.08formulae-sequence𝛾0.4𝛽0.08\gamma=0.4,\beta=0.08italic_γ = 0.4 , italic_β = 0.08 γ=0.5,β=5formulae-sequence𝛾0.5𝛽5\gamma=0.5,\beta=5italic_γ = 0.5 , italic_β = 5 γ=0.3,β=1.5formulae-sequence𝛾0.3𝛽1.5\gamma=0.3,\beta=1.5italic_γ = 0.3 , italic_β = 1.5 γ=0.3,β=0.015formulae-sequence𝛾0.3𝛽0.015\gamma=0.3,\beta=0.015italic_γ = 0.3 , italic_β = 0.015 γ=0.1,β=1.5formulae-sequence𝛾0.1𝛽1.5\gamma=0.1,\beta=1.5italic_γ = 0.1 , italic_β = 1.5 γ=0.1,β=0.5formulae-sequence𝛾0.1𝛽0.5\gamma=0.1,\beta=0.5italic_γ = 0.1 , italic_β = 0.5
WTTM w/o CE γ=0.2𝛾0.2\gamma=0.2italic_γ = 0.2 γ=0.5𝛾0.5\gamma=0.5italic_γ = 0.5 γ=0.6𝛾0.6\gamma=0.6italic_γ = 0.6 γ=0.4𝛾0.4\gamma=0.4italic_γ = 0.4 γ=0.4𝛾0.4\gamma=0.4italic_γ = 0.4 γ=0.5𝛾0.5\gamma=0.5italic_γ = 0.5 γ=0.2𝛾0.2\gamma=0.2italic_γ = 0.2
Table 9: Hyperparameters for different-architecture distillation on CIFAR-100.
Teacher Student vgg13 MobileNetV2 ResNet50 MobileNetV2 ResNet50 vgg8 resnet32x4 ShuffleNetV1 resnet32x4 ShuffleNetV2 WRN-40-2 ShuffleNetV1
TTM γ=0.2,β=16formulae-sequence𝛾0.2𝛽16\gamma=0.2,\beta=16italic_γ = 0.2 , italic_β = 16 γ=0.2,β=20formulae-sequence𝛾0.2𝛽20\gamma=0.2,\beta=20italic_γ = 0.2 , italic_β = 20 γ=0.1,β=70formulae-sequence𝛾0.1𝛽70\gamma=0.1,\beta=70italic_γ = 0.1 , italic_β = 70 γ=0.2,β=12formulae-sequence𝛾0.2𝛽12\gamma=0.2,\beta=12italic_γ = 0.2 , italic_β = 12 γ=0.4,β=40formulae-sequence𝛾0.4𝛽40\gamma=0.4,\beta=40italic_γ = 0.4 , italic_β = 40 γ=0.3,β=8formulae-sequence𝛾0.3𝛽8\gamma=0.3,\beta=8italic_γ = 0.3 , italic_β = 8
WTTM γ=0.2,β=3formulae-sequence𝛾0.2𝛽3\gamma=0.2,\beta=3italic_γ = 0.2 , italic_β = 3 γ=0.2,β=5formulae-sequence𝛾0.2𝛽5\gamma=0.2,\beta=5italic_γ = 0.2 , italic_β = 5 γ=0.1,β=2formulae-sequence𝛾0.1𝛽2\gamma=0.1,\beta=2italic_γ = 0.1 , italic_β = 2 γ=0.2,β=1.4formulae-sequence𝛾0.2𝛽1.4\gamma=0.2,\beta=1.4italic_γ = 0.2 , italic_β = 1.4 γ=0.4,β=16formulae-sequence𝛾0.4𝛽16\gamma=0.4,\beta=16italic_γ = 0.4 , italic_β = 16 γ=0.3,β=3formulae-sequence𝛾0.3𝛽3\gamma=0.3,\beta=3italic_γ = 0.3 , italic_β = 3
WTTM+CRD γ=0.3,β=4.2formulae-sequence𝛾0.3𝛽4.2\gamma=0.3,\beta=4.2italic_γ = 0.3 , italic_β = 4.2 γ=0.3,β=3formulae-sequence𝛾0.3𝛽3\gamma=0.3,\beta=3italic_γ = 0.3 , italic_β = 3 γ=0.1,β=3formulae-sequence𝛾0.1𝛽3\gamma=0.1,\beta=3italic_γ = 0.1 , italic_β = 3 γ=0.2,β=0.4formulae-sequence𝛾0.2𝛽0.4\gamma=0.2,\beta=0.4italic_γ = 0.2 , italic_β = 0.4 γ=0.4,β=12formulae-sequence𝛾0.4𝛽12\gamma=0.4,\beta=12italic_γ = 0.4 , italic_β = 12 γ=0.2,β=0.16formulae-sequence𝛾0.2𝛽0.16\gamma=0.2,\beta=0.16italic_γ = 0.2 , italic_β = 0.16
WTTM+ITRD γ=0.3,β=0.03formulae-sequence𝛾0.3𝛽0.03\gamma=0.3,\beta=0.03italic_γ = 0.3 , italic_β = 0.03 γ=0.2,β=0.02formulae-sequence𝛾0.2𝛽0.02\gamma=0.2,\beta=0.02italic_γ = 0.2 , italic_β = 0.02 γ=0.1,β=1formulae-sequence𝛾0.1𝛽1\gamma=0.1,\beta=1italic_γ = 0.1 , italic_β = 1 γ=0.3,β=0.6formulae-sequence𝛾0.3𝛽0.6\gamma=0.3,\beta=0.6italic_γ = 0.3 , italic_β = 0.6 γ=0.4,β=0.8formulae-sequence𝛾0.4𝛽0.8\gamma=0.4,\beta=0.8italic_γ = 0.4 , italic_β = 0.8 γ=0.1,β=0.2formulae-sequence𝛾0.1𝛽0.2\gamma=0.1,\beta=0.2italic_γ = 0.1 , italic_β = 0.2
Table 10: Hyparameters for ImageNet experiments.
Teacher Student WTTM
ResNet-34 ResNet-18 γ=0.8,β=1.6formulae-sequence𝛾0.8𝛽1.6\gamma=0.8,\beta=1.6italic_γ = 0.8 , italic_β = 1.6
ResNet-50
ResNet-101
ResNet-152
ResNet-50 MobileNet γ=0.7,β=3.5formulae-sequence𝛾0.7𝛽3.5\gamma=0.7,\beta=3.5italic_γ = 0.7 , italic_β = 3.5

A.5 Combination of Distillation Losses

In this section, we clarify how we combine WTTMsubscript𝑊𝑇𝑇𝑀\mathcal{L}_{WTTM}caligraphic_L start_POSTSUBSCRIPT italic_W italic_T italic_T italic_M end_POSTSUBSCRIPT with other distillation losses in our experiments. Actually, we simply add another distillation component to WTTMsubscript𝑊𝑇𝑇𝑀\mathcal{L}_{WTTM}caligraphic_L start_POSTSUBSCRIPT italic_W italic_T italic_T italic_M end_POSTSUBSCRIPT with a multiplier. The total objective is

tot=H(y,q)+βU1T(pt)D(pTt||q)+μdist\displaystyle\mathcal{L}_{tot}=H(y,q)+\beta U_{\frac{1}{T}}(p^{t})\cdot D(p^{t% }_{T}||q)+\mu\mathcal{L}_{dist}caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t end_POSTSUBSCRIPT = italic_H ( italic_y , italic_q ) + italic_β italic_U start_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_T end_ARG end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⋅ italic_D ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | italic_q ) + italic_μ caligraphic_L start_POSTSUBSCRIPT italic_d italic_i italic_s italic_t end_POSTSUBSCRIPT (23)

where μ𝜇\muitalic_μ is a balancing weight, and distsubscript𝑑𝑖𝑠𝑡\mathcal{L}_{dist}caligraphic_L start_POSTSUBSCRIPT italic_d italic_i italic_s italic_t end_POSTSUBSCRIPT is the additional distillation component, which can be CRD or ITRD in our experiments.

In the case where we combine WTTM with CRD, μ𝜇\muitalic_μ is always set to be 0.8, which is the optimal value used in the original paper.

In the case where we combine WTTM with ITRD, μ𝜇\muitalic_μ is always set to be 1. However, ITRD distillation loss itself is a combination of two components shown as follow

dist=βcorrcorr+βmimisubscript𝑑𝑖𝑠𝑡subscript𝛽𝑐𝑜𝑟𝑟subscript𝑐𝑜𝑟𝑟subscript𝛽𝑚𝑖subscript𝑚𝑖\displaystyle\mathcal{L}_{dist}=\beta_{corr}\mathcal{L}_{corr}+\beta_{mi}% \mathcal{L}_{mi}caligraphic_L start_POSTSUBSCRIPT italic_d italic_i italic_s italic_t end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT italic_c italic_o italic_r italic_r end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_r italic_r end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_m italic_i end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_m italic_i end_POSTSUBSCRIPT (24)

where βcorrsubscript𝛽𝑐𝑜𝑟𝑟\beta_{corr}italic_β start_POSTSUBSCRIPT italic_c italic_o italic_r italic_r end_POSTSUBSCRIPT and βmisubscript𝛽𝑚𝑖\beta_{mi}italic_β start_POSTSUBSCRIPT italic_m italic_i end_POSTSUBSCRIPT are two balancing weights within ITRD distillation loss. In our experiments, we always select the optimal βcorrsubscript𝛽𝑐𝑜𝑟𝑟\beta_{corr}italic_β start_POSTSUBSCRIPT italic_c italic_o italic_r italic_r end_POSTSUBSCRIPT and βmisubscript𝛽𝑚𝑖\beta_{mi}italic_β start_POSTSUBSCRIPT italic_m italic_i end_POSTSUBSCRIPT values specified in the original paper. Specifically, βcorr=2subscript𝛽𝑐𝑜𝑟𝑟2\beta_{corr}=2italic_β start_POSTSUBSCRIPT italic_c italic_o italic_r italic_r end_POSTSUBSCRIPT = 2 and βmi=0subscript𝛽𝑚𝑖0\beta_{mi}=0italic_β start_POSTSUBSCRIPT italic_m italic_i end_POSTSUBSCRIPT = 0 for 3 teacher-student pairs, namely ResNet50 \rightarrow MobileNetV2, ResNet50 \rightarrow vgg8 and WRN-40-2 \rightarrow ShuffleNetV1, while βcorr=2subscript𝛽𝑐𝑜𝑟𝑟2\beta_{corr}=2italic_β start_POSTSUBSCRIPT italic_c italic_o italic_r italic_r end_POSTSUBSCRIPT = 2 and βmi=1subscript𝛽𝑚𝑖1\beta_{mi}=1italic_β start_POSTSUBSCRIPT italic_m italic_i end_POSTSUBSCRIPT = 1 for all the other 10 teacher-student pairs. Note that there is another inherent hyperparameter αitsubscript𝛼𝑖𝑡\alpha_{it}italic_α start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT within ITRD, which is selected as 1.01 for same-architecture distillation and 1.5 for different-architecture distillation, following the suggestion in the original paper.

A.6 Future Work

This work provides multiple directions for our future research:

  • \bullet

    From Eq. (15), we know that the ratio between the distillation term D(pTt||qT)D(p^{t}_{T}||q_{T})italic_D ( italic_p start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | italic_q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) and the regularizer H1T(q)subscript𝐻1𝑇𝑞H_{\frac{1}{T}}(q)italic_H start_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_T end_ARG end_POSTSUBSCRIPT ( italic_q ) in TTM is determined by T𝑇Titalic_T. Also, the order of Rényi entropy is bound to be 1/T1𝑇1/T1 / italic_T. However, these constraints are not necessary. In future work, we can directly combine the standard KD with a Rényi entropy regularizer while setting the balancing weight and the order of Rényi entropy as tunable hyperparameters.

  • \bullet

    Given the generalized transform framework and related discussion in A.2, other transforms can be proposed in place of the power transform (temperature scaling) used in TTM and WTTM.

  • \bullet

    Systematically analyze the selection of the sample-adaptive weight in WTTM, in order to find the optimal one.

A.7 Related Work

In recent years, a variety of works have been proposed to advance the methodology of KD and its application to related fields. Huang et al. (2022) proposed a correlation-based loss capturing the inter-class and intra-class relations from the teacher explicitly. Yang et al. (2023b) unified KD and self distillation by decomposing and reorganizing the vanilla KD loss into a normalized KD (NKD) loss and proposed a novel self distillation method based on it. Li et al. (2023c) proposed a novel distillation method based on a dynamic and learnable distillation temperature. Hao et al. (2023) claimed that the power of vanilla KD was underestimated due to small data pitfall, and observed that the performance gap between vanilla KD and other meticulously designed KD variants could be greatly reduced by employing stronger training strategy. Li (2022) proposed a novel feature-based self distillation approach, reusing channel-wise and layer-wise features within the student to provide regularization. Liu et al. (2023) presented a two-stage KD method dubbed NORM based on a feature transform module. Li & ** (2022) proposed a Shadow Knowledge Distillation framework to bridge offline and online distillation in an efficient way. Ye et al. (2024) proposed to maximize the conditional mutual information for the teacher model in order to improve the distillation performance. Dong et al. (2023) presented a training-free framework to search for the optimal student architectures given a teacher architecture. Also, following the trend of Automated Machine Learning (AutoML), several recent works (Li et al., 2023a; b) focused on automating distiller design using techniques like evolutionary algorithm and Monte Carlo tree search.

A.8 Standard Deviation for Results on CIFAR-100

Below, we report the standard deviation for results on CIFAR-100 dataset in Table 11 and 12.

Table 11: Top-1 accuracy (%) on CIFAR-100. Each teacher-student pair has the same architecture. Standard deviation is provided (the standard deviation is missing for DKD since it’s not available in the literature).
Teacher Student WRN-40-2 WRN-16-2 WRN-40-2 WRN-40-1 resnet56 resnet20 resnet110 resnet20 resnet110 resnet32 resnet32x4 resnet8x4 vgg13 vgg8
Teacher 75.61 75.61 72.34 74.31 74.31 79.42 74.64
Student 73.26 71.98 69.06 69.06 71.14 72.50 70.36
Feature-based
FitNet 73.58 ±plus-or-minus\pm± 0.32 72.24 ±plus-or-minus\pm± 0.24 69.21 ±plus-or-minus\pm± 0.36 68.99 ±plus-or-minus\pm± 0.27 71.06 ±plus-or-minus\pm± 0.13 73.50 ±plus-or-minus\pm± 0.28 71.02 ±plus-or-minus\pm± 0.31
AT 74.08 ±plus-or-minus\pm± 0.25 72.77 ±plus-or-minus\pm± 0.10 70.55 ±plus-or-minus\pm± 0.27 70.22 ±plus-or-minus\pm± 0.16 72.31 ±plus-or-minus\pm± 0.08 73.44 ±plus-or-minus\pm± 0.19 71.43 ±plus-or-minus\pm± 0.09
VID 74.11 ±plus-or-minus\pm± 0.24 73.30 ±plus-or-minus\pm± 0.13 70.38 ±plus-or-minus\pm± 0.14 70.16 ±plus-or-minus\pm± 0.39 72.61 ±plus-or-minus\pm± 0.28 73.09 ±plus-or-minus\pm± 0.21 71.23 ±plus-or-minus\pm± 0.06
RKD 73.35 ±plus-or-minus\pm± 0.09 72.22 ±plus-or-minus\pm± 0.20 69.61 ±plus-or-minus\pm± 0.06 69.25 ±plus-or-minus\pm± 0.05 71.82 ±plus-or-minus\pm± 0.34 71.90 ±plus-or-minus\pm± 0.11 71.48 ±plus-or-minus\pm± 0.05
PKT 74.54 ±plus-or-minus\pm± 0.04 73.45 ±plus-or-minus\pm± 0.19 70.34 ±plus-or-minus\pm± 0.04 70.25 ±plus-or-minus\pm± 0.04 72.61 ±plus-or-minus\pm± 0.17 73.64 ±plus-or-minus\pm± 0.18 72.88 ±plus-or-minus\pm± 0.09
CRD 75.48 ±plus-or-minus\pm± 0.09 74.14 ±plus-or-minus\pm± 0.22 71.16 ±plus-or-minus\pm± 0.17 71.46 ±plus-or-minus\pm± 0.09 73.48 ±plus-or-minus\pm± 0.13 75.51 ±plus-or-minus\pm± 0.18 73.94 ±plus-or-minus\pm± 0.22
Logits-based
KD 74.92 ±plus-or-minus\pm± 0.28 73.54 ±plus-or-minus\pm± 0.20 70.66 ±plus-or-minus\pm± 0.24 70.67 ±plus-or-minus\pm± 0.27 73.08 ±plus-or-minus\pm± 0.18 73.33 ±plus-or-minus\pm± 0.25 72.98 ±plus-or-minus\pm± 0.19
DIST 75.51 ±plus-or-minus\pm± 0.04 74.73 ±plus-or-minus\pm± 0.24 71.75 ±plus-or-minus\pm± 0.30 71.65 ±plus-or-minus\pm± 0.21 73.69 ±plus-or-minus\pm± 0.23 76.31 ±plus-or-minus\pm± 0.19 73.89 ±plus-or-minus\pm± 0.19
DKD 76.24 74.81 71.97 n/a 74.11 76.32 74.68
TTM 76.23 ±plus-or-minus\pm± 0.15 74.32 ±plus-or-minus\pm± 0.31 71.83 ±plus-or-minus\pm± 0.16 71.46 ±plus-or-minus\pm± 0.16 73.97 ±plus-or-minus\pm± 0.23 76.17 ±plus-or-minus\pm± 0.28 74.33 ±plus-or-minus\pm± 0.07
WTTM 76.37 ±plus-or-minus\pm± 0.10 74.58 ±plus-or-minus\pm± 0.26 71.92 ±plus-or-minus\pm± 0.40 71.67 ±plus-or-minus\pm± 0.28 74.13 ±plus-or-minus\pm± 0.37 76.06 ±plus-or-minus\pm± 0.27 74.44 ±plus-or-minus\pm± 0.19
WTTM+CRD 76.61 ±plus-or-minus\pm± 0.24 74.94 ±plus-or-minus\pm± 0.35 72.20 ±plus-or-minus\pm± 0.15 72.13 ±plus-or-minus\pm± 0.26 74.52 ±plus-or-minus\pm± 0.29 76.65 ±plus-or-minus\pm± 0.14 74.71 ±plus-or-minus\pm± 0.07
WTTM+ITRD 76.65 ±plus-or-minus\pm± 0.33 75.34 ±plus-or-minus\pm± 0.22 72.16 ±plus-or-minus\pm± 0.28 72.20 ±plus-or-minus\pm± 0.27 74.36 ±plus-or-minus\pm± 0.31 77.36 ±plus-or-minus\pm± 0.13 75.13 ±plus-or-minus\pm± 0.16
Table 12: Top-1 accuracy (%) on CIFAR-100. Each teacher-student pair has different architectures. Standard deviation is provided (the standard deviation is missing for DKD since it’s not available in the literature).
Teacher Student vgg13 MobileNetV2 ResNet50 MobileNetV2 ResNet50 vgg8 resnet32x4 ShuffleNetV1 resnet32x4 ShuffleNetV2 WRN-40-2 ShuffleNetV1
Teacher 74.64 79.34 79.34 79.42 79.42 75.61
Student 64.6 64.6 70.36 70.5 71.82 70.5
Feature-based
FitNet 64.14 ±plus-or-minus\pm± 0.50 63.16 ±plus-or-minus\pm± 0.47 70.69 ±plus-or-minus\pm± 0.22 73.59 ±plus-or-minus\pm± 0.15 73.54 ±plus-or-minus\pm± 0.22 73.73 ±plus-or-minus\pm± 0.32
AT 59.40 ±plus-or-minus\pm± 0.20 58.58 ±plus-or-minus\pm± 0.54 71.84 ±plus-or-minus\pm± 0.28 71.73 ±plus-or-minus\pm± 0.31 72.73 ±plus-or-minus\pm± 0.09 73.32 ±plus-or-minus\pm± 0.35
VID 65.56 ±plus-or-minus\pm± 0.42 67.57 ±plus-or-minus\pm± 0.28 70.30 ±plus-or-minus\pm± 0.31 73.38 ±plus-or-minus\pm± 0.09 73.40 ±plus-or-minus\pm± 0.17 73.61 ±plus-or-minus\pm± 0.12
RKD 64.52 ±plus-or-minus\pm± 0.45 64.43 ±plus-or-minus\pm± 0.42 71.50 ±plus-or-minus\pm± 0.07 72.28 ±plus-or-minus\pm± 0.39 73.21 ±plus-or-minus\pm± 0.28 72.21 ±plus-or-minus\pm± 0.16
PKT 67.13 ±plus-or-minus\pm± 0.30 66.52 ±plus-or-minus\pm± 0.33 73.01 ±plus-or-minus\pm± 0.14 74.10 ±plus-or-minus\pm± 0.25 74.69 ±plus-or-minus\pm± 0.34 73.89 ±plus-or-minus\pm± 0.16
CRD 69.73 ±plus-or-minus\pm± 0.42 69.11 ±plus-or-minus\pm± 0.28 74.30 ±plus-or-minus\pm± 0.14 75.11 ±plus-or-minus\pm± 0.32 75.65 ±plus-or-minus\pm± 0.10 76.05 ±plus-or-minus\pm± 0.14
Logits-based
KD 67.37 ±plus-or-minus\pm± 0.32 67.35 ±plus-or-minus\pm± 0.32 73.81 ±plus-or-minus\pm± 0.13 74.07 ±plus-or-minus\pm± 0.19 74.45 ±plus-or-minus\pm± 0.27 74.83 ±plus-or-minus\pm± 0.17
DIST 68.50 ±plus-or-minus\pm± 0.26 68.66 ±plus-or-minus\pm± 0.23 74.11 ±plus-or-minus\pm± 0.07 76.34 ±plus-or-minus\pm± 0.18 77.35 ±plus-or-minus\pm± 0.25 76.40 ±plus-or-minus\pm± 0.03
DKD 69.71 70.35 n/a 76.45 77.07 76.70
TTM 68.98 ±plus-or-minus\pm± 0.85 69.24 ±plus-or-minus\pm± 0.28 74.87 ±plus-or-minus\pm± 0.31 74.18 ±plus-or-minus\pm± 0.26 76.57 ±plus-or-minus\pm± 0.26 75.39 ±plus-or-minus\pm± 0.33
WTTM 69.16 ±plus-or-minus\pm± 0.20 69.59 ±plus-or-minus\pm± 0.58 74.82 ±plus-or-minus\pm± 0.28 74.37 ±plus-or-minus\pm± 0.39 76.55 ±plus-or-minus\pm± 0.08 75.42 ±plus-or-minus\pm± 0.34
WTTM+CRD 70.30 ±plus-or-minus\pm± 0.68 70.84 ±plus-or-minus\pm± 0.56 75.30 ±plus-or-minus\pm± 0.42 75.82 ±plus-or-minus\pm± 0.16 77.04 ±plus-or-minus\pm± 0.19 76.86 ±plus-or-minus\pm± 0.37
WTTM+ITRD 70.70 ±plus-or-minus\pm± 0.45 71.56 ±plus-or-minus\pm± 0.15 76.00 ±plus-or-minus\pm± 0.17 77.03 ±plus-or-minus\pm± 0.26 77.68 ±plus-or-minus\pm± 0.26 77.44 ±plus-or-minus\pm± 0.27

A.9 Results on Transformer-based Models

To verify the effectiveness of our proposed distillation method WTTM on transformer-based models, we apply it to a vision transformer model DeiT-Tiny (Touvron et al., 2021), results shown in Table 13. We conduct experiments following the settings in Yang et al. (2023b) and Yang et al. (2022), and compare our results with the vanilla KD and two distillation methods proposed in the above two papers, namely NKD and ViTKD. It’s shown that the performance of WTTM is better than all the three benchmark methods. Moreover, combined with ViTKD, WTTM can improve the Top-1 accuracy of DeiT-Tiny to 78.04%, which is also higher than the performance of NKD combined with ViTKD.

Table 13: Top-1 accuracy (%) on ImageNet.
Teacher Student KD ViTKD NKD WTTM NKD+ViTKD WTTM+ViTKD
DeiT III-Small (82.76) DeiT-Tiny (74.42) 76.01 76.06 76.68 77.03 77.78 78.04