HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: axessibility
  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2403.05894v1 [cs.CV] 09 Mar 2024

Frequency Attention for Knowledge Distillation

Cuong Pham1    Van-Anh Nguyen1    Trung Le1    Dinh Phung1    Gustavo Carneiro2           Thanh-Toan Do1
1Department of Data Science and AI, Monash University, Australia
2Centre for Vision, Speech and Signal Processing, University of Surrey, United Kingdom
Abstract

Knowledge distillation is an attractive approach for learning compact deep neural networks, which learns a lightweight student model by distilling knowledge from a complex teacher model. Attention-based knowledge distillation is a specific form of intermediate feature-based knowledge distillation that uses attention mechanisms to encourage the student to better mimic the teacher. However, most of the previous attention-based distillation approaches perform attention in the spatial domain, which primarily affects local regions in the input image. This may not be sufficient when we need to capture the broader context or global information necessary for effective knowledge transfer. In frequency domain, since each frequency is determined from all pixels of the image in spatial domain, it can contain global information about the image. Inspired by the benefits of the frequency domain, we propose a novel module that functions as an attention mechanism in the frequency domain. The module consists of a learnable global filter that can adjust the frequencies of student’s features under the guidance of the teacher’s features, which encourages the student’s features to have patterns similar to the teacher’s features. We then propose an enhanced knowledge review-based distillation model by leveraging the proposed frequency attention module. The extensive experiments with various teacher and student architectures on image classification and object detection benchmark datasets show that the proposed approach outperforms other knowledge distillation methods.

1 Introduction

Convolutional Neural Networks (CNNs) have been widely applied and achieved a myriad of successes in various computer vision tasks, such as image classification, object detection, and image segmentation. However, these CNNs often require expensive memory and computation resources, making them unsuitable for applications with limited resources. Different approaches have been proposed to learn efficient deep neural networks, such as pruning [16, 5, 10], knowledge distillation [7, 23, 6, 19], and quantization [8, 35, 37, 20]. Among them, knowledge distillation (KD) is an attractive approach to reduce the computational cost of CNNs. In knowledge distillation, a smaller student network is trained to mimic the behaviour of a larger teacher network.

Different approaches have been proposed for knowledge distillation [7, 23, 11, 30, 2, 6, 19, 36]. Among them, intermediate feature-based KD is a popular approach because it is flexible to design different distillation mechanisms such as layer to layer distillation [23, 11, 6] and layer fusion distillation [19]. Attention-based KD [11, 9, 28, 26] is a specific form of intermediate feature-based knowledge distillation. In those works, the attention is performed in the spatial domain and they use attention maps to help the student to focus on the most informative information from the teacher. However, in  [11, 9, 28, 26] each value of the attention map is calculated from a local region of the input feature map. This focus of the local regions may not be sufficient to effectively transfer knowledge from teacher model to student model in knowledge distillation when we need to capture the broader context or global information necessary for effective knowledge transfer.

Our goal is to encourage student model to capture both detailed and higher-level information such as object parts from the teacher model. This can be accomplished by processing the student’s features in the frequency domain instead of the spatial domain. The frequency domain is useful for understanding images with repetitive or periodic patterns that may be difficult to discover using traditional spatial domain techniques. By capturing the intensity changes and patterns in the image, the frequency domain can identify different regions associated with objects, and each frequency could correspond to some specific structures, e.g., high frequencies correspond to large changes in image intensity over a short pixel distance (e.g., edges).

With the above benefits of the frequency domain, we propose a Frequency Attention Module (FAM), which has a learnable global filter in the frequency domain. The global filter can be seen as a form of attention in the frequency domain, which can adjust the frequency of student’s feature maps. We then invert the attending features in frequency domain back to the spatial domain and minimize them with the teacher’s features. By updating the parameters of the learnable filter based on the guidance of the teacher, we can encourage the transformed student’s features to have similar patterns as the teacher’s features.

Given the proposed frequency-based attention module, we propose two enhanced architectures for layer-to-layer [6, 11, 23] and knowledge review distillation [19]. We extensively demonstrate the effectiveness of our proposed method with various teacher and student architectures on benchmark datasets for image classification and object detection. The experimental results show that the proposed approach outperforms other knowledge distillation methods. In summary, our contributions are:

  • We propose a novel module, which is our main contribution in which we explore Fourier frequency domain for knowledge distillation. The module consists of a learnable global filter that can adjust frequency of the student’s features, which encourages student’s features to mimic patterns from the teacher’s features.

  • We propose an enhanced layer-to-layer knowledge distillation model and an enhanced knowledge review-based distillation model by leveraging the proposed FAM module.

  • Our method outperforms other knowledge distillation methods for classification on CIFAR-100 and ImageNet datasets and object detection on MS COCO dataset.

2 Related work

Knowledge Distillation (KD) has received substantial attention recently due to its versatility in various applications. In KD, the student model can benefit from the guidance of various forms from the teacher model to achieve better performance. This could be soft logit-based distillation [7, 36], relation-based distillation [30, 17, 2], or intermediate feature-based distillation [23, 11, 6, 19]. Among them, the feature-based knowledge distillation allows flexibility in designing distillation mechanisms. Particularly, in FitNets [23] given a student layer (guided layer) and a teacher layer (hint layer), the authors minimize the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance between the transformed student’s features and the teacher’s features. Following FitNets, AT [11], PKT [18], and SP [31] transfer knowledge through activation maps, feature distributions, and pairwise similarities, respectively. In OFD [6], the authors propose margin ReLU applied on teacher’s feature maps to select information used for distillation. In [19], the authors introduce the review mechanism to enrich student features. They show that lower-level features of teacher are useful in supervising the higher-level features of student. They propose to fuse different levels of student features before mimicking teacher knowledge.

In [11, 9, 28, 26], the attention is performed in the spatial domain and they use attention maps to help the student focus on the most informative information from the teacher. Specifically, spatial attention maps in AT [11] can be computed using the sum of absolute values across the channel dimension. AFD [9] also transfers knowledge from teacher to student through spatial attention maps that are computed through channel-wise average pooling layer. They then maximize the similarity between attention maps of student’s features and the attention maps of teacher’s features. Meanwhile,  [28] computes the spatial attention maps using average pooling and fully connected layers. However, with the attention in the spatial domain used [11, 9, 28, 26], weights of the attention map are usually calculated from local regions of the feature maps. The attention weights (i.e., values in the attention map) indicate the importance of the corresponding local regions. Due to its local property, a change in a value of the attention map (in backpropagation) only affects the corresponding local region.

Fourier frequency domain and attention in the frequency domain. In digital image processing, Fourier frequency domain represents an image with a set of sinusoidal waves, with each wave representing a different level of intensity in the whole image. The frequency domain is a helpful way to understand images that have repetitive or periodic patterns [3]. It is more effective than traditional spatial domain techniques in capturing geometric structures that are difficult to extract. By capturing the intensity changes in the image, the frequency domain can identify distinct regions that are associated with objects.

Each frequency in the frequency domain is determined by all the pixels in the image in the spatial domain. Frequencies can correspond to particular structures in the spatial domain. For instance, high frequencies correspond to significant changes in image intensity over a small distance between pixels, such as edges. Therefore, focusing on the frequency domain can be seen as a form of global attention. Meanwhile, attention in the spatial domain [11, 9, 28] primarily affects local regions in the input feature map, which may be insufficient for capturing the global structure of the feature map. By contrast, attention in the frequency domain can be especially useful for identifying global information or geometric structures of the feature map that may be difficult to detect using traditional spatial domain techniques. A change in a frequency of the attention frequency map can impact the entire input feature, compared to the effect on the local regions when changing a value in attention map in the spatial domain.

In this work, we explore the Fourier frequency domain for the knowledge distillation problem. We propose a frequency attention module (FAM) that has a learnable global filter, which acts as an attention in the frequency domain. Based on the guidance from the teacher, FAM will encourage the student’s features to have similar patterns as teacher’s features.

3 Proposed method

This section first details the frequency attention module (FAM) that encourages the student to better mimic the teacher. We then present our design to integrate the FAM module into two popular knowledge distillation mechanisms, i.e., layer-to-layer feature-based distillation [6] and knowledge review-based distillation [19].

3.1 Frequency attention module

Refer to caption
Figure 1: Fourier Frequency Attention Module. HPF stands for a high pass filter. In the global branch, the input student’s feature map is transformed to the frequency domain using the FFT. The frequency is then adjusted by a learnable global filter. A high pass filter is then applied to the adjusted frequency map to filter out lowest frequencies. The local branch consists of a 1×\times×1 convolutional layer in the spatial domain. The outputs of the global and local branches are added and the resulting feature map is compared with the teacher’s feature map. γ1subscript𝛾1\gamma_{1}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and γ2subscript𝛾2\gamma_{2}italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the learnable weighting parameters of the global and local branches, respectively.

As shown in Figure 1, the FAM module consists of global and local branches. Specifically, given a feature map X𝑋Xitalic_X with a dimension of Cin×H×Wsubscript𝐶𝑖𝑛𝐻𝑊C_{in}\times H\times Witalic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT × italic_H × italic_W, in the global branch we first transform it into the frequency domain via Fast Fourier Transform (FFT). Here the FFT is applied to each channel separately. For the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT channel Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the feature map X𝑋Xitalic_X, the 2-D discrete FFT of Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denoted by 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is expressed as:

𝒳i(u,v)=k=0H1l=0W1Xi(k,l)ei2π(ukH+vlW).subscript𝒳𝑖𝑢𝑣superscriptsubscript𝑘0𝐻1superscriptsubscript𝑙0𝑊1subscript𝑋𝑖𝑘𝑙superscript𝑒𝑖2𝜋𝑢𝑘𝐻𝑣𝑙𝑊\mathcal{X}_{i}(u,v)=\sum_{k=0}^{H-1}\sum_{l=0}^{W-1}X_{i}(k,l)e^{-{i2\pi}(% \frac{uk}{H}+\frac{vl}{W})}.caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_u , italic_v ) = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k , italic_l ) italic_e start_POSTSUPERSCRIPT - italic_i 2 italic_π ( divide start_ARG italic_u italic_k end_ARG start_ARG italic_H end_ARG + divide start_ARG italic_v italic_l end_ARG start_ARG italic_W end_ARG ) end_POSTSUPERSCRIPT . (1)

To adjust the frequencies of 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we apply a learnable global filter K𝐾Kitalic_K which can be seen as a form of attention on 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Global filtering.

It is worth noting that in feature distillation, we want the feature map resulting from the FAM module to have the same dimension as the dimension of a given teacher feature map where the knowledge will be distilled. Therefore, we design the global filter K𝐾Kitalic_K with the dimension of Cout×Cin×H×Wsubscript𝐶𝑜𝑢𝑡subscript𝐶𝑖𝑛𝐻𝑊C_{out}\times C_{in}\times H\times Witalic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT × italic_H × italic_W, where Coutsubscript𝐶𝑜𝑢𝑡C_{out}italic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT is the number of channels of the teacher’s feature map. Each kernel in the global filter K𝐾Kitalic_K has the same size as the 3D input tensor 𝒳𝒳\mathcal{X}caligraphic_X with the size Cin×H×Wsubscript𝐶𝑖𝑛𝐻𝑊C_{in}\times H\times Witalic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT × italic_H × italic_W. This kernel performs element-wise multiplied with the 3D input tensor 𝒳𝒳\mathcal{X}caligraphic_X, resulting in a 3D feature map with the same size as the input feature map. Next, the 3D frequency feature maps of the output are then summed up, (i.e., sum-pooling in each Cin×1×1subscript𝐶𝑖𝑛11C_{in}\times 1\times 1italic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT × 1 × 1 block), resulting in a 2D output with the size H×W𝐻𝑊H\times Witalic_H × italic_W. The above operation is performed for Coutsubscript𝐶𝑜𝑢𝑡C_{out}italic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT kernels of the global filter, resulting in a 3D feature map with a size of Cout×H×Wsubscript𝐶𝑜𝑢𝑡𝐻𝑊C_{out}\times H\times Witalic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT × italic_H × italic_W as the output.

It is worth noting that the proposed filter acts in the frequency domain. Each frequency in the frequency domain is determined by all the pixels in the spatial domain; hence, although each element of each kernel attends to a particular frequency, the filter still achieves the global effects.

After that, we further suppress low frequencies, which encourages the student to de-focus from the non-salient regions. To this end, we add a high pass filter (HPF) after the learnable global filter to eliminate part of the lowest frequency components. The HPF is applied to each channel separately. Specifically, for each channel, we adopt the ideal HPF, which suppresses 1 percent of the lowest frequencies.

We then transform the frequency domain back to the spatial domain via the inverse Fast Fourier Transform (IFFT). Given 𝒳¯¯𝒳\bar{\mathcal{X}}over¯ start_ARG caligraphic_X end_ARG which is the frequency feature map after the HPF, for the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT channel 𝒳i¯¯subscript𝒳𝑖\bar{\mathcal{X}_{i}}over¯ start_ARG caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG of the frequency feature map 𝒳¯¯𝒳\bar{\mathcal{X}}over¯ start_ARG caligraphic_X end_ARG, the 2-D IFFT of 𝒳i¯¯subscript𝒳𝑖\bar{\mathcal{X}_{i}}over¯ start_ARG caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG denoted by Xi¯¯subscript𝑋𝑖\bar{X_{i}}over¯ start_ARG italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG is expressed as:

Xi¯(k,l)=1HWu=0H1v=0W1𝒳i¯(u,v)ei2π(ukH+vlW).¯subscript𝑋𝑖𝑘𝑙1𝐻𝑊superscriptsubscript𝑢0𝐻1superscriptsubscript𝑣0𝑊1¯subscript𝒳𝑖𝑢𝑣superscript𝑒𝑖2𝜋𝑢𝑘𝐻𝑣𝑙𝑊\bar{X_{i}}(k,l)=\frac{1}{HW}\sum_{u=0}^{H-1}\sum_{v=0}^{W-1}\bar{\mathcal{X}_% {i}}(u,v)e^{{i2\pi}(\frac{uk}{H}+\frac{vl}{W})}.over¯ start_ARG italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( italic_k , italic_l ) = divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_u = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_v = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W - 1 end_POSTSUPERSCRIPT over¯ start_ARG caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( italic_u , italic_v ) italic_e start_POSTSUPERSCRIPT italic_i 2 italic_π ( divide start_ARG italic_u italic_k end_ARG start_ARG italic_H end_ARG + divide start_ARG italic_v italic_l end_ARG start_ARG italic_W end_ARG ) end_POSTSUPERSCRIPT . (2)

Formally, let g(𝒳,K)𝑔𝒳𝐾g(\mathcal{X},K)italic_g ( caligraphic_X , italic_K ) be the output of the global filtering as above, hhitalic_h be the high pass filter, 𝔽𝔽\mathbb{F}blackboard_F and 𝔽1superscript𝔽1\mathbb{F}^{-1}blackboard_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT be the FFT and the inverse IFFT, respectively, the output of the global branch is calculated as:

global(X)=𝔽1(h(g(𝔽(X),K)),\mathcal{F}_{global}(X)=\mathbb{F}^{-1}(h(g(\mathbb{F}(X),K)),caligraphic_F start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT ( italic_X ) = blackboard_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_h ( italic_g ( blackboard_F ( italic_X ) , italic_K ) ) , (3)

where 𝔽,h,𝔽1𝔽superscript𝔽1\mathbb{F},h,\mathbb{F}^{-1}blackboard_F , italic_h , blackboard_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT are applied in a channel-wise fashion.

The FAM module also consists of a local branch, which is a 1×1111\times 11 × 1 convolutional layer in the spatial domain. This layer aims to leverage the information of features in the spatial domain. Let local(X)subscript𝑙𝑜𝑐𝑎𝑙𝑋\mathcal{F}_{local}(X)caligraphic_F start_POSTSUBSCRIPT italic_l italic_o italic_c italic_a italic_l end_POSTSUBSCRIPT ( italic_X ) be the output of the local branch, and the output of the frequency attention module is calculated as below:

out=γ1*global+γ2*local,subscript𝑜𝑢𝑡subscript𝛾1subscript𝑔𝑙𝑜𝑏𝑎𝑙subscript𝛾2subscript𝑙𝑜𝑐𝑎𝑙\mathcal{F}_{out}=\gamma_{1}*\mathcal{F}_{global}+\gamma_{2}*\mathcal{F}_{% local},caligraphic_F start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT = italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT * caligraphic_F start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT * caligraphic_F start_POSTSUBSCRIPT italic_l italic_o italic_c italic_a italic_l end_POSTSUBSCRIPT , (4)

where γ1subscript𝛾1\gamma_{1}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and γ2subscript𝛾2\gamma_{2}italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the learnable weighting parameters of the global and local branches, respectively.

Computational complexity of the FAM module.

The global branch comprises a fast Fourier transform (FFT), an inverse fast Fourier transform (IFFT), a global filter, and a high pass filter (HPF).

The complexity of the FFT of an image with dimensions H×W𝐻𝑊H\times Witalic_H × italic_W is 𝒪(HWlog(HW))𝒪𝐻𝑊𝑙𝑜𝑔𝐻𝑊\mathcal{O}(HWlog(HW))caligraphic_O ( italic_H italic_W italic_l italic_o italic_g ( italic_H italic_W ) ). Similarly, the complexity of the inverse fast Fourier transform (IFFT) of a frequency image with dimensions H×W𝐻𝑊H\times Witalic_H × italic_W is 𝒪(HWlog(HW))𝒪𝐻𝑊𝑙𝑜𝑔𝐻𝑊\mathcal{O}(HWlog(HW))caligraphic_O ( italic_H italic_W italic_l italic_o italic_g ( italic_H italic_W ) ). Therefore, the complexities of the FFT, global filter, HPF, and IFFT components in the FAM module are 𝒪(CinHWlog(HW))𝒪subscript𝐶𝑖𝑛𝐻𝑊𝑙𝑜𝑔𝐻𝑊\mathcal{O}(C_{in}HWlog(HW))caligraphic_O ( italic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT italic_H italic_W italic_l italic_o italic_g ( italic_H italic_W ) ), 𝒪(CoutCinHW)𝒪subscript𝐶𝑜𝑢𝑡subscript𝐶𝑖𝑛𝐻𝑊\mathcal{O}(C_{out}C_{in}HW)caligraphic_O ( italic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT italic_H italic_W ), 𝒪(CoutHW)𝒪subscript𝐶𝑜𝑢𝑡𝐻𝑊\mathcal{O}(C_{out}HW)caligraphic_O ( italic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT italic_H italic_W ), and 𝒪(CoutHWlog(HW))𝒪subscript𝐶𝑜𝑢𝑡𝐻𝑊𝑙𝑜𝑔𝐻𝑊\mathcal{O}(C_{out}HWlog(HW))caligraphic_O ( italic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT italic_H italic_W italic_l italic_o italic_g ( italic_H italic_W ) ), respectively.

The FAM module also consists of a local branch, which is a 1 × 1 convolutional layer in the spatial domain. The local branch has the complexity of 𝒪(CoutCinHW)𝒪subscript𝐶𝑜𝑢𝑡subscript𝐶𝑖𝑛𝐻𝑊\mathcal{O}(C_{out}C_{in}HW)caligraphic_O ( italic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT italic_H italic_W ). Overall, the FAM module has the complexity of 𝒪(CoutCinHW)𝒪subscript𝐶𝑜𝑢𝑡subscript𝐶𝑖𝑛𝐻𝑊\mathcal{O}(C_{out}C_{in}HW)caligraphic_O ( italic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT italic_H italic_W ).

3.2 Applying FAM to knowledge distillation

3.2.1 Layer-to-layer intermediate feature-based knowledge distillation

Refer to caption
Figure 2: The proposed enhanced layer-to-layer knowledge distillation. LA is the local attention and FAM is the proposed frequency attention module. 𝒟𝒟\mathcal{D}caligraphic_D is the distance function. FTsuperscript𝐹𝑇F^{T}italic_F start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and FSsuperscript𝐹𝑆F^{S}italic_F start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT represent the feature maps of teacher and student, respectively.
Refer to caption
Figure 3: The proposed enhanced knowledge review distillation. CrossAT is the cross attention and FAM is the proposed frequency attention module. 𝒟𝒟\mathcal{D}caligraphic_D is the distance function. FTsuperscript𝐹𝑇F^{T}italic_F start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and FSsuperscript𝐹𝑆F^{S}italic_F start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT represent the feature maps of teacher and student, respectively.

Let \mathcal{I}caligraphic_I be the selected layer indices from the teacher for intermediate feature-based distillation. The layer-to-layer knowledge distillation loss is defined as

feat=i𝒟(FiT,f(FjS)),subscript𝑓𝑒𝑎𝑡subscript𝑖𝒟superscriptsubscript𝐹𝑖𝑇𝑓superscriptsubscript𝐹𝑗𝑆\mathcal{L}_{feat}=\sum_{i\in\mathcal{I}}\mathcal{D}\left(F_{i}^{T},f(F_{j}^{S% })\right),caligraphic_L start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT caligraphic_D ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_f ( italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ) ) , (5)

where FjSsuperscriptsubscript𝐹𝑗𝑆F_{j}^{S}italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT is the feature map from the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT layer of the student selected for receiving the knowledge from the feature map FiTsuperscriptsubscript𝐹𝑖𝑇F_{i}^{T}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT from ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT layer of the teacher; f𝑓fitalic_f is a transformation applied on the student’s feature map. In our work, f𝑓fitalic_f is the FAM module. 𝒟𝒟\mathcal{D}caligraphic_D is a distance function. In this work, we use L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance as the distance function. It is worth noting the teacher is fixed in our framework, i.e., there are no transformations applied on teacher’s feature maps.

In order to make FAM to better mimic teacher, we find that it would be beneficial to also enhance local structures in the spatial domain. To this end, we place an attention layer after student’s feature maps before feeding it through the FAM module, as shown in Figure 2. To avoid increasing model complexity, we use local self-attention (LA) layer introduced by [21]. In LA, the self-attention is applied only to a small neighbourhood around each position.

3.2.2 Knowledge review distillation

We also integrate the FAM module into the knowledge review distillation mechanism [19], as shown in Figure 3. In [19], the authors propose a knowledge review mechanism that uses teacher’s low-level features to supervise deeper student’s features. They fuse different levels of the student’s features before mimicking knowledge from the teacher. In knowledge review mechanism [19], the distillation loss is defined as follows:

feat=𝒟(FMT,f(FNS))+i=M11𝒟(FiT,f(u(Fj=iS,Fj+1,NS))),subscript𝑓𝑒𝑎𝑡𝒟superscriptsubscript𝐹𝑀𝑇𝑓superscriptsubscript𝐹𝑁𝑆superscriptsubscript𝑖𝑀11𝒟superscriptsubscript𝐹𝑖𝑇𝑓𝑢superscriptsubscript𝐹𝑗𝑖𝑆superscriptsubscript𝐹𝑗1𝑁𝑆\small\mathcal{L}_{feat}=\mathcal{D}(F_{M}^{T},f(F_{N}^{S}))+\sum_{i=M-1}^{1}% \mathcal{D}\left(F_{i}^{T},f(u(F_{j=i}^{S},F_{j+1,N}^{S}))\right),caligraphic_L start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t end_POSTSUBSCRIPT = caligraphic_D ( italic_F start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_f ( italic_F start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ) ) + ∑ start_POSTSUBSCRIPT italic_i = italic_M - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT caligraphic_D ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_f ( italic_u ( italic_F start_POSTSUBSCRIPT italic_j = italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_j + 1 , italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ) ) ) , (6)

where M𝑀Mitalic_M and N𝑁Nitalic_N are the numbers of selected intermediate layers of teacher and student used for knowledge distillation. We note that in intermediate feature-based KD, the student and the teacher models are often divided into stages. The number of stages is the same for the teacher and the student, i.e., M=N𝑀𝑁M=Nitalic_M = italic_N. The last layers in each stage are used for distillation. u(.,.)u(.,.)italic_u ( . , . ) is a fusion function that recursively fuses student features. Fj+1,NSsuperscriptsubscript𝐹𝑗1𝑁𝑆F_{j+1,N}^{S}italic_F start_POSTSUBSCRIPT italic_j + 1 , italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT denotes the fusion of features from Fj+1Ssuperscriptsubscript𝐹𝑗1𝑆F_{j+1}^{S}italic_F start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT to FNSsuperscriptsubscript𝐹𝑁𝑆F_{N}^{S}italic_F start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT; u(Fj=iS,Fj+1,NS)=u(Fj=iS,u(Fj+1S,Fj+2,NS))𝑢superscriptsubscript𝐹𝑗𝑖𝑆superscriptsubscript𝐹𝑗1𝑁𝑆𝑢superscriptsubscript𝐹𝑗𝑖𝑆𝑢superscriptsubscript𝐹𝑗1𝑆superscriptsubscript𝐹𝑗2𝑁𝑆u(F_{j=i}^{S},F_{j+1,N}^{S})=u(F_{j=i}^{S},u(F_{j+1}^{S},F_{j+2,N}^{S}))italic_u ( italic_F start_POSTSUBSCRIPT italic_j = italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_j + 1 , italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ) = italic_u ( italic_F start_POSTSUBSCRIPT italic_j = italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , italic_u ( italic_F start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_j + 2 , italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ) ); f𝑓fitalic_f is the FAM module.

In [19], u(.,.)u(.,.)italic_u ( . , . ) is an attention-based fusion (ABF [19]) function that learns two attention maps for two inputs and uses attention maps to aggregate two inputs. In this work, we propose using cross attention [32] in which the low-level feature map is considered as the value and key and the high (fused) feature map is considered as the query when fusing student’s feature maps at different levels. Specifically, let F=u(Fj+1S,Fj+2,NS))F^{\prime}=u(F_{j+1}^{S},F_{j+2,N}^{S}))italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_u ( italic_F start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_j + 2 , italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ) )

u(FjS,F)=softmax((𝐖QF)(𝐖KFjS)T)𝐖VFjS,𝑢superscriptsubscript𝐹𝑗𝑆superscript𝐹𝑠𝑜𝑓𝑡𝑚𝑎𝑥subscript𝐖𝑄superscript𝐹superscriptsubscript𝐖𝐾superscriptsubscript𝐹𝑗𝑆𝑇subscript𝐖𝑉superscriptsubscript𝐹𝑗𝑆u(F_{j}^{S},F^{\prime})=softmax((\mathbf{W}_{Q}F^{\prime})(\mathbf{W}_{K}F_{j}% ^{S})^{T})\mathbf{W}_{V}F_{j}^{S},italic_u ( italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( ( bold_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ( bold_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) bold_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , (7)

where 𝐖Qsubscript𝐖𝑄\mathbf{W}_{Q}bold_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT, 𝐖Ksubscript𝐖𝐾\mathbf{W}_{K}bold_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, and 𝐖Vsubscript𝐖𝑉\mathbf{W}_{V}bold_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT represent learnable parameters for query, key, and value, respectively. In summary, compared to [19], firstly, to emphasize the importance of the student’s feature map that is at the same level as the teacher’s feature map, our enhanced KD review architecture uses cross attention instead of ABF [19]. Then, we feed the output of cross attention to the FAM module to adjust frequencies before computing the distance function 𝒟𝒟\mathcal{D}caligraphic_D.

The overall loss consists of the task loss (i.e., cross-entropy loss for classification task) and the feature distillation loss:

=task+αfeatsubscript𝑡𝑎𝑠𝑘𝛼subscript𝑓𝑒𝑎𝑡\mathcal{L}=\mathcal{L}_{task}+\alpha\mathcal{L}_{feat}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_t italic_a italic_s italic_k end_POSTSUBSCRIPT + italic_α caligraphic_L start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t end_POSTSUBSCRIPT (8)

4 Experiments

4.1 Experimental setup

Datasets.

We evaluate our approach on CIFAR-100 [12] and ImageNet [24] datasets for image classification task, and COCO dataset [14] for object detection task. The CIFAR-100 dataset consists of 60,0006000060,00060 , 000 images for 100100100100 classes, in which, 50,0005000050,00050 , 000 and 10,0001000010,00010 , 000 images are used for training and validation sets, respectively. ImageNet is a challenging dataset with 1000100010001000 classes. This dataset contains 1.21.21.21.2 million images for training and 50,0005000050,00050 , 000 images for validation, which is used as a test set in our experiments. For object detection task, COCO is a standard dataset with multiple objects in an image. In total, this dataset contains 1.51.51.51.5 million object instances of 80808080 object categories in 118,000118000118,000118 , 000 training and 5,00050005,0005 , 000 validation images.

Implementation details.

We apply our method across various teacher-student architecture pairs, as shown in Table 1, Table 2, Table 3, and Table 4. For a fair comparison, we do experiments on standard teacher/student pairs following other distillation methods [30, 2, 19, 36] and base on the public distiller code-base [36]. This includes the distillation when teachers and students are in the same architecture and in different architectures. For training, we use the standard training procedure following [19, 36] and pre-trained teachers in all settings for both classification and object detection tasks. We employ L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance as a distance function 𝒟𝒟\mathcal{D}caligraphic_D when calculating the featsubscript𝑓𝑒𝑎𝑡\mathcal{L}_{feat}caligraphic_L start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t end_POSTSUBSCRIPT losses (Eq. (5) and Eq. (6)). The implementation details for CIFAR-100, ImageNet, and MS-COCO datasets and the values of the hyper-parameter α𝛼\alphaitalic_α (Eq. 8) for each teacher/student pair are provided in supplementary materials due to page limit.

Teacher WRN-40-2 WRN-40-2 ResNet56 ResNet110 ResNet32x4
Student WRN-16-2 WRN-40-1 ResNet20 ResNet32 ResNet8x4
Teacher 75.61 75.61 72.34 74.31 79.42
Student 73.26 71.98 69.06 71.14 72.50
Soft logit-based distillation
KD [7] 74.92 73.54 70.66 73.08 73.33
DKD [36] 76.24 74.81 71.97 74.11 76.32
Layer to layer-based distillation
FITNET [23] 73.58 72.24 69.21 71.06 73.50
AT [11] 74.08 72.77 70.55 72.31 73.44
VID [1] 74.11 73.30 70.38 72.61 73.09
RKD [17] 73.35 72.22 69.61 71.82 71.90
CRD [30] 75.48 74.14 71.16 73.48 75.51
WCoRD [2] 75.88 74.73 71.56 73.81 75.95
OFD [6] 75.24 74.33 70.98 73.23 74.95
FAM-KD (layer-to-layer) - Ours 76.03 74.88 72.03 74.03 76.24
Layer to layer + Soft logit-based distillation
WCoRD + KD[2] 76.11 74.72 71.92 74.20 76.15
Knowledge review-based distillation
ReviewKD [19] 76.12 75.09 71.89 73.89 75.63
FAM-KD (review) - Ours 76.47 75.40 72.15 74.45 76.84
Table 1: Results on the CIFAR-100 validation set. Teachers and students are in the same architecture. FAM-KD (layer-to-layer) and FAM-KD (review) refer to our proposed methods in Section 3.2.1 and Section 3.2.2, respectively. Our reported results are an average of three trials.
Teacher ResNet32x4 WRN-40-2 ResNet32x4 VGG13
Student ShuffleNet-V1 ShuffleNet-V1 ShuffleNet-V2 MobileNet-V2
Teacher 79.42 75.61 79.42 74.64
Student 70.50 70.50 71.82 64.60
Soft logit-based distillation
KD [7] 74.07 74.83 74.45 67.37
DKD [36] 76.45 76.70 77.07 69.71
Layer to layer-based distillation
FITNET [23] 73.59 73.73 73.54 63.16
AT [11] 71.73 73.32 72.73 59.40
VID [1] 73.38 73.61 73.40 65.56
RKD [17] 72.28 72.21 73.21 64.52
CRD [30] 75.11 76.05 75.65 69.73
WCoRD [2] 75.40 76.32 75.96 69.47
OFD [6] 75.98 75.85 76.82 69.48
FAM-KD (layer-to-layer) - Ours 77.15 77.33 77.64 69.96
Layer to layer + Soft logit based-distillation
WCoRD + KD [2] 75.77 76.68 76.48 70.02
Knowledge review-based distillation
ReviewKD [19] 77.45 77.14 77.78 70.37
FAM-KD (review) - Ours 77.76 77.57 78.41 70.88
Table 2: The comparative results on the CIFAR-100 validation set. Teachers and students are in the different architectures. FAM-KD (layer-to-layer) and FAM-KD (review) refer to our proposed methods in Section 3.2.1 and Section 3.2.2, respectively. Our reported results are an average of three trials.
Setting Teacher Student KD [7] AT [11] OFD [6] CRD [30] WCoRD [2] DKD [36] ReviewKD [19] FAM-KD (Ours)
(a) Top-1 73.31 69.75 70.66 70.69 70.81 71.17 71.49 71.70 71.61 71.91
Top-5 91.42 89.07 89.88 90.01 89.98 90.13 90.16 90.41 90.51 90.53
(b) Top-1 76.16 68.87 68.58 70.69 70.81 71.17 - 72.05 72.56 73.33
Top-5 92.86 88.76 88.98 90.01 89.98 90.13 - 91.05 91.00 91.44
Table 3: Top-1 and top-5 accuracy (%) on the ImageNet validation set. (a) ResNet34 and ResNet18 and (b) ResNet50 and MobileNetV1 are used as the teacher and student architectures. Our results (FAM-KD) are with the enhanced knowledge review-based distillation (Section 3.2.2). Our reported results are an average of three trials.
Refer to caption
(a) Original image
Refer to caption
(b) ResNet18 (w/o KD)
Refer to caption
(c) OFD [6]
Refer to caption
(d) Knowledge review [19]
Refer to caption
(e) FAM-KD (Ours)
Figure 4: (a) Original image. (b) - (e) Grad-CAMs [27] from layer 9 of ResNet18 model when training (b) without knowledge distillation, (c) with OFD [6], (d) with knowledge review [19], and (e) with FAM-KD (ours), respectively. When training with distillation, ResNet34 is used as the teacher. The figure shows that our FAM-KD (e) has better focus on the object than using OFD [6] and knowledge review [19].
Method ResNet101 & ResNet18 ResNet101 & ResNet50
AP AP_50 AP_75 AP AP_50 AP_75
Teacher 42.04 62.48 45.88 42.04 62.48 45.88
Student 33.26 53.61 35.26 37.93 58.84 41.05
KD [7] 33.97 54.66 36.62 38.35 59.41 41.71
FitNet [23] 34.13 54.16 36.71 38.76 59.62 41.80
FGFI [33] 35.44 55.51 38.17 39.44 60.27 43.04
ReviewKD [19] 36.75 56.72 34.00 40.36 60.97 44.08
DKD [36] 35.05 56.60 37.54 39.25 60.90 42.73
DKD + ReviewKD [36] 37.01 57.53 39.85 40.65 61.51 44.44
FAM-KD (ours) 37.20 57.86 40.01 40.77 61.42 44.49
Table 4: Comparative object detection accuracy on the MS-COCO dataset. We use the two-stage method Faster RCNN [22] with FPN [13] as the detector. On the student side, ResNet18 and ResNet50 models are selected as backbones, while teacher models use ResNet101 as a backbone. Our results (FAM-KD) are with the enhanced knowledge review-based distillation (Section 3.2.2). Our reported results are an average of three trials.

4.2 Comparison with the state of the art

4.2.1 Image classification

Comparative results on CIFAR-100.

We present top-1 classification accuracy on the CIFAR-100 by various teacher-student pairs, both from the same network family (Table 1) and from the different network family (Table 2). The selected networks comprise ResNet [4], WideResNet [34], ShuffleNet [15], MobileNetV2 [25], and VGG [29]. The results of competitors are cited from [2, 36, 19].

Overall, our method FAM-KD (review) consistently outperforms all compared methods in all settings. In some cases, i.e., WRN-40-2/WRN-16-2, ResNet110/ResNet32, WRN-40-2/ShuffleNet-V1, students’ performance even surpasses the teachers.

Regarding layer-to-layer distillation, our method FAM-KD (layer-to-layer) outperforms all other methods belonging to the same category. Our method consistently outperforms the most competitor WCoRD [2] on all settings. Compare to WCoRD + KD [2], our method achieves competitive results, despite that we only use feature-based distillation. The highest improvement over WCoRD + KD [2] is 1.38%percent1.381.38\%1.38 % with the ResNet32x4/ShuffleNet-V1 setting. It is worth noting that even with the layer-to-layer setting, the FAM-KD achieves comparable results with the current state-of-the-art feature distillation method using the review mechanism [19].

Regarding the knowledge review mechanism (FAM-KD (review)), we outperform compared methods for all teacher-student distillation pairs. Compare to [19], our method outperforms ReviewKD [19] in all cases. Compare to DKD [36], which is a soft logit-based distillation method, our method FAM-KD (review) also outperforms DKD [36] in all settings. The highest improvement is 1.34%percent1.341.34\%1.34 % with the ResNet32x4/ShuffleNet-V2 setting. The promising results have shown the effectiveness of the FAM module, supporting students to perform better.

Comparative results on ImageNet.

We validate our approach on the large-scale dataset ImageNet [24] in the case of integrating the FAM module into the knowledge review distillation mechanism (FAM-KD). Table 3 presents the top-1 and top-5 classification accuracy on the ImageNet validation set of various distillation methods. When both teacher and student have the same architecture, we employ ResNet34/ResNet18 as the teacher/student pair. Meanwhile, when teacher and student have different architectures, we use ResNet50/MobileNetV1 as the teacher/student pair. Our approach yields the highest performance on both top-1 and top-5 accuracy. For ResNet34/ResNet18, compared to the vanilla KD [7], the FAM-KD improves by a large margin of 1.24%percent1.241.24\%1.24 % top-1 accuracy. Meanwhile, compared to ReviewKD [19] and DKD [36], the FAM-KD improves 0.3%percent0.30.3\%0.3 % and 0.21%percent0.210.21\%0.21 % top-1 accuracy, respectively. The relative improvements111Similar to [2], we compute the relative improvement as OursAAKDOursAAKD\frac{\mathrm{Ours}-\mathrm{A}}{\mathrm{A}-\mathrm{KD}}divide start_ARG roman_Ours - roman_A end_ARG start_ARG roman_A - roman_KD end_ARG, where AA\mathrm{A}roman_A is the method we are comparing to. For each method, the corresponding accuracy of the student is used for the calculation. over DKD and ReviewKD are considerable at 20.2% and 31.6%, respectively. For ResNet50/MobileNetV1, our approach yields a significant improvement, i.e., the improvements of 1.28% and 0.77% over DKD and ReviewKD in top-1 accuracy, and the corresponding relative improvements over DKD and ReviewKD are 36.9% and 19.3%, respectively.

Figure 4 visualizes Grad-CAMs  [27] (gradient-weighted class activation map**) extracted from layer 9 of the ResNet18, which serves as student’s architecture, with ResNet34 used as the teacher. The figure shows that the Grad-CAM when using our FAM-KD (e) has better focus on the object than using OFD [6] and knowledge review [19].

4.2.2 Object detection

Table 4 presents the object detection accuracy on the MS COCO dataset. We use the FasterRCNN [22] with FPN [13] as the detector, and use teacher/student pairs ResNet101/ResNet18, ResNet101/ResNet50 for the backbones. The results show that our method FAM-KD consistently outperforms ReviewKD [19] and the recent work DKD [36] for both settings at all metrics. With the teacher/student pairs ResNet101/ResNet18 and ResNet101/ResNet50, the proposed method outperforms DKD 2.152.152.152.15 and 1.521.521.521.52 AP points, respectively. In [36], to boost the detection accuracy of the student, the authors combine their soft logit-based distillation DKD with the knowledge review-based distillation [19]. Compare to DKD+ReviewKD [36], despite that we only use intermediate feature-based distillation, our method outperforms DKD+ReviewKD for most metrics, except AP_50 with ResNet101/ResNet50 setting.

4.3 Ablation studies

Setting Global branch Local branch Top-1
(a) 73.90
(b) 74.17
(c) 74.45
Table 5: Impact of the global branch and local branch of the FAM. ResNet110 and ResNet32 are used as the teacher and the student, respectively. The results are on the CIFAR-100 validation set.

In this section, we focus on investigating how different components of the FAM module contribute to the performance of FAM-KD. All experiments are conducted on CIFAR-100 dataset with ResNet110 as a teacher and ResNet32 as a student when integrating the FAM module into the knowledge review-based mechanism as presented in Section  3.2.2. Other experiments that inspect the effectiveness of the FAM module when integrating the FAM module to ReviewKD [19], given the same ReviewKD implementation in the public distiller code-base [36] are provided in supplementary materials.

With/without global and local branches in FAM.

In Table 5, we present performance of FAM-KD with and without the global branch and local branch. The results show that the global branch benefits the model better than the local one and having both branches gives the best result.

Setting Top-1
FAM-KD (w/o HPF) 73.82
FAM-KD 74.45
Table 6: Effect of the high pass filter (HPF) component in global branch of the FAM. ResNet110 and ResNet32 are used as the teacher and the student, respectively. The results are on the CIFAR-100 validation set.
With/without high pass filter (HPF) in the global branch.

The effectiveness of the HPF is presented in Table 6, i.e., having the HPF boosts the performance 0.63%percent0.630.63\%0.63 %. This shows the effectiveness of HPF, which helps to filter out the lowest frequency components and encourages the student to de-focus from the non-salient regions.

5 Conclusion

In this paper, we propose to use the frequency domain to encourage the student model to capture both detailed and higher-level information such as object parts based on a well-trained teacher’s guidance. We introduce a novel frequency attention module (FAM) for knowledge distillation that operates in the frequency domain and has a filter that can be adjusted to mimic the teacher’s features. This encourages the student’s features to have similar geometric structures to the teacher’s features. Moreover, we propose an enhanced knowledge review-based distillation by leveraging the proposed FAM and cross attention. We extensively evaluate our approach with different teacher and student models, and the proposed approach achieves significant improvements compared to other state-of-the-art methods for image classification on the CIFAR-100 and ImageNet datasets and for object detection on the MS COCO dataset.

References

  • [1] Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D. Lawrence, and Zhenwen Dai. Variational information distillation for knowledge transfer. In CVPR, 2019.
  • [2] Liqun Chen, Dong Wang, Zhe Gan, **g**g Liu, Ricardo Henao, and Lawrence Carin. Wasserstein contrastive representation distillation. In CVPR, 2021.
  • [3] Rafael C. Gonzalez and Richard E. Woods. Digital Image Processing (3rd Edition). Prentice-Hall, Inc., USA, 2006.
  • [4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [5] Yihui He, Xiangyu Zhang, and Jian Sun. Channel Pruning for Accelerating Very Deep Neural Networks. In ICCV, 2017.
  • [6] Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyo** Park, Nojun Kwak, and ** Young Choi. A comprehensive overhaul of feature distillation. In ICCV, 2019.
  • [7] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the Knowledge in a Neural Network. In NIPS Deep Learning and Representation Learning Workshop, 2014.
  • [8] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 2017.
  • [9] Mingi Ji, Byeongho Heo, and Sungrae Park. Show, attend and distill: Knowledge distillation via attention-based feature matching. In AAAI, 2021.
  • [10] Jianxin Wu Jian-Hao Luo and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In ICCV, 2017.
  • [11] Nikos Komodakis and Sergey Zagoruyko. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In ICLR, 2017.
  • [12] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [13] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
  • [14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
  • [15] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV, 2018.
  • [16] Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. In CVPR, 2019.
  • [17] Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In CVPR, 2019.
  • [18] Nikolaos Passalis and Anastasios Tefas. Learning deep representations with probabilistic knowledge transfer. In ECCV, 2018.
  • [19] Pengguang Chen, Shu Liu, Hengshuang Zhao, and Jiaya Jia. Distilling knowledge via knowledge review. In CVPR, 2021.
  • [20] Cuong Pham, Tuan Hoang, and Thanh-Toan Do. Collaborative multi-teacher knowledge distillation for learning low bit-width deep neural networks. In WACV, pages 6435–6443, 2023.
  • [21] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jon Shlens. Stand-alone self-attention in vision models. NIPS, 2019.
  • [22] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS, 2015.
  • [23] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
  • [24] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
  • [25] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
  • [26] Bharat Bhusan Sau, Soumya Roy, Vinay P Namboodiri, and Raghu Sesha Iyengar. Deep knowledge distillation using trainable dense attention. 2021.
  • [27] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
  • [28] Sungho Shin, Joosoon Lee, Junseok Lee, Yeonguk Yu, and Kyoobin Lee. Teaching where to look: Attention similarity knowledge distillation for low resolution face recognition. In ECCV, 2022.
  • [29] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [30] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. In ICLR, 2020.
  • [31] Frederick Tung and Greg Mori. Similarity-preserving knowledge distillation. In ICCV, 2019.
  • [32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NIPS, 30, 2017.
  • [33] Tao Wang, Li Yuan, Xiaopeng Zhang, and Jiashi Feng. Distilling object detectors with fine-grained feature imitation. In CVPR, 2019.
  • [34] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.
  • [35] Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks. In ECCV, 2018.
  • [36] Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. Decoupled knowledge distillation. In CVPR, 2022.
  • [37] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. CoRR, 2016.