CSAKD: Knowledge Distillation with Cross Self-Attention for Hyperspectral and Multispectral Image Fusion thanks: This study was supported partly by National Science and Technology Council (NSTC), Taiwan, under Grants NSTC 110-2636-E-006-026, 110-2222-E-006-012, 111-2634-F-007-002, 110-2218-E-006-026, and 111-2221-E-003-019-MY3. thanks: (Corresponding author: Li-Wei Kang.) thanks: C.-C. Hsu and C.-M. Lee are with the Institute of Data Science, and with Miin Wu School of Computing, National Cheng Kung University, Tainan, Taiwan (R.O.C.) (e-mail: [email protected], [email protected]). thanks: C.-C. Ni and L.-W Kang are with the Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan (R.O.C.) (e-mail: [email protected], [email protected]).

Chih-Chung Hsu, ,
Chih-Chien ​Ni, Chia-Ming ​Lee, and Li-Wei Kang
Abstract

Hyperspectral imaging, capturing detailed spectral information for each pixel, is pivotal in diverse scientific and industrial applications. Yet, the acquisition of high-resolution (HR) hyperspectral images (HSIs) often needs to be addressed due to the hardware limitations of existing imaging systems. A prevalent workaround involves capturing both a high-resolution multispectral image (HR-MSI) and a low-resolution (LR) HSI, subsequently fusing them to yield the desired HR-HSI. Although deep learning-based methods have shown promising in HR-MSI/LR-HSI fusion and LR-HSI super-resolution (SR), their substantial model complexities hinder deployment on resource-constrained imaging devices. This paper introduces a novel knowledge distillation (KD) framework for HR-MSI/LR-HSI fusion to achieve SR of LR-HSI. Our KD framework integrates the proposed Cross-Layer Residual Aggregation (CLRA) block to enhance efficiency for constructing Dual Two-Streamed (DTS) network structure, designed to extract joint and distinct features from LR-HSI and HR-MSI simultaneously. To fully exploit the spatial and spectral feature representations of LR-HSI and HR-MSI, we propose a novel Cross Self-Attention (CSA) fusion module to adaptively fuse those features to improve the spatial and spectral quality of the reconstructed HR-HSI. Finally, the proposed KD-based joint loss function is employed to co-train the teacher and student networks. Our experimental results demonstrate that the student model not only achieves comparable or superior LR-HSI SR performance but also significantly reduces the model-size and computational requirements. This marks a substantial advancement over existing state-of-the-art methods. The source code is available at https://github.com/ming053l/CSAKD.

Index Terms:
hyperspectral image, multispectral image, image fusion, super-resolution, teacher-student model, knowledge distillation.

I Introduction

Refer to caption
Figure 1: The brief illustration of proposed CSAKD framework by adaptively fusing the features of the LR-HSI and HR-MSI.

Hyperspectral imaging aims to capture information based on dense spectral sensing at each image pixel of a scene. Compared with conventional imaging modalities, hyperspectral images (HSIs) include a wider spectral range, with the number of channels ranging from ten to hundreds. HSIs have been shown to enable a wide range of applications in the fields of industry, science, military, agriculture, and medicine [1]. However, the extreme limitation of hardware of hyperspectral image sensing systems in the miniaturized satellite often restrict that the spectral or spatial resolution could not be large enough. In practice, the general solution is to capture the image of high spatial resolution together with limited spectral bands. That is, existing sensing systems usually capture the high-resolution (HR) multispectral images (MSIs), i.e., HR-MSIs, and the low-resolution (LR) HSIs, i.e., LR-HSIs. To further enhance the spatial resolution of LR-HSI, super-resolution (SR) of LR-HSI achieved by fusing HR-MSI and LR-HSI to obtain the corresponding HR-HSI has been a promising way [2, 3, 4] in recent research direction.

Several traditional fusion methods have been presented with the development of LR-HSI and HR-MSI fusion techniques (e.g., [5, 6]). For example, sparse representation-based [7], and low-rank-based [8] matrix decomposition-guided fusion frameworks were proposed to achieve feasible performance for the SR of LR-HSI. Benefiting the advantages of recent deep learning (DL) related techniques, such as image restoration [9, 10], image classification [11, 12], and object detection [13, 14], DL-based HR-MSI and LR-HSI fusion methods have been proposed recently for obtaining the better spectral and spatial quality of the reconstructed HR-HSI [15, 16, 17, 18, 19, 20, 21, 22, 23, 24]. However, the current state-of-the-art DL-based HR-MSI/LR-HSI fusion methods may still suffer from higher model complexity or insufficient image detail reconstruction due to the lack of fully exploiting the spectral and spatial feature representation from both HR-MSI and LR-HSI.

To design a lightweight deep HR-MSI/LR-HSI fusion model and produce sufficiently good HR-HSI of the input LR-HSI, in this paper, we propose a knowledge distillation (KD)-based LR-HSI and HR-MSI fusion method to meet the massive requirements of the real-time applications, as the power supply in the world is becoming emerged. In the proposed framework, we first train a sophisticated teacher network with excellent HR-MSI/LR-HSI fusion performance. Then, we distill the knowledge from the teacher network into a lightweight student network to achieve high-quality outcomes in both the spectral and spatial domains. To effectively guide student network learning, the KD-loss is adopted to ensure higher similarity between the feature maps, respectively, generated from the teacher and the student networks based on the response-based KD approach [25], thereby improving the performance of the student network. Since the oversimplified student network could be harmful to the quality of the reconstructed HR-HSI, a good and simple network architecture is essential. Moreover, it is well-known that the feature representation of HR-MSI and LR-HSI could be significantly different from each other, implying that directly fusing them without dynamically determining the corresponding weights could result in restricted performance. Therefore, to fully exploit the spatial and spectral features of LR-HSI and HR-MSI without increasing the parameters of the teacher/student networks, we propose a novel Dual Two-Streamed (DTS) network based on our Cross-Layer Residual Aggregation (CLRA) block with the Cross-Self-Attention (CSA) fusion module for judiciously extracting the needed spatial and spectral information for obtaining the better quality of the reconstructed HR-HSI, as illustrated in Fig. 1. In this way, the proposed DTS Network not only achieves state-of-the-art performance but also reduces the computational and space complexities simultaneously. The major novelties and contributions of this paper are three-fold:

  • to the best of our knowledge, we are among the first to propose a response-based KD framework to learn a lightweight HR-MSI and LR-HSI fusion model;

  • the proposed DTS Network effectively incorporates the spatial and spectral features from LR-HSI/HR-MSI dynamically by using our CSA fusion module; and

  • the proposed method has been shown to outperform several state-of-the-art LR-HSI/HR-MSI fusion models in terms of different metrics.

The rest of this paper is organized as follows. In Sec. II, we briefly introduce the related works, including traditional frameworks for SR of LR-HSI, the DL-based frameworks for SR of LR-HSI, and related KD techniques. In Sec. III, we present the proposed DTS Network with KD framework for learning a lightweight deep HR-MSI/LR-HSI fusion network. In Sec. IV, experimental results, and ablation studies are demonstrated. Finally, Sec. V concludes this paper.

II Related Works

This section provides an overview of the methodologies developed for enhancing the spatial and spectral resolution of hyperspectral images. The evolution of these methodologies spans from traditional techniques, leveraging sparse representations and low-rank matrix factorizations, to contemporary DL-based approaches that exploit the representational power of convolutional neural networks (CNNs) for superior fusion outcomes. Additionally, we discuss the emergent strategy of the KD aimed at refining model efficiency and facilitating deployment on resource-constrained devices.

II-A Optimization-based Approach

In [26], a pioneer MSI Pan-sharpening framework was presented, where the goal is to fuse LR-MSI and HR-panchromatic image (with single band and high spatial resolution) of the same scene to generate an image with high spectral and spatial resolutions. Moreover, based on the sparse or low-rank image prior knowledge of HSIs, several sparse representation-based or low-rank-based image fusion frameworks were presented for the SR of HSI or HR-MSI/LR-HSI fusion [7, 8, 27, 28, 29]. For example, in [7], an SR method for LR-HSI was proposed, where the prediction of the HR-HSI is formulated as a joint derivation task of the HSI dictionary and the sparse codes relying on the spatial-spectral sparsity of HSIs. In addition, a group spectral embedding-based HR-MSI/LR-HSI fusion method was presented in [8], where the manifold structures of spectral bands and the low-rank structure of HR-HSIs were explored. A spatial and spectral fusion model was also proposed in [27] by using sparse matrix factorization to fuse remote sensing images of HR with low spectral resolution (similar to HR-MSI) and LR with high spectral resolution (similar to LR-HSI). An image fusion framework relying on spectral unmixing and sparse coding was similarly proposed in [28] to fuse HR-MSI and LR-HSI. Furthermore, a coupled sparse tensor factorization framework was presented in [29] for fusing HR-MSI and LR-HSI, where estimating the dictionaries and core tensor was formulated as a coupled tensor factorization problem. Since these traditional methods rely on some image priors, such as sparse or low-rank, some real-world scenarios not fitting these assumptions may introduce some performance degradation. While the optimization-based approach usually requires high-precision computation, and hard to deploy those algorithms into moderate AI-chip since it is hard to parallelize (e.g., eigendecomposition is often used in optimization-based methods), reducing the computational complexity with promising performance is highly desired.

II-B Deep Learning-based Approach

DL-based strategy has shown promise in HR-MSI/LR-HSI fusion tasks. With the development of DL technology, such as the powerful representation learning ability of CNNs, several SR frameworks for LR-HSI or the fusion of HR-MSI and LR-HSI have been recently proposed. A 3-D CNN was used in [15] to fuse multispectral and hyperspectral images to generate an HR-HSI, where the dimensionality of the HSI was reduced prior to the fusion process to significantly reduce the computational complexity. A blind HR-MSI/LR-HSI fusion problem was formulated and solved based on DL in [16], where the estimation of the observation model and fusion process are optimized iteratively and alternatively during the SR reconstruction. In addition, an HSI reconstruction algorithm with a data-driven prior relying on an optimization-inspired DL was presented in [17], where the prior was learned based on both the local coherence and dynamic characteristics of HSIs. Moreover, an end-to-end DL network was proposed in [18] to jointly learn multi-scale spatial-spectral features for HR-MSI and LR-HSI fusion (denoted by MSSJFL). In addition, a lightweight deep model-based progressive zero-centric residual network (denoted by PZRes-Net) was presented in [19] for SR of HSI, where the spectral-spatial separable convolution operations with dense connections were used to efficiently learn the residual image. In [20], a dual-UNet-based architecture with a multi-stage details injection strategy was presented for fusing HR-MSI and LR-HSI, where a multi-scale spatial-spectral attention module was utilized. Furthermore, a deep hyperspectral image fusion network (denoted by DHIF-Net) was proposed in [21], where an end-to-end optimization strategy of iterative spatial-spectral regularization was implemented. On the other hand, an unregistered and unsupervised mutual Dirichlet-Net was presented in [22] for SR of HSI. An Interpretable deep neural network designed for HR-MSI/LR-HSI fusion was proposed in [23]. An interpretable deep model named by spatial–spectral dual-optimization model-driven deep network was also presented in [24] for HR-MSI/LR-HSI fusion.

However, considering that the lightweight models designed manually could be tedious and, thus, hard to guarantee their performance, we propose an effective network architecture (i.e., DTS Network) to ensure a promised performance and followed by applying the KD-based approach to reduce the computational and spatial complexity without significant performance degradation.

II-C Knowledge Distillation

Directly deploying a sophisticated network into low-power devices is infeasible due to its extreme limitation of memory and computational resources. KD manner offers a solution by training efficient ”student” models guided by complex ”teacher” networks, aiming for the student to match or exceed the teacher’s performance. This process involves strategic knowledge transfer, which can be categorized into response-based [30], feature-based [31], and relation-based [32] KD schemes.

Response-based KD focuses on emulating the teacher model’s final output, enabling the student model to learn directly from these predictions, as seen in [30]. Feature-based KD expands on this by using outputs from both the final and intermediate layers of the teacher model, enriching the student’s learning with deeper insights, exemplified by [31]. Relation-based KD, on the other hand, transfers inter-layer relationships to provide a nuanced understanding of model behaviors, as detailed in [32].

The application of KD in HSI processing tasks, including segmentation and pan-sharpening [33, 34], showcases its potential for enhancing HSI and MSI fusion with lower computational and space complexity. Directing adopts response-based KD, which could be enough to distill the knowledge in the teacher network to that in the student one. In this paper, we would like to emphasize that the efficient and effective network architecture for teacher and student models is essential, while KD is a way to further reduce the complexity by knowledge transfer. Therefore, we do not focus on the selection of the KD framework in this study.

III Proposed Dual Two-Streamed Network via Cross-Self-Attention Fusion

Figure 2 gives the overview of the proposed lightweight deep network model for real-time HSI/MSI fusion tasks. First, a complex network, coupled with the proposed DTS (Dual Two-Streamed) network, is used as the teacher network. Then, a reduced version of our teacher network, with a reduced number of channels of each layer, is treated as the student network. In our network design, the proposed CSA (Cross Self-Attention) fusion module is essential for judiciously fusing the high-fidelity spatial and spectral quasi-fused (or initially fused) results generated from the proposed DTS backbone network. This design, incorporating different sampling rates in the spatial and spectral domain of HSI and MSI, respectively, could effectively capture the high-resolution spectral and spatial features simultaneously, thereby improving the performance without increasing the network complexity. A standard KD (knowledge distillation) loss is then applied to train the teacher and student networks simultaneously. Finally, the student network could fuse the HSI and MSI in real-time. The technical details will be revealed in the following subsections.

Refer to caption
Figure 2: The proposed network architecture for HSI/MSI fusion based on the proposed Cross-Layer Residual Aggregation (CLRA) unit and Cross-Self-Attention (CSA) Fusion module. With the proposed Dual-two-Streamed (DTS) network, our network can judiciously learn the spatial-spectral representation across different branches. Afterwards, CSA enables network to adaptively fuse these representation, thereby yielding great results. By the proposed Knowledge Distillation (KD) manner, the network not only keep great performance, but reduce the model-size to fit real-world scenarios.

III-A Network Architecture

To design a lightweight network model for efficient LR-HSI/HR-MSI fusion, we aim at leveraging knowledge distillation to reduce the network complexity. However, the KD often requires the network architectures of the teacher and student to be identical, so the native network architecture should be efficient and effective to have enough space to be pruned. Inspired by conventional ensemble learning in the machine learning field, it is possible to improve the classification performance by leveraging multiple independent classifiers together, where each classifier could be simple enough. Similarly, our DTS inherits the advantages of the idea for ensemble learning but rather just simply aggregates the spectral and spatial information from the input LR-HSI/HR-MSI. Specifically, we fully exploit spectral and spatial information from the inputs to design the four different sub-networks for information aggregation to improve the fusion performance with lower computational and space complexity.

This Section illustrates the architecture design of our DTS network, consisting of spatial- and spectral-aware networks (SpaNet and SpeNet) for LR-HSI and HR-MSI, respectively, as shown in Figure 2. To effectively refer the LR-HSI and HR-MSI information jointly, the proposed SpaNet and SpeNet not only retrieve the respective feature representations from LR-HSI and HR-MSI, but also extract the joint features of LR-HSI and HR-MSI simultaneously by internal feature concatenation, as shown in the left part of Figure 2. In this way, we could easily integrate the spatial and spectral feature representation without increasing the model complexity. Then, a novel CSA fusion module is proposed to judiciously aggregate the spatial and spectral feature representations to obtain the final HR-HSI. The details can be found as follows.

III-A1 Proposed Dual Two-Streamed Network

This subsection explicates the network architecture design, starting from the HR-HSI denoted as 𝐘𝐘\mathbf{Y}bold_Y. The observable LR-HSI is modeled as 𝐗h=𝐘𝐁subscript𝐗𝐘𝐁\mathbf{X}_{h}=\mathbf{Y}\mathbf{B}bold_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = bold_YB, where 𝐁𝐁\mathbf{B}bold_B is the blurring matrix reducing pixel count. The observable HR-MSI is represented as 𝐗m=𝐃𝐘subscript𝐗𝑚𝐃𝐘\mathbf{X}_{m}=\mathbf{D}\mathbf{Y}bold_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_DY, with 𝐃𝐃\mathbf{D}bold_D being the downsampling matrix that diminishes the number of spectral bands.

Let the LR-HSI and HR-MSI be 𝑿hhh×wh×bsubscript𝑿superscriptsubscriptsubscript𝑤𝑏{\bm{X}}_{h}\in\mathbb{R}^{h_{h}\times w_{h}\times b}bold_italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_w start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_b end_POSTSUPERSCRIPT and 𝑿mh×w×bmsubscript𝑿𝑚superscript𝑤subscript𝑏𝑚{\bm{X}}_{m}\in\mathbb{R}^{h\times w\times b_{m}}bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, the reconstructed HR-HSI denotes 𝒀w×h×bsuperscript𝒀superscript𝑤𝑏{\bm{Y}}^{*}\in\mathbb{R}^{w\times h\times b}bold_italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_w × italic_h × italic_b end_POSTSUPERSCRIPT by the proposed DTS network using

𝒀=fDTS(𝑿h,𝑿m;𝑾DTS),superscript𝒀subscript𝑓DTSsubscript𝑿subscript𝑿𝑚subscript𝑾DTS{\bm{Y}}^{*}=f_{\text{DTS}}({\bm{X}}_{h},{\bm{X}}_{m};{\bm{W}}_{\text{DTS}}),bold_italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT DTS end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ; bold_italic_W start_POSTSUBSCRIPT DTS end_POSTSUBSCRIPT ) , (1)

where 𝑾DTSsubscript𝑾DTS{\bm{W}}_{\text{DTS}}bold_italic_W start_POSTSUBSCRIPT DTS end_POSTSUBSCRIPT is the weights of the proposed DTS network. As mentioned previously, we respectively sample the spatial and spectral features from 𝑿hsubscript𝑿{\bm{X}}_{h}bold_italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and 𝑿msubscript𝑿𝑚{\bm{X}}_{m}bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to have better fusion results. We start with spatial feature extraction for HR-MSI and LR-HSI. First, the high-spectral-resolution feature could be obtained as follows:

𝒁m=fCLRA(𝑿m),subscript𝒁𝑚subscript𝑓CLRAsubscript𝑿𝑚{\bm{Z}}_{m}=f_{\text{CLRA}}({\bm{X}}_{m}),bold_italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT CLRA end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) , (2)

where fCLRAsubscript𝑓CLRAf_{\text{CLRA}}italic_f start_POSTSUBSCRIPT CLRA end_POSTSUBSCRIPT denotes the proposed Cross-Layer Residual Aggregation (CLRA) module, and we will discuss CLRA later. As we were required to learn the fine detail features from LR-HSI and HR-MSI, we could upsample the LR-HSI 𝑿hsubscript𝑿{\bm{X}}_{h}bold_italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT to obtain the 𝑿huh×w×b=fup(𝑿h)superscriptsubscript𝑿𝑢superscript𝑤𝑏subscript𝑓upsubscript𝑿{\bm{X}}_{h}^{u}\in\mathbb{R}^{h\times w\times b}=f_{\text{up}}({\bm{X}}_{h})bold_italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_b end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ), where fupsubscript𝑓upf_{\text{up}}italic_f start_POSTSUBSCRIPT up end_POSTSUBSCRIPT denotes the Bicubic interpolation function.

Then, the fused spatial draft 𝒁hmsubscript𝒁𝑚{\bm{Z}}_{hm}bold_italic_Z start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT could be obtained by

𝒁hm=fCLRA(Cat.(𝑿hu,𝑿m)),{\bm{Z}}_{hm}=f_{\text{CLRA}}(\text{Cat}.({\bm{X}}_{h}^{u},{\bm{X}}_{m})),bold_italic_Z start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT CLRA end_POSTSUBSCRIPT ( Cat . ( bold_italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ) , (3)

where Cat. is the channel-wise feature map concatenation operation. In this way, we smartly aggregate the HR-MSI and LR-HSI simultaneously in 𝒁hmsubscript𝒁𝑚{\bm{Z}}_{hm}bold_italic_Z start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT, thereby improving the spatial quality of the reconstructed HR-HSI 𝒁superscript𝒁{\bm{Z}}^{*}bold_italic_Z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Meanwhile, the high-spectral-resolution information could also be obtained in a similar manner. Specifically, we have LR-HSI 𝑿hsubscript𝑿{\bm{X}}_{h}bold_italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, a rich spectral information data, that could be used to restore the spectrums for the reconstructed HR-HSI 𝒀superscript𝒀{\bm{Y}}^{*}bold_italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. First, let the feature representation of 𝑿hsubscript𝑿{\bm{X}}_{h}bold_italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT be 𝒁hsubscript𝒁{\bm{Z}}_{h}bold_italic_Z start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, we could simply obtain this feature representation by

𝒁h=fCLRA(𝑿h;gc),subscript𝒁subscript𝑓CLRAsubscript𝑿subscript𝑔𝑐{{{\bm{Z}}_{h}=f_{\text{CLRA}}({\bm{X}}_{h};g_{c}),}}bold_italic_Z start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT CLRA end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ; italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , (4)

where gcsubscript𝑔𝑐g_{c}italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT indicates the group number for the grouped convolution operator in the proposed CLRA (will be discussed later). It is somewhat reasonable that the spectral redundancy is relatively high in the HSI, especially in the successive spectrums. Therefore, it is natural that the grouped convolution could be used to reduce complexity and maintain performance. While the spectral information could still be extracted from HR-MSI 𝑿msubscript𝑿𝑚{\bm{X}}_{m}bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, we followed a similar protocol to jointly retrieve the joint feature representation from both LR-HSI and HR-MSI by

𝒁mh=fCLRA(Cat.(𝑿md,𝑿h)),{\bm{Z}}_{mh}=f_{\text{CLRA}}(\text{Cat}.({\bm{X}}_{m}^{d},{\bm{X}}_{h})),bold_italic_Z start_POSTSUBSCRIPT italic_m italic_h end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT CLRA end_POSTSUBSCRIPT ( Cat . ( bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) , (5)

where 𝑿mdsuperscriptsubscript𝑿𝑚𝑑{\bm{X}}_{m}^{d}bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denotes the spatially dowsampled HR-MSI 𝑿msubscript𝑿𝑚{\bm{X}}_{m}bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT by Bicubic interpolation function fdownsubscript𝑓downf_{\text{down}}italic_f start_POSTSUBSCRIPT down end_POSTSUBSCRIPT. In this way, the high quality spectrum information should be able to reconstruct by merging fup(𝒁mh)subscript𝑓upsubscript𝒁𝑚f_{\text{up}}({\bm{Z}}_{mh})italic_f start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_m italic_h end_POSTSUBSCRIPT ) and fup(𝒁m)subscript𝑓upsubscript𝒁𝑚f_{\text{up}}({\bm{Z}}_{m})italic_f start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ). It is easy to obtain the fused HR-HSI by simple modalities ensemble by

𝒀=fup(𝒁mh)+fup(𝒁h)+𝒁m+𝒁hm.superscript𝒀subscript𝑓upsubscript𝒁𝑚subscript𝑓upsubscript𝒁subscript𝒁𝑚subscript𝒁𝑚{\bm{Y}}^{*}=f_{\text{up}}({\bm{Z}}_{mh})+f_{\text{up}}({\bm{Z}}_{h})+{\bm{Z}}% _{m}+{\bm{Z}}_{hm}.bold_italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_m italic_h end_POSTSUBSCRIPT ) + italic_f start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) + bold_italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + bold_italic_Z start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT . (6)

However, different modalities, 𝒁mhsubscript𝒁𝑚{\bm{Z}}_{mh}bold_italic_Z start_POSTSUBSCRIPT italic_m italic_h end_POSTSUBSCRIPT, 𝒁msubscript𝒁𝑚{\bm{Z}}_{m}bold_italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, 𝒁msubscript𝒁𝑚{\bm{Z}}_{m}bold_italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, and 𝒁hmsubscript𝒁𝑚{\bm{Z}}_{hm}bold_italic_Z start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT might exist conflict in spatial or spectral features so that the performance could be suppressed. Moreover, retaining the high quality of the reconstructed HR-HSI in noised HR-MSI or LR-HSI is essential and desired. If HR-MSI or LR-HSI has been perturbed by random noise during transmission or sensor noise, the performance of the reconstructed HR-HSI could be degraded significantly. Considering that the noise 𝑵N(μ,σ)similar-to𝑵𝑁𝜇𝜎{\bm{N}}\sim N(\mu,\sigma)bold_italic_N ∼ italic_N ( italic_μ , italic_σ ) with zeros mean μ=0𝜇0\mu=0italic_μ = 0 and a standard deviation σ𝜎\sigmaitalic_σ, the noised LR-HSI 𝑿hsubscript𝑿{\bm{X}}_{h}bold_italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT could be 𝑿h=𝑿h+𝑵subscriptsuperscript𝑿subscript𝑿𝑵{\bm{X}}^{\prime}_{h}={\bm{X}}_{h}+{\bm{N}}bold_italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = bold_italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + bold_italic_N. In this case, the feature representations, 𝒁msubscript𝒁𝑚{\bm{Z}}_{m}bold_italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, 𝒁mhsubscript𝒁𝑚{\bm{Z}}_{mh}bold_italic_Z start_POSTSUBSCRIPT italic_m italic_h end_POSTSUBSCRIPT, and 𝒁hmsubscript𝒁𝑚{\bm{Z}}_{hm}bold_italic_Z start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT, could be also noised propagated. So, the fused HR-HSI could be

𝒀superscript𝒀\displaystyle{\bm{Y}}^{*}bold_italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT =fup(𝒁mh)+𝒁hm+𝒁m+fup(𝒁h)absentsubscript𝑓upsubscriptsuperscript𝒁𝑚subscriptsuperscript𝒁𝑚subscriptsuperscript𝒁𝑚subscript𝑓upsubscript𝒁\displaystyle=f_{\text{up}}({\bm{Z}}^{\prime}_{mh})+{\bm{Z}}^{\prime}_{hm}+{% \bm{Z}}^{\prime}_{m}+f_{\text{up}}({\bm{Z}}_{h})= italic_f start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m italic_h end_POSTSUBSCRIPT ) + bold_italic_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT + bold_italic_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_f start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) (7)
+[fup(𝒁mh)+fup(𝑵mh)]+[fup(𝒁hm)+fup(𝑵hm)]delimited-[]subscript𝑓upsubscript𝒁𝑚subscript𝑓upsubscript𝑵𝑚delimited-[]subscript𝑓upsubscript𝒁𝑚subscript𝑓upsubscript𝑵𝑚\displaystyle+[f_{\text{up}}({\bm{Z}}_{mh})+f_{\text{up}}({\bm{N}}_{mh})]+[f_{% \text{up}}({\bm{Z}}_{hm})+f_{\text{up}}({\bm{N}}_{hm})]+ [ italic_f start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_m italic_h end_POSTSUBSCRIPT ) + italic_f start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( bold_italic_N start_POSTSUBSCRIPT italic_m italic_h end_POSTSUBSCRIPT ) ] + [ italic_f start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT ) + italic_f start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( bold_italic_N start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT ) ]
+(𝒁m+𝑵m)+𝒁hsubscript𝒁𝑚subscript𝑵𝑚subscript𝒁\displaystyle+({\bm{Z}}_{m}+{\bm{N}}_{m})+{\bm{Z}}_{h}+ ( bold_italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + bold_italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) + bold_italic_Z start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT

where fup(𝑵hm)subscript𝑓upsubscript𝑵𝑚f_{\text{up}}({\bm{N}}_{hm})italic_f start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( bold_italic_N start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT ), fup(𝑵mh)subscript𝑓upsubscript𝑵𝑚f_{\text{up}}({\bm{N}}_{mh})italic_f start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( bold_italic_N start_POSTSUBSCRIPT italic_m italic_h end_POSTSUBSCRIPT ), and fup(𝑵m)subscript𝑓upsubscript𝑵𝑚f_{\text{up}}({\bm{N}}_{m})italic_f start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( bold_italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) represent the feature representation of noise pattern 𝑵𝑵{\bm{N}}bold_italic_N. While the additive noise during the training phase might enhance the robustness of the reconstructed HR-HSI for the proposed DTS network, the equal weights shared with four modalities still lead to restricted performance. A smart way to adaptively fuse these modalities would be to use dynamic weights instead of equal weights for better performance and robustness, i.e., the proposed CSA Fusion Module.

Refer to caption
Figure 3: The proposed Cross self-attention (CSA) fusion module. The blue cube contains high-spatial information, and the other two contain relatively rich spectral information. The proposed attention module smartly considers the weight of different branches and fuses these representations together.

III-A2 Proposed Cross-Self-Attention Fusion Module

To learn the adaptive weight across different modalities of features, we propose a novel attention module to fuse the feature representations of LR-HSI and HR-MSI judiciously, as shown in Figure 3. To reduce the computational complexity of the high-dimensional feature representation like 𝒁msubscript𝒁𝑚{\bm{Z}}_{m}bold_italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and 𝒁hmsubscript𝒁𝑚{\bm{Z}}_{hm}bold_italic_Z start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT, a simple bottleneck layer is used to project the feature maps to the lower-dimensional ones. Suppose that the reduced features of fup(𝒁h)subscript𝑓upsubscript𝒁f_{\text{up}}({\bm{Z}}_{h})italic_f start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ), 𝒁msubscript𝒁𝑚{\bm{Z}}_{m}bold_italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, fup(𝒁mh)subscript𝑓upsubscript𝒁𝑚f_{\text{up}}({\bm{Z}}_{mh})italic_f start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_m italic_h end_POSTSUBSCRIPT ), and 𝒁hmsubscript𝒁𝑚{\bm{Z}}_{hm}bold_italic_Z start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT denoted by 𝑸𝑸{\bm{Q}}bold_italic_Q, 𝑲𝑲{\bm{K}}bold_italic_K, 𝑽𝑽{\bm{V}}bold_italic_V, and 𝒁hmrsuperscriptsubscript𝒁𝑚𝑟{\bm{Z}}_{hm}^{r}bold_italic_Z start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT, the projection is defined as follows:

𝑸isubscript𝑸𝑖\displaystyle{\bm{Q}}_{i}bold_italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =Proj.(𝑸)absentProj.𝑸\displaystyle=\text{Proj.}({\bm{Q}})= Proj. ( bold_italic_Q ) (8)
𝑲isubscript𝑲𝑖\displaystyle{\bm{K}}_{i}bold_italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =Proj.(𝑲)absentProj.𝑲\displaystyle=\text{Proj.}({\bm{K}})= Proj. ( bold_italic_K )
𝑽isubscript𝑽𝑖\displaystyle{\bm{V}}_{i}bold_italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =Proj.(𝑽)absentProj.𝑽\displaystyle=\text{Proj.}({\bm{V}})= Proj. ( bold_italic_V )
𝒁hmrsuperscriptsubscript𝒁𝑚𝑟\displaystyle{\bm{Z}}_{hm}^{r}bold_italic_Z start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT =Proj.(𝒁hm)absentProj.subscript𝒁𝑚\displaystyle=\text{Proj.}({\bm{Z}}_{hm})= Proj. ( bold_italic_Z start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT )

where Proj.()Proj.\text{Proj.}(\cdot)Proj. ( ⋅ ) consists of multiple stages to project the input feature into lower dimensional space. First, the 1×1111\times 11 × 1 convolution is used to project the 𝑿b×c×h×w𝑿superscript𝑏𝑐𝑤{\bm{X}}\in\mathbb{R}^{b\times c\times h\times w}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_c × italic_h × italic_w end_POSTSUPERSCRIPT into 𝑿b×r×h×wsuperscript𝑿superscript𝑏𝑟𝑤{\bm{X}}^{\prime}\in\mathbb{R}^{b\times r\times h\times w}bold_italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_r × italic_h × italic_w end_POSTSUPERSCRIPT, where r𝑟ritalic_r is the reduced number of dimension. To enable the multi-head attention in CSA, we reshape the 𝑿superscript𝑿{\bm{X}}^{\prime}bold_italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT into a feature vector sized of b×ha×ro𝑏subscript𝑎subscript𝑟𝑜b\times h_{a}\times r_{o}italic_b × italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT × italic_r start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, where rosubscript𝑟𝑜r_{o}italic_r start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT is determined by r/ha𝑟subscript𝑎r/h_{a}italic_r / italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, and hasubscript𝑎h_{a}italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT denotes the number of multi-heads in the attentions. Now, we could perform the cross-attention by

𝑨isubscript𝑨𝑖\displaystyle{\bm{A}}_{i}bold_italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =softmax(𝑸i𝑲iTr),absentsoftmaxsubscript𝑸𝑖superscriptsubscript𝑲𝑖𝑇𝑟\displaystyle=\text{softmax}\left(\frac{{\bm{Q}}_{i}\cdot{\bm{K}}_{i}^{T}}{% \sqrt{r}}\right),= softmax ( divide start_ARG bold_italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_r end_ARG end_ARG ) , (9)
𝑶𝑶\displaystyle{\bm{O}}bold_italic_O =Cat.(𝑨0𝑽0,𝑨1𝑽1,,𝑨ha𝑽ha),absentCat.subscript𝑨0subscript𝑽0subscript𝑨1subscript𝑽1subscript𝑨subscript𝑎subscript𝑽subscript𝑎\displaystyle=\text{Cat.}({\bm{A}}_{0}\cdot{\bm{V}}_{0},{\bm{A}}_{1}\cdot{\bm{% V}}_{1},...,{\bm{A}}_{h_{a}}\cdot{\bm{V}}_{h_{a}}),= Cat. ( bold_italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ bold_italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ bold_italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_A start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ bold_italic_V start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ,
𝑶𝑶\displaystyle{\bm{O}}bold_italic_O =Proj.O(𝑶)+𝒁hmr,absentsubscriptProj.𝑂𝑶superscriptsubscript𝒁𝑚𝑟\displaystyle=\text{Proj.}_{O}({\bm{O}})+{\bm{Z}}_{hm}^{r},= Proj. start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT ( bold_italic_O ) + bold_italic_Z start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ,
𝑪𝑪\displaystyle{\bm{C}}bold_italic_C =Cat.(𝑸,𝑲,𝑽,𝑶),absentCat.𝑸𝑲𝑽𝑶\displaystyle=\text{Cat.}({\bm{Q}},{\bm{K}},{\bm{V}},{\bm{O}}),= Cat. ( bold_italic_Q , bold_italic_K , bold_italic_V , bold_italic_O ) ,
𝑾𝑾\displaystyle{\bm{W}}bold_italic_W =Sigmoid(Proj.C(𝑪)),absentSigmoidsubscriptProj.𝐶𝑪\displaystyle=\text{Sigmoid}(\text{Proj.}_{C}({\bm{C}})),= Sigmoid ( Proj. start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( bold_italic_C ) ) ,

where Proj.OsubscriptProj.𝑂\text{Proj.}_{O}Proj. start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT aims to project the concatenated multi-head attentions into the same dimension with 𝒁hmrb×c×h×wsuperscriptsubscript𝒁𝑚𝑟superscript𝑏𝑐𝑤{\bm{Z}}_{hm}^{r}\in\mathbb{R}^{b\times c\times h\times w}bold_italic_Z start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_c × italic_h × italic_w end_POSTSUPERSCRIPT, Proj.CsubscriptProj.𝐶\text{Proj.}_{C}Proj. start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT projects the concatenated cross-attentions to adaptive weights 𝑾𝑾{\bm{W}}bold_italic_W, i.e., b×4×h×w𝑏4𝑤b\times 4\times h\times witalic_b × 4 × italic_h × italic_w. In this way, we could judiciously fuse the different modalities, and, even under noised inputs, the proposed CSA still remains strong due to its adaptivity, as follows:

𝒁fused=𝑾1𝑸+𝑾2𝑲+𝑾3𝑽+𝑾4𝑶,subscript𝒁fusedsubscript𝑾1𝑸subscript𝑾2𝑲subscript𝑾3𝑽subscript𝑾4𝑶{\bm{Z}}_{\text{fused}}={\bm{W}}_{1}\cdot{\bm{Q}}+{\bm{W}}_{2}\cdot{\bm{K}}+{% \bm{W}}_{3}\cdot{\bm{V}}+{\bm{W}}_{4}\cdot{\bm{O}},bold_italic_Z start_POSTSUBSCRIPT fused end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ bold_italic_Q + bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ bold_italic_K + bold_italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⋅ bold_italic_V + bold_italic_W start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ⋅ bold_italic_O , (10)

where 𝑾isubscript𝑾𝑖{\bm{W}}_{i}bold_italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT indicates i𝑖iitalic_i-th channel of 𝑾𝑾{\bm{W}}bold_italic_W. Finally, the reconstructed HR-HSI is obtained via a simple convolution layer by 𝒀=ConvHR(𝒁fused)superscript𝒀subscriptConvHRsubscript𝒁fused{\bm{Y}}^{*}=\text{Conv}_{\text{HR}}({\bm{Z}}_{\text{fused}})bold_italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = Conv start_POSTSUBSCRIPT HR end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT fused end_POSTSUBSCRIPT ).

III-A3 Cross-Layer Residual Aggregation Module

Refer to caption
Figure 4: The overview of proposed CLRA. Each CLRA contains three CLRB and residual connection. As for CLRB, it is stacked by several convolution operators, such as LeakyReLU, dense, and residual connections.

Designing an effective and efficient block to capture the spatial and spectral features of HSI is essential. This subsection aims to draw a basic block design of our CLRA, as shown in Figure 4. In the proposed CLRA module, inheriting the advantages from the block designed in DCSN [35], the residual connection and densely connected feature concatenation are also adopted to make the larger receptive field in the single CLR block, as shown in the bottom part in Figure 4. By aggregating three CLR blocks with a residual connection between the input and output features, we could form the basic CLRA module, as shown in the top part in Figure 4. Consider that the high-spectral-resolution input, i.e., 𝑿hsubscript𝑿{\bm{X}}_{h}bold_italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, has high redundancy between the successive bands, the grouped convolution operation is adopted in our CLRA for 𝑿hsubscript𝑿{\bm{X}}_{h}bold_italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT with group number gcsubscript𝑔𝑐g_{c}italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, while other inputs remain to adopt normal convolutional operation. Note that the merged high-spectral-resolution input, i.e., 𝑿hmsubscript𝑿𝑚{\bm{X}}_{hm}bold_italic_X start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT and 𝑿mhsubscript𝑿𝑚{\bm{X}}_{mh}bold_italic_X start_POSTSUBSCRIPT italic_m italic_h end_POSTSUBSCRIPT, also adopts the standard convolution since there might exist the useful information between 𝑿hsubscript𝑿{\bm{X}}_{h}bold_italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and 𝑿msubscript𝑿𝑚{\bm{X}}_{m}bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

On the other hand, to have a lower latency in the inference phase, relatively shallow networks are constructed for the four input branches by stacking our CLRA by 6, 6, 4, and 4 times for extracting the feature 𝒁hm,𝒁mh,𝒁h,and𝒁msubscript𝒁𝑚subscript𝒁𝑚subscript𝒁andsubscript𝒁𝑚{\bm{Z}}_{hm},{\bm{Z}}_{mh},{\bm{Z}}_{h},\text{and}{\bm{Z}}_{m}bold_italic_Z start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT , bold_italic_Z start_POSTSUBSCRIPT italic_m italic_h end_POSTSUBSCRIPT , bold_italic_Z start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , and bold_italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT in our teacher model, respectively. Conversely, the student model stacks CLRA by 1, 4, 4, and 1 times to reduce its computational complexity.

III-B Joint Training via Knowledge Distillation

Traditional knowledge distillation (KD) techniques often employ a feature map-based loss, where the student network is trained to mimic the intermediate feature representations of the teacher network. This method, while effective in some scenarios, imposes a stringent requirement on the student network to replicate the feature maps exactly as the teacher’s. Such a constraint can limit the learning capacity of the student network, particularly when the student’s architecture is much lighter, and may lead to difficulties in convergence due to the complex nature of the feature spaces involved.

The feature map-based KD loss assumes that a direct correspondence between the teacher and student feature maps is necessary for knowledge transfer. However, this can be overly restrictive, as the student network might benefit from develo** its unique feature representations that are more suited to its capacity, yet still retain the essential characteristics learned by the teacher. The forced alignment of feature maps can, therefore, be counterproductive, leading to a challenging training process and potentially suboptimal student performance.

To address these issues, the feature-map KD loss should be placed in the relatively rare layers instead of each layer to allow the student network to learn its unique feature representations in most layers, thereby improving the performance of the student network. Specifically, Sigmoid cross-entropy loss is used to approximate the feature map distributions of student and teacher networks, as follows:

KDsubscriptKD\displaystyle\ell_{\text{KD}}roman_ℓ start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT =(fs(𝒁fuseds)log(fs(𝒁fusedt)\displaystyle=-(f_{\text{s}}({\bm{Z}}_{\text{fused}}^{s})\log(f_{\text{s}}({% \bm{Z}}_{\text{fused}}^{t})= - ( italic_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT fused end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) roman_log ( italic_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT fused end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) (11)
+(1fs(𝒁fuseds)log(1fs(𝒁fusedt))\displaystyle+(1-f_{\text{s}}({\bm{Z}}_{\text{fused}}^{s})\log(1-f_{\text{s}}(% {\bm{Z}}_{\text{fused}}^{t}))+ ( 1 - italic_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT fused end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) roman_log ( 1 - italic_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT fused end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) )

where fssubscript𝑓sf_{\text{s}}italic_f start_POSTSUBSCRIPT s end_POSTSUBSCRIPT denotes the Sigmoid activation function, and 𝒁fusedssuperscriptsubscript𝒁fused𝑠{{\bm{Z}}_{\text{fused}}^{s}}bold_italic_Z start_POSTSUBSCRIPT fused end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and 𝒁fusedtsuperscriptsubscript𝒁fused𝑡{{\bm{Z}}_{\text{fused}}^{t}}bold_italic_Z start_POSTSUBSCRIPT fused end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT represent the fused feature maps from student and teacher networks. To leverage the good quality of the reconstructed HR-HSI, the reconstruction-relative loss functions should be involved to enhance the spectral and spatial quality. Traditionally, the \ellroman_ℓ-1 norm distance metric aims to enhance the data fidelity, while the energy of each band in an HSI may vary significantly, leading to the that the traditional L1subscriptL1\ell_{\text{L1}}roman_ℓ start_POSTSUBSCRIPT L1 end_POSTSUBSCRIPT distance could pay more attention to the bands whose energy is relatively large. However, the spectrum feature of HSIs is essential for different tasks since each band has its purposes. Therefore, we propose a Band-Energy-Balance-Aware (BEBA) loss BEBAsubscriptBEBA\ell_{\text{BEBA}}roman_ℓ start_POSTSUBSCRIPT BEBA end_POSTSUBSCRIPT to judiciously facilitate the problem above, thereby improving the spectrum quality of the reconstructed HSI 𝒀superscript𝒀{\bm{Y}}^{*}bold_italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

BEBA=fm(α𝑫/β+fReLU(𝑫β)αβ)fm(𝒀2+ϵ),subscriptBEBAsubscript𝑓m𝛼𝑫𝛽subscript𝑓ReLU𝑫𝛽𝛼𝛽subscript𝑓msuperscript𝒀2italic-ϵ\displaystyle\ell_{\text{BEBA}}=\frac{f_{\text{m}}\left(\alpha{\bm{D}}/\beta+f% _{\text{ReLU}}({\bm{D}}-\beta)-\alpha\beta\right)}{f_{\text{m}}({\bm{Y}}^{2}+% \epsilon)},roman_ℓ start_POSTSUBSCRIPT BEBA end_POSTSUBSCRIPT = divide start_ARG italic_f start_POSTSUBSCRIPT m end_POSTSUBSCRIPT ( italic_α bold_italic_D / italic_β + italic_f start_POSTSUBSCRIPT ReLU end_POSTSUBSCRIPT ( bold_italic_D - italic_β ) - italic_α italic_β ) end_ARG start_ARG italic_f start_POSTSUBSCRIPT m end_POSTSUBSCRIPT ( bold_italic_Y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ϵ ) end_ARG , (12)

where α𝛼\alphaitalic_α and β𝛽\betaitalic_β are regularization parameters, 𝑫𝑫{\bm{D}}bold_italic_D denotes the squared absolute difference between the prediction and target |𝒀𝒀|2superscriptsuperscript𝒀𝒀2|{\bm{Y}}^{*}-{\bm{Y}}|^{2}| bold_italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_italic_Y | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, ϵitalic-ϵ\epsilonitalic_ϵ is a small positive constant, and fmsubscript𝑓mf_{\text{m}}italic_f start_POSTSUBSCRIPT m end_POSTSUBSCRIPT denotes the mean operator over spatial axis. Specifically, α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 and β=1𝛽1\beta=1italic_β = 1 are chosen in our experiments. In this way, the fm(𝒀2+ϵ)subscript𝑓msuperscript𝒀2italic-ϵf_{\text{m}}({\bm{Y}}^{2}+\epsilon)italic_f start_POSTSUBSCRIPT m end_POSTSUBSCRIPT ( bold_italic_Y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ϵ ) captures the energy of each band, thereby dynamically adjusting the weights of each band according to its energy.

The parameters α𝛼\alphaitalic_α and β𝛽\betaitalic_β play crucial roles in balancing the sensitivity of the loss function towards small and large prediction errors. The term α𝛼\alphaitalic_α primarily scales the mean squared error, enhancing the function’s reactivity to smaller deviations between the predicted 𝒀superscript𝒀{\bm{Y}}^{*}bold_italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and the ground truth HR-HSI 𝒀𝒀{\bm{Y}}bold_italic_Y. This scaling is particularly significant when dealing with data that possess subtle variations, as it amplifies the importance of minor discrepancies.

The parameter β𝛽\betaitalic_β, on the other hand, serves as a thresholding value that delineates the boundary between small and large errors. When the squared difference 𝑫𝑫{\bm{D}}bold_italic_D is less than β𝛽\betaitalic_β, the ReLU term fReLU(𝑫β)subscript𝑓ReLU𝑫𝛽f_{\text{ReLU}}({\bm{D}}-\beta)italic_f start_POSTSUBSCRIPT ReLU end_POSTSUBSCRIPT ( bold_italic_D - italic_β ) becomes zero, and the loss function primarily operates in a quadratic regime dominated by α𝑫/β𝛼𝑫𝛽\alpha{\bm{D}}/\betaitalic_α bold_italic_D / italic_β. This regime is sensitive to smaller errors, thus ensuring precision in the predictions. Conversely, for larger errors where 𝑫𝑫{\bm{D}}bold_italic_D exceeds β𝛽\betaitalic_β, the loss function transitions into a linear regime, mitigating the potential issues of gradient explosion typically associated with large errors in quadratic loss functions. This linear portion of the loss function is given by fReLU(𝑫β)αβsubscript𝑓ReLU𝑫𝛽𝛼𝛽f_{\text{ReLU}}({\bm{D}}-\beta)-\alpha\betaitalic_f start_POSTSUBSCRIPT ReLU end_POSTSUBSCRIPT ( bold_italic_D - italic_β ) - italic_α italic_β, which acts as a safeguard against the disproportionate penalization of large errors, enhancing the robustness of the model against outliers and noise.

To enhance the spectral quality of the reconstructed HR-HSI further, Spectral Angle Mapper (SAM) loss SAMsubscriptSAM\ell_{\text{SAM}}roman_ℓ start_POSTSUBSCRIPT SAM end_POSTSUBSCRIPT is also proposed to guide our teacher and student networks, as follows:

SAM=11HWn=1HW((𝒀n)T𝒀n|𝒀n|2|𝒀n|2+ϵ),subscriptSAM11𝐻𝑊superscriptsubscript𝑛1𝐻𝑊superscriptsubscript𝒀𝑛𝑇subscriptsuperscript𝒀𝑛subscriptsubscript𝒀𝑛2subscriptsubscriptsuperscript𝒀𝑛2italic-ϵ\ell_{\text{SAM}}=1-\frac{1}{HW}\sum_{n=1}^{HW}\left(\frac{(\bm{Y}_{n})^{T}{% \bm{Y}^{*}_{n}}}{|\bm{Y}_{n}|_{2}\cdot|\bm{Y}^{*}_{n}|_{2}+\epsilon}\right),roman_ℓ start_POSTSUBSCRIPT SAM end_POSTSUBSCRIPT = 1 - divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_W end_POSTSUPERSCRIPT ( divide start_ARG ( bold_italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG | bold_italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ | bold_italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_ϵ end_ARG ) , (13)

where 𝒀nsubscript𝒀𝑛{\bm{Y}}_{n}bold_italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes the n𝑛nitalic_n-th spectral vector, and we calculate the negative cosine similarity between the reconstructed HR-HSI 𝒀superscript𝒀{\bm{Y}}^{*}bold_italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and the ground truth HR-HSI 𝒀𝒀{\bm{Y}}bold_italic_Y as the SAM loss. This measure effectively captures the angular difference between the spectral signatures in the hyperspectral data, making it a robust metric for assessing the spectral fidelity of the predicted image in comparison to the ground truth. The cosine similarity is computed as the dot product of the vectors, normalized by the product of their magnitudes, ensuring that the loss function focuses solely on the angular difference, independent of the magnitude of the spectral signatures.

Finally, the standard reconstruction loss, i.e., \ellroman_ℓ-1 norm loss L1subscriptL1\ell_{\text{L1}}roman_ℓ start_POSTSUBSCRIPT L1 end_POSTSUBSCRIPT, is used to ensure the high quality of the reconstructed HSI. Thus, the total loss of the teacher network would be

tsubscriptt\displaystyle\ell_{\text{t}}roman_ℓ start_POSTSUBSCRIPT t end_POSTSUBSCRIPT =L1(𝒀t,𝒀)+λ1BEBA(𝒀t,𝒀)absentsubscriptL1superscript𝒀𝑡𝒀subscript𝜆1subscriptBEBAsuperscript𝒀𝑡𝒀\displaystyle=\ell_{\text{L1}}({\bm{Y}}^{t},{\bm{Y}})+\lambda_{1}\ell_{\text{% BEBA}}({\bm{Y}}^{t},{\bm{Y}})= roman_ℓ start_POSTSUBSCRIPT L1 end_POSTSUBSCRIPT ( bold_italic_Y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_Y ) + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT BEBA end_POSTSUBSCRIPT ( bold_italic_Y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_Y ) (14)
+λ2SAM(𝒀t,𝒀),subscript𝜆2subscriptSAMsuperscript𝒀𝑡𝒀\displaystyle+\lambda_{2}\ell_{\text{SAM}}({\bm{Y}}^{t},{\bm{Y}}),+ italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT SAM end_POSTSUBSCRIPT ( bold_italic_Y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_Y ) ,

where the superposition of 𝒀tsuperscript𝒀𝑡{\bm{Y}}^{t}bold_italic_Y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT denotes the reconstructed HR-HSI by our teacher network and λ𝜆\lambdaitalic_λ is the parameters to control the importance between the spatial fidelity and spectral quality terms. Likewise, the total loss of the student network is defined by the reconstruction loss and KD loss, as follows:

ssubscripts\displaystyle\ell_{\text{s}}roman_ℓ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT =L1(𝒀s,𝒀)+λ1BEBA(𝒀s,𝒀)absentsubscriptL1superscript𝒀𝑠𝒀subscript𝜆1subscriptBEBAsuperscript𝒀𝑠𝒀\displaystyle=\ell_{\text{L1}}({\bm{Y}}^{s},{\bm{Y}})+\lambda_{1}\ell_{\text{% BEBA}}({\bm{Y}}^{s},{\bm{Y}})= roman_ℓ start_POSTSUBSCRIPT L1 end_POSTSUBSCRIPT ( bold_italic_Y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_Y ) + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT BEBA end_POSTSUBSCRIPT ( bold_italic_Y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_Y ) (15)
+λ2SAM(𝒀s,𝒀)+λ3KD((𝒀s,𝒀))subscript𝜆2subscriptSAMsuperscript𝒀𝑠𝒀subscript𝜆3subscriptKDsuperscript𝒀𝑠𝒀\displaystyle+\lambda_{2}\ell_{\text{SAM}}({\bm{Y}}^{s},{\bm{Y}})+\lambda_{3}% \ell_{\text{KD}}(({\bm{Y}}^{s},{\bm{Y}}))+ italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT SAM end_POSTSUBSCRIPT ( bold_italic_Y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_Y ) + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( ( bold_italic_Y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_Y ) )
+λ4L1(𝒀s,𝒀t),subscript𝜆4subscriptL1superscript𝒀𝑠superscript𝒀𝑡\displaystyle+\lambda_{4}\ell_{\text{L1}}({\bm{Y}}^{s},{\bm{Y}}^{t}),+ italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT L1 end_POSTSUBSCRIPT ( bold_italic_Y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_Y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ,

where L1(𝒀s,𝒀t)subscriptL1superscript𝒀𝑠superscript𝒀𝑡\ell_{\text{L1}}({\bm{Y}}^{s},{\bm{Y}}^{t})roman_ℓ start_POSTSUBSCRIPT L1 end_POSTSUBSCRIPT ( bold_italic_Y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_Y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) aims to relax the constraint of L1(𝒀s,𝒀)subscriptL1superscript𝒀𝑠𝒀\ell_{\text{L1}}({\bm{Y}}^{s},{\bm{Y}})roman_ℓ start_POSTSUBSCRIPT L1 end_POSTSUBSCRIPT ( bold_italic_Y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_Y ) since the outcome of lightweight student network might hard to approximate to the ground truth accurately. All the balance parameters, λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT,λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and λ4subscript𝜆4\lambda_{4}italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, are set to 0.10.10.10.1 respectively.

IV Experimental Results

IV-A Experiment Settings

IV-A1 Dataset Preparation and Synthesis of LR-HSI and HR-MSI

The dataset used for performance evaluation in this study was acquired by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor [36]. The collected dataset includes various natural landscapes from the US and Canada, such as cities, mountains, lakes, fields, and plants, captured between 2006 and 2011. The original HSI images were partitioned into non-overlap** sub-images of size 259×259259259259\times 259259 × 259 pixels, with 224 spectral bands covering a wavelength range from 400 to 2500 nm. As suggested in [35], low-quality bands (1-10, 104-116, 152-170, and 215-224) were removed, resulting in HSI images with 172 spectral bands.

To simulate the image fusion experiments, Wald’s protocol [37] was employed. The HR-HSI 𝒀𝒀{\bm{Y}}bold_italic_Y of size 256×256256256256\times 256256 × 256 pixels was cropped from the top-left corner of each HSI sub-image. Two different downsampling matrices were used to synthesize multispectral images (MSI) 𝒀m4256×256×4subscript𝒀𝑚4superscript2562564{\bm{Y}}_{m4}\in\mathbb{R}^{256\times 256\times 4}bold_italic_Y start_POSTSUBSCRIPT italic_m 4 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 256 × 256 × 4 end_POSTSUPERSCRIPT and 𝒀m6256×256×6subscript𝒀𝑚6superscript2562566{\bm{Y}}_{m6}\in\mathbb{R}^{256\times 256\times 6}bold_italic_Y start_POSTSUBSCRIPT italic_m 6 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 256 × 256 × 6 end_POSTSUPERSCRIPT with four and six spectral bands, respectively. The downsampling matrix 𝑫44×172subscript𝑫4superscript4172{\bm{D}}_{4}\in\mathbb{R}^{4\times 172}bold_italic_D start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 × 172 end_POSTSUPERSCRIPT approximately corresponds to Landsat TM bands 1-4 (covering 450-520, 520-600, 630-690, and 770-900 nm), while the downsampling matrix 𝑫66×172subscript𝑫6superscript6172{\bm{D}}_{6}\in\mathbb{R}^{6\times 172}bold_italic_D start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 6 × 172 end_POSTSUPERSCRIPT roughly corresponds to Landsat TM bands 1-5 and 7 (covering 450-520, 520-600, 630-690, 770-900, 1550-1750, and 2090-2350 nm). A Gaussian point spread function with a variance of σ=3𝜎3\sigma=3italic_σ = 3 and a blurring factor of br=4subscript𝑏𝑟4b_{r}=4italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 4 was used to generate a spatially degenerated matrix 𝑩L2×Ll2𝑩superscriptsuperscript𝐿2superscriptsubscript𝐿𝑙2{\bm{B}}\in\mathbb{R}^{L^{2}\times{L_{l}}^{2}}bold_italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_L start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT for synthesizing LR HSIs. The LR-HSI 𝒀h164×64×172subscript𝒀1superscript6464172{\bm{Y}}_{h1}\in\mathbb{R}^{64\times 64\times 172}bold_italic_Y start_POSTSUBSCRIPT italic_h 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 64 × 64 × 172 end_POSTSUPERSCRIPT was obtained by applying the spatially degenerated matrix 𝑩𝑩{\bm{B}}bold_italic_B to the HR-HSI 𝒀𝒀{\bm{Y}}bold_italic_Y.

The collected dataset consisted of 2,078 HR-HSI images, which were randomly partitioned into training, validation, and testing sets for performance evaluation. The training set contained 1,678 images, while the validation and testing sets contained 200 images for each. The spatial and spectral resolutions of the HR-MSI and LR-HSI were 256×256×Mm256256subscript𝑀𝑚256\times 256\times M_{m}256 × 256 × italic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and 64×64×172646417264\times 64\times 17264 × 64 × 172, respectively, where Mmsubscript𝑀𝑚M_{m}italic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is either 4 or 6 in our experiments.

IV-A2 Implementation Details.

The experimental platform utilized in this study comprised an Intel®Xeon®Gold 61 CPU, 90GB of system memory, and an NVIDIA Tesla V100 GPU with 32GB of memory. The proposed method was implemented using the PyTorch deep learning framework. The batch size was set to 4, and the number of training epochs was fixed to 600 for all experiments involving the proposed method. For the peer methods, the number of training epochs was set according to their default values as specified in their respectively original publications. The online distillation strategy employed in the proposed framework facilitated simultaneous updates of the teacher and student networks during the training process. The Adam optimizer [38] was used for training, with an initial learning rate of 0.0001. The learning rate was adjusted during the training process using the Cosine Annealing learning decay scheduler. The weights of the penalty terms in the loss function, denoted as λssubscript𝜆𝑠\lambda_{s}italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, were both set to 0.1. Standard data augmentation, including random crop** and rotation, is adopted in this paper for all of the evaluated methods.

IV-A3 Quantitative Metrics

For better comprehensive evaluation, we adopt the three commonly-used quantitative metrics:

  1. 1.

    Peak signal-to-noise ratio (PSNR in dB) is defined as

    PSNR=1Mm=1MPSNRm,PSNR1𝑀superscriptsubscript𝑚1𝑀subscriptPSNR𝑚\text{PSNR}=\frac{1}{M}\sum_{m=1}^{M}\text{PSNR}_{m},PSNR = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT PSNR start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ,

    where PSNRmsubscriptPSNR𝑚\text{PSNR}_{m}PSNR start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT measures the spatial quality of a single band, and m𝑚{m}italic_m represents the m𝑚{m}italic_m-th band, defined by:

    PSNRm=10log10(max{𝒚mn2nL}1L𝒀(m)𝒀(m)22),subscriptPSNR𝑚10subscript10conditionalsuperscriptsubscript𝒚𝑚𝑛2𝑛subscript𝐿1𝐿superscriptsubscriptnormsuperscript𝒀𝑚superscript𝒀absent𝑚22\text{PSNR}_{m}\!=\!10\log_{10}\left(\frac{\max\{\bm{y}_{mn}^{2}\mid n\in% \mathcal{I}_{L}\}}{\frac{1}{L}\|{\bm{Y}}^{(m)}-{{\bm{Y}}}^{*(m)}\|_{2}^{2}}% \right),PSNR start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 10 roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( divide start_ARG roman_max { bold_italic_y start_POSTSUBSCRIPT italic_m italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ italic_n ∈ caligraphic_I start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT } end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∥ bold_italic_Y start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT - bold_italic_Y start_POSTSUPERSCRIPT ∗ ( italic_m ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ,

    where 𝒛mnsubscript𝒛𝑚𝑛\bm{z}_{mn}bold_italic_z start_POSTSUBSCRIPT italic_m italic_n end_POSTSUBSCRIPT denotes the n𝑛nitalic_nth entry in the vector 𝒀(m)superscript𝒀𝑚\bm{Y}^{(m)}bold_italic_Y start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT, and L{1,,L}subscript𝐿1𝐿\mathcal{I}_{L}\triangleq\{1,\dots,L\}caligraphic_I start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ≜ { 1 , … , italic_L }. A higher PSNR value indicates a better spatial quality of the fused image 𝒀superscript𝒀\bm{Y}^{*}bold_italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT;

  2. 2.

    Spectral angle mapper (SAM) is defined as

    SAM=1Ln=1Larccos((𝒚[n])Ty[n]𝒚[n]2𝒚[n]2),SAM1𝐿superscriptsubscript𝑛1𝐿arccossuperscript𝒚delimited-[]𝑛𝑇𝑦delimited-[]𝑛subscriptnorm𝒚delimited-[]𝑛2subscriptnormsuperscript𝒚delimited-[]𝑛2\text{SAM}=\frac{1}{L}\sum_{n=1}^{L}\text{arccos}\left(\frac{(\bm{y}[n])^{T}{% \bm{}*{y}}[n]}{\|\bm{y}[n]\|_{2}\cdot\|\bm{y}^{*}[n]\|_{2}}\right),SAM = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT arccos ( divide start_ARG ( bold_italic_y [ italic_n ] ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∗ italic_y [ italic_n ] end_ARG start_ARG ∥ bold_italic_y [ italic_n ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ ∥ bold_italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT [ italic_n ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) ,

    where 𝒚[n]𝒚delimited-[]𝑛\bm{y}[n]bold_italic_y [ italic_n ] denotes the n𝑛nitalic_nth column of 𝒀𝒀\bm{Y}bold_italic_Y. The lower the absolute value of SAM is, the greater the spectral restoration performance of 𝒀superscript𝒀\bm{Y}^{*}bold_italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is; and

  3. 3.

    Root mean squared error (RMSE) is defined as

    RMSE=1Mm=1MRMSEm2,RMSE1𝑀superscriptsubscript𝑚1𝑀superscriptsubscriptRMSE𝑚2\text{RMSE}=\sqrt{\frac{1}{M}\sum_{m=1}^{M}\text{RMSE}_{m}^{2}},RMSE = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT RMSE start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ,

    where

    RMSEm=1L𝒀(m)𝒀(m)2,subscriptRMSE𝑚1𝐿subscriptnormsuperscript𝒀𝑚superscript𝒀absent𝑚2\text{RMSE}_{m}=\frac{1}{\sqrt{L}}{\|\bm{Y}^{(m)}-\bm{Y}^{*(m)}\|_{2}},RMSE start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_L end_ARG end_ARG ∥ bold_italic_Y start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT - bold_italic_Y start_POSTSUPERSCRIPT ∗ ( italic_m ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

    The smaller the RMSE value is, the better the global quality of the fused image 𝒀superscript𝒀\bm{Y}^{*}bold_italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is.

IV-B Performance Evaluation

Refer to caption
Figure 5: Hyperspectral and Multispectral fusion images at AVIRIS dataset. The upper row is the fused RGB image, and the lower row is the residual image subtracted from Ground Truth : (a) the Ground Truth image ; (b) the Proposed method ; (c) PZRes-Net [19] ; (d) MSSJFL [18] ; (e) Dual-UNet [20] ; (f) DHIF-Net [21].
Refer to caption
Figure 6: Robustness comparison among the proposed method and other peer methods under different SNR values in LR-HSI for 4-band HR-MSI.
Refer to caption
Figure 7: Robustness comparison among the proposed method and other peer methods under different SNR values in both LR-HSI and 4-band HR-MSI.
Refer to caption
Figure 8: Robustness comparison among the proposed method and other peer methods under different SNR values in LR-HSI for 6-band HR-MSI.
Refer to caption
Figure 9: Robustness comparison among the proposed method and other peer methods under different SNR values in both LR-HSI and 6-band HR-MSI.
TABLE I: Performance evaluation and complexity comparison of the proposed method and other fusion models in terms of several metrics. Note that the methods marked with an asterisk (*) are unsupervised approaches. For the complexity parts, M and G indicate 106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT and 109superscript10910^{9}10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT, respectively. L denotes the large version. EXT represents the extended training scenario, where we reduced the learning rate to 5e55𝑒55e-55 italic_e - 5 and trained for an additional 40 epochs.
4 Bands LR-HSI 6 Bands LR-HSI 4 Bands LR-HSI
Method Venue PSNR↑ SAM↓ RMSE↓ PSNR↑ SAM↓ RMSE↓ Params FLOPs Run-time Memory
PZRes-Net[19] TIP 2021 34.963 1.934 35.498 37.427 1.478 28.234 40.15M 5262G 0.0141s 11059MB
MSSJFL[18] HPCC 2021 34.966 1.792 33.636 38.006 1.390 26.893 16.33M 175.56G 0.0128s 1349M
Dual-UNet [20] TGRS 2021 35.423 1.892 33.183 38.453 1.548 26.148 2.97M 88.65G 0.0127s 2152M
DHIF-Net [21] TCI 2022 34.458 1.829 34.769 39.146 1.239 25.309 57.04M 13795G 6.005s 29381M
CUCaNet [39] ECCV 2020 28.848 4.140 71.710 35.509 2.205 38.973 3.0M 40.0G 2070.01s -
USDN [40] CVPR 2018 30.069 3.688 93.408 35.208 2.650 53.987 0.006M 1.0G 28.83s -
U2MDN [41] TGRS 2021 30.127 3.235 59.071 33.356 2.243 41.528 0.01M 4.0G 547.28s -
Proposed-Teacher - 35.967 1.527 30.928 40.046 1.095 23.785 26.8M 941.77G 0.0134s 8733M
Proposed-Student - 35.544 1.643 32.308 39.153 1.205 25.080 7.44M 144.77G 0.0121s 1653M
Proposed-Teacher-L - 36.098 1.503 30.577 40.048 1.092 23.733 37.19M 1303.3G 0.1117s 12110M
Proposed-Student-L - 35.548 1.588 31.561 39.784 1.119 23.956 11.34M 399.9G 0.0292s 4054M
Proposed-Teacher-L-Ext - 36.076 1.508 30.589 40.043 1.098 23.754 37.19M 1303.3G 0.1117s 12110M
Proposed-Student-L-Ext - 35.954 1.528 30.801 39.801 1.115 23.844 11.34M 399.9G 0.0292s 4054M
TABLE II: Robustness comparison among several state-of-the-art methods. This table illustrates the model performance for different noise inputs 4 bands LR-HSI and HR-MSI with the addition of AWGN noise. This table corresponds to Figure 7.
SNR Ratio 25% (Noisy) 30% 35% 40% 45% 0% (Clean) Average
Method PSNR / SAM PSNR / SAM PSNR / SAM PSNR / SAM PSNR / SAM PSNR / SAM PSNR / SAM
PZResNet [19] 22.417 / 9.3 25.658 / 5.72 29.017 / 3.681 31.566 / 2.654 33.454 / 1.945 34.963 / 1.934 29.512 / 4.205
MSSJFL [18] 23.553 / 6.549 26.603 / 4.573 29.627 / 3.195 31.932 / 2.475 33.464 / 2.103 34.966 / 1.792 30.024 / 3.447
Dual-UNet [20] 19.944 / 11.009 24.423 / 6.614 28.339 / 4.128 31.365 / 2.846 33.378 / 2.258 35.423 / 1.892 28.812 / 4.791
DHIF-Net [21] 24.526 / 6.324 28.677 / 4.214 31.405 / 2.45 33.148 / 2.204 34.251 / 1.98 34.458 / 1.829 31.077 / 3.166
Proposed - Student 27.632 / 4.282 31.138 / 2.776 33.432 / 2.068 34.609 / 1.787 35.316 / 1.648 35.544 / 1.643 32.945 / 2.367

We evaluated our approach against seven state-of-the-art HSI/MSI fusion methods, including four supervised methods: PZRes-Net [19], MSSJFL [18], Dual-UNet [20], and DHIF-Net [21], and three unsupervised methods: CUCaNet [39], USDN [40], and U2MDN [41]. The performance was objectively measured using the three metrics, including PSNR, SAM, and RMSE. Experiments were conducted with both 4 and 6 MSI bands. As more bands in the LR-HSI are available, richer spectral information can be exploited to potentially enhance the SR quality. The quantitative results are presented in Table I. This study investigates two variants of the proposed model, denoted by the postfixes L and L-Ext. The L model is constructed by stacking additional blocks, as described in Section III.E, resulting in a larger model architecture. On the other hand, the L-Ext model is obtained by extending the training process of the L model with more epochs and a reduced learning rate of 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT.

The experimental results demonstrate that the proposed framework outperforms the compared state-of-the-art methods in terms of spectral reconstruction performance for both 4 and 6 bands of HR-MSI. Additionally, our method exhibits superior overall and pixel-level restoration capabilities compared to the state-of-the-art methods based on the obtained better performances in terms of the SAM, PSNR, and RMSE metrics, respectively. The outstanding performance of our method can be attributed to several factors. First, the proposed DTS network effectively integrates spatial-spectral feature representations, leading to the excellent performance of the teacher model. Second, the relatively shallow network architecture of the student model enables faster inference times while maintaining high-quality results, resulting in a higher performance-to-complexity ratio compared to previous methods. Third, the proposed response-based KD framework provides refined and strong guidance, facilitating the student network in learning nuanced representations with fewer parameters, thereby streamlining the architecture without compromising model performance. Furthermore, the proposed CSA fusion module and the distillation strategy enable our method to adaptively determine the optimal weights for HSI and MSI features, even in the presence of noise, resulting in improved robustness and stability. The effectiveness of the CSA and KD strategies will be further demonstrated in the subsequent experiments. On the other hand, the unsupervised learning approach, such as CUCaNet [39], USDN [40], and U2MDN [41], are hard to meet the requirements of real-time inference scenarios, even the relatively lower number of parameters and FLOPs (floating point operations).

Discerning differences in HSI images through visualization in the RGB color system is challenging. To better observe these differences, we compute residual images by subtracting each method’s fused image 𝒀superscript𝒀{\bm{Y}}^{*}bold_italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT from the corresponding ground truth 𝒀𝒀{\bm{Y}}bold_italic_Y and enhance the contrast through logarithmic map**. The resulting residual images, which would be closer to black while being closer to the ground truths, are depicted in Figure 5. These visualized results not only corroborate the superior quantitative performance of our method but also highlight the effectiveness of the proposed CSA and KD strategies in preserving fine image details and producing visually appealing false-color representations.

IV-C Robustness Evaluation

In real-world scenarios, the quality of LR-HSI and HR-MSI would suffer from lossy transmission or physical distortions, leading to the presence of heavy noise in the input data. Such noise would cause significant degradation and potentially catastrophic restoration results in the LR-HSI/HR-MSI fusion task. Therefore, it would be crucial to evaluate the robustness of the considered fusion methods in the noisy scenarios.

To assess the robustness of each method, we introduced Additive White Gaussian Noise (AWGN) with varying Signal-to-Noise Ratios (SNR) ranging from 25 to 45. The AWGN noise and its impact on the HSI are formulated as:

𝑵awgn=1Ni=1N𝑿i210SNR10subscript𝑵awgn1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑿𝑖2superscript10𝑆𝑁𝑅10{\bm{N}}_{\text{awgn}}=\sqrt{\frac{\frac{1}{N}\sum_{i=1}^{N}{{\bm{X}}_{i}^{2}}% }{{10}^{\frac{SNR}{10}}}}bold_italic_N start_POSTSUBSCRIPT awgn end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 10 start_POSTSUPERSCRIPT divide start_ARG italic_S italic_N italic_R end_ARG start_ARG 10 end_ARG end_POSTSUPERSCRIPT end_ARG end_ARG (16)

where 𝑿isubscript𝑿𝑖{\bm{X}}_{i}bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the tensor-form input image and N𝑁Nitalic_N is the number of tensor elements. The noisy LR-HSI and HR-MSI can be obtained by 𝑿m=𝑿m+𝑵awgnsubscript𝑿𝑚subscript𝑿𝑚subscript𝑵awgn{\bm{X}}_{m}={\bm{X}}_{m}+{\bm{N}}_{\text{awgn}}bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + bold_italic_N start_POSTSUBSCRIPT awgn end_POSTSUBSCRIPT and 𝑿h=𝑿h+𝑵awgnsubscript𝑿subscript𝑿subscript𝑵awgn{\bm{X}}_{h}={\bm{X}}_{h}+{\bm{N}}_{\text{awgn}}bold_italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = bold_italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + bold_italic_N start_POSTSUBSCRIPT awgn end_POSTSUBSCRIPT, respectively.

As shown in Table II, the proposed method achieves the best performance under noisy scenarios. We further considered two scenarios: (1) adding AWGN to the LR-HSI only, and (2) adding AWGN to both the LR-HSI and HR-MSI. The results for these scenarios are presented in Figures 6, 7, 8, and 9. As expected, the performance of all models generally deteriorates in the presence of noise; however, the degradation patterns vary across methods. The restored results from Dual-UNet [20] collapse when heavy noise is added to both LR-HSI and HR-MSI. The other three methods are also affected to varying degrees. In contrast, our approach is notably resilient, maintaining higher PSNR and SAM values whether noise is added solely to LR-HSI or to both LR-HSI and HR-MSI.

The effectiveness of our proposed CSAKD framework in handling noise can be attributed to two key factors. First, the Cross Self-Attention (CSA) fusion module adaptively determines the optimal weights for features extracted from the LR-HSI and HR-MSI branches. By dynamically adjusting these weights based on the input data, the CSA module can effectively suppress the influence of noise and prioritize the more reliable information from each modality. Second, the Knowledge Distillation (KD) strategy enables the student network to learn robust feature representations from the teacher network. During training, the teacher network is exposed to noisy inputs and learns to extract noise-resilient features. Through the distillation process, this robustness is transferred to the student network, allowing it to maintain high-quality fusion results even in the presence of noise.

Refer to caption
Figure 10: Model complexity-performance comparison plot.The upper-right corner represents faster inference speed and higher fusion quality. The size of the circle means the memory-usage of the model deployed on hardware.

IV-D Computational Complexity Analysis

In addition to the restoration performance and robustness of the models, we also consider their computational complexity and lightweight nature for deployment on different hardware platforms. This is crucial because HSI processing needs to be compatible with various hardware constraints and SDG requirements, allowing for feasible applications without heavy computational burden. The results of the complexity analysis are shown in the last two columns of Table I and Figure 10.

The HR-HSI restored from the state-of-the-art supervised learning-based methods all exhibit high fidelity. However, the model complexity and hardware requirements may vary in different aspects. The proposed method demonstrates comprehensively competitive capabilities across parameter size, FLOPs, running time per input pair, and memory usage during inference, underscoring an optimized balance between computational efficiency and fusion quality. By employing the proposed knowledge distillation framework, we significantly reduce the model size, FLOPs, and memory usage, which is extremely valuable for lightweight hardware platforms.

The other methods face different challenges arising from their drawbacks in handling heavy noise or their substantial hardware requirements. DHIF-Net [21] and PZRes-Net [19] are limited in their applicability to lightweight hardware due to their iterative spatial-spectral-aware optimization strategy or residual learning-based approach, which result in heavy parameter and memory requirements. Dual-UNet [20] achieves low computational complexity but struggles to address highly noisy data. MSSJFL [18] strikes a balance between maintaining fusion quality in the presence of noise and computational complexity, but its performance is relatively limited compared to our method.

Refer to caption
Figure 11: The convergence process at the training phase and the validation performance comparison against the proposed loss function is employed or not. The red curve represents the proposed loss function is used. The blue curve denotes using the naive loss function, which means we just uses L1-loss in teacher model, L1-loss and response distillation loss term to guide lightweight student network.

IV-E Model Scalability and Extended Training

Refer to caption
Figure 12: Performance comparison between with and without proposed CSA-fusion block under different SNR level. Compared with direct fusion, the CSA-fusion method we proposed is more robust and effective against noise.
TABLE III: Comparisons of Different Coefficients of Penalty Term in the Proposed Student Model. The proposed setting means that the all coefficient of penalty term is set to 0.1. The naive setting means prohibiting the SAM, BEBA, feature map-KD loss.
Teacher PSNR / SAM / RMSE Student PSNR / SAM / RMSE
Naive 20.430 / 8.145 / 148.33 Naive 19.266 / 8.448 / 171.361
λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT=0.5 35.794 / 1.565 / 31.514 λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT=0.5 35.021 / 1.732 / 33.899
λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT=0.5 35.812 / 1.561 / 31.475 λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT=0.5 34.906 / 1.775 / 34.532
λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT=0.5 35.785 / 1.565 / 31.513 λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT=0.5 35.041 / 1.717 / 33.638
λ4subscript𝜆4\lambda_{4}italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT=0.5 35.781 / 1.566 / 31.552 λ4subscript𝜆4\lambda_{4}italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT=0.5 35.016 / 1.709 / 33.358
Proposed 35.967 / 1.527 / 30.928 Proposed 35.544 / 1.643 / 32.308

The proposed CSAKD framework demonstrates exceptional performance, with both the teacher and student networks surpassing other state-of-the-art models. To explore the limitations of the CSAKD framework, which primarily depend on the DTS network architecture and the CSA-fusion module for achieving high-quality fusion results, we enhanced both networks by incorporating additional CLRA units. This enhancement aimed to assess their scalability and identify the upper bound of fusion performance, as detailed in Table I.

Specifically, we augmented the teacher network by stacking the CLRA unit in the four branches (𝒁h,𝒁hm,𝒁mh,𝒁msubscript𝒁subscript𝒁𝑚subscript𝒁𝑚subscript𝒁𝑚{\bm{Z}}_{h},{\bm{Z}}_{hm},{\bm{Z}}_{mh},{\bm{Z}}_{m}bold_italic_Z start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , bold_italic_Z start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT , bold_italic_Z start_POSTSUBSCRIPT italic_m italic_h end_POSTSUBSCRIPT , bold_italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT) 8, 8, 6, and 6 times, respectively. In contrast, the student network received a more modest increase of 2, 4, 4, and 4 stacks. These versions, denoted as Proposed-Teacher-L and Proposed-Student-L, are presented in Table I. This strategy served two purposes: first, to preserve the lightweight nature of the student network, and second, to amplify the learning capacity of the teacher model. The results indicate a significant improvement in the SAM and RMSE metrics for the student network. However, a limitation was observed in the PSNR metric. These findings suggest that our approach is viable for achieving superior fusion results when computational complexity is not a primary concern.

To further enhance the learning ability of the CSA-Large model, we explored the potential of a deeper teacher model, which can provide richer feature information in the feature domain. To investigate the effectiveness of the KD-guided framework, we extended the training process for CSA-Large by an additional 40 epochs and reduced the learning rate to 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. These versions, denoted as Proposed-Teacher-L-Ext and Proposed-Student-L-Ext, are presented in Table I. The results show that extending the training process degraded the teacher’s performance, indicating that the model had reached its bottleneck. Conversely, the student model acquired more effective guidance, aligning its performance with the proposed KD framework. This demonstrates that the proposed feature-map knowledge distillation loss KDsubscriptKD\ell_{\text{KD}}roman_ℓ start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT can effectively enhance the student network when using a deeper teacher network and a richer feature space.

The scalability analysis highlights the flexibility of the CSAKD framework in accommodating varying network depths and architectures. By increasing the number of CLRA units, the fusion performance can be further improved, particularly in terms of the SAM and RMSE metrics. However, the limitations observed in the PSNR metric suggest that there may be a trade-off between network depth and certain aspects of fusion quality. This trade-off should be carefully considered when designing the network architecture for specific applications.

The extended training experiment demonstrates the effectiveness of the KD-guided framework in transferring knowledge from a deeper teacher network to a lightweight student network. By leveraging the richer feature space provided by the deeper teacher, the student network can learn more nuanced representations and achieve improved fusion performance. This finding underscores the importance of the feature-map knowledge distillation loss KDsubscriptKD\ell_{\text{KD}}roman_ℓ start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT in enabling effective knowledge transfer and enhancing the student network’s learning ability.

In summary, the model scalability and extended training analysis provide valuable insights into the flexibility and effectiveness of the CSAKD framework. These findings can guide future research in designing and optimizing network architectures for HSI/MSI fusion tasks, while also highlighting the potential for further performance improvements through extended training and knowledge distillation.

IV-F Ablation Study

Based on Table I, we corroborated the superior performance of the student network under teacher guidance. Furthermore, to explore the impact of teacher model complexity on student learning, highlighting the need for distillation loss to bridge the output gap between teacher and student, we perform the ablation study on loss function analysis shown as follows. In addition, we also perform the ablation study on CLRA depth analysis to determine the optimal number of CLRA units to balance the SR performance and the computational complexity, shown as follows.

IV-F1 Loss Function Analysis

We first verify that the proposed loss function for joint training of the teacher-student model not only accelerates the training process, hel** the model to converge at a high speed, but also effectively stabilizes the instability during backpropagation. As shown in Figure 11 and Table III, the proposed SAM loss and BEBA loss are both crucial for training a strong teacher model. Subsequently, the feature-map distillation loss enables the teacher model to guide the student model in an ideal manner.

Due to the complexity of the loss function in the student network, we compared the influence of each loss function component. The penalty term in the student network’s loss function totalstudentsuperscriptsubscripttotalstudent\ell_{\text{total}}^{\text{student}}roman_ℓ start_POSTSUBSCRIPT total end_POSTSUBSCRIPT start_POSTSUPERSCRIPT student end_POSTSUPERSCRIPT is calculated solely based on the discrepancy between the teacher-student network output and the ground truth, with the combined impact detailed in Table III. In this experiment, the penalty terms in totalteachersuperscriptsubscripttotalteacher\ell_{\text{total}}^{\text{teacher}}roman_ℓ start_POSTSUBSCRIPT total end_POSTSUBSCRIPT start_POSTSUPERSCRIPT teacher end_POSTSUPERSCRIPT are all set to 0.1. The experiment demonstrates that the SAM loss is relatively sensitive in the CSAKD framework.

IV-F2 CLRA Depth Analysis

TABLE IV: Comparisons of Stacking Different Amounts of CLRA in Different Branches of the Proposed Model.
M and G indicate 106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT and 109superscript10910^{9}10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT.
Zhsubscript𝑍Z_{h}italic_Z start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT,Zhmsubscript𝑍𝑚Z_{hm}italic_Z start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT,Zmhsubscript𝑍𝑚Z_{mh}italic_Z start_POSTSUBSCRIPT italic_m italic_h end_POSTSUBSCRIPT,Zmsubscript𝑍𝑚Z_{m}italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT PSNR / SAM / RMSE FLOPs Params
1, 3, 3, 3 35.528 / 1.598 / 31.698 309G 8.717M
2, 2, 2, 2 35.476 / 1.617 / 31.983 218G 6.089M
2, 3, 3, 2 35.405 / 1.624 / 32.127 224G 7.418M
1, 4, 4, 1 (proposed) 35.544 / 1.643 / 32.308 144G 7.449M

In addition to comparing the computational complexity with other methods, we conducted experiments to determine the optimal depth combination of the CLRA units. The objective was to achieve the best balance between performance and speed. Table IV presents the results of our experiments.

The depth of the CLRA units plays a crucial role in the fusion performance and computational efficiency of the proposed CSAKD framework. By varying the number of CLRA units in each branch of the DTS network, we can fine-tune the network’s capacity to extract and integrate spatial-spectral features. The results in Table IV demonstrate that the optimal combination of CLRA depths varies depending on the specific performance metrics and computational constraints.

For instance, the combination of 1, 3, 3, and 3 CLRA units in the 𝒁hsubscript𝒁{\bm{Z}}_{h}bold_italic_Z start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, 𝒁hmsubscript𝒁𝑚{\bm{Z}}_{hm}bold_italic_Z start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT, 𝒁mhsubscript𝒁𝑚{\bm{Z}}_{mh}bold_italic_Z start_POSTSUBSCRIPT italic_m italic_h end_POSTSUBSCRIPT, and 𝒁msubscript𝒁𝑚{\bm{Z}}_{m}bold_italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT branches, respectively, achieves the best SAM and RMSE metrics. However, this configuration also results in a higher number of parameters and FLOPs compared to the proposed combination of 1, 4, 4, and 1 CLRA units. The proposed combination strikes a balance between performance and computational efficiency, achieving competitive PSNR and SAM metrics while maintaining a lower number of parameters and FLOPs.

These findings highlight the importance of carefully designing the network architecture and selecting the appropriate depth of the CLRA units based on the specific requirements of the application. The ablation study provides valuable insights into the trade-offs between fusion performance and computational complexity, enabling researchers and practitioners to make informed decisions when deploying the CSAKD framework in real-world scenarios.

V Conclusion

In this work, we have introduced a novel knowledge distillation-based teacher-student framework, named CSAKD, for LR-HSI/HR-MSI fusion. The proposed framework incorporates a Dual Two-Streamed (DTS) network architecture, which effectively captures spectral and spatial information from LR-HSI and HR-MSI. The Cross-Layer Residual Aggregation (CLRA) unit and Cross Self-Attention (CSA) module enhance the network’s ability to handle noise and integrate spatial-spectral features, resulting in high-quality fused results. The application of knowledge distillation in LR-HSI/HR-MSI fusion is a key contribution of this work. The proposed Spectral Angle Mapper (SAM) loss, Band-Energy-Balance-Aware (BEBA) loss, and feature map-based KD loss guide the lightweight student model to achieve excellent fusion performance while reducing model-size and computational requirements. Extensive experiments have demonstrated the superiority of the CSAKD method under various conditions, including noisy images and LR-HSIs with varying numbers of bands. The lightweight student model exhibits outstanding performance compared to larger, state-of-the-art models, offering an exceptional balance of high performance and reduced computational complexity. The CSAKD framework opens up new possibilities for efficient and effective HSI/MSI fusion, with potential applications in remote sensing and related fields. Future research could explore integrating CSAKD with other advanced techniques, such as attention mechanisms and adversarial learning, to further improve fusion performance and adaptability to diverse scenarios.

References

  • [1] P. G. et al., “Advances in hyperspectral image and signal processing: A comprehensive overview of the state of the art,” IEEE Geoscience and Remote Sensing Magazine, vol. 5, no. 4, pp. 37–78, Dec. 2017.
  • [2] R. Dian, S. Li, B. Sun, and A. Guo, “Recent advances and new guidelines on hyperspectral and multispectral image fusion,” Information Fusion, vol. 69, pp. 40–51, 2021.
  • [3] G. Vivone, “Multispectral and hyperspectral image fusion in remote sensing: A survey,” Information Fusion, vol. 89, pp. 405–417, 2023.
  • [4] X. Y. Wang, Q. Hu, Y. S. Cheng, and J. Y. Ma, “Hyperspectral image super-resolution meets deep learning: A survey and perspective,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 8, p. 1664–1687, Aug. 2023.
  • [5] S. Li, X. Kang, L. Fang, J. Hu, and H. Yin, “Pixel-level image fusion: A survey of the state of the art,” Information Fusion, vol. 33, pp. 100–112, 2017.
  • [6] J. Nunez, X. Otazu, O. Fors, A. Prades, V. Pala, and R. Arbiol, “Multiresolution-based image fusion with additive wavelet decomposition,” IEEE Trans. Geosci. Remote Sens., vol. 37, no. 3, p. 1204–1211, May 1999.
  • [7] W. D. et al., “Hyperspectral image super-resolution via non-negative structured sparse representation,” IEEE Trans. Image Process., vol. 25, no. 5, p. 2337–2352, May 2016.
  • [8] K. Zhang, M. Wang, and S. Yang, “Multispectral and hyperspectral image fusion based on group spectral embedding and low-rank factorization,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 3, p. 1363–1371, Mar. 2017.
  • [9] Y. Chang, L. Yan, H. Fang, S. Zhong, and W. Liao, “Hsi-denet: Hyperspectral image restoration via convolutional neural network,” IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 2, pp. 667–682, Feb. 2019.
  • [10] X. Deng and P. L. Dragotti, “Deep convolutional neural network for multi-modal image restoration and fusion,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 10, pp. 3333–3348, Oct. 2021.
  • [11] S. Li, W. Song, L. Fang, Y. Chen, P. Ghamisi, and J. A. Benediktsson, “Deep learning for hyperspectral image classification: An overview,” IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 9, pp. 6690–6709, 2019.
  • [12] X. Yang, W. Cao, Y. Lu, and Y. Zhou, “Hyperspectral image transformer classification networks,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022, art no. 5528715.
  • [13] L. Yan, M. Zhao, X. Wang, Y. Zhang, and J. Chen, “Object detection in hyperspectral images,” IEEE Signal Processing Letters, vol. 28, pp. 508–512, 2021.
  • [14] C. H. Yeh et al., “Lightweight deep neural network for joint learning of underwater object detection and color conversion,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 11, pp. 6129–6143, Nov. 2022.
  • [15] F. Palsson, J. R. Sveinsson, and M. O. Ulfarsson, “Multispectral and hyperspectral image fusion using a 3-d convolutional neural network,” IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 5, pp. 639–643, May 2017.
  • [16] W. Wang, W. Zeng, Y. Huang, X. Ding, and J. Paisley, “Deep blind hyperspectral image fusion,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Seoul, Korea, 2019.
  • [17] L. Wang, C. Sun, Y. Fu, M. H. Kim, and H. Huang, “Hyperspectral image reconstruction using a deep spatial-spectral prior,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 8024–8033.
  • [18] Z. Min, Y. Wang, and S. Jia, “Multiscale spatial-spectral joint feature learning for multispectral and hyperspectral image fusion,” in Proc. IEEE Int. Conf. High Performance Computing and Communications, Haikou, Hainan, China, 2021, pp. 1265–1270.
  • [19] Z. Zhu, J. Hou, J. Chen, H. Zeng, and J. Zhou, “Hyperspectral image super-resolution via deep progressive zero-centric residual learning,” IEEE Trans. Image Processing, vol. 30, pp. 1423–1438, 2021.
  • [20] J. Xiao, J. Li, Q. Yuan, and L. Zhang, “A dual-unet with multistage details injection for hyperspectral image fusion,” IEEE Trans. Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2021.
  • [21] T. Huang, W. Dong, J. Wu, L. Li, X. Li, and G. Shi, “Deep hyperspectral image fusion network with iterative spatio-spectral regularization,” IEEE Trans. Computational Imaging, vol. 8, pp. 201–214, 2022.
  • [22] Y. Qu, H. Qi, C. Kwan, N. Yokoya, and J. Chanussot, “Unsupervised and unregistered hyperspectral image super-resolution with mutual dirichlet-net,” IEEE Trans. Geoscience and Remote Sensing, vol. 60, pp. 1–18, 2022.
  • [23] Q. Xie, M. Zhou, Q. Zhao, Z. Xu, and D. Meng, “Mhf-net: An interpretable deep network for multispectral and hyperspectral image fusion,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 44, no. 3, pp. 1457–1473, 1 March 2022.
  • [24] W. Dong, T. Zhang, J. Qu, Y. Li, and H. Xia, “A spatial–spectral dual-optimization model-driven deep network for hyperspectral and multispectral image fusion,” IEEE Trans. Geoscience and Remote Sensing, vol. 60, pp. 1–16, 2022.
  • [25] J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,” Int. J. Computer Vision, vol. 129, p. 1789–1819, 2021.
  • [26] B. Aiazzi, S. Baronti, and M. Selva, “Improving component substitution pansharpening through multivariate regression of ms + pan data,” IEEE Trans. Geoscience and Remote Sensing, vol. 45, no. 10, pp. 3230–3239, Oct. 2007.
  • [27] B. Huang, H. Song, H. Cui, J. Peng, and Z. Xu, “Spatial and spectral image fusion using sparse matrix factorization,” IEEE Trans. Geoscience and Remote Sensing, vol. 52, no. 3, pp. 1693–1704, Mar. 2014.
  • [28] Z. H. Nezhad, A. Karami, R. Heylen, and P. Scheunders, “Fusion of hyperspectral and multispectral images using spectral unmixing and sparse coding,” IEEE J. Selected Topics in Applied Earth Observations and Remote Sensing, vol. 9, no. 6, pp. 2377–2389, Jun. 2016.
  • [29] S. Li, R. Dian, L. Fang, and J. M. Bioucas-Dias, “Fusing hyperspectral and multispectral images via coupled sparse tensor factorization,” IEEE Trans. Image Processing, vol. 27, no. 8, pp. 4118–4130, Aug. 2018.
  • [30] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in Proc. NIPS Deep Learning and Representation Learning Workshop, 2015.
  • [31] S. Zagoruyko and N. Komodakis, “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer,” in Proc. Int. Conf. Learning Representations, Toulon, France, Apr. 2017.
  • [32] J. Yim, D. Joo, J. Bae, and J. Kim, “A gift from knowledge distillation: Fast optimization, network minimization and transfer learning,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 7130–7138.
  • [33] M. H. Phan, S. L. Phung, K. Luu, and A. Bouzerdoum, “Efficient hyperspectral image segmentation for biosecurity scanning using knowledge distillation from multi-head teacher,” Neurocomputing, vol. 504, pp. 189–203, 2022.
  • [34] M. Gong, H. Zhang, H. Xu, X. Tian, and J. Ma, “Multipatch progressive pansharpening with knowledge distillation,” IEEE Trans. Geoscience and Remote Sensing, vol. 61, pp. 1–15, 2023.
  • [35] C.-C. Hsu, C.-H. Lin, C.-H. Kao, and Y.-C. Lin, “Dcsn: Deep compressed sensing network for efficient hyperspectral data transmission of miniaturized satellite,” IEEE Trans. Geoscience and Remote Sensing, vol. 59, no. 9, pp. 7773–7789, Sep. 2021.
  • [36] G. Vane, R. O. Green, T. G. Chrien, H. T. Enmark, E. G. Hansen, and W. M. Porter, “The airborne visible/infrared imaging spectrometer (aviris),” Remote Sensing of Environment, vol. 44, no. 2-3, pp. 127–143, 1993.
  • [37] L. Wald, T. Ranchin, and M. Mangolini, “Fusion of satellite images of different spatial resolutions: Assessing the quality of resulting images,” Photogrammetric Engineering and Remote Sensing, vol. 63, pp. 691–699, 1997.
  • [38] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learning Representations, May 2015.
  • [39] J. Yao, D. Hong, J. Chanussot, D. Meng, X. Zhu, and Z. Xu, “Cross-attention in coupled unmixing nets for unsupervised hyperspectral super-resolution,” in European Conference on Computer Vision (ECCV), 2020, pp. 208–224.
  • [40] Y. Qu, H. Qi, and C. Kwan, “Unsupervised sparse dirichlet-net for hyperspectral image super-resolution,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2511–2520, 2018.
  • [41] Y. Qu, H. Qi, C. Kwan, N. Yokoya, and J. Chanussot, “Unsupervised and unregistered hyperspectral image super-resolution with mutual dirichlet-net,” IEEE Transactions on Geoscience and Remote Sensing, pp. 1–18, 2021.