CMT: Cross Modulation Transformer with Hybrid Loss for Pansharpening

Wen-Jie Shu, Hong-Xia Dou, Rui Wen, Xiao Wu and Liang-Jian Deng The research is supported by NSFC (No. 12271083), and National Key Research and Development Program of China (No. 2020YFA0714001).Wen-Jie Shu is with the School of Optoelectronic Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China (e-mail: [email protected]). Hong-Xia Dou is with the School of Science, Xihua University, Chengdu 610039, China (e-mail: [email protected]). Rui Wen, Xiao Wu and Liang-Jian Deng are with the School of Mathematical Sciences, University of Electronic Science and Technology of China, Chengdu 611731, China (e-mail: [email protected]; [email protected]; [email protected]).
Abstract

Pansharpening aims to enhance remote sensing image (RSI) quality by merging high-resolution panchromatic (PAN) with multispectral (MS) images. However, prior techniques struggled to optimally fuse PAN and MS images for enhanced spatial and spectral information, due to a lack of a systematic framework capable of effectively coordinating their individual strengths. In response, we present the Cross Modulation Transformer (CMT), a pioneering method that modifies the attention mechanism. This approach utilizes a robust modulation technique from signal processing, integrating it into the attention mechanism’s calculations. It dynamically tunes the weights of the carrier’s value (V) matrix according to the modulator’s features, thus resolving historical challenges and achieving a seamless integration of spatial and spectral attributes. Furthermore, considering that RSI exhibit large-scale features and edge details along with local textures, we crafted a hybrid loss function that combines Fourier and wavelet transforms to effectively capture these characteristics, thereby enhancing both spatial and spectral accuracy in pansharpening. Extensive experiments demonstrate our framework’s superior performance over existing state-of-the-art methods. The code will be publicly available to encourage further research.

Index Terms:
Pansharpening, Cross Modulation Transformer, Fourier and Wavelet Transforms

I Introduction

Given the inherent constraints of remote sensing technology, obtaining MS images with high spatial resolution directly from satellites is a significant challenge. As a solution, pansharpening has become a crucial technique. It merges low-resolution MS (LRMS) images with high-resolution PAN images to produce high-spatial-resolution MS (HRMS) images with superior spatial detail. This fusion technique effectively navigates around the limitations of sensor technology, offering invaluable data for RSI analysis.

Refer to caption
Figure 1: (a) The modulation process in the field of communication. (b) The modulation process in coded aperture snapshot spectral imaging (CASSI). (c) Our Cross Modulation Multi-head Self-Attention (CM-MSA) modulation process.

Deep learning breakthroughs, led by Convolutional Neural Networks (CNNs), have significantly advanced the field of pansharpening [4], [12], [29]. CNNs have shown exceptional prowess in pansharpening, skillfully extracting and combining intricate features from various images to enhance both spatial and spectral quality. Furthermore, since their introduction, Transformers [21] have revolutionized numerous fields, including pansharpening [11], [28], [17] by their unparalleled ability to model long-range dependencies using self-attention mechanisms. This capability gives them a significant edge in effectively blending spatial and spectral features, outperforming traditional CNN-based approaches by capturing the complex dynamics between different types of images.

However, the field of pansharpening still faces several challenges. Firstly, Hyperspectral Images (HSI) show spatial sparsity and spectral self-similarity, complicating spatial dependency modeling and highlighting the importance of prioritizing inter-spectral over spatial correlations in Transformers. Secondly, RSI are marked by their rich spectral content, complex surface textures and edge details, posing challenges for fusion and necessitating tailored approaches for effective integration. Lastly, the potential of attention structures, specifically tailored for pansharpening within the Transformer framework, warrants further exploration. Current integration methods of PAN and LRMS images within frameworks like PanFormer [30] and Hyperformer [1] mainly involve linear projection of images into tokens for merging via attention computations. While these methods are efficient, they don’t fully exploit the intricate relationship between the spatial resolution of PAN and the spectral diversity of LRMS images.

Refer to caption
Figure 2: Overall structure of the proposed method.

In response to these challenges, we have developed innovative solutions, notably the CMT and a hybrid loss function, to elevate the pansharpening workflow. Firstly, as illustrated in Fig. 1 (c), to harness spectral features for LRMS images and spatial correlation for PAN images, we independently compute the attention block for each spectral and spatial channel ensuring a targeted processing of the distinct characteristics inherent. Secondly, to address the nuanced complexities of RSI, characterized by their rich spectral content and complex surface textures, we’ve implemented a hybrid loss function that combines Fourier and wavelet transforms. This approach utilizes Fourier transforms for identifying widespread features and wavelet transforms for enhancing local texture details. Together, they effectively improve spatial detail and maintain spectral fidelity in the pansharpening process. Lastly, our approach innovatively blends PAN’s spatial details with LRMS’s spectral data by applying advanced modulation techniques to the pansharpening process. Specifically, rather than merely concatenating features, our approach leverages the features of the modulator to dynamically modulate the carrier, altering the weights of the Transformer’s value (V) matrix, which achieves a more sophisticated fusion of features. Our approach, inspired by signal processing techniques depicted in Fig. 1 (a), uses modulation to seamlessly integrate high and low-frequency signals, significantly improving signal fidelity and richness. In parallel, CASSI systems, illustrated in Fig. 1 (b), compress images by modulating high frequencies with masks, akin to our deep learning methods for image enhancement like MST by Cai et al [3], which also employ masks for recovery.

In conclusion, the CMT framework significantly advances pansharpening through the following contributions:

  1. 1.

    Our Cross Modulation module within the CMT framework significantly enhances the fusion of PAN and LRMS images through a novel modulation technique.

  2. 2.

    We introduce a pioneering hybrid loss function that combines Fourier and wavelet transforms, which is the first attempt in the field of pansharpening to the best of our knowledge.

  3. 3.

    The CMT framework delivers outstanding results on benchmark datasets, establishing a new benchmark for pansharpening performance.

Refer to caption
Figure 3: (a) The CMAB module consists of a Double Feed-Forward Network (DFFN), a CM-MSA module, and two layers of normalization. (b) Components of the DFFN. (c) The hybrid loss between Predicted Images (PI𝑃𝐼PIitalic_P italic_I) and ground truth (GT𝐺𝑇GTitalic_G italic_T), which employs both the 2D Discrete Fourier Transform (DFT) and the 2D Discrete Wavelet Transform (DWT).

II Method

II-A Overall Architecture

The CMT architecture, depicted in Fig. 2, is structured into three primary phases: feature extraction, modulation, and feature aggregation. In the feature extraction phase, distinct extractors are designed to capture the unique characteristics of PAN and LRMS images. During the modulation phase, the Cross Modulation Attention Block (CMAB) modulator separately modulates PAN images to enrich LRMS images and vice versa, allowing for an effective blend of spatial and spectral information. Finally, to aggregate feature, we use a 3×3333\times 33 × 3 convolutional kernel and four ResNet blocks simply.

Initially, LRMS images are upscaled to dimensions H×W×c𝐻𝑊𝑐H\times W\times citalic_H × italic_W × italic_c. Then, PAN and LRMS images are processed through convolutional and ResNet blocks to extract local and global spatial details, resulting in feature sets Fpansubscript𝐹panF_{\text{pan}}italic_F start_POSTSUBSCRIPT pan end_POSTSUBSCRIPT and Fmssubscript𝐹msF_{\text{ms}}italic_F start_POSTSUBSCRIPT ms end_POSTSUBSCRIPT, respectively, with channel expansion to C=32𝐶32C=32italic_C = 32 in implementation.

The extracted features Fpansubscript𝐹panF_{\text{pan}}italic_F start_POSTSUBSCRIPT pan end_POSTSUBSCRIPT and Fmssubscript𝐹msF_{\text{ms}}italic_F start_POSTSUBSCRIPT ms end_POSTSUBSCRIPT are then modulated within the CMAB module, as shown in Fig. 3 (a). This module comprises a CM-MSA, two layers of normalization and a DFFN, employing varied activation functions for enhanced modulation. Fig. 3 (b) illustrates the specifics of the DFFN. Post-modulation, the features are concatenated and merged through a 3×3333\times 33 × 3 convolution and four ResNet blocks, allowing for further integration of spatial and spectral information.

Ultimately, the aggregated features are combined with the upsampled LRMS images to produce the HRMS.

II-B Modulation Approach

In signal processing, Double Side-Band modulation (DSB) [9], [19] stands as a classic and effective technique wherein the amplitude of a carrier wave is varied in accordance with the instantaneous value of the message signal, thus encoding information within the carrier. The mathematical expression for an DSB signal can be accurately given by:

s(t)=m(t)Accos(2πfct),𝑠𝑡𝑚𝑡subscript𝐴𝑐2𝜋subscript𝑓𝑐𝑡s(t)=m(t)\cdot A_{c}\cdot\cos(2\pi f_{c}t),italic_s ( italic_t ) = italic_m ( italic_t ) ⋅ italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⋅ roman_cos ( 2 italic_π italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_t ) , (1)

where s(t)𝑠𝑡s(t)italic_s ( italic_t ) denotes the modulated signal, Acsubscript𝐴𝑐A_{c}italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT represents the constant amplitude of the carrier wave, fcsubscript𝑓𝑐f_{c}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the carrier frequency, and m(t)𝑚𝑡m(t)italic_m ( italic_t ) embodies the message signal intended for transmission. This foundational concept of modulation underscores its robust capability to encode and transmit complex information efficiently and accurately across different mediums. In a similar vein, depicted in Fig. 1 (c), our pansharpening framework leverages a cross modulation paradigm where high-resolution spatial features and spectral details are mutually modulated. This bilateral modulation mechanism considerably amplifies the model’s proficiency in capturing and amalgamating multi-dimensional information, marking a significant leap from traditional methods.

Firstly, for the carrier, the input feature Xinsubscript𝑋inX_{\text{in}}italic_X start_POSTSUBSCRIPT in end_POSTSUBSCRIPT H×W×Cabsentsuperscript𝐻𝑊𝐶\in\mathbb{R}^{H\times W\times C}∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT is reshaped into tokens XHW×C𝑋superscript𝐻𝑊𝐶X\in\mathbb{R}^{HW\times C}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_H italic_W × italic_C end_POSTSUPERSCRIPT. A multi-head attention mechanism is employed to improve generalization and capture multi-dimensional information, splitting X𝑋Xitalic_X into k𝑘kitalic_k heads:

X=[X1,X2,,Xk],𝑋subscript𝑋1subscript𝑋2subscript𝑋𝑘X=[X_{1},X_{2},\ldots,X_{k}],italic_X = [ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] , (2)

where XiHW×dksubscript𝑋𝑖superscript𝐻𝑊subscript𝑑𝑘X_{i}\in\mathbb{R}^{HW\times d_{k}}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H italic_W × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, dk=Cksubscript𝑑𝑘𝐶𝑘d_{k}=\frac{C}{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_C end_ARG start_ARG italic_k end_ARG, and i=1,2,,k𝑖12𝑘i=1,2,\ldots,kitalic_i = 1 , 2 , … , italic_k.

Each Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is then linearly projected into queries Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, keys Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and values Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using the following equations:

Qi=XiWiQ,Ki=XiWiK,Vi=XiWiV,formulae-sequencesubscript𝑄𝑖subscript𝑋𝑖subscriptsuperscript𝑊𝑄𝑖formulae-sequencesubscript𝐾𝑖subscript𝑋𝑖subscriptsuperscript𝑊𝐾𝑖subscript𝑉𝑖subscript𝑋𝑖subscriptsuperscript𝑊𝑉𝑖Q_{i}=X_{i}W^{Q}_{i},\quad K_{i}=X_{i}W^{K}_{i},\quad V_{i}=X_{i}W^{V}_{i},italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (3)

where WiQsubscriptsuperscript𝑊𝑄𝑖W^{Q}_{i}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, WiKsubscriptsuperscript𝑊𝐾𝑖W^{K}_{i}italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and WiVdk×Csubscriptsuperscript𝑊𝑉𝑖superscriptsubscript𝑑𝑘𝐶W^{V}_{i}\in\mathbb{R}^{d_{k}\times C}italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT are learnable parameters, and i=1,2,,k𝑖12𝑘i=1,2,\ldots,kitalic_i = 1 , 2 , … , italic_k.

For the modulator, aligned with the carrier, the input feature is reshaped into tokens MHW×C𝑀superscript𝐻𝑊𝐶M\in\mathbb{R}^{HW\times C}italic_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_H italic_W × italic_C end_POSTSUPERSCRIPT and split into k𝑘kitalic_k heads. Then, the modulation is integrated into the self-attention calculation by element-wise multiplying Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

M=[M1,M2,,Mk],Vi=MiViformulae-sequence𝑀subscript𝑀1subscript𝑀2subscript𝑀𝑘subscriptsuperscript𝑉𝑖direct-productsubscript𝑀𝑖subscript𝑉𝑖M=[M_{1},M_{2},\ldots,M_{k}],\quad V^{\prime}_{i}=M_{i}\odot V_{i}italic_M = [ italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] , italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊙ italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (4)

where MiHW×dksubscript𝑀𝑖superscript𝐻𝑊subscript𝑑𝑘M_{i}\in\mathbb{R}^{HW\times d_{k}}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H italic_W × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, i=1,2,,k𝑖12𝑘i=1,2,\ldots,kitalic_i = 1 , 2 , … , italic_k.

For a single head, the modulation-attention computation is:

MA(Qi,Ki,Vi)=Visoftmax(KiTQiαi),MAsubscript𝑄𝑖subscript𝐾𝑖subscriptsuperscript𝑉𝑖subscriptsuperscript𝑉𝑖softmaxsuperscriptsubscript𝐾𝑖𝑇subscript𝑄𝑖subscript𝛼𝑖\begin{split}\text{MA}(Q_{i},K_{i},V^{\prime}_{i})&=V^{\prime}_{i}\cdot\text{% softmax}\left(\frac{K_{i}^{T}Q_{i}}{\alpha_{i}}\right),\end{split}start_ROW start_CELL MA ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL start_CELL = italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ softmax ( divide start_ARG italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) , end_CELL end_ROW (5)

where MA is modulation attention and αisubscript𝛼𝑖\alpha_{i}\in\mathbb{R}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R is a learnable parameter that adaptively scales the matrix multiplication, enhancing the model’s ability to adjust attention weights dynamically.

The results from multi-heads are concatenated together, and with the addition of position encoding, the final output Xoutsubscript𝑋outX_{\text{out}}italic_X start_POSTSUBSCRIPT out end_POSTSUBSCRIPT is derived as follows:

Xout=Fc(Concatenate(MA1,MA2,,MAk))+Fp,subscript𝑋outFcConcatenatesubscriptMA1subscriptMA2subscriptMA𝑘subscript𝐹𝑝X_{\text{out}}=\text{Fc}(\text{Concatenate}(\text{MA}_{1},\text{MA}_{2},\ldots% ,\text{MA}_{k}))+F_{p},italic_X start_POSTSUBSCRIPT out end_POSTSUBSCRIPT = Fc ( Concatenate ( MA start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , MA start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , MA start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) + italic_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , (6)

where Fc denotes a fully connected layer, and Fpsubscript𝐹𝑝F_{p}italic_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT represents the position encoding.

TABLE I: Quantitative results on 20 reduced-resolution and 20 full-resolution samples of GF2. (Red: best; Blue: second best).
Method Reduced-Resolution Metrics Full-Resolution Metrics
SAM↓ ERGAS↓ Q4↑ Dλsubscript𝐷𝜆D_{\lambda}italic_D start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT HQNR↑
PNN[16] 1.048 ±plus-or-minus\pm± 0.226 1.057 ±plus-or-minus\pm± 0.235 0.960 ±plus-or-minus\pm± 0.010 0.0317 ±plus-or-minus\pm± 0.0286 0.0943 ±plus-or-minus\pm± 0.0224 0.877 ±plus-or-minus\pm± 0.036
PanNet[27] 0.997 ±plus-or-minus\pm± 0.212 0.919 ±plus-or-minus\pm± 0.191 0.967 ±plus-or-minus\pm± 0.010 0.0179 ±plus-or-minus\pm± 0.0110 0.0799 ±plus-or-minus\pm± 0.0178 0.904 ±plus-or-minus\pm± 0.020
DiCNN[10] 1.052 ±plus-or-minus\pm± 0.231 1.081 ±plus-or-minus\pm± 0.254 0.959 ±plus-or-minus\pm± 0.010 0.0369 ±plus-or-minus\pm± 0.0132 0.0992 ±plus-or-minus\pm± 0.0131 0.868 ±plus-or-minus\pm± 0.016
FusionNet[5] 0.973 ±plus-or-minus\pm± 0.212 0.988 ±plus-or-minus\pm± 0.222 0.964 ±plus-or-minus\pm± 0.009 0.0350 ±plus-or-minus\pm± 0.0124 0.1013 ±plus-or-minus\pm± 0.0134 0.867 ±plus-or-minus\pm± 0.018
DCFNet[25] 0.872 ±plus-or-minus\pm± 0.169 0.784 ±plus-or-minus\pm± 0.146 0.974 ±plus-or-minus\pm± 0.009 0.0240 ±plus-or-minus\pm± 0.0115 0.0659 ±plus-or-minus\pm± 0.0096 0.912 ±plus-or-minus\pm± 0.012
MMNet[26] 0.993 ±plus-or-minus\pm± 0.141 0.777 ±plus-or-minus\pm± 0.134 0.969 ±plus-or-minus\pm± 0.020 0.0443 ±plus-or-minus\pm± 0.0298 0.1033 ±plus-or-minus\pm± 0.0129 0.857 ±plus-or-minus\pm± 0.027
LAGConv[13] 0.786 ±plus-or-minus\pm± 0.148 0.687 ±plus-or-minus\pm± 0.113 0.980 ±plus-or-minus\pm± 0.009 0.0284 ±plus-or-minus\pm± 0.0130 0.0792 ±plus-or-minus\pm± 0.0136 0.895 ±plus-or-minus\pm± 0.020
HMPNet[20] 0.803 ±plus-or-minus\pm± 0.156 0.564 ±plus-or-minus\pm± 0.099 0.981 ±plus-or-minus\pm± 0.030 0.0819 ±plus-or-minus\pm± 0.0499 0.1146 ±plus-or-minus\pm± 0.0126 0.813 ±plus-or-minus\pm± 0.049
Proposed 0.722 ±plus-or-minus\pm± 0.136 0.624 ±plus-or-minus\pm± 0.107 0.992 ±plus-or-minus\pm± 0.001 0.0202 ±plus-or-minus\pm± 0.0103 0.0338 ±plus-or-minus\pm± 0.0086 0.947 ±plus-or-minus\pm± 0.012
Refer to caption
Figure 4: Qualitative result comparison between representative methods on the GF2 reduced-resolution dataset. The first row presents RGB outputs, while the second row gives the corresponding QNR maps.

II-C Loss Function

To enhance the resolution and quality of RSI, as shown in Fig. 3 (c), our method improves upon traditional fusion techniques by leveraging both Fourier and wavelet transforms. Fourier transforms [18] are essential for map** images into the frequency domain, capturing widespread environmental features. Besides, wavelet transforms [14] excel in delineating images across multiple scales, adeptly enhancing local textures and detailing.

The Fourier transform loss function is defined as:

LFourier=1Mi=1M(PIi)(GTi)1,subscript𝐿Fourier1𝑀superscriptsubscript𝑖1𝑀subscriptnorm𝑃subscript𝐼𝑖𝐺subscript𝑇𝑖1L_{\text{Fourier}}=\frac{1}{M}\sum_{i=1}^{M}\|\mathcal{F}(PI_{i})-\mathcal{F}(% GT_{i})\|_{1},italic_L start_POSTSUBSCRIPT Fourier end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ caligraphic_F ( italic_P italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - caligraphic_F ( italic_G italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (7)

which applies the L1 loss to the difference in Fourier-transformed Predicted Images (PIi𝑃subscript𝐼𝑖PI_{i}italic_P italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) and ground truth images (GTi𝐺subscript𝑇𝑖GT_{i}italic_G italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT), averaged over all M𝑀Mitalic_M training samples.

The loss function for the wavelet transform is defined as:

LWavelet=1Mi=1MDj,c(i),subscript𝐿Wavelet1𝑀superscriptsubscript𝑖1𝑀subscript𝐷𝑗𝑐𝑖L_{\text{Wavelet}}=\frac{1}{M}\sum_{i=1}^{M}D_{j,c}(i),italic_L start_POSTSUBSCRIPT Wavelet end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_j , italic_c end_POSTSUBSCRIPT ( italic_i ) , (8)

where Dj,c(i)=𝒲j,c(PIi)𝒲j,c(GTi)1D_{j,c}(i)=\|\mathcal{W}{j,c}(PI_{i})-\mathcal{W}{j,c}(GT_{i})\|_{1}italic_D start_POSTSUBSCRIPT italic_j , italic_c end_POSTSUBSCRIPT ( italic_i ) = ∥ caligraphic_W italic_j , italic_c ( italic_P italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - caligraphic_W italic_j , italic_c ( italic_G italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒲j,c𝒲𝑗𝑐\mathcal{W}{j,c}caligraphic_W italic_j , italic_c captures wavelet coefficients at scale j𝑗jitalic_j and orientation c𝑐citalic_c, reflecting local variations and textures, averaged across all training instances.

Combining these components, our comprehensive loss function integrates spatial, frequency, and wavelet domain losses:

Ltotal=Lspa+λ1LFourier+λ2LWaveletsubscript𝐿totalsubscript𝐿spasubscript𝜆1subscript𝐿Fouriersubscript𝜆2subscript𝐿WaveletL_{\text{total}}=L_{\text{spa}}+\lambda_{1}L_{\text{Fourier}}+\lambda_{2}L_{% \text{Wavelet}}italic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT spa end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT Fourier end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT Wavelet end_POSTSUBSCRIPT (9)

where Lspasubscript𝐿spaL_{\text{spa}}italic_L start_POSTSUBSCRIPT spa end_POSTSUBSCRIPT is the L1 loss in the spatial domain, λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are set to 0.7 and 0.2 to balance the contributions of the different loss components as weighting coefficients in implementation.

III Experiment

III-A Datasets and Implementation Details

To validate our approach, we construct datasets following Wald’s protocol [6], [24] on data collected from the WorldView-3 (WV3) and GaoFen-2 (GF2) satellites. Our datasets and data processing methods are downloaded from the PanCollection repository [7]. The datasets consist of images cropped from entire remote sensing images, divided into training and testing sets. The training set comprises PAN/LRMS/GT image pairs obtained by downsampling simulation, with dimensions of 64×64, 16×16×C and 64×64×C. Besides, we evaluated our method on the commonly used metrics in the field of pansharpening, including SAM [2],  ERGAS [23] and Q8 [8] for reduced-resolution dataset, Ds𝐷𝑠Dsitalic_D italic_s, Dλsubscript𝐷𝜆D_{\lambda}italic_D start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT and HQNR [22] for full-resolution dataset. The CMT was trained with an initial learning rate of 0.001, for 400 epochs, and a batch size of 32, using Adam optimizer [15] with learning rate halved every 100 epochs. As for other DL-based methods, we utilize the default settings in related papers or codes to train the networks.

TABLE II: Quantitative results of DL-based methods on 20 reduced-resolution samples acquired by WV3. (Red: best; Blue: second best).
Method Q8↑ SAM↓ ERGAS↓
PNN 0.893 ±plus-or-minus\pm± 0.092 3.677 ±plus-or-minus\pm± 0.762 2.680 ±plus-or-minus\pm± 0.647
PanNet 0.891 ±plus-or-minus\pm± 0.093 3.613 ±plus-or-minus\pm± 0.766 2.664 ±plus-or-minus\pm± 0.688
DiCNN 0.900 ±plus-or-minus\pm± 0.087 3.592 ±plus-or-minus\pm± 0.762 2.672 ±plus-or-minus\pm± 0.662
FusionNet 0.904 ±plus-or-minus\pm± 0.090 3.324 ±plus-or-minus\pm± 0.698 2.465 ±plus-or-minus\pm± 0.644
MMNet 0.915 ±plus-or-minus\pm± 0.086 3.084 ±plus-or-minus\pm± 0.640 2.343 ±plus-or-minus\pm± 0.626
LAGConv 0.910 ±plus-or-minus\pm± 0.091 3.103 ±plus-or-minus\pm± 0.558 2.292 ±plus-or-minus\pm± 0.607
HMPNet 0.916 ±plus-or-minus\pm± 0.087 3.063 ±plus-or-minus\pm± 0.577 2.229 ±plus-or-minus\pm± 0.545
Proposed 0.917 ±plus-or-minus\pm± 0.086 3.001 ±plus-or-minus\pm± 0.610 2.201 ±plus-or-minus\pm± 0.522
TABLE III: Ablation study on loss components. (Red: best; Blue: second best).
Loss Components Metrics
Lspaspa{}_{\text{spa}}start_FLOATSUBSCRIPT spa end_FLOATSUBSCRIPT Lwavewave{}_{\text{wave}}start_FLOATSUBSCRIPT wave end_FLOATSUBSCRIPT Lfftfft{}_{\text{fft}}start_FLOATSUBSCRIPT fft end_FLOATSUBSCRIPT Q8↑ SAM↓ ERGAS↓
0.912±0.086 3.043±0.618 2.238±0.530
0.913±0.087 3.033±0.612 2.223±0.523
0.916±0.086 3.006±0.609 2.208±0.521
0.917±0.086 3.001±0.610 2.201±0.522
TABLE IV: Ablation study on Modulation Approach. (Red: best; Blue: second best).
Method Dλsubscript𝐷𝜆D_{\lambda}italic_D start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT Ds𝐷𝑠Dsitalic_D italic_s HQNR↑
V1 0.0210±0.0074 0.0364±0.0125 0.9435±0.0180
V2 0.0249±0.0123 0.0355±0.0132 0.9406±0.0187
V3 0.0234±0.0079 0.0388±0.0155 0.9389±0.0210
CMT 0.0201±0.0074 0.0344±0.0135 0.9463±0.0188

III-B Results

The performance of the proposed CMT method is show cased through extensive evaluations on GF2 datasets. TABLE I present a comprehensive comparison of CMT with various state-of-the-art methods on the GF2 dataset. The quantitative results show our method consistently surpasses existing approaches in all metrics. The visual comparison results are provided in Fig. 3. TABLE II present the results on the WV3 datasets, and the proposed method obtains the best average results on all quality indexes.

III-C Ablation Experiment

Ablation on Hybrid Loss. We compare the outcomes after training with different loss functions. The results are displayed in TABLE III. This comparison aims to ascertain the impact of each loss component on the overall performance of the model. We perform experiments on 20 reduced-resolution samples acquired by WV3 satellite.

Ablation on Modulation Approach. To validate the effectiveness of our method, we create three variants of the CMT. In the first variant (V1), modulation is omitted, taining only the transformer to evaluate its baseline feature integration performance. The second variant (V2) involves exclusively using the PAN image to modulate the MS features. In the third variant (V3), we solely use LRMS modulation of the PAN image. We perform experiments on 20 full-resolution samples acquired by the WV3 satellite. The results in TABLE IV show CMT has the best overall performance, proving our method’s efficacy.

IV Conclusion

In this study, we introduce the CMT, a novel pansharpening method that synergistically merges PAN and LRMS images. Central to CMT is the application of signal modulation, innovatively incorporated into a Transformer-based architecture. This allows for precise modulation of the Transformer’s value (V) matrix, facilitating a superior integration of spatial detail and spectral depth. Our method is characterized by its unique CM-SMA modulation technique and a bespoke hybrid loss function that blends Fourier and wavelet transforms. This loss function adeptly captures both global patterns and local textures, thereby enhancing spatial resolution while maintaining spectral fidelity. The versatility of CMT suggests its applicability beyond pansharpening, offering promising enhancements in various fields that require intricate and spectrally accurate image fusion.

References

  • [1] Wele Gedara Chaminda Bandara and Vishal M Patel. Hypertransformer: A textural and spectral feature fusion transformer for pansharpening. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1767–1777, 2022.
  • [2] Joseph W Boardman. Automating spectral unmixing of aviris data using convex geometry concepts. 1993.
  • [3] Yuanhao Cai, **g Lin, Xiaowan Hu, Haoqian Wang, Xin Yuan, Yulun Zhang, Radu Timofte, and Luc Van Gool. Mask-guided spectral-wise transformer for efficient hyperspectral image reconstruction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17502–17511, 2022.
  • [4] Zhixuan Chen, Cheng **, Tian**g Zhang, Xiao Wu, and Liangjian Deng. Spanconv: A new convolution via spanning kernel space for lightweight pansharpening. International Joint Conference on Artificial Intelligence, pages 841–847, 2022.
  • [5] Liangjian Deng, Gemine Vivone, Cheng **, and Jocelyn Chanussot. Detail injection-based deep convolutional neural networks for pansharpening. IEEE Transactions on Geoscience and Remote Sensing, pages 6995–7010, 2020.
  • [6] Liangjian Deng, Gemine Vivone, Cheng **, and Jocelyn Chanussot. Detail injection-based deep convolutional neural networks for pansharpening. IEEE Transactions on Geoscience and Remote Sensing, pages 6995–7010, 2020.
  • [7] Liangjian Deng, Gemine Vivone, Mercedes E Paoletti, Giuseppe Scarpa, Jiang He, Yongjun Zhang, Jocelyn Chanussot, and Antonio Plaza. Machine learning in pansharpening: A benchmark, from shallow to deep networks. IEEE Geoscience and Remote Sensing Magazine, pages 279–315, 2022.
  • [8] Renwei Dian, Shutao Li, and Leyuan Fang. Learning a low tensor-train rank representation for hyperspectral image super-resolution. IEEE Transactions on Neural Networks and Learning Systems, pages 2672–2683, 2019.
  • [9] Ralph VL Hartley. Transmission of information 1. Bell System technical journal, pages 535–563, 1928.
  • [10] Lin He, Yizhou Rao, Jun Li, Jocelyn Chanussot, Antonio Plaza, Jiawei Zhu, and Bo Li. Pansharpening via detail injection based convolutional neural networks. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, pages 1188–1204, 2019.
  • [11] **fan Hu, Tingzhu Huang, Liangjian Deng, Hongxia Dou, Danfeng Hong, and Gemine Vivone. Fusformer: A transformer-based fusion network for hyperspectral image super-resolution. IEEE Geoscience and Remote Sensing Letters, pages 1–5, 2022.
  • [12] Zirong **, Liangjian Deng, Tian**g Zhang, and Xiao Xu **. Bam: Bilateral activation mechanism for image fusion. Proceedings of the 29th ACM international conference on multimedia, pages 4315–4323, 2021.
  • [13] Zirong **, Tian**g Zhang, Taixiang Jiang, Gemine Vivone, and Liangjian Deng. Lagconv: Local-context adaptive convolution kernels with global harmonic bias for pansharpening. AAAI Conference on Artificial Intelligence, 2022.
  • [14] Er Saiqa Khan and Er Arun Kulkarni. An efficient method for detection of copy-move forgery using discrete wavelet transform. International Journal on Computer Science and Engineering, page 2010, 1801.
  • [15] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [16] Giuseppe Masi, Davide Cozzolino, Luisa Verdoliva, and Giuseppe Scarpa. Pansharpening by convolutional neural networks. Remote Sensing, page 594, 2016.
  • [17] Xiangchao Meng, Nan Wang, Feng Shao, and Shutao Li. Vision transformer for pansharpening. IEEE Transactions on Geoscience and Remote Sensing, pages 1–11, 2022.
  • [18] Raymond Edward Alan Christopher Paley and Norbert Wiener. Fourier transforms in the complex domain. American Mathematical Soc., 1934.
  • [19] Hans Roder. Amplitude, phase, and frequency modulation. Proceedings of the Institute of Radio Engineers, pages 2145–2176, 1931.
  • [20] Xin Tian, Kun Li, Wei Zhang, Zhongyuan Wang, and Jiayi Ma. Interpretable model-driven deep network for hyperspectral, multispectral, and panchromatic image fusion. IEEE Transactions on Neural Networks and Learning Systems, 2023.
  • [21] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 2017.
  • [22] Gemine Vivone, Luciano Alparone, Jocelyn Chanussot, Mauro Dalla Mura, Andrea Garzelli, Giorgio A. Licciardi, Rocco Restaino, and Lucien Wald. A critical comparison among pansharpening algorithms. IEEE Transactions on Geoscience and Remote Sensing, pages 2565–2586, 2015.
  • [23] Gemine Vivone, Rocco Restaino, Mauro Dalla Mura, Giorgio Licciardi, and Jocelyn Chanussot. Contrast and error-based fusion schemes for multispectral image pansharpening. IEEE Geoscience and Remote Sensing Letters, pages 930–934, 2014.
  • [24] Lucien Wald, Thierry Ranchin, and Marc Mangolini. Fusion of satellite images of different spatial resolutions: Assessing the quality of resulting images. Photogrammetric engineering and remote sensing, pages 691–699, 1997.
  • [25] Xiao Wu, Tingzhu Huang, Liangjian Deng, and Tian**g Zhang. Dynamic cross feature fusion for remote sensing pansharpening. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14687–14696, October 2021.
  • [26] Keyu Yan, Man Zhou, Li Zhang, and Chengjun Xie. Memory-augmented model-driven network for pansharpening. European Conference on Computer Vision, pages 306–322, 2022.
  • [27] Junfeng Yang, Xueyang Fu, Yuwen Hu, Yue Huang, Xinghao Ding, and John Paisley. Pannet: A deep network architecture for pan-sharpening. In Proceedings of the IEEE international conference on computer vision, pages 5449–5457, 2017.
  • [28] Hao Zhang, Hebaixu Wang, Xin Tian, and Jiayi Ma. P2sharpen: A progressive pansharpening network with deep spectral transformation. Information Fusion, pages 103–122, 2023.
  • [29] Tianjiang Zhang, Liangjian Deng, Tingzhu Huang, Jocelyn Chanussot, and Gemine Vivone. A triple-double convolutional neural network for panchromatic sharpening. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • [30] Huanyu Zhou, Qingjie Liu, and Yunhong Wang. Panformer: A transformer based model for pan-sharpening. IEEE International Conference on Multimedia and Expo, pages 1–6, 2022.