CMT: Cross Modulation Transformer with Hybrid Loss for Pansharpening
Abstract
Pansharpening aims to enhance remote sensing image (RSI) quality by merging high-resolution panchromatic (PAN) with multispectral (MS) images. However, prior techniques struggled to optimally fuse PAN and MS images for enhanced spatial and spectral information, due to a lack of a systematic framework capable of effectively coordinating their individual strengths. In response, we present the Cross Modulation Transformer (CMT), a pioneering method that modifies the attention mechanism. This approach utilizes a robust modulation technique from signal processing, integrating it into the attention mechanism’s calculations. It dynamically tunes the weights of the carrier’s value (V) matrix according to the modulator’s features, thus resolving historical challenges and achieving a seamless integration of spatial and spectral attributes. Furthermore, considering that RSI exhibit large-scale features and edge details along with local textures, we crafted a hybrid loss function that combines Fourier and wavelet transforms to effectively capture these characteristics, thereby enhancing both spatial and spectral accuracy in pansharpening. Extensive experiments demonstrate our framework’s superior performance over existing state-of-the-art methods. The code will be publicly available to encourage further research.
Index Terms:
Pansharpening, Cross Modulation Transformer, Fourier and Wavelet TransformsI Introduction
Given the inherent constraints of remote sensing technology, obtaining MS images with high spatial resolution directly from satellites is a significant challenge. As a solution, pansharpening has become a crucial technique. It merges low-resolution MS (LRMS) images with high-resolution PAN images to produce high-spatial-resolution MS (HRMS) images with superior spatial detail. This fusion technique effectively navigates around the limitations of sensor technology, offering invaluable data for RSI analysis.
Deep learning breakthroughs, led by Convolutional Neural Networks (CNNs), have significantly advanced the field of pansharpening [4], [12], [29]. CNNs have shown exceptional prowess in pansharpening, skillfully extracting and combining intricate features from various images to enhance both spatial and spectral quality. Furthermore, since their introduction, Transformers [21] have revolutionized numerous fields, including pansharpening [11], [28], [17] by their unparalleled ability to model long-range dependencies using self-attention mechanisms. This capability gives them a significant edge in effectively blending spatial and spectral features, outperforming traditional CNN-based approaches by capturing the complex dynamics between different types of images.
However, the field of pansharpening still faces several challenges. Firstly, Hyperspectral Images (HSI) show spatial sparsity and spectral self-similarity, complicating spatial dependency modeling and highlighting the importance of prioritizing inter-spectral over spatial correlations in Transformers. Secondly, RSI are marked by their rich spectral content, complex surface textures and edge details, posing challenges for fusion and necessitating tailored approaches for effective integration. Lastly, the potential of attention structures, specifically tailored for pansharpening within the Transformer framework, warrants further exploration. Current integration methods of PAN and LRMS images within frameworks like PanFormer [30] and Hyperformer [1] mainly involve linear projection of images into tokens for merging via attention computations. While these methods are efficient, they don’t fully exploit the intricate relationship between the spatial resolution of PAN and the spectral diversity of LRMS images.
In response to these challenges, we have developed innovative solutions, notably the CMT and a hybrid loss function, to elevate the pansharpening workflow. Firstly, as illustrated in Fig. 1 (c), to harness spectral features for LRMS images and spatial correlation for PAN images, we independently compute the attention block for each spectral and spatial channel ensuring a targeted processing of the distinct characteristics inherent. Secondly, to address the nuanced complexities of RSI, characterized by their rich spectral content and complex surface textures, we’ve implemented a hybrid loss function that combines Fourier and wavelet transforms. This approach utilizes Fourier transforms for identifying widespread features and wavelet transforms for enhancing local texture details. Together, they effectively improve spatial detail and maintain spectral fidelity in the pansharpening process. Lastly, our approach innovatively blends PAN’s spatial details with LRMS’s spectral data by applying advanced modulation techniques to the pansharpening process. Specifically, rather than merely concatenating features, our approach leverages the features of the modulator to dynamically modulate the carrier, altering the weights of the Transformer’s value (V) matrix, which achieves a more sophisticated fusion of features. Our approach, inspired by signal processing techniques depicted in Fig. 1 (a), uses modulation to seamlessly integrate high and low-frequency signals, significantly improving signal fidelity and richness. In parallel, CASSI systems, illustrated in Fig. 1 (b), compress images by modulating high frequencies with masks, akin to our deep learning methods for image enhancement like MST by Cai et al [3], which also employ masks for recovery.
In conclusion, the CMT framework significantly advances pansharpening through the following contributions:
-
1.
Our Cross Modulation module within the CMT framework significantly enhances the fusion of PAN and LRMS images through a novel modulation technique.
-
2.
We introduce a pioneering hybrid loss function that combines Fourier and wavelet transforms, which is the first attempt in the field of pansharpening to the best of our knowledge.
-
3.
The CMT framework delivers outstanding results on benchmark datasets, establishing a new benchmark for pansharpening performance.
II Method
II-A Overall Architecture
The CMT architecture, depicted in Fig. 2, is structured into three primary phases: feature extraction, modulation, and feature aggregation. In the feature extraction phase, distinct extractors are designed to capture the unique characteristics of PAN and LRMS images. During the modulation phase, the Cross Modulation Attention Block (CMAB) modulator separately modulates PAN images to enrich LRMS images and vice versa, allowing for an effective blend of spatial and spectral information. Finally, to aggregate feature, we use a convolutional kernel and four ResNet blocks simply.
Initially, LRMS images are upscaled to dimensions . Then, PAN and LRMS images are processed through convolutional and ResNet blocks to extract local and global spatial details, resulting in feature sets and , respectively, with channel expansion to in implementation.
The extracted features and are then modulated within the CMAB module, as shown in Fig. 3 (a). This module comprises a CM-MSA, two layers of normalization and a DFFN, employing varied activation functions for enhanced modulation. Fig. 3 (b) illustrates the specifics of the DFFN. Post-modulation, the features are concatenated and merged through a convolution and four ResNet blocks, allowing for further integration of spatial and spectral information.
Ultimately, the aggregated features are combined with the upsampled LRMS images to produce the HRMS.
II-B Modulation Approach
In signal processing, Double Side-Band modulation (DSB) [9], [19] stands as a classic and effective technique wherein the amplitude of a carrier wave is varied in accordance with the instantaneous value of the message signal, thus encoding information within the carrier. The mathematical expression for an DSB signal can be accurately given by:
(1) |
where denotes the modulated signal, represents the constant amplitude of the carrier wave, is the carrier frequency, and embodies the message signal intended for transmission. This foundational concept of modulation underscores its robust capability to encode and transmit complex information efficiently and accurately across different mediums. In a similar vein, depicted in Fig. 1 (c), our pansharpening framework leverages a cross modulation paradigm where high-resolution spatial features and spectral details are mutually modulated. This bilateral modulation mechanism considerably amplifies the model’s proficiency in capturing and amalgamating multi-dimensional information, marking a significant leap from traditional methods.
Firstly, for the carrier, the input feature is reshaped into tokens . A multi-head attention mechanism is employed to improve generalization and capture multi-dimensional information, splitting into heads:
(2) |
where , , and .
Each is then linearly projected into queries , keys , and values using the following equations:
(3) |
where , , and are learnable parameters, and .
For the modulator, aligned with the carrier, the input feature is reshaped into tokens and split into heads. Then, the modulation is integrated into the self-attention calculation by element-wise multiplying with :
(4) |
where , .
For a single head, the modulation-attention computation is:
(5) |
where MA is modulation attention and is a learnable parameter that adaptively scales the matrix multiplication, enhancing the model’s ability to adjust attention weights dynamically.
The results from multi-heads are concatenated together, and with the addition of position encoding, the final output is derived as follows:
(6) |
where Fc denotes a fully connected layer, and represents the position encoding.
Method | Reduced-Resolution Metrics | Full-Resolution Metrics | ||||
---|---|---|---|---|---|---|
SAM↓ | ERGAS↓ | Q4↑ | ↓ | ↓ | HQNR↑ | |
PNN[16] | 1.048 0.226 | 1.057 0.235 | 0.960 0.010 | 0.0317 0.0286 | 0.0943 0.0224 | 0.877 0.036 |
PanNet[27] | 0.997 0.212 | 0.919 0.191 | 0.967 0.010 | 0.0179 0.0110 | 0.0799 0.0178 | 0.904 0.020 |
DiCNN[10] | 1.052 0.231 | 1.081 0.254 | 0.959 0.010 | 0.0369 0.0132 | 0.0992 0.0131 | 0.868 0.016 |
FusionNet[5] | 0.973 0.212 | 0.988 0.222 | 0.964 0.009 | 0.0350 0.0124 | 0.1013 0.0134 | 0.867 0.018 |
DCFNet[25] | 0.872 0.169 | 0.784 0.146 | 0.974 0.009 | 0.0240 0.0115 | 0.0659 0.0096 | 0.912 0.012 |
MMNet[26] | 0.993 0.141 | 0.777 0.134 | 0.969 0.020 | 0.0443 0.0298 | 0.1033 0.0129 | 0.857 0.027 |
LAGConv[13] | 0.786 0.148 | 0.687 0.113 | 0.980 0.009 | 0.0284 0.0130 | 0.0792 0.0136 | 0.895 0.020 |
HMPNet[20] | 0.803 0.156 | 0.564 0.099 | 0.981 0.030 | 0.0819 0.0499 | 0.1146 0.0126 | 0.813 0.049 |
Proposed | 0.722 0.136 | 0.624 0.107 | 0.992 0.001 | 0.0202 0.0103 | 0.0338 0.0086 | 0.947 0.012 |
II-C Loss Function
To enhance the resolution and quality of RSI, as shown in Fig. 3 (c), our method improves upon traditional fusion techniques by leveraging both Fourier and wavelet transforms. Fourier transforms [18] are essential for map** images into the frequency domain, capturing widespread environmental features. Besides, wavelet transforms [14] excel in delineating images across multiple scales, adeptly enhancing local textures and detailing.
The Fourier transform loss function is defined as:
(7) |
which applies the L1 loss to the difference in Fourier-transformed Predicted Images () and ground truth images (), averaged over all training samples.
The loss function for the wavelet transform is defined as:
(8) |
where and captures wavelet coefficients at scale and orientation , reflecting local variations and textures, averaged across all training instances.
Combining these components, our comprehensive loss function integrates spatial, frequency, and wavelet domain losses:
(9) |
where is the L1 loss in the spatial domain, and are set to 0.7 and 0.2 to balance the contributions of the different loss components as weighting coefficients in implementation.
III Experiment
III-A Datasets and Implementation Details
To validate our approach, we construct datasets following Wald’s protocol [6], [24] on data collected from the WorldView-3 (WV3) and GaoFen-2 (GF2) satellites. Our datasets and data processing methods are downloaded from the PanCollection repository [7]. The datasets consist of images cropped from entire remote sensing images, divided into training and testing sets. The training set comprises PAN/LRMS/GT image pairs obtained by downsampling simulation, with dimensions of 64×64, 16×16×C and 64×64×C. Besides, we evaluated our method on the commonly used metrics in the field of pansharpening, including SAM [2], ERGAS [23] and Q8 [8] for reduced-resolution dataset, , and HQNR [22] for full-resolution dataset. The CMT was trained with an initial learning rate of 0.001, for 400 epochs, and a batch size of 32, using Adam optimizer [15] with learning rate halved every 100 epochs. As for other DL-based methods, we utilize the default settings in related papers or codes to train the networks.
Method | Q8↑ | SAM↓ | ERGAS↓ |
---|---|---|---|
PNN | 0.893 0.092 | 3.677 0.762 | 2.680 0.647 |
PanNet | 0.891 0.093 | 3.613 0.766 | 2.664 0.688 |
DiCNN | 0.900 0.087 | 3.592 0.762 | 2.672 0.662 |
FusionNet | 0.904 0.090 | 3.324 0.698 | 2.465 0.644 |
MMNet | 0.915 0.086 | 3.084 0.640 | 2.343 0.626 |
LAGConv | 0.910 0.091 | 3.103 0.558 | 2.292 0.607 |
HMPNet | 0.916 0.087 | 3.063 0.577 | 2.229 0.545 |
Proposed | 0.917 0.086 | 3.001 0.610 | 2.201 0.522 |
Loss Components | Metrics | ||||
---|---|---|---|---|---|
L | L | L | Q8↑ | SAM↓ | ERGAS↓ |
✓ | ✗ | ✗ | 0.912±0.086 | 3.043±0.618 | 2.238±0.530 |
✓ | ✓ | ✗ | 0.913±0.087 | 3.033±0.612 | 2.223±0.523 |
✓ | ✗ | ✓ | 0.916±0.086 | 3.006±0.609 | 2.208±0.521 |
✓ | ✓ | ✓ | 0.917±0.086 | 3.001±0.610 | 2.201±0.522 |
Method | ↓ | ↓ | HQNR↑ |
---|---|---|---|
V1 | 0.0210±0.0074 | 0.0364±0.0125 | 0.9435±0.0180 |
V2 | 0.0249±0.0123 | 0.0355±0.0132 | 0.9406±0.0187 |
V3 | 0.0234±0.0079 | 0.0388±0.0155 | 0.9389±0.0210 |
CMT | 0.0201±0.0074 | 0.0344±0.0135 | 0.9463±0.0188 |
III-B Results
The performance of the proposed CMT method is show cased through extensive evaluations on GF2 datasets. TABLE I present a comprehensive comparison of CMT with various state-of-the-art methods on the GF2 dataset. The quantitative results show our method consistently surpasses existing approaches in all metrics. The visual comparison results are provided in Fig. 3. TABLE II present the results on the WV3 datasets, and the proposed method obtains the best average results on all quality indexes.
III-C Ablation Experiment
Ablation on Hybrid Loss. We compare the outcomes after training with different loss functions. The results are displayed in TABLE III. This comparison aims to ascertain the impact of each loss component on the overall performance of the model. We perform experiments on 20 reduced-resolution samples acquired by WV3 satellite.
Ablation on Modulation Approach. To validate the effectiveness of our method, we create three variants of the CMT. In the first variant (V1), modulation is omitted, taining only the transformer to evaluate its baseline feature integration performance. The second variant (V2) involves exclusively using the PAN image to modulate the MS features. In the third variant (V3), we solely use LRMS modulation of the PAN image. We perform experiments on 20 full-resolution samples acquired by the WV3 satellite. The results in TABLE IV show CMT has the best overall performance, proving our method’s efficacy.
IV Conclusion
In this study, we introduce the CMT, a novel pansharpening method that synergistically merges PAN and LRMS images. Central to CMT is the application of signal modulation, innovatively incorporated into a Transformer-based architecture. This allows for precise modulation of the Transformer’s value (V) matrix, facilitating a superior integration of spatial detail and spectral depth. Our method is characterized by its unique CM-SMA modulation technique and a bespoke hybrid loss function that blends Fourier and wavelet transforms. This loss function adeptly captures both global patterns and local textures, thereby enhancing spatial resolution while maintaining spectral fidelity. The versatility of CMT suggests its applicability beyond pansharpening, offering promising enhancements in various fields that require intricate and spectrally accurate image fusion.
References
- [1] Wele Gedara Chaminda Bandara and Vishal M Patel. Hypertransformer: A textural and spectral feature fusion transformer for pansharpening. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1767–1777, 2022.
- [2] Joseph W Boardman. Automating spectral unmixing of aviris data using convex geometry concepts. 1993.
- [3] Yuanhao Cai, **g Lin, Xiaowan Hu, Haoqian Wang, Xin Yuan, Yulun Zhang, Radu Timofte, and Luc Van Gool. Mask-guided spectral-wise transformer for efficient hyperspectral image reconstruction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17502–17511, 2022.
- [4] Zhixuan Chen, Cheng **, Tian**g Zhang, Xiao Wu, and Liangjian Deng. Spanconv: A new convolution via spanning kernel space for lightweight pansharpening. International Joint Conference on Artificial Intelligence, pages 841–847, 2022.
- [5] Liangjian Deng, Gemine Vivone, Cheng **, and Jocelyn Chanussot. Detail injection-based deep convolutional neural networks for pansharpening. IEEE Transactions on Geoscience and Remote Sensing, pages 6995–7010, 2020.
- [6] Liangjian Deng, Gemine Vivone, Cheng **, and Jocelyn Chanussot. Detail injection-based deep convolutional neural networks for pansharpening. IEEE Transactions on Geoscience and Remote Sensing, pages 6995–7010, 2020.
- [7] Liangjian Deng, Gemine Vivone, Mercedes E Paoletti, Giuseppe Scarpa, Jiang He, Yongjun Zhang, Jocelyn Chanussot, and Antonio Plaza. Machine learning in pansharpening: A benchmark, from shallow to deep networks. IEEE Geoscience and Remote Sensing Magazine, pages 279–315, 2022.
- [8] Renwei Dian, Shutao Li, and Leyuan Fang. Learning a low tensor-train rank representation for hyperspectral image super-resolution. IEEE Transactions on Neural Networks and Learning Systems, pages 2672–2683, 2019.
- [9] Ralph VL Hartley. Transmission of information 1. Bell System technical journal, pages 535–563, 1928.
- [10] Lin He, Yizhou Rao, Jun Li, Jocelyn Chanussot, Antonio Plaza, Jiawei Zhu, and Bo Li. Pansharpening via detail injection based convolutional neural networks. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, pages 1188–1204, 2019.
- [11] **fan Hu, Tingzhu Huang, Liangjian Deng, Hongxia Dou, Danfeng Hong, and Gemine Vivone. Fusformer: A transformer-based fusion network for hyperspectral image super-resolution. IEEE Geoscience and Remote Sensing Letters, pages 1–5, 2022.
- [12] Zirong **, Liangjian Deng, Tian**g Zhang, and Xiao Xu **. Bam: Bilateral activation mechanism for image fusion. Proceedings of the 29th ACM international conference on multimedia, pages 4315–4323, 2021.
- [13] Zirong **, Tian**g Zhang, Taixiang Jiang, Gemine Vivone, and Liangjian Deng. Lagconv: Local-context adaptive convolution kernels with global harmonic bias for pansharpening. AAAI Conference on Artificial Intelligence, 2022.
- [14] Er Saiqa Khan and Er Arun Kulkarni. An efficient method for detection of copy-move forgery using discrete wavelet transform. International Journal on Computer Science and Engineering, page 2010, 1801.
- [15] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [16] Giuseppe Masi, Davide Cozzolino, Luisa Verdoliva, and Giuseppe Scarpa. Pansharpening by convolutional neural networks. Remote Sensing, page 594, 2016.
- [17] Xiangchao Meng, Nan Wang, Feng Shao, and Shutao Li. Vision transformer for pansharpening. IEEE Transactions on Geoscience and Remote Sensing, pages 1–11, 2022.
- [18] Raymond Edward Alan Christopher Paley and Norbert Wiener. Fourier transforms in the complex domain. American Mathematical Soc., 1934.
- [19] Hans Roder. Amplitude, phase, and frequency modulation. Proceedings of the Institute of Radio Engineers, pages 2145–2176, 1931.
- [20] Xin Tian, Kun Li, Wei Zhang, Zhongyuan Wang, and Jiayi Ma. Interpretable model-driven deep network for hyperspectral, multispectral, and panchromatic image fusion. IEEE Transactions on Neural Networks and Learning Systems, 2023.
- [21] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 2017.
- [22] Gemine Vivone, Luciano Alparone, Jocelyn Chanussot, Mauro Dalla Mura, Andrea Garzelli, Giorgio A. Licciardi, Rocco Restaino, and Lucien Wald. A critical comparison among pansharpening algorithms. IEEE Transactions on Geoscience and Remote Sensing, pages 2565–2586, 2015.
- [23] Gemine Vivone, Rocco Restaino, Mauro Dalla Mura, Giorgio Licciardi, and Jocelyn Chanussot. Contrast and error-based fusion schemes for multispectral image pansharpening. IEEE Geoscience and Remote Sensing Letters, pages 930–934, 2014.
- [24] Lucien Wald, Thierry Ranchin, and Marc Mangolini. Fusion of satellite images of different spatial resolutions: Assessing the quality of resulting images. Photogrammetric engineering and remote sensing, pages 691–699, 1997.
- [25] Xiao Wu, Tingzhu Huang, Liangjian Deng, and Tian**g Zhang. Dynamic cross feature fusion for remote sensing pansharpening. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14687–14696, October 2021.
- [26] Keyu Yan, Man Zhou, Li Zhang, and Chengjun Xie. Memory-augmented model-driven network for pansharpening. European Conference on Computer Vision, pages 306–322, 2022.
- [27] Junfeng Yang, Xueyang Fu, Yuwen Hu, Yue Huang, Xinghao Ding, and John Paisley. Pannet: A deep network architecture for pan-sharpening. In Proceedings of the IEEE international conference on computer vision, pages 5449–5457, 2017.
- [28] Hao Zhang, Hebaixu Wang, Xin Tian, and Jiayi Ma. P2sharpen: A progressive pansharpening network with deep spectral transformation. Information Fusion, pages 103–122, 2023.
- [29] Tianjiang Zhang, Liangjian Deng, Tingzhu Huang, Jocelyn Chanussot, and Gemine Vivone. A triple-double convolutional neural network for panchromatic sharpening. IEEE Transactions on Neural Networks and Learning Systems, 2022.
- [30] Huanyu Zhou, Qingjie Liu, and Yunhong Wang. Panformer: A transformer based model for pan-sharpening. IEEE International Conference on Multimedia and Expo, pages 1–6, 2022.